Our forward-looking statement applies to roadmap projections.
Enterprises often store data in both Salesforce and other external data lakes such as Snowflake, Google BigQuery, Databricks, Redshift, or cloud storage like Amazon S3. This siloing of data in different source system poses a challenge for companies who want to harness the full power of their data.
Architects working to bring together data across multiple data lakes face key architectural decisions around how best to integrate that data. Data Cloud offers multiple options for data integration, each of which offers different pros and cons.
This guide provides a framework to evaluate which pattern best fits your requirements for latency, cost, scalability, governance, and complexity when integrating data, helping you choose when to use data ingestion, Zero Copy data federation, or a hybrid approach. The guide will also help you select between different methods of data ingestion and data federation, each of which fills a different need.
Integrating external data lakehouses with Data Cloud requires careful consideration of trade-offs between data freshness, governance, and pipeline efficiency. For example, using Zero Copy data federation live queries maximize the freshness of the data but can reduce pipeline efficiency as you move more data over the network. Therefore, for most real-world implementations, a combination of ingestion and federation within a multi-cloud lakehouse ecosystem is the optimal path. This hybrid approach ensures a scalable, governed, interoperable architecture that seamlessly supports both low-latency operational workloads such as real-time personalization and fraud detection and analytical workloads like regulatory reporting and historical trend analysis. This decision guide will help you understand how to navigate these trade-offs and select the right strategy.
Hybrid Architecture: Mixing data ingest and data federation is often needed.
Data Ingestion Frequency Matters: Choose the frequency based on business value, latency needs, and operational complexity.
Match the Federation Pattern to Latency & Performance: Choose the one that best matches your access patterns and the requirements for freshness, performance, and cost.
Align the Governance with Data Residency Requirements:
Prioritize Ingestion for High-Value Workflows: Apply ingestion selectively to critical processes such as identity resolution, regulatory reporting, and operational activation.
Cost & Complexity Drive the Decision: Real-time ingestion can be expensive and complex. Architects should weigh the cost of onboarding, storing, and transforming data against the cost of querying it directly via Zero Copy.
Choosing the right integration pattern—Data Ingestion, Zero Copy, or a Hybrid approach—directly impacts latency, governance, operational efficiency, and cost across multi-cloud platforms. This decision shapes how real-time insights, AI-driven activation, and personalized engagement can be delivered reliably and at scale.
This table provides a technical comparison of Data Ingestion and Zero Copy patterns in Salesforce Data Cloud, focusing on capabilities, trade-offs, and benefits, along with enterprise use cases and outcomes. Architects can use this as a reference for designing hybrid, multi-cloud data platforms that balance performance, cost, and compliance.
Pattern Type | Mode / Tool | Benefits | Considerations | Outcomes |
---|---|---|---|---|
Data Ingestion | Real-Time:Sub-second latency ingestion via Ingestion APIs with CDC support. Continuous streaming pipelines. | - Immediate insights - Ideal for low-latency operational and personalization use cases - Supports event-driven workflows |
- High cost - Complex architecture - Requires low-latency source systems - High-volume sources can cause excessive streaming leading to saturated pipelines - I/O intensive - Consider selective fields and filtering to reduce overhead |
Agentforce: - Real-time fraud alerts, retail personalization, operational alerts Analytics: - Sub-second dashboards, KPI monitoring Compliance: - Continuous customer record updates for regulated workflows |
Streaming:Micro-batch ingestion every 1–3 minutes via native connectors | - Balanced cost vs freshness - Simpler architecture than real-time - Supports incremental updates |
- Slight latency - May not be suitable for critical sub-second decisions - Batch size impacts memory/compute - I/O is moderate - Best for predictable, repeated update patterns - Consider windowed aggregation to reduce processing load |
Agentforce: - Timely campaign triggers, near-live engagement Analytics: - Recommendation engines, near-live dashboards Compliance: - Frequent updates with auditability |
|
Batch:Scheduled large-volume loads via connectors or APIs. Supports object storage and ETL/ELT pipelines. | - Cost-efficient for massive datasets - Easy to implement - Reliable for historical analytics |
- Data latency - Unsuitable for time-sensitive operations - I/O intensive during load windows - Network throughput may become a bottleneck for large files - Best for historical aggregation or regulated reporting workflows |
Agentforce: - IT support tickets (Jira/ServiceNow), aggregated workflows Analytics: - Historical analysis, trend evaluation Complainance: - Regulatory reporting, patient/claims data aggregation |
|
Zero Copy | Live Query:Direct queries on external systems; schema-on-read; no data duplication | - Maximum freshness - Minimal storage overhead; supports real-time operational insights |
- Dependent on source performance - High query volume may affect latency - Ideal for queries with predicate pushdown & aggregation to minimize I/O - Avoid unfiltered queries on massive datasets |
Agentforce: - Dynamic workflows adapting to live activity Analytics: - Operational dashboards, live reporting Compliance: - Respects row-level security & masking at source |
Accelerated Query (Caching):Cached local copies for federated queries. Configurable from 15 min to 7 days. Optimized query execution | - Reduces latency - Lower cost than repeated live queries - Improves performance for frequent access patterns |
- Cache management required - Staleness depends on cache interval - Best for high-frequency queries - Not suitable for sub-second decisioning |
Agentforce: - Pre-aggregated engagement metrics for fast decisioning Analytics: - BI dashboards, segmentation, analytical reporting Compliance: - Consistent regulated dashboards with audit logs |
|
File Federation:Direct access to large historical datasets in object stores or lakes (S3, Iceberg, Google BigQuery, Redshift). | - Handles massive-scale datasets - Minimal storage in Data Cloud - Supports AI/ML workloads |
- Read-only - Query performance depends on external system throughput - Optimized for batch-heavy, throughput-intensive jobs - Not suitable for real-time dashboard |
Agentforce: - (Not typical — batch-heavy) Analytics: - ML/AI training, historical analytics, petabyte-scale reporting Compliance: - Governed access to external datasets without duplication |
With data ingestion, data is physically copied into Data Cloud and fully governed, unlike Zero Copy where data remains at the source. Compute for transformations happens within Data Cloud, which allows for centralized governance and auditing.
Purpose: Use data ingestion to store canonical, governed datasets in Salesforce Data Cloud for compliance and operational control. Use ingestion when full control, auditing, and traceability are required. Ideal for regulated or high-value workflows where centralized compute and governance are critical.
Ingestion is best for building a trusted foundation for identity resolution, regulatory reporting, and mission-critical AI-driven workflows and customer engagement.
Data ingestion methods vary depending on what connector you use to ingest your data. Some connectors offer a variety of ingestion methods, while others operate only in batch or streaming mode. See Data Cloud: Integrations and Connectors for a complete list of Data Cloud connectors and their available methods.
Feature | Real-Time Ingestion | Streaming Ingestion | Batch Ingestion |
---|---|---|---|
Latency and Freshness | Sub-second latency ingestion via Ingestion APIs with Change Data Capture (CDC) support. Provides continuous streaming pipelines. Best for low-latency operational use cases. | Micro-batch ingestion every 1–3 minutes via native connectors. Supports incremental updates. Slight latency is expected. | Data latency is expected. Scheduled large-volume loads. Periodic ingestion (hourly, daily, weekly). Unsuitable for time-sensitive operations. |
Primary Use Cases | Ideal for low-latency operational and personalization use cases. Used for time-sensitive workflows. Supports event-driven workflows. Used for real-time fraud alerts and operational alerts. | Suitable for moderately urgent processes. Used for campaign orchestration, near-live engagement, and operational reporting. Used for timely campaign triggers. | Cost-efficient for massive datasets. Reliable for historical analytics. Used for historical aggregation or regulated reporting workflows. Best for historical or low-velocity datasets. |
Architectural Complexity and I/O | High cost and complex architecture. Requires low-latency source systems. I/O intensive. High-volume sources can cause saturated pipelines. | Simpler architecture than real-time. I/O is moderate. Best for predictable, repeated update patterns. Batch size impacts memory/compute. | Easy to implement. I/O intensive during load windows. Network throughput may become a bottleneck for large batches. |
Cost Considerations | Highest compute and pipeline costs. Justified only for high-value, time-sensitive workflows. | Moderate compute and storage costs. Provides a balanced cost vs freshness approach. Suitable for frequent updates that can tolerate slight delays. | Lower compute costs and predictable storage. Recommended for historical datasets or low-frequency updates. Ingestion via Salesforce internal pipelines is free. |
Design Practices | Use incremental CDC to minimize data shuffling. Filter and use selective fields to reduce overhead. | Use micro-batches to control I/O spikes. Consider windowed aggregation to reduce processing load. | Favor this for archival reporting or periodic snapshots. Ensure compute locality in the same region as source storage for cost optimization. |
Purpose: Use Zero Copy for real-time querying of external systems without data duplication, enabling agility, freshness, and scalable access to large or transient datasets. It is best for live dashboards, exploratory analytics, AI/ML model training, and real-time customer engagement directly through Salesforce Data Cloud.
When using Zero Copy, architects must further decide between three available data federation methods, each of which offers its own tradeoffs between freshness, performance, and cost.
Decision Point | Live Query | Caching (Accelerated Query) | File Federation |
---|---|---|---|
Data Source Location | External data lakehouses (Snowflake, Google BigQuery, Redshift, Databricks). | External data lakehouses (Snowflake, Google BigQuery, Redshift, Databricks) | Object stores or cloud data lakes (S3, ADLS, GCS), often using open table formats like Iceberg. |
Purpose/Use Case | Ideal for interactive analysis and real-time dashboards. Best for real-time personalization and dynamic workflows. | Best for when queries are frequent but slightly stale results are acceptable. Suitable for BI dashboards and segmentation. | Best for large-scale batch processing and AI/ML model training. Ideal for historical analytics and petabyte-scale reporting. |
Freshness/Latency | Maximum freshness; queries run directly in real time. Supports sub-second decisioning. | Slightly stale results are acceptable. Freshness depends on the cache interval, configurable from 15 minutes to 7 days. | Optimized for batch-heavy, throughput-intensive jobs. Not suitable for real-time dashboarding. |
Access Pattern | Best for infrequent or ad-hoc queries. Use for high-QPS (query per second), low-latency queries where freshness is critical. | Best for high-frequency read scenarios. Improves performance for frequent access patterns. | Read-only access. Suited for petabyte-scale datasets without ingestion. |
Performance Drivers | Highly dependent on the external source system's performance. Optimized when predicates and aggregations can be pushed down to the source. | Reduces latency compared to repeated live queries. Performance depends on cache management and interval. | Performance depends heavily on object format, partitioning, and external system throughput. Use partitioned, columnar formats (Parquet/ORC). |
Cost Implications | Pay-per-query model. Costs accrue on external lakehouse compute. Cost-effective for infrequent queries but expenses can spike with high query per second (QPS) volume. | Lower cost than repeated live queries. Reduces the need to repeatedly query the external source. Adds cache storage and refresh overhead. | Cheapest storage option. Query costs depend on file size and partitioning. |
Key Consideration | Avoid unfiltered queries that scan massive data volumes unnecessarily. | Requires cache management. Not suitable for sub-second decisioning. | Query performance relies heavily on optimization via partitioning and predicate pushdown. |
Hybrid architectures enable architects to anchor critical datasets in Data Cloud for centralized governance while leveraging federated queries for freshness, reduced duplication, and scalable access to large external datasets. This approach balances I/O, compute locality, cost, and compliance requirements.
Purpose: Use a hybrid approach for balanced governance, freshness, and operational efficiency by combining data ingestion and zero copy to deliver real-time, actionable insights. Use ingestion for high-value, regulated datasets where traceability, RLS, and masking are required, and federation for ephemeral or high-volume datasets where freshness and performance are key.
Below are common archetypes that illustrate how to apply this logic.
Enterprise data strategy is no longer about choosing a single integration pattern — it's about architecting controlled flexibility within an interoperable data ecosystem. Selecting the right data integration method for each source data system based on business needs often leads to a hybrid approach that blends the strengths of both data ingestion and data federation.:
Salesforce Data Cloud on Hyperforce delivers multi-region resilience and scalability. ts open lakehouse with Iceberg tables enables compute separation and interoperability with platforms like Snowflake, Databricks, and S3 Iceberg — forming the backbone of a truly interoperable, multi-cloud data ecosystem.
As data ecosystems evolve, continuously balance freshness, cost, performance, and compliance to maintain architectural agility. Future-proof your platform by unifying ingested, governed data with federated access. This enables real-time intelligence, AI activation, and enterprise-scale personalization across clouds, regions, and business domains.
One-size-fits-all solutions don't suit most businesses. The optimal strategy maps the right pattern to the right business driver.
Yugandhar Bora is a Software Engineering Architect at Salesforce, specializing in data architecture within the Data & Intelligence Applications platform. He leads enterprise architecture review board (EARB) initiatives focused on data governance and unified data models, while contributing to automated platform provisioning solutions.
Jan Fernando is a Principal Architect in the Office of the Chief Architect at Salesforce. He joined Salesforce in 2012, bringing a wealth of experience from his time in the startup ecosystem. Prior to joining the Office of the Chief Architect, he spent over a decade in the Platform organization, where he led several key technology transformations.