Snowflake is a leading cloud data warehouse and transitions into a data cloud that enables various use cases. The major drawback of this evolution is the significantly growing cost of the data processing. This blog post explores how data streaming with Apache Kafka and Apache Flink enables a “shift left architecture” where business teams can reduce cost, provide better data quality, and process data more efficiently. The real-time capabilities and unification of transactional and analytical workloads using Apache Iceberg’s open table format enable new use cases and a best of breed approach without a vendor lock-in and the choice of various analytical query engines like Dremio, Starburst, Databricks, Amazon Athena, Google BigQuery, or Apache Flink.
Snowflake is a leading cloud-native data warehouse. Its usability and scalability made it a prevalent data platform in thousands of companies. This blog series explores different data integration and ingestion options, including traditional ETL / iPaaS and data streaming with Apache Kafka. The discussion covers why point-to-point Zero-ETL is only a short term win, why Reverse ETL is an anti-pattern for real-time use cases and when a Kappa Architecture and shifting data processing “to the left” into the streaming layer helps to build transactional and analytical real-time and batch use cases in a reliable and cost-efficient way.
This is part one of a blog series:
Subscribe to my newsletter to get an email about the next publications.
The first two blog posts explored the different design patterns and ingestion options for Snowflake via Kafka or other ETL tools. But let’s make it very clear: Ingesting all raw data into a single data lake at rest is an anti-pattern in almost all cases. It is not cost efficient and blocks real-time use cases. Why?
Real-time data beats slow data. That’s true for almost every use case. Enterprise architects build infrastructures with the Lambda architecture that often include a data lake for all the analytics. The Kappa architecture is a software architecture that is event-based and able to handle all data at all scale in real-time for transactional AND analytical workloads.
The central premise behind the Kappa architecture is that you can perform both real-time and batch processing with a single technology stack. The heart of the infrastructure is streaming architecture. First, the data streaming platform log stores incoming data. From there, a stream processing engine processes the data continuously in real-time or ingests the data into any other analytics database or business application via any communication paradigm and speed, including real-time, near real-time, batch, request-response.
The business domains have the freedom of choice. In the above diagram, a consumer could be:
I explored in a thorough analysis why Reverse ETL with a Data Warehouse or Data Lake like Snowflake is an ANTI-PATTERN for real-time use cases.
Reverse ETL is unavoidable in some scenarios, especially in larger organizations. Tools live FiveTran or Hightouch do a good job if you need Reverse ETL. But never design your enterprise architecture to store data in a data lake just to reverse it afterwards.
The cost of storage has gotten extremely cheap with object storage in the public cloud.
The explosion of more data, more data types/formats, object storage has risen as de facto cheap, ubiquitous storage serving the traditional “data lake”.
However, even with that in place the data lake was still the “data swamp”, with lots of different file sizes/types, etc etc, still the cloud data warehouse was probably easier and probably why it made more sense to still point most raw data for analysis in the data warehouse.
Snowflake is a very good analytics platform. It provides strict SLAs and concurrent queries that need that speed. However, many of the workloads done in Snowflake don’t have that requirement. Instead, some use cases require ad hoc queries and are exploratory in nature with no speed SLAs. Data warehousing should often be a destination of processed data, not an entry point for all things raw data. And on the other side, some applications require low latency for query results. Purpose-built databases do analytics much faster than Snowflake, even at scale and for transactional workloads.
The combination of an object store with a table format like Apache Iceberg enables openness for different teams. You add a query engine on top, and it’s like a composable data warehouse. Now your team can run Apache Spark on your tables. Maybe Data science can run Ray on it for machine learning, and maybe some aggregations run with Apache Flink.
It was never easier to choose the best of breed approach in each business unit when you leverage data streaming in combination with SaaS analytics offerings.
Therefore, teams should choose the best price/performance for their unique needs/requirements. Apache Iceberg makes that shift between different lakehouses and data platforms a lot easier as it helps with compaction of small files issues.
Still, the performance of running SQL Engine on the data lake was slow / not performant and did not support key characteristics that are expected, like ACID transactions
Dremio, Starburst Data, Spark SQL, Databricks Photon, Ray/Anyscale, Amazon Athena, Google BigQuery, Flink, etc. are all the query engine ecosystem systems that the data product in a governed Kafka Topic unlocks.
Why did I add a stream processor here with Flink? Flink as an option to query streams in real-time / as ad hoc exploratory questions on both real-time/historical upstream.
And keep in mind that not every application requires performance of a Snowflake, real-time analytics engine like Druid or Pinot, or modern data virtualization with Starburst. If you don’t need your random ad hoc query to see how many customers today are buying red shoes in the store to share results in 20 seconds / can wait for 3-5 minutes, and if the price of that query can be cheaper and done efficiently, shifting left gives you the option to decide where lane to put your queries:
If storing all data in a single data lake/data warehouse/lakehouse AND Reverse ETL are anti-patterns, how can we solve it? The answer is simpler than you might think: Move the processing “to the left” in the enterprise architecture!
Most architectures ingest raw data into multiple data sinks, i.e., databases, warehouses, lakes, and other applications:
This has several drawbacks:
Hence, it is much more consistent and cost-efficient to process data once in real-time when it is created. A data streaming platform with Apache Kafka and Flink is perfect for this:
With the shift-left approach providing a data product with good quality in the streaming layer, each business team can look at the price/performance of the most popular/common/frequent/expensive queries and determine if they truly should live in the warehouse or choose another solution.
Processing the data with a data streaming platform “on the left side of the architecture” has various advantages:
This is just a primer for its own discussion. Kafka and Snowflake are complementary. Though, the right integration and ingestion architecture can make a difference for your entire data architecture (not just the analytics with Snowflake). Hence, check out my in-depth article “Kappa Architecture is Mainstream Replacing Lambda” to understand why data streaming is much more than just ingestion into your cloud data warehouse.
Building (real-time or batch) data products in modern microservice and data mesh enterprise architectures is the future. Data streaming enforces good data quality and cost-efficient data processing (instead of “DBT’ing” everything again and again in Snowflake).
This gives companies options to disaggregate the compute layer, and perhaps shift left queries that really don’t need a purpose-built analytics engine, i.e., where a stream processor like Apache Flink can do the job. That reduces data replication/copies. Different teams can query that data with the query engine of choice, based on the key requirements for the workload. But already in good data quality built into the data product (i.e., Kafka Topic) with data quality rules and policy enforcement (Topic Schema + rules engine). For instance, in Confluent’s data streaming platform all of this is built into the solution already.
Enterprise-wide data governance plays a critical role in ensuring that organizations derive value from their data assets, maintain compliance with regulations, and mitigate risks associated with data management across diverse IT environments.
Most data platforms provide data governance capabilities like data catalog, data lineage, self-service data portals, etc. This includes data warehouses, data lakes, lakehouses and data streaming platforms.
In contrary, enterprise-wide data governance solutions look at the entire enterprise architecture. They collect information from all different warehouses, lakes, and streaming platforms. If you apply the shift left strategy for consistent and cost-efficient data processing across real-time and batch platforms, then the data streaming layer MUST integrate with the enterprise-wide data governance solution. But frankly, you SHOULD to integrate all platforms into the enterprise-wide governance view anyway, no matter if you shift some workloads to the left or not.
Today, most companies built their own enterprise-wide governance suite (if they have one already at all). In the future, more and more companies adopt dedicated solutions like Collibra or Microsoft Purview. No matter what you choose, the data streaming governance layer needs to integrate with this enterprise-wide platform. Ideally, via direct integrations, but at least via APIs.
While the “shift left” approach disrupts the data teams that processed all data in analytics platforms like Snowflake or Databricks in the past, the enterprise-wide governance ensures that the data teams are comfortable across different data platforms. The benefit is the unification of real-time and batch data and operational and analytical workloads.
Apache Kafka and Snowflake are complementary technologies with various integration options to build an end-to-end data pipeline with reporting and analytics in the cloud data warehouse or data cloud.
Real-time data beats slow data in almost all use cases. Some scenarios benefit from Snowpipe Streaming with Kafka Connect to build near-real-time use cases in Snowflake. Other use cases leverage stream processing for real-time analytics and operational use cases.
To ensure good data quality and cost-efficient data pipelines with data governance and policy enforcement in an enterprise architecture, a “shift left architecture” with data streaming is the best choice to build data products.
Instead of using DBT or similar tools to process data again and again at rest in Snowflake, consider pre-processing the data once in the streaming platform with technologies like Kafka Streams or Apache Flink and reduce the cost in Snowflake significantly. The consequence is that each data product consumes events of good data quality to focus on building business logic instead of ETL. And you can choose your favorite analytical query engine per use case or business unit.
Apache Iceberg and competing technologies like Apache Hudi or Databricks’ Delta Lake make the future integration between data streaming and data warehouses / data lakes even more interesting. I have already explored the adoption of Apache Iceberg as an open table format in between the streaming and analytics layer. Most vendors (including Confluent, Snowflake and Amazon) already support Iceberg today. As a concrete example, look at Confluent’s Apache Iceberg integration called Tableflow.
How do you build data pipelines for Snowflake today? Already with Apache Kafka and a “shift left architecture”, or still with traditional ETL tools? What do you think about Snowflake’s Snowpipe Streaming? Or do you already go all in with Apache Iceberg? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.
In the age of digitization, the concept of pricing is no longer fixed or manual.…
In the rapidly evolving landscape of intelligent traffic systems, innovative software provides real-time processing capabilities,…
In the fast-paced world of finance, the ability to prevent fraud in real-time is not…
Choosing between Apache Kafka, Azure Event Hubs, and Confluent Cloud for data streaming is critical…
In today's data-driven world, understanding data at rest versus data in motion is crucial for…
If you ask your favorite large language model, Microsoft Fabric appears to be the ultimate…