Workloads for analytics and transactions have very unlike characteristics and requirements. The use cases differ significantly. SLAs are very different, too. Many people think that Apache Kafka is not built for transactions and should only be used for big data analytics. This blog post explores when and how to use Kafka in resilient, mission-critical architectures and when to use the built-in Transaction API.
There are some overlaps in some use cases and products. Hence, I use the more generic terms “transactions” and “analytics” in this blog post.
Analytical workloads have the following characteristics:
Analytics solutions exist on-premises and in all major clouds. The tools differ regarding their capabilities and sweet spots. Examples include:
Transactional workloads have unique characteristics and SLAs compared to analytical workloads:
Transactional solutions include applications, databases, messaging systems, and integration middleware:
Often, a transactional workload has to guarantee ACID principles (i.e., all or nothing writes to different applications and technologies).
Many solutions support a mix of transactional and analytical workloads.
For instance, many enterprises store transactional data in MongoDB but also process complex queries for analytics use cases in the same database. MongoDB started as document-based NoSQL database. In the meantime, it is a general-purpose database platform that also supports other forms of database queries like MongoDB provides graph and tree traversal capabilities:
Hence, focus on the business problem first. Then, you can decide if your existing infrastructure can solve the problem or if you need yet another one. But there is no silver bullet. A vendor-independent best of breed approach works best in most enterprise architectures I see in the success stories from the field.
Batch vs. real-time data processing is an important discussion you should have in every project. Statements like “batch processing is for analytics, real-time processing is for transactions” are not always correct. Real-time beats slow data in almost all use cases from a business value perspective. Nevertheless, batch processing is the better approach for some specific use cases.
Data at Rest means to store data in a database, data warehouse, or data lake. This means that the data is processed too late in many use cases – even if a real-time streaming component (like Kafka) ingests the data. The data processing is still a web service call, SQL query, or map-reduce batch process away from providing a result to your problem.
Don’t get me wrong. Data at Rest is not a bad thing. Several use cases such as reporting (business intelligence), analytics (batch processing), and model training (machine learning) require this approach… If you do it right! Data at Rest can be used for transactional workloads, too!
The Kafka API is the De Facto Standard API for Data in Motion like Amazon S3 for object storage. Why is Kafka so successful? Real-time beats slow data in most use cases across industries.
The same cloud-native approach is required for event streaming as for the modern data lake. Building a scalable and cost-efficient infrastructure is crucial for the success of a project. Event streaming and data lake technologies are complementary, not competitive.
I will not explore the reasons and use cases for the success of Kafka in this post. Instead, check out my overview about Kafka use cases across industries for more details. Or read some of my vertical-specific blog posts.
In short, most added value comes from processing Data in Motion while it is relevant instead of storing Data at Rest and processing it later (or too late). Many analytical and transactional workloads use Kafka for this reason.
Even in 2022, many people think about Kafka as a data ingestion layer into data stores. This is still a critical use case. Enterprises use Kafka as the ingestion layer for different analytics platforms:
But Kafka is much more than a messaging and ingestion layer. Here are a few analytics examples using Kafka for analytics (often with other analytics tools to solve a specific problem together):
Here is an illustration from an AWS architecture combining Confluent and its ecosystem including connectors, stream processing capabilities, and schema management together with several 1st party AWS cloud services:
As you can see, Kafka is an excellent tool for analytical workloads. It is not a silver bullet but used for appropriate parts of the overall data management architecture. I have another blog post that explores the relationship between Kafka and other serverless analytics platforms.
However, Kafka is NOT just used for analytical workloads!
Around 60 to 70% of use cases and deployments I see at customers across the globe leverage the Kafka ecosystem for transactional workloads. Enterprises use Kafka for:
Kafka is a distributed, fault-tolerant system that is resilient by nature (if you deploy and operate it correctly). No downtime and no data loss can be guaranteed, like in your favorite database, mainframe, or other core platforms.
Elastic scalability and rolling upgrades allow building a flexible and reliable data streaming infrastructure for transactional workloads to guarantee business continuity. The architect can even stretch a cluster across regions to ensure zero data loss and zero downtime even in case of a disaster where a data center is completely down. The post “Global Kafka Deployments” explores the different deployment options and their trade-offs in more detail.
And even better: Kafka’s Transaction API, i.e., Exactly-Once Semantics (EOS), has been available since Kafka 0.11 (that GA’ed a long time ago). EOS makes building transactional workloads even easier as you don’t need to handle duplicates anymore.
Kafka now supports atomic writes across multiple partitions through the transactions API. This allows a producer to send a batch of messages to multiple partitions. Either all messages in the batch are eventually visible to any consumer, or none are ever visible to consumers. Here is an example:
Kafka provides a built-in transactions API. And the performance impact (that many people are worried about) is minimal. Here is a simple rule of thumb: If you care about exactly-once semantics, simply activate it! If performance issues force you to disable it, you can still fine-tune your application or disable it. But most projects are fine with the minimal performance trade-offs versus the enormous benefit of handling transactional behavior out-of-the-box.
Nevertheless, to be clear: You don’t need to use Kafka’s Transaction API to build mission-critical, transactional workloads.
The Kafka Transactions API is optional. As discussed above, Kafka is resilient without transactions. Though eliminating duplicates is your task then. Exactly-once semantics solve this problem out-of-the-box across all Kafka components. Kafka Connect, Kafka Streams, ksqlDB, and different clients like Java, C++, .NET, Go support EOS.
However, I am also not saying that you should always use the Kafka Transaction API or that it solves every transactional problem. Keep in mind that scalable distributed systems require other design patterns than a traditional “Oracle to IBM MQ transaction”.
Some business transactions span multiple services. Hence, you need a mechanism to implement transactions that span services. A familiar design pattern and implementation for such a transactional workload is the SAGA pattern with a stateful orchestration application.
Swisscom’s Custodigit is an excellent example of such an implementation leveraging Kafka Streams. It is a modern banking platform for digital assets and cryptocurrencies that provides crucial features and guarantees for seriously regulated crypto investments – more details in my blog post about Blockchain, Crypto, NFTs, and Kafka.
And yes, there are always trade-offs between the Kafka Transaction API and exactly-once semantics, stateful orchestration in a separate application, and two-phase-commit transactions like Oracle DB and IBM MQ use it. Choose the right tool to define your appropriate enterprise architecture!
Most enterprises use Kafka as the central scalable real-time data hub. Hence, use cases include analytical and transactional workloads.
Most Kafka projects I see today also leverage Kafka Connect for data integration, Kafka Streams/ksqlDB for continuous data processing, and Schema Registry for data governance.
Thus, with Kafka, one (distributed and scalable) infrastructure enables messaging, storage, integration, and data processing. But of course, most Kafka clusters connect to other applications (like SAP or Salesforce) and data management systems (like MongoDB, Snowflake, Databricks, et al.) for analytics:
I explored in detail why Kafka is a database for some specific use cases but will NOT replace other databases and data lakes in its own blog post.
In addition to Kafka-native stream processing engines like Kafka Streams or ksqlDB, other streaming analytics frameworks like Apache Flink or Spark Streaming can easily be connected for transactional or analytical workloads. Just keep in mind that especially transactional workloads get harder end-to-end with every additional system and infrastructure you add to the enterprise architecture.
Real-time data beats slow data. That’s true for almost every use case. Yet, enterprise architects build new infrastructures with the Lambda architecture that includes a separate batch layer for analytics and a real-time layer for transactional workloads.
A single real-time pipeline, called Kappa architecture, is the better fit. Real-world examples from companies such as Disney, Shopify, Uber, and Twitter explore the benefits of Kappa but also show how batch processing fits into this discussion positively with no Lambda. In its dedicated post, learn how a Kappa architecture can revolutionize how you built analytical and transactional workloads with the same scalable real-time data hub powered by Kafka.
How do you leverage data streaming for analytical or transactional workloads? Do you use exactly-once semantics to ease the implementation of transactions? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.
In the age of digitization, the concept of pricing is no longer fixed or manual.…
In the rapidly evolving landscape of intelligent traffic systems, innovative software provides real-time processing capabilities,…
In the fast-paced world of finance, the ability to prevent fraud in real-time is not…
Choosing between Apache Kafka, Azure Event Hubs, and Confluent Cloud for data streaming is critical…
In today's data-driven world, understanding data at rest versus data in motion is crucial for…
If you ask your favorite large language model, Microsoft Fabric appears to be the ultimate…