Recognizing and handling errors is essential for any reliable data streaming pipeline. This blog post explores best practices for implementing error handling using a Dead Letter Queue in Apache Kafka infrastructure. The options include a custom implementation, Kafka Streams, Kafka Connect, the Spring framework, and the Parallel Consumer. Real-world case studies show how Uber, CrowdStrike, Santander Bank, and Robinhood build reliable real-time error handling at an extreme scale.
Apache Kafka became the favorite integration middleware for many enterprise architectures. Even for a cloud-first strategy, enterprises leverage data streaming with Kafka as a cloud-native integration platform as a service (iPaaS).
Before I go into this post, I want to make you aware that this content is part of a blog series about “JMS, Message Queues, and Apache Kafka”:
I will link the other posts here as soon as they are available. Please follow my newsletter to get updated in real-time abo t new posts. (no spam or ads)
The Dead Letter Queue (DLQ) is a service implementation within a messaging system or data streaming platform to store messages that are not processed successfully. Instead of passively dumping the message, the system moves it to a Dead Letter Queue.
The Enterprise Integration Patterns (EIP) call the design pattern Dead Letter Channel instead. We can use both as synonyms.
This article focuses on the data streaming platform Apache Kafka. The main reason for putting a message into a DLQ in Kafka is usually a bad message format or invalid/missing message content. For instance, an application error occurs if a value is expected to be an Integer, but the producer sends a String. In more dynamic environments, a “Topic does not exist” exception might be another error why the message cannot be delivered.
Therefore, as so often, don’t use the knowledge from your existing middleware experience. Message Queue middleware, such as JMS-compliant IBM MQ, TIBCO EMS, or RabbitMQ, works differently than a distributed commit log like Kafka. A DLQ in a message queue is used in message queuing systems for many other reasons that do not map one-to-one to Kafka. For instance, the message in an MQ system expires because of per-message TTL (time to live).
Hence, the main reason for putting messages into a DLQ in Kafka is a bad message format or invalid/missing message content.
A Dead Letter Queue in Kafka is one or more Kafka topics that receive and store messages that could not be processed in another streaming pipeline because of an error. This concept allows continuing the message stream with the following incoming messages without stopping the workflow due to the error of the invalid message.
The Kafka architecture does not support DLQ within the broker. Intentionally, Kafka was built on the same principles as modern microservices using the ‘dumb pipes and smart endpoints‘ principle. That’s why Kafka scales so well compared to traditional message brokers. Filtering and error handling happen in the client applications.
The true decoupling of the data streaming platform enables a much more clean domain-driven design. Each microservice or application implements its logic with its own choice of technology, communication paradigm, and error handling.
In traditional middleware and message queues, the broker provides this logic. The consequence is worse scalability and less flexibility in the domains, as only the middleware team can implement integration logic.
A Dead Letter Queue in Kafka is independent of the framework you use. Some components provide out-of-the-box features for error handling and Dead Letter Queues. However, it is also easy to write your Dead Letter Queue logic for Kafka applications in any programming language like Java, Go, C++, Python, etc.
The source code for a Dead Letter Queue implementation contains a try-cath block to handle expected or unexpected exceptions. The message is processed if no error occurs. Send the message to a dedicated DLQ Kafka topic if any exception occurs.
The failure cause should be added to the header of the Kafka message. The key and value should not be changed so that future re-processing and failure analysis of historical events are straightforward.
You don’t always need to implement your Dead Letter Queue. Many components and frameworks provide their DLQ implementation already.
With your own applications, you can usually control errors or fix code when there are errors. However, integration with 3rd party applications does not necessarily allow you to deal with errors that may be introduced across the integration barrier. Therefore, DLQ becomes more important and is included as part of some frameworks.
Kafka Connect is the integration framework of Kafka. It is included in the open-source Kafka download. No additional dependencies are needed (besides the connectors themselves that you deploy into the Connect cluster).
By default, the Kafka Connect task stops if an error occurs because of consuming an invalid message (like when the wrong JSON converter is used instead of the correct AVRO converter). Dropping invalid messages is another option. The latter tolerates errors.
The configuration of the DLQ in Kafka Connect is straightforward. Just set the values for the two configuration options ‘errors.tolerance’ and ‘errors.deadletterqueue.topic.name’ to the right values:
The blog post ‘Kafka Connect Deep Dive – Error Handling and Dead Letter Queues‘ shows a detailed hands-on code example for using DLQs.
Kafka Connect can even be used to process the error message in the DLQ. Just deploy another connector that consumes from t e DLQ topic. For instance, if your application processes Avro messages and an incoming message is in JSON format. A connector then consumes the JSON message and transforms it into an AVRO message to be re-processed successfully:
Note that Kafka Connect has no Dead Letter Queue for source connectors.
Kafka Streams is the stream processing library of Kafka. It is comparable to other streaming frameworks, such as Apache Flink, Storm, Beam, and similar tools. However, it is Kafka-native. This means you build the complete end-to-end data streaming within a single scalable and reliable infrastructure.
If you use Java, respectively, the JVM ecosystem, to build Kafka applications, the recommendation is almost always to use Kafka Streams instead of the standard Java client for Kafka. Why?
One of the built-in functions of Kafka Streams is the default deserialization exception handler. It allows you to manage record exceptions that fail to deserialize. Corrupt data, incorrect serialization logic, or unhandled record types can cause the error. The feature is not called Dead Letter Queue but solves the same problem out-of-the-box.
The Spring framework has excellent support for Apache Kafka. It provides many templates to avoid writing boilerplate code by yourself. Spring-Kafka and Spring Cloud Stream Kafka support various retry and error handling options, including time / count-based retry, Dead Letter Queues, etc.
Although the Spring framework is pretty feature-rich, it is a bit heavy and has a learning curve. Hence, it is a great fit for a greenfield project or if you are already using Spring for your projects for other scenarios.
Plenty of great blog posts exist that show different examples and configuration options. There is also the official Spring Cloud Stream example for dead letter queues. Spring allows building logic, such as DLQ, with simple annotations. This programming approach is a beloved paradigm by some developers, while others dislike it. Just know the options and choose the right one for yourself.
In many customer conversations, it turns out that often the main reason for asking for a dead letter queue is handling failures from connecting to external web services or databases. Time-outs or the inability of Kafka to send various requests in parallel brings down some applications. There is an excellent solution to this problem:
The Parallel Consumer for Apache Kafka is an open-source project under Apache 2.0 license. It provides a parallel Apache Kafka client wrapper with client-side queueing, a simpler consumer/producer API with key concurrency, and extendable non-blocking IO processing.
This library lets you process messages in parallel via a single Kafka Consumer, meaning you can increase Kafka consumer parallelism without increasing the number of partitions in the topic you intend to process. For many use cases, this improves both throughput and latency by reducing the load on your Kafka brokers. It also opens up new use cases like extreme parallelism, external data enrichment, and queuing.
A key feature is handling/repeating web service and database calls within a single Kafka consumer application. The parallelization avoids the need for a single web request sent at a time:
The Parallel Consumer client has powerful retry logic. This includes configurable delays and dynamic er or handling. Errors can also be sent to a dead letter queue.
You are not finished after sending errors to a dead letter queue! The bad messages need to be processed or at least monitored!
Dead Letter Queue is an excellent way to take data error processing out-of-band from the event processing which means the error handlers can be created or evolved separately from the event processing code.
Plenty of error-handling strategies exist for using dead letter queues. DOs and DONTs explore the best practices and lessons learned.
Several options are available for handling messages stored in a dead letter queue:
Here are some best practices and lessons learned for error handling using a Dead Letter Queue within Kafka applications:
Remember that a DLQ kills processing in guaranteed order and makes any sort of offline processing much harder. Hence, a Kafka DLQ is not perfect for every use case.
Let’s explore what kinds of messages you should NOT put into a Dead Letter Queue in Kafka:
Last but not least, let’s explore the possibility to reduce or even eliminate the need for a Dead Letter Queue in some scenarios.
The Schema Registry for Kafka is a way to ensure data cleansing to prevent errors in the payload from producers. It enforces the correct message structure in the Kafka producer:
Schema Registry is a client-side check of the schema. Some implementations like Confluent Server provide an additional schema check on the broker side to reject invalid or malicious messages that come from a producer which is not using the Schema Registry.
Let’s look at four case studies from Uber, CrowdStrike, Santander Bank, and Robinhood for real-world deployment of Dead Letter Queues in a Kafka infrastructure. Keep in mind that those are very mature examples. Not every project needs that much complexity.
In distributed systems, retries are inevitable. From network errors to replication issues and even outages in downstream dependencies, services operating at a massive scale must be prepared to encounter, identify, and handle failure as gracefully as possible.
Given the scope and pace at which Uber operates, its systems must be fault-tolerant and uncompromising when failing intelligently. Uber leverages Apache Kafka for various use cases at an extreme scale to accomplish this.
Using these properties, the Uber Insurance Engineering team extended Kafka’s role in their existing event-driven architecture by using non-blocking request reprocessing and Dead Letter Queues to achieve decoupled, observable error handling without disrupting real-time traffic. This strategy helps their opt-in Driver Injury Protection program run reliably in over 200 cities, deducting per-mile premiums per trip for enrolled drivers.
Here is an example of Uber’s error handling. Errors trickle-down levels of retry topics until landing in the DLQ:
For more information, read Uber’s very detailed technical article: ‘Building Reliable Reprocessing and Dead Letter Queues with Apache Kafka‘.
CrowdStrike is a cybersecurity technology company based in Austin, Texas. It provides cloud workload and endpoint security, threat intelligence, and cyberattack response services.
CrowdStrike’s infrastructure processes trillions of events daily with Apache Kafka. I covered related use cases for creating situational awareness and threat intelligence in real-time at any scale in my ‘Cybersecurity with Apache Kaka blog series‘.
CrowdStrike defines three best practices to implement Dead Letter Queues and error handling successfully:
In a cybersecurity platform like CrowdStrike, real-time data processing at scale is crucial. This requirement is valid for error handling, too. The next cyberattack might be a malicious message that intentionally includes inappropriate or invalid content (like a JavaScript exploit). Hence, handling errors in real-time via a Dead Letter Queue is a MUST.
Santander Bank had enormous challenges with their synchronous data processing in their mailbox application to process mass volumes of data. They rearchitected their infrastructure and built a decoupled and scalable architecture called “Santander Mailbox 2.0”.
Santander’s workloads and moved to Event Sourcing powered by Apache Kafka:
A key challenge in the new asynchronous event-based architecture was error handling. Santander solved the issues using error-handling built with retry and DLQ Kafka topics:
Check out the details in the Kafka Summit talk “Reliable Event Delivery in Apache Kafka Based on Retry Policy and Dead Letter Topics” from Santander’s integration partner Consdata.
Blog post UPDATE October 2022:
Robinhood, a financial services company (famous for its trading app), presented another approach for handling errors in Kafka messages at Current 2022. Instead of using only Kafka topics for error handling, they insert failed messages in a Postgresql database. A client application including CLI fixes the issues and republishes the messages to the Kafka topic:
Real-world use cases at Robinhood include:
Currently, the error handling application is “only” usable via the command line and is relatively inflexible. New features will improve the DQL handling in the future:
Robinhood’s DLQ implementation shows that error handling is worth investing in a dedicated project in some scenarios.
Error handling is crucial for building reliable data streaming pipelines and platforms. Different alternatives exist for solving this problem. The solution includes a custom implementation of a Dead Letter Queue or leveraging frameworks in use anyway, such as Kafka Streams, Kafka Connect, the Spring framework, or the Parallel Consumer for Kafka.
The case studies from Uber, CrowdStrike, Santander Bank, and Robinhood showed that error handling is not always easy to implement. It needs to be thought through from the beginning when you design a new application or architecture. Real-time data streaming with Apache Kafka is compelling but only successful if you can handle unexpected behavior. Dead Letter Queues are an excellent option for many scenarios.
Do you use the Dead Letter Queue design pattern in your Apache Kafka applications? What are the use cases and limitations? How do you implement error handling in your Kafka applications? When do you prefer a message queue instead, and why? Let’s connect on LinkedIn and discuss it! S ay informed about new blog posts by subscribing to my newsletter.
Siemens Healthineers, a global leader in medical technology, delivers solutions that improve patient outcomes and…
Discover my journey to achieving Lufthansa HON Circle (Miles & More) status in 2025. Learn…
Data streaming is a new software category. It has grown from niche adoption to becoming…
Apache Kafka and Apache Flink are leading open-source frameworks for data streaming that serve as…
This blog delves into Cardinal Health’s journey, exploring how its event-driven architecture and data streaming…
In the age of digitization, the concept of pricing is no longer fixed or manual.…
View Comments
Hey Kai - thank you so much for the great post. Quick question for you: do you have ideas about how to handle scenarios where Kafka is completely unavailable to the Producer? I think this takes away the DLQ option, right? So would you need a separate persistence layer for the event data that can't even get to your Kafka brokers? Like a MongoDB collection? Just curious if you have thoughts on this scenario. Thanks in advance!
Great question. Yes, if you need to ensure this scenario, you need local persistence. This could be a file on disk, a database like Mongo, or also a single Kafka broker. The latter has the cool advantage that you use the same API and technology at the local edge and in the data center / cloud. Kafka at the edge is coming up more and more (also as alternative for other persistence layers):
https://www.kai-waehner.de/blog/2020/10/14/use-cases-architectures-apache-kafka-edge-computing-industrial-iot-retail-store-cell-tower-train-factory/
Also check out the pros and cons (Kafka does NOT replace MongoDB or other databases in many scenarios):
https://www.kai-waehner.de/blog/2020/03/12/can-apache-kafka-replace-database-acid-storage-transactions-sql-nosql-data-lake/
Hi Kai, is there any way to stop the messages going to DLQ Topics after a defined number of retries in Kafka?
For example, in RabbitMQ we can throw "TmmediateAcknowledgeAmqpException" to stop sending messages to DLQ Topics. Is there an equivalent exception or some sort of mechanism available in Kafka?