How do you prevent hallucinations from large language models (LLMs) in GenAI applications? LLMs need real-time, contextualized, and trustworthy data to generate the most reliable outputs. This blog post explains how RAG and a data streaming platform with Apache Kafka and Flink make that possible. A lightboard video shows how to build a context-specific real-time RAG architecture. Also, learn how the travel agency Expedia leverages data streaming with Generative AI using conversational chatbots to improve the customer experience and reduce the cost of service agents.
Generative AI (GenAI) refers to artificial intelligence (AI) systems that can create new content, such as text, images, music, or code, often mimicking human creativity. These systems use advanced machine learning techniques, particularly deep learning models like neural networks, to generate data that resembles the training data they were fed. Popular examples include language models like GPT-3 for text generation and DALL-E for image creation.
Large Language Models like ChatGPT use lots of public data, are very expensive to train, and do not provide domain-specific context. Training own models is not an option for most companies because of limitations in cost and expertise.
Retrieval Augmented Generation (RAG) is a technique in Generative AI to solve this problem. RAG enhances the performance of language models by integrating information retrieval mechanisms into the generation process. This approach aims to combine the strengths of information retrieval systems and generative models to produce more accurate and contextually relevant outputs.
Pinecone created an excellent diagram that explains RAG and shows the relation to an embedding model and vector database:
RAG brings various benefits to the GenAI enterprise architecture:
However, one of the most significant problems still exists: the missing right context and up-to-date information…
RAG is obviously crucial in enterprises where data privacy, up-to-date context, and the data integration with transactional and analytical systems like an order management system, booking platform or payment fraud engine must be consistent, scalable and in real-time.
An event-driven architecture is the foundation of data streaming with Kafka and Flink:
Apache Kafka and Apache Flink play a crucial role in the Retrieval Augmented Generation (RAG) architecture by ensuring real-time data flow and processing, which enhances the system’s ability to retrieve and generate up-to-date and contextually relevant information.
Here’s how Kafka and Flink contribute to the RAG architecture:
Data Ingestion: Kafka acts as a high-throughput, low-latency messaging system that ingests real-time data from various data sources, such as databases, APIs, sensors, or user interactions.
Event Streaming: Kafka streams the ingested data, ensuring that the data is available in real-time to downstream systems. This is critical for applications that require immediate access to the latest information.
Stream Processing: Flink processes the incoming data streams in real-time. It can perform complex transformations, aggregations, and enrichments on the data as it flows through the system.
Low Latency: Flink’s ability to handle stateful computations with low latency ensures that the processed data is quickly available for retrieval operations.
Real-Time Updates: By using Kafka and Flink, the retrieval component of RAG can access the most current data. This is crucial for generating responses that are not only accurate but also timely.
Dynamic Indexing: As new data arrives, Flink can update the retrieval index in real-time, ensuring that the latest information is always retrievable in a vector database.
Scalable Architecture: Kafka’s distributed architecture allows it to handle large volumes of data, making it suitable for applications with high throughput requirements. Flink’s scalable stream processing capabilities ensure it can process and analyze large data streams efficiently. Cloud-native implementations or cloud services take over the operations and elastic scale.
Fault Tolerance: Kafka provides built-in fault tolerance by replicating data across multiple nodes, ensuring data durability and availability, even in the case of node failures. Flink offers state recovery and exactly-once processing semantics, ensuring reliable and consistent data processing.
Contextual Data Processing: Flink can enrich the raw data with additional context before the generative model uses it. For instance, Flink can join incoming data streams with historical data or external datasets to provide a richer context for retrieval operations.
Feature Extraction: Flink can extract features from the data streams that help improving the relevance of the retrieved documents or passages.
Seamless Integration: Kafka and Flink integrate well with model servers (e.g., for model embeddings) and storage systems (e.g., vector data bases for sematic search). This makes it easy to incorporate the right information and context into the RAG architecture.
Modular Design: The use of Kafka and Flink allows for a modular design where different components (data ingestion, processing, retrieval, generation) can be developed, scaled, and maintained independently.
The following ten-minute lightboard video is an excellent interactive explanation for building a RAG architecture with embedding model, vector database, Kafka and Flink to ensure up-to-date and context-specific prompts into the LLM:
Expedia is an online travel agency that provides booking services for flights, hotels, car rentals, vacation packages, and other travel-related services. The IT architecture is built around data streaming for many years already, including the integration of transactional and analytical systems.
When Covid hit, Expedia had to innovate fast to handle all the support traffic spikes regarding flight rebookings, cancellations, and refunds. The project team trained a domain-specific conversational chatbot (long before ChatGPT and the term GenAI existed) and integrated it into the business process.
Here are some of the impressive business outcomes:
By leveraging Apache Kafka and Apache Flink, the RAG architecture can handle real-time data ingestion, processing, and retrieval efficiently. This ensures that the generative model has access to the most current and contextually rich information, resulting in more accurate and relevant responses. The scalability, fault tolerance, and flexibility offered by Kafka and Flink make them ideal components for enhancing the capabilities of RAG systems.
If you want to learn more about data streaming with GenAI, read these articles:
How do you build a RAG architecture? Do you already leveraging Kafka and Flink for it? Or what technologies and architectures do you use? Let’s connect on LinkedIn and discuss it! Stay informed about new blog posts by subscribing to my newsletter.
Discover when Apache Flink is the right tool for your stream processing needs. Explore its…
Data streaming with Apache Kafka and Flink is transforming the airline industry, enabling real-time efficiency…
The rise of stream processing has changed how we handle and act on data. While…
Siemens Healthineers, a global leader in medical technology, delivers solutions that improve patient outcomes and…
Discover my journey to achieving Lufthansa HON Circle (Miles & More) status in 2025. Learn…
Data streaming is a new software category. It has grown from niche adoption to becoming…