This blog post covers use cases and architectures for Apache Kafka and Event Streaming in Pharma and Life Sciences. The technical example explores drug development and discovery with real time data processing, machine learning, workflow orchestration and image / video processing.
The following shows some of the use cases I have seen in the field in pharma and life sciences:
Many of them have in common that they are not new. But event streaming at scale in real time can help improve the processes and allow innovative new applications. Therefore Apache Kafka is a perfect fit for the Pharma and Life Science industry. Having said this, starting with a use case and goal is important to add business value:
From a technical perspective, the next step is drilling down into technologies as you can see in the above picture. Typically, you combine different concepts like ‘data pipelines’ and ‘stream processing’ to implement the business solution.
The pharmaceutical and life science industry today has an unprecedented wealth of opportunities to generate added value from data.
These possibilities cover all relevant areas such as:
Novel data use:
Challenges:
Let’s now take a look at how to solve these challenges to add value from existing data…
Here are a few examples of Pharma and Life Sciences companies relying on event streaming with Kafka and its ecosystem:
These companies spoke on a past Kafka Summit about their use cases. Find more details in the linked slides and video recordings.
All of them have in common that the event streaming platform based on Apache Kafka is the heart of their integration and data processing infrastructure:
Let’s now take a look at a concrete example to go into more details.
I want to cover one specific use case: Drug Discovery. Honestly, I am not an expert in this area. Therefore, I use examples from the company ‘Recursion Pharmaceutical’. They presented at a past Kafka Summit about “Drug Discovery at Scale in Real Time with Kafka Streams“.
The motivation for improving the drug development process is pretty clear: Cost reduction and faster time-to-market.
Here are a few quotes from McKinsey & Company:
The process for drug discovery is long and complex:
As you can see, the drug development process takes many years. Part of that reason is that drug discovery requires a lot of clinical studies doing data processing and analytics of big data sets.
Recursion Pharmaceutical went from ‘drug discovery in manual and slow, not scalable, bursty BATCH MODE’ to ‘drug discovery in automated, scalable, reliable REAL TIME MODE’
They created a massively parallel system that combines experimental biology, artificial intelligence, automation and real-time event streaming to accelerate drug discovery:
This hybrid event streaming architecture is explained in more detail in Recursion Pharmaceutical’s Kafka Summit talk.
While Recursion Pharmaceutical showed a concrete example, I want to share a more general view of such an architecture in the following…
The following is a possible solution to do data processing based on business rules (e.g. feature engineering or filtering) in conjunction with machine learning (e.g. image recognition using a convolutional neural network / CNN):
Such an infrastructure typically combines modern technologies with old, legacy interfaces (like file processing on a old Windows server). Different programming languages and tools are used in different parts of the process. It is not uncommon to see Python, Java, .NET and some proprietary tools in one single workflow.
The following maps the above use case to concrete cutting-edge technologies:
What makes this architecture exciting?
I did not have the time to implement this use case. But the good news is that there is a demo available showing exactly the same architecture and combination of technologies (showcasing a connected car infrastructure for real time data processing and analytics at scale in real time). Check out the Blog and video or the Github project for more details.
I want to cover a few more topics which come up regularly when I discuss Kafka use cases with customers from pharma, life sciences and other industries:
Each one is worth its own blog post, but the following will guide you into the right direction.
Image and video processing is a very important topic in many industries. Many pharma and life sciences processes require it, too.
The key question: Can and should you do image / video processing with Kafka? Or how does this fit into the story at all?
Several alternatives exists (and I have seen all three in the field several times):
All approaches are valid and have their pros and cons.
LinkedIn did a great presentation in 2016 about this topic. Here are their trade-offs for sending large messages via Kafka vs. sending just the reference link:
Please keep in mind that this presentation was done in 2016. Kafka and its ecosystem improved a lot since that time. Infrastructures also changed a lot regarding scalability and cost. Therefore, find the right architecture and cost structure for your use case!
UPDATE 2020: I wrote a blog post about the current status of processing large messages with Kafka. Check it out for the latest capabilities and use cases.
Business processes are often complex. Some can be fully automated. Others need human interaction. In short, there are two approaches for Workflow Orchestration in a Kafka infrastructure:
The advantage of Kafka-native workflow orchestration is that there is only one infrastructure to operate 24/7. But if it is not sufficient or you want to use a nice, pre-built UI, then nothing speaks against combining Kafka with an external workflow orchestration tool.
I pointed this out above already, but want to highlight it again in its own section: Apache Kafka is a great technology to deploy a modern, scalable, reliable middleware. In pharma and life sciences, many different technologies, protocols, interfaces and communication paradigms have to be integrated with each other. From Mainframe and batch systems to modern big data analytics platforms and real time event streaming applications.
Kafka and its ecosystem are a perfect fit:
The following shows how you can leverage the Strangler Design Pattern to integrate and (partly) replace legacy systems like mainframes:
If you think about using the Kafka ecosystem in your Pharma or Life Science projects, please check out my blogs, slides and videos about Apache Kafka vs. Middleware (MQ, ETL, ESB) and “Mainframe Offloading and Replacement with Apache Kafka“.
I created some slides and a video recording discussing Apache Kafka and Machine Learning in Pharma and Life Sciences. Check it out:
Slides:
Click on the button to load the content from www.slideshare.net.
Video recording:
The pharmaceutical and life science industry today has an unprecedented wealth of opportunities to generate added value from data. Apache Kafka and Event Streaming are a perfect fit. This includes scalable big data pipelines, machine learning for real time analytics, image / video processing, and workflow orchestration.
What are your experiences in pharma and life science projects? Did you or do you plan to use Apache Kafka and its ecosystem? What is your strategy? Let’s connect on LinkedIn and discuss!
Siemens Healthineers, a global leader in medical technology, delivers solutions that improve patient outcomes and…
Discover my journey to achieving Lufthansa HON Circle (Miles & More) status in 2025. Learn…
Data streaming is a new software category. It has grown from niche adoption to becoming…
Apache Kafka and Apache Flink are leading open-source frameworks for data streaming that serve as…
This blog delves into Cardinal Health’s journey, exploring how its event-driven architecture and data streaming…
In the age of digitization, the concept of pricing is no longer fixed or manual.…