Why I Move (Back) to Open Source for Messaging, Integration and Stream Processing

After three great years at TIBCO Software, I move back to open source and join Confluent, a company focusing on the open source project Apache Kafka to build mission-critical, scalable infrastructures for messaging, integration and streaming analytics. Confluent is a Silicon Valley startup, still in the beginning of its journey, with a 700% growing business in 2016, and is exjustpected to grow significantly in 2017 again.

In this blog post, I want to share why I see the future for middleware and big data analytics in open source technologies, why I really like Confluent, what I will focus on in the next months, and why I am so excited about this next step in my career.

Let’s talk shortly about three cutting-edge topics which get important in any industry and small, medium and big enterprises these days:

  • Big Data Analytics: Find insights and patterns in big historical datasets.
  • Real Time Streaming Platforms: Apply insights and patterns to new events in real time (e.g. for fraud detection, cross selling or predictive maintenance).
  • Machine Learning (and its hot subtopic Deep Learning): Leverage algorithms and let machines learn by themselves without programming everything explicitly.

These three topics disrupt every industry these days. Note that Machine Learning is related to the other two topics. Though today, we often see it as independent topic; many data science projects actually use only very small datasets (often less than a Gigabyte of input data). Fortunately, all three topics will be combined more and more to add additional business value.

Some industries are just in the beginning of their journey of disruption and digital transformation (like banks, telcos, insurance companies), others already realized some changes and innovation (e.g. retailers, airlines). In addition to the above topics, some other cutting edge success stories emerge in a few industries, like Internet of Things (IoT) in manufacturing or Blockchain in banking.

With all these business trends on the market, we also see a key technology trend for all these topics: The adoption of open source projects.

Key Technology Trend: Adoption of “Real” Open Source Projects

When I say “open source”, I mean some specific projects. I do not talk about very new, immature projects, but frameworks which are deployed for many years in production successfully, and used by various different developers and organizations. For example, Confluent’s Docker images like the Kafka REST Proxy or Kafka Schema Registry are each downloaded over 100.000 times, already.

A “real”, successful middleware or analytics open source project has the following characteristics:

  • Openness: Available for free and really open under a permissive license, i.e. you can use it in production and scale it out without any need to purchase a license or subscription (of course, there can be commercial, proprietary add-ons – but they need to be on top of the project, and not change the license for the used open source project under the hood)
  • Maturity: Used in business-relevant or even mission critical environments for at least one year, typically much longer
  • Adoption: Various vendors and enterprises support a project, either by contributing (source code, documentation, add-ons, tools, commercial support) or realizing projects
  • Flexibility: Deployment on any infrastructure, including on premise, public cloud, hybrid. Support for various application environments (Windows, Linux, Virtual Machine, Docker, Serverless, etc.), APIs for several programming languages (Java, .Net, Go, etc.)
  • Integration: Independent and loosely coupled, but also highly integrated (via connectors, plugins, etc.) to other relevant open source and commercial components

After defining key characteristics for successful open source projects, let’s take a look some frameworks with huge momentum.

Cutting Edge Open Source Technologies: Apache Hadoop, Apache Spark, Apache Kafka

I defined three key trends above which are relevant for any industry and many (open source and proprietary) software vendors. There is a clear trend towards some open source projects as de facto standards for new projects:

  • Big Data Analytics: Apache Hadoop (and its zoo of sub projects like Hive, Pig, Impala, HBase, etc.) and Apache Spark (which is often separated from Hadoop in the meantime) to store, process and analyze huge historical datasets
  • Real Time Streaming Platforms: Apache Kafka – not just for highly scalable messaging, but also for integration and streaming analytics. Platforms either use Kafka Streams to build stream processing applications / microservices or an “external” framework like Apache Flink, Apex, Storm or Heron.
  • Machine Learning: No clear “winner” here (and that is a good thing in my opinion as it is so multifaceted). Many great frameworks are available – for example R, Python and Scala offer various great implementations of Machine Learning algorithms, and specific frameworks like Caffee, Torch, TensorFlow or MXNet focus on Deep Learning and Artificial Neural Networks.

On top of these frameworks, various vendors build open source or proprietary tooling and offer commercial support. Think about the key Hadoop / Spark vendors: Hortonworks, Cloudera, MapR and others, or KNIME, RapidMiner or H2O.ai as specialized open source tools for machine learning in a visual coding environment.

Of course, there are many other great open source frameworks not mentioned here but also relevant on the market, for example RabbitMQ and ActiveMQ for messaging or Apache Camel for integration. In addition, new “best practice stacks” are emerging, like the SMACK Stack which combines Spark, Mesos, Akka, and Kafka.

I am so excited about Apache Kafka and Confluent, because it is used in any industry and many small and big enterprises, already. Apache Kafka production deployments accelerated in 2016 and it is now used by one-third of the Fortune 500. And this is just the beginning. Apache Kafka is not an all-rounder to solve all problems, but it is awesome in the things it is built for – as the huge and growing number of users, contributors and production deployments prove. It highly integrated with many other frameworks and tools. Therefore, I will not just focus on Apache Kafka and Confluent in my new job, but also many other technologies as discussed later.

Let’s next think about the relation of Apache Kafka and Confluent to proprietary software.

Open Source vs. Proprietary Software – A Never-ending War?

The trend is moving towards open source technologies these days. This is no question, but a fact. I have not seen a single customer in the last years who does not have projects and huge investments around Hadoop, Spark and Kafka. In the meantime, it changed from labs and first small projects to enterprise de facto standards and company-wide deployments and strategies. Replacement of closed legacy software is coming more and more – to reduce costs, but even more important to be more flexible, up-to-date and innovative.

What does this mean for proprietary software?

For some topics, I do not see much traction or demand for proprietary solutions. Two very relevant examples where closed software ruled the last ~20 years: Messaging solutions and analytics platforms. Open frameworks seem to replace them almost everywhere in any industry and enterprise in new projects (for good reasons).

New messaging projects are based on standards like MQTT or frameworks like Apache Kafka. Analytics is done with R and Python in conjunction with frameworks like scikit-learn or TensorFlow. These options leverage flexible, but also very mature implementations. Often, there is no need for a lot of proprietary, inflexible, complex or expensive tooling on top of it. Even IBM, the mega vendor, focuses on offerings around open source in the meantime.

IBM invests millions into Apache Spark for big data analytics and puts over 3500 researchers and developers to work on Spark-related projects instead of just pushing towards its various own proprietary analytics offerings like IBM SPSS. If you search for “IBM Messaging”, you find offerings based on AMQP standard and cloud services based on Apache Kafka instead of pushing towards new proprietary solutions!

I think IBM is a great example of how the software market is changing these days. Confluent (just in the beginning of its journey) or Cloudera (just went public with a successful IPO) are great examples for Silicon Valley startups going the same way.

In my opinion, a good proprietary software leverages open source technologies like Apache Kafka, Apache Hadoop or Apache Spark. Vendors should integrate natively with these technologies. Some opportunities for vendors:

  • Visual coding (for developers) to generate code (e.g. graphical components, which generate framework-compliant source code for a Hadoop or Spark job)
  • Visual tooling (e.g. for business analysts or data scientists), like a Visual Analytics tools which connect to big data stores to find insights and patterns
  • Infrastructure tooling for management and monitoring of open source infrastructure (e.g. tools to monitor and scale Apache Kafka messaging infrastructures)
  • Add-ons which are natively integrated with open source frameworks (e.g. instead of requiring own proprietary runtime and messaging infrastructures, an integration vendor should deploy its integration microservices on open cloud-native platforms like Kubernetes or Cloud Foundry and leverage open messaging infrastructures like Apache Kafka instead of pushing towards proprietary solutions)

Open Source and Proprietary Software Complement Each Other

Therefore, I do not see this as a discussion of “open source software” versus “proprietary software”. Both complement each other very well. You should always ask the following questions before making a decision for open source software, proprietary software or a combination of both:

  • What is the added value of the proprietary solution? Does this increase the complexity and increases the footprint of runtime and tooling?
  • What is the expected total cost of ownership of a project (TCO), i.e. license / subscription + project lifecycle costs?
  • How to realize the project? Who will support you, how do you find experts for delivery (typically consulting companies)? Integration and analytics projects are often huge with big investments, so how to make sure you can deliver (implementation, test, deployment, operations, change management, etc.)? Can we get commercial support for our mission-critical deployments (24/7)?
  • How to use this project with the rest of the enterprise architecture? Do you deploy everything on the same cluster? Do we want to set some (open) de facto standards in different departments and business units?
  • Do we want to use the same technology in new environments without limiting 30-day-trials or annoying sales cycles, maybe even deploy it to production without any license / subscription costs?
  • Do we want to add changes, enhancements or fixes to the platform by ourselves (e.g. if we need a specific feature immediately, not in six months)?

Let’s think about a specific example with these questions in mind:

Example: Do you still need an Enterprise Service Bus (ESB) in a World of Cloud-Native Microservices?

I faced this question a lot in the last 24 months, especially with the trend moving to flexible, agile microservices (not just for building business applications, but also for integration and analytics middleware). See my article “Do Microservices Spell the End of the ESB?”. The short summary: You still need middleware (call it ESB, integration platform, iPaaS, or somehow else), though the requirements are different today. This is true for open source and proprietary ESB products. However, something else has changed in the last 24 months…

In the past, open source and proprietary middleware vendors offered an ESB as integration platform. The offering included a runtime (to guarantee scalable, mission-critical operation of integration services) and a development environment (for visual coding and faster time-to-market). The last two years changed how we think about building new applications. We now (want to) build microservices, which run in a Docker container. The scalable, mission-critical runtime is managed by a cloud-native platform like Kubernetes or Cloud Foundry. Ideally, DevOps automates builds, tests and deployment. These days, ESB vendors adopt these technologies. So far, so good.

Now, you can deploy your middleware microservice to these cloud-native platforms like any other Java, .NET or Go microservice. However, this completely changes the added value of the ESB platform. Now, its benefit is just about visual coding and the key argument is time-to-market (you should always question and doublecheck if it is a valid argument). The runtime is not really part of the ESB anymore. In most scenarios, this completely changes the view on deciding if you still need an ESB. Ask yourself about time-to-market, license / subscription costs and TCO again! Also think about the (typically) increased resource requirements (Memory, CPU, Disk) of tooling-built services (typically some kind of big .ear file), compared to plain source code (e.g. Java’s .jar files).

Is the ESB still enough added value or should you just use a cloud-native platform and a messaging infrastructure? Is it easier to write some lines of source instead of setting up the ESB tooling where you often struggle importing your REST / Swagger or WSDL files and many other configuration environments before you actually can begin to leverage the visual coding environment? In very big, long-running projects, you might finally end up with a win. Though, in an agile, every-changing world with fail-fast ideology, various different technologies and frameworks, and automated CI/CD stacks, you might only add new complexity, but not get the expected value anymore like in the old world where the ESB was also the mission-critical runtime. The same is true for other middleware components like stream processing, analytic platforms, etc.

ESB Alternative: Apache Kafka and Confluent Open Source Platform

As alternative, you could use for example Kafka Connect, which is a very lightweight integration library based on Apache Kafka to build large-scale low-latency data pipelines. The beauty of Kafka Connect is that all the challenges around scalability, fail-over and performance are leveraged from the Kafka infrastructure. You just have to use the Kafka Connect connectors to realize very powerful integrations with a few lines of configuration for sources and sinks. If you use Apache Kafka as messaging infrastructure anyway, you need to find some very compelling reasons to use an ESB on top instead of the much less complex and much less heavyweight Kafka Connect library.

I think this section explained why I think that open source and proprietary software are complementary in many use cases. But it does not make sense to add heavyweight, resource intensive and complex tooling in every scenario. Open source is not free (you still need to spend time and efforts on the project and probably pay money for some kind of commercial support), but often open source without too much additional proprietary tooling is the better choice regarding complexity, time-to-market and TCO. You can find endless success stories about open source projects; not just from tech giants like Google, Facebook or LinkedIn, but also from many small and big traditional enterprises. Of course, any project can fail. Though, in projects with frameworks like Hadoop, Spark or Kafka, it is probably not due to issues with technology…

Confluent + “Proprietary Mega Vendors”

On the other side, I really look forward working together with “mostly proprietary” mega vendors like TIBCO, SAP, SAS and others where it makes sense to solve customer problems and build innovative, powerful, mission-critical solutions. For example, TIBCO StreamBase is a great tool if you want to develop stream processing applications via visual editor instead of writing source code. Actually, it does not even compete with Kafka Streams because the latter one is a library which you embed into any other microservice or application (deployed anywhere, e.g. in a Java application, Docker container, Apache Mesos, “you-choose-it”) while StreamBase (like its competitors Software AG Apama, IBM Streams and all the other open source frameworks like Apache Flink, Storm, Apache Apex, Heron, etc.) focus on building streaming application on its own cluster (typically either deployed on Hadoop’s YARN or on a proprietary cluster). Therefore, you could use StreamBase and its Kafka connector to build streaming applications leveraging Kafka as messaging infrastructure.

Even Confluent itself does offer some proprietary tooling like Confluent Control Center for management and monitoring on top of open source Apache Kafka and open source Confluent Platform, of course. This is the typical business model you see behind successful open source vendors like Red Hat: Embrace open source projects, offer 24/7 support and proprietary add-ons for enterprise-grade deployments. Thus, not everything is or needs to be open source. That’s absolutely fine.

So, after all these discussions about business and technology trends, open source and proprietary software, what will I do in my new job at Confluent?

Confluent Platform in Conjunction with Analytics, Machine Learning, Blockchain, Enterprise Middleware

Of course, I will focus a lot on Apache Kafka and Confluent Platform in my new job where I will work mainly with prospects and customers in EMEA, but also continue as Technology Evangelist with publications and conference talks. Let’s get into a little bit more detail here…

My focus never was being a deep level technology expert or fixing issues in production environments (but I do hands-on coding, of course). Many other technology experts are way better in very technical discussions. As in the past, I will focus on designing mission-critical enterprise architectures, discussing innovative real world use cases with prospects and customers, and evaluating cutting edge technologies in conjunction with the Confluent Platform. Here are some of my ideas for the next months:

  • Apache Kafka + Cloud Native Platforms = Highly Scalable Streaming Microservices (leveraging platforms like Kubernetes, Cloud Foundry, Apache Mesos)
  • Highly Scalable Machine Learning and Deep Learning Microservices with Apache Kafka Streams (using TensorFlow, MXNet, H2O.ai, and other open source frameworks)
  • Online Machine Learning (i.e. updating analytics models in real time for every new event) leveraging Apache Kafka as infrastructure backbone
  • Open Source IoT Platform for Core and Edge Streaming Analytics (leveraging Apache Kafka, Apache Edgent, and other IoT frameworks)
  • Comparison of Open Source Stream Processing Frameworks (differences between Kafka Streams and other modern frameworks like Heron, Apache Flink, Apex, Spark Streaming, Edgent, Nifi, StreamSets, etc.)
  • Apache Kafka / Confluent and other Enterprise Middleware (discuss when to combine proprietary middleware with Apache Kafka, and when to simply “just” use Confluent’s open source platform)
  • Middleware and Streaming Analytics as Key for Success in Blockchain Projects

You can expect publications, conference and meetup talks, and webinars about these and other topics in 2017 like I did it in the last years. Please let me know what you are most interested in and what other topics you would like to hear about!

I am also really looking forward to work together with partners on scalable, mission-critical enterprise architectures and powerful solutions around Apache Kafka and Confluent Platform. This will include combined solutions and use cases with open source but also proprietary software vendors.

Last but not least the most important part, I am excited to work with prospects and customers from traditional enterprises, tech giants and startups to realize innovative use cases and success stories to add business value.

As you can see, I am really excited to start at Confluent in May 2017. I will visit Confluent’s London and Palo Alto offices in the first weeks and also be at Kafka Summit in New York. Thus, an exciting month to get started in this awesome Silicon Valley startup.

Please let me know your feedback. Do you see the same trends? Do you share my opinions or disagree? Looking forward to discuss all these topics with customers, partners and anybody else in upcoming meetings, workshops, publications, webinars, meetups and conference talks.

Kai Waehner

builds cloud-native event streaming infrastructures for real-time data processing and analytics

Recent Posts

Real-Time Locating System and Digital Twin in Smart Factories with Data Streaming at Daimler Truck

Technologies like Real-Time Locating Systems (RTLS) and Digital Twin are transforming manufacturing processes in the…

18 hours ago

Data Streaming with Apache Kafka at Daimler Truck for Industrial IoT and Cloud Use Cases

As a global leader in the commercial vehicle sector, Daimler Truck is not only committed…

4 days ago

A New Era in Dynamic Pricing: Real-Time Data Streaming with Apache Kafka and Flink

In the age of digitization, the concept of pricing is no longer fixed or manual.…

1 week ago

IoT and Data Streaming with Kafka for a Tolling Traffic System with Dynamic Pricing

In the rapidly evolving landscape of intelligent traffic systems, innovative software provides real-time processing capabilities,…

3 weeks ago

Fraud Prevention in Under 60 Seconds with Apache Kafka: How A Bank in Thailand is Leading the Charge

In the fast-paced world of finance, the ability to prevent fraud in real-time is not…

4 weeks ago

When to Choose Apache Kafka vs. Azure Event Hubs vs. Confluent Cloud for a Microsoft Fabric Lakehouse

Choosing between Apache Kafka, Azure Event Hubs, and Confluent Cloud for data streaming is critical…

1 month ago