What is Apache Kafka?
Suppose you find yourself managing data and interconnecting loads of applications, services, and edge messaging systems. They rapidly produce data such as databases, MQTT brokers, web apps, etc. In such instances, it is important to orchestrate and deliver this information swiftly, orderly, and reliably between your systems.
For this task, many employ messaging systems such as message queues. However, if the messaging system cannot handle the load, your customers could experience issues like data loss, slow connectivity, etc. Apache Kafka is a distributed data streaming platform that is a popular and robust solution that prioritizes speed, availability, and scaling to mitigate the adverse effects of load issues.
What is Apache Kafka, and how is it different from other systems?
Like MQTT, Apache Kafka uses a publish-subscribe messaging system. This means it has message producers (or publishers), consumers (or subscribers), and a broker that handles message delivery to appropriate topics. However, MQTT brokers only serve MQTT data and focus on handling discrete messages from many IoT devices. In contrast, Apache Kafka is an open-source data streaming platform capable of handling high volumes of data with low latency. Unlike typical message queues, Kafka excels in scalable and reliable data streaming, making it more suitable for large-scale distributed systems.
If you are wondering whether Kafka can replace an MQTT broker, the answer is that it might be complex. Kafka has its own protocol, which many IoT devices may not support directly. To communicate with Kafka, it is necessary to first talk to an MQTT gateway that can forward messages to Kafka. Moreover, there are usually hoards of IoT devices that generate discrete messages.
In this case, Kafka would work best if there is an MQTT broker in front, aggregating all those discrete messages and sending them over to leverage Kafka’s batch messaging optimizations. Therefore, Kafka is not meant to replace the MQTT broker but rather as a complementary system. It can serve as a system that can interface with an MQTT broker for optimal performance and ease of use. Read this comprehensive guide to discover more about utilizing MQTT vs Kafka.
Simply put, Apache Kafka is a reliable, durable, and scalable system that allows you to move and share data with any system without the need for complex networks. Before delving into the inner workings of Apache Kafka, let’s look into the motivations behind its development.
If you want to forward your MQTT data to Apache Kafka, explore the Pro Edition for Eclipse Mosquitto. This pro version of the open-source Mosquitto MQTT broker includes a simple-to-use Kafka Bridge, among other features. The Pro Mosquitto broker enables stable, reliable, fast, and secure transmissions between IoT devices and can integrate with external systems.
What is the history of Apache Kafka?
Despite its wide variety of use cases and reputation as one of the leading data-streaming platforms, Apache Kafka was only developed by LinkedIn in 2010 and open-sourced a year later in 2011.
Apache Kafka was initially designed to solve LinkedIn’s issue of high-latency ingestion of event data and real-time processing. At the time, there were several solutions that could ingest large amounts of data into offline batch systems. Still, none that could do it in real time with appropriate performance and the ability to scale. Thus, Kafka was developed to get data from source systems and move it around reliably.
After its open-source release in 2011, Apache Kafka started ingesting over one billion messages daily. It quickly gained popularity, and several LinkedIn engineers collaborated to expand Kafka’s features. Since then, Apache Kafka has continued to grow and currently ingests over seven trillion messages daily.
Now that you understand the origins of Apache Kafka, let’s look at how it works.
How does Apache Kafka work?
Apache Kafka combines the best of the queuing and publish-subscribe messaging models to provide its customers with highly scalable, reliable, distributed data streaming features.
In a queue system, messages are saved in a queue, and only a single client can consume each message at a time. This method typically does not allow multiple subscribers, meaning once a message is consumed, it is gone.
On the other hand, publish-subscribe models stores messages in a topic, and each topic allows multiple subscribers. This means that every new subscriber can consume each message in the topic.
To seamlessly combine these two messaging models, Apache Kafka employs a partitioned log model. Each log represents a sequence of records/messages grouped into topics. The topic data is then broken down into partitions and shared across different Kafka brokers. A specific user-defined key determines partitions, which should be present in messages destined for a particular topic. These partitions are stored on disk, which allows access to them anytime. Consumers can subscribe to the desired partition for a chosen topic and specify the offset from which to read messages. This architecture ensures that multiple consumers can poll a partition of interest on a separate Kafka broker and access past records at any time.
Distributing partitions ensures consumers will connect to different brokers to request desired partitions. This keeps the number of consumers per broker manageable so as not to overwhelm the server.
Note that Kafka implicitly determines the partition to subscribe based on the consumer’s key when subscribing to a particular topic.
In addition, all partitions are replicated across a cluster of broker-nodes of a specific size. If one of the nodes goes down, another takes its place and continues serving the clients (consumers), keeping the infrastructure robust and ensuring high availability.
Apache Kafka security
It’s no surprise that a distributed data streaming platform like Apache Kafka handles large amounts of data daily, which begs the question: How safe is your data on Apache Kafka?
Apache Kafka provides several security measures to ensure the protection of your data and records:
- Authentication and authorization: Connections between clients and brokers, as well as read and write operations, undergo authentication and authorization. Various Simple Authentication and Security Layer (SASL) protocols like SASL/GSSAPI, SASL/PLAIN, SASL/SCRAM-SHA-256 and SASL/SCRAM-SHA-512, and SASL/OAUTHBEARER are used.
- Encryption: The Secure Sockets Layer (SSL) encrypts data during transit between brokers and clients.
- ACLs (Access Control Lists): Apache Kafka provides a mechanism to control read and write access to Kafka topics. It helps secure your Kafka cluster by restricting who can produce or consume messages to/from certain topics.
Why use Apache Kafka?
Besides ensuring secure data ingestion and processing, Apache Kafka offers several other benefits.
Reliability and availability
Apache Kafka partitions and distributes the storage of topics in a fault-tolerant cluster across multiple Kafka nodes. These nodes can live in various locations, including different geographical regions. This guarantees a copy of your data is always available if issues arise or nodes fail – making disaster recovery easier.
Speed and throughput
Apache Kafka is optimized for speed and throughput, efficiently utilizing its resources, disk, and network I/O. For example, instead of transmitting many small individual messages, it batches messages together before transmitting them from data producers to Kafka brokers or from brokers to consumers.
Kafka also optimizes record storage and hard drive operations. It populates persisted data files into the OS’s page cache as much as possible. This makes it possible to speed up hard drive reads and writes and ensures it always has a warm cache. By simply appending constant size records to log files in a persistent queue fashion, Kafka enhances hard drive performance. This avoids storing data in advanced structures like B-trees, which are versatile but require slow hard drive seek operations. The result is improved performance by extracting efficiency from the hard drives.
Kafkfa uses sendfile system calls to send portions of the stored files directly via network. It also partitions the topics by keys across many brokers, to increase throughput by allowing the clients to access only the nodes that hold relevant data. This allows Kafka to deliver messages with an extremely low latency, sometimes as low as two milliseconds.
Scalability
Apache Kafka uses a partitioned log model, meaning topics are broken into partitions, each of which can exist on a separate node in the cluster. This allows Kafka to load balance over multiple servers, enabling you to scale as needed or spread clusters over different locations.
Versatility
Apache Kafka is suitable for advanced messaging queue in a traditional sense but is also ideal for storing data due to its efficient use of disk space. Kafka can store data for as long as necessary and allow numerous clients to consume it simultaneously. Apache Kafka can easily connect with virtually any other system to share data. Moreover, it also provides capabilities for processing the stored data and computing some statistics with the help of Streams API, making it highly flexible and versatile.
Apache Kafka use cases
Now, having an in-depth understanding of why one uses Apache Kafka, let’s explore some of its most common use cases.
Kafka is most suitable for real-time streaming applications and data pipelines. This means that if your application requires reliable and scalable movement of large amounts of continuously generated data between systems, then Apache Kafka might be the solution for you.
Use cases for Apache Kafka may vary depending on your needs, but some proven use cases include:
- Metrics: A diverse range of ecosystems can use Apache Kafka to consolidate metrics and other statistics and create centralized feeds of operational data.
- Log aggregation: Like metrics, Kafka can collate and aggregate logs from various sources and present them in a centralized location and format for easy consumption.
- Messaging: Apache Kafka provides better throughput, built-in partitioning, data replication, and fault tolerance, which makes it a good substitute for traditional message brokers or queues.
- Activity tracking: Kafka enables the comprehensive collation and publication of site activity, like page views, searches, or other user actions, to topics. Real-time processing and monitoring or offline processing and reporting systems, can then consume this information.
Where to place Kafka in a tech architecture
As previously established, Apache Kafka is a distributed data streaming platform that has the ability to aggregate information from various sources in a centralized location. To effectively do so, place Kafka in the middle of your ecosystem. Integration is not a problem as most open-source options for producers and consumers have existing connectors to Kafka.
Note: When Apache Kafka is not performing complex application logic, one can perceive it as an ordered database of messages. Kafka remains unaware of the content stored in the messages, even though producers can choose to store message metadata separately.
Apache Kafka acts like a pipeline that allows data to pass from one place to another while additionally storing it on disk. Consumers in the pipeline can choose where to start consuming messages on a particular topic (from the earliest message, the latest message, or otherwise).
Overview of Apache Kafka Connect
Apache Kafka Connect is a tool that helps you connect to Kafka by creating pipelines that either produce or consume messages. This enables you to start using Kafka in an easy and code-free manner, since creating a connector often requires writing a JSON configuration file.
Connecting to external systems to import and export data
In Apache Kafka Connect, there are two types of connectors you can use:
- Source: This type of pipeline extracts data from an external source and then publishes it to a Kafka topic. Examples include pulling data from a relational database, subscribing to a message broker, watching for changes in an S3 bucket, etc.
- Sink: This pipeline consumes data from a Kafka topic and then shares or pushes it to an external source. Examples include calling an external API, writing data to a relational database, and sending a webhook to notify clients of changes, etc.
Depending on your application needs, it may be necessary to use both sink and source connectors to move data between two external sources. Kafka Connect enables you to perform simple transformations to specific data fields in messages. Additionally, if your system does not have an existing Kafka connector, you can create one using the Kafka Connect API.
Note: Kafka also provides a low-level Connect TCP API, which allows you to develop custom connectors in any programming language. Apache Kafka also provides a Java Client, which implements the Connect API and can use it to create custom connectors by implementing Java interfaces. On top of that, there is a Connect REST API for deploying and monitoring connector instances. It is important to differentiate between these APIs to avoid confusing the terms.
Overview of the Apache Kafka Streams library
Apache Kafka Streams allows you to define how to deal with messages within your ecosystem. This Java-based client library is a part of the Kafka Java Client. It enables you to write and deploy standard Java and Scala applications on the client side. These applications pull the desired data from Kafka topics, perform transformations over it, and push it back to the defined output topics.
The Kafka Streams library lets you create a smooth pipeline in your code and define a chain of data transformations. For instance, you can combine or extract message fields, computing statistics over the messages, or aggregate them. The Kafka Streams library determines what it needs in order to complete the defined transformations and deploy the whole process.
Overview of Apache Kafka APIs
At its core, Apache Kafka is an event streaming platform. It offers advanced features like facilitating publish and subscribe models, reliable and durable storage of large amounts of data, and real-time record processing.
There are five main (low-level TCP) APIs you need to know to leverage the key features of Apache Kafka:
- Producer API: Enables any application to publish streams of records to a Kafka topic, where they are stored for a predefined period (or even indefinitely).
- Consumer API: Enables applications to subscribe to one or more Kafka topics and retrieve past and real-time records stored in the topic. Your application logic can then process the retrieved data.
- Streams API: Builds on the Producer and Consumer APIs. It enables applications to perform complex processing capabilities like consuming, analyzing, aggregating, or transforming records from multiple topics and publishing the processed records to other Kafka topics. As described above, the officially supported API implementation in Java is Kafka Streams.
- Connect API: Enables applications to build custom connectors, also known as reusable publishers and consumers. These connectors can link topics to existing applications and simplify integrating external data sinks or sources into a Kafka cluster. It’s a more general, high-abstraction API compared to Producer and Consumer APIs (in fact, it abstracts away the two). It connects entire systems to the Kafka cluster in a manageable and possibly distributed way for high volumes of data. Meanwhile, Producer and Consumer APIs are more fine-grained. This can be useful in specific scenarios requiring substantial customization when subscribing to or publishing to topics.
- Admin API: Manages and inspects Kafka topics, brokers, and other Kafka objects.
Note that Kafka exposes all the mentioned APIs in a programming language-agnostic way. The official Kafka Client only supports Java. However, you can find many Kafka clients for other programming languages developed by the open-source community, here.
In most scenarios, you can use most features in Apache Kafka with the Producer, Consumer, Connect, and Admin APIs. But if your application requires more sophisticated data or event processing capabilities, you should also use the Streams API.
Apache Kafka vs Confluent Kafka
This article mainly focuses on Apache Kafka, but did you know there is also Confluent Kafka? Although bearing similar names, it is important to note that the two are not the same.
Apache and Confluent Kafka are distributed streaming platforms offering highly scalable, resilient, and dependable pipelines for real-time data streaming and processing. However, Confluent Kafka is an enterprise version with additional features. The following table outlines the key differences:
Differences | Apache Kafka | Confluent Kafka |
Use cases | Suited for a wide range of small to large-scale installations and can handle real-time processing efficiently. | Particularly suited for enterprise solutions of any scale with features like Confluent Schema Registry and Confluent Control Center that make the use of Kafka easier and even more manageable. |
Support | Open-source support mostly comes from its sizable, active community of contributors and users. | Available as open-source under the Confluent Community License or a fee-based enterprise license. |
Popularity | Widely used in sectors like banking, healthcare, and technology. | Preferred option by enterprises for its data integration, governance, and monitoring features. Examples include large scale driver fleet management and telecom. |
Performance | Low latency and high throughput. | Low latency and high throughput, with additional enterprise features like multi-datacenter replication and advanced caching. |
Features – Both offer robust data streaming and processing capabilities | Open-source message broker with core functionalities like low latency, fault tolerance, and high throughput plus additional infrastructure like Kafka Clients, which comprises a robust and efficient distributed event streaming platform. | Has the core features of Apache Kafka but also offers the following:Additional pre-built connectors like REST proxy and Schema Registry.Advanced tools for managing and deploying clusters such as Confluent Control Center.A fully managed service for deploying and running event streaming platforms on cloud providers called Confluent Cloud.Additional security features like RBAC, Audit Logging, Advanced Encryption, Secret Protection, etc. |
Wrap up
In large distributed IoT projects, it is essential to handle, process, stream, move, and store large amounts of data regularly. If you fail to use the right data streaming application, your customers may experience data loss and slow connectivity issues. To overcome challenges, Apache Kafka is a reliable and durable data streaming platform that can effectively handle trillions of records.
Leveraging partitioned log, queue, and publish-subscribe messaging models, Apache Kafka delivers the highly scalable, reliable, and distributed data streaming features it is known for. Additionally, it utilizes various authentication, authorization, and encryption methods to ensure the secure processing of data.
When placed in the middle of your ecosystem, Apache Kafka excels in aggregating information from various sources. The following libraries can kickstart your Kafka journey:
- Apache Kafka Connect – Use pre-built source (extracts data) and sink (consumes data) connectors, or implement your own to connect to Kafka.
- Apache Kafka Streams – A Java-based client library that helps pull, process, and push data back to topics.
To fully utilize Kafka’s core features, there are five main language-agnostic low-level TCP APIs you need to be familiar with:
- Producer API: Enables an application to publish streams of records to a Kafka topic.
- Consumer API: Enables applications to subscribe to one or more Kafka topics and retrieve records stored in these topics.
- Streams API: Builds on the Producer and Consumer APIs to enable applications to perform complex processing capabilities and publish the processed records to Kafka topics.
- Connect API: Enables applications to build connectors that link topics to existing applications and simplify integrating other data sources.
- Admin API: Manages and inspects Kafka topics, brokers, and other Kafka objects.
Apache Kafka also supports Java Client, which provides Kafka Streams library and exposes classes to work with Kafka’s native APIs using Java. However, the community has developed many other Kafka clients for other programming languages.
There are several differences between Apache Kafka and Confluent Kafka. In short, the latter offers additional enterprise-centric features, in addition to the core Apache Kafka capabilities.
Now that you have a better understanding of Apache Kafka, it’s equally important to know how it can work seamlessly with an MQTT broker. The Pro Edition for Eclipse Mosquitto offers a Kafka Bridge feature and several others that are easy to use. Sign up for a free 14-day trial or 30-day on-premises trial with a basic MQTT high availability configuration to test its features. If you encounter any challenges while using the open-source Mosquitto broker, you can take advantage of the new professional support service.
About the author
Serhii Orlivskyi is a full-stack software developer at Cedalo GmbH. He previously worked in the Telekom industry and software startups, gaining experience in various areas such as web technologies, services, relational databases, billing systems, and eventually IoT.
While searching for new areas to explore, Serhii came across Cedalo and started as a Mosquitto Management Center developer. Over time, Serhii delved deeper into the MQTT protocol and the intricacies of managing IoT ecosystems.
Recognizing the immense potential of MQTT and IoT, he continues to expand his knowledge in this rapidly growing industry and contributes by writing and editing technical articles for Cedalo's blog.