Introduction
Apache Kafka is a distributed event streaming platform that can process trillions of events each day. Kafka was originally created as a messaging queue, but it is now an abstraction of a distributed commit log. In the time since it was built and open-sourced by LinkedIn in 2011, Kafka has quickly evolved from a messaging queue to a full-fledged event streaming platform. Apache Kafka is written in Scala and Java, but it is compatible with many other popular programming languages. It allows you to store, read, and analyze data. It may not appear to be much at first, but it’s a strong tool capable of handling a huge amount of events every day while remaining responsive due to its distributed nature.
Taking a look at more traditional message queues, it can be concluded that they are not the same as modern message queues. For example, RabbitMQ is an open-source message Broker software. It accepts messages from Producers and delivers them to Consumers. RabbitMQ eliminates messages immediately after the Consumer receives them. It sends messages to Consumers and keeps track of their load. Rabbit MQ determines how many messages each Consumer should be processing at any one time (there are settings for this behaviour).
Kafka, on the other hand, works with the publish-subscribe messaging pattern. This decouples the Applications that send data from the Applications that receive data and, in turn, enable more dynamic network topology and greater scalability. Compared with RabbitMQ, Kafka stores the messages for a period of time (default is 7 days) and does not delete them right away once they’ve been pulled. Additionally, the process of pulling in Kafka refers to how Consumers can retrieve messages.
Kafka is built to be horizontally scalable through the addition of more nodes. Traditional messaging queues assume that by adding more power to the same machine, they can scale vertically. These aspects differentiate Kafka from typical messaging systems.
How to use Apache Kafka
Before we learn how to use Apache Kafka, let’s go through some basic terminology. Some of the main keywords that we will use in the following text and the next blog posts on Apache Kafka are the following:
- Broker is a node (or server)
- Producing a message – means sending (or pushing) a message
- Consuming a message – means receiving (or pulling) a message
- Producer – an application that produces/pushes messages
- Consumer – an application that consumes/pulls messages
A streaming platform, such as Kafka, generally has three fundamental capabilities: publishing and subscribing to streams of records, storing these streams of records in a fault-tolerant mode, and processing them as they happen. Applications (in this case, Producers) push messages (records) that are comprised of a header, a key, a value, and a timestamp to a Kafka node (Broker). The Broker stores these messages in groups or categories called Topics. Other applications (Consumers) can consume (pull) the messages from the Topics by contacting the Brokers independently from the Producers. However, as Topics expand in size, a single Consumer might not be enough to keep pace with the number of messages in a Topic, and a Lag appears. In this case, we can divide the Topic into smaller Partitions for improved speed and scalability. In this way, we can increase the speed by scaling the number of Consumers in a Consumer Group, and each Consumer can read from a separate Partition of the same Topic. All messages received by each Partition are organized in the immutable order in which they arrived and are added to a structured commit log on a regular basis.
Kafka handles fault tolerance very effectively. We can assign a Replication Factor to each Topic, meaning we can specify the number of Replicas (or copies) of that Topic. For example, if a Topic has a Replication Factor of two, two Brokers will have a copy of the Topic, which means that if one Broker fails the other Broker will continue to serve the Producers and Consumers of the Topic.
It’s also worth mentioning that Kafka is indifferent to the type of the records – the data is simply always stored in binary format. Using a customizable retention period, the Kafka cluster keeps all published records for a long time. Consumers poll Kafka for new messages and let it know what records they wish to see. The record, however, is destroyed after the specified time period to free up space.
In terms of APIs, Kafka has five main components:
- The Producer API enables the program to publish a continuous stream of records to Topics
- The Consumer API allows the app to subscribe to Topics and process the incoming data stream
- The Streams API makes it easier to turn input streams into output streams
- The Connector API allows you to create and operate reusable Producers and Consumers that connect Kafka Topics to other applications or data systems
- Topics, Brokers, and other Kafka objects can be managed and inspected using the Admin API
Why is it so popular?
Kafka is a robust mechanism to use. Kafka is simple to set up and operate, and it’s simple to understand how it works. However, Kafka’s popularity stems primarily from its outstanding performance. It is stable, has dependable durability, a flexible publish-subscribe/queue that scales well with N-number of Consumer Groups, resilient replication, configurable consistency guarantees for Producers, and retained ordering at the shard level.
Kafka also works well with systems that need to process data streams, allowing them to aggregate, transform, and load data into other repositories. If Kafka was slow, none of these qualities would matter.
This is the first of a series of blogs on Apache Kafka. Stay tuned for the next blogs on more benefits of Apache Kafka, who should use it and which use cases are the best scenarios.