Understanding Kafka: A Beginner’s Guide to Its Architecture
Kafka Overview
Kafka is a messaging system where producers send categorised messages, and consumers subscribe to specific types. A Kafka broker handles the communication between them.
Kafka Architecture
Kafka consists of three key components:
Producers
Producers create new messages in Kafka. They have the following key components:
-
Producer Record: This keeps track of each message. It includes the topic, a key (or partition), and the message value.
-
Serializer: The producer converts the key and value of the message into byte arrays (also called data marshaling).
-
Partitioner: After serialization, the partitioner assigns the message to a partition. There are different ways this happens:
- If a key is specified, the partitioner uses a hash function to map the message to a partition.
- If a partition is directly specified, the message goes there without changes.
By default, each topic has one partition. However, users can change this. It’s a good practice to have partitions equal to or a multiple of the number of brokers, ensuring balanced distribution across the cluster.
Producer Workflow
Once the partitioner decides the topic and partition, the producer stores the messages in batches inside a buffer. The producer then sends these batches to Kafka brokers using produce requests.
Brokers
Kafka runs on multiple servers called brokers. They receive messages from producers and deliver them to consumers when requested. When a broker gets a message, it:
- Assign an offset to the message.
- Stores the message on disk.
- Sends a response back to the producer.
If successful, the broker sends metadata (topic, partition, offset) to the producer. If not, it sends an error. The producer retries sending the message a few times. If it still fails, an error is returned.
If hardware requirements are met, brokers handle millions of messages and thousands of partitions per second. Kafka brokers operate in clusters, where one is elected as the cluster controller using ZooKeeper. The cluster controller:
- Assign partitions to brokers.
- Elects partition leaders.
- Monitors broker failures.
Benefits of Multiple Brokers
- Scalability: Data load is distributed across brokers.
- Replication: Partitions are replicated for fault tolerance in case of server failure.
Kafka Cluster
Each partition is replicated across multiple brokers in a cluster for fault tolerance. Only one broker, the partition leader, handles communication with producers and consumers. If a leader fails, another broker takes over. Producers and consumers must always connect to the partition leader.
Consumers
Consumers, also called subscribers or readers, read messages from topic partitions. They follow this process:
- Subscribe to one or more topics.
- Send pull requests to brokers, which provide the stored messages.
- Read messages from partitions in the order they were written.
- Track consumed messages using the offset (a unique integer assigned to each message).
Offsets allow consumers to restart without losing track of messages. Consumers send pull requests to the leader partition of a broker. If they contact the wrong broker, they receive an error like “Not a leader for partition.” To prevent the mistakes, producers and consumers first request metadata from brokers. This metadata helps them send future requests to the correct leader partitions.
Consumer Groups
In a consumer group, multiple consumers work together to consume messages from a topic. Each partition is assigned to only one consumer within the group. Here’s how it works:
- A consumer group ensures only one member consumes each partition at a time.
- A consumer can read from multiple partitions, but a single member owns each partition.
Why Use Consumer Groups?
While a single consumer can fetch data, consumer groups provide vital benefits:
- Scalability: If a producer sends messages faster than a single consumer can read, a consumer group spreads the load.
- Fault Tolerance: If a consumer fails, other consumers in the group can take over, preventing data loss.
Consumer groups allow horizontal scaling and provide a reliable way to handle large volumes of messages efficiently.
In summary, Kafka’s architecture comprises producers, brokers, and consumers (or consumer groups), providing a robust and scalable messaging system. With its ability to handle millions of messages, Kafka ensures efficient message distribution and fault tolerance through its partitioning and replication mechanisms. Consumer groups enable horizontal scalability, allowing systems to manage high message throughput and ensure reliability in the face of failures. Whether handling real-time data streams or scaling distributed applications, Kafka is a powerful tool for modern data-driven architectures.