How many times have you been part of a conversation where someone goes – “Let us use a queue like Kafka.”
And you are like
Kafka is a distributed stream processor. There is a gulf of difference between a queue and a distributed stream processor. Kafka happens to be the most popular distributed stream processor; there are others too, for example, Kinesis from AWS. Even though you can use a distributed stream processor like a queue, it is a good idea not to. Also, you need to be very clear about the difference between the two; you have to approach and model them differently.
For the rest of this post, we will not talk specifically about Kafka. We will build a mental model of a distributed stream processor like how we did with a distributed key-value store in one of the earlier posts.
Imagine that there is a huge file which your applications can write to. Others can read from this file. Let us refer to this file as “Topic.” The applications which write to the file are called “Producers.” Applications which read from the file are called “Consumers.” Each line in this file is called an “Event/Message.”
There you go, we have clarified many distributed stream processing parlances.
The lines in a file have a sequence based on the order in which they are written. The same goes for messages in a distributed stream processor; they are ordered. Also, each message has a unique incrementing sequence number to determine the order.
Like how a file can have multiple writers and readers, a distributed stream processor can have many producers and consumers.
A distributed stream processor is stateless. All it cares about is ensuring the file is always available. The consumers are responsible for maintaining their state. The consumers keep count of the last line they read from so that they know where to start reading from again. Some distributed stream processors may help you in maintaining the state, but that is not their primary concern.
While reading a file, even though you may be reading from the 100th line, you can always reset and start reading from another line. Reading a line from a file does not delete it from the file. The same is true for consumers of a distributed stream processor; they can go back and forth between messages as they wish.
If we keep writing to a file without ever deleting, it will soon grow large and eat up the entire storage. Distributed stream processors have retention periods to counter this problem. The retention period is configurable based on how much storage capacity you have. Messages past the retention period are deleted.
If multiple processes write to a file, you will soon hit performance limits. Same goes for a topic in a distributed stream processor; shards/partitions are used to overcome this. Instead of writing to one file, imagine that the file is split into multiple smaller files and consumers write to these thus distributing the writes.
How does a consumer decide which file to write to?
Partition key aids this decision.
A data point in the message is used to determine which partition to write to. Usually, this data point is hashed, and then modulo arithmetic is used to determine the target partition. There are other schemes too for deciding the target partition.
A queue has a straightforward mental model; first in, first out. Once you fetch a message from a queue, it is deleted; it will not be available again for processing. If a queue has multiple consumers, a message will be available to only one; this is in stark contrast to the consumers of a distributed stream processor. In a distributed stream processor, since we are reading lines from a file, a message can be processed multiple times by the same or different consumer; there is no concept of deletion.
Multiple machines back a distributed stream processor so that they are resilient to machine failures; this should be obvious since it is a “Distributed” stream processor.
Distributed systems are hard; tons of engineering goes into making anything distributed; same goes for a distributed stream processor.
I have written before too on similar lines.
This is not correct. Kafka is not a stream processor. “Kafka Streams”, a different product altogether, is a stream processor. Kafka is a persistent queue, but it is still a queue.
A stream processor processes data, Kafka doesn’t process data, so it is not a stream processor.