Kafka Is Not A Queue

How many times have you been part of a conversation where someone goes – “Let us use a queue like Kafka.”

And you are like

341yvn

Get articles on coding, software and product development, managing software teams, scaling organisations and enhancing productivity by subscribing to my blog

Kafka is a distributed stream processor. There is a gulf of difference between a queue and a distributed stream processor. Kafka happens to be the most popular distributed stream processor; there are others too, for example, Kinesis from AWS. Even though you can use a distributed stream processor like a queue, it is a good idea not to. Also, you need to be very clear about the difference between the two; you have to approach and model them differently.

For the rest of this post, we will not talk specifically about Kafka. We will build a mental model of a distributed stream processor like how we did with a distributed key-value store in one of the earlier posts.

Imagine that there is a huge file which your applications can write to. Others can read from this file. Let us refer to this file as “Topic.” The applications which write to the file are called “Producers.” Applications which read from the file are called “Consumers.” Each line in this file is called an “Event/Message.”

There you go, we have clarified many distributed stream processing parlances.

New Doc 2019-06-23 20.57.56 - Page 1

The lines in a file have a sequence based on the order in which they are written. The same goes for messages in a distributed stream processor; they are ordered. Also, each message has a unique incrementing sequence number to determine the order.

Like how a file can have multiple writers and readers, a distributed stream processor can have many producers and consumers.

A distributed stream processor is stateless. All it cares about is ensuring the file is always available. The consumers are responsible for maintaining their state. The consumers keep count of the last line they read from so that they know where to start reading from again. Some distributed stream processors may help you in maintaining the state, but that is not their primary concern.

While reading a file, even though you may be reading from the 100th line, you can always reset and start reading from another line. Reading a line from a file does not delete it from the file. The same is true for consumers of a distributed stream processor; they can go back and forth between messages as they wish.

If we keep writing to a file without ever deleting, it will soon grow large and eat up the entire storage. Distributed stream processors have retention periods to counter this problem. The retention period is configurable based on how much storage capacity you have. Messages past the retention period are deleted.

If multiple processes write to a file, you will soon hit performance limits. Same goes for a topic in a distributed stream processor; shards/partitions are used to overcome this. Instead of writing to one file, imagine that the file is split into multiple smaller files and consumers write to these thus distributing the writes.

How does a consumer decide which file to write to?
Partition key aids this decision.

A data point in the message is used to determine which partition to write to. Usually, this data point is hashed, and then modulo arithmetic is used to determine the target partition. There are other schemes too for deciding the target partition.

A queue has a straightforward mental model; first in, first out. Once you fetch a message from a queue, it is deleted; it will not be available again for processing. If a queue has multiple consumers, a message will be available to only one; this is in stark difference to the consumers of a distributed stream processor. In a distributed stream processor, since we are reading lines from a file, a message can be processed multiple times by the same or different consumer; there is no concept of deletion.

Multiple machines back a distributed stream processor so that they are resilient to machine failures; this should be obvious since it is a “Distributed” stream processor.

Distributed systems are hard; tons of engineering goes into making anything distributed; same goes for a distributed stream processor.

I have written before too on similar lines.

Creating Quality Software

Creating software is easy; creating quality software is hard.

In today’s world of umpteen open source libraries, cloud computing platforms, and StackOverflow, creating software is child’s play. You can look up things on the internet, glue together components, and voila – you have created something.

What is hard is creating quality software. Software that scales; software with very few bugs; software that is elegant and takes care of edge cases and boundary conditions; software that stands the test of time.

smiley-2979107_640

Get articles on coding, software and product development, managing software teams, scaling organisations and enhancing productivity by subscribing to my blog

I believe the fundamental reason for this is the age-old problem of simultaneously holding opposing thoughts in mind and doing justice to them.

To create something, you need to be an optimist. You need to look at the bright side of things. Whereas, if you want to build quality software, you have to foresee all the probable future problems and develop interventions for them, i.e., you need to wear the pessimist cap for this.

Being an optimist and pessimist at the same time is hard, and I believe that is one of the reasons why quality software seems to be elusive.

Image by Gino Crescoli from Pixabay

Distributed System Fundamentals With An Imaginary Key-Value Store

The CAP theorem says that in the event of a network partition, a distributed system can be either consistent or available, not both.

Let us first define some of the terms in the CAP theorem.

A distributed system is a group of independent computers coordinating with each other to solve a problem. The group of computers is called a cluster.

A network connects the computers so that they can communicate and coordinate. Whenever a network is involved, there are bound to be delays and outages. Also, the individual computers themselves may go down due to hardware failure. An event which leads to some of the machines in the cluster not being reachable is called a network partition.

Get articles on coding, software and product development, managing software teams, scaling organisations and enhancing productivity by subscribing to my blog

Now that we have defined some of the terms, we will build a hypothetical distributed key-value store and witness some of the distributed computing concepts and trade-offs in action.

A key-value store is a dictionary data structure with persistence. A distributed key-value store uses multiple machines for storage.

The user sees and interacts with the key-value store as a single unit through a client. The multiple computers backing up the store is abstracted from the user.

Let us say that we have three machines – m0, m1, and m2 backing our key-value store. m0 is the coordinator and m1, and m2 are the followers.

cluster.png

The coordinator is the one which handles the reads and the writes. The client only interacts with the coordinator. The role of the followers will become clear as you read on.

Let us say that the client writes key-value k0 to the store. The coordinator persists it and then asks the followers also to persist. m0, m1, and m2 now have the key-value pair k0. Along with the key value pair, a version number is also stored; in this case, the version number is zero.

How does a read occur?
The read goes to the coordinator m0. m0 reads it’s own value, also fetches the values from m1, and m2. It compares the value it has with those fetched from m1, and m2 and sees that all of them are in sync, i.e., they are consistent. It responds to the client with the value.

Now the client updates k0 with a new value. The same sequence of steps follow, and the version number is updated to one. When this update is in transit, m2 is not reachable from m0 due to network congestion, i.e., a network partition occurs.

m2-down-1-1.png

Recalling the CAP theorem – Our system can be either available or consistent in the event of a network partition.

If we want to design our key-value store to be consistent, we should reject the write and throw an error to the client.

If we want to design our key-value store for availability, we should accept the write even though we cannot reliably store it in all the machines in the cluster, i.e., maintain the consistency of the value.

Every distributed system has to make this trade-off.

Let our key-value store trade-off consistency for availability.

The update proceeds and the state now is: m0, and m1 have the version one of the key-value and m2 is still at version zero. Now, the coordinator m0 goes down due to a hardware fault, again a network partition; our cluster has only m1 and m2 now.

down-img.pngWe have to figure out a new coordinator for our cluster. Let m1 become the new coordinator and m2 its follower. We will ignore how and why m1 becomes the new coordinator; we will tackle that in another post.

If you are a keen reader, you will see that the cluster is in an inconsistent state now as m1 has version one of the key-value whereas m2 has version zero.

Now the client tries to read k0. The same sequence of events as earlier occur for read, but coordinator realizes that the cluster is in an inconsistent state.

How to resolve this inconsistency?

Distributed key-value stores make trade-offs to resolve this.

One option is to let the last write win. m1 sees that the value it has is the latest and updates that value in m2 too and then returns the value to the client.

This is prone to the notorious clock problem in distributed systems.

Another option is to let the client decide which value to keep. In the event of inconsistency, all the conflicting values are sent to the client, and the client is free to determine which value to retain. Once the client decides, the chosen value is updated in the cluster.

In our scenario, with the version numbers, m1 could have taken the call that it has the latest value and update it in m2 too before responding to the client; this is what real distributed key-value stores do. But, it is not possible to do this in complicated scenarios involving highly diverged versions. Vector/Lamport clocks and version numbers aid this process.

At this stage, it should be apparent as to why we store copies of the key-value in multiple machines: to avoid catastrophic loss of data in case of machine failure.

All machines in the cluster keep sending heartbeat messages to each other so that they are aware of non-reachable machines in the cluster. Gossip protocol is one of the ways to achieve this. This cluster membership information is proliferated to the client too.`

We have glossed over a lot of details and simplified things in the interest of readability. A real key-value store is much more complicated and can involve thousands of machines coordinating with each other for storage. After reading this, you should be able to make some sense of the many seminal papers in distributed systems.

Dynamo paper, Aerospike paper, Bigtable paper.

Management Distilled

There have been tomes written on team management, but, if you whittle it to the core, it boils down to the following.

kettle-2178442_640.jpg

Get articles on coding, software and product development, managing software teams, scaling organisations and enhancing productivity by subscribing to my blog

You set crystal clear expectations so that everyone knows what is expected out of them. You create an environment where everyone can thrive and meet/exceed the set expectations. You periodically check-in to ensure people have what they need to excel. You hold people accountable. If you see people meeting/exceeding the set expectations, you positively reinforce it. If someone is not able to meet the set expectations, you figure out what is going wrong and fix that. Rinse and repeat.

The above is management in as few words as possible.

Multiple books can and have been written on each sentence in the above paragraph.

Image by Pexels from Pixabay.

Cloud Movement

Moving from your own data center to a public cloud can be a daunting task. It can get confusing – Where to start? How to come up with a plan for the movement?

The trick is to identify non-critical components of your tech stack to experiment and gain confidence. Once you are familiar with the nuances of operating in the cloud, you can make the big bang move.

cris-ovalle-549013-unsplash

Get articles on coding, software and product development, managing software teams, scaling organisations and enhancing productivity by subscribing to my blog

One approach is to start with replicating your QA/development/staging environments in the cloud; this gives you a realistic picture of what it will take to move your actual production setup to the cloud.

Another approach is to move offline data processing workloads to the cloud. Data processing thrives on elastic capacity, and the cloud is amenable to workloads like this.

Also, start creating slaves of your production databases in the cloud; this will give you the valuable experience of running a datastore on fickle cloud servers. Operating persistent datastores in a transient environment like the cloud can be challenging. Plan for this and gain experience before you trust your production data to the cloud.

Once you have done the above, you would have enough foresight to plan your complete migration to the cloud.

Photo by Cris Ovalle on Unsplash

Choosing Open Source Libraries

Confused how to choose an open source library? Read on.

Nowadays, the barrier to entry for creating open source software has come down; this has lead to a boom in open source libraries. When it comes to picking one, there are tons of options; it can get confusing. Here are some simple and actionable tips to help you choose one.

javier-allegue-barros-761133-unsplash.jpg

Get articles on coding, software and product development, managing software teams, scaling organisations and enhancing productivity by subscribing to my blog

Start with the documentation. More precise the documentation, the higher the chances of the library being the right choice. I feel Hashicorp has set the standard here; the attention to detail in their documentation is impressive.

Look at the Github issues. Browsing through the issues gives you a good understanding of the strengths and weakness of the project. An important tip here is to look at the closed issues too, do not restrict yourself to only the open ones.

If the project has a mailing list, lurk around the ongoing discussions. Go through old threads too. Usually, projects have two mailing lists – one for the users of the project and the other to discuss development. The latter gives you a good sense of the direction of the project.

Go through the past releases and the release notes, get a sense of how the project has evolved.

Combination of issues, mailing list, and release history paint the right picture of the evolution of the project, it’s velocity, maintenance, and the responsiveness of the community. All these are very important to a project’s success and durability.

Peek through the source code. More idiomatic it is, the better the chances that the library is well written.

Browse the test cases, the more comprehensive it is, the better the quality.

Following the above will enable you to make an informed decision.

Photo by Javier Allegue Barros on Unsplash