The CAP theorem says that in the event of a network partition, a distributed system can be either consistent or available, not both.
Let us first define some of the terms in the CAP theorem.
A distributed system is a group of independent computers coordinating with each other to solve a problem. The group of computers is called a cluster.
A network connects the computers so that they can communicate and coordinate. Whenever a network is involved, there are bound to be delays and outages. Also, the individual computers themselves may go down due to hardware failure. An event which leads to some of the machines in the cluster not being reachable is called a network partition.
Now that we have defined some of the terms, we will build a hypothetical distributed key-value store and witness some of the distributed computing concepts and trade-offs in action.
A key-value store is a dictionary data structure with persistence. A distributed key-value store uses multiple machines for storage.
The user sees and interacts with the key-value store as a single unit through a client. The multiple computers backing up the store is abstracted from the user.
Let us say that we have three machines – m0, m1, and m2 backing our key-value store. m0 is the coordinator and m1, and m2 are the followers.
The coordinator is the one which handles the reads and the writes. The client only interacts with the coordinator. The role of the followers will become clear as you read on.
Let us say that the client writes key-value k0 to the store. The coordinator persists it and then asks the followers also to persist. m0, m1, and m2 now have the key-value pair k0. Along with the key value pair, a version number is also stored; in this case, the version number is zero.
How does a read occur?
The read goes to the coordinator m0. m0 reads it’s own value, also fetches the values from m1, and m2. It compares the value it has with those fetched from m1, and m2 and sees that all of them are in sync, i.e., they are consistent. It responds to the client with the value.
Now the client updates k0 with a new value. The same sequence of steps follow, and the version number is updated to one. When this update is in transit, m2 is not reachable from m0 due to network congestion, i.e., a network partition occurs.
Recalling the CAP theorem – Our system can be either available or consistent in the event of a network partition.
If we want to design our key-value store to be consistent, we should reject the write and throw an error to the client.
If we want to design our key-value store for availability, we should accept the write even though we cannot reliably store it in all the machines in the cluster, i.e., maintain the consistency of the value.
Every distributed system has to make this trade-off.
Let our key-value store trade-off consistency for availability.
The update proceeds and the state now is: m0, and m1 have the version one of the key-value and m2 is still at version zero. Now, the coordinator m0 goes down due to a hardware fault, again a network partition; our cluster has only m1 and m2 now.
We have to figure out a new coordinator for our cluster. Let m1 become the new coordinator and m2 its follower. We will ignore how and why m1 becomes the new coordinator; we will tackle that in another post.
If you are a keen reader, you will see that the cluster is in an inconsistent state now as m1 has version one of the key-value whereas m2 has version zero.
Now the client tries to read k0. The same sequence of events as earlier occur for read, but coordinator realizes that the cluster is in an inconsistent state.
How to resolve this inconsistency?
Distributed key-value stores make trade-offs to resolve this.
One option is to let the last write win. m1 sees that the value it has is the latest and updates that value in m2 too and then returns the value to the client.
This is prone to the notorious clock problem in distributed systems.
Another option is to let the client decide which value to keep. In the event of inconsistency, all the conflicting values are sent to the client, and the client is free to determine which value to retain. Once the client decides, the chosen value is updated in the cluster.
In our scenario, with the version numbers, m1 could have taken the call that it has the latest value and update it in m2 too before responding to the client; this is what real distributed key-value stores do. But, it is not possible to do this in complicated scenarios involving highly diverged versions. Vector/Lamport clocks and version numbers aid this process.
At this stage, it should be apparent as to why we store copies of the key-value in multiple machines: to avoid catastrophic loss of data in case of machine failure.
All machines in the cluster keep sending heartbeat messages to each other so that they are aware of non-reachable machines in the cluster. Gossip protocol is one of the ways to achieve this. This cluster membership information is proliferated to the client too.`
We have glossed over a lot of details and simplified things in the interest of readability. A real key-value store is much more complicated and can involve thousands of machines coordinating with each other for storage. After reading this, you should be able to make some sense of the many seminal papers in distributed systems.
Dynamo paper, Aerospike paper, Bigtable paper.