Raft: The Understandable Consensus Algorithm
We just talked about the CAP theorem. Specifically, how sometimes you have to be CP (Consistent, Partition-tolerant). But being CP is hard. It means when things go sideways, you might have to stop and wait.
Waiting for what? Waiting for everyone to agree on what happened. That’s Consensus.
Getting nodes in a distributed system to agree on a single value or state is incredibly tricky. You saw the Two Generals Problem back on Jan 13th – a fundamental impossibility with unreliable networks.
So, how do systems like etcd, CockroachDB, or even Kubernetes (for its control plane) manage to agree on things like cluster state or leadership? They use consensus algorithms. And the most popular, most understandable one is Raft.
What is Consensus, Really?#
At its heart, consensus is about getting a group of servers to agree on a sequence of operations (a replicated log). If all servers reliably apply the same sequence of operations, they will all end up in the same state. Simple, right? Except when servers fail, messages get lost, or the network splits.
Raft breaks this problem down into three sub-problems:
- Leader Election: How do you pick one server to be in charge?
- Log Replication: How does the leader get the rest of the servers to catch up?
- Safety: How do you guarantee correctness even when things fail?
Leader Election: Finding the Boss#
Raft uses a timer-based approach. Every server is in one of three states:
- Follower: The default. Listens to the Leader. If it doesn’t hear from the Leader for a while (timeout), it gets suspicious.
- Candidate: When a Follower’s timer expires, it becomes a Candidate. It increments its “term” (think of it as a version number), votes for itself, and sends “RequestVote” RPCs to other servers.
- Leader: If a Candidate receives votes from a majority of servers, it becomes the Leader. It then starts sending “AppendEntries” RPCs (heartbeats) to all Followers to maintain its authority.
If a Candidate times out waiting for votes, it reverts to Follower and starts a new election. If it receives an AppendEntries RPC from a new Leader, it reverts to Follower.
If a Follower doesn’t get heartbeats, it becomes a Candidate. If it gets votes from a majority, it wins and becomes the Leader.
Log Replication: Keeping Everyone Synced#
Once you have a leader, the hard work begins: replicating the state machine log.
- Client Sends Command: The client sends a command (e.g., “Set Key X to Value Y”) to the Leader.
- Leader Appends to Log: The Leader appends the command as a new entry in its own log.
- Leader Sends AppendEntries: The Leader sends this new entry (along with previous entries for consistency checks) to each Follower via an AppendEntries RPC.
- Followers Acknowledge: Followers append the entry to their logs and reply to the Leader.
- Leader Commits: Once a majority of Followers have acknowledged the entry, the Leader “commits” it. This means the Leader guarantees that entry will be persisted and eventually applied by all healthy servers.
- Leader Applies Command: The Leader applies the command to its state machine and responds to the client.
- Followers Apply: Followers also apply committed entries to their state machines.
Crucially, if a Follower is behind, the Leader will keep sending it entries until it catches up. If a Follower has a different log, the Leader will backtrack it by forcing it to delete conflicting entries and resync.
Safety Guarantees#
Raft provides several safety properties:
- Election Safety: Only one leader can be elected in a given term.
- Leader Append-Only: A Leader only appends to its log; it never overwrites or deletes entries.
- Log Matching: If two logs contain an entry with the same index and term, then the logs are identical in all preceding entries.
These rules ensure that the system remains consistent and doesn’t diverge.
What I’m Learning#
I used to think consensus was pure magic. How do you get everyone to agree when the network is trying its best to tear you apart?
Raft’s brilliance is in its simplicity (relative to Paxos, anyway). It breaks down the problem into understandable parts: get a leader, then replicate the log. The rules for each part are explicit.
The trade-off, as we saw with the CAP theorem, is that in a partition, the CP nature of Raft means part of the system will likely become Unavailable to maintain that precious consistency. The leader might step down, or new writes might be blocked until the network heals. That’s the price of guaranteed correctness.
It’s a cornerstone for building reliable distributed systems that need to stay in sync.
What distributed systems do you use daily that rely on consensus?