Deep Dive into the Raft Consensus Algorithm: Core Mechanisms and Engineering Optimizations
Raft is a consensus algorithm used in distributed systems. Its clarity and simplicity have led to its widespread adoption in numerous distributed middleware systems (e.g., Etcd, Kafka, RocketMQ).
However, once you dive into Raft, you’ll discover that its mechanisms are remarkably sophisticated and elegant, making it highly worthwhile to study and document.
Data Structures and Rules
A major reason for Raft’s simplicity and clarity is that it defines specific data structures and strict rules that must be followed.
Naturally, these data structures and rules only make sense and become fully effective when viewed in conjunction with, or after thoroughly understanding, the subsequent mechanisms.
State
Persistent State
All nodes store the following persistent state:
| Parameter | Description |
|---|---|
currentTerm | Latest term seen (initially 0, increases monotonically) |
votedFor | Candidate ID that received a vote in this term (null if none) |
log[] | Log entries. Each entry contains an index, term, and command |
Volatile State
Volatile state on all nodes:
| Parameter | Description |
|---|---|
commitIndex | Highest known committed log index (initially 0, increases monotonically) |
lastApplied | Highest log index applied to the state machine (initially 0, increases monotonically) |
Volatile state on Leaders:
| Parameter | Description |
|---|---|
nextIndex[] | Next log index to send to each follower (one per follower) |
matchIndex[] | Highest log index known to be replicated on each follower (one per follower) |
RPCs
RequestVote RPC
Arguments for a RequestVote RPC initiated by a Candidate:
| Parameter | Description |
|---|---|
term | Candidate’s term |
candidateId | Candidate requesting vote |
lastLogIndex | Index of candidate’s last log entry |
lastLogTerm | Term of candidate’s last log entry |
Response structure received:
| Return Value | Description |
|---|---|
term | Current term, for candidate to update itself |
voteGranted | True means candidate received vote |
AppendEntries RPC
Called by the Leader to replicate log entries or as a heartbeat. Arguments:
| Parameter | Description |
|---|---|
term | Leader’s term |
leaderId | Leader’s ID, so followers can redirect client requests |
prevLogIndex | Index of log entry immediately preceding new ones |
prevLogTerm | Term of prevLogIndex entry |
entries[] | Log entries to store (empty for heartbeat; may send multiple for efficiency) |
leaderCommit | Leader’s known highest committed index |
Response structure:
| Return Value | Description |
|---|---|
term | Current term, for Leader to update itself |
success | True if follower contained entry matching prevLogIndex and prevLogTerm |
Rules
Rules for All Servers
- If
commitIndex > lastApplied, incrementlastAppliedand applylog[lastApplied]to state machine. - If RPC request or response contains term
T > currentTerm: setcurrentTerm = T, convert to follower.
Rules for Leaders
- Upon election: send initial empty
AppendEntriesRPCs (heartbeat) to each server; repeat during idle periods to prevent election timeouts. - When a client request arrives: append command to log. After the entry is committed, apply it to the state machine and respond to the client.
- If last log index ≥
nextIndexfor a follower: send anAppendEntriesRPC with log entries starting atnextIndex:
- If successful: update
nextIndexandmatchIndexfor the follower. - If
AppendEntriesfails because of log inconsistency: decrementnextIndexand retry.
- If there exists an
Nsuch thatN > commitIndex, a majority ofmatchIndex[i] ≥ N, andlog[N].term == currentTerm: setcommitIndex = N.
Rules for Followers
- Respond to RPCs from candidates and leaders.
- If election timeout elapses without receiving
AppendEntriesRPC from current leader or without granting vote to a candidate: convert to candidate.
Rules for Candidates
- On conversion to candidate, start election:
- Increment
currentTerm - Vote for self
- Reset election timer
- Send
RequestVoteRPCs to all other servers
- If votes received from majority of servers: become leader.
- If
AppendEntriesRPC received from new leader: convert to follower. - If election timeout elapses: start new election.
RequestVote RPC Rules
- Reply false if
term < currentTerm. - If
votedForis null orcandidateId, and candidate’s log is at least as up-to-date as receiver’s log, grant vote (voteGranted = true).
AppendEntries RPC Rules
- Reply false if
term < currentTerm. - Reply false if log doesn’t contain an entry at
prevLogIndexwhose term matchesprevLogTerm. - If an existing entry conflicts with a new one (same index but different terms), delete the existing entry and all that follow it.
- Append any new entries not already in the log.
- If
leaderCommit > commitIndex, setcommitIndex = min(leaderCommit, index of last new entry).
Node States
Distributed systems consist of multiple nodes. In Raft, each node assumes one of three states:
- Leader: Handles all client requests.
- Follower: Passive state; only responds to RPCs from leaders and candidates.
- Candidate: A temporary state used to elect a new leader.
Under normal operation, there is exactly one leader, and all other nodes are followers. The candidate state only appears during leader elections.
Terms
In Raft, time is divided into terms. Each node maintains a term number.
The term number is a monotonically increasing integer that acts as a logical clock in Raft.
Every term begins with an election, and Raft guarantees that at most one leader is elected per term.
Leader Election
In Raft, nodes communicate via RPCs for internal requests.
Normally, the leader periodically sends heartbeats to followers to assert its authority.
If a follower receives no communication for a period (election timeout), it assumes there is no leader and initiates an election.
The follower increments its term, transitions to candidate, and sends RequestVote RPCs to other nodes.
A candidate remains in this state until one of three events occurs:
- It wins the election.
- Another server establishes itself as leader.
- A period of time goes by with no winner (election timeout).
During an election, a candidate may receive other voting requests, but it will only respect requests with a higher term.
If multiple candidates split the vote, a timeout occurs. Raft mitigates this by assigning a randomized election timeout to each server.
Log Replication
In Raft, only the leader handles client requests.
When a client sends a request that modifies the state machine, the leader appends the corresponding command to its local log.
The leader then sends AppendEntries RPCs in parallel to all followers to replicate the entry.
Once a majority of nodes have successfully replicated the entry, the leader marks it as committed.
The leader then applies the command to its state machine and returns the response to the client.
When do followers apply these commands? Followers apply commands only when they learn via
commitIndexthat preceding log entries have been committed.
Raft’s log replication guarantees two properties:
- If two entries in different logs have the same index and term, then they store the same command.
- If two entries in different logs have the same index and term, then the logs are identical in all preceding entries.
The first property is straightforward since commands within a term are immutable once appended.
The second property stems from the AppendEntries RPC: the leader’s request includes both the new log entries and the index and term of the preceding entry. If a follower cannot match the preceding entry, it rejects the new entries.
To ensure log consistency, the leader forces followers’ logs to match its own.
The leader maintains a nextIndex for each follower, representing the next log index to send. Upon election, the leader initializes nextIndex to the index of its last log entry plus one. When a follower rejects an AppendEntries request due to inconsistency, the leader decrements nextIndex and retries until consistency is found. The follower then deletes conflicting entries and appends the leader’s entries.
Safety
Election Restriction
During elections, Raft enforces a rule: only a candidate that contains all committed log entries can become a leader. This prevents a newly elected leader from overwriting and losing committed entries.
Raft enforces this through its voting mechanism. Specifically, a voter will reject a RequestVote request if the candidate’s log is not at least as up-to-date as its own.
Commitment Restriction
When a new leader is elected, it may encounter entries from previous terms that are technically commit-eligible (present on a majority of nodes but with a term lower than the current one).
Raft stipulates that a leader cannot commit entries from previous terms directly. It can only commit them indirectly by committing a new entry in its current term.
Directly committing old-term entries doesn’t guarantee they won’t be overwritten by a future leader, which could break consensus. By committing a new-term entry, the leader effectively “locks in” the old-term entries, ensuring state machine safety.
Leader Completeness
Under Raft, a leader never loses committed entries. This property is known as Leader Completeness.
We can prove this by contradiction:
Suppose a leader is missing a committed entry. This means the node became leader without containing that committed entry.
Becoming leader requires a majority vote, while an entry is committed only after a majority of nodes have stored it.
Therefore, if a majority of nodes already store the entry, it’s impossible for a node lacking it to secure a majority vote.
Thus, such a node could never become leader, contradicting the assumption.
Server Failure and Recovery
When a leader crashes, it triggers leader election and log overwrite mechanisms.
If a follower or candidate crashes, Raft handles it through infinite RPC retries. Raft requires all RPCs to be idempotent, meaning that even if an RPC takes effect but the response is lost due to a crash, no harm is done.
Cluster Membership Changes
Changing cluster membership can cause split-brain scenarios, such as having two leaders (one for the old configuration and one for the new).
Raft mandates a two-phase approach called Joint Consensus to safely transition configurations:
- When a new configuration (
C-new) is requested, the leader stores a joint configuration (C-old,new). - The leader makes all decisions using both configurations (i.e., both must grant consensus).
- Once the joint configuration is committed, all servers must use both configurations for decisions.
- The leader creates and replicates a log entry for
C-new. - Once
C-newis committed, the old configuration is automatically discarded.
When new nodes join, they typically have no log and need time to catch up. Raft allows them to join as non-voting members, often called Learners. Once they catch up, they can participate in configuration changes.
If the old configuration’s leader is not part of the new configuration, it will step down to follower after committing the new configuration.
After a configuration update, nodes removed from the new configuration will stop receiving heartbeats. They might trigger elections and vie for leadership, causing cluster instability.
To prevent this, servers must ignore RequestVote RPCs if they have heard from a valid leader recently. Specifically, a server will not start an election if its election timer hasn’t expired, which is continuously reset by the leader’s heartbeats.
Log Compaction
Raft relies heavily on logs, but continuously growing logs can burden the cluster.
Raft uses snapshots for log compaction. Each server independently creates snapshots containing:
- The latest index
- The latest term
- The current state machine snapshot
- The latest cluster configuration
Once a snapshot is created, all log entries prior to the snapshot’s last included index can be safely deleted.
Although snapshots are created independently, the leader occasionally needs to send them to lagging or newly joined followers whose logs might have been compacted away. The leader sends the snapshot to help them catch up.
This is done via the InstallSnapshot RPC, which is chunked (since the state machine data can be large):
Arguments:
| Parameter | Description |
|---|---|
term | Leader’s term |
leaderId | Leader’s ID, for client redirection |
lastIncludedIndex | The index of the last log entry in the snapshot |
lastIncludedTerm | The term of the last log entry in the snapshot |
offset | Byte offset where this chunk begins |
data[] | Raw bytes starting from offset |
done | True if this is the last chunk |
Response:
| Return Value | Description |
|---|---|
term | Current term, for leader to update itself |
Servers receiving this RPC must follow these steps:
- Reply immediately if
term < currentTerm. - If
offsetis 0, create a new snapshot file. - Write
dataat the specifiedoffset. - If
doneis false, wait for the next chunk. - Save the snapshot file, discarding any existing snapshot with a smaller index.
- If the existing log contains an entry with the same index and term as
lastIncludedIndex, retain subsequent log entries and reply. - Otherwise, discard the entire log.
- Reset state machine and cluster configuration using the snapshot.
You might wonder why all nodes create snapshots independently rather than just the leader sending them:
- Performance: Sending snapshots consumes bandwidth; having followers create their own is more efficient.
- Complexity: Combining snapshot transfers with normal log replication would make the synchronization mechanism overly complex and slow.
Raft also suggests performance optimizations:
- Snapshots shouldn’t be too frequent or too rare. A simple trigger is when the log reaches a certain size threshold.
- Snapshot creation takes time, but we should minimize blocking new commands. Techniques like Copy-on-Write (CoW) can speed this up.
Client Interaction
When a client communicates with a Raft cluster, it typically sends requests to a random node. If that node isn’t the leader, it rejects the request and returns the leader’s address. If the leader crashes, the request times out, triggering a client retry.
A key goal of Raft is linearizable semantics, which requires idempotent state machine execution. If a leader crashes after committing a log entry but before responding, the client will retry. To handle this, commands are assigned a unique sequence number (e.g., a hash). The state machine checks if it has already executed the command; if so, it returns the cached result.
For read-only requests, simply returning data can be risky: the leader might have been deposed but is unaware, potentially returning stale data.
Raft addresses this with two measures:
- Upon election, the leader commits an empty log entry to ensure previous terms’ entries are properly committed.
- Before responding to read-only requests, the leader must communicate with a majority of nodes to confirm its authority hasn’t been revoked.
Engineering Optimizations
Pre-Vote
If network partitions isolate some nodes from the leader, they may repeatedly trigger elections and elect a leader in the minority partition.
When these nodes reconnect to the main cluster, they can cause a “false alarm” election: their term might be higher, but their logs are outdated, making them unfit to be leader. Their reconnection only bumps up the cluster’s term, triggering another full election.
This instability hurts performance and availability, leading to the Pre-Vote mechanism:
- Before starting a real election, a node initiates a pre-vote.
- A node receiving a pre-vote RPC will reject it if it has received a leader’s heartbeat in the current term.
- Only if the pre-vote succeeds does the node proceed to a real election.
Essentially, pre-vote acts as a safety check to avoid unnecessary elections.
Leader Transfer
Sometimes the current leader needs maintenance, or another node is better suited to lead, requiring a leader transfer.
The process is designed as follows:
- The current leader stops accepting new client requests.
- It replicates logs to bring the target node’s log up to date.
- The current leader sends a
TimeoutNowRPC to the target node, immediately triggering an election.
Upon receiving TimeoutNow, the target node is highly likely to win the election and become the new leader. It then informs the former leader of the new term, prompting it to step down.
Single-Server Membership Change
Joint Consensus is robust but complex to implement.
A simpler alternative is Single-Server Membership Change, which works as follows:
- The leader appends the new configuration (
C-new) to its log. - Nodes apply the new configuration immediately upon receipt, without waiting for commit.
- Once a majority of nodes in
C-newhave replicated the entry, the leader marks it as committed. - The change is complete, and the cluster can proceed with subsequent changes.
By changing only one server at a time, the Pigeonhole Principle guarantees that the old and new configurations always overlap in a majority, preventing split-brain elections.
However, for correctness, a newly elected leader must commit a no-op entry before processing client requests. In terms of availability, Joint Consensus is generally more robust. Overall, while single-server changes are simpler, they are best suited for infrequent, one-off additions or removals.
Algorithm Limitations
Single-Leader Bottleneck
The leader handles all client requests, bearing significantly more load than followers. This can become a performance bottleneck.
Byzantine Fault Tolerance
Raft assumes crash-stop failures. It cannot handle Byzantine faults, where nodes behave maliciously or send conflicting information.
Multi-Leader Limitations
Raft is strictly designed for a single leader. Deployments requiring multiple concurrent leaders must use variants like Multi-Raft or other consensus protocols.