Deep Dive into the Raft Consensus Algorithm: Core Mechanisms and Engineering Optimizations

Raft is a consensus algorithm used in distributed systems. Its clarity and simplicity have led to its widespread adoption in numerous distributed middleware systems (e.g., Etcd, Kafka, RocketMQ).

However, once you dive into Raft, you’ll discover that its mechanisms are remarkably sophisticated and elegant, making it highly worthwhile to study and document.

Data Structures and Rules

A major reason for Raft’s simplicity and clarity is that it defines specific data structures and strict rules that must be followed.

Naturally, these data structures and rules only make sense and become fully effective when viewed in conjunction with, or after thoroughly understanding, the subsequent mechanisms.

State

Persistent State

All nodes store the following persistent state:

ParameterDescription
currentTermLatest term seen (initially 0, increases monotonically)
votedForCandidate ID that received a vote in this term (null if none)
log[]Log entries. Each entry contains an index, term, and command

Volatile State

Volatile state on all nodes:

ParameterDescription
commitIndexHighest known committed log index (initially 0, increases monotonically)
lastAppliedHighest log index applied to the state machine (initially 0, increases monotonically)

Volatile state on Leaders:

ParameterDescription
nextIndex[]Next log index to send to each follower (one per follower)
matchIndex[]Highest log index known to be replicated on each follower (one per follower)

RPCs

RequestVote RPC

Arguments for a RequestVote RPC initiated by a Candidate:

ParameterDescription
termCandidate’s term
candidateIdCandidate requesting vote
lastLogIndexIndex of candidate’s last log entry
lastLogTermTerm of candidate’s last log entry

Response structure received:

Return ValueDescription
termCurrent term, for candidate to update itself
voteGrantedTrue means candidate received vote

AppendEntries RPC

Called by the Leader to replicate log entries or as a heartbeat. Arguments:

ParameterDescription
termLeader’s term
leaderIdLeader’s ID, so followers can redirect client requests
prevLogIndexIndex of log entry immediately preceding new ones
prevLogTermTerm of prevLogIndex entry
entries[]Log entries to store (empty for heartbeat; may send multiple for efficiency)
leaderCommitLeader’s known highest committed index

Response structure:

Return ValueDescription
termCurrent term, for Leader to update itself
successTrue if follower contained entry matching prevLogIndex and prevLogTerm

Rules

Rules for All Servers

  1. If commitIndex > lastApplied, increment lastApplied and apply log[lastApplied] to state machine.
  2. If RPC request or response contains term T > currentTerm: set currentTerm = T, convert to follower.

Rules for Leaders

  1. Upon election: send initial empty AppendEntries RPCs (heartbeat) to each server; repeat during idle periods to prevent election timeouts.
  2. When a client request arrives: append command to log. After the entry is committed, apply it to the state machine and respond to the client.
  3. If last log index ≥ nextIndex for a follower: send an AppendEntries RPC with log entries starting at nextIndex:
  • If successful: update nextIndex and matchIndex for the follower.
  • If AppendEntries fails because of log inconsistency: decrement nextIndex and retry.
  1. If there exists an N such that N > commitIndex, a majority of matchIndex[i] ≥ N, and log[N].term == currentTerm: set commitIndex = N.

Rules for Followers

  1. Respond to RPCs from candidates and leaders.
  2. If election timeout elapses without receiving AppendEntries RPC from current leader or without granting vote to a candidate: convert to candidate.

Rules for Candidates

  1. On conversion to candidate, start election:
  • Increment currentTerm
  • Vote for self
  • Reset election timer
  • Send RequestVote RPCs to all other servers
  1. If votes received from majority of servers: become leader.
  2. If AppendEntries RPC received from new leader: convert to follower.
  3. If election timeout elapses: start new election.

RequestVote RPC Rules

  1. Reply false if term < currentTerm.
  2. If votedFor is null or candidateId, and candidate’s log is at least as up-to-date as receiver’s log, grant vote (voteGranted = true).

AppendEntries RPC Rules

  1. Reply false if term < currentTerm.
  2. Reply false if log doesn’t contain an entry at prevLogIndex whose term matches prevLogTerm.
  3. If an existing entry conflicts with a new one (same index but different terms), delete the existing entry and all that follow it.
  4. Append any new entries not already in the log.
  5. If leaderCommit > commitIndex, set commitIndex = min(leaderCommit, index of last new entry).

Node States

Distributed systems consist of multiple nodes. In Raft, each node assumes one of three states:

  • Leader: Handles all client requests.
  • Follower: Passive state; only responds to RPCs from leaders and candidates.
  • Candidate: A temporary state used to elect a new leader.

Under normal operation, there is exactly one leader, and all other nodes are followers. The candidate state only appears during leader elections.

Terms

In Raft, time is divided into terms. Each node maintains a term number.

The term number is a monotonically increasing integer that acts as a logical clock in Raft.

Every term begins with an election, and Raft guarantees that at most one leader is elected per term.

Leader Election

In Raft, nodes communicate via RPCs for internal requests.

Normally, the leader periodically sends heartbeats to followers to assert its authority.

If a follower receives no communication for a period (election timeout), it assumes there is no leader and initiates an election.

The follower increments its term, transitions to candidate, and sends RequestVote RPCs to other nodes.

A candidate remains in this state until one of three events occurs:

  • It wins the election.
  • Another server establishes itself as leader.
  • A period of time goes by with no winner (election timeout).

During an election, a candidate may receive other voting requests, but it will only respect requests with a higher term.

If multiple candidates split the vote, a timeout occurs. Raft mitigates this by assigning a randomized election timeout to each server.

Log Replication

In Raft, only the leader handles client requests.

When a client sends a request that modifies the state machine, the leader appends the corresponding command to its local log.

The leader then sends AppendEntries RPCs in parallel to all followers to replicate the entry.

Once a majority of nodes have successfully replicated the entry, the leader marks it as committed.

The leader then applies the command to its state machine and returns the response to the client.

When do followers apply these commands? Followers apply commands only when they learn via commitIndex that preceding log entries have been committed.

Raft’s log replication guarantees two properties:

  • If two entries in different logs have the same index and term, then they store the same command.
  • If two entries in different logs have the same index and term, then the logs are identical in all preceding entries.

The first property is straightforward since commands within a term are immutable once appended.

The second property stems from the AppendEntries RPC: the leader’s request includes both the new log entries and the index and term of the preceding entry. If a follower cannot match the preceding entry, it rejects the new entries.

To ensure log consistency, the leader forces followers’ logs to match its own.

The leader maintains a nextIndex for each follower, representing the next log index to send. Upon election, the leader initializes nextIndex to the index of its last log entry plus one. When a follower rejects an AppendEntries request due to inconsistency, the leader decrements nextIndex and retries until consistency is found. The follower then deletes conflicting entries and appends the leader’s entries.

Safety

Election Restriction

During elections, Raft enforces a rule: only a candidate that contains all committed log entries can become a leader. This prevents a newly elected leader from overwriting and losing committed entries.

Raft enforces this through its voting mechanism. Specifically, a voter will reject a RequestVote request if the candidate’s log is not at least as up-to-date as its own.

Commitment Restriction

When a new leader is elected, it may encounter entries from previous terms that are technically commit-eligible (present on a majority of nodes but with a term lower than the current one).

Raft stipulates that a leader cannot commit entries from previous terms directly. It can only commit them indirectly by committing a new entry in its current term.

Directly committing old-term entries doesn’t guarantee they won’t be overwritten by a future leader, which could break consensus. By committing a new-term entry, the leader effectively “locks in” the old-term entries, ensuring state machine safety.

Leader Completeness

Under Raft, a leader never loses committed entries. This property is known as Leader Completeness.

We can prove this by contradiction:

Suppose a leader is missing a committed entry. This means the node became leader without containing that committed entry.

Becoming leader requires a majority vote, while an entry is committed only after a majority of nodes have stored it.

Therefore, if a majority of nodes already store the entry, it’s impossible for a node lacking it to secure a majority vote.

Thus, such a node could never become leader, contradicting the assumption.

Server Failure and Recovery

When a leader crashes, it triggers leader election and log overwrite mechanisms.

If a follower or candidate crashes, Raft handles it through infinite RPC retries. Raft requires all RPCs to be idempotent, meaning that even if an RPC takes effect but the response is lost due to a crash, no harm is done.

Cluster Membership Changes

Changing cluster membership can cause split-brain scenarios, such as having two leaders (one for the old configuration and one for the new).

Raft mandates a two-phase approach called Joint Consensus to safely transition configurations:

  1. When a new configuration (C-new) is requested, the leader stores a joint configuration (C-old,new).
  2. The leader makes all decisions using both configurations (i.e., both must grant consensus).
  3. Once the joint configuration is committed, all servers must use both configurations for decisions.
  4. The leader creates and replicates a log entry for C-new.
  5. Once C-new is committed, the old configuration is automatically discarded.

When new nodes join, they typically have no log and need time to catch up. Raft allows them to join as non-voting members, often called Learners. Once they catch up, they can participate in configuration changes.

If the old configuration’s leader is not part of the new configuration, it will step down to follower after committing the new configuration.

After a configuration update, nodes removed from the new configuration will stop receiving heartbeats. They might trigger elections and vie for leadership, causing cluster instability.

To prevent this, servers must ignore RequestVote RPCs if they have heard from a valid leader recently. Specifically, a server will not start an election if its election timer hasn’t expired, which is continuously reset by the leader’s heartbeats.

Log Compaction

Raft relies heavily on logs, but continuously growing logs can burden the cluster.

Raft uses snapshots for log compaction. Each server independently creates snapshots containing:

  • The latest index
  • The latest term
  • The current state machine snapshot
  • The latest cluster configuration

Once a snapshot is created, all log entries prior to the snapshot’s last included index can be safely deleted.

Although snapshots are created independently, the leader occasionally needs to send them to lagging or newly joined followers whose logs might have been compacted away. The leader sends the snapshot to help them catch up.

This is done via the InstallSnapshot RPC, which is chunked (since the state machine data can be large):

Arguments:

ParameterDescription
termLeader’s term
leaderIdLeader’s ID, for client redirection
lastIncludedIndexThe index of the last log entry in the snapshot
lastIncludedTermThe term of the last log entry in the snapshot
offsetByte offset where this chunk begins
data[]Raw bytes starting from offset
doneTrue if this is the last chunk

Response:

Return ValueDescription
termCurrent term, for leader to update itself

Servers receiving this RPC must follow these steps:

  1. Reply immediately if term < currentTerm.
  2. If offset is 0, create a new snapshot file.
  3. Write data at the specified offset.
  4. If done is false, wait for the next chunk.
  5. Save the snapshot file, discarding any existing snapshot with a smaller index.
  6. If the existing log contains an entry with the same index and term as lastIncludedIndex, retain subsequent log entries and reply.
  7. Otherwise, discard the entire log.
  8. Reset state machine and cluster configuration using the snapshot.

You might wonder why all nodes create snapshots independently rather than just the leader sending them:

  • Performance: Sending snapshots consumes bandwidth; having followers create their own is more efficient.
  • Complexity: Combining snapshot transfers with normal log replication would make the synchronization mechanism overly complex and slow.

Raft also suggests performance optimizations:

  • Snapshots shouldn’t be too frequent or too rare. A simple trigger is when the log reaches a certain size threshold.
  • Snapshot creation takes time, but we should minimize blocking new commands. Techniques like Copy-on-Write (CoW) can speed this up.

Client Interaction

When a client communicates with a Raft cluster, it typically sends requests to a random node. If that node isn’t the leader, it rejects the request and returns the leader’s address. If the leader crashes, the request times out, triggering a client retry.

A key goal of Raft is linearizable semantics, which requires idempotent state machine execution. If a leader crashes after committing a log entry but before responding, the client will retry. To handle this, commands are assigned a unique sequence number (e.g., a hash). The state machine checks if it has already executed the command; if so, it returns the cached result.

For read-only requests, simply returning data can be risky: the leader might have been deposed but is unaware, potentially returning stale data.

Raft addresses this with two measures:

  1. Upon election, the leader commits an empty log entry to ensure previous terms’ entries are properly committed.
  2. Before responding to read-only requests, the leader must communicate with a majority of nodes to confirm its authority hasn’t been revoked.

Engineering Optimizations

Pre-Vote

If network partitions isolate some nodes from the leader, they may repeatedly trigger elections and elect a leader in the minority partition.

When these nodes reconnect to the main cluster, they can cause a “false alarm” election: their term might be higher, but their logs are outdated, making them unfit to be leader. Their reconnection only bumps up the cluster’s term, triggering another full election.

This instability hurts performance and availability, leading to the Pre-Vote mechanism:

  1. Before starting a real election, a node initiates a pre-vote.
  2. A node receiving a pre-vote RPC will reject it if it has received a leader’s heartbeat in the current term.
  3. Only if the pre-vote succeeds does the node proceed to a real election.

Essentially, pre-vote acts as a safety check to avoid unnecessary elections.

Leader Transfer

Sometimes the current leader needs maintenance, or another node is better suited to lead, requiring a leader transfer.

The process is designed as follows:

  1. The current leader stops accepting new client requests.
  2. It replicates logs to bring the target node’s log up to date.
  3. The current leader sends a TimeoutNow RPC to the target node, immediately triggering an election.

Upon receiving TimeoutNow, the target node is highly likely to win the election and become the new leader. It then informs the former leader of the new term, prompting it to step down.

Single-Server Membership Change

Joint Consensus is robust but complex to implement.

A simpler alternative is Single-Server Membership Change, which works as follows:

  1. The leader appends the new configuration (C-new) to its log.
  2. Nodes apply the new configuration immediately upon receipt, without waiting for commit.
  3. Once a majority of nodes in C-new have replicated the entry, the leader marks it as committed.
  4. The change is complete, and the cluster can proceed with subsequent changes.

By changing only one server at a time, the Pigeonhole Principle guarantees that the old and new configurations always overlap in a majority, preventing split-brain elections.

However, for correctness, a newly elected leader must commit a no-op entry before processing client requests. In terms of availability, Joint Consensus is generally more robust. Overall, while single-server changes are simpler, they are best suited for infrequent, one-off additions or removals.

Algorithm Limitations

Single-Leader Bottleneck

The leader handles all client requests, bearing significantly more load than followers. This can become a performance bottleneck.

Byzantine Fault Tolerance

Raft assumes crash-stop failures. It cannot handle Byzantine faults, where nodes behave maliciously or send conflicting information.

Multi-Leader Limitations

Raft is strictly designed for a single leader. Deployments requiring multiple concurrent leaders must use variants like Multi-Raft or other consensus protocols.

References