Raft Kv - fzdwx

Lab2 Document Translation#

Since my English is not very good, I used translation software for translation and then manually proofread for understanding.

Original address: https://pdos.csail.mit.edu/6.824/labs/lab-raft.html

Introduction#

This is the first in a series of experiments where we will build a fault-tolerant key/value storage system. In this experiment, we will implement Raft (a replicated state machine protocol). In the next experiment, we will build a key/value service on top of Raft. Then, you will perform sharding (hashing based on keys to determine which replica to route to) on a state machine with multiple replicas to improve performance.

Replication is achieved by storing complete copies of its state (i.e., data) on multiple replicated servers to provide fault tolerance. Even if some servers experience failures (crashes or network disconnections and jitter), replication allows them to continue operating. The challenge is that failures may lead to replicas having different data.

Raft organizes client requests into a sequence called a log and ensures that all replica servers see the same log. Each replica executes client requests in the order of the log, applying them to its local service state replica (which means running commands from clients). Since all living replicas read the same log content, they execute requests in the same order, thus having the same service state. If a server experiences a failure but later recovers, Raft is responsible for updating its log to the latest state. As long as at least a majority of servers are still alive and able to communicate, Raft will continue to operate. If this number is not reached, Raft will stop running until this number is reached again.

In this lab, you will implement Raft as a GO object type with relevant methods, aiming to use it in a larger module. A set of Raft instances maintains replicated logs through RPC. Your Raft instance will support a series of uncertainly numbered commands, also known as log entries. These entities are numbered by index. A log entry with a given index will be committed, at which point your Raft should send this log to a larger service for execution.

You should follow the design in the extended Raft paper, especially Figure 2. You will implement most of the content in the paper, including saving persistent state and reading state after node failure recovery. You will not implement changes to cluster membership (Section 6).

You may find this guide useful, as well as this advice on concurrency locking and structure. If you need a broader perspective, you can look at Paxos, Chubby, Paxos Made Live, Spanner, Zookeeper, Harp, Viewstamped Replication, and Bolosky et al.

Please remember that the most challenging part of this lab may not be implementing your solution but debugging it. To help address this challenge, you may need to spend time thinking about how to make your implementation easier to debug. You can refer to the guidance page and this blog post about effective print statements here.

We also provide a Raft interaction diagram that can help clarify how Raft code interacts with the upper layer (users?).

The code#

Implement Raft by adding code to raft/raft.go. In this file, you will find skeleton code and examples of how to send and receive RPCs. Your implementation must support the following interface, which the tester and (ultimately) your key/value service will use. You can find more details in the comments in raft.go.

{{< block type="tip">}}
Raft instances can only communicate via RPC and must use the labrpc package (e.g., cannot use files or shared variables).
{{< /block >}}

// create a new Raft server instance:
rf := Make(peers, me, persister, applyCh)

// start agreement on a new log entry:
rf.Start(command interface{}) (index, term, isleader)

// ask a Raft for its current term, and whether it thinks it is leader
rf.GetState() (term, isLeader)

// each time a new entry is committed to the log, each Raft peer
// should send an ApplyMsg to the service (or tester).
type ApplyMsg

Make#

Make(peers []*labrpc.ClientEnd, me int, persister *Persister, applyCh chan ApplyMsg)

Used to create a Raft server.

All the ports of the raft servers are stored in peers[] (including the current service), and the port of the current service can be obtained through peers[me].
All services' peers[] arrays have the same order.
persister is a place to store persistent state, and it will save the most recent state at initialization, if any.
applyCh is the channel for the service or tester to send messages to Raft. Make() must return quickly, so it should start goroutines for some long-running tasks.

Start#

The Raft service (e.g., kv server) wishes to start a protocol for the next command to be appended to the Raft log. If this server is not the leader, it returns false. Otherwise, it starts the protocol and returns immediately. It cannot guarantee that this command will ever be committed to the Raft log, as the leader may fail or lose the election. Even if the Raft instance is killed, this function should return gracefully. The first return value is the index at which this command will appear when committed. The second return value is the current term. If this server believes it is the leader, the third return value is true.

Start(command interface{}) (int, int, bool)

The Raft service (e.g., kv server) wishes to start a protocol for the next command to be appended to the Raft log. If the current Raft server is not the leader, it returns false. Otherwise, it starts the protocol and returns immediately without waiting for the log append to complete. Therefore, it cannot guarantee that this command will definitely be committed to the Raft log, as the leader may fail or lose the election. Even if the Raft instance is killed, this function should return gracefully.

The first return value is the index at which this command will be set when committed. The second return value is the current term. If this server believes it is the leader, the third return value is true.

Each newly committed Raft log entity should send an ApplyMsg to Make()'s applyCh.

2A - Leader Election#

Implement Raft leader election and heartbeats (without log entries in AppendEntries RPCs).

The goal of 2A is: to elect a leader, who remains the leader if there are no failures, and if the old leader fails or packets are lost between the old leader and the new leader, the new leader takes over.

{{< block type="tip">}}
Does this failure mean the leader has crashed? That is, as long as it does not have a runtime failure or network issues, it will always be the leader?
{{< /block >}}

Key points:

Test your implementation by running go test -run 2A.
Follow the paper's Figure 2, focusing on sending and receiving RequestVote RPCs, the rules for servers related to elections, and the state related to leader election.
Add the states related to leader election from Figure 2 to the Raft structure, and also define a structure to store information about each log.
Implement RequestVote(), so that Raft services can vote for each other. Add the RequestVoteArgs and RequestVoteReply structures. Modify Make(), create a goroutine to check for heartbeat messages, and if no peer messages are received for a while, send RequestVote RPCs to periodically initiate leader elections. This way, if there is a leader, peers will know who the leader is, or they will become the leader themselves.
Implement heartbeats, defining an AppendEntries RPC structure (even though you may not need all parameters yet), and have the leader send it periodically. Write a handle method for AppendEntries RPC to reset election timeouts, so that when someone has already been elected, other servers do not become leaders again.
Ensure that different peers do not have election timeouts at the same time; otherwise, all peers will only vote for themselves, and no one will become the leader.
During testing, the RPC requests sent by the leader should not exceed 10 times per second.
During testing, Raft is required to elect a new leader within 5 seconds after the old leader fails (if a majority of nodes can still communicate). However, remember that if a split vote occurs (which can happen if packets are lost or candidates choose the same random backoff time), leader elections may require multiple rounds. Therefore, you must set a sufficiently short election timeout (i.e., heartbeat interval) so that even if multiple rounds of elections occur, it can still be completed within 5 seconds.
The election timeout range mentioned in the paper's Leader election section is 150 to 300 milliseconds. This range only makes sense if the leader sends heartbeats at a frequency significantly higher than once every 150 milliseconds. Since testing limits heartbeats to 10 times per second, you must use a larger election timeout than in the paper, but it cannot be too large, as it may not complete the election within 5 seconds.
If your code fails the tests, please reread the paper's Figure 2; the entire logic of leader election is distributed across multiple parts of the figure.
Don't forget to implement GetState().
During testing, if you want to shut down a Raft instance, rf.kill() will be called. We can check if kill() has been called by calling rf.killed. You may want to do this in all loops to avoid dead Raft instances printing confusing messages.
go RPC only sends names of struct fields that start with uppercase letters. Sub-structs must also have uppercase field names.

2B - log#

Implement the code for the leader and follower to add new logs.

Key points:

Your primary goal is to pass TestBasicAgree2B. Start with implementing Start(), writing code according to Figure 2, and sending and receiving new logs via AppendEntries. Send the most recently committed logs to the other party's applyCh.
You need to implement election restriction.
One way to fail to reach agreement in early Lab 2B tests is to repeatedly hold elections even if the leader is still alive. Look for bugs in the election timer or the issue of not sending heartbeats immediately after winning the election.
Your code may have loops that repeatedly check certain events. Do not let these loops execute continuously without pausing, as this will slow down your implementation to the point where it cannot pass tests. Use Go's condition variables or add time.sleep(10 * Millisecond) in each loop.
Read config.go and test_test.go to understand what is being tested.

Code Implementation Thoughts#

Roughly record the implementation thoughts and encountered issues.

2A#

Add corresponding properties according to the state section in Figure 2.
Add a RaftRole property to represent the current role: leader, candidate, follower.
Implement the ticker function:
- You can use two time.timer as triggers.
- Check if a heartbeat has not been received for a long time to initiate an election.
- Check if it needs to send heartbeats to maintain its authority.
At this point, pay attention to this rule: If an RPC request or response contains a term T greater than currentTerm, then currentTerm is set to T, and the status switches to follower.
Implement the election initiation process:
- Step by step implement the candidate process according to the Rules for Servers in Figure 2.
- Parallelly initiate RequestVote RPC, if the votes are greater than 1/2, then switch to leader (may need to broadcast heartbeats).
Implement RequestVote:
- Just implement it step by step according to the RequestVote RPC in Figure 2.
Implement AppendEntries:
- Simply implement the transition to follower and refresh the election timeout.
All nodes refresh the election timeout after receiving a response (specific situations should be judged).
After becoming the leader, start refreshing the heartbeat timeout (to refresh the election timeout of followers).

Encountered issues:

The first test will have a warning: Term changed even though there were no failures.

This occurs when there is a leader, and other followers still initiate elections. The guess is that the election timeout and AppendEntries timing are very close; this can only be debugged by adjusting both timeout values. My election timeout is 200-350ms, and the heartbeat is fixed at 125ms.

Then I added a check if new role == old role then return when switching roles, which caused the election timeout not to refresh.

When responding or requesting, handle point 4 immediately.

Raft Paper Translation#

Select some important segments for translation.

Original address: https://pdos.csail.mit.edu/6.824/papers/raft-extended.pdf

Abstract#

Raft is a consensus algorithm for managing replicated logs. It provides the same functionality and performance as the Paxos algorithm, but its algorithm structure is different from Paxos, making Raft easier to understand and build real systems. To enhance understandability, Raft divides the consensus algorithm into several key modules, such as leader election, log replication, and safety. At the same time, it reduces the number of states to consider by implementing a stronger consistency. Raft also includes a mechanism to allow dynamic changes in cluster membership, leveraging overlapping majorities to ensure safety.

Introduction#

The Raft algorithm is similar to existing consensus algorithms in some respects (mainly Oki and Liskov's Viewstamped Replication), but it has the following new features:

Strong leadership: Raft uses a stronger form of leadership than other consensus algorithms. For example, logs are only sent from the leader to other servers. This simplifies the management of replicated logs and makes Raft easier to understand.
Leader election: Raft uses random timers to select a leader. This method is merely a slight improvement on the heartbeat mechanism that all consensus algorithms need to improve, but it makes Raft simpler and faster in resolving conflicts.
Membership changes: Raft uses a new joint consensus algorithm to handle changes in cluster membership. During the adjustment process, the majority of two different configurations will overlap, allowing the cluster to continue operating normally during membership changes.

Replicated State Machine#

Replicated State Machine is used in distributed systems to solve various fault tolerance problems. For example, large single-leader cluster systems like GFS, HDFS, and RAMCloud typically use independent replicated state machines to manage leader elections and store configuration information to ensure survival in the event of a leader crash. Examples of replicated state machines include Chubby and Zookeeper.

The consensus algorithm is proposed in the context of replicated state machines, where state machines on a set of servers compute the same replica of the same state and can continue operating even in the event of some server failures.

As shown in Figure 1, the replicated state machine is implemented through a replicated log. Each server maintains a log containing a series of commands, and its state machine executes them in order. Each log contains the same commands in the same order, so each state machine executes the same sequence of commands. Because the state machine is deterministic, each state machine computes the same state and produces the same output in the same order.

The task of the consensus algorithm is to ensure the consistency of the replicated log. The consensus module on the server receives commands from clients and adds them to the log, communicating with the consensus modules on other servers to ensure that each of their logs eventually matches (the same requests in the same order), even if some services fail. Once commands are correctly replicated, each service's state machine processes them in the order of the log and returns the results to the client.

Thus, these services appear to form a single, highly reliable state machine.

In practical consensus algorithms, the following properties are typically present:

Ensure safety (never return incorrect results) under non-Byzantine conditions, including network delays, partitions, and packet loss, redundancy, and out-of-order delivery.
As long as a majority of services are running and can communicate with each other and with clients, they can function fully (availability). Therefore, a cluster of 5 services can tolerate 2 service failures. Assuming services fail due to downtime, they may later recover their state from stable storage and rejoin the cluster.
Do not rely on timing to ensure log consistency: incorrect clocks and extreme message delays can lead to availability issues in the worst case.
Generally, the completion of a command depends on a majority of the services responding to a single round of remote calls; a few slower services do not affect the overall performance of the system.

The Raft consensus algorithm#

Raft is the algorithm used to manage the replicated logs described above. Figure 2 is a summary of the streamlined version of the algorithm, and Figure 3 lists the key properties of the algorithm, which will be discussed one by one.

Raft achieves consensus by electing a leader and giving it complete responsibility for managing log replication. The leader receives logs from clients, replicates them to other servers, and notifies them when they can safely consume (apply to the state machine) these logs. The leader simplifies the management of log replication. For example, the leader can independently determine where to store new logs without consulting other services, and data flows from the leader to other servers. The leader may fail or lose connection with other servers, at which point a new leader needs to be elected.

Based on the leader approach, Raft addresses consistency issues through three subprocesses:

Leader election: A new leader needs to be elected when the leader fails (crashes).
Log replication: The leader receives logs from clients and replicates them to other machines in the cluster, forcing other servers to be consistent with itself.
Safety: The safety of Raft is defined by the safety properties in Figure 3: If any server consumes a log, then no other server can consume a different log at the same log index.

Figure 2#

state:

Persisted on all servers (updated to stable storage before responding to RPC requests)
- currentTerm: The latest term known to the server (initialized to 0 when the server first starts, monotonically increasing).
- votedFor: The candidate ID that received votes in the current term; empty if no votes were cast for any candidate.
- log[]: Log entries, each containing commands for the state machine and the term when the leader received that entry (initial index is 1).
Volatile state on all servers:
- commitIndex: The maximum index of the known committed log (initially 0, monotonically increasing).
- lastApplied: The maximum index of the log that has been executed by the state machine (initially 0, monotonically increasing).
Unstable state on the leader (initialized after each re-election):
- nextIndex[]: For each server, the index of the next log entry to be sent to that server (initially the leader's last log index + 1).
- matchIndex[]: For each server, the highest log entry index known to have been replicated to that server (initially 0, monotonically increasing).

AppendEntries RPC

Called by the leader to replicate logs and also used for heartbeat detection.

Arguments:
- term: The leader's term.
- leaderId: Used for followers to find the leader (sometimes clients send requests to followers).
- prevLogIndex: The index of the log entry immediately preceding the new log entry.
- prevLogTerm: The term of the log entry immediately preceding the new log entry.
- entries[]: The logs to be saved (empty when it is a heartbeat, may send multiple entries at once for efficiency).
- leaderCommit: The highest log entry index known to have been committed by the leader.
Results:
- term: currentTerm, used for the leader to update its term.
- success: True if the follower's prevLogIndex and prevLogTerm match.
Receiver implementation:
- if term < currentTerm then return false (if the leader's term is less than the receiver's current term, the receiver can be a follower or candidate).
- if log[prevLogIndex].term != prevLogTerm then return false (if the receiver's log does not find a log entry with the same index and term as prevLogIndex and prevLogTerm, return false).
- if log[oldIndex].term != log[newIndex].term then remove log[oldIndex,lastIndex] (if an existing log conflicts with the new log (same index, different term), delete all logs from that index onward).
- Add new logs that do not exist in the log list.
- if leaderCommit > commitIndex then commitIndex=min(leaderCommit,log[].last.commitIndex) (if the leader's known highest log entry index that has been committed is greater than the receiver's known highest committed log entry index, reset the receiver's known highest committed log entry index commitIndex to the minimum of the leader's known highest committed log entry index leaderCommit or the index of the last new entry).

RequestVote RPC

Called by candidates to collect votes.

Arguments:
- term: The candidate's term number.
- candidateId: The ID of the candidate making the request.
- lastLogIndex: The index of the candidate's last log entry.
- lastLogTerm: The term number of the candidate's last log entry.
Results:
- term: The current term number, used for the candidate to update its term.
- voteGranted: True indicates that the candidate received a vote.
Receiver implementation:
- if term < currentTerm then return false (if term < currentTerm return false).
- if (votedFor==null||votedFor==candidateId)&&(lastLogIndex,lastLogTerm)==log[].last then return true (if votedFor is null or matches candidateId, and the candidate's log is as fresh as its own log, then vote for that candidate).

Rules for Servers

All Servers:
- if commitIndex > lastApplied then incr lastApplied and exec log[lastApplied] (if commitIndex > lastApplied, increment lastApplied and apply log[lastApplied] to the state machine).
- if appendEntries.logs exist (log.term > currentTerm) then currentTerm = log.term and set status = follower (if an RPC request or response contains a term T greater than currentTerm, set currentTerm to T and switch status to follower).
Followers:
- Do not make any requests, only respond to requests from candidates and leaders.
- After election timeout, if no AppendEntries RPC from the current leader (same term) or votes for a candidate are received, switch to candidate.
Candidates:
- After switching to candidate, start the election:
  - Increment currentTerm.
  - Vote for itself.
  - Reset election timer.
  - Send RequestVote RPC to all other servers.
- If it receives a majority of votes, it becomes the leader.
- If it receives an AppendEntries RPC from a new leader, it becomes a follower.
- If the election times out, it starts a new round of elections.
Leaders:
- Once it becomes the leader, it sends empty AppendEntries RPC to other servers, repeating this during idle times to prevent election timeouts.
- If it receives a command from a client: add it to the local log, execute it, and respond to the client.
- For followers, if last log index >= nextIndex (if the last log entry index is greater than or equal to nextIndex):
  then send all logs after nextIndex via AppendEntries RPC.
  - If successful: update that follower's nextIndex and matchIndex.
  - If it fails due to log inconsistency: decrement nextIndex and resend.
- If there exists a number N such that N > commitIndex, a majority of matchIndex[i] >= N, and log[N].term == currentTerm: set commitIndex = N.

Figure 3#

Election Safety: Only one leader can be elected in a given term.
Leader Append-Only: The leader never overwrites or deletes logs, only adds.
Log Matching: If two logs contain the same index and term, they are considered identical.
Leader Completeness: If a log is committed in a given term, it must appear in the logs of leaders in later terms.
State Machine Safety: If a server has applied a log entry at a given index to its state machine, then no other server will apply a different log at the same index.

Raft basics#

A Raft cluster can contain multiple servers; 5 is a typical number, allowing the system to tolerate 2 failures (with two services down). At any given time, each service is in one of the following three states: leader, follower, candidate. Normally, there is exactly one leader, and all other servers are followers.

Followers are passive: they do not make requests themselves but only respond to requests from leaders and candidates.
Leaders handle all client requests (if a client contacts a follower, the follower redirects to the leader).
Candidates are used to elect a new leader (see Figure 4).

Figure 4#

As shown in Figure 5: Raft divides time into arbitrary-length terms. The numbering of terms is consecutive integers. Each term begins with an election, where one or more candidates attempt to become the leader. If a candidate wins the election, it will serve as the leader for the remainder of the term.

In some special cases, the result of the election is a split vote. In this case, the term will end without a leader. A new term (accompanied by a new round of elections) will begin shortly thereafter. Raft guarantees that there will be at most one leader in a given term.

Figure 5#

Different servers may observe transitions between terms at different times; in some cases, a server may not observe the entire election or even the entire term. Terms act as logical clocks in Raft, allowing servers to detect stale information, such as outdated leaders.

Each server stores a current term number, which monotonically increases over time. Whenever a server communicates, it exchanges its current term; if a server's current term is smaller than another server's, it updates its current term to the larger value.

If a candidate or leader discovers that its term has become outdated, it will immediately revert to follower status.

If a request received by a server has an outdated term number, it will reject that request.

Raft servers communicate using RPCs, and the basic consensus algorithm requires only two types of RPCs. RequestVote RPCs are initiated by candidates during elections; AppendEntries RPCs are initiated by leaders to replicate log entries and provide a heartbeat mechanism. A third RPC is added in the following sections for transferring snapshots between servers. If servers do not receive responses in a timely manner, they will retry RPCs, and to achieve optimal performance, they will issue RPCs in parallel.

Leader election#

Raft uses a heartbeat mechanism to trigger leader elections. When servers start, their initial state is follower. As long as a server receives valid RPCs from a leader or candidate, it remains in the follower state. The leader periodically sends heartbeats (i.e., AppendEntries RPCs, without log entries) to all followers to maintain its authority. If a follower does not receive any communication for a period of time (election timeout), it assumes there is no available leader and begins an election to select a new leader.

To start an election, a follower increments its current term and transitions to candidate status. It then votes for itself and issues RequestVote RPCs to every other server in the cluster in parallel. A candidate will remain in this state until one of the following three conditions occurs:

It wins the election.
Another server establishes its leadership.
A period of time passes without anyone winning.

Next, we will discuss these outcomes:

It wins the election.

If a candidate receives votes from a majority of servers in the cluster during the same term, it wins the election. Each server can vote for at most one candidate in a given term, following a first-come, first-served principle.

The principle of majority rule ensures that at most one candidate can win the election in a given term (Figure 3 contains the election safety property). Once a candidate wins the election, it becomes the leader. It then sends heartbeat messages (empty AppendEntries RPCs without logs) to all other servers to establish its authority and prevent new elections from occurring.

Another server establishes its leadership.

During the waiting period for votes, a candidate may receive an AppendEntries RPC from another server claiming to be the leader. If this leader's term (carried in the RPC) is not less than the candidate's current term, the candidate acknowledges the leader as legitimate and reverts to follower status. If the term in the RPC is smaller than the candidate's current term, the candidate will reject the RPC and remain in candidate status.

A period of time passes without anyone winning.

The third possible outcome is that a candidate neither wins nor loses the election: if many followers simultaneously become candidates, the votes may be split, and no candidate can receive enough votes. When this occurs, each candidate will time out and start a new election by incrementing its term and issuing new RequestVote RPCs. However, without additional measures, split votes may repeat indefinitely.

Raft uses random election timeouts to ensure that split votes occur infrequently and can be resolved quickly. To prevent split votes from occurring from the outset, election timeouts are randomly chosen from a fixed time interval (e.g., 150-300ms). This way, each server has a different election timeout, so in most cases, only one server will time out.

If a service wins the election, it sends heartbeats before other services time out; split votes are handled using this mechanism. Each candidate restarts its random election timeout at the beginning of the election and waits until the timeout expires before starting the next election; this reduces the likelihood of encountering split votes again in the new election.

Elections serve as an example of how understandability guided our design trade-offs. Initially, we planned to use a ranking system: each candidate would be assigned a unique rank to choose between competing candidates. If a candidate discovered another candidate with a higher rank, it would revert to follower status, allowing the higher-ranked candidate to win the next election more easily. We found that this approach introduced subtle issues in availability (if a higher-ranked service failed, a lower-ranked service might time out and become a candidate again, but if it did so too early, it could reset the progress of electing a leader). We made multiple adjustments to the algorithm, but each adjustment introduced new corner cases. Ultimately, we concluded that a random retry approach was clearer and easier to understand.

Log replication#

Once a leader is elected, it begins serving client requests. Each client request contains a command to be executed by the replicated state machine. The leader appends this command as a new entry to its log and then initiates AppendEntries RPCs in parallel to replicate that log entry to every other server. Once the entry is safely replicated (as will be discussed below), the leader applies this log to its state machine and returns the execution result to the client. If a follower crashes or runs slowly, or if network packets are lost, the leader will indefinitely retry AppendEntries RPCs (even after responding to the client) until all followers eventually store all log entries.

The organization of logs is shown in Figure 6. Each log entry stores a state machine command and the term number when the leader received that entry. The term number in the log entry is used to detect inconsistencies between logs and ensure some properties in Figure 3. Each log entry also has an integer index to identify its position in the log.

The leader decides when it is safe to apply a log entry to the state machine, and such entries are referred to as committed. Raft guarantees that committed entries are persistent and will eventually be executed by all available state machines. Once the leader replicates the entry to a majority of servers, that log entry is considered committed (for example, entry 7 in Figure 6). This also commits all previous entries in the leader's log, including those created by previous leaders. The leader tracks the highest index it knows to be committed and includes that index in future AppendEntries RPCs (including heartbeats) so that other servers eventually discover it. Once a follower learns that a log entry has been committed, it applies that entry to its local state machine (in log order).

The Raft log mechanism we designed maintains a high level of consistency between logs on different servers. This not only simplifies the behavior of the system, making it more predictable, but is also an important component of ensuring safety. Raft maintains the following properties, which together constitute the Log Matching property in Figure 3:

If two different logs have the same index and term

then they are considered to store the same command
then all preceding logs are also considered identical.

The first property arises from the fact that a leader can create only one log entry at a given index in a term, and log entries will never change their position in the log. The second property is guaranteed by the simple consistency checks performed by AppendEntries RPC. When sending AppendEntries RPC, the leader includes the index and term of the log entry immediately preceding the new entry in its log. If the follower does not find a log entry with the same index and term in its log, it will reject the new entry. The consistency check acts as an inductive step: the initial state of the log certainly satisfies the Log Matching property, and every time the log is extended, the consistency check preserves the Log Matching property. Therefore, whenever AppendEntries successfully returns, the leader knows that the follower's log matches its own log up to the new entry.

During normal operation, the logs of the leader and followers remain consistent, so the AppendEntries consistency checks do not fail. However, a leader crash can cause log inconsistencies (the old leader may not have fully replicated all entries in its log). These inconsistencies can be exacerbated during a series of crashes of leaders and followers. Figure 7 illustrates how a follower's log may differ from the new leader's log.

A follower may miss entries from the leader.
A follower may have additional entries that the leader does not have.
Or both.

Missing and extra entries in the log may span multiple terms.

In Raft, the leader handles inconsistencies by forcing the logs of followers to replicate its own logs. This means that conflicting entries in the follower's log will be overwritten by entries in the leader's log. The next section will show that this is safe, provided an additional constraint is added.

To make a follower's log consistent with its own log, the leader must find the most recent log entry that is consistent between the two logs, delete all entries in the follower's log after that point, and send all leader entries after that point to the follower. All these operations occur in response to the consistency checks performed by AppendEntries RPC. The leader maintains a nextIndex for each follower, which is the index of the next log entry to be sent to that follower. When the leader is elected, it initializes all nextIndex values to the index of its last log entry + 1 (11 in Figure 7). If the follower's log is inconsistent with the leader's log, the AppendEntries consistency check will fail in the next AppendEntries RPC. After the follower rejects, the leader will decrement the value of nextIndex and retry the AppendEntries RPC. Eventually, nextIndex will reach the point where the logs of the leader and follower match. When this occurs, AppendEntries will succeed, which will delete any conflicting entries in the follower's log and add entries from the leader's log (if any). Once AppendEntries succeeds, the follower's log will be consistent with the leader's, and this state will be maintained in future terms.

If necessary, the protocol can be optimized to reduce the number of rejected AppendEntries RPCs. For example, when rejecting an AppendEntries request, the follower can include the term of the conflicting entry along with the first index it stores in that term. With this information, the leader can reduce nextIndex to skip all conflicting entries in that term; each term with log conflicts would only require one AppendEntries RPC, rather than one RPC for each log entry. In practice, we suspect that this optimization may not be necessary, as failures are rare and it is unlikely to have many inconsistent entries.

Through this mechanism, the leader does not need to take any special measures to restore log consistency when it takes office. It simply begins normal operation, and the logs will converge automatically in response to the failures of the AppendEntries consistency checks. The leader never overwrites or deletes entries in its logs (Figure 3 contains the Leader Append-Only property).

Ideal Raft:

As long as a majority of servers are running, Raft can accept, replicate, and apply new log entries.
New entries can be replicated to a majority of the cluster in a single round of RPCs.
A single slow follower does not affect performance.

Safety#

In the previous sections, we introduced how Raft elects leaders and replicates logs. However, these mechanisms alone are not sufficient to ensure that each state machine executes the same command in the same order. For example, a follower may be in an unavailable state while the leader commits several logs, and then it may be elected as the leader and overwrite those logs; the result is that different state machines may execute different command sequences.

In this section, we refine Raft by adding constraints on which servers can be elected as leaders. This constraint ensures that the leader, at any given term, contains all committed log entries from its previous terms (Figure 3 contains the Leader Completeness Property). Adding this election constraint allows us to make the rules for committing more precise. Finally, we will present a brief proof of the Leader Completeness Property and explain how it leads to the correct behavior of replicated state machines.

Election restriction#

In any leader-based consensus algorithm, the leader must store committed logs. In some consensus algorithms, such as Viewstamped Replication, a node can be elected as the leader even if it initially does not contain all committed logs. These algorithms have additional mechanisms to identify missing logs and send them to the new leader, either during or shortly after the election. Unfortunately, this approach leads to considerable additional mechanisms and complexity.

Raft uses a simpler approach that guarantees that a new leader has all previously committed log entries during elections without needing to transfer these log entries to the leader. This means that log entries are transferred unidirectionally, only from the leader to followers, and the leader never overwrites entries that already exist in its local log.

Raft uses the voting process to prevent candidates from winning elections unless their logs contain all previously committed entries. Candidates must contact a majority of nodes in the cluster to be elected, meaning that each committed log must exist on at least one node among those services. If a candidate's log is at least as fresh as any other log in the majority (the definition of "fresh" is provided below), then it must contain all committed logs.

The constraints implemented by RequestVote RPC: the RPC will include information about the candidate's log, and if the voter's own log is more up-to-date than the candidate's log, the vote will be denied.

Raft determines which of two logs is more up-to-date by comparing the index and term of the last log entry. If the terms of the last log entries differ, the one with the larger term is more up-to-date. If the logs have the same term, the longer log (based on the length of the log array or the size of the index) is considered more up-to-date.

Committing entries from previous terms#

As discussed in Log replication, as long as logs are stored on a majority of nodes, the leader knows that this log can be committed in the current term. If the leader crashes before committing the log, future leaders will attempt to complete the replication of that log. However, the leader cannot immediately infer that a log from a previous term was committed when it was stored on a majority of servers. Figure 8 illustrates a scenario where an old log entry that has been stored on a majority of nodes may still be overwritten by a future leader.

To eliminate the issue in Figure 8, Raft never commits logs from previous terms based on the count of replicas. Only log entries from the current term will be committed based on the count of replicas; once logs from the current term are committed in this way, all previous log entries are indirectly committed due to Log Matching. In some cases, the leader can safely determine whether an old log has been committed (for example, whether that log is stored on a server), but Raft uses a more conservative approach to simplify the problem.

When the leader replicates logs from previous terms, Raft retains the original term number for all logs, which adds complexity to the commit rules. In other consensus algorithms, if a new leader needs to re-replicate logs from previous terms, it must use the current new term number. Raft's method makes it easier to identify logs, as it maintains the same term number over time and changes in logs. Additionally, compared to other algorithms, the new leader in Raft needs to send fewer log entries (other algorithms must send more redundant log entries to renumber them before they are committed). However, this may not be particularly important in practice, as leader changes occur infrequently.

Follower and candidate crashes#

If a follower or candidate crashes, subsequent RPCs sent to them will fail. Raft's approach is to retry indefinitely; if they recover later, these requests will succeed. If a node crashes after receiving an RPC request but before responding, it will receive the same request again upon recovery. Raft's RPCs are idempotent, so retries will not cause any issues. If a follower receives an AppendEntries RPC containing a log that already exists, it will simply ignore the new request.

Timing and availability#

One of Raft's requirements is that safety cannot rely on timing: the entire system must not produce incorrect results due to some events running slightly faster or slower than expected. However, availability (the system can respond to clients in a timely manner) inevitably depends on timing. For example, if message exchanges take longer than the intervals between server failures, candidates will not have enough time to win elections; without a stable leader, Raft cannot function.

Leader elections are the most critical aspect of timing requirements in Raft. Raft can elect and maintain a stable leader as long as the system meets the following timing requirements:

{{< center >}}
broadcastTime ≪ electionTimeout ≪ MTBF (Mean Time Between Failures)
{{< /center >}}

broadcastTime: The average time taken to send RPCs in parallel from one server to other servers in the cluster and receive responses.
electionTimeout: The election timeout.
MTBF: The average time between failures for a server.

broadcastTime must be an order of magnitude smaller than electionTimeout to send stable heartbeat messages to prevent followers from entering election states. By randomly generating electionTimeout, this inequality makes split votes unlikely. electionTimeout should be several orders of magnitude smaller than MTBF to ensure stable operation of the system. When the leader crashes, the entire system will be unavailable for one electionTimeout; we hope this situation occurs infrequently during the system's operation.

broadcastTime and MTBF are determined by the system, but electionTimeout is chosen by us. Raft's RPCs require the recipient to persist information to stable storage, so broadcastTime is approximately 0.5 milliseconds to 20 milliseconds, depending on the storage technology. Therefore, electionTimeout may need to be between 10 milliseconds and 500 milliseconds. Most servers have an MTBF of several months or longer, easily meeting the timing requirements.

Cluster membership changes#

OneSizeFitsQuorum/raft-thesis-zh_cn

Raft 博士论文的中文翻译

55366

Links#

Project address: https://pdos.csail.mit.edu/6.824/labs/lab-raft.html
Raft official website: https://raft.github.io/
GFS related materials: https://fzdwx.github.io/posts/2022-10-07-gfs/#links
Raft paper: https://pdos.csail.mit.edu/6.824/papers/raft-extended.pdf
Chinese translation of Raft paper: https://github.com/maemual/raft-zh_cn/blob/master/raft-zh_cn.md
Diagram of Raft interactions: https://pdos.csail.mit.edu/6.824/notes/raft_diagram.pdf
Students guide to Raft: https://thesquareplanet.com/blog/students-guide-to-raft/
Raft locking: https://pdos.csail.mit.edu/6.824/labs/raft-locking.txt
Raft structure: https://pdos.csail.mit.edu/6.824/labs/raft-structure.txt
Paxos Replicated State Machines as the Basis of a High-Performance Data Store: https://static.usenix.org/event/nsdi11/tech/full_papers/Bolosky.pdf
TiKV optimization for Raft: https://cn.pingcap.com/blog/optimizing-raft-in-tikv
https://www.cnblogs.com/niejunlei/p/9719557.html
https://blog.csdn.net/viskaz/article/details/124232474
https://www.cnblogs.com/brianleelxt/p/13251540.html