Skip to content

optimization: Leader log sampled handshake #150

@pav-kv

Description

@pav-kv

Background: #144


At the moment, a raft node only accepts MsgApp log appends from the latest leader it knows about, i.e. when MsgApp.Term == raft.Term. This restriction could be relaxed, which can reduce the message turnaround during the times when the leader changes.

The safety requirement is that we don't accept entries that are not in the raft.Term leader log. If we can deduce that an entry is in the leader's log (before / other than by getting a MsgApp directly from this leader), we can always safely accept it.

One way to achieve this:

  • When we vote for a leader, we know a (term, index) of the last entry of the new leader's log. If the election wins, the new leader will not overwrite entries up to this index, and will append new entries strictly after it.
  • If we receive a MsgApp (from any leader) that contains this entry, we have the guarantee that all entries <= index in this append are contained in the leader's log. It is safe to accept them.

A more general way to achieve this is:

  • When a leader campaigns, it should not only attach the last entry (index, term), but also a sample of K other (index, term) in its log. Specifically, it would be wise to attach the "fork" points of the last K terms.
  • When a node votes for a new leader, it remembers this sample.
  • When a follower handles MsgApp, it can deduce from this sample the overlap between this append message and the leader's log. The overlapping part can be safely accepted regardless of who sent it.

The practical K would be 2 or 3, because leader changes are typically not frequent. 2 or 3 last term changes cover a significant section of the log.

This sampling technique is equivalent to the fork point search that the leader does in the StateProbe state to establish the longest common prefix with the follower's log before transitioning it to the optimistic StateReplicate state.

This gives significant benefits:

  • By including a sample of K fork points rather than just the latest one, we increase chances of finding an overlap immediately, and reduce message turnaround.
  • By including a sample in the votes, we avoid the first MsgApp in the StateProbe, and will typically be able to transition straight to StateReplicate.
  • The bonus point is that the sample can be used to safely accept some MsgApp.Entries (that arrived just slightly late) from a recent leader who is stepping down.

This technique will minimize cluster disruption / slowdown during election, and reduce tail replication/commit latency in some cases.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions