Paxos
UCL Course COMP0133: Distributed Systems and Security LEC-07
Centralization and Replication
Single Point of Failure
Centralization causes a single point of failure (SPOF)
such that when some significant node is down, others should wait for it comes alive (availiability)
Replication
Replicate data on several servers
such that one or more nodes is down, others can still work with replicated data
Consistency
All replicas should hold identical copies of data (even when data is requested to modify)
State Machine Replications
Definition of State Machine
A state machine is a behavior model. It consists of a finite number of states.
Based on the current state and a given input the machine performs state transitions and produces outputs
Any server essentially a state machine
-
Disk, RAM, CPU registers are state
-
Instructions (caused by user requests to be executed) transition among states
Replications
Replicate state machine on multiple hosts such that
if every replica must see same (deterministic) operations in same order, replicas end in same state
Primary and Backups
One special server called Primary and all other servers called Backups
-
client should send operations to current primary
-
current primary should
-
choose order for operations from clients
-
send these operations to backups
-
reply to client
-
Primary Failure
-
Choose a new primary (and two simultaneous primaries are not allowed)
-
Last operation received by primary may not be complete
Primary Election
Suppose that each server has an unique ID (an integer)
and the live server with lowest ID is expected to be new primary
Strawman Approach
After primary failure, each live server pings each other to find the new primary
Problem
Two simultaneous primaries can be caused by
-
Pings might be lost or delayed (no ping means not alive)
-
Network Partition (isolated network chooses its own primary)
Majority Consensus Approach
A majority (more than half) of nodes need to agree on primary such that
-
Majorities must overlap if pings are lost or delayed such that non-agreement exists in the overlapped part
-
At most one network partition can contain majority (more than half nodes)
View Change Algorithm
Based on a sequence of views and
each view is a two-element tuple {view#, set of participant nodes}
-
view#
is a monotonically increasing integer (from \( view_i \) to \( view_{i+1} \)) -
set of participant nodes
contains unqiue IDs of live servers
View change algorithm must ensure agreement on unique successor for next view
Participant set within view allows all nodes to agree on primary where view is regarded as opaque value
Not guarantee to agree on the same value but Guarantee not to agree on different values
Paxos Overview
Paxos guaranteed to complete only when all nodes agree on input (here input is view)
State
-
\( n_a \): greatest \( n \) accepted by node (init: -1)
-
\( v_a \): value received together with \( n_a \) (init: nil)
-
\( n_h \): greatest \( n \) seen in \( Q_1 \) message (init: -1)
-
\( done \): leader says agreement reached (init: false)
Phase 1
A node (maybe more than one)
-
decide to be leader
-
pick proposal number \( n \)
-
must be unique (good if higher than any known proposal number)
-
such that use last known proposal number + 1 and append unique ID of this node
-
-
send \( Q_1(n) \) message to all nodes (including self)
if node receives \( Q_1(n) \) and \( n > n_h \)
-
\( n_h = n \)
-
send reply \( R_1(n_a, v_a) \) message
Phase 2
if leader receives \( R_1 \) messages from majority of nodes (including self)
-
if any \( R_1(n, v) \) contained a value \( v \)
\( v \) = value sent with highest \( n \)
-
else (all \( R_1(n, v) \) have \( v = nil \))
choose a value \( v \)
{oldview# + 1, set of participant nodes}
-
send \( Q_2(n, v) \) message to all responders (with highest \( n \))
if node receives \( Q_2(n, v) \) and \( n \geq n_h \)
-
\( n_h = n_a = n \)
-
\( v_a = v \)
-
send reply \( R_2() \) message
Phase 3
if leader receives \( R_2() \) messages from majority of protocol participants
- send \( Q_3() \) message to all participants
if node receives \( Q_3() \)
- \( done = true \) (agreement reached with agreed-on value is \( v_a \)
Process of Paxos (One Leader & No Failures)
Initalization
Nodes | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
\( n_a \) | -1 | -1 | -1 | -1 | -1 |
\( v_a \) | nil | nil | nil | nil | nil |
\( n_h \) | -1 | -1 | -1 | -1 | -1 |
\( done \) | F | F | F | F | F |
Phase 1
Node 1 decides to be the leader by itself
and it chooses 1 as its proposal number \( n \) and appends with its unique ID 1
then it sends \( Q_1(11) \) to all nodes including itself
Each nodes receives \( Q_1(11) \) from Node 1
and they compare \( n=11 \) with their \( n_h \) to set \( n_h = n = 11 \)
then they reply \( R_1(n_a, v_a) \) to Node 1 (currently \( n_a=-1 \) and \( v_a = nil \))
Nodes | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
\( n_a \) | -1 | -1 | -1 | -1 | -1 |
\( v_a \) | nil | nil | nil | nil | nil |
\( n_h \) | 11 | 11 | 11 | 11 | 11 |
\( done \) | F | F | F | F | F |
Phase 2
Node 1 receives \( R_1(n, v) \) from majority of nodes
and it finds that all values of \( v \) is nil such that it chooses a value \( v \) {oldview# + 1, set of participant nodes}
then it sends \( Q_2(n, v) \) message to all responders (currently \( n = 11 \) and \( v = {1, {0,\dots,4}} \))
Nodes | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
\( n_a \) | -1 | -1 | -1 | -1 | -1 |
\( v_a \) | nil | {1, {0,…,4}} | nil | nil | nil |
\( n_h \) | 11 | 11 | 11 | 11 | 11 |
\( done \) | F | F | F | F | F |
Each nodes receives \( Q_2(n,v) \) from Node 1
and they find \( n = n_h = 11 \) such that they set \( n_h = n_a = n \) and \( v_a = v \)
then they reply \( R_2() \) message to Node 1
Nodes | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
\( n_a \) | 11 | 11 | 11 | 11 | 11 |
\( v_a \) | {1, {0,…,4}} | {1, {0,…,4}} | {1, {0,…,4}} | {1, {0,…,4}} | {1, {0,…,4}} |
\( n_h \) | 11 | 11 | 11 | 11 | 11 |
\( done \) | F | F | F | F | F |
Phase 3
Node 1 receives \( R_2() \) from majority of nodes
then it sends \( Q_3() \) to all nodes including itself
Each nodes receives \( Q_3() \) from Node 1 to set \( done \) to true
Nodes | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
\( n_a \) | 11 | 11 | 11 | 11 | 11 |
\( v_a \) | {1, {0,…,4}} | {1, {0,…,4}} | {1, {0,…,4}} | {1, {0,…,4}} | {1, {0,…,4}} |
\( n_h \) | 11 | 11 | 11 | 11 | 11 |
\( done \) | T | T | T | T | T |
Finally, all nodes agree on value (view) \( v = {1, {0,\dots,4}} \)
such that the new primary is the lowest ID in set Node 0
Other Situations in Paxos
Timeout
All nodes wait a maximum period (timeout) for messages they expect
Upon timeout, a node declares itself a leader and initiates a new Phase 1 of algorithm
Reason:
More likely to keep liveness without sacrificing safety (work through three phases & not keep waiting for a leader)
Non-argreement
Happen when nodes with different \( v_a \) receive \( Q_3() \) (since nodes should agree on \( v_a \) eventually)
Solution:
If \( Q_3() \) could have been sent, future \( Q_3() \) s are guaranteed to reach nodes with same \( v_a \) in the same round of Paxos
More than One Leader
Paxos applies some mechanisms that only one leader with high probability (one server step forward)
-
Every node must be willing to become leader in case of failures
-
Every node should delay random period (or unique ID times some constant) after realizing pingable nodes have changed
However, more than one leader is still possible
These leaders must hold different proposal numbers \( n \) since they are appended with their own unique ID
Suppose two different proposaal numbers are 10 and 11
Case 1: The proposer of 10 not receive \( R_2() \) from majority
-
Because Phase 2 requires that \( n > n_h \) , no node will send \( R_2() \) to reply \( Q_2(10, v_{10}) \) after seeing \( Q_1(11) \), or
-
Becuase Proposer of 10 might be in network partition with minority of nodes
such that the proposer of 10 will never get agreement with all participants
Case 2: The proposer of 10 did receive \( R_2() \) from majority
The majority of 10 must have seen \( Q_2(10, v_{10}) \) before seeing \( Q_1(11) \) (otherwise it becomes Case 1)
\( \implies \) The proposer of 11 must receive its \( R_1() \) from at least one node that saw \( Q_2(10, v_{10}) \)
because of overlap in majority
-
The node saw \( Q_2(10, v_{10}) \) would set its \( n_h = n_a = 10 \) and \( v_a = v_{10} \)
-
The node would reply \( R_1(10, v_{10}) \) to the proposer of 11 because \( 11 > n_h = 10 \)
\( \implies \) The proposer of 11 must know the value \( v_{10} \) of 10 from the received \( R_1() \)
\( \implies \) The proposer of 11 will use the value \( v_{10} \) of 10 rather than creating its own value in Phase 2
such that the proposer of 10 will get agreement with all participants
Node Failure
Case 1: The leader fails before sending \( Q_2() \)s
This will cause timeout of some servers such that will become new leaders
and because old leader fails, it will not send any \( Q_3() \)s, so no risk of non-agreement caused by old leader
New leader chooses higher \( n \) for proposal is good (but not required)
otherwise, other nodes will ignore the lower \( n \) to make this timeout again
Finally, new leader who knew old \( n \) and will use higher \( n \) will be found
Case 2: The leader fails after sending minority of \( Q_2() \)s
This case is the same as More than One Leader (timeout and then some servers will become new leaders)
Case 3: The leader fails after sending majority of \( Q_2() \)s
This case is the same as More than One Leader (timeout and then some servers will become new leaders)
Case 4: The node fails after receiving \( Q_2() \) and sending \( R_2() \)
-
If this node not restart (e.g., timeout in Phase 3), some node will become new leader (More than One Leader)
-
If this node did restart, it must remember \( v_a \) and \( n_a \) on disk before sending \( R_2() \)
otherwise, if the leader failed after sending a few \( Q_3() \)s
the new leader must choose same value as the old leader
since the agreed-on value \( v_a \) has already be set in some node by \( Q_3() \)
and known by receiving \( R_1() \) from majority of the old leaderthen this failed node without \( v_a \) may be only node in intersection of two majorities between the new leader and the old leader
Summary
Designed for replicated state machines (avaiable even if some nodes not reachable)
After each failure, perform view change using Paxos to agree on which nodes in new view and choose new primary
No discussion of how to render data consistent across replicas
but still can use Paxos to agree on which client operation is next in every replica of log for data consistent