Paxos

13 minute read

UCL Course COMP0133: Distributed Systems and Security LEC-07

Centralization and Replication

Single Point of Failure

Centralization causes a single point of failure (SPOF)

such that when some significant node is down, others should wait for it comes alive (availiability)

Replication

Replicate data on several servers

such that one or more nodes is down, others can still work with replicated data

Consistency

All replicas should hold identical copies of data (even when data is requested to modify)

State Machine Replications

Definition of State Machine

A state machine is a behavior model. It consists of a finite number of states.

Based on the current state and a given input the machine performs state transitions and produces outputs

Any server essentially a state machine

Disk, RAM, CPU registers are state
Instructions (caused by user requests to be executed) transition among states

Replications

Replicate state machine on multiple hosts such that

if every replica must see same (deterministic) operations in same order, replicas end in same state

Primary and Backups

One special server called Primary and all other servers called Backups

client should send operations to current primary
current primary should
- choose order for operations from clients
- send these operations to backups
- reply to client

Primary Failure

Choose a new primary (and two simultaneous primaries are not allowed)
Last operation received by primary may not be complete

Primary Election

Suppose that each server has an unique ID (an integer)

and the live server with lowest ID is expected to be new primary

Strawman Approach

After primary failure, each live server pings each other to find the new primary

Problem

Two simultaneous primaries can be caused by

Pings might be lost or delayed (no ping means not alive)
Network Partition (isolated network chooses its own primary)

Majority Consensus Approach

A majority (more than half) of nodes need to agree on primary such that

Majorities must overlap if pings are lost or delayed such that non-agreement exists in the overlapped part
At most one network partition can contain majority (more than half nodes)

View Change Algorithm

Based on a sequence of views and

each view is a two-element tuple {view#, set of participant nodes}

view# is a monotonically increasing integer (from \( view_i \) to \( view_{i+1} \))
set of participant nodes contains unqiue IDs of live servers

View change algorithm must ensure agreement on unique successor for next view

Participant set within view allows all nodes to agree on primary where view is regarded as opaque value

Not guarantee to agree on the same value but Guarantee not to agree on different values

Paxos Overview

Paxos guaranteed to complete only when all nodes agree on input (here input is view)

State

\( n_a \): greatest \( n \) accepted by node (init: -1)
\( v_a \): value received together with \( n_a \) (init: nil)
\( n_h \): greatest \( n \) seen in \( Q_1 \) message (init: -1)
\( done \): leader says agreement reached (init: false)

Phase 1

A node (maybe more than one)

decide to be leader
pick proposal number \( n \)
- must be unique (good if higher than any known proposal number)
- such that use last known proposal number + 1 and append unique ID of this node
send \( Q_1(n) \) message to all nodes (including self)

if node receives \( Q_1(n) \) and \( n > n_h \)

\( n_h = n \)
send reply \( R_1(n_a, v_a) \) message

Phase 2

if leader receives \( R_1 \) messages from majority of nodes (including self)

if any \( R_1(n, v) \) contained a value \( v \)

\( v \) = value sent with highest \( n \)
else (all \( R_1(n, v) \) have \( v = nil \))

choose a value \( v \) {oldview# + 1, set of participant nodes}
send \( Q_2(n, v) \) message to all responders (with highest \( n \))

if node receives \( Q_2(n, v) \) and \( n \geq n_h \)

\( n_h = n_a = n \)
\( v_a = v \)
send reply \( R_2() \) message

Phase 3

if leader receives \( R_2() \) messages from majority of protocol participants

send \( Q_3() \) message to all participants

if node receives \( Q_3() \)

\( done = true \) (agreement reached with agreed-on value is \( v_a \)

Process of Paxos (One Leader & No Failures)

Initalization

Nodes	0	1	2	3	4
\( n_a \)	-1	-1	-1	-1	-1
\( v_a \)	nil	nil	nil	nil	nil
\( n_h \)	-1	-1	-1	-1	-1
\( done \)	F	F	F	F	F

Phase 1

Node 1 decides to be the leader by itself
and it chooses 1 as its proposal number \( n \) and appends with its unique ID 1
then it sends \( Q_1(11) \) to all nodes including itself

Each nodes receives \( Q_1(11) \) from Node 1
and they compare \( n=11 \) with their \( n_h \) to set \( n_h = n = 11 \)
then they reply \( R_1(n_a, v_a) \) to Node 1 (currently \( n_a=-1 \) and \( v_a = nil \))

Nodes	0	1	2	3	4
\( n_a \)	-1	-1	-1	-1	-1
\( v_a \)	nil	nil	nil	nil	nil
\( n_h \)	11	11	11	11	11
\( done \)	F	F	F	F	F

Phase 2

Node 1 receives \( R_1(n, v) \) from majority of nodes
and it finds that all values of \( v \) is nil such that it chooses a value \( v \) {oldview# + 1, set of participant nodes}
then it sends \( Q_2(n, v) \) message to all responders (currently \( n = 11 \) and \( v = {1, {0,\dots,4}} \))

Nodes	0	1	2	3	4
\( n_a \)	-1	-1	-1	-1	-1
\( v_a \)	nil	{1, {0,…,4}}	nil	nil	nil
\( n_h \)	11	11	11	11	11
\( done \)	F	F	F	F	F

Each nodes receives \( Q_2(n,v) \) from Node 1
and they find \( n = n_h = 11 \) such that they set \( n_h = n_a = n \) and \( v_a = v \)
then they reply \( R_2() \) message to Node 1

Nodes	0	1	2	3	4
\( n_a \)	11	11	11	11	11
\( v_a \)	{1, {0,…,4}}	{1, {0,…,4}}	{1, {0,…,4}}	{1, {0,…,4}}	{1, {0,…,4}}
\( n_h \)	11	11	11	11	11
\( done \)	F	F	F	F	F

Phase 3

Node 1 receives \( R_2() \) from majority of nodes
then it sends \( Q_3() \) to all nodes including itself

Each nodes receives \( Q_3() \) from Node 1 to set \( done \) to true

Nodes	0	1	2	3	4
\( n_a \)	11	11	11	11	11
\( v_a \)	{1, {0,…,4}}	{1, {0,…,4}}	{1, {0,…,4}}	{1, {0,…,4}}	{1, {0,…,4}}
\( n_h \)	11	11	11	11	11
\( done \)	T	T	T	T	T

Finally, all nodes agree on value (view) \( v = {1, {0,\dots,4}} \)
such that the new primary is the lowest ID in set Node 0

Other Situations in Paxos

Timeout

All nodes wait a maximum period (timeout) for messages they expect

Upon timeout, a node declares itself a leader and initiates a new Phase 1 of algorithm

Reason:

More likely to keep liveness without sacrificing safety (work through three phases & not keep waiting for a leader)

Non-argreement

Happen when nodes with different \( v_a \) receive \( Q_3() \) (since nodes should agree on \( v_a \) eventually)

Solution:

If \( Q_3() \) could have been sent, future \( Q_3() \) s are guaranteed to reach nodes with same \( v_a \) in the same round of Paxos

More than One Leader

Paxos applies some mechanisms that only one leader with high probability (one server step forward)

Every node must be willing to become leader in case of failures
Every node should delay random period (or unique ID times some constant) after realizing pingable nodes have changed

However, more than one leader is still possible

These leaders must hold different proposal numbers \( n \) since they are appended with their own unique ID

Suppose two different proposaal numbers are 10 and 11

Case 1: The proposer of 10 not receive \( R_2() \) from majority

Because Phase 2 requires that \( n > n_h \) , no node will send \( R_2() \) to reply \( Q_2(10, v_{10}) \) after seeing \( Q_1(11) \), or
Becuase Proposer of 10 might be in network partition with minority of nodes

such that the proposer of 10 will never get agreement with all participants

Case 2: The proposer of 10 did receive \( R_2() \) from majority

The majority of 10 must have seen \( Q_2(10, v_{10}) \) before seeing \( Q_1(11) \) (otherwise it becomes Case 1)

\( \implies \) The proposer of 11 must receive its \( R_1() \) from at least one node that saw \( Q_2(10, v_{10}) \)
because of overlap in majority

The node saw \( Q_2(10, v_{10}) \) would set its \( n_h = n_a = 10 \) and \( v_a = v_{10} \)
The node would reply \( R_1(10, v_{10}) \) to the proposer of 11 because \( 11 > n_h = 10 \)

\( \implies \) The proposer of 11 must know the value \( v_{10} \) of 10 from the received \( R_1() \)

\( \implies \) The proposer of 11 will use the value \( v_{10} \) of 10 rather than creating its own value in Phase 2

such that the proposer of 10 will get agreement with all participants

Node Failure

Case 1: The leader fails before sending \( Q_2() \)s

This will cause timeout of some servers such that will become new leaders

and because old leader fails, it will not send any \( Q_3() \)s, so no risk of non-agreement caused by old leader

New leader chooses higher \( n \) for proposal is good (but not required)

otherwise, other nodes will ignore the lower \( n \) to make this timeout again

Finally, new leader who knew old \( n \) and will use higher \( n \) will be found

Case 2: The leader fails after sending minority of \( Q_2() \)s

This case is the same as More than One Leader (timeout and then some servers will become new leaders)

Case 3: The leader fails after sending majority of \( Q_2() \)s

This case is the same as More than One Leader (timeout and then some servers will become new leaders)

Case 4: The node fails after receiving \( Q_2() \) and sending \( R_2() \)

If this node not restart (e.g., timeout in Phase 3), some node will become new leader (More than One Leader)
If this node did restart, it must remember \( v_a \) and \( n_a \) on disk before sending \( R_2() \)

otherwise, if the leader failed after sending a few \( Q_3() \)s

the new leader must choose same value as the old leader

since the agreed-on value \( v_a \) has already be set in some node by \( Q_3() \)
and known by receiving \( R_1() \) from majority of the old leader

then this failed node without \( v_a \) may be only node in intersection of two majorities between the new leader and the old leader

Summary

Designed for replicated state machines (avaiable even if some nodes not reachable)

After each failure, perform view change using Paxos to agree on which nodes in new view and choose new primary

No discussion of how to render data consistent across replicas

but still can use Paxos to agree on which client operation is next in every replica of log for data consistent

Endong Liu