Paxos
UCL Course COMP0133: Distributed Systems and Security LEC-07
Centralization and ReplicationPermalink
Single Point of FailurePermalink
Centralization causes a single point of failure (SPOF)
such that when some significant node is down, others should wait for it comes alive (availiability)
ReplicationPermalink
Replicate data on several servers
such that one or more nodes is down, others can still work with replicated data
Consistency
All replicas should hold identical copies of data (even when data is requested to modify)
State Machine ReplicationsPermalink
Definition of State MachinePermalink
A state machine is a behavior model. It consists of a finite number of states.
Based on the current state and a given input the machine performs state transitions and produces outputs
Any server essentially a state machine
-
Disk, RAM, CPU registers are state
-
Instructions (caused by user requests to be executed) transition among states
ReplicationsPermalink
Replicate state machine on multiple hosts such that
if every replica must see same (deterministic) operations in same order, replicas end in same state
Primary and BackupsPermalink
One special server called Primary and all other servers called Backups
-
client should send operations to current primary
-
current primary should
-
choose order for operations from clients
-
send these operations to backups
-
reply to client
-
Primary FailurePermalink
-
Choose a new primary (and two simultaneous primaries are not allowed)
-
Last operation received by primary may not be complete
Primary ElectionPermalink
Suppose that each server has an unique ID (an integer)
and the live server with lowest ID is expected to be new primary
Strawman ApproachPermalink
After primary failure, each live server pings each other to find the new primary
Problem
Two simultaneous primaries can be caused by
-
Pings might be lost or delayed (no ping means not alive)
-
Network Partition (isolated network chooses its own primary)
Majority Consensus ApproachPermalink
A majority (more than half) of nodes need to agree on primary such that
-
Majorities must overlap if pings are lost or delayed such that non-agreement exists in the overlapped part
-
At most one network partition can contain majority (more than half nodes)
View Change AlgorithmPermalink
Based on a sequence of views and
each view is a two-element tuple {view#, set of participant nodes}
-
view#
is a monotonically increasing integer (from to ) -
set of participant nodes
contains unqiue IDs of live servers
View change algorithm must ensure agreement on unique successor for next view
Participant set within view allows all nodes to agree on primary where view is regarded as opaque value
Not guarantee to agree on the same value but Guarantee not to agree on different values
Paxos OverviewPermalink
Paxos guaranteed to complete only when all nodes agree on input (here input is view)
StatePermalink
-
: greatest accepted by node (init: -1) -
: value received together with (init: nil) -
: greatest seen in message (init: -1) -
: leader says agreement reached (init: false)
Phase 1Permalink
A node (maybe more than one)
-
decide to be leader
-
pick proposal number
-
must be unique (good if higher than any known proposal number)
-
such that use last known proposal number + 1 and append unique ID of this node
-
-
send
message to all nodes (including self)
if node receives
-
-
send reply
message
Phase 2Permalink
if leader receives
-
if any
contained a value = value sent with highest -
else (all
have )choose a value
{oldview# + 1, set of participant nodes}
-
send
message to all responders (with highest )
if node receives
-
-
-
send reply
message
Phase 3Permalink
if leader receives
- send
message to all participants
if node receives
(agreement reached with agreed-on value is
Process of Paxos (One Leader & No Failures)Permalink
Initalization
Nodes | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
-1 | -1 | -1 | -1 | -1 | |
nil | nil | nil | nil | nil | |
-1 | -1 | -1 | -1 | -1 | |
F | F | F | F | F |
Phase 1
Node 1 decides to be the leader by itself
and it chooses 1 as its proposal number
then it sends
Each nodes receives
and they compare
then they reply
Nodes | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
-1 | -1 | -1 | -1 | -1 | |
nil | nil | nil | nil | nil | |
11 | 11 | 11 | 11 | 11 | |
F | F | F | F | F |
Phase 2
Node 1 receives
and it finds that all values of {oldview# + 1, set of participant nodes}
then it sends
Nodes | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
-1 | -1 | -1 | -1 | -1 | |
nil | {1, {0,…,4}} | nil | nil | nil | |
11 | 11 | 11 | 11 | 11 | |
F | F | F | F | F |
Each nodes receives
and they find
then they reply
Nodes | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
11 | 11 | 11 | 11 | 11 | |
{1, {0,…,4}} | {1, {0,…,4}} | {1, {0,…,4}} | {1, {0,…,4}} | {1, {0,…,4}} | |
11 | 11 | 11 | 11 | 11 | |
F | F | F | F | F |
Phase 3
Node 1 receives
then it sends
Each nodes receives
Nodes | 0 | 1 | 2 | 3 | 4 |
---|---|---|---|---|---|
11 | 11 | 11 | 11 | 11 | |
{1, {0,…,4}} | {1, {0,…,4}} | {1, {0,…,4}} | {1, {0,…,4}} | {1, {0,…,4}} | |
11 | 11 | 11 | 11 | 11 | |
T | T | T | T | T |
Finally, all nodes agree on value (view)
such that the new primary is the lowest ID in set Node 0
Other Situations in PaxosPermalink
TimeoutPermalink
All nodes wait a maximum period (timeout) for messages they expect
Upon timeout, a node declares itself a leader and initiates a new Phase 1 of algorithm
Reason:
More likely to keep liveness without sacrificing safety (work through three phases & not keep waiting for a leader)
Non-argreementPermalink
Happen when nodes with different
Solution:
If
More than One LeaderPermalink
Paxos applies some mechanisms that only one leader with high probability (one server step forward)
-
Every node must be willing to become leader in case of failures
-
Every node should delay random period (or unique ID times some constant) after realizing pingable nodes have changed
However, more than one leader is still possible
These leaders must hold different proposal numbers
Suppose two different proposaal numbers are 10 and 11
Case 1: The proposer of 10 not receive
-
Because Phase 2 requires that
, no node will send to reply after seeing , or -
Becuase Proposer of 10 might be in network partition with minority of nodes
such that the proposer of 10 will never get agreement with all participants
Case 2: The proposer of 10 did receive
The majority of 10 must have seen
because of overlap in majority
-
The node saw
would set its and -
The node would reply
to the proposer of 11 because
such that the proposer of 10 will get agreement with all participants
Node FailurePermalink
Case 1: The leader fails before sending
This will cause timeout of some servers such that will become new leaders
and because old leader fails, it will not send any
New leader chooses higher
otherwise, other nodes will ignore the lower
Finally, new leader who knew old
Case 2: The leader fails after sending minority of
This case is the same as More than One Leader (timeout and then some servers will become new leaders)
Case 3: The leader fails after sending majority of
This case is the same as More than One Leader (timeout and then some servers will become new leaders)
Case 4: The node fails after receiving
-
If this node not restart (e.g., timeout in Phase 3), some node will become new leader (More than One Leader)
-
If this node did restart, it must remember
and on disk before sendingotherwise, if the leader failed after sending a few
sthe new leader must choose same value as the old leader
since the agreed-on value
has already be set in some node by
and known by receiving from majority of the old leaderthen this failed node without
may be only node in intersection of two majorities between the new leader and the old leader
SummaryPermalink
Designed for replicated state machines (avaiable even if some nodes not reachable)
After each failure, perform view change using Paxos to agree on which nodes in new view and choose new primary
No discussion of how to render data consistent across replicas
but still can use Paxos to agree on which client operation is next in every replica of log for data consistent