13 minute read

UCL Course COMP0133: Distributed Systems and Security LEC-07


Centralization and ReplicationPermalink

Single Point of FailurePermalink

Centralization causes a single point of failure (SPOF)

such that when some significant node is down, others should wait for it comes alive (availiability)

ReplicationPermalink

Replicate data on several servers

such that one or more nodes is down, others can still work with replicated data

Consistency

All replicas should hold identical copies of data (even when data is requested to modify)


State Machine ReplicationsPermalink

Definition of State MachinePermalink

A state machine is a behavior model. It consists of a finite number of states.

Based on the current state and a given input the machine performs state transitions and produces outputs

Any server essentially a state machine

  • Disk, RAM, CPU registers are state

  • Instructions (caused by user requests to be executed) transition among states

ReplicationsPermalink

Replicate state machine on multiple hosts such that

if every replica must see same (deterministic) operations in same order, replicas end in same state


Primary and BackupsPermalink

One special server called Primary and all other servers called Backups

  • client should send operations to current primary

  • current primary should

    • choose order for operations from clients

    • send these operations to backups

    • reply to client

Primary FailurePermalink

  • Choose a new primary (and two simultaneous primaries are not allowed)

  • Last operation received by primary may not be complete


Primary ElectionPermalink

Suppose that each server has an unique ID (an integer)

and the live server with lowest ID is expected to be new primary

Strawman ApproachPermalink

After primary failure, each live server pings each other to find the new primary

Problem

Two simultaneous primaries can be caused by

  • Pings might be lost or delayed (no ping means not alive)

  • Network Partition (isolated network chooses its own primary)

Majority Consensus ApproachPermalink

A majority (more than half) of nodes need to agree on primary such that

  • Majorities must overlap if pings are lost or delayed such that non-agreement exists in the overlapped part

  • At most one network partition can contain majority (more than half nodes)

View Change AlgorithmPermalink

Based on a sequence of views and

each view is a two-element tuple {view#, set of participant nodes}

  • view# is a monotonically increasing integer (from viewi to viewi+1)

  • set of participant nodes contains unqiue IDs of live servers

View change algorithm must ensure agreement on unique successor for next view

Participant set within view allows all nodes to agree on primary where view is regarded as opaque value

Not guarantee to agree on the same value but Guarantee not to agree on different values


Paxos OverviewPermalink

Paxos guaranteed to complete only when all nodes agree on input (here input is view)

StatePermalink

  • na: greatest n accepted by node (init: -1)

  • va: value received together with na (init: nil)

  • nh: greatest n seen in Q1 message (init: -1)

  • done: leader says agreement reached (init: false)


Phase 1Permalink

A node (maybe more than one)

  • decide to be leader

  • pick proposal number n

    • must be unique (good if higher than any known proposal number)

    • such that use last known proposal number + 1 and append unique ID of this node

  • send Q1(n) message to all nodes (including self)

if node receives Q1(n) and n>nh

  • nh=n

  • send reply R1(na,va) message


Phase 2Permalink

if leader receives R1 messages from majority of nodes (including self)

  • if any R1(n,v) contained a value v

    v = value sent with highest n

  • else (all R1(n,v) have v=nil)

    choose a value v {oldview# + 1, set of participant nodes}

  • send Q2(n,v) message to all responders (with highest n)

if node receives Q2(n,v) and nnh

  • nh=na=n

  • va=v

  • send reply R2() message


Phase 3Permalink

if leader receives R2() messages from majority of protocol participants

  • send Q3() message to all participants

if node receives Q3()

  • done=true (agreement reached with agreed-on value is va

Process of Paxos (One Leader & No Failures)Permalink

Initalization

Nodes 0 1 2 3 4
na -1 -1 -1 -1 -1
va nil nil nil nil nil
nh -1 -1 -1 -1 -1
done F F F F F

Phase 1

Node 1 decides to be the leader by itself
and it chooses 1 as its proposal number n and appends with its unique ID 1
then it sends Q1(11) to all nodes including itself

Each nodes receives Q1(11) from Node 1
and they compare n=11 with their nh to set nh=n=11
then they reply R1(na,va) to Node 1 (currently na=1 and va=nil)

Nodes 0 1 2 3 4
na -1 -1 -1 -1 -1
va nil nil nil nil nil
nh 11 11 11 11 11
done F F F F F

Phase 2

Node 1 receives R1(n,v) from majority of nodes
and it finds that all values of v is nil such that it chooses a value v {oldview# + 1, set of participant nodes}
then it sends Q2(n,v) message to all responders (currently n=11 and v=1,0,,4)

Nodes 0 1 2 3 4
na -1 -1 -1 -1 -1
va nil {1, {0,…,4}} nil nil nil
nh 11 11 11 11 11
done F F F F F

Each nodes receives Q2(n,v) from Node 1
and they find n=nh=11 such that they set nh=na=n and va=v
then they reply R2() message to Node 1

Nodes 0 1 2 3 4
na 11 11 11 11 11
va {1, {0,…,4}} {1, {0,…,4}} {1, {0,…,4}} {1, {0,…,4}} {1, {0,…,4}}
nh 11 11 11 11 11
done F F F F F

Phase 3

Node 1 receives R2() from majority of nodes
then it sends Q3() to all nodes including itself

Each nodes receives Q3() from Node 1 to set done to true

Nodes 0 1 2 3 4
na 11 11 11 11 11
va {1, {0,…,4}} {1, {0,…,4}} {1, {0,…,4}} {1, {0,…,4}} {1, {0,…,4}}
nh 11 11 11 11 11
done T T T T T

Finally, all nodes agree on value (view) v=1,0,,4
such that the new primary is the lowest ID in set Node 0


Other Situations in PaxosPermalink

TimeoutPermalink

All nodes wait a maximum period (timeout) for messages they expect

Upon timeout, a node declares itself a leader and initiates a new Phase 1 of algorithm

Reason:

More likely to keep liveness without sacrificing safety (work through three phases & not keep waiting for a leader)


Non-argreementPermalink

Happen when nodes with different va receive Q3() (since nodes should agree on va eventually)

Solution:

If Q3() could have been sent, future Q3() s are guaranteed to reach nodes with same va in the same round of Paxos


More than One LeaderPermalink

Paxos applies some mechanisms that only one leader with high probability (one server step forward)

  • Every node must be willing to become leader in case of failures

  • Every node should delay random period (or unique ID times some constant) after realizing pingable nodes have changed

However, more than one leader is still possible

These leaders must hold different proposal numbers n since they are appended with their own unique ID

Suppose two different proposaal numbers are 10 and 11

Case 1: The proposer of 10 not receive R2() from majority

  • Because Phase 2 requires that n>nh , no node will send R2() to reply Q2(10,v10) after seeing Q1(11), or

  • Becuase Proposer of 10 might be in network partition with minority of nodes

such that the proposer of 10 will never get agreement with all participants

Case 2: The proposer of 10 did receive R2() from majority

The majority of 10 must have seen Q2(10,v10) before seeing Q1(11) (otherwise it becomes Case 1)

The proposer of 11 must receive its R1() from at least one node that saw Q2(10,v10)
because of overlap in majority

  • The node saw Q2(10,v10) would set its nh=na=10 and va=v10

  • The node would reply R1(10,v10) to the proposer of 11 because 11>nh=10

The proposer of 11 must know the value v10 of 10 from the received R1()

The proposer of 11 will use the value v10 of 10 rather than creating its own value in Phase 2

such that the proposer of 10 will get agreement with all participants


Node FailurePermalink

Case 1: The leader fails before sending Q2()s

This will cause timeout of some servers such that will become new leaders

and because old leader fails, it will not send any Q3()s, so no risk of non-agreement caused by old leader

New leader chooses higher n for proposal is good (but not required)

otherwise, other nodes will ignore the lower n to make this timeout again

Finally, new leader who knew old n and will use higher n will be found

Case 2: The leader fails after sending minority of Q2()s

This case is the same as More than One Leader (timeout and then some servers will become new leaders)

Case 3: The leader fails after sending majority of Q2()s

This case is the same as More than One Leader (timeout and then some servers will become new leaders)

Case 4: The node fails after receiving Q2() and sending R2()

  • If this node not restart (e.g., timeout in Phase 3), some node will become new leader (More than One Leader)

  • If this node did restart, it must remember va and na on disk before sending R2()

    otherwise, if the leader failed after sending a few Q3()s

    the new leader must choose same value as the old leader

    since the agreed-on value va has already be set in some node by Q3()
    and known by receiving R1() from majority of the old leader

    then this failed node without va may be only node in intersection of two majorities between the new leader and the old leader


SummaryPermalink

Designed for replicated state machines (avaiable even if some nodes not reachable)

After each failure, perform view change using Paxos to agree on which nodes in new view and choose new primary

No discussion of how to render data consistent across replicas

but still can use Paxos to agree on which client operation is next in every replica of log for data consistent