6 minute read

UCL Course COMP0133: Distributed Systems and Security LEC-08


Requirement of Paxos

Strong Reachability: majority of nodes must be reachable by leader

Bayou studies consistency of a distributed system with poor connections


Centralized System

The central server (owns the only copy) checks for conflicts before accepting updates

The central server returns error to user when conflict, and user decides how to solve


Automatic Conflict Resolution

Items in database cannot be viewed as bits (too little information to resolve conflicts)

e.g.,

  • “Both files have changed” can falsely conclude entire conflict

  • “Distinct record in each database changed” can falsely conclude no conflict

Ways to resolve conflict: read database, think, change request (like user updates)

Must ensure all nodes resolve conflicts in same way to keep replicas consistent

  • every node maintains an ordered list of updates

  • every node holds same deterministic updates

  • every node applies deterministic updates in same order

such that “sync” is to merge two ordered list


Write Log

The ordered list of updates (writes) are called Write Log

Each write has a decision and should contain one or more alternative decisions for conflict resolution

Each write will have an unique ID <Local-Timestamp, Originating-Node-ID> (used for ordering)

However, when writes epidemically propagate across nodes, nodes may initially apply updates in different orders

When a new write is received, it will be merged into write log.

And write log should be replayed from start such that all entries will be tentative

After each node has seen all writes, each node will agree


Local Timestamp

The global time synchronization is impossible but local timestamp can allow agreement on order

However, this ordering by the unique ID (contains local timestamp) has limitations

All entries in write log will be tentative \( \implies \) Store entire write log forever

Therefore, it is needed to commit tentaive writes to make previous parts of log entires stable


Requirements of a New Committed Write

For log entry \( X \) to be committed, each node must agree on

  • Total order of all previous committed entries

  • Fact that \( X \) is next in total order

  • Fact that all uncommitted (tentative) entries are “after” \( X \)


Committed Write, Primary Replica, and CSN

Idea

  • One node designated Primary Replica

  • Primary Replica marks each write it receives with permanent Commit Sequence Number (CSN) (monotonically increasing)

These writes will be regarded as a committed write whose unique ID <CSN, Local-Timestamp, Originating-Node-ID>

Then, nodes will exchange committed writes with CSNs

However, committed write is not stable (not safe to show users) unless node has seen all prior committed writes

This is guaranteed by Bayou Propagation Protocol (propagate in order)

such that CSN helps to define total order for committed writes

  • All nodes eventually agree on total order

  • Uncommitted (tentative) writes come after all committed writes

Then, the committed write is stable (safe to show users)

and slow or disconnected node cannot prevent the process (Primary Replica allocates CSNs)

Additionally,

ordering of commits by Primary Replica has following requirements

  • create < delete < modify

  • keep view of tentative log entries in each node

and nodes use Lamport Logical Clocks as local timestamp

Therefore, Primary Replica receives writes in per-node causal order, and commits them in that order

Problem of using Physical Clocks as Local Timestamp

  • Nodes might have wrong physical clocks

    For example,

    <701, A>: Node A asks for meeting M1 to occur at 10 AM, else 11 AM

    Then if B with a slower clock wants to delete A’s create write

    <700, B>: Node B deletes <701, A>

    such that the delete operation will be ordered before create operation which is a counterintuitive behavior

  • Solution by Lamport Logical Clocks

    By the definition of IR2, node B will update its logical clock to \( max(701+1, 700) = 702 \)

    then the delete operation from node B will have a later timestamp 702


Uncommitted (Tentative) Write

Two nodes may disagree on meaning of tentative (uncommitted) writes
even if those two nodes have synced with each other

Only writes with CSNs from primary replica can resolve these disagreements permanently

Moreover, tentative order \( \neq \) commit order
since CSN is determined by the order that Primary Replica receives writes


Snapshot for Log of Committed Writes

When nodes receive new CSNs

they can discard all committed log entries seen up to that point

and keep snapshot of whole database as of highest CSN

such that not need to keep and transfer entire log entires

When nodes sync database snapshot

When node A and node B sync (exchange) database snapshot

  • If the highest CSN in A’s snapshot is greater than B’s

    • A will send whole stable database to B

    • Then, B will use the received database as starting point

    • Then, B will discard all its committed log entries

    • Then, B will replay its tentative writes into the database

  • If the highest CSN in A’s snapshot is smaller than B’s

    • B will ingore the database sent by A

    • B will do send whole stable database to A and so on


Summary

  • Useful for weak connectivity with convenient Automatic Conflict Resolution

  • Not transparent to applications

    since writes are not bits, the checking and resolving process will be very complex

  • Not suitable for all applications

    since writes need to provide alternative decisions for conflict resolution

    • suitable for making appointments (e.g., conflict in Place 1 (Time 1) and try Place 2 or 3 (Time 2 or 3))

    • suitable for bank tranfer (e.g., not have enough 50 pounds and try 40 pounds)

    • not suitable for code repository like GitHub (hard to provide alternative decisions for conflicted codes)