Bayou
UCL Course COMP0133: Distributed Systems and Security LEC-08
Requirement of Paxos
Strong Reachability: majority of nodes must be reachable by leader
Bayou studies consistency of a distributed system with poor connections
Centralized System
The central server (owns the only copy) checks for conflicts before accepting updates
The central server returns error to user when conflict, and user decides how to solve
Automatic Conflict Resolution
Items in database cannot be viewed as bits (too little information to resolve conflicts)
e.g.,
-
“Both files have changed” can falsely conclude entire conflict
-
“Distinct record in each database changed” can falsely conclude no conflict
Ways to resolve conflict: read database, think, change request (like user updates)
Must ensure all nodes resolve conflicts in same way to keep replicas consistent
-
every node maintains an ordered list of updates
-
every node holds same deterministic updates
-
every node applies deterministic updates in same order
such that “sync” is to merge two ordered list
Write Log
The ordered list of updates (writes) are called Write Log
Each write has a decision and should contain one or more alternative decisions for conflict resolution
Each write will have an unique ID <Local-Timestamp, Originating-Node-ID>
(used for ordering)
However, when writes epidemically propagate across nodes, nodes may initially apply updates in different orders
When a new write is received, it will be merged into write log.
And write log should be replayed from start such that all entries will be tentative
After each node has seen all writes, each node will agree
Local Timestamp
The global time synchronization is impossible but local timestamp can allow agreement on order
However, this ordering by the unique ID (contains local timestamp) has limitations
All entries in write log will be tentative \( \implies \) Store entire write log forever
Therefore, it is needed to commit tentaive writes to make previous parts of log entires stable
Requirements of a New Committed Write
For log entry \( X \) to be committed, each node must agree on
-
Total order of all previous committed entries
-
Fact that \( X \) is next in total order
-
Fact that all uncommitted (tentative) entries are “after” \( X \)
Committed Write, Primary Replica, and CSN
Idea
-
One node designated Primary Replica
-
Primary Replica marks each write it receives with permanent Commit Sequence Number (CSN) (monotonically increasing)
These writes will be regarded as a committed write whose unique ID <CSN, Local-Timestamp, Originating-Node-ID>
Then, nodes will exchange committed writes with CSNs
However, committed write is not stable (not safe to show users) unless node has seen all prior committed writes
This is guaranteed by Bayou Propagation Protocol (propagate in order)
such that CSN helps to define total order for committed writes
-
All nodes eventually agree on total order
-
Uncommitted (tentative) writes come after all committed writes
Then, the committed write is stable (safe to show users)
and slow or disconnected node cannot prevent the process (Primary Replica allocates CSNs)
Additionally,
ordering of commits by Primary Replica has following requirements
-
create < delete < modify
-
keep view of tentative log entries in each node
and nodes use Lamport Logical Clocks as local timestamp
Therefore, Primary Replica receives writes in per-node causal order, and commits them in that order
Problem of using Physical Clocks as Local Timestamp
-
Nodes might have wrong physical clocks
For example,
<701, A>: Node A asks for meeting M1 to occur at 10 AM, else 11 AM
Then if B with a slower clock wants to delete A’s create write
<700, B>: Node B deletes <701, A>
such that the delete operation will be ordered before create operation which is a counterintuitive behavior
-
Solution by Lamport Logical Clocks
By the definition of IR2, node B will update its logical clock to \( max(701+1, 700) = 702 \)
then the delete operation from node B will have a later timestamp 702
Uncommitted (Tentative) Write
Two nodes may disagree on meaning of tentative (uncommitted) writes
even if those two nodes have synced with each other
Only writes with CSNs from primary replica can resolve these disagreements permanently
Moreover, tentative order \( \neq \) commit order
since CSN is determined by the order that Primary Replica receives writes
Snapshot for Log of Committed Writes
When nodes receive new CSNs
they can discard all committed log entries seen up to that point
and keep snapshot of whole database as of highest CSN
such that not need to keep and transfer entire log entires
When nodes sync database snapshot
When node A and node B sync (exchange) database snapshot
-
If the highest CSN in A’s snapshot is greater than B’s
-
A will send whole stable database to B
-
Then, B will use the received database as starting point
-
Then, B will discard all its committed log entries
-
Then, B will replay its tentative writes into the database
-
-
If the highest CSN in A’s snapshot is smaller than B’s
-
B will ingore the database sent by A
-
B will do send whole stable database to A and so on
-
Summary
-
Useful for weak connectivity with convenient Automatic Conflict Resolution
-
Not transparent to applications
since writes are not bits, the checking and resolving process will be very complex
-
Not suitable for all applications
since writes need to provide alternative decisions for conflict resolution
-
suitable for making appointments (e.g., conflict in Place 1 (Time 1) and try Place 2 or 3 (Time 2 or 3))
-
suitable for bank tranfer (e.g., not have enough 50 pounds and try 40 pounds)
-
not suitable for code repository like GitHub (hard to provide alternative decisions for conflicted codes)
-