Bayou

6 minute read

UCL Course COMP0133: Distributed Systems and Security LEC-08

Requirement of Paxos

Strong Reachability: majority of nodes must be reachable by leader

Bayou studies consistency of a distributed system with poor connections

Centralized System

The central server (owns the only copy) checks for conflicts before accepting updates

The central server returns error to user when conflict, and user decides how to solve

Automatic Conflict Resolution

Items in database cannot be viewed as bits (too little information to resolve conflicts)

e.g.,

“Both files have changed” can falsely conclude entire conflict
“Distinct record in each database changed” can falsely conclude no conflict

Ways to resolve conflict: read database, think, change request (like user updates)

Must ensure all nodes resolve conflicts in same way to keep replicas consistent

every node maintains an ordered list of updates
every node holds same deterministic updates
every node applies deterministic updates in same order

such that “sync” is to merge two ordered list

Write Log

The ordered list of updates (writes) are called Write Log

Each write has a decision and should contain one or more alternative decisions for conflict resolution

Each write will have an unique ID <Local-Timestamp, Originating-Node-ID> (used for ordering)

However, when writes epidemically propagate across nodes, nodes may initially apply updates in different orders

When a new write is received, it will be merged into write log.

And write log should be replayed from start such that all entries will be tentative

After each node has seen all writes, each node will agree

Local Timestamp

The global time synchronization is impossible but local timestamp can allow agreement on order

However, this ordering by the unique ID (contains local timestamp) has limitations

All entries in write log will be tentative \( \implies \) Store entire write log forever

Therefore, it is needed to commit tentaive writes to make previous parts of log entires stable

Requirements of a New Committed Write

For log entry \( X \) to be committed, each node must agree on

Total order of all previous committed entries
Fact that \( X \) is next in total order
Fact that all uncommitted (tentative) entries are “after” \( X \)

Committed Write, Primary Replica, and CSN

Idea

One node designated Primary Replica
Primary Replica marks each write it receives with permanent Commit Sequence Number (CSN) (monotonically increasing)

These writes will be regarded as a committed write whose unique ID <CSN, Local-Timestamp, Originating-Node-ID>

Then, nodes will exchange committed writes with CSNs

However, committed write is not stable (not safe to show users) unless node has seen all prior committed writes

This is guaranteed by Bayou Propagation Protocol (propagate in order)

such that CSN helps to define total order for committed writes

All nodes eventually agree on total order
Uncommitted (tentative) writes come after all committed writes

Then, the committed write is stable (safe to show users)

and slow or disconnected node cannot prevent the process (Primary Replica allocates CSNs)

Additionally,

ordering of commits by Primary Replica has following requirements

create < delete < modify
keep view of tentative log entries in each node

and nodes use Lamport Logical Clocks as local timestamp

Therefore, Primary Replica receives writes in per-node causal order, and commits them in that order

Problem of using Physical Clocks as Local Timestamp

Nodes might have wrong physical clocks

For example,

<701, A>: Node A asks for meeting M1 to occur at 10 AM, else 11 AM

Then if B with a slower clock wants to delete A’s create write

<700, B>: Node B deletes <701, A>

such that the delete operation will be ordered before create operation which is a counterintuitive behavior
Solution by Lamport Logical Clocks

By the definition of IR2, node B will update its logical clock to \( max(701+1, 700) = 702 \)

then the delete operation from node B will have a later timestamp 702

Uncommitted (Tentative) Write

Two nodes may disagree on meaning of tentative (uncommitted) writes
even if those two nodes have synced with each other

Only writes with CSNs from primary replica can resolve these disagreements permanently

Moreover, tentative order \( \neq \) commit order
since CSN is determined by the order that Primary Replica receives writes

Snapshot for Log of Committed Writes

When nodes receive new CSNs

they can discard all committed log entries seen up to that point

and keep snapshot of whole database as of highest CSN

such that not need to keep and transfer entire log entires

When nodes sync database snapshot

When node A and node B sync (exchange) database snapshot

If the highest CSN in A’s snapshot is greater than B’s
- A will send whole stable database to B
- Then, B will use the received database as starting point
- Then, B will discard all its committed log entries
- Then, B will replay its tentative writes into the database
If the highest CSN in A’s snapshot is smaller than B’s
- B will ingore the database sent by A
- B will do send whole stable database to A and so on

Summary

Useful for weak connectivity with convenient Automatic Conflict Resolution
Not transparent to applications

since writes are not bits, the checking and resolving process will be very complex
Not suitable for all applications

since writes need to provide alternative decisions for conflict resolution
- suitable for making appointments (e.g., conflict in Place 1 (Time 1) and try Place 2 or 3 (Time 2 or 3))
- suitable for bank tranfer (e.g., not have enough 50 pounds and try 40 pounds)
- not suitable for code repository like GitHub (hard to provide alternative decisions for conflicted codes)

Endong Liu