2021-12-06
Facebook's Delos
Virtual Consensus in Delos (2020)
-
TL; DR: Split up the reconfiguration part and the state machine replication
aspect of a consensus protocol to enable on the fly switching/testing of
various consensus protocols as well as loosening the availability
requirements of the state machine replication aspect.
-
Consensus protocols include a data plane (for ordering and storing
commands durably) and a control plane (for reconfiguring leadership,
roles, parameters, and membership). Most protocols combine both planes into
a single protocol
-
Delos exposes a VirtualLog which acts as a single log to external users, but
is in fact composed of a chain of log instances (Loglets). Ex. Entries 1-100
belong to Log A, Entries 101-200 to Log B, Entries 201-∞ belong to Log C
-
Requirements
- Low # of dependencies on other services
- Strong guarantees for durability, availability, and consistency
-
Flexible storage APIs (the project needed to create a Table API with
transactions, range queries and serialization instead of the filesystem
based Zookeeper API)
Other goals:
- Eventually deprecate Zookeeper
- Migrate between log implementation with no downtime
-
Virtual Log Abstraction:
-
append and checkTail are only operated on the most recent
log
- readNext could apply to multiple logs
- reconfigExtend is used to change logs to a new log
Requirements
-
Virtual Log Operations
Steady State
-
Uses locally cache version of the log to write to and forward client
operations to there
Reconfiguration
-
seal the current log (Ci): This prevents new appends
from succeeding on the current log.
seal
is an idempotent operation so concurrent operations are okay. After
sealing, the client calls
checkTail
to get the final log position
-
Update the metastore with the new configuration: Write the start
position of Ci+1. The metastore allows conditional writes
(Ci+1 if previous was Ci) so only one concurrent
operation will succeed.
-
Fetch updated chain from metastore: Get new configurations and store in
local cache
-
After a log is sealed, it returns an error indicating the
sealed state to the client. checkTail also returns a bit
about sealed. When this happens, the client knows to fetch new configs
from the metastore
-
If after getting a sealed response from the log and fetching from the
metastore returns no new configuration, wait before fetching again.
After an appropriate timeout, start creating a separate chain.
-
Virtual Log Metastore: Uses raft, paxos, or similar things to ensure highly
available consistent operations. Is not on the critical path as
reconfigurations are rare
-
Loglet
-
Requirements
- Totally ordered durable storage
-
Highly available seal operation (does not need fault tolerant
consensus). Does not need highly available append operation
-
"Zombie" appends can happen where the checkTail operation returns
increasing tail positions despite it being sealed. This is not relevant
because the final positions are stored durably in the virtual log
-
Native Loglet (performance optimized loglet)
-
Single sequencer assigns position for every command and remaining
machines replicate it for durability. The sequencer is a single
point of failure, but if it fails, the system will be reconfigured.
A command is considered committed if a majority of servers have
replicated it.
-
seal occurs by contacting each Native Loglet server and
setting the seal bit to true. A system is sealed if a majority of
servers are sealed
-
checkTail uses a state machine so that if some servers return
as sealed, eventually the entire system is sealed, in which case it
returns the max value of the tail returned
- Fast local reads
-
Experimental info:
-
Successful production swap between initial Zookeper based Loglet to
native loglet. 10x performance gains
-
Can modify throughout characteristics by swapping between having the
loglet on the same server as the virtual log and not. Default has the
loglet and the virtual log on the same server. With high QPS, being on
the same server results in IO contention and by offloading the loglet to
separate servers, the throughput is greatly improved. Can use in real
time to handle workload changes.
-
Can eliminate the sequencer bottleneck by using a round robin between
multiple loglets at once (StripedLoglet)
Log-structured Protocols in Delos (2021)
-
TL; DR: Create reusable protocol stacks so that developing different
applications using a shared log is easy
-
Initially had a single Table API (DelosTable). Incrementally upgraded with
no downtime to enable backups and other features. When building Zookeeper
API (Zelos), they were able to reuse the production ready stack from
DelosTable. Zookeeper API had special requirements such as stronger
consistency than linearizability (same session operations are ordered which
is a fact used by transactions). Use common performance upgrades
Batching/Leasing
-
Stack of engines. It's linearizable because it simply applies the log
operations sent to it from a lower engine. Each one inspects its own headers
for additional metadata to do operations like batching or adding leases etc.
-
Top layer uses propose and each layer uses propose to
propagate all the way down to the BaseEngine. Then each layer calls
apply on its local store before forwarding the log entry higher
up in the stack
-
Dynamic Addition of Engines: Two phase protocol where initially it is added
but can only append to shared log and not update local state. After that
with a log command it is enabled consistently across all servers.
-
SessionOrderEngine: Handle Zookeeper's ordering guarantees where same
session operations are ordered even without waiting. Solves this by
generating a sequence number and only allowing entries upstream in sequence
order. Out of order entries are filtered and re-proposed
-
Allow easy design of other operations such as building Queues and Locking
over Delos. These services are generally production ready and don't require
high amounts of customer load on operators
-
Major benefit is that each layer can add an ObserverEngine to monitor the
time cost of that engine in a transparent way.
Any error corrections or comments can be made by sending me a pull request.
Back