Paxos-based Sharded KV Storage System

At a Glance

Replication modelMulti-Paxos

Transaction model2PC + 2PL

Consistency targetStrong

ContextUW CSE 452

The Problem

The challenge was to build a storage system that remains correct under node failures, retries, and concurrent operations while scaling beyond a single replica group. A simple key-value server was not enough; the design needed fault tolerance, transactional coordination, and shard-aware consensus to preserve correctness.

System Architecture

Details Inspector

Paxos Replica Group

Service

Each shard is owned by a Paxos replica group running Multi-Paxos. The stable leader assigns log slots, runs Prepare and Accept phases, and commits entries once a majority of followers respond. Followers apply committed entries to their local state machine copies in log order. On leader failure, the group elects a new leader from the majority — state is never lost because a committed entry has been accepted by at least ⌊n/2⌋ + 1 replicas.

Why This Component

Multi-Paxos is the practical form of Paxos for a long-lived replicated log. The single-leader optimization eliminates the Prepare phase for stable operations — once a leader holds a ballot accepted by a majority, every subsequent slot only requires the Accept phase. This cuts consensus to one round-trip under no contention.

Stable leader eliminates Prepare

The full two-phase Paxos round (Prepare + Accept) is only necessary when leadership is contested. A stable Multi-Paxos leader with a durable ballot skips Prepare entirely for each new slot. Under normal operation, a write commits in one round-trip to a majority. Phase 1 is the recovery mechanism, not the steady-state path.

Trade-offs

A single leader per group is both the performance advantage and the availability bottleneck. Leader failure requires an election before the group can commit new entries — during this window, writes stall. The election timeout bounds unavailability but also sets the floor for false-positive leader failures under network partition. Tuning this timeout is the primary operational lever for availability vs. split-brain risk.

Alternatives

Multi-Raft is a functionally equivalent alternative with a more prescriptive protocol specification — useful when implementation clarity matters more than protocol minimalism. EPaxos allows leaderless operation with commutative commands, eliminating the single-leader bottleneck, but the correctness argument is significantly more complex and dependency tracking across commands adds overhead for non-commutative writes.

Correctness Guarantees, Layered

The system builds correctness incrementally — each layer assumes the guarantee below it and adds exactly one new property. Exactly-once RPC eliminates duplicate mutations before any replication logic runs. Viewstamped Replication adds failover without data loss. Multi-Paxos provides shard-level consensus with a stable leader. Two-phase commit and locking tie it all together for cross-shard atomicity.

Layer 1 — Exactly-once RPCadds: Deduplication

Client assigns each request a (clientId, seqNum) pair. The server maintains a per-client deduplication table so retried requests are identified and their cached result returned — never re-applied.

Layer 2 — Viewstamped Replicationadds: Failover

Primary-backup replication with view changes on failure. A new primary is elected from the replica set, state is reconciled, and the group resumes without data loss. Exactly-once semantics are preserved across view changes.

Layer 3 — Multi-Paxos Replica Groupsadds: Shard-level consensus

Each shard is backed by a dedicated Paxos replica group. A stable leader drives all proposals for its group; Phase 1 (Prepare) is skipped after leader election. Log-based state machine replication ensures all replicas converge to the same committed history.

Layer 4 — 2PC + 2PL Cross-Shardadds: Transactional atomicity

Cross-shard operations coordinate through a two-phase commit protocol. Each shard-participant uses two-phase locking to prevent concurrent transactions from interleaving. The coordinator logs its decision before issuing Commit so crash recovery is deterministic.

Protocol Flows

Two core protocols drive correctness in this system. Paxos consensus handles single-shard replication — the leader drives prepare and accept rounds to commit each log entry. Two-phase commit coordinates cross-shard transactions, with 2PL ensuring participants hold locks through the commit decision. Click any step to inspect the protocol message.

Key Decisions

1
Implemented exactly-once RPC first so all higher-level replication logic could assume idempotent client semantics.
2
Used Viewstamped Replication for the primary-backup stage to keep failover logic explicit before introducing full Paxos groups.
3
Combined Multi-Paxos replica groups with shard ownership to avoid a single consensus domain becoming the bottleneck.
4
Applied 2PC and 2PL for cross-shard transactions, trading latency for correctness and easier reasoning about consistency.

Outcomes

Delivered a working sharded KV store for CSE 452 with replication, transactions, and fault-tolerance mechanisms implemented end-to-end.
Demonstrated strong consistency across distributed nodes in a course project setting modeled after production distributed databases.
Built reusable understanding of consensus, leader failover, and distributed transaction tradeoffs for later systems design work.

Lessons Learned

1Exactly-once semantics are foundational — retry behavior becomes chaos if deduplication is not designed early.
2Consensus solves agreement, not end-to-end system design; transaction coordination and locking still dominate complexity.
3Sharding helps scalability, but cross-shard correctness reintroduces coordination costs very quickly.
4Course projects are an excellent place to feel the operational pain of distributed systems before facing it in production.

At a Glance

The Problem

System Architecture

Client Layer

Shard Router