GFS: Design Decisions That Shaped Distributed Storage — Blog

The workload GFS was built for

GFS was designed in 2003 for Google's internal workloads: large batch jobs (web crawling, indexing, log processing) that read and write files in the hundreds of gigabytes to terabytes range. The paper's design assumptions are explicit:

Files are large (multi-GB is common). Small files exist but are not the common case.
Writes are mostly record appends, not random writes. Files are written once and read many times.
Sequential reads are far more common than random reads.
Commodity hardware fails regularly — failure is the norm, not the exception.
Throughput matters more than low latency.

These assumptions drive every significant design decision in the paper. If your workload differs (many small files, random writes, latency-sensitive reads), GFS is the wrong tool.

Single master, many chunkservers

ConceptDistributed Storage

GFS splits responsibility cleanly: one master tracks metadata (where files are, which chunkservers hold each chunk), many chunkservers hold the actual data. Clients talk to the master for metadata, then directly to chunkservers for data. The master is never in the data path.

Prerequisites

distributed systems basics
replication
fault tolerance

Key Points

The master holds all metadata in memory — fast lookups, but limits total namespace size.
Clients cache chunk locations from the master. Most reads skip the master entirely.
Chunkservers store 64 MB chunks. Large chunks reduce master load (fewer chunks to track per file).
The master never moves data. It only tells clients where data is.

Why a single master works

The obvious concern with a single master is: it's a single point of failure and a bottleneck. The paper addresses this directly by keeping the master out of the data path and keeping the interactions small.

When a client reads a file:

Client asks the master: "Where are the chunks for file X, offset Y?"
Master replies with chunk handles and chunkserver locations.
Client caches this response.
Client reads directly from the chunkserver — no master involved.

The master handles metadata operations (file creates, renames, namespace lookups). It does not see read or write data. At Google's scale in 2003, a single machine could keep all metadata in RAM because the metadata per chunk is small (~64 bytes) and the chunk size (64 MB) limits the total number of chunks.

A 1 PB filesystem with 64 MB chunks = ~16 million chunks × 64 bytes ≈ 1 GB of metadata. That fits in RAM on a single server.

The trade-off: as namespace size grows, the single master becomes a scaling ceiling. Google eventually replaced GFS with Colossus, which distributes the master function across multiple nodes.

The lease mechanism for writes

Write consistency is where GFS gets interesting. For a client to write to a chunk, one chunkserver must be designated the primary — it serializes writes and determines the mutation order applied by all replicas.

GFS uses a lease to designate the primary:

Client asks master which chunkserver holds the lease for the chunk.
If no lease is active, master grants a 60-second lease to one replica — this becomes the primary.
Client sends data to all replicas (pipelined along a chain).
Client tells the primary to apply the mutation.
Primary determines the mutation's serial number and applies it. Forwards the mutation to secondaries.
Secondaries apply in the same order. All reply to primary. Primary replies to client.

The data transfer and the mutation application are decoupled. Data is pushed to replicas before the primary orders mutations, which lets the network pipeline fill without waiting for mutation serialization.

Client → chunkserver 1 → chunkserver 2 → chunkserver 3
         (data pipelined along closest path)

Client → Primary: "apply mutation"
Primary → Secondaries: "apply mutation #42"
Secondaries → Primary: "done"
Primary → Client: "success"

If the lease expires before the write completes, the master can grant a new lease to a different replica. The master refuses to extend a lease if it is trying to revoke it (e.g., for snapshot operations).

The consistency model: not what you might expect

GFS's consistency model is weaker than a traditional filesystem. The paper distinguishes between defined (all clients see the same data, consistent with the mutation) and consistent (all replicas have the same data, but it might be a mix of concurrent mutations).

For concurrent writes to the same region: the region is consistent but undefined — all replicas have the same bytes, but the bytes may be a mix of fragments from multiple concurrent writes. No single write is guaranteed to be atomic relative to others.

For record appends (GFS's primary write mode): the operation is defined but may have padding or duplicates. GFS guarantees that the record was written at least once as an atomic unit — but a replica might have received the record twice, or a replica might have padding where another has data.

This sounds alarming. The practical implication: applications must handle duplicates. Google's MapReduce jobs and other consumers of GFS are designed to be idempotent — duplicate records are filtered by the application, not the filesystem.

💡Why large chunks (64 MB) make sense for GFS's workload

64 MB seems large compared to typical filesystem block sizes (4–64 KB). The benefits for GFS's workload:

Fewer master interactions: a client reading a 10 GB file needs metadata for ~160 chunks rather than millions of blocks. The client keeps chunk locations cached.
Persistent TCP connections: because operations on a chunk span many requests (writes are batched), clients keep a persistent connection to the chunkserver. This amortizes TCP setup cost.
Less metadata per file: fewer chunks per file means less metadata in the master.

The downside: small files may consist of just one chunk. If many clients access the same small file (a common executable, for example), its single chunkserver becomes a hot spot. GFS mitigates this by allowing clients to read from different replicas, but hot spots are a real limitation for small-file workloads.

Master recovery: operation log and checkpoints

The master's metadata is entirely in memory. Persistence comes from an operation log: every metadata change (file creation, chunk allocation, lease grant) is written to the operation log on disk and replicated to remote machines before the master acknowledges the operation.

On failure:

Master loads the most recent checkpoint — a serialized snapshot of all metadata state.
Master replays operation log entries from after the checkpoint.
Master rebuilds in-memory state and resumes.

Checkpoints are written by a separate thread while the master continues serving requests. The log entries between the last checkpoint and a failure are typically small (a few MB), so recovery is fast.

Chunkserver locations are not persisted on the master. On restart, the master asks each chunkserver what chunks it holds. This is simpler than trying to keep chunk locations accurate through chunkserver failures — the chunkservers are the authoritative source of truth for which chunks they hold.

A GFS client writes a record to a file. The primary chunkserver applies the mutation but fails before forwarding to secondaries. The master grants a new lease to a secondary. What is the state of the file?

hard

The original primary applied mutation #42. The secondaries did not. The master detects the primary failure via missed heartbeats and grants a lease to a secondary with a higher version number.

AThe mutation is lost — the client must retry
Incorrect.This is close, but the client receives an error and retries. The retry may produce a duplicate at any replicas that did receive the write before failure. The new primary applies the mutation with a new serial number.
BThe file is inconsistent — different replicas have different data — and GFS accepts this as 'consistent but undefined'
Correct!The primary applied mutation #42; secondaries did not. GFS does not have a rollback mechanism for this. The master increments the chunk version number when granting a new lease. The old primary's replica, with the wrong version, is eventually identified and removed. The client will receive an error and retry, potentially creating a duplicate record at the replicas that did receive the write. Applications consuming GFS must handle this.
CGFS uses two-phase commit to roll back the partial write atomically
Incorrect.GFS does not implement two-phase commit. Its consistency guarantees are intentionally weaker — defined for successful appends, undefined for concurrent writes, and tolerant of duplicates.
DThe master reverts the secondary replicas to match the primary's state
Incorrect.The master does not have the data to do this — it only tracks metadata. Data recovery is handled by re-replication from an up-to-date replica, not rollback.

Hint:Think about what 'at least once' delivery means for record appends and what the version number mechanism accomplishes.