logo
Published on

GFS

Authors
  • avatar
    Name
    Bowen Y
    Twitter

GFS

Key Features

  • The primary checks to see if appending the record to the current chunk would cause the chunk to exceed the maximum size (64 MB). If so, it pads the chunk to the maximum size
  • Record append is restricted to be at most one-fourth of the maximum chunk size
  • GFS does not guarantee that all replicas are bytewise identical. It only guarantees that the data is written at least once as an atomic unit.
  • Like AFS, GFS uses standard copy-on-write techniques to implement snapshots. The newly created snapshot files point to the same chunks as the source files until clients send new write requests.
  • Each master operation acquires a set of locks before it runs. Typically, if it involves /d1/d2/.../dn/leaf, it will acquire read-locks on the directory names /d1, /d1/d2, ..., /d1/d2/.../dn, and either a read lock or a write lock on the full pathname /d1/d2/.../dn/leaf.
  • After a file is deleted, GFS does not immediately reclaim the available physical storage. It does so only lazily during regular garbage collection at both the file and chunk levels.
  • Instead of reclaiming resources immediately, the file is just renamed to a hidden name that includes the deletion timestamp.
  • In a similar regular scan of the chunk namespace, the master identifies orphaned chunks (i.e., those not reachable from any file) and erases the metadata for those chunks.
  • Each chunkserver uses checksumming to detect corruption of stored data.

Metadata Stored on the Master Node

  1. File-to-Chunk Mapping (Namespace):

    • The file namespace in GFS is similar to a traditional file system’s directory structure. Files are identified by their full pathnames.
    • For each file, the master stores a mapping of the file name to an array of chunk handles. Each chunk handle corresponds to a 64 MB chunk of the file.
      • Example:
        • A file like /data/logs/logfile.txt could map to chunk handles [chunkHandle1, chunkHandle2, chunkHandle3], where each handle represents a 64 MB chunk of the file.
    • This structure allows the master to keep track of which file corresponds to which chunks, facilitating data location and retrieval.
  2. Chunk Handle Information:

    • Each chunk handle in the array contains several key pieces of metadata that describe the state and location of the chunk in the system:
      1. List of Chunkservers:
        • The master keeps a list of chunkservers that store replicas of the chunk. This list allows the master to direct clients to specific chunkservers for reading or writing.
      2. Chunk Version Number:
        • The version number of the chunk is critical for ensuring data consistency. If a chunkserver goes down and misses some updates (mutations), its version number will lag behind the up-to-date replicas. When the chunkserver comes back online, the master uses the version number to detect if the chunk is stale and to prevent stale data from being used.
        • The master increments the version number whenever a new lease is granted for the chunk to ensure all replicas are synchronized.
      3. Primary Chunkserver:
        • For each chunk, the master identifies which chunkserver is the primary for mutation (write) operations. The primary is responsible for ordering writes and ensuring consistency across replicas.
        • The primary is selected when a client writes to the chunk, and this information is cached by the client to avoid repeated interactions with the master.
      4. Lease Expiration Time:
        • The master grants a lease to one of the chunk replicas (the primary) for coordinating writes. The lease ensures that all mutations to the chunk are ordered consistently. Leases typically last for 60 seconds, but they can be extended as long as the chunk is being actively written.
        • The master tracks the lease expiration time, and if the lease expires or needs to be revoked (e.g., for a snapshot operation), the master will update the lease information.
  3. Replication Information:

    • For each chunk, the master stores information about the replication level (default is three replicas) and the current locations of the replicas (the chunkservers storing them). This allows the master to:
      • Ensure that the correct number of replicas are maintained.
      • Respond to client requests by directing them to the appropriate chunkservers holding valid replicas of a chunk.
    • If a chunk becomes under-replicated (due to a chunkserver failure), the master initiates a re-replication process by copying the chunk to other chunkservers.
  4. Chunk Placement and Load Balancing:

    • The master keeps track of how chunks are distributed across chunkservers. It uses this information to make chunk placement decisions (e.g., when creating a new chunk) to ensure that chunks are evenly spread across different machines and racks for both performance and reliability.
    • The master periodically rebalances chunk replicas if it detects that some chunkservers are overloaded or underutilized.
  5. Operation Log:

    • The master maintains an operation log of all critical metadata updates (file creations, deletions, chunk allocations, etc.). This log is persisted on the master’s local disk and replicated on remote machines for fault tolerance.
    • The operation log serves as the only persistent record of metadata. In case of a failure, the master can recover by replaying the log.
  6. Chunk Garbage Collection and Orphaned Chunks:

    • The master keeps track of orphaned chunks—chunks that are no longer associated with any files (due to file deletions) and need to be garbage collected.
    • Orphaned chunks are identified through periodic namespace scans, and the master informs chunkservers to delete them when they are no longer needed.
  7. Checkpointing for Faster Recovery:

    • To avoid replaying the entire operation log after a failure, the master periodically creates checkpoints of its in-memory metadata. These checkpoints are stored on disk and contain the latest version of the file namespace and chunk mapping.

How the Master Uses Metadata

  1. File Operations:
    • When a client performs an operation like opening a file, the master consults its file-to-chunk mapping to find the relevant chunk handles.
    • For each chunk handle, the master retrieves the list of chunkservers storing replicas of the chunk and returns this information to the client.
    • The master may update the metadata if new chunks are created or if the file is renamed or deleted.
  2. Reading Data:
    • The client first asks the master for the chunk locations corresponding to the file. The master retrieves the list of chunkservers storing the chunk replicas and directs the client to one of the chunkservers (typically the closest one based on network proximity).
    • After this initial request, the client caches the chunk locations, reducing the load on the master for subsequent reads.
  3. Writing Data (Mutation Order):
    • For writes or record appends, the client requests the lease for the relevant chunk from the master. The master grants a lease to one of the replicas, making it the primary chunkserver.
    • The primary chunkserver is responsible for serializing write operations, and it replicates the writes to other chunkservers in the order it receives them, ensuring that all replicas remain consistent.
    • The master keeps track of the lease expiration and primary status for each chunk, ensuring orderly writes and preventing conflicting operations.
  4. Fault Detection and Recovery:
    • The master uses heartbeat messages to continuously monitor the health of chunkservers. If a chunkserver becomes unavailable or reports a stale chunk, the master uses the version number to detect which replicas are up-to-date and to initiate the necessary re-replication to restore the required number of replicas.
    • The master handles garbage collection by marking chunks that are no longer referenced in the namespace for deletion.

Master Recovery Mechanisms

1. Operation Log and Checkpointing

Operation Log:

The operation log is a sequential record of all critical metadata changes in the system. Each entry records an operation that modifies the file system's state. The log entries might include operations like file creation, deletion, chunk creation, and replica updates.

Example of Operation Log Entries:

Imagine the master node's operation log contains entries like the following (simplified for illustration):

[Timestamp: 2023-10-17 10:00:00]
Operation: Create File
File Path: /data/logs/logfile.txt
Assigned Chunk Handles: [CHUNK_001]

[Timestamp: 2023-10-17 10:05:00]
Operation: Create Chunk
Chunk Handle: CHUNK_001
Replica Locations: [ChunkServer1, ChunkServer2, ChunkServer3]

[Timestamp: 2023-10-17 10:10:00]
Operation: Append Data
File Path: /data/logs/logfile.txt
Chunk Handle: CHUNK_001
Data Size: 64 MB
Client ID: Client_123A

[Timestamp: 2023-10-17 10:15:00]
Operation: Create Chunk
Chunk Handle: CHUNK_002
Replica Locations: [ChunkServer2, ChunkServer3, ChunkServer4]

[Timestamp: 2023-10-17 10:20:00]
Operation: Delete File
File Path: /data/logs/oldlog.txt

[Timestamp: 2023-10-17 10:25:00]
Operation: Lease Granted
Chunk Handle: CHUNK_002
Primary ChunkServer: ChunkServer2
Lease Expiration: 2023-10-17 10:30:00

In this example:

  • The master logs the creation of a new file and assigns it a chunk handle.
  • It records the creation of chunks and their replica locations.
  • Append operations are logged with details like file path, chunk handle, data size, and client ID.
  • Deletions and lease grants are also recorded with relevant details.

Checkpointing:

A checkpoint is a snapshot of the master node's in-memory metadata at a specific point in time. It includes the current state of the file system namespace, the mapping of files to chunks, chunk version numbers, lease information, and other necessary metadata.

Example of a Checkpoint Content:

Checkpoint Timestamp: 2023-10-17 10:30:00

File Namespace:
- /data/logs/logfile.txt
  - Chunks: [CHUNK_001, CHUNK_002]
- /data/logs/oldlog.txt (marked for deletion)
  - Chunks: [CHUNK_000]

Chunk Information:
- CHUNK_000
  - Replica Locations: [ChunkServer1, ChunkServer2, ChunkServer3]
  - Version: 2
- CHUNK_001
  - Replica Locations: [ChunkServer1, ChunkServer2, ChunkServer3]
  - Version: 1
- CHUNK_002
  - Replica Locations: [ChunkServer2, ChunkServer3, ChunkServer4]
  - Version: 1
  - Lease Info:
    - Primary: ChunkServer2
    - Lease Expiration: 2023-10-17 10:30:00

Deleted Files:
- /data/logs/oldlog.txt
  - Deletion Timestamp: 2023-10-17 10:20:00
  - Pending Deletion after: 3 days (garbage collection)

Additional Metadata:
- Next Chunk Handle ID: CHUNK_003
- Active Leases:
  - CHUNK_002
    - Primary: ChunkServer2
    - Lease Expiration: 2023-10-17 10:30:00

The checkpoint captures:

  • The file namespace and mappings to chunks.
  • Chunk information, including replica locations and version numbers.
  • Lease information for chunks with active leases.
  • Metadata about deleted files pending garbage collection.
  • A record of the next available chunk handle ID.

Master Recovery Using Operation Log and Checkpoint:

When the master node needs to recover (e.g., after a failure), it performs the following steps:

  1. Load the Most Recent Checkpoint
  2. Replay the Operation Log
  3. Re-initialize State

2. Replication of Master Metadata

To ensure durability and availability, the master replicates its operation log and checkpoints to multiple remote machines.

Example of Replication Process:

  • Operation Log Replication:

    Each time the master appends a new entry to the operation log, it writes the entry to its local disk and sends it to multiple remote replicas.

    Operation Log Entry Replication:
    
    - Local Write: Append new log entry to local operation log file.
    - Remote Replication:
      - Send log entry to ReplicaServer1.
      - Send log entry to ReplicaServer2.
    - Confirmation: Wait for acknowledgments from replicas before considering the operation committed.
    
  • Checkpoint Replication:

    When the master creates a new checkpoint, it saves it locally and sends copies to replica servers.

    Checkpoint Replication:
    
    - Local Save: Write checkpoint file to local disk.
    - Remote Replication:
      - Send checkpoint file to ReplicaServer1.
      - Send checkpoint file to ReplicaServer2.
    - Confirmation: Verify successful receipt at replicas.
    

This replication ensures that if the master node's local storage fails, a new master can be started using the replicated logs and checkpoints.

3. Fast Recovery of the Master Node

Example of Master Recovery Steps:

Suppose the master node crashes at 10:35:00. Upon detection, the system initiates recovery:

  1. Startup:
    • A new master process is started on the same or a different machine.
  2. Load Checkpoint:
    • The master loads the last checkpoint created at 10:30:00.
  3. Replay Operation Log:
    • The master reads the operation log entries from 10:30:00 to 10:35:00.

    • Example of entries to replay:

      [Timestamp: 2023-10-17 10:31:00]
      Operation: Append Data
      File Path: /data/logs/logfile.txt
      Chunk Handle: CHUNK_002
      Data Size: 32 MB
      Client ID: Client_124
      
      [Timestamp: 2023-10-17 10:33:00]
      Operation: Lease Renewal
      Chunk Handle: CHUNK_002
      Lease Expiration: 2023-10-17 10:38:00
      
    • The master applies these operations to update the in-memory metadata:

      • Update chunk data for CHUNK_002.
      • Extend the lease expiration for CHUNK_002.
  4. Communicate with Chunkservers:
    • The master sends heartbeat messages to all chunkservers to collect the latest state of chunk replicas.
    • Chunkservers respond with their current chunk inventory, including any updates that occurred during the master's downtime.
  5. Update Chunk Replica Locations:
    • The master updates its metadata with the current locations of chunk replicas as reported by the chunkservers.
  6. Resume Operations:
    • The master is now ready to handle client requests.

4. Shadow Masters for Read-Only Access

While the primary master is recovering, shadow masters can provide read-only access to clients.

Example of Shadow Master Functionality:

  • The shadow master maintains a copy of the metadata by:
    • Periodically reading the replicated operation log and checkpoints.
    • Applying updates to its in-memory metadata structures.
  • During master downtime:
    • Clients can send read requests to the shadow master.
    • The shadow master uses its metadata to direct clients to chunkservers holding the requested data.
  • Limitation:
    • Shadow masters cannot process metadata mutations (e.g., file creations, deletions, or writes) since they are not authorized to modify the file system state.

5. Client Resilience to Master Failures

Clients are designed to handle temporary master unavailability gracefully.

Example of Client Behavior:

  • Caching Metadata:
    • Clients cache file-to-chunk mappings and chunk replica locations.
  • Handling Master Unavailability:
    • If the master is unavailable when a client needs to access metadata:
      • The client waits and retries the request after a short delay.
      • For read operations, if the client has cached metadata, it can continue accessing data directly from chunkservers.
  • Example Scenario:
    • A client wants to read from /data/logs/logfile.txt.
    • The client has cached that CHUNK_002 is part of the file and knows the locations of its replicas.
    • The client sends a read request directly to one of the chunkservers holding CHUNK_002.
    • If the client's cached metadata expires or is invalid (e.g., due to a chunkserver failure), and the master is down, the client waits until the master is back online to refresh the metadata.

6. Fault Detection and Monitoring

External monitoring systems detect master failures and initiate recovery.

Example of Monitoring Process:

  • Heartbeat Monitoring:
    • A monitoring service periodically sends heartbeat messages to the master.
    • If the master does not respond within a specified timeout, it is considered down.
  • Automated Recovery Actions:
    • The monitoring system starts a new master process on the same or a different machine.
    • Updates the system's configuration or DNS entries to point clients to the new master.
  • Notification:
    • Administrators are notified of the master failure and recovery actions taken.

Question: What will happen when a user sends a write operation to the GFS and the server receives the request but hasn't executed it yet?

In GFS, write operations go through multiple stages before they are committed. The steps typically include:

  1. Client sends a write request to the master to identify which chunkservers hold replicas of the relevant chunk.
  2. The master grants a lease to one of the chunk replicas, designating it as the primary chunkserver for that chunk.
  3. The primary chunkserver orders the writes and forwards them to other replicas.
  4. Replicas acknowledge the write, and once the primary receives acknowledgments from all replicas, the write is considered complete.
  5. After the primary receives all the acknowledgments from the secondaries, it sends a success confirmation directly to the client.

IMPORTANT: In Google File System (GFS), the primary chunkserver does not send a confirmation message back to the master node after it receives acknowledgment from the other replicas for a write operation.

The master is not involved in the step 5. The master’s role is limited to granting the lease and informing the client of the chunkservers that hold the replicas.

Question: What happens if the master node fails between step 1 and step 2

If the master node fails after the client has received the chunk locations and lease information (Step 1), but before the client sends data to the chunkservers (Step 2), the following key points apply:

1. The Master Is No Longer Involved in the Write Operation:

  • After the client has received the chunkserver locations and the lease information, the master node's job is done for this particular write request. The client already knows:
    • Which chunkserver is the primary.
    • Which chunkservers are holding the replicas.
  • The write operation itself does not require further communication with the master at this point, as the client directly communicates with the chunkservers.

Effect:

The failure of the master node will not immediately impact the write operation, as the client has all the information it needs to proceed.

2. Client Sends Data to Chunkservers as Usual:

  • The client will proceed with Step 2, which involves sending the data to the primary chunkserver and the secondary replicas.
  • The primary chunkserver will still coordinate the write operation and forward it to the secondaries. Once the primary receives acknowledgments from the replicas, it will send a confirmation back to the client.

Effect:

The write operation can still proceed normally without the master being online, provided that the lease is still valid and the chunkservers involved in the write are functional.

3. Lease Expiration and the Role of the Master:

  • The lease granted by the master to the primary chunkserver has a limited duration (typically 60 seconds). During this period, the primary chunkserver is authorized to handle writes for that chunk.
  • Since the master node is down, the lease cannot be renewed if it expires, and the master won't be able to grant new leases for subsequent writes to that chunk until it recovers.

Effect:

  • The current write can proceed as long as the lease is valid and hasn't expired.
  • If the lease expires during or after the write, any subsequent write requests to the same chunk will fail until the master node recovers and can issue a new lease.

4. Client-Side Impact If the Write Completes Successfully:

  • If the write completes successfully (i.e., the primary chunkserver receives acknowledgments from all secondary replicas), the client receives a confirmation directly from the primary.
  • The master is not needed to confirm that the write was successful, so the client will not experience an issue unless there is a further problem (e.g., a failure of the primary chunkserver).

Effect:

The client will receive a confirmation of success even though the master is down, assuming the write completes before the lease expires.

5. What If the Write Fails (Chunkserver Failure or Lease Expiry)?

  • If, during the process, the primary chunkserver or one of the secondary replicas fails, the client will not receive a confirmation that the write succeeded.
  • If the lease expires while the master is down, the write will not be retried by the client automatically, as the lease would need to be renewed by the master.
  • In these cases, the client may retry the write after a timeout, but the write will not succeed until the master node is back online to handle lease management.

Effect:

If any failures occur in the chunkservers or the lease expires, the client’s write operation will fail. The client will have to retry after the master recovers and issues a new lease.

Conclusion

  • The write operation can proceed even if the master node fails, as the client already has the necessary lease and chunkserver information to complete the write.
  • The master is not involved in the actual write operation, so its failure does not immediately affect the success of the write, as long as the lease is still valid.
  • If the lease expires or a chunkserver failure occurs before the write completes, the client will need to retry the operation after the master recovers.
  • Once the master recovers, it will reestablish leases, check for stale replicas, and allow any failed operations to be retried.

Question: Does masternode track content changes in chunkservers?

No, the master node in Google File System (GFS) does not track the actual content changes to a file or directly log every modification made by clients. Instead, the master’s role is primarily focused on metadata management, which includes tasks such as maintaining the file-to-chunk mappings, granting leases for write operations, and managing chunk replication.

Question: But the masternode has Append Data logs, are they write operations that masternode stored?

Append Data is indeed a write operation, but there's a key distinction in how append operations are handled by the master node in GFS compared to regular writes.

The append process involves several key steps that include the master node, but the role of the master is limited to metadata management, not tracking the exact data being written.

Step 1: Client Requests the Chunk Locations

  • When a client initiates an append operation, it sends a request to the master node for the location of the last chunk in the file.
  • The master responds with the list of chunkservers that hold replicas of the last chunk.

Step 2: Check for Chunk Space

  • The primary chunkserver (for the last chunk) checks if there is enough space in the chunk to accommodate the append operation.
    • If there is space, the primary handles the append, and the data is written.
    • If there is no space (because GFS chunks are fixed-size, typically 64MB), the client will need to request a new chunk from the master.

Step 3: New Chunk Allocation (If Necessary)

  • If a new chunk is required, the client asks the master to allocate a new chunk.
    • The master logs the creation of the new chunk and updates the file's chunk list in its metadata.

Step 4: Append to the Chunk

  • The client sends the data to the chunkservers, and the primary chunkserver assigns a serial order to the write and forwards the data to the secondary replicas.
  • Once all replicas have acknowledged the append, the primary confirms the success of the operation to the client.

Question: What Happens If the Master Node Is Down When the Client Needs a New Chunk?

  • The client cannot create a new chunk without the master node because the master is responsible for allocating chunks and assigning chunkservers to store replicas.
  • If the master node is down, the client’s request to create a new chunk will fail, and the client will retry the operation until the master node recovers.
  • The system will be in a blocked state for any new chunk creation or append operations that require additional space until the master node is back online and able to process chunk allocation requests.

Question: What Happens If the Primary Chunkserver Fails after Step 1

  1. Client Tries to Write to the Primary:
    • The client attempts to send the write or append operation to the primary chunkserver, but since the primary is down, the client will not receive an acknowledgment from the primary or the secondary replicas.
    • After a timeout period, the client will retry the write operation.
  2. Client Recognizes the Primary Is Unresponsive:
    • The client will attempt to communicate with the primary chunkserver multiple times, but if it detects that the primary is unresponsive (e.g., after retrying a few times or after the network connection fails), it recognizes that the write cannot proceed.
  3. Client Recontacts the Master:
    • Once the client detects that the primary chunkserver is down, it contacts the master node to report the failure.
    • The client essentially asks the master to check the lease and potentially designate a new primary chunkserver.
  4. Master Node Detects Primary Failure:
    • The master node, through its regular heartbeat messages with the chunkservers, will likely have already detected that the primary chunkserver has failed.
    • Once the failure is confirmed, the master revokes the lease from the failed primary.
  5. Master Reassigns a New Primary:
    • The master node selects one of the secondary replicas and assigns it as the new primary chunkserver.
    • The master then notifies the client of the new primary chunkserver and updates the lease information.
  6. Client Retries the Write with the New Primary:
    • After receiving the information about the new primary chunkserver, the client retries the write operation.
    • The new primary chunkserver takes over the responsibility of ordering the writes and coordinating with the other replicas to ensure consistency.

Question: What Happens If the Primary Chunkserver Fails Between Step 3 and Step 4?

If the primary chunkserver fails after it has ordered the write and forwarded it to the secondaries (Step 3) but before receiving acknowledgments from all secondary chunkservers (Step 4), the system will follow a well-defined failure recovery process.

Here’s what happens in detail:

1. Secondary Chunkservers Are Waiting for Confirmation

  • After the primary chunkserver forwards the write to the secondary chunkservers, the secondaries apply the write and send back an acknowledgment to the primary.
  • However, in this scenario, the primary chunkserver fails before receiving acknowledgments from the secondaries.

2. Secondary Chunkservers Cannot Confirm Write Completion

  • Since the primary chunkserver has failed, the secondary chunkservers are unable to confirm the successful completion of the write. They can apply the write to their local data, but the final confirmation to the client cannot be sent, as this requires coordination from the primary.

3. Client Does Not Receive Write Confirmation

  • The client is waiting for a confirmation from the primary chunkserver that the write was successful. Since the primary chunkserver has failed and cannot send this confirmation, the client will timeout after waiting for a response.
  • Once the client times out, it will retry the write operation, but before retrying, it will need to verify the state of the primary chunkserver and ensure that it is communicating with the correct chunkserver.

4. Master Node Detects Primary Failure

  • The master node detects the failure of the primary chunkserver through its heartbeat mechanism. The master sends periodic heartbeats to all chunkservers to ensure they are alive.
  • Upon detecting the failure, the master will revoke the lease from the failed primary chunkserver and assign a new primary from the remaining secondary replicas.

5. Master Assigns a New Primary Chunkserver

  • The master promotes one of the secondary chunkservers (which already has the data) to the role of the new primary chunkserver.
  • The master also updates the lease information to reflect the new primary and informs the client about the new primary chunkserver.

6. Client Retries the Write Operation with the New Primary

  • Once the client is aware that the primary has changed, it will retry the write operation with the new primary chunkserver.
  • Since the secondaries already have the data (from Step 3), the new primary will ensure that any partially applied write data is completed and that all replicas remain consistent.

7. Consistency and Data Recovery

Handling Incomplete Writes:
  • Version Numbers: GFS uses version numbers for each chunk to track consistency between replicas. When a write operation is initiated, the primary increments the chunk’s version number. If the new primary chunkserver detects that one of the replicas has an older version of the chunk (because it didn’t complete the write), the system will flag that replica as stale and initiate re-replication.

    Example:

    • The failed primary sent the write to all secondaries, but one of the secondaries failed to fully apply the write.
    • The new primary will check the version numbers of the replicas and detect if any of the replicas have a stale version of the chunk.
    • The master initiates re-replication to bring all replicas up to the current version of the chunk.
Atomicity of Writes:
  • If the write operation was not fully completed by the time the primary failed, the system ensures atomicity. That is, the write is either fully completed across all replicas or it is discarded, and the client retries the operation.
  • The new primary ensures that the operation is applied correctly and that no incomplete or inconsistent data remains.

Primary Election procedure during write request

  1. No primary
  2. Find the most up-to-date replicas → Compare the version number in the chunk handler(on master) with the version number of each chunk server → Wait until find a up-to-date replica
  3. Pick new replicas as primary, secondary chunk servers
  4. Master Increment Version Number → Chunk servers store new Version Number
  5. Master gives the lease to the primary
  6. Master write the Version Number to the disk

Question: How to ensure that the system time on all the chunk server are the same? If not, how can we make sure there is no two primary which both hold the lease and think that they are both the primary servers?

That's the key safeguard in Google File System (GFS)—the primary chunkserver (A in this case) doesn’t act alone. Whenever a client sends a write request to the primary, the primary chunkserver (A) must replicate that write to the secondary chunkservers. This replication process plays a crucial role in preventing conflicting primary chunkservers, especially in scenarios like time skew.

So even though, there might be a time period that both chunk server A and B are thinking that they are all the primary chunk servers, however, when the old primary A sends write requests to the secondary servers, A will realize the lease has already expired.

Question: Will the new primary chunk server B (latest data at offset 100) ask other secondary chunk servers (latest data at offset 200) to overwrite inconsistent data at offset 100?

I think Yes. (Not sure, it is not explained in the paper)

  • Before chunkserver B (the new primary) starts writing data, it communicates with all the secondary replicas (including chunkserver A) to ensure the system is in a consistent state.
  • This process involves verifying the current chunk version number and the offset at which the most recent successful write occurred.
  • Chunkserver A, as a secondary now, would report the state of the data it holds. If A had previously written data at offset 100, this information would be part of the state verification.
  • If chunkserver A had written data at offset 100 but it wasn’t fully replicated and acknowledged before its lease expired, that data is considered uncommitted and effectively invalid.
  • When chunkserver B takes over and starts writing from offset 50, the data at offset 100 on chunkserver A is ignored and treated as stale.
  • Chunkserver A will overwrite its stale data at offset 100 with new data as soon as chunkserver B reaches and writes to that offset.

Question: How does GFS chunk server handle write failure?

Duplicated Data in GFS:

  • In GFS, the system is designed to handle potential partial writes or write failures. If a write request is interrupted (e.g., due to a network issue or client failure) after some data has already been written to the chunk, but before the client receives confirmation, the client may retry the write.
  • This retry could result in duplicate data being written to the chunk, but GFS is designed to append data rather than overwrite it. So even if the same data is written again, it will be added to the end of the chunk, leading to duplicated writes at different positions.

Handling Duplicate Writes:

  • GFS doesn't automatically remove duplicate data because the file system is optimized for sequential appends rather than overwrites.
  • If duplication occurs, the client application or higher-level systems typically need to handle detecting and eliminating duplicates during data processing or reading.

Frequency of Duplicated Writes:

  • This situation is relatively uncommon in normal operation but can occur when there are network failures or timeouts during the writing process, causing the client to resend the write request. GFS ensures that, in case of failure, the system remains consistent, but the possibility of redundant data still exists due to retries.

Question: How does GFS handle duplicated data or padding?

In Google File System (GFS), the handling of duplicated data generated by failed write requests involves several key behaviors, but GFS itself does not automatically detect or mark duplicated data at the chunk server level. Here's how the process generally works:

1. Client or Library Awareness of Duplicates:

  • GFS clients or libraries are responsible for tracking the success of write requests.
  • If a write fails and the client retries, the client does not automatically know if the previous data was already successfully written to the chunk server. The client typically resends the write request, potentially causing duplicated data.
  • The GFS system is optimized for append operations, meaning that even if the same data is written twice, it will be appended rather than overwritten. Therefore, GFS itself does not recognize or remove duplicates within the chunk.

2. Marking Duplicated or Invalid Data:

  • The chunk server does not explicitly mark duplicated or invalid data resulting from failed write requests.
  • GFS stores data in a sequential, append-only manner, meaning that even if a failed write results in duplicated data, it’s just written again without any indication in the chunk metadata that it is a duplicate.
  • Duplicate data will exist as valid entries in the chunk unless explicitly removed or filtered by higher-level systems.

3. Reading Data from the Chunk Server:

  • When the client reads data from a chunk, the chunk server sends back all the data in that chunk, including any duplicates caused by failed writes.
  • It is up to the client or application to manage and filter out duplicate data, if necessary, after retrieving it.
  • GFS itself does not differentiate between valid and duplicated data during the reading process. The chunk server treats all written data as valid unless explicitly deleted.

Handling Duplicates in Applications:

  • Many applications built on top of GFS are designed to work with append-only logs, where duplicates might not cause major issues. For applications where duplicated data could be problematic, the application layer must implement logic to:
    • Detect and filter duplicates.
    • Validate data integrity using mechanisms like checksums, timestamps, or version numbers.

Question: How does the GFS handle outdated chunks?

1. Handling Stale Chunks:

  • Stale chunks are not directly updated. Instead, the master treats stale replicas as invalid and ignores them. The master does not attempt to synchronize them with the latest version.
  • The master considers the stale replica to be non-existent during normal operations. It will not instruct clients to read from or write to stale replicas.

2. Re-replication:

  • Once a stale replica is detected, the master triggers a re-replication process to ensure that the required number of valid replicas are available.
  • The master instructs a valid chunkserver (i.e., one with an up-to-date replica) to clone its chunk to another chunkserver.
  • This cloning creates a new, valid replica on a different chunkserver, bringing the number of replicas back to the desired replication level.

3. Garbage Collection of Stale Replicas:

  • The master periodically removes stale replicas during garbage collection. These stale replicas are eventually deleted from the chunkservers to free up space.
  • Before garbage collection, the master effectively treats stale replicas as if they do not exist, meaning they are not part of the active replication set for any chunk.

Question: What Does "Incrementally Update the Checksum" Mean? How does the checksum work?

When you append data to a chunk, the existing chunk may have a partial block at the end (a block that hasn't been completely filled). Instead of recalculating the checksum for the entire block from scratch, GFS updates the existing checksum by processing only the newly appended data.

  • GFS divides the data in each chunk into checksum blocks (typically 64 KB each).
  • Each checksum block has its own checksum value, which is stored separately from the data. When the data in a block changes, GFS updates the checksum for that specific block.
  • If you append data to a chunk, it will only affect the checksum of the last partial checksum block (if there is one) or create new checksum blocks for newly added data.
  • new checksum=func(old checksum,new data)

Paper reading:

GFS