GFS

In order to achieve parallel data reading for performance, data is divided and stored on a large number of servers, which is called sharding.
However, errors will inevitably occur on thousands of machines, so fault tolerance is necessary.
The simplest way to achieve fault tolerance is through replication, where one replica is switched to another when a failure occurs.
If replication is used, inconsistencies may occur if not careful. Data inconsistencies are a problem that arises.
If consistency is desired, additional interactions are required to ensure consistency, resulting in low performance, which is contrary to our initial expectations.

{{< block type="tip">}}
So, strong consistency represents low performance.
{{< /block >}}

Design Goals#

Since GFS is built on a large number of computers, which are prone to failures, it is necessary to perform checks, fault tolerance, and fast recovery from failures.
It primarily supports large files (e.g., files that are several gigabytes in size), while also supporting small files without targeted optimization.
The workload mainly consists of two types of reads: large streaming reads and small random reads.
Applications that prioritize performance usually perform batch processing and sort the content they read, so that the reads are always sequential and do not need to read data backwards.
- In large streaming reads, a single operation typically reads hundreds of kilobytes, or even 1 megabyte or more of data. For the same client, continuous sequential reads of a file are often initiated.
- Small random reads typically read a few kilobytes of data at an arbitrary offset position. Small-scale random reads usually read a few kilobytes of data at different positions in the file.
Once a file in GFS is written and completed, it is rarely modified again, so it mainly focuses on large streaming reads, while also supporting small-scale writes at arbitrary positions.
GFS must have efficient and unambiguous semantic support for multiple clients to concurrently append to the same file, which is known as atomic operations. Multiple clients often append to the same file in parallel.
High-performance, stable bandwidth networks are more important than low latency. Most of our target applications place great emphasis on high-speed batch processing of data, while few have strict response time requirements for individual read and write operations.

Architecture#

Single master, multiple chunk servers (which store the actual files), and multiple clients.
Each file is divided into chunks of a certain size (64MB), and each chunk has a unique 64-bit identifier (chunk handle).
Each chunk is replicated on different chunk servers (default is 3 replicas), and users can specify different replication levels.
The master manages metadata, such as the mapping between files and chunks, and the location information of chunks.
The master manages chunk splitting, garbage collection of orphaned chunks, and mirror management between chunk servers.
Each chunk server communicates with the master through heartbeat messages, and during the detection process, it sends instructions and collects status.

Metadata in GFS Master#

filename -> chunk ids (chunk handles) NV
Mapping between chunk handles and chunk data
- Which server the chunk is stored on (chunk server list)
- Chunk version number NV
- Primary chunk server for the chunk, where write operations are performed
- Lease expiration for the primary chunk server

Both of these data tables are stored in the master's memory, and for fault tolerance (e.g., data is not lost after a restart), they are stored on disk as logs. When reading, they are read from memory, and when writing, they are written to both memory and disk. Whenever there is a data change, it is appended to the log on disk, and periodically (when the log grows beyond a certain size), a checkpoint (similar to a snapshot) is created, so that reading does not need to start from the beginning.

GFS Read Steps#

The read request indicates that the client has a filename and the desired position to read (offset), which is then sent to the master.
Upon receiving the request, the master retrieves the corresponding chunk handles from the filenames. Since the size of each chunk is fixed, it determines the specific starting chunk handle.
It then finds the list of chunk servers that store the data corresponding to the chunk handles and returns it to the client.
The client can choose a server to read from (the paper mentions selecting the nearest server, as Google's IP addresses are contiguous and proximity can be determined based on IP). Since the client only reads 1MB or 64KB of data each time, it caches the relationship between chunks and chunk servers so that it doesn't have to request it every time.
When the chunk server receives the request, it finds the corresponding chunk based on the chunk handle (it is speculated that the chunk is named based on the chunk handle) and retrieves the data corresponding to the offset for the client.

q1: What happens if the data to be read spans multiple chunks?#

For example, if the data the client wants to read exceeds 64MB, or if it is only 2 bytes but spans multiple chunks, the client will notice that the request crosses boundaries when sending the request. Therefore, it will split the request into two and send them to the master, which means that two read requests may be sent to the master, and then data will be read from different chunk servers.

Consistency of Order of Changes Among Multiple Replicas#

For a given chunk:

The master grants a lease to the server that holds the chunk (the primary server) for a certain period of time (60 seconds). This lease is known as the primary lease.
The primary server orders all change operations (serial order), and the other secondary servers make changes based on this order.
As long as the chunk is being changed, the primary server can request an extension of the lease from the master.

GFS Write Steps#

The client sends a request to the master to obtain the list of chunk servers (primary and secondaries). If there is no primary, the master selects a secondary to become the primary.
Once the client obtains the list of chunk servers, it caches it and only requests it again if the primary does not respond or the lease expires.
The client pushes the data to all replicas. The client does not guarantee the order of the pushes. Each chunk server saves the data in its internal LRU cache until it is used or expires.
When all replicas have received the data, the client sends a write request to the primary, which identifies the data previously pushed to each replica. The primary organizes these writes into a certain order and applies them locally.
The primary then forwards this applied order to the secondary servers.
The secondaries apply this order to complete the modifications and reply to the primary.
The primary replies to the client, and if any errors occur, it also replies to the client. In case of errors, the write request may still succeed on the primary and secondaries (if the primary fails directly, it will not forward the serial order to the secondaries). The client considers the request as failed and handles it through retries (repeating steps 3-7 several times).

GFS Atomic Record Appends#

{{< block type="tip" title="Concurrent writes to the same region are not serializable">}}
This region may eventually contain data fragments from multiple clients.
{{< /block >}}

An atomic append operation. A record append appends data to a file at a given offset (chosen by GFS, as this operation may fail and some chunk servers may already have this data). It appends the data to the file once and returns the offset to the client. It is similar to O_APPEND and ensures atomicity. The record append follows the GFS Write Steps process, but with some special considerations:

After the client pushes all the data, the primary checks if appending to the chunk would exceed the maximum size of a single chunk.
If it exceeds the maximum size, the primary replies to the client indicating that the operation should be retried on the next chunk (the size of the record needs to be controlled to be a quarter of the maximum value of a single chunk to ensure manageable fragmentation).
If it does not exceed the maximum size, it is saved as usual.

Expired Replica Detection#

If a chunk server fails due to a crash or loss of certain update requests, it may become expired. For each chunk, the master maintains a version number to identify the latest and expired replicas.

When the master grants or renews a lease for a chunk's primary server, it increases the version number and notifies all replicas to update.

In the case of data consistency, the version numbers of the master and all replicas are consistent (which can be guaranteed before the client sends a write request).

When a chunk server restarts or reports a version number, the master checks if it contains expired replicas. If the master finds that the version number it receives is greater than its record, it updates with the higher version number.

The master deletes expired replicas through periodic garbage collection. Before deletion, it ensures that the response to chunk information requests from all its clients does not include the expired replica.

When a client retrieves the list of chunk servers from the master, it includes the version number, so it can compare and select the newest replica for operations.

Summary#

This is not a fully qualified distributed system with multiple replicas, multi-site, high availability, and self-repairing faults.