File System - ShuWa

1. Basic Components of the File System#

The Linux file system allocates two data structures for each file: inode and directory entry, which are mainly used to record the file's metadata and directory hierarchy.

The inode is used to record the file's metadata, such as inode number, file size, access permissions, creation time, modification time, data location on disk, etc. The inode is the unique identifier for a file, with a one-to-one correspondence between them, and they are also stored on the hard disk, so inodes also occupy disk space.
The directory entry is used to record the file's name, inode pointer, and hierarchical relationships with other directory entries. When multiple directory entries are linked together, they form a directory structure. However, unlike inodes, directory entries are a data structure maintained by the kernel, not stored on disk, but cached in memory.

Since inodes uniquely identify a file and directory entries record the file's name, the relationship between directory entries and inodes is many-to-one, meaning a file can have multiple aliases. For example, the implementation of hard links allows multiple directory entries' inodes to point to the same file.
Note that a directory is also a file, uniquely identified by an inode. The difference from regular files is that regular files store file data on disk, while directory files store subdirectories or files on disk.

Are directory entries and directories the same thing?

Although their names are very similar, they are not the same. A directory is a file that is persistently stored on disk, while a directory entry is a data structure in the kernel, cached in memory.
If a directory is frequently queried from disk, the efficiency will be very low, so the kernel caches directories that have been read using the directory entry data structure. The next time the same directory is read, it can be read from memory, greatly improving the efficiency of the file system.
Note that the directory entry data structure not only represents directories but can also represent files.

How is file data stored on disk?

The smallest unit of disk read/write is a sector, which is only 512B in size. Clearly, if every read/write operation is done at such a small unit, the efficiency will be very low.
Therefore, the file system groups multiple sectors into a logical block, with the minimum unit of read/write being the logical block (data block). The logical block size in Linux is 4KB, meaning that it reads/writes 8 sectors at a time, which greatly improves the efficiency of disk read/write operations.
The above describes the relationship between inodes, directory entries, and file data. The following diagram illustrates their relationships well:

Inodes are stored on the hard disk, so to speed up file access, they are usually loaded into memory.
Additionally, when a disk is formatted, it is divided into three storage areas: the superblock, inode area, and data block area.

The superblock stores detailed information about the file system, such as the number of blocks, block size, free blocks, etc.
The inode area is used to store inodes;
The data block area is used to store file or directory data;

We cannot load all superblocks and inode areas into memory, as this would overwhelm the memory. Therefore, they are only loaded into memory when needed, and the timing of their loading is different:

Superblock: Loaded into memory when the file system is mounted;
Inode area: Loaded into memory when a file is accessed;

2. Virtual File System#

There are many types of file systems, and the operating system wants to provide a unified interface to users, so an intermediate layer is introduced between the user layer and the file system layer, called the Virtual File System (VFS).
VFS defines a set of data structures and standard interfaces supported by all file systems, allowing programmers to work without needing to understand the workings of the file system, only needing to know the unified interface provided by VFS.
In the Linux file system, the relationships between user space, system calls, virtual file systems, caches, file systems, and storage are shown in the following diagram:
Linux supports a variety of file systems, which can be divided into three categories based on storage location:

Disk file systems, which store data directly on disk, such as Ext 2/3/4, XFS, etc.
Memory file systems, which do not store data on the hard disk but occupy memory space. Commonly used /proc and /sys file systems belong to this category, where reading and writing these files actually reads and writes related data in the kernel.
Network file systems, which are used to access data from other computer hosts, such as NFS, SMB, etc.

File systems must first be mounted to a directory to be used normally. For example, during the startup of a Linux system, the file system is mounted to the root directory.

3. File Usage#

From the user's perspective, how do we use files? First, we need to open a file through a system call.
write process
The process of reading a file is as follows:

First, use the open system call to open the file, with the parameters of open including the file path and name.
Use write to write data, where write uses the file descriptor returned by open, not the file name as a parameter.
After using the file, the close system call must be used to close the file to avoid resource leakage.

Once we open a file, the operating system tracks all files opened by the process. This tracking means that the operating system maintains an open file table for each process, with each entry representing a file descriptor, making the file descriptor the identifier for the opened file.
The operating system maintains the state and information of opened files in the open file table:

File pointer: The system tracks the last read/write position as the current file position pointer, which is unique for a specific process opening the file;
File open counter: When a file is closed, the operating system must reuse its open file table entry; otherwise, there won't be enough space in the table. Since multiple processes may open the same file, the system must wait for the last process to close the file before deleting the open file entry. This counter tracks the number of opens and closes, and when it reaches 0, the system closes the file and deletes the entry;
File disk location: Most file operations require the system to modify file data, and this information is kept in memory to avoid reading from disk for every operation;
Access permissions: Each process opening a file needs an access mode (create, read-only, read-write, append, etc.), and this information is stored in the process's open file table so that the operating system can allow or deny subsequent I/O requests;

From the user's perspective, a file is a persistent data structure, but the operating system does not care about any data structure you want to store on disk. The operating system's perspective is how to map file data to disk blocks.
Thus, there are differences in the read/write operations of files between users and the operating system. Users are accustomed to reading and writing files in bytes, while the operating system reads and writes files in data blocks. The file system's job is to mask this difference.
Let's look at the processes of reading and writing files:

When a user process reads 1 byte of data from a file, the file system needs to obtain the data block where the byte is located and then return the corresponding data portion to the user process.
When a user process writes 1 byte of data to a file, the file system finds the position of the data block where the data needs to be written, modifies the corresponding part in the data block, and finally writes the data block back to disk.
Therefore, the basic operational unit of the file system is the data block.

4. File Storage#

1. Contiguous Space Storage Method#

The contiguous space storage method, as the name suggests, stores files in "contiguous" physical space on the disk. In this mode, the data of the file is tightly connected, and the read/write efficiency is very high because an entire file can be read out with one disk seek.
Using contiguous storage requires knowing the size of a file in advance so that the file system can find a contiguous space on the disk to allocate for the file based on its size.
Thus, the file header needs to specify the "starting block position" and "length", which can effectively represent that the file storage method is a contiguous disk space.
Note that the file header mentioned here is similar to the inode in Linux.

Contiguous Space Storage Method

Although the contiguous storage method has high read/write efficiency, it has the drawbacks of "disk space fragmentation" and "difficulty in expanding file length."

2. Non-contiguous Space Storage Method#

Linked List Method#

The linked list storage method is discrete and does not require continuity, thus eliminating disk fragmentation, greatly improving disk space utilization, and allowing dynamic expansion of file length.
Implicit Linked List
The implementation requires the file header to include the positions of the "first block" and "last block," and each data block leaves space for a pointer to store the position of the next data block. This way, one data block is linked to another, and starting from the head of the chain, all data blocks can be found by following the pointers, so the storage method can be non-contiguous.
Implicit Linked List
The disadvantage of the implicit linked list storage method is that it cannot directly access data blocks and can only access files sequentially through pointers, and the data block pointers consume some storage space. The stability of implicit linked allocation is relatively poor; if pointers in the linked list are lost or damaged due to software or hardware errors during system operation, it can lead to data loss.
Explicit Linking
This means the pointers used to link the data blocks of a file are explicitly stored in a link table in memory, with only one such table set for the entire disk. Each table entry stores a link pointer pointing to the next data block number.
For explicit linking, consider an example where file A uses disk blocks 4, 7, 2, 10, and 12 in sequence, and file B uses disk blocks 6, 3, 11, and 14. Using the table in the diagram below, you can start from block 4 and follow the chain to find all the disk blocks of file A. Similarly, starting from block 6, you can follow the chain to find all the disk blocks of file B. Finally, both chains end with a special marker that does not belong to a valid disk number (e.g., -1). This table in memory is called the File Allocation Table (FAT).
Explicit Linking

Since the lookup process is done in memory, it not only significantly improves retrieval speed but also greatly reduces the number of disk accesses. However, because the entire table is stored in memory, its main disadvantage is that it is not suitable for large disks.
For example, for a 200GB disk with a block size of 1KB, this table would need 200 million entries, each corresponding to one of the 200 million disk blocks. If each entry requires 4 bytes, this table would occupy 800MB of memory, which is clearly not suitable for large disks.

Indexing Method#

The indexing method creates an index data block for each file, which contains a list of pointers to the file's data blocks, similar to a table of contents in a book, where you can find the content of a chapter by checking the directory.
Additionally, the file header needs to include a pointer to the "index data block," allowing the file header to know the location of the index data block and then find the corresponding data block using the index information.
When a file is created, all pointers in the index block are set to empty. When the first write occurs to the i-th block, a block is obtained from free space, and its address is written to the i-th entry of the index block.

Indexing Method
The advantages of the indexing method include:

Creating, enlarging, and shrinking files is very convenient;
There will be no fragmentation issues;
Supports sequential and random read/write;

Since index data is also stored in disk blocks, if a file is very small and can fit in one block, an additional block is still needed to store index data, so one of the drawbacks is the overhead of storing the index.
Chained Index Blocks
If a file is very large, to the extent that one index data block cannot hold all the index information, how should we handle the storage of large files? We can use a combined approach to manage large file storage.
First, let's look at the combination of linked lists and indexing, referred to as chained index blocks. Its implementation involves leaving a pointer in the index data block to store the next index data block, allowing the next index data block's information to be found when the current index data block's index information is exhausted. However, this method may encounter the issues mentioned earlier regarding linked lists; if a pointer is damaged, subsequent data may become unreadable.
Chained Index Blocks
Another combination method is the indexing method combined with indexing, referred to as multi-level index blocks, where one index block stores multiple index data blocks, creating layers of indexing, similar to Russian nesting dolls.

3. Implementation of Unix Files#

The storage method varies based on the size of the file:

If the number of data blocks required to store the file is less than 10, direct lookup is used;
If the number of data blocks required exceeds 10, a single indirect indexing method is used;
If both of the previous methods are insufficient to store a large file, a double indirect indexing method is used;
If the double indirect indexing is also insufficient, a triple indirect indexing method is used;

Thus, the file header (inode) needs to contain 13 pointers:

10 pointers to data blocks;
The 11th pointer to the index block;
The 12th pointer to the secondary index block;
The 13th pointer to the tertiary index block;

This method can flexibly support the storage of both small and large files:

For small files, direct lookup reduces the overhead of index data blocks;
For large files, multi-level indexing is used, so accessing data blocks for large files requires many queries;

4. Free Space Management#

If I want to save a data block, where should I place it on the hard disk? Do I need to scan all blocks to find an empty place to put it?
That method would be too inefficient, so a management mechanism for free disk space must be introduced. Here are a few common methods:

Free table method
Free linked list method
Bitmap method

Free Table Method#

The free table method establishes a table for all free spaces, with the table content including the first block number of the free area and the number of blocks in that free area. Note that this method is for contiguous allocation. As shown in the diagram:

Free Table Method

When a request for disk space allocation is made, the system scans the contents of the free table sequentially until it finds a suitable free area. When a user cancels a file, the system recovers the file space. At this time, it also needs to scan the free table sequentially to find a free table entry and fill in the first physical block number of the released space and the number of blocks it occupies.
This method works well only when there are a small number of free areas. If there are many small free areas in the storage space, the free table becomes very large, leading to low query efficiency. Additionally, this allocation technique is suitable for establishing contiguous files.

Free Linked List Method#

We can also use a "linked list" method to manage free space, where each free block has a pointer pointing to the next free block, making it easy to find and manage free blocks. As shown in the diagram:
Free Linked List Method

When creating a file that requires one or more blocks, it takes the next one or more blocks from the head of the chain. Conversely, when recovering space, these free blocks are sequentially added to the head of the chain.
This technique only requires saving a pointer in main memory that points to the first free block. Its characteristics are simplicity, but it cannot be accessed randomly, and its working efficiency is low because many I/O operations are needed whenever free blocks are added or moved on the chain, and the pointers for data blocks consume some storage space.
Neither the free table method nor the free linked list method is suitable for large file systems, as they would make the free table or free linked list too large.

Bitmap Method#

A bitmap uses a single binary bit to represent the usage status of a disk block, with each disk block corresponding to a binary bit on the disk.
When the value is 0, it indicates that the corresponding disk block is free; when the value is 1, it indicates that the corresponding disk block is allocated. It appears as follows:

1111110011111110001110110111111100111 ...

The Linux file system uses the bitmap method to manage free space, not only for managing free data blocks but also for managing free inodes, as inodes are also stored on disk and naturally require management.

5. Structure of the File System#

When a user creates a new file, the Linux kernel finds a free inode through the inode bitmap and allocates it. To store data, it finds a free block through the block bitmap and allocates it, but careful calculation reveals that there are still issues.
The data block bitmap is stored in disk blocks; assuming it is stored in one block, one block is 4K, and each bit represents a data block, it can represent a total of 4 * 1024 * 8 = 2^15 free blocks. Since one data block is 4K in size, the maximum representable space is 2^15 * 4 * 1024 = 2^27 bytes, which is 128M.
This means that according to the above structure, if we use "one block's bitmap + a series of blocks," plus "one block's inode bitmap + a series of inode structures," the maximum space that can be represented is only 128M, which is too small, as many files today exceed this size.
In the Linux file system, this structure is referred to as a block group. With N block groups, it can represent a larger file.
The following diagram shows the structure of the entire Linux Ext2 file system and the contents of the block groups, which consist of many block groups arranged sequentially on the hard disk:

The first block is the boot block, which is used to enable booting when the system starts, followed by a series of contiguous block groups. The contents of a block group are as follows:

The superblock contains important information about the file system, such as the total number of inodes, total number of blocks, the number of inodes per block group, the number of blocks per block group, etc.
The block group descriptor contains the status of each block group in the file system, such as the number of free blocks and inodes in the block group. Each block group contains "descriptor information for all block groups" in the file system.
The data bitmap and inode bitmap indicate whether the corresponding data block or inode is free or in use.
The inode list contains all the inodes in the block group, which are used to store all metadata related to files and directories in the file system.
The data blocks contain the useful data of the files.
You may notice that each block group contains a lot of duplicate information, such as the superblock and block group descriptor table, both of which are global information and very important. This is done for two reasons:
If the system crashes and the superblock or block group descriptor is damaged, all information regarding the file system structure and content will be lost. If there are redundant copies, that information may be recoverable.
By keeping files and management data as close as possible, the seek time and rotation of the disk head are reduced, which can improve the performance of the file system.

However, later versions of Ext2 adopted sparse technology. This approach means that the superblock and block group descriptor table are no longer stored in every block group of the file system but are only written to block group 0, block group 1, and other block groups whose IDs can be represented as powers of 3, 5, and 7.

6. Directory Storage#

The blocks of regular files store file data, while the blocks of directory files store the information of each file in the directory.
In the blocks of directory files, the simplest storage format is a list, which lists the file information (such as file name, file inode, file type, etc.) under the directory one by one.
Each item in the list represents the file name and corresponding inode of the file under that directory, and through this inode, the actual file can be found.
Directory Format Hash Table

Typically, the first item is “.”, representing the current directory, and the second item is "..", representing the parent directory, followed by the file names and inodes.

If a directory has a large number of files, searching for a file in that directory by checking each item in the list is inefficient.

Thus, the storage format of directories is changed to a hash table, where the file names are hashed, and the hash values are stored. If we want to find a file name under a directory, we can check the hash by name. If the hash matches, it indicates that the file information is in the corresponding block.

The ext file system in Linux uses a hash table to store directory contents. The advantage of this method is that searching is very fast, and insertion and deletion are also relatively simple, but some preparatory measures are needed to avoid hash collisions.

Directory queries are completed by repeatedly searching on the disk, requiring constant I/O operations, which incurs significant overhead. Therefore, to reduce I/O operations, the currently used file directory is cached in memory, so that when the file is needed again, it can be operated in memory, thus reducing the number of disk operations and improving the speed of file system access.

7. Hard Links and Soft Links#

Sometimes we want to give a file an alias, which can be achieved in Linux through hard links and soft links, both of which are special types of files but implemented differently.

Hard Links#

A hard link is when the "inode" in multiple directory entries points to one file, meaning it points to the same inode. However, inodes cannot cross file systems, as each file system has its own inode data structure and list, so hard links cannot be used across file systems. Since multiple directory entries point to the same inode, the system will only completely delete the file when all hard links and the original file are deleted.

Soft Links#

A soft link is equivalent to creating a new file, which has an independent inode, but the content of this file is the path of another file, so when accessing the soft link, it is equivalent to accessing another file. Therefore, soft links can cross file systems, and even if the target file is deleted, the linked file still exists, but it simply cannot find the file it points to.

8. File I/O#

The methods of reading and writing files vary, and there are many classifications for file I/O, commonly including:

Buffered and unbuffered I/O
Direct and non-direct I/O
Blocking and non-blocking I/O vs. synchronous and asynchronous I/O

Buffered and Unbuffered I/O#

The standard library for file operations can implement data caching, so based on "whether to use standard library buffering," file I/O can be divided into buffered I/O and unbuffered I/O:

Buffered I/O uses the standard library's cache to accelerate file access, while the standard library accesses files through system calls.
Unbuffered I/O accesses files directly through system calls without going through the standard library cache.

Here, "buffered" specifically refers to the buffering implemented internally by the standard library.
For example, many programs only output when encountering a newline, while the content before the newline is temporarily cached by the standard library. The purpose of this is to reduce the number of system calls, as system calls incur the overhead of CPU context switching.

Direct and Non-direct I/O#

We know that disk I/O is very slow, so the Linux kernel reduces the number of disk I/O operations by copying user data to the kernel's cache after system calls. This kernel cache space is known as "page cache," and disk I/O requests are only initiated when the cache meets certain conditions.
Thus, based on "whether to use the operating system's cache," file I/O can be divided into direct I/O and non-direct I/O:

Direct I/O does not involve copying data between the kernel cache and user programs but accesses the disk directly through the file system.
Non-direct I/O involves reading data from the kernel cache to the user program during read operations, while during write operations, data is copied from the user program to the kernel cache, with the kernel deciding when to write data to disk.

If you specify the O_DIRECT flag when using file operation system call functions, it indicates the use of direct I/O. If not set, the default is non-direct I/O.

When using non-direct I/O for writing data, under what circumstances will the kernel write cached data to disk?

The following scenarios will trigger the kernel to write cached data to disk:

At the end of the write call, when the kernel finds that there is too much cached data, it will write the data to disk;
When the user actively calls sync, the kernel cache will be flushed to disk;
When memory is very tight and cannot allocate more pages, the kernel will also flush cached data to disk;
When the cache time of the kernel's cached data exceeds a certain duration, the data will also be flushed to disk;

Blocking and Non-blocking I/O vs. Synchronous and Asynchronous I/O#

Why are blocking/non-blocking and synchronous/asynchronous discussed together? Because they are indeed very similar and easily confused, but their relationships are somewhat subtle.
First, let's look at blocking I/O. When a user program executes read, the thread will be blocked, waiting for the kernel data to be ready and for the data to be copied from the kernel buffer to the application program's buffer. The read call will only return after the copy process is complete.
Note that blocking waits for the two processes of "kernel data being ready" and "data being copied from kernel space to user space." The process is shown in the diagram below:

Blocking I/O

Knowing about blocking I/O, let's look at non-blocking I/O. A non-blocking read request immediately returns when data is not ready, allowing the program to continue executing. At this point, the application program continuously polls the kernel until the data is ready, and only then can the read call obtain the result. The process is shown in the diagram below:

Non-blocking I/O

Note that the last read call to obtain data is a synchronous process that requires waiting. Here, synchronous refers to the process of copying data from kernel space to the user program's buffer.
For example, when accessing a pipe or socket, if the O_NONBLOCK flag is set, it indicates that non-blocking I/O is being used. If no settings are made, the default is blocking I/O.
Polling the kernel for I/O readiness seems a bit silly because the application does nothing during the polling process; it just loops.
To solve this silly polling method, I/O multiplexing technology has emerged, such as select and poll, which dispatch I/O events and notify the application program to operate when kernel data is ready.
This greatly improves CPU utilization because when an I/O multiplexing interface is called, if no events occur, the current thread will block, allowing the CPU to switch to execute other threads. When the kernel detects that an event has arrived, it wakes up the thread that is blocked on the I/O multiplexing interface, allowing the user to perform subsequent event handling.
The entire process is more complex than blocking I/O and seems to waste more performance. However, the biggest advantage of the I/O multiplexing interface is that users can handle multiple socket I/O requests simultaneously within a single thread (see: I/O Multiplexing: select/poll/epoll). Users can register multiple sockets and continuously call the I/O multiplexing interface to read activated sockets, achieving the goal of handling multiple I/O requests simultaneously within the same thread. In synchronous blocking models, this can only be achieved through multithreading.
The following diagram illustrates the process of using select for I/O multiplexing. Note that the read process of obtaining data (the process of copying data from kernel space to user space) is also a synchronous process that requires waiting:

I/O Multiplexing

In fact, whether it is blocking I/O, non-blocking I/O, or I/O multiplexing based on non-blocking I/O, all are synchronous calls. Because during the read call, the kernel copies data from kernel space to application space, this process requires waiting, meaning it is synchronous. If the kernel's implementation of the copy is not efficient, the read call may wait a long time during this synchronous process.
True asynchronous I/O means that neither the "kernel data being ready" nor the "data being copied from kernel space to user space" processes need to wait.
When we initiate aio_read, it returns immediately, and the kernel automatically copies data from kernel space to application space. This copy process is also asynchronous, automatically completed by the kernel, unlike the previous synchronous operations where the application program must actively initiate the copy action. The process is shown in the diagram below:

Asynchronous I/O

The following diagram summarizes the various I/O models mentioned above:

Earlier, we learned that I/O is divided into two processes:

The process of preparing data
The process of copying data from kernel space to user process buffer

Blocking I/O will block on "process 1" and "process 2," while non-blocking I/O and I/O multiplexing based on non-blocking I/O will only block on "process 2." Therefore, these three can be considered synchronous I/O.
Asynchronous I/O, on the other hand, does not block on either "process 1" or "process 2."

9. Page Cache#

What is Page Cache?#

To understand Page Cache, let's first look at the Linux file I/O system, as shown in the diagram below:

In the diagram, the red part represents Page Cache. It can be seen that the essence of Page Cache is a memory area managed by the Linux kernel. When we read files into memory through mmap and buffered I/O, it is actually read into Page Cache.

How to View the System's Page Cache?#

By reading the /proc/meminfo file, you can obtain real-time information about the system's memory:

$ cat /proc/meminfo
...
Buffers:            1224 kB
Cached:           111472 kB
SwapCached:        36364 kB
Active:          6224232 kB
Inactive:         979432 kB
Active(anon):    6173036 kB
Inactive(anon):   927932 kB
Active(file):      51196 kB
Inactive(file):    51500 kB
...
Shmem:             10000 kB
...
SReclaimable:      43532 kB
...

From the above data, you can derive a simple formula (the sum of both sides is 112696 KB):

Buffers + Cached + SwapCached = Active(file) + Inactive(file) + Shmem + SwapCached

Both sides of the equation represent Page Cache, thus:

Page Cache = Buffers + Cached + SwapCached

By reading the following sections, you will understand why SwapCached and Buffers are also part of Page Cache.

Page and Page Cache#

A page is the basic unit of memory management allocation, and Page Cache consists of multiple pages. In operating systems, a page is typically 4KB in size (32bits/64bits), while the size of Page Cache is a multiple of 4KB.
On the other hand, not all pages are organized as Page Cache.
The memory accessible to users in a Linux system is divided into two types:

File-backed pages: These are the pages in Page Cache that correspond to several data blocks on disk; the main issue with these pages is dirty pages being flushed to disk.
Anonymous pages: These pages do not correspond to any disk data blocks; they are the memory space for the process's execution (e.g., method stack, local variable table, etc.);

Why doesn't Linux call Page Cache block cache? Wouldn't that be better?
This is because the data loaded from disk into memory is not only placed in Page Cache but also in buffer cache.
For example, disk files accessed through Direct I/O do not enter Page Cache. Of course, this issue also has historical design reasons in Linux, as this is just a naming convention, and its meaning has gradually changed with the evolution of the Linux system.
Next, let's compare the performance of File-backed pages and Anonymous pages under the Swap mechanism.
Memory is a precious resource, and when memory is insufficient, the Memory Management Unit (MMU) needs to provide scheduling algorithms to reclaim relevant memory space. The usual way to reclaim memory space is through swap, which involves swapping to persistent storage devices.
The essence of the Swap mechanism is that the Linux system provides a virtual memory management mechanism, where each process believes it has exclusive access to memory space, meaning the total memory space of all processes far exceeds physical memory. The portion of memory space that exceeds physical memory needs to be swapped to disk.
The operating system manages memory in pages, and when a process discovers that the data it needs is not in memory, it may load the data into memory in page form. This process is called page fault, and when a page fault occurs, the operating system will use a system call to read the page back into memory.
However, the main memory space is limited, and when there is no available space in the main memory, the operating system will evict a suitable physical memory page back to disk to make room for the new memory page. The process of selecting which page to evict is called page replacement in the operating system, and the replacement operation will trigger the swap mechanism.
If physical memory is large enough, the Swap mechanism may not be needed. However, Swap still has certain advantages in this case: for applications (processes) that may experience memory leaks, the Swap partition is crucial, as it ensures that memory leaks do not lead to insufficient physical memory, ultimately causing system crashes. However, memory leaks can lead to frequent swaps, which can severely impact the performance of the operating system.
Linux controls the Swap mechanism through a parameter called swappiness: this parameter can range from 0 to 100, controlling the priority of the system's swapping:

High values: Higher frequency of swapping, actively swapping out inactive processes from physical memory.
Low values: Lower frequency of swapping, ensuring that interactive processes are not frequently swapped to disk, thus reducing response latency.

Finally, why is SwapCached also part of Page Cache?
This is because when anonymous pages (Inactive(anon) and Active(anon)) are swapped out to disk and then loaded back into memory, the original Swap File still exists, so SwapCached can also be considered a File-backed page, thus belonging to Page Cache. The process is illustrated in the diagram below.

Page Cache and Buffer Cache#

When executing the free command, you may notice two columns named buffers and cached, as well as a line named "-/+ buffers/cache":

~ free -m
             total       used       free     shared    buffers     cached
Mem:        128956      96440      32515          0       5368      39900
-/+ buffers/cache:      51172      77784
Swap:        16002          0      16001

The cached column indicates the current usage of page cache (Page Cache), while the buffers column indicates the current usage of block cache (buffer cache).
In one sentence: Page Cache is used to cache page data of files, while buffer cache is used to cache block data of block devices (such as disks).

Pages are a logical concept, so Page Cache is on par with the file system;
Blocks are a physical concept, so buffer cache is on par with block device drivers.
The common purpose of Page Cache and buffer cache is to accelerate data I/O:
When writing data, it is first written to the cache, marking the written pages as dirty, and then flushing to external storage, which is the write-back mechanism in caching (another method is write-through, which is not used by Linux by default);
When reading data, it first reads from the cache; if there is a miss, it reads from external storage and adds the read data to the cache. The operating system always actively uses all free memory as Page Cache and buffer cache, and when memory is insufficient, it will also use LRU and other algorithms to evict cached pages.

Before the Linux 2.4 kernel version, Page Cache and buffer cache were completely separate. However, since block devices are mostly disks, and data on disks is often organized through file systems, this design led to much data being cached twice, wasting memory.
Therefore, after the 2.4 kernel version, the two caches were nearly merged: if a file's pages are loaded into Page Cache, then the buffer cache only needs to maintain pointers to the blocks pointing to the pages. Only those blocks that do not have file representations or are directly operated on (such as the dd command) will actually be placed in the buffer cache.
Thus, when we refer to Page Cache now, we generally refer to both Page Cache and buffer cache together, and this article will no longer distinguish between them, directly referring to them as Page Cache.
The following diagram roughly illustrates a possible structure of Page Cache in a 32-bit Linux system, where the block size is 1KB and the page size is 4KB.

Each file in Page Cache is a radix tree (essentially a multi-way search tree), where each node in the tree is a page. The page can be quickly located based on the offset within the file, as shown in the diagram below. For more information on the principles of radix trees, you can refer to the English Wikipedia; it will not be elaborated here.

Page Cache and Read-Ahead#

The operating system provides a read-ahead mechanism (PAGE_READAHEAD) for the read caching mechanism based on Page Cache. An example is:

The user thread only requests to read data from file A on disk in the range of offset 0-3KB. Since the basic read/write unit of the disk is a block (4KB), the operating system will read at least 0-4KB of content, which fits perfectly into one page.
However, due to the principle of locality, the operating system will choose to load the disk blocks offset [4KB,8KB), [8KB,12KB), and [12KB,16KB) into memory, thus additionally requesting 3 pages in memory.
The diagram below represents the operating system's read-ahead mechanism:

In the diagram, the application program uses the read system call to read 4KB of data, but the kernel uses the readahead mechanism to complete the reading of 16KB of data.

Page Cache and Consistency & Reliability of File Persistence#

Modern Linux's Page Cache, as its name implies, is a memory cache for pages (of disk) and can be used for both read and write operations.
Any system that introduces caching will encounter consistency issues: the data in memory may not match the data on disk, such as the common consistency issues between Redis cache and MySQL database in backend architectures.
Linux provides various mechanisms to ensure data consistency, but whether it is consistency between memory and disk on a single machine or consistency issues among nodes 1, 2, and 3 in distributed components, the key to understanding is the trade-off: throughput and data consistency guarantees are contradictory.
First, we need to understand the data of a file. File = Data + Metadata. Metadata describes various attributes of the file and must also be stored on disk. Therefore, ensuring file consistency actually involves two aspects: data consistency + metadata consistency.

The metadata of a file includes: file size, creation time, access time, owner and group information, etc.

Let's consider a consistency issue: if a write operation occurs and the corresponding data is in Page Cache, the write operation will directly affect the data in Page Cache. If the data has not yet been flushed to disk, the data in memory will be ahead of that on disk, and the corresponding page will be referred to as a Dirty page.
Currently, Linux implements file consistency in two ways:

Write Through: Provides specific interfaces to the user layer, allowing applications to actively call interfaces to ensure file consistency;
Write Back: The system has periodic tasks (in the form of kernel threads) that periodically synchronize dirty data blocks of files in the file system, which is the default consistency scheme in Linux;

Both methods ultimately rely on system calls, mainly divided into the following three system calls:
fsync(fd): Flushes all dirty data and metadata of the file represented by fd to disk.
fdatasync(fd): Flushes the dirty data of the file represented by fd to disk, along with necessary metadata. Here, "necessary" refers to information that is crucial for subsequent file access, such as file size, while file modification time is not considered necessary.
sync(): Flushes all dirty file data and metadata in the system to disk.
These three system calls can be initiated by both user processes and kernel processes. Next, let's examine the characteristics of kernel threads related to this.

The number of kernel threads created for write-back tasks is determined by the persistent storage devices in the system, creating a separate flush thread for each storage device;
Regarding the architecture of multithreading, the Linux kernel adopts a lightweight approach, with a management thread and multiple flush threads (each persistent storage device corresponds to a flush thread). The management thread monitors the status of dirty pages on the device; if no dirty pages are generated within a certain period, it destroys the flush thread on that device. If it detects that there are dirty pages needing to be flushed on the device and no flush thread has been created for that device, it creates a flush thread to handle the dirty page flushing. The flush thread's task is relatively monotonous, only responsible for flushing dirty pages from the device to persistent storage.
The flushing thread flushes dirty pages on the device roughly as follows:
- Each device maintains a linked list of dirty files, which stores the inodes of the dirty files stored on that device. Flushing dirty pages means flushing certain files' dirty pages from this inode list;
- There are multiple flushing triggers in the system: first, when the application program actively calls flushing interfaces (fsync, fdatasync, sync, etc.), second, the management thread periodically wakes up the flushing thread on the device to perform flushing, and third, when certain applications/kernel tasks find insufficient memory and need to reclaim some cached pages, they may preemptively flush dirty pages. Designing a unified framework to manage these flushing tasks is very necessary.
  Write Through and Write Back differ in terms of persistence reliability:

Write Through sacrifices system I/O throughput to ensure that once written, the data is already on disk and will not be lost;
Write Back cannot ensure that data is on disk in the event of a system crash, thus posing a risk of data loss. However, if a program crashes, such as being killed with kill -9, the operating system will still ensure that the data in Page Cache is flushed to disk;

Advantages and Disadvantages of Page Cache#

Advantages of Page Cache#

1. Accelerates Data Access
If data can be cached in memory, the next access does not require disk I/O, and it can be directly hit in memory.
Since memory access is much faster than disk access, accelerating data access is a significant advantage of Page Cache.
2. Reduces I/O Operations and Increases System Disk I/O Throughput
Thanks to the caching and read-ahead capabilities of Page Cache, and because programs often conform to the principle of locality, loading multiple pages into Page Cache through a single I/O operation can reduce the number of disk I/O operations, thereby increasing system disk I/O throughput.

Disadvantages of Page Cache#

Page Cache also has its disadvantages, the most direct being that it requires additional physical memory space. When physical memory is tight, this may lead to frequent swap operations, ultimately increasing the disk I/O load on the system.
Another drawback of Page Cache is that it does not provide a good management API for the application layer; it is almost transparently managed. Even if the application layer wants to optimize the use strategy of Page Cache, it is challenging to do so. Therefore, some applications choose to implement their own page management in user space instead of using Page Cache, such as the MySQL InnoDB storage engine, which manages pages of 16KB.
The final drawback of Page Cache is that in certain application scenarios, it incurs one additional disk read I/O and one additional disk write I/O compared to Direct I/O.
Direct I/O refers to direct I/O. The term "direct" is used to distinguish it from cached I/O that uses the Page Cache mechanism.

Cached file I/O: The user space reads and writes a file and does not directly interact with the disk but is instead buffered through an intermediate layer, namely Page Cache;
Direct file I/O: The user space reads the file directly from the disk without an intermediate Page Cache layer;
The term "direct" here also has another meaning: in all other technologies, data must be stored at least once in kernel space, but in Direct I/O technology, data is stored directly in user space, bypassing the kernel.
The Direct I/O model is illustrated in the diagram below:

At this point, the user space directly transfers data to and from the disk and network card via DMA.
The read and write operations of Direct I/O are very characteristic:

Write operations: Since it does not use Page Cache, if the write file returns successfully, the data is indeed on disk (not considering the disk's built-in cache);
Read operations: Since it does not use Page Cache, each read operation genuinely reads from the disk and does not read from the file system's cache.

Zero-Copy#

Disk read/write speeds differ from memory by more than 10 times, so there are many technologies aimed at optimizing disk performance, such as zero-copy, direct I/O, asynchronous I/O, etc. The purpose of these optimizations is to improve system throughput. Additionally, the disk high-speed cache area in the operating system kernel can effectively reduce the number of disk accesses.

What is DMA Technology?#

When transferring data between I/O devices and memory, the data transfer work is entirely handled by the DMA controller, and the CPU no longer participates in any data transfer-related tasks, allowing the CPU to handle other tasks.
The process of using the DMA controller for data transfer is as follows:

The specific process is:

The user process calls the read method, issuing an I/O request to the operating system to read data into its memory buffer, and the process enters a blocked state;
The operating system receives the request and further sends the I/O request to DMA, allowing the CPU to perform other tasks;
DMA sends the I/O request to the disk;
The disk receives the DMA I/O request and reads the data from the disk into the disk controller's buffer. When the disk controller's buffer is full, it sends an interrupt signal to DMA, notifying it that its buffer is full;
DMA receives the signal from the disk and copies the data from the disk controller's buffer to the kernel buffer, without occupying the CPU, allowing the CPU to execute other tasks;
When DMA has read enough data, it sends an interrupt signal to the CPU;
The CPU receives the signal from DMA, indicating that the data is ready, and then copies the data from the kernel to user space, returning from the system call;

As we can see, the CPU no longer participates in the "moving data from the disk controller's buffer to kernel space" work; this part is entirely handled by DMA. However, the CPU is still essential in this process, as it needs to tell the DMA controller what data to transfer and from where to where.

Early DMA only existed on the motherboard, but now, due to the increasing number of I/O devices and varying data transfer needs, each I/O device has its own DMA controller.

How to Optimize File Transfer Performance?#

Traditional file transfer involves 4 context switches between user space and kernel space because it involves two system calls, one for read() and one for write(). Each system call requires switching from user space to kernel space and back again after the kernel completes the task.
The cost of context switching is not small; each switch takes tens of nanoseconds to several microseconds. Although this time seems short, in high-concurrency scenarios, this time can accumulate and amplify, affecting system performance.
Additionally, there are 4 data copies involved, two of which are DMA copies, and the other two are copies through the CPU.
Therefore, to improve file transfer performance, it is necessary to reduce the number of "context switches between user space and kernel space" and "memory copies."

How to reduce the number of "context switches between user space and kernel space"?

When reading disk data, context switches occur because user space does not have permission to operate on disks or network cards. Each system call will inevitably involve 2 context switches: first switching from user space to kernel space, and then switching back to user space after the kernel completes the task.
Thus, to reduce the number of context switches, we need to minimize the number of system calls.

How to reduce the number of "data copies"?

The traditional file transfer method undergoes 4 data copies, and among these, "copying from the kernel's read buffer to the user's buffer and then from the user's buffer to the socket's buffer" is unnecessary.
In file transfer scenarios, we do not "reprocess" the data in user space, so the data can actually be transferred without being moved to user space, meaning the user's buffer is unnecessary.

How to Achieve Zero-Copy?#

Zero-copy technology is typically implemented in two ways: mmap + write, sendfile.

mmap + write#

Earlier, we learned that the read() system call copies data from the kernel buffer to the user's buffer. To reduce this overhead, we can replace the read() system call with mmap().
The mmap() system call directly "maps" the data in the kernel buffer to user space, so the operating system kernel and user space no longer need to perform any data copying operations.

The specific process is as follows:

After the application process calls mmap(), DMA copies the disk data into the kernel buffer. The application process then "shares" this buffer with the operating system kernel;
The application process then calls write(), and the operating system directly copies the data from the kernel buffer to the socket buffer, all occurring in kernel space, with the CPU handling the data transfer;
Finally, the data in the kernel's socket buffer is copied to the network card's buffer, which is done by DMA.

We can see that by using mmap() instead of read(), we can reduce one data copying process.
However, this is still not the ideal zero-copy solution, as it still requires the CPU to copy data from the kernel buffer to the socket buffer, and it still involves 4 context switches, as there are still 2 system calls.

sendfile#

In Linux kernel version 2.1, a dedicated system call function for sending files, sendfile(), was introduced. The function signature is as follows:

#include <sys/socket.h>
ssize_t sendfile(int out_fd, int in_fd, off_t *offset, size_t count);

The first two parameters are the destination and source file descriptors, while the last two parameters are the source offset and the length of data to copy. The return value is the actual length of data copied.
First, it can replace the previous read() and write() system calls, thus reducing one system call and consequently reducing 2 context switches.
Secondly, this system call can directly copy data from the kernel buffer to the socket buffer without copying it to user space, resulting in only 2 context switches and 3 data copies. The process is illustrated in the diagram below:

However, this is still not true zero-copy technology. If the network card supports SG-DMA (The Scatter-Gather Direct Memory Access) technology (which differs from ordinary DMA), we can further reduce the process of copying data from the kernel buffer to the socket buffer.
Thus, starting from Linux kernel version 2.4, the process of the sendfile() system call has changed under the condition that the network card supports SG-DMA technology, as follows:

First, DMA copies data from the disk to the kernel buffer;
Second, the buffer descriptor and data length are sent to the socket buffer, allowing the network card's SG-DMA controller to directly copy data from the kernel cache to the network card's buffer without needing to copy data from the operating system's kernel buffer to the socket buffer, thus reducing one data copy;

Therefore, in this process, only 2 data copies are performed, as shown in the diagram:

This is what is known as zero-copy technology (Zero-copy), as we do not copy data at the memory level, meaning that no data is moved by the CPU throughout the process; all data is transferred via DMA.
Compared to traditional file transfer methods, zero-copy technology reduces the number of context switches and data copies by 2, requiring only 2 context switches and data copies, both of which are handled by DMA.
Thus, overall, zero-copy technology can improve file transfer performance by at least double.

What Method to Use for Transferring Large Files?#

Large file transfers should not use Page Cache, as they may occupy Page Cache and prevent "hot" small files from utilizing Page Cache.

Therefore, in high-concurrency scenarios, the method for transferring large files should use "asynchronous I/O + direct I/O" instead of zero-copy technology.

Common scenarios for using direct I/O include:

The application program has already implemented caching for disk data, so Page Cache is not needed again, reducing additional performance loss. In MySQL, direct I/O can be enabled through parameter settings, which is off by default;
When transferring large files, since large files are difficult to hit in Page Cache and may fill up Page Cache, preventing "hot" files from fully utilizing the cache, this increases performance overhead. Therefore, direct I/O should be used in this case.

Additionally, since direct I/O bypasses Page Cache, it cannot enjoy the following two optimizations from the kernel:

The kernel's I/O scheduling algorithm will cache as many I/O requests as possible in Page Cache, ultimately "merging" them into a larger I/O request before sending it to the disk, which reduces disk addressing operations;
The kernel will also "read ahead" subsequent I/O requests into Page Cache, again to reduce disk operations;

Thus, when transferring large files, using "asynchronous I/O + direct I/O" allows for non-blocking file reads.
Therefore, when transferring files, we should use different methods based on the file size:

For transferring large files, use "asynchronous I/O + direct I/O";
For transferring small files, use "zero-copy technology";

In nginx, we can use the following configuration to use different methods based on file size:

location /video/ { 
    sendfile on; 
    aio on; 
    directio 1024m; 
}

When the file size exceeds the directio value, "asynchronous I/O + direct I/O" is used; otherwise, "zero-copy technology" is used.