MySQL Execution Process#
As we can see, the MySQL architecture is divided into two layers: Server layer and storage engine layer.
- The Server layer is responsible for establishing connections, analyzing, and executing SQL. Most of MySQL's core functional modules are implemented here, including connectors, query cache, parsers, preprocessors, optimizers, executors, etc. Additionally, all built-in functions (such as date, time, mathematical, and encryption functions) and all cross-storage engine functionalities (such as stored procedures, triggers, views, etc.) are implemented in the Server layer.
- The storage engine layer is responsible for data storage and retrieval. It supports multiple storage engines like InnoDB, MyISAM, Memory, etc., and different storage engines share a single Server layer. The most commonly used storage engine now is InnoDB, which became the default storage engine for MySQL starting from version 5.5. The index data structure we often refer to is implemented by the storage engine layer, and different storage engines support different types of indexes. For example, InnoDB supports B+ tree index types, which are used by default, meaning that the primary key index and secondary index created in the data table use B+ tree indexes by default.
Step 1: Connector#
The work of the connector:
- Establish a connection with the client through a TCP three-way handshake;
- Verify the client's username and password; if the username or password is incorrect, an error will be reported;
- If both the username and password are correct, it will read the user's permissions, and subsequent permission logic judgments will be based on the permissions read at that time;
How to check how many clients are connected to the MySQL service?
Execute show processlist
Do idle connections occupy resources indefinitely?
Controlled by the wait_timeout parameter, the default value is 8 hours (28880 seconds). If an idle connection exceeds this time, the connector will automatically disconnect it.
Is there a limit to the number of MySQL connections?
Controlled by the max_connections parameter.
How to solve the memory issue caused by long connections?
- Regularly disconnect long connections.
- Clients actively reset connections.
MySQL version 5.7 implemented the mysql_reset_connection() function interface. When a client performs a large operation, it can call the mysql_reset_connection function in the code to reset the connection, achieving memory release. This process does not require reconnection or re-permission verification but restores the connection to the state just after it was created.
Step 2: Query Cache#
When the client sends an SQL statement to the MySQL service, after receiving the SQL statement, MySQL will parse the first field of the SQL statement to see what type of statement it is.
If the SQL is a query statement (select statement), MySQL will first check the query cache for cached data to see if this command has been executed before. The query cache is stored in memory in a key-value format, where the key is the SQL query statement and the value is the result of the SQL statement query.
If the query statement hits the query cache, it will directly return the value to the client. If the query statement does not hit the query cache, it will continue to execute, and after execution, the query result will be stored in the query cache.
However, for tables that are frequently updated, the hit rate of the query cache is very low because as long as a table has an update operation, the query cache for that table will be cleared. Therefore, MySQL version 8.0 directly removed the query cache.
Step 3: Parsing SQL#
Before officially executing the SQL query statement, MySQL will first parse the SQL statement, a task handled by the "parser."
Parser#
The parser will perform the following two tasks.
The first task is lexical analysis. MySQL will identify keywords based on the input string and construct an SQL syntax tree, making it easier for subsequent modules to obtain SQL types, table names, field names, where conditions, etc.
The second task is syntax analysis. Based on the results of lexical analysis, the syntax parser will determine whether the input SQL statement meets MySQL syntax rules.
If the SQL statement we input has incorrect syntax, an error will be reported at the parsing stage.
However, note that if a table or field does not exist, it is not handled in the parser.
Step 4: Executing SQL#
Preprocessor#
Let's first talk about what happens during the preprocessing stage.
- Check whether the tables or fields in the SQL query statement exist;
- Expand the
*
symbol inselect *
to all columns in the table;
Optimizer#
After the preprocessing stage, an execution plan needs to be formulated for the SQL query statement, a task handled by the "optimizer."
The optimizer is mainly responsible for determining the execution plan of the SQL query statement. For example, when there are multiple indexes in a table, the optimizer will decide which index to use based on the cost of the query.
Executor#
After the optimizer has determined the execution plan, MySQL will actually start executing the statement, a task completed by the "executor." During execution, the executor will interact with the storage engine, with interactions occurring on a record basis.
Primary Key Index Query#
select * from product where id = 1;
This query statement uses the primary key index for its query condition and is an equality query. Since the primary key id is unique, there will be no records with the same id, so the optimizer decides to use the access type of const for the query, meaning it will use the primary key index to query a record. The execution process between the executor and the storage engine is as follows:
- The executor's first query will call the function pointed to by the read_first_record function pointer. Since the optimizer chose the access type of const, this function pointer is directed to the InnoDB engine's index query interface, passing the condition id = 1 to the storage engine to locate the first record that meets the condition.
- The storage engine locates the first record with id = 1 through the B+ tree structure of the primary key index. If the record does not exist, it will report an error to the executor indicating that the record cannot be found, and the query will end. If the record exists, it will return the record to the executor;
- After the executor reads the record from the storage engine, it will check whether the record meets the query condition. If it does, it sends it to the client; if not, it skips that record.
- The executor's querying process is a while loop, so it will query again. However, since this is not the first query, it will call the function pointed to by the read_record function pointer. Since the optimizer chose the access type of const, this function pointer is directed to a function that always returns -1, so when this function is called, the executor exits the loop, ending the query.
Full Table Scan#
select * from product where name = 'iphone';
This query statement does not use an index for its query condition, so the optimizer decides to use the access type of ALL for the query, meaning it will perform a full table scan. The execution process between the executor and the storage engine is as follows:
- The executor's first query will call the function pointed to by the read_first_record function pointer. Since the optimizer chose the access type of all, this function pointer is directed to the InnoDB engine's full scan interface, allowing the storage engine to read the first record in the table;
- The executor will check whether the name of the read record is iphone. If not, it will skip it; if it is, it will send the record to the client (yes, every time the Server layer reads a record from the storage engine, it sends it to the client. The reason the client displays all records directly is that it waits until the query statement is completed before displaying all records).
- The executor's querying process is a while loop, so it will query again, calling the function pointed to by the read_record function pointer. Since the optimizer chose the access type of all, the read_record function pointer is still directed to the InnoDB engine's full scan interface, so it will continue to request the next record from the storage engine, which will return the next record to the executor (Server layer). The executor continues to check the condition; if it does not meet the query condition, it skips that record; otherwise, it sends it to the client;
- This process repeats until the storage engine has read all records in the table, then returns the completion information to the executor (Server layer);
- The executor receives the completion information reported by the storage engine and exits the loop, stopping the query.
Index Condition Pushdown#
Index condition pushdown can reduce the secondary index's back table operations during queries, improving query efficiency because it hands over some responsibilities that the Server layer is responsible for to the storage engine layer.
select * from t_user where age > 20 and reward = 100000;
When a composite index encounters a range query (> or <), it will stop matching, meaning the age field can utilize the composite index, but the reward field cannot use the index.
So, when not using index condition pushdown (versions before MySQL 5.6), the execution process between the executor and the storage engine is as follows:
- The Server layer first calls the storage engine's interface to locate the first secondary index record that meets the query condition, which is to locate the first record where age > 20;
- After the storage engine quickly locates this record based on the secondary index's B+ tree and retrieves the primary key value, it performs a back table operation to return the complete record to the Server layer;
- The Server layer checks whether the reward of that record equals 100000. If so, it sends it to the client; otherwise, it skips that record;
- Then, it continues to request the next record from the storage engine. After locating the record in the secondary index, the storage engine retrieves the primary key value and performs a back table operation to return the complete record to the Server layer;
- This process continues until the storage engine has read all records in the table.
As we can see, without index condition pushdown, every time a secondary index record is queried, a back table operation must be performed, and then the record is returned to the Server, which then checks whether the reward of that record equals 100000.
With index condition pushdown, the task of checking whether the reward equals 100000 is handed over to the storage engine layer, and the process is as follows:
- The Server layer first calls the storage engine's interface to locate the first secondary index record that meets the query condition, which is to locate the first record where age > 20;
- After locating the secondary index, the storage engine does not perform a back table operation immediately but first checks whether the condition of the included column (reward column) in that index (whether reward equals 100000) holds. If the condition does not hold, it directly skips that secondary index. If it holds, it performs a back table operation to return the complete record to the Server layer.
- The Server layer checks whether other query conditions (there are no other conditions in this query) hold. If they do, it sends it to the client; otherwise, it skips that record and requests the next record from the storage engine.
- This process continues until the storage engine has read all records in the table.
As we can see, after using index condition pushdown, although the reward column cannot utilize the composite index, since it is included in the composite index (age, reward), the storage engine filters out records that meet reward = 100000 before performing the back table operation to obtain the entire record. Compared to not using index condition pushdown, this saves a lot of back table operations.
When you find that the Extr part of the execution plan shows "Using index condition," it indicates that index condition pushdown is being used.
How is MySQL data stored?#
Which file does MySQL data reside in?#
MySQL data is stored on disk, specifically in files, discussed here in the context of the InnoDB storage engine.
There are three files, which represent:
- db.opt, used to store the default character set and character collation rules of the current database.
- t_order.frm, the table structure of t_order is stored in this file. Every time a table is created in MySQL, a .frm file is generated, which is used to save the metadata information of each table, mainly containing the table structure definition.
- t_order.ibd, the table data of t_order is stored in this file. Table data can either reside in a shared tablespace file (filename: ibdata1) or in an exclusive tablespace file (filename: table_name.ibd). This behavior is controlled by the innodb_file_per_table parameter. If the innodb_file_per_table parameter is set to 1, the stored data, indexes, and other information will be stored separately in an exclusive tablespace. Starting from MySQL version 5.6.6, its default value is 1, so from this version onward, the data of each table in MySQL is stored in a separate .ibd file.
Now we know that the data of a database table is stored in the file named "table_name.ibd," which is also referred to as an exclusive tablespace file.
What is the structure of the tablespace file?#
A tablespace consists of segments, extents, pages, and rows. The logical storage structure of the InnoDB storage engine is roughly as follows:
Row#
Records in a database table are stored by row, and each row's storage structure varies based on different row formats, which will be detailed later.
Page#
Records are stored by row, but database reads do not occur on a "row" basis; otherwise, a single read (i.e., a single I/O operation) would only handle one row of data, which would be very inefficient.
Therefore, InnoDB data is read and written on a "page" basis, meaning that when a record needs to be read, the entire page is read into memory rather than just the row record from disk.
The default size of each page is 16KB, ensuring a maximum of 16KB of contiguous storage space.
Pages are the smallest unit of disk management in the InnoDB storage engine, meaning that database reads and writes occur in units of 16KB. At least 16KB of content is read from disk into memory at a time, and at least 16KB of content is flushed from memory to disk at a time.
There are many types of pages, with common ones including data pages, undo log pages, overflow pages, etc. Row records in a data table are managed using "data pages."
Extent#
We know that the InnoDB storage engine organizes data using B+ trees.
The leaf node layer of the B+ tree is connected by a doubly linked list. If storage space is allocated on a page basis, the physical locations of two adjacent pages in the linked list may not be contiguous and could be very far apart, leading to a lot of random I/O during disk queries, which is very slow.
The solution to this problem is simple: make the physical locations of adjacent pages in the linked list also adjacent, allowing for sequential I/O, which significantly improves performance during range queries (scanning leaf nodes).
So how is this achieved?
When there is a large amount of data in the table, space for an index is allocated not on a page basis but on an extent basis. Each extent is 1MB in size, meaning that 64 contiguous pages will be allocated as one extent, ensuring that the physical locations of adjacent pages in the linked list are also adjacent, allowing for sequential I/O.
Segment#
A tablespace is composed of various segments, and a segment consists of multiple extents. Segments are generally divided into data segments, index segments, and rollback segments, etc.
- Index segment: A collection of extents that store the non-leaf nodes of the B+ tree;
- Data segment: A collection of extents that store the leaf nodes of the B+ tree;
- Rollback segment: A collection of extents that store rollback data. We previously discussed that MVCC utilizes rollback segments to achieve multi-version query data.
What are the InnoDB row formats?#
InnoDB provides four row formats: Redundant, Compact, Dynamic, and Compressed row formats.
The Redundant row format is now rarely used, so we will focus on the Compact row format, as Dynamic and Compressed row formats are very similar to Compact.
What does the COMPACT row format look like?#
A complete record is divided into two parts: "additional information of the record" and "actual data of the record."
Additional Information of the Record#
The additional information of the record consists of three parts: variable-length field length list, NULL value list, and record header information.
Variable-Length Field Length List
The difference between varchar(n) and char(n) is well understood: char is fixed-length, while varchar is variable-length. The actual length (size) of the data stored in variable-length fields is not fixed.
Therefore, when storing data, the size occupied by the data must also be stored in the "variable-length field length list," allowing the corresponding length of data to be read based on this "variable-length field length list" when retrieving data. Other variable-length fields like TEXT and BLOB are implemented similarly.
Why is the information in the "variable-length field length list" stored in reverse order?
Because the pointer in the "record header information" points to the position between the next record's "record header information" and "actual data," allowing for easy reading: reading left gives the record header information, and reading right gives the actual data.
The same logic applies to the NULL value list information, which also needs to be stored in reverse order.
Does every database table's row format have a "variable-length field length list"?
When a data table has no variable-length fields, such as all fields being of int type, the row format in the table will not have a "variable-length field length list."
What is the maximum value of n in varchar(n)?
MySQL stipulates that, except for large object types like TEXT and BLOB, the total byte length occupied by all columns (excluding hidden columns and record header information) cannot exceed 65535 bytes.
The n in varchar(n) represents the maximum number of characters that can be stored, not the byte size. To calculate the maximum byte size that varchar(n) can store, one must consider the character set of the database table, as the character set determines how many bytes a character occupies. For example, in the ASCII character set, one character occupies 1 byte, so varchar(100) means a maximum of 100 bytes of data can be stored.
If there are multiple fields, ensure that the total length of all fields + the byte size occupied by the variable-length field length list + the byte size occupied by the NULL value list <= 65535.
NULL Value List
If there are columns that allow NULL values, each column corresponds to a binary bit (bit), arranged in reverse order according to the column order.
- A binary bit value of 1 indicates that the column's value is NULL.
- A binary bit value of 0 indicates that the column's value is not NULL.
Additionally, the NULL value list must be represented using an integer number of bytes (1 byte = 8 bits). If the number of binary bits used is insufficient for an integer number of bytes, the high bits of the byte are padded with 0.
Does every database table's row format have a "NULL value list"?
When all fields in a data table are defined as NOT NULL, the row format in the table will not have a NULL value list.
Therefore, when designing database tables, it is generally recommended to set fields as NOT NULL, which can save at least 1 byte of space (the NULL value list occupies at least 1 byte of space).
Is the "NULL value list" fixed at 1 byte of space? If so, how is it represented when a record has 9 fields, all of which are NULL?
The space for the "NULL value list" is not fixed at 1 byte.
When a record has 9 fields that are all NULL, a "NULL value list" of 2 bytes will be created, and so on.
Record Header Information
The record header information contains a lot of content, and I won't list everything here, but I will mention a few important points:
- delete_mask: Indicates whether this data has been deleted. From this, we can see that when we execute delete to remove records, the records are not actually deleted; instead, the delete_mask of this record is marked as 1.
- next_record: The position of the next record. This shows that records are organized through a linked list. As mentioned earlier, it points to the position between the next record's "record header information" and "actual data," allowing for easy reading: reading left gives the record header information, and reading right gives the actual data.
- record_type: Indicates the type of the current record, where 0 represents a normal record, 1 represents a non-leaf node record of the B+ tree, 2 represents the minimum record, and 3 represents the maximum record.
Actual Data of the Record#
The actual data portion of the record includes not only the fields we defined but also three hidden fields: row_id, trx_id, and roll_pointer. Let's take a look at what these three fields are.
- row_id: If we specified a primary key or unique constraint column when creating the table, there will be no hidden row_id field. If neither a primary key nor a unique constraint is specified, InnoDB will add a hidden row_id field to the record. The row_id is not mandatory and occupies 6 bytes.
- trx_id: Transaction ID, indicating which transaction generated this data. The trx_id is mandatory and occupies 6 bytes.
- roll_pointer: A pointer to the previous version of this record. The roll_pointer is mandatory and occupies 7 bytes.
The trx_id and roll_pointer are related to the MVCC mechanism.
How does MySQL handle row overflow?#
The basic unit of interaction between disk and memory in MySQL is the page, which is generally 16KB, or 16384 bytes. A varchar(n) type column can store a maximum of 65532 bytes, and some large objects like TEXT and BLOB may store even more data, which means a page may not be able to hold a single record. This situation leads to row overflow, where excess data is stored in another "overflow page."
If a data page cannot hold a single record, the InnoDB storage engine will automatically store the overflow data in an "overflow page." Generally, InnoDB data is stored in "data pages." However, when row overflow occurs, the overflow data will be stored in "overflow pages."
When row overflow occurs, only a portion of the data for that column will be saved in the actual data of the record, while the remaining data will be placed in the "overflow page." The actual data will store a 20-byte pointer to the address of the overflow page, allowing the remaining data to be found. This is roughly illustrated as follows:
The above shows how the Compact row format handles row overflow.
The Compressed and Dynamic row formats are very similar to Compact, with some differences in handling overflow data.
These two formats use a complete row overflow method, where the actual data portion does not store a portion of the data for that column but only stores a 20-byte pointer to the overflow page. The actual data is stored entirely in the overflow page, appearing as follows: