Index - ShuWa

An index is a directory of data, and the so-called storage engine is essentially the implementation method of how to store data, how to create indexes for stored data, and how to update and query data, among other techniques.

Classification of Indexes#

Indexes can be classified from four perspectives.

By "data structure": B+ tree index, Hash index, Full-text index.
By "physical storage": clustered index (primary key index), secondary index (auxiliary index).
By "field characteristics": primary key index, unique index, ordinary index, prefix index.
By "number of fields": single-column index, composite index.

Classification by Data Structure#

From the perspective of data structure, common indexes in MySQL include B+Tree index, HASH index, and Full-Text index.

B+Tree Index#

The B+Tree index type is also the most commonly used index type by the MySQL storage engine.
When creating a table, the InnoDB storage engine selects different columns as indexes based on different scenarios:

If there is a primary key, it will default to using the primary key as the clustered index key;
If there is no primary key, it will choose the first unique column that does not contain NULL values as the clustered index key;
In the absence of both, InnoDB will automatically generate an implicit auto-increment id column as the clustered index key.

The primary key index and secondary index created by default use the B+Tree index.
B+Tree is a type of multi-way tree where only leaf nodes store data, and non-leaf nodes only store indexes. Moreover, the data in each node is stored in primary key order, and each leaf node has two pointers pointing to the next and previous leaf nodes, forming a doubly linked list.

B+Tree can store tens of millions of data with only 3-4 levels of height, which means that querying target data from a table with tens of millions of records requires at most 3-4 disk I/O operations. Therefore, compared to B-trees and binary trees, the greatest advantage of B+Tree lies in its high query efficiency, as even with a large amount of data, the disk I/O for querying a piece of data remains at 3-4 times.

Why does MySQL InnoDB choose B+tree as the index data structure?

1. B+Tree vs B Tree
B+Tree only stores data in leaf nodes, while B Tree also stores data in non-leaf nodes, so the data volume of a single node in B+Tree is smaller, allowing more nodes to be queried under the same number of disk I/O operations. Additionally, B+Tree leaf nodes are connected by a doubly linked list, which is suitable for common range-based sequential searches in MySQL, while B Tree cannot achieve this.
2. B+Tree vs Hash
Hash is very fast for equality queries, with a search complexity of O(1). However, hash tables are not suitable for range queries; they are better suited for equality queries, which is why B+Tree indexes have a broader range of applicable scenarios than hash table indexes.

Classification by Physical Storage#

Divided into clustered index (primary key index) and secondary index (auxiliary index).
The differences between these two are:

The leaf nodes of the B+Tree of the primary key index store the actual data, with all complete user records stored in the leaf nodes of the primary key index's B+Tree;
The leaf nodes of the B+Tree of the secondary index store the primary key values, not the actual data.

Covering Index: Therefore, when querying using a secondary index, if the data can be found in the secondary index, there is no need to look up the primary index, and this process is called covering index.
Back to Table: If the queried data is not in the secondary index, it will first retrieve the secondary index, find the corresponding leaf node, obtain the primary key value, and then retrieve the primary key index to find the data. This process is called back to table.

Classification by Field Characteristics#

From the perspective of field characteristics, indexes are divided into primary key index, unique index, ordinary index, and prefix index.
1. Primary Key Index
A primary key index is an index established on the primary key field, usually created together when the table is created. A table can have only one primary key index, and the values in the index column cannot be NULL.
2. Unique Index
A unique index is an index established on UNIQUE fields. A table can have multiple unique indexes, and the values in the index column must be unique but can be NULL.
3. Ordinary Index
An ordinary index is an index established on ordinary fields, which are neither required to be primary keys nor required to be UNIQUE.
4. Prefix Index
A prefix index refers to an index established on the first few characters of a character-type field rather than on the entire field. Prefix indexes can be established on columns of types char, varchar, binary, and varbinary. The purpose of using a prefix index is to reduce the storage space occupied by the index and improve query efficiency.

Classification by Number of Fields#

From the perspective of the number of fields, indexes are divided into single-column index and composite index (compound index).

An index established on a single column is called a single-column index, such as a primary key index;
An index established on multiple columns is called a composite index;

Composite Index
By combining multiple fields into one index, that index is called a composite index. For example, combining the product_no and name fields in the product table into a composite index (product_no, name) can be created as follows:

CREATE INDEX index_product_no_name ON product(product_no, name);

As can be seen, the non-leaf nodes of the composite index use the values of the two fields as the key values of the B+Tree. When querying data in the composite index, it first compares the product_no field, and in cases where product_no is the same, it compares the name field.
In other words, the B+Tree of the composite index is sorted first by product_no, and then by name when product_no is the same.
Therefore, when using a composite index, there is a leftmost matching principle, meaning that the matching of the index is done in a leftmost priority manner. When querying using a composite index, if the "leftmost matching principle" is not followed, the composite index will become invalid, and the characteristics of fast querying using the index will not be utilized.

Range Queries in Composite Indexes#

The leftmost matching principle of composite indexes will continue to match to the right until it encounters a "range query," at which point it will stop matching. This means that the field of the range query can utilize the composite index, but the fields following the range query field cannot utilize the composite index.
When the leftmost matching principle of the composite index encounters range queries (such as >, <), it will stop matching. This means that the field of the range query can utilize the composite index, but the fields following the range query field cannot utilize the composite index. Note that for >=, <=, BETWEEN, and like prefix matching range queries, it will not stop matching, as I have illustrated with four examples earlier.

Index Condition Pushdown#

Index condition pushdown optimization can filter out records that do not meet the conditions during the traversal of the composite index, reducing the number of back-to-table operations.
For a composite index (a, b), when executing the statement select * from table where a > 1 and b = 2, if only the a field can utilize the index, then after finding the first primary key value (ID 2) that meets the condition in the composite index's B+Tree, it checks if b = 2, and if it does not meet the condition, it filters it out directly.

Index Distinction#

The fields that appear earlier in the index have a higher probability of being used for index filtering. In actual development work, when establishing a composite index, it is important to place fields with high distinction at the front, as these fields are more likely to be used by more SQL queries.
Distinction is calculated as the number of different values of a column divided by the total number of rows in the table, with the formula: $q = distinct(column)/count(*)$
For example, the distinction of gender is very low and not suitable for creating an index or for being placed at the front of the composite index column, while fields like UUID are more suitable for indexing or for being placed at the front of the composite index column.

Sorting with Composite Index#

For the following SQL, how can we improve query efficiency using indexes?
select * from order where status = 1 order by create_time asc
A better way is to create a composite index on the status and create_time columns, as this can avoid file sorting in the MySQL database.
When querying, if only the status index is used but the statement also needs to sort by create_time, it will require file sorting, which means that the Extra column in the SQL execution plan will show Using filesort.
Therefore, to utilize the ordered nature of the index, a composite index should be established on the status and create_time columns, so that the data filtered by status is already sorted by create_time, avoiding file sorting and improving query efficiency.

What methods are there to optimize indexes?#

When is it necessary / unnecessary to create an index?#

The greatest benefit of indexes is to improve query speed, but indexes also have drawbacks, such as:

They require physical space, and the larger the number, the more space they occupy;
Creating and maintaining indexes takes time, and this time increases with the amount of data;
They can reduce the efficiency of insert, delete, and update operations because every time an index is modified, the B+ tree needs to be dynamically maintained to keep the index ordered.

When is it suitable to use indexes?#

For fields with uniqueness constraints, such as product codes;
For fields frequently used in WHERE query conditions, as this can improve the overall query speed of the table. If the query condition involves multiple fields, a composite index can be created.
For fields frequently used in GROUP BY and ORDER BY, as this means that sorting does not need to be done again during querying, since we already know that the records in the B+Tree are sorted after creating the index.

When is it unnecessary to create an index?#

For fields that are not used in WHERE conditions, GROUP BY, or ORDER BY, as the value of the index is to quickly locate records. If a field does not serve this purpose, it is usually unnecessary to create an index because indexes occupy physical space.
For fields with a large amount of duplicate data, indexes are unnecessary. For example, the gender field only has male and female. If the distribution of records in the database table is even, searching for any value may yield half the data. In these cases, it is better not to have an index because MySQL has a query optimizer that generally ignores indexes and performs a full table scan when it finds that a certain value appears in a high percentage of rows.
When the table data is too small, indexes are unnecessary;
For fields that are frequently updated, indexes should not be created. For example, do not create an index on user balances in an e-commerce project because frequently modifying indexed fields requires frequent rebuilding of the index to maintain the ordered nature of the B+Tree, which can affect database performance.

What methods are there to optimize indexes?#

Prefix Index Optimization#

As the name suggests, a prefix index is an index established using the first few characters of a string in a certain field. Why do we need to use prefixes to establish indexes?
Using prefix indexes reduces the size of indexed fields, allowing more index values to be stored in a single index page, effectively improving the query speed of the index. When using large string fields as indexes, prefix indexes can help reduce the size of index entries.
However, prefix indexes have certain limitations, such as:

They cannot be used for order by;
They cannot be used as covering indexes;

Covering Index Optimization#

Assuming we only need to query the name and price of a product, what method can we use to avoid back-to-table operations?
We can create a composite index with "product ID, name, price" as a composite index. If this data exists in the index, the query will not need to retrieve the primary key index again, thus avoiding back-to-table operations.

Primary Key Index Should Be Auto-Incremented#

If we use an auto-increment primary key, then each new piece of data will be added in order to the current index node position without needing to move existing data. When the page is full, a new page will be automatically allocated. Since each time a new record is inserted, it is an append operation that does not require moving data, this method of inserting data is very efficient.
If we use a non-auto-increment primary key, since each time the primary key index value is inserted randomly, new data may be inserted into an existing data page in the middle, which will require moving other data to accommodate the new data insertion, and may even require copying data from one page to another. We usually refer to this situation as page splitting. Page splitting can also cause a lot of memory fragmentation, leading to a non-compact index structure, which affects query efficiency.

Indexes Should Be Set to NOT NULL#

The first reason: If the indexed column contains NULL, it complicates the optimizer's index selection, making optimization more difficult. Columns that can be NULL make indexing, index statistics, and value comparisons more complex. For example, during index statistics, count will omit rows with NULL values.
The second reason: NULL values are meaningless but occupy physical space, leading to storage space issues. When InnoDB stores records, if there are fields that allow NULL in the table, then at least 1 byte of space will be used in the row format to store the NULL value list.

Preventing Index Invalidity#

Situations that cause index invalidity:

When we use left or right fuzzy matching, such as like %xx or like %xx%, both of these methods will cause index invalidity;
When we perform calculations, functions, or type conversions on indexed columns in the query conditions, these situations will cause index invalidity;
To correctly use composite indexes, the leftmost matching principle must be followed, meaning that matching of the index is done in a leftmost priority manner; otherwise, index invalidity will occur.
In the WHERE clause, if the condition column before OR is an indexed column, and the condition column after OR is not an indexed column, then the index will become invalid.

Which count has the best performance?#

COUNT(*) = COUNT(1) < COUNT(field) (with secondary index) < COUNT(primary key field) (only with primary key index) < COUNT(non-primary key field) (without secondary index)

What is count?#

count() is an aggregate function, and its parameters can be field names or any other expressions. The function counts how many records meet the query conditions where the specified parameter is not NULL.

What is the execution process of count(primary key field)?#

When counting how many records there are using the count function, the MySQL server layer maintains a variable called count.
The server layer loops through InnoDB to read a record. If the parameter specified by the count function is not NULL, it increments the count variable by 1 until all records that meet the query are read, at which point it exits the loop. Finally, the value of the count variable is sent to the client.
InnoDB saves records using B+ trees, which are divided into clustered indexes and secondary indexes based on the type of index. The difference is that the leaf nodes of the clustered index store actual data, while the leaf nodes of the secondary index store primary key values, not actual data.
If the table only has a primary key index and no secondary index, then InnoDB will loop through the clustered index and return the records read to the server layer, then read the id values from the records to check if they are NULL. If they are not NULL, it increments the count variable by 1.
However, if the table has a secondary index, the object of the loop traversal by InnoDB will not be the clustered index but the secondary index.
This is because the same number of secondary index records can occupy less storage space than clustered index records, so the secondary index tree is smaller than the clustered index tree. Thus, the I/O cost of traversing the secondary index is lower than that of traversing the clustered index, which is why the "optimizer" prefers to choose the secondary index.

What is the execution process of count(1) and count(*)?#

The execution process of count(1) and count() is basically the same because count() = count(0).
InnoDB loops through the clustered index (primary key index) and returns the records read to the server layer, but it does not read any field values from the records because the parameter of the count function is 1, not a field, so it does not need to read field values from the records. The parameter 1 is clearly not NULL, so every time the server layer reads a record from InnoDB, it increments the count variable by 1.
It can be seen that count(1) is one step shorter than count(primary key field) because it does not need to read field values from the records, so it is often said that count(1) executes slightly more efficiently than count(primary key field).

What is the execution process of count(field)?#

The execution efficiency of count(field) is the worst because it requires a full table scan.

Differences between MYISAM and InnoDB

When using the MyISAM engine, executing the count function only requires O(1) complexity because each MyISAM data table has meta information that stores the row_count value, guaranteed by table-level locks for consistency, so directly reading the row_count value gives the result of the count function.

How to optimize count(*)?#

When faced with large tables for record counting, the execution efficiency of count is very poor. For example, the t_order table has over 12 million records, and even with a secondary index, executing select count(*) from t_order takes about 5 seconds!
How can we improve efficiency?
First method, approximate values
If your business does not require precise counts, such as search engines providing approximate counts of search results.
In this case, we can use the show table status or explain command to estimate the table.
Second method, save count values in an additional table
If you want to accurately obtain the total number of records in the table, you can save this count value in a separate count table.
When inserting a record into the data table, increment the count field in the count table by 1. This means that during insert and delete operations, we need to maintain this count table as well.