TCP

Basic Concepts of TCP#

TCP Header Format#

Sequence Number: A random number generated by the computer as its initial value when establishing a connection, passed to the receiving host through the SYN packet. Each time data is sent, the size of the "data byte count" is "accumulated" once. It is used to solve the problem of network packet out-of-order.
Acknowledgment Number: Refers to the sequence number of the next "expected" data to be received. Once the sender receives this acknowledgment, it can assume that all data prior to this sequence number has been received normally. It is used to solve the problem of packet loss.
Control Flags:
ACK: When this bit is set to 1, the "Acknowledgment" field becomes valid. TCP specifies that this bit must be set to 1 except for the initial SYN packet when establishing a connection.
RST: When this bit is set to 1, it indicates that an exception has occurred in the TCP connection and the connection must be forcibly terminated. This means calling the Socket.close() function without needing a four-way handshake.
SYN: When this bit is set to 1, it indicates the desire to establish a connection and sets the initial value of the sequence number in the "Sequence Number" field.
FIN: When this bit is set to 1, it indicates that no more data will be sent in the future and the connection is desired to be terminated. When communication ends and the connection is to be terminated, the hosts on both sides can exchange TCP segments with the FIN bit set to 1.

How to uniquely identify a TCP connection?#

The TCP four-tuple can uniquely identify a connection, which includes:
Source Address, Source Port, Destination Address, Destination Port

The fields for source and destination addresses (32 bits) are in the IP header, used to send packets to the other host via the IP protocol.
The fields for source and destination ports (16 bits) are in the TCP header, used to inform the TCP protocol which process the packet should be sent to.

How many maximum connections can a server listening on an IP port have?

The server typically listens on a fixed local port, waiting for connection requests from clients.

Therefore, the client IP and port are variable, and the theoretical value calculation formula is as follows:

Maximum TCP connections = Number of client IPs * Number of client ports

For IPv4, the maximum number of client IPs is 2 to the power of 32, and the maximum number of client ports is 2 to the power of 16, which means the maximum TCP connections on a single server is approximately 2 to the power of 48.
Of course, the maximum concurrent TCP connections on the server cannot reach the theoretical limit and will be affected by the following factors:

File Descriptor Limit: Each TCP connection is a file. If the file descriptors are exhausted, a Too many open files error will occur. Linux imposes three types of limits on the number of open file descriptors:
- System level: The maximum number of files that can be opened by the current system, viewable via cat /proc/sys/fs/file-max;
- User level: The maximum number of files that can be opened by a specified user, viewable via cat /etc/security/limits.conf;
- Process level: The maximum number of files that can be opened by a single process, viewable via cat /proc/sys/fs/nr_open;
Memory Limit: Each TCP connection occupies a certain amount of memory. The operating system's memory is limited, and if memory resources are exhausted, an OOM (Out of Memory) error will occur.

Since the IP layer can fragment, why does the TCP layer still need MSS?#

MTU: The maximum length of a network packet, generally 1500 bytes in Ethernet;
MSS: The maximum length of TCP data that can be accommodated in a network packet after excluding the IP and TCP headers;

When the IP layer has data (TCP header + TCP data) that exceeds the MTU size to send, the IP layer must fragment the data into several pieces, ensuring that each fragment is smaller than the MTU. After fragmenting an IP datagram, the IP layer of the destination host will reassemble it and then pass it to the upper layer TCP transport layer.
This seems orderly, but there is a hidden danger: if one IP fragment is lost, all fragments of the entire IP datagram must be retransmitted.

Because the IP layer itself does not have a timeout retransmission mechanism, it is the TCP transport layer that is responsible for timeouts and retransmissions.
When a certain IP fragment is lost, the receiving IP layer cannot assemble a complete TCP datagram (header + data) and cannot deliver the datagram to the TCP layer. Therefore, the receiver will not respond with an ACK to the sender. Since the sender does not receive the ACK confirmation message for a long time, it will trigger a timeout retransmission, retransmitting the "entire TCP datagram (header + data)".

Thus, to achieve optimal transmission efficiency, the TCP protocol typically negotiates the MSS value of both parties when establishing a connection. When the TCP layer finds that the data exceeds the MSS, it will first fragment the data, and naturally, the length of the resulting IP packet will not exceed the MTU, so there is no need for IP fragmentation.
After TCP layer fragmentation, if a TCP fragment is lost, retransmission will also be done in units of MSS, rather than retransmitting all fragments, greatly increasing the efficiency of retransmission.

What are the differences between UDP and TCP? What are their respective application scenarios?#

Differences between TCP and UDP:#

Connection
TCP is a connection-oriented transport layer protocol, requiring a connection to be established before transmitting data.
UDP does not require a connection and transmits data immediately.
Service Object
TCP is a one-to-one service, meaning a single connection has only two endpoints.
UDP supports one-to-one, one-to-many, and many-to-many interactive communication.
Reliability
TCP reliably delivers data, ensuring that data is error-free, not lost, not duplicated, and arrives in order.
UDP makes a best-effort delivery without guaranteeing reliable data delivery. However, we can implement a reliable transport protocol based on UDP, such as the QUIC protocol. For more details, refer to this article: How to implement reliable transmission based on the UDP protocol? (opens new window)
Congestion Control and Flow Control
TCP has congestion control and flow control mechanisms to ensure the safety of data transmission.
UDP does not have these mechanisms, and even if the network is very congested, it will not affect the sending rate of UDP.
Header Overhead
TCP has a longer header length, resulting in some overhead. The header is 20 bytes if the "options" field is not used, and it will be longer if the "options" field is used.
UDP has a fixed header of only 8 bytes, resulting in lower overhead.
Transmission Method
TCP is stream-oriented, with no boundaries, but guarantees order and reliability.
UDP sends data in packets, which have boundaries, but may lose packets and be out of order.
Fragmentation Differences
If TCP data size exceeds MSS, it will fragment at the transport layer. The destination host will also reassemble the TCP packet at the transport layer. If a fragment is lost, only that fragment needs to be retransmitted.
If UDP data size exceeds MTU, it will fragment at the IP layer. The destination host will reassemble the data at the IP layer and then pass it to the transport layer.

Application Scenarios for TCP and UDP:#

Due to TCP being connection-oriented and able to guarantee reliable data delivery, it is often used for:

FTP file transfer;
HTTP / HTTPS;

Due to UDP being connectionless, it can send data at any time, and its processing is both simple and efficient, it is often used for:

Communication with a small number of packets, such as DNS, SNMP, etc.;
Multimedia communication such as video and audio;
Broadcast communication;

TCP Connection Establishment#

What is the process of the TCP three-way handshake?#

Initially, both the client and server are in the CLOSE state. The server actively listens on a certain port, in the LISTEN state.
The client randomly initializes a sequence number (client_isn), places this number in the "Sequence Number" field of the TCP header, and sets the SYN flag to 1, indicating a SYN packet. It then sends the first SYN packet to the server, indicating a request to establish a connection. This packet does not contain application layer data, and the client then enters the SYN-SENT state.
After receiving the client's SYN packet, the server also randomly initializes its own sequence number (server_isn), fills this number into the "Sequence Number" field of the TCP header, and fills the "Acknowledgment Number" field with client_isn + 1. It then sets both the SYN and ACK flags to 1. Finally, it sends this packet to the client, which also does not contain application layer data, and the server then enters the SYN-RCVD state.
After the client receives the server's packet, it must respond with the final acknowledgment packet. First, the ACK flag in the TCP header of this acknowledgment packet is set to 1, and the "Acknowledgment Number" field is filled with server_isn + 1. Finally, the packet is sent to the server, and this packet can carry data from the client to the server. The client then enters the ESTABLISHED state.
After the server receives the client's acknowledgment packet, it also enters the ESTABLISHED state.

From the above process, it can be seen that the third handshake can carry data, while the first two handshakes cannot carry data, which is a common interview question.
Once the three-way handshake is completed, both parties are in the ESTABLISHED state, and the connection is established, allowing the client and server to send data to each other.

How to check TCP status in a Linux system?

In Linux, you can check it using the command netstat -napt.

What happens if the first handshake is lost?

If the client does not receive the server's SYN-ACK packet (the second handshake) for a long time, it will trigger the "timeout retransmission" mechanism, retransmitting the SYN packet, and the retransmitted SYN packet will have the same sequence number.
In Linux, the maximum retransmission count for the client's SYN packet is controlled by the tcp_syn_retries kernel parameter, which can be customized, with a default value usually set to 5. Each timeout duration is twice that of the previous one.

What happens if the second handshake is lost?

The client will retransmit the SYN packet, which is the first handshake, with the maximum retransmission count determined by the tcp_syn_retries kernel parameter;
The server will retransmit the SYN-ACK packet, which is the second handshake, with the maximum retransmission count determined by the tcp_synack_retries kernel parameter.
In Linux, the maximum retransmission count for the SYN-ACK packet is controlled by the tcp_synack_retries kernel parameter, with a default value of 5.

What happens if the third handshake is lost?

When the third handshake is lost, if the server does not receive this acknowledgment packet for a long time, it will trigger the timeout retransmission mechanism, retransmitting the SYN-ACK packet until it receives the third handshake or reaches the maximum retransmission count.
Note: ACK packets are not retransmitted. If an ACK is lost, the other party will retransmit the corresponding packet.

Why is three-way handshake necessary?#

Avoid Historical Connections
The primary reason for the three-way handshake is to prevent confusion caused by the initialization of old duplicate connections.
Consider a scenario where the client first sends a SYN (seq = 90) packet, then the client crashes, and this SYN packet is blocked in the network, so the server does not receive it. The client then restarts and tries to establish a connection with the server again by sending a SYN (seq = 100) packet (note! This is not a retransmission of the SYN; the retransmitted SYN has the same sequence number).
Let's see how the three-way handshake prevents historical connections:
The client continuously sends multiple SYN packets (all with the same four-tuple) to establish a connection, and in a congested network:
- An "old SYN packet" arrives at the server before the "latest SYN" packet, so the server will respond with a SYN + ACK packet to the client, with the acknowledgment number being 91 (90+1).
- The client receives it and finds that the expected acknowledgment number should be 100 + 1, not 90 + 1, so it will respond with a RST packet.
- The server receives the RST packet and releases the connection.
- Afterward, the latest SYN arrives at the server, and the client and server can complete the three-way handshake normally.

The "old SYN packet" mentioned above is called a historical connection. The main reason TCP uses a three-way handshake to establish a connection is to prevent the initialization of "historical connections."
In the case of two-way handshakes, the server does not have an intermediate state to prevent historical connections from the client, which may lead to the server establishing a historical connection, wasting resources.
2. Synchronize Initial Sequence Numbers
Sequence numbers are a key factor in reliable transmission, serving the purpose of:
- Allowing the receiver to discard duplicate data;
- Allowing the receiver to receive data packets in order based on their sequence numbers;
- Identifying which packets have been received by the other party (known through the sequence number in the ACK packet).

Therefore, when the client sends a SYN packet containing the "initial sequence number," the server needs to respond with an ACK packet to indicate that the client's SYN packet has been successfully received. Likewise, when the server sends its "initial sequence number" to the client, it also needs to receive a response from the client. This back-and-forth ensures that both parties' initial sequence numbers are reliably synchronized.
3. Avoid Resource Waste
If there are only "two handshakes," when the client's SYN packet is blocked in the network and the client does not receive the ACK packet, it will resend the SYN. Without a third handshake, the server cannot know whether the client has received its ACK packet, so the server must actively establish a connection every time it receives a SYN.
If the client's SYN packet is blocked in the network and multiple SYN packets are sent, the server will establish multiple redundant invalid connections upon receiving the requests, resulting in unnecessary resource waste.

Summary: The reasons for not using "two-way handshakes" and "four-way handshakes":
"Two-way handshake": Cannot prevent the establishment of historical connections, leading to resource waste on both sides, and cannot reliably synchronize the sequence numbers of both parties;
"Four-way handshake": The three-way handshake is already theoretically the minimum for reliably establishing a connection, so there is no need for more communication rounds.

Why must the initial sequence numbers be different each time a TCP connection is established?#

To prevent historical packets from being received by the next connection with the same four-tuple (mainly);
For security, to prevent forged TCP packets with the same sequence number from being received by the other party;

The process is as follows:

The client and server establish a TCP connection. If the client's data packet is blocked in the network and times out, while the server device loses power and restarts, the previous connection with the client disappears. When the server receives the client's data packet, it will send a RST packet.
Immediately afterward, the client establishes a connection with the server with the same four-tuple as the previous connection.
Once the new connection is established, the data packet from the previous connection that was blocked in the network arrives at the server, and the sequence number of this packet happens to be within the server's receiving window, so the packet will be normally received by the server, causing data confusion.

It can be seen that if the initial sequence numbers of the client and server are the same each time a connection is established, it is easy to encounter the problem of historical packets being received by the next connection with the same four-tuple.

How is the initial sequence number ISN randomly generated?

The initial ISN is based on a clock, incrementing by 1 every 4 microseconds, with a full cycle taking 4.55 hours.
RFC793 mentions the algorithm for randomly generating the initial sequence number ISN: ISN = M + F(localhost, localport, remotehost, remoteport).

M is a timer that increments by 1 every 4 microseconds.
F is a hash algorithm that generates a random value based on the source IP, destination IP, source port, and destination port. It is important to ensure that the hash algorithm cannot be easily deduced externally, and using the MD5 algorithm is a good choice.

It can be seen that the random number is incremented based on a clock timer, making it nearly impossible to randomly generate the same initial sequence number.

What is a SYN attack? How to prevent SYN attacks?#

We know that establishing a TCP connection requires three-way handshakes. Suppose an attacker forges SYN packets with different IP addresses in a short time. Each time the server receives a SYN packet, it enters the SYN_RCVD state, but the ACK + SYN packets sent by the server cannot receive ACK responses from unknown IP hosts. Over time, this will fill the server's half-connection queue, preventing the server from serving normal users.

Half-connection queue and full-connection queue?
Also known as SYN queue and accept queue;

Normal process:

When the server receives a client's SYN packet, it creates a half-connection object and adds it to the kernel's "SYN queue";

It then sends a SYN + ACK to the client, waiting for the client to respond with an ACK packet;
After the server receives the ACK packet, it removes a half-connection object from the "SYN queue" and creates a new connection object to place in the "Accept queue";

The application calls the accept() socket interface to retrieve the connection object from the "Accept queue".

Both the half-connection queue and the full-connection queue have a maximum length limit, and packets will be discarded by default if the limit is exceeded.

The most direct manifestation of a SYN attack is to fill the TCP half-connection queue, so when the TCP half-connection queue is full, subsequent SYN packets will be discarded, preventing clients from establishing connections with the server.
To prevent SYN attacks, the following four methods can be employed:
1. Increase netdev_max_backlog
When the speed of the network card receiving packets exceeds the speed of the kernel processing them, there will be a queue to save these packets. The maximum value of this queue is controlled by the following parameter, with a default value of 1000. We should appropriately increase this parameter's value, for example, setting it to 10000:
2. Increase the TCP half-connection queue
To increase the TCP half-connection queue, the following three parameters must be increased simultaneously:

Increase net.ipv4.tcp_max_syn_backlog
Increase the backlog in the listen() function
Increase net.core.somaxconn

3. Enable net.ipv4.tcp_syncookies
Enabling the syncookies feature allows connections to be successfully established without using the SYN half-connection queue, effectively bypassing the SYN half-connection to establish connections.

As seen, when tcp_syncookies is enabled, even if a SYN attack causes the SYN queue to fill up, normal connections can still be successfully established.
The net.ipv4.tcp_syncookies parameter has the following three values:
0: Disable this feature;
1: Enable it only when the SYN half-connection queue is full;
2: Unconditionally enable the feature;
Thus, when responding to SYN attacks, it should be set to 1.
4. Reduce SYN+ACK retransmission counts
When the server is under a SYN attack, there will be many TCP connections in the SYN_REVC state. TCP connections in this state will retransmit SYN+ACK. When the retransmission exceeds the maximum count, the connection will be terminated.
In scenarios of SYN attacks, we can reduce the retransmission count of SYN-ACK to speed up the disconnection of TCP connections in the SYN_REVC state.
The maximum retransmission count for SYN-ACK packets is controlled by the tcp_synack_retries kernel parameter (default value is 5 times), for example, reducing tcp_synack_retries to 2 times.

TCP Connection Termination#

The Four-Way Handshake Process of TCP#

The client intends to close the connection and sends a packet with the TCP header FIN flag set to 1, known as a FIN packet, after which the client enters the FIN_WAIT_1 state.
After receiving this packet, the server sends an ACK acknowledgment packet to the client, then enters the CLOSE_WAIT state.
After the client receives the server's ACK acknowledgment packet, it enters the FIN_WAIT_2 state.
After the server processes the data, it sends a FIN packet to the client, then enters the LAST_ACK state.
After the client receives the server's FIN packet, it sends an ACK acknowledgment packet back to the server, then enters the TIME_WAIT state.
After the server receives the ACK acknowledgment packet, it enters the CLOSE state, completing the connection closure on the server's side.
After a period of 2MSL, the client automatically enters the CLOSE state, completing the connection closure on the client's side.

You can see that each direction requires one FIN and one ACK, hence it is commonly referred to as a four-way handshake.
One point to note is that only the party actively closing the connection will have the TIME_WAIT state.

Why does the termination require four steps?

The server usually needs to wait until all data has been sent and processed, so the server's ACK and FIN are generally sent separately, thus requiring four steps for termination.

What happens if the first termination step is lost?

If the first termination step is lost, and the client does not receive the passive party's ACK for a long time, it will trigger the timeout retransmission mechanism, retransmitting the FIN packet. The number of retransmissions is controlled by the tcp_orphan_retries parameter.
When the number of retransmissions of the FIN packet by the client exceeds tcp_orphan_retries, it will stop sending FIN packets and wait for a period (the time being twice the last timeout duration). If it still does not receive the second termination step, it will directly enter the close state.

What happens if the second termination step is lost?

ACK packets are not retransmitted, so if the server's second termination step is lost, the client will trigger the timeout retransmission mechanism, retransmitting the FIN packet until it receives the server's second termination step or reaches the maximum retransmission count.

What happens if the third termination step is lost?

When the client receives the second termination step, which is the server's ACK packet, the client will be in the FIN_WAIT2 state, waiting for the server to send the third termination step, which is the server's FIN packet.
For connections closed by the close function, since no further data can be sent or received, the FIN_WAIT2 state cannot last too long, and the tcp_fin_timeout controls the duration of this state, with a default value of 60 seconds.
This means that for connections closed by calling close, if the FIN packet is not received within 60 seconds, the client's (active closing party) connection will be directly closed.
However, note that if the active closing party uses the shutdown function to close the connection, specifying only to close the sending direction while not closing the receiving direction, it means that the active closing party can still receive data.
In this case, if the active closing party does not receive the third termination step for a long time, its connection will remain in the FIN_WAIT2 state.
When the server (passive closing party) receives the FIN packet from the client (active closing party), the kernel will automatically reply with an ACK, and the connection will enter the LAST_ACK state, waiting for the client to return an ACK to confirm the connection closure.
If this ACK is not received for a long time, the server will retransmit the FIN packet, with the number of retransmissions still controlled by the tcp_orphan_retries parameter, which is the same as the way the client retransmits the FIN packet.

What happens if the fourth termination step is lost?

When the client receives the server's third termination step, which is the FIN packet, it will send an ACK packet back to the server, which is the fourth termination step. At this point, the client's connection enters the TIME_WAIT state.
In a Linux system, the TIME_WAIT state lasts for 2MSL before entering the closed state.
Then, the server (passive closing party) remains in the LAST_ACK state until it receives the ACK packet.
If the fourth termination step's ACK packet does not reach the server, the server will retransmit the FIN packet, with the number of retransmissions still controlled by the previously mentioned tcp_orphan_retries parameter.

Why is the waiting time for TIME_WAIT 2MSL?#

MSL is Maximum Segment Lifetime, the maximum lifetime of a packet, which is the longest time any packet can exist in the network. If it exceeds this time, the packet will be discarded. Since TCP packets are based on the IP protocol, and the IP header has a TTL field, which is the maximum number of hops a packet can take through routers, the value decreases by 1 with each router it passes. When this value reaches 0, the packet will be discarded, and an ICMP packet will be sent to notify the source host.
The difference between MSL and TTL: MSL is measured in time, while TTL is measured in hops. Therefore, MSL should be greater than or equal to the time it takes for TTL to reach 0 to ensure that packets have naturally disappeared.
The TTL value is generally 64, and Linux sets MSL to 30 seconds, meaning that Linux believes that the time taken for a packet to pass through 64 routers will not exceed 30 seconds. If it exceeds this time, it is considered that the packet has disappeared in the network.
TIME_WAIT waits for 2 times the MSL, which is reasonably explained as: there may be data packets from the sender in the network, and when these packets are processed by the receiver, they will send responses back to the sender, so the round trip requires waiting for 2 times the time.

Why is the TIME_WAIT state needed?#

There are two main reasons:

To prevent data from historical connections from being incorrectly received by subsequent connections with the same four-tuple;
To ensure that the party "passively closing the connection" can be correctly closed;

What are the dangers of excessive TIME_WAIT?#

For the server, it occupies system resources such as file descriptors, memory resources, CPU resources, thread resources, etc.;
For the client, it occupies port resources, which are also limited. Generally, the available port range is 32768 to 61000, which can also be specified through the net.ipv4.ip_local_port_range parameter.

How to optimize TIME_WAIT?#

Enable the net.ipv4.tcp_tw_reuse and net.ipv4.tcp_timestamps options;
This allows sockets in the TIME_WAIT state to be reused for new connections. One point to note is that the tcp_tw_reuse feature can only be used by the client (the connection initiator) because when this feature is enabled, the kernel will randomly find a connection in the TIME_WAIT state that has exceeded 1 second to reuse for the new connection.
net.ipv4.tcp_max_tw_buckets
This value defaults to 18000. When the number of connections in the TIME_WAIT state exceeds this value, the system will reset the subsequent TIME_WAIT connection states. This method is relatively aggressive.
Use SO_LINGER in the program to forcefully close with RST.
If l_onoff is non-zero and l_linger is 0, then calling close will send a RST flag to the other party, and this TCP connection will skip the four-way handshake and the TIME_WAIT state, closing directly.

What are the reasons for a server to have a large number of TIME_WAIT states?#

The TIME_WAIT state only appears for the party that actively closes the connection. Therefore, if the server has a large number of TCP connections in the TIME_WAIT state, it indicates that the server has actively closed many TCP connections.
In what scenarios will the server actively close connections?

The first scenario: HTTP does not use long connections.
If either party's HTTP header contains Connection: close, the HTTP long connection mechanism cannot be used. After completing an HTTP request/processing, the connection will be closed.
According to the implementation of most web services, regardless of which party disables HTTP Keep-Alive, the server actively closes the connection.
The second scenario: HTTP long connection timeout.
Assuming the timeout for the HTTP long connection is set to 60 seconds, nginx will start a "timer." If the client does not initiate a new request within 60 seconds after completing the last HTTP request, when the timer expires, nginx will trigger a callback function to close the connection, resulting in TIME_WAIT state connections on the server.
The third scenario: The number of requests for the HTTP long connection reaches the limit.
The keepalive_requests parameter in nginx indicates the number of requests that can be processed on a single HTTP long connection. If this maximum value is reached, nginx will actively close the long connection, resulting in TIME_WAIT state connections on the server.

What are the reasons for a server to have a large number of CLOSE_WAIT states?
When a server has a large number of CLOSE_WAIT state connections, it indicates that the server's program has not called the close function to close the connection, which is usually a code issue.

Socket Programming#

How should Socket programming be done for TCP?#

The server and client initialize the socket to obtain file descriptors;
The server calls bind to bind the socket to a specified IP address and port;
The server calls listen to start listening;
The server calls accept to wait for client connections;
The client calls connect to initiate a connection request to the server's address and port;
The server's accept returns the file descriptor for the socket used for transmission;
The client calls write to send data; the server calls read to receive data;
When the client disconnects, it calls close, causing the server to read data and receive EOF. After processing the data, the server calls close to indicate that the connection is closed.

It is important to note that when the server calls accept, a socket for an established connection is returned for subsequent data transmission.
Thus, there are "two" sockets: one is the listening socket, and the other is the completed connection socket.
After a successful connection is established, both parties begin to read and write data using the read and write functions, just like writing to a file stream.

What is the significance of the backlog parameter when listening?#

The Linux kernel maintains two queues:

Half-connection queue (SYN queue): When a SYN connection request is received, it is in the SYN_RCVD state;
Full-connection queue (Accept queue): Completed the TCP three-way handshake process, in the ESTABLISHED state;

In earlier Linux kernels, backlog referred to the size of the SYN queue, which is the size of the unfinished queue.
After Linux kernel 2.2, backlog became the length of the accept queue, which is the queue for established connections, so it is now generally considered that backlog refers to the accept queue.
However, the upper limit is the size of the kernel parameter somaxconn, meaning that the accept queue length = min(backlog, somaxconn).

What happens when the TCP half-connection queue and full-connection queue are full?
TCP Full Connection Queue Overflow
When the maximum full connection queue is exceeded, the server will drop subsequent incoming TCP connections.
TCP Half Connection Queue Overflow
If the half-connection queue is full and tcp_syncookies is not enabled, packets will be discarded;
If the full connection queue is full and there are more than one connection request retransmitting SYN+ACK packets, they will be discarded;
If tcp_syncookies is not enabled and max_syn_backlog minus the current half-connection queue length is less than (max_syn_backlog >> 2), packets will be discarded.

At which step does accept occur in the three-way handshake?#

The client successfully returns from connect at the second handshake, while the server successfully returns from accept after the three-way handshake is completed.

What is the process of connection termination when the client calls close?#

The client calls close, indicating that there is no data to send, and sends a FIN packet to the server, entering the FIN_WAIT_1 state;
The server receives the FIN packet, and the TCP protocol stack will insert an EOF file end marker into the receive buffer for the FIN packet. The application can perceive this FIN packet through the read call. This EOF will be placed after other received data that is queued, indicating that the server needs to handle this exceptional situation, as EOF indicates no more data will arrive on this connection. At this point, the server enters the CLOSE_WAIT state;
After processing the data, the server will naturally read EOF and call close to close its socket, sending a FIN packet and entering the LAST_ACK state;
The client receives the server's FIN packet and sends an ACK confirmation packet back to the server, entering the TIME_WAIT state;
After the server receives the ACK confirmation packet, it enters the final CLOSE state;
After a period of 2MSL, the client also enters the CLOSE state.

Can a TCP connection be established without accept?#

Yes.
The accept system call does not participate in the TCP three-way handshake process; it is only responsible for retrieving an already established connection socket from the TCP full connection queue. The user layer can perform read and write operations on the socket obtained through the accept system call.

Can a TCP connection be established without listen?#

Yes.
A client can connect to itself to form a connection (TCP self-connection), or two clients can simultaneously send requests to each other to establish connections (TCP simultaneous open). In both cases, there is a common point: no server is involved, meaning that a TCP connection can be established without listen.
If a server is involved and the server does not call the listen function, it cannot find a socket listening on that port, resulting in a RST packet to terminate the connection.

TCP Reliability Mechanisms#

Retransmission Mechanism#

Timeout Retransmission#

One of the ways to implement the retransmission mechanism is to set a timer when sending data. If the specified time elapses without receiving the other party's ACK confirmation packet, the data will be retransmitted, which is commonly referred to as timeout retransmission.
TCP will perform timeout retransmission in the following two situations:

Packet loss
Acknowledgment loss

If the data that is retransmitted times out again, TCP's strategy is to double the timeout interval.
This means that each time a timeout retransmission occurs, the next timeout duration will be set to twice the previous value. Two timeouts indicate that the network environment is poor, and frequent retransmissions are not advisable.
The problem with timeout-triggered retransmissions is that the timeout period may be relatively long.

Fast Retransmission#

TCP also has another mechanism called Fast Retransmit, which is driven by data rather than time.

In the above diagram, the sender sends data packets 1, 2, 3, 4, and 5:

The first packet Seq1 is received first, so it acknowledges with 2;
Seq2 is not received for some reason, but Seq3 arrives, so it still acknowledges with 2;
The subsequent Seq4 and Seq5 arrive, but it still acknowledges with 2 because Seq2 is still not received;
The sender receives three ACKs = 2, knowing that Seq2 has not been received, and will retransmit the lost Seq2 before the timer expires.
Finally, Seq2 is received, and since Seq3, Seq4, and Seq5 have all been received, it acknowledges with 6.

Thus, the fast retransmission mechanism works by retransmitting the lost segment when three identical ACK packets are received before the timer expires.
The fast retransmission mechanism only addresses the issue of timeout duration, but it still faces another problem: whether to retransmit one packet or all packets.

SACK Method#

SACK (Selective Acknowledgment) requires adding a SACK option in the TCP header. It allows the receiver to send information about the data that has been received to the sender, enabling the sender to know which data has been received and which has not. Knowing this information allows the sender to retransmit only the lost data.
When the sender receives three identical ACK confirmation packets, it triggers the fast retransmission mechanism. Through SACK information, it discovers that only the data in the range of 200-299 is lost, so it retransmits only that TCP segment.

Duplicate SACK#

Duplicate SACK, also known as D-SACK, mainly uses SACK to inform the sender which data has been received multiple times.
ACK Packet Loss:

The two ACK confirmation packets sent from the receiver to the sender are lost, so the sender times out and retransmits the first data packet (3000-3499).
The receiver finds that the data has been received multiple times, so it sends a SACK = 3000-3500 to inform the sender that the data in the range of 3000-3500 has already been received. Since the ACK has reached 4000, it indicates that all data before 4000 has been received, so this SACK represents D-SACK.
This way, the sender knows that the data has not been lost; rather, the receiver's ACK confirmation packets have been lost.

Network Delay:

The data packet (1000-1499) is delayed in the network, causing the sender not to receive the ACK for 1500.
The subsequent three identical ACK confirmation packets arrive, triggering the fast retransmission mechanism. However, after retransmission, the delayed data packet (1000-1499) arrives at the receiver.
Therefore, the receiver sends a SACK = 1000-1500, since the ACK has reached 3000, this SACK is D-SACK, indicating that duplicate packets have been received.
This way, the sender knows that the reason for triggering fast retransmission is not due to lost packets sent out or lost ACK packets, but rather due to network delays.

It can be seen that D-SACK has several benefits:

It allows the sender to know whether the sent packets were lost or whether the receiver's ACK packets were lost;
It can determine whether the sender's packets were delayed in the network;
It can identify whether the sender's packets were duplicated in the network.

Sliding Window#

TCP introduces the concept of a window to address the issue that the longer the round-trip time of packets, the lower the communication efficiency. The window size refers to the maximum amount of data that can be sent without waiting for an acknowledgment.
The implementation of the window is essentially a buffer space allocated by the operating system. The sending host must retain the sent data in the buffer until it receives the acknowledgment. If the acknowledgment is received on time, the data can be cleared from the buffer.

In the diagram, the ACK 600 confirmation packet is lost, but it does not matter because it can be confirmed by the next acknowledgment. As long as the sender receives ACK 700, it means that all data prior to 700 has been received by the receiver. This mode is called cumulative acknowledgment or cumulative acknowledgment.

Who determines the window size?

There is a field in the TCP header called Window, which indicates the window size.
This field informs the sender how much buffer space the receiver has available to receive data. Thus, the sender can send data according to the receiver's processing capacity without overwhelming it.
Therefore, the window size is usually determined by the receiver's window size.
The amount of data sent by the sender cannot exceed the receiver's window size; otherwise, the receiver will not be able to receive the data properly.

Sending Window
The data stream sent can be divided into the following four parts: Sent and acknowledged | Sent but not acknowledged | Not sent but can be sent | Not sendable, where the sending window = Sent but not acknowledged + Not sent but can be sent.

Receiving Window
The received data stream can be divided into: Received | Not received but ready to receive | Not received and not ready to receive. The receiving window = Not received but ready to receive.

Are the sizes of the receiving window and sending window equal?

They are not completely equal; the size of the receiving window is approximately equal to the size of the sending window.
This is because the sliding window is not static. For example, when the receiving party's application process reads data very quickly, the receiving window can quickly become available. The new receiving window size is communicated to the sender through the Windows field in the TCP packet. Therefore, this transmission process has latency, so the receiving window and sending window are approximately equal.

Flow Control#

The sender cannot blindly send data to the receiver without considering the receiver's processing capacity.
If data is continuously sent to the receiver without consideration, it may trigger the retransmission mechanism, leading to unnecessary waste of network traffic.
To address this phenomenon, TCP provides a mechanism that allows the sender to control the amount of data sent based on the receiver's actual receiving capacity, which is known as flow control.
Relationship between Operating System Buffer and Sliding Window
The number of bytes stored in the sending window and receiving window is kept in the operating system's memory buffer, which is adjusted by the operating system.
When the application process cannot read the contents of the buffer in time, it will also affect our buffer.
If the buffer is reduced first and then the window is shrunk, packet loss may occur.
To prevent this situation, TCP specifies that it is not allowed to reduce the buffer and shrink the window simultaneously; instead, it should shrink the window first and then reduce the buffer after a period, thus avoiding packet loss.
Window Closure
If the window size is 0, it will prevent the sender from transmitting data to the receiver until the window becomes non-zero. This is known as window closure.
When the receiver informs the sender of the window size, it does so through the ACK packet.
When window closure occurs, after the receiver processes the data, it will send a non-zero window notification ACK packet to the sender. If this notification ACK packet is lost in the network, it will cause the sender to wait indefinitely for the non-zero window notification, while the receiver will also wait for the sender's data. If no measures are taken, this mutual waiting process will lead to a deadlock.
To resolve this issue, TCP sets a persistent timer for each connection. As soon as one side of the TCP connection receives a zero window notification from the other side, it starts the persistent timer.
If the persistent timer times out, it will send a window probe packet, and the other party will provide its current receiving window size upon acknowledging this probe packet.
The number of window probes is generally 3, with each probe occurring approximately every 30-60 seconds (this may vary by implementation). If after 3 attempts the receiving window is still 0, some TCP implementations will send an RST packet to terminate the connection.
Confused Window Syndrome
If the receiver is too busy to process the data in the receiving window, it will cause the sender's sending window to become smaller.
Eventually, if the receiver frees up a few bytes and informs the sender of the current window size, the sender will unconditionally send those few bytes, leading to the confused window syndrome.
It should be noted that our TCP + IP header has 40 bytes. To transmit such a small amount of data incurs a significant overhead, which is very inefficient.
To resolve the confused window syndrome, two issues must be addressed:

Prevent the receiver from notifying the sender of a small window.
The typical strategy for the receiver is as follows:
When the "window size" is less than min(MSS, buffer space/2), meaning it is less than the minimum of MSS and half of the buffer size, it will notify the sender that the window is 0, thus preventing the sender from sending more data.
Once the receiver processes some data, and the window size >= MSS, or the receiver's buffer space has half available, it can open the window to allow the sender to send data.
Prevent the sender from sending small data.
The typical strategy for the sender is as follows:
Use the Nagle algorithm, which delays processing and only allows data to be sent when either of the following two conditions is met:
Condition 1: Wait until the window size >= MSS and data size >= MSS;
Condition 2: Receive the ACK packet for previously sent data;
As long as neither of the above two conditions is met, the sender will continue to accumulate data until the sending conditions are satisfied.

Congestion Control#

When the network is congested, continuing to send a large number of packets may lead to packet delays, losses, etc. In this case, TCP will retransmit the data, but retransmission will further burden the network, leading to greater delays and more packet losses, creating a vicious cycle that continues to amplify...
Thus, congestion control is implemented to avoid filling the entire network with data from the sender.
To adjust the amount of data the sender sends, a concept called "congestion window" is defined. It dynamically changes based on the level of network congestion.
Congestion control algorithms:
1. Slow Start
When the sender receives an ACK, the size of the congestion window cwnd increases by 1. When slow start reaches the slow start threshold ssthresh:
- When cwnd < ssthresh, the slow start algorithm is used.
- When cwnd >= ssthresh, the "congestion avoidance algorithm" is used.
2. Congestion Avoidance
Every time an ACK is received, cwnd increases by 1/cwnd.
3. Congestion Occurrence
When network congestion occurs, meaning packet retransmission happens, the algorithm for congestion occurrence varies based on the type of retransmission.

Congestion occurrence algorithm for timeout retransmission

Set ssthresh to cwnd/2
Reset cwnd to 1 (restore to the initial value of cwnd, assuming the initial value is 1)

Congestion occurrence algorithm for fast retransmission

Set cwnd = cwnd/2, meaning it is set to half of the original value;
Set ssthresh = cwnd;
Enter fast recovery algorithm.

4. Fast Recovery

Set the congestion window cwnd = ssthresh + 3
Retransmit the lost packets;
If more duplicate ACKs are received, increase cwnd by 1;
If a new data ACK is received, set cwnd to the value of ssthresh from the first step, as this ACK confirms that new data has been received, indicating that the recovery process has ended and can return to the state prior to recovery, thus re-entering the congestion avoidance state.