揭开Linux 系统上TCP的 bind 和 listen函数的神秘面纱（下） Secer's Blog

原文：https://ops.tips/blog/how-linux-tcp-introspection/

在本文中，我们为读者介绍套接字在准备接受连接之前，系统在幕后做了哪些工作，以及“准备好接受连接”倒底意味着什么。由于篇幅较长，本文分为上下两篇进行翻译，这里为下篇。

检查listen函数的backlog参数是否受具体域名空间所限

如果该参数确实由具体网络命名空间决定的话，那么，我们可以设法进入某个命名空间，设置某个限值，然后，让外部看到的却是另一个限值：

# Check the somaxconn limit as set in the
# default network namespace
cat /proc/sys/net/core/somaxconn
128


# Create a new network namespace
ip netns add mynamespace


# Join the network namespace and then
# check the value set for `somaxconn`
# within it
ip netns mynamespace exec \
        cat /proc/sys/net/core/somaxconn
128


# Modify the limit set from within the 
# network namespace
ip netns mynamespace exec \
        /bin/sh -c "echo 1024 > /proc/sys/net/core/somaxconn"


# Check whether the limit is in place there
ip netns mynamespace exec \
        cat /proc/sys/net/core/somaxconn
1024


# Check that the host's limit is still the
# same as before (128), meaning that the change
# took effect only within the namespace
cat /proc/sys/net/core/somaxconn
128

所以，通过/proc，我们可以看到相关的sysctl参数的情况，但是，它们真的就位了吗?

要解决这个问题，我们首先需要了解如何收集为给定套接字设置的backlog参数的限值。

利用procfs收集TCP套接字的相关信息

通过/proc/net/tcp，我们可以看到当前名称空间中所有的套接字。

通常，我们可以利用这个文件找到所需的大部分信息。

该文件包含了一些非常有用的信息，比如:

连接状态;
远程地址和端口;
本地地址和端口;
接收队列的大小;
传输队列的大小。

例如，在我们让套接字进入监听状态之后，就可以通过它来查看相关信息了：

# Retrieve a list of all of the TCP sockets that
# are either listening of that have had or has a
# established connection.    

        hexadecimal representation <-.
        of the conn state.           |
                                     |
cat /proc/net/tcp                    |
   .---------------.               .----.
sl | local_address | rem_address   | st | tx_queue rx_queue 
0: | 00000000:0016 | 00000000:0000 | 0A | 00000000:00000000 
   *---------------*               *----*
    |                                  |
    *-> Local address in the format    *-.
        <ip>:<port>, where numbers are   |
        represented in the hexadecimal   |
        format.                          |
                    .--------------------*
                    |
        The states here correspond to the
        ones in include/net/tcp_states.h:

enum {
    TCP_ESTABLISHED = 1,
    TCP_SYN_SENT,
    TCP_SYN_RECV,
    TCP_FIN_WAIT1,
    TCP_FIN_WAIT2,
    TCP_TIME_WAIT,     .-> 0A = 10 --> LISTEN
    TCP_CLOSE,         |
    TCP_CLOSE_WAIT,    |
    TCP_LAST_ACK,      |
    TCP_LISTEN,  ------*
    TCP_CLOSING,
        TCP_NEW_SYN_RECV,
    TCP_MAX_STATES,
};

当然，这里并没有看到为侦听套接字配置的backlog参数，这是因为该信息与处于LISTEN状态的套接字密切相关，当然，目前来说，这只是一个猜测。

那么，我们该如何进行检测呢？

检查侦听套接字的backlog参数的大小

为了完成这项任务，最简便的方法是使用iproute2中的ss命令。

现在，请考虑以下用户空间代码：

int main (int argc, char** argv) {
        // Create a socket for the AF_INET
        // communication domain, of type SOCK_STREAM
        // without a protocol specified.
        int sock_fd = socket(AF_INET, SOCK_STREAM, 0);
        if (sock_fd == -1) {
                perror("socket");
                return 1;
        }

        // Mark the socket as passive with a backlog
        // size of 128.
        int err = listen(sockfd, 128);
        if (err == -1) {
                perror("listen");
                return 1;
        }

        // Sleep
        sleep(3000);
}

运行上面的代码后，执行ss命令：

# Display a list of passive tcp sockets, showing
# as much info as possible.
ss \
  --info \     .------> Number of connections waiting
  --tcp \      |        to be accepted.
  --listen \   |    .-> Maximum size of the backlog.
  --extended   |    |
        .--------..--------.
State   | Recv-Q || Send-Q | ...
LISTEN  | 0      || 128    | ...
        *--------**--------*

在这里，我们之所以使用的是ss，而非/proc/net/TCP，主要是因为后者的最新版本没有提供套接字的backlog方面的信息，而ss却提供了。

实际上，ss之所以能够提供这方面的信息，是因为它使用了不同的API从内核中检索信息，即它没有从procfs中读取信息，而是使用了netlink：

Netlink是一种面向数据报的服务。[...]用于在内核和用户空间进程之间传输信息。

鉴于netlink可以与许多不同内核子系统的通信，因此，ss需要指定它打算与哪个子系统通信——就套接字来说，将选择sock_diag：

sock_diag netlink子系统提供了一种机制，用于从内核获取有关各种地址族套接字的信息。

     该子系统可用于获取各个套接字的信息或请求套接字列表。

更具体地说，允许我们收集backlog信息的是UDIAG_SHOW_RQLEN标志：

UDIAG_SHOW_RQLEN
       ...
       udiag_rqueue
              For listening sockets: the number of pending
              connections. [ ... ] 

       udiag_wqueue
              For listening sockets: the backlog length which
              equals to the value passed as the second argu‐
              ment to listen(2). [...]

现在，再次运行上一节中的代码，我们可以看到，这里的限制确实视每个命名空间而定。

好了，我们已经介绍了这个backlog队列的大小问题，但是，它是如何初始化的呢？

ipv4协议族中listen函数的内部运行机制

利用sysctl值（SOMAXCONN）限制backlog大小之后，下一步是将侦听任务交给协议族的相关函数（inet_listen）来完成。

这一过程，具体如下图所示。

为了提高可读性，这里已经对TCP Fast Open的代码进行了相应的处理，下面是inet_listen函数的实现代码：

int
inet_listen(struct socket* sock, int backlog)
{
    struct sock*  sk = sock->sk;
    unsigned char old_state;
    int           err, tcp_fastopen;

    // Ensure that we have a fresh socket that has
    // not been put into `LISTEN` state before, and
    // is not connected.
    //
    // Also, ensure that it's of the TCP type (otherwise
    // the idea of a connection wouldn't make sense).
    err = -EINVAL;
    if (sock->state != SS_UNCONNECTED || sock->type != SOCK_STREAM)
        goto out;

    if (_some_tcp_fast_open_stuff_) {
        // ... do some TCP fast open stuff ...

        // Initialize the necessary data structures
        // for turning this socket into a listening socket
        // that is going to be able to receive connections.
        err = inet_csk_listen_start(sk, backlog);
        if (err)
            goto out;
    }

    // Annotate the protocol-specific socket structure
    // with the backlog configured by `sys_listen` (the
    // value from userspace after being capped by the
    // kernel).
    sk->sk_max_ack_backlog = backlog;
    err                    = 0;
    return err;
}

完成某些检查后，inet_csk_listen_start开始侦听套接字的变化情况，并对连接队列进行赋值：

int inet_csk_listen_start(struct sock *sk, int backlog)
{
    struct inet_connection_sock *icsk = inet_csk(sk);
    struct inet_sock *inet = inet_sk(sk);
    int err = -EADDRINUSE;

        // Initializes the internet connection accept
        // queue.
    reqsk_queue_alloc(&icsk->icsk_accept_queue);

        // Sets the maximum ACK backlog to the one that
        // was capped by the kernel.
    sk->sk_max_ack_backlog = backlog;

        // Sets the current size of the backlog to 0 (given
        // that it's not started yet.
    sk->sk_ack_backlog = 0;
    inet_csk_delack_init(sk);

        // Marks the socket as in the TCP_LISTEN state.
    sk_state_store(sk, TCP_LISTEN);

        // Tries to either reserve the port already
        // bound to the socket or pick a "random" one.
    if (!sk->sk_prot->get_port(sk, inet->inet_num)) {
        inet->inet_sport = htons(inet->inet_num);

        sk_dst_reset(sk);
        err = sk->sk_prot->hash(sk);

        if (likely(!err))
            return 0;
    }

        // If things went south, then return the error
        // but first set the state of the socket to
        // TCP_CLOSE.
    sk->sk_state = TCP_CLOSE;
    return err;
}

现在，我们已经为套接字设置了一个地址、正确的状态集和一个为传入的连接进行排序的队列，接下来，我们就可以接收连接了。

不过，在此之前，先让我们来了解一下可能会遇到的一些情况。

如果侦听之前没有执行绑定操作的话，会出现什么情况

如果“根本”没有执行绑定操作的话，listen(2)最终会为你选择一个随机的端口。

为什么会这样呢？如果我们仔细考察inet_csk_listen_start用来准备端口的方法（get_port），我们就会发现，如果底层套接字没有选择端口的话，它会随机选一个临时端口。

/* Obtain a reference to a local port for the given sock,
 * if snum is zero it means select any available local port.
 * We try to allocate an odd port (and leave even ports for connect())
 */
int inet_csk_get_port(struct sock *sk, unsigned short snum)
{
    bool reuse = sk->sk_reuse && sk->sk_state != TCP_LISTEN;
    struct inet_hashinfo *hinfo = sk->sk_prot->h.hashinfo;
    int ret = 1, port = snum;
    struct inet_bind_hashbucket *head;
    struct inet_bind_bucket *tb = NULL;

        // If we didn't specify a port (port == 0)
    if (!port) {
        head = inet_csk_find_open_port(sk, &tb, &port);
        if (!head)
            return ret;
        if (!tb)
            goto tb_not_found;
        goto success;
    }

        // ...
}

所以，如果您不想在侦听的时候选择端口的话，那就随您便吧！

当接收连接的速度不够快时，哪些指标会达到峰值

假设套接字进入被动状态时，我们总是有两个队列（一个队列用于那些尚未完成三次握手的连接，另一个用于那些已经完成但尚未被接收的队列），我们可以想象，一旦接收连接的速度跟不上的话，第二个队列将逐渐被塞满。

我们可以看到的第一个指标是我们之前已经介绍过的指标，即sock_diag为特定套接字报告的idiag_rqueue和idiag_wqueue的值。

idiag_rqueue
       对于侦听套接字：挂起连接的数量。

       对于其他套接字：传入队列中的数据量。

idiag_wqueue
       对于侦听套接字：积压长度。

       对于其他套接字：可用于发送操作的内存量。

虽然这些对于每个套接字的分析来说非常有用，但我们可以查看更高级别的信息，以便从整体上了解该机器的接收队列是否将出现溢出情况。

鉴于每当内核尝试将传入请求从syn队列转移到接收队列并失败时，它会在ListenOverflows上记录一个错误，所以，我们可以跟踪错误的数量（您可以从/proc/net/netstat中获取该数据）：

# Retrieve the number of listen overflows
# (accept queue full, making transitioning a
# connection from `syn queue` to `accept queue`
# not possible at the moment).
cat /proc/net/netstat
cat /proc/net/netstat
TcpExt: SyncookiesSent SyncookiesRecv ...  ListenOverflows
TcpExt: 0 0 ... 105 ...

当然，我们可以看到，/proc/net/netstat提供的数据的格式不够人性化。这时，netstat(工具)就有了用武之地了:

netstat --statistics | \
        grep 'times the listen queue of a socket overflowed'
105 times the listen queue of a socket overflowed

想知道内核代码中发生了什么吗? C详见tcp_v4_syn_recv_sock。

/*
 * The three way handshake has completed - we got a valid synack -
 * now create the new socket.
 */
struct sock *tcp_v4_syn_recv_sock(const struct sock *sk, struct sk_buff *skb,
                  struct request_sock *req,
                  struct dst_entry *dst,
                  struct request_sock *req_unhash,
                  bool *own_req)
{
        // ...
    if (sk_acceptq_is_full(sk))
        goto exit_overflow;
        // ...
exit_overflow:
    NET_INC_STATS(
                sock_net(sk), 
                LINUX_MIB_LISTENOVERFLOWS); // (ListenOverflows)
}

现在，如果syn队列接近满载，但是仍然没有出现三方握手已经完成的连接，所以不能将连接转移到接收队列，假设该队列眼看就要溢出了，那该怎么办呢？

这时，另一个指标就派上用场了，即TCPReqQFullDrop或TCPReqQFullDoCookies（取决于是否启用了SYN cookie），详情请参见tcp_conn_request。

如果想知道某时刻第一个队列（syn队列）中的连接数是多少，我们可以列出仍处于syn-recv状态的所有套接字：

# List all sockets that are in
# the `SYN-RECV` state  towards
# the port 1337.
ss \
  --numeric \
  state syn-recv sport = :1337

关于该主题，在CloudFlare上有一篇很棒的文章：SYN packet handling in the wild。

大家不妨去看看吧！

小结

如果能够理解为接收新连接而设置服务器TCP套接字所涉及的一些边缘情况的话，自然是极好的。所以，我计划对这个过程中涉及的其他一些内容做进一步的解释，以便帮助读者理解现代的TCP的一些怪癖行为，但那是另一篇文章的任务。最后，祝大家阅读愉快！

参考资料

揭开Linux 系统上TCP的 bind 和 listen函数的神秘面纱（下）

Hacking more