11月 5 2014

TCP ADI in Linux(4): TCP Connection Setup

这一章主要学习TCP网络编程的几个关键函数：bind(),listen(),accept()和connect()。当然这里说的是这几个函数对应的systemcall的内核实现，并不是用户层怎么使用这些函数。同时还会介绍一些内核中与这些函数相关的数据结构。

服务器端调用关键系统调用的顺序及简单说明

英文水平有限，没法准确的翻译出这些句子。所以直接摘录原文了，:-)

socket(): Identify correct set of socket * protocol operation and link them together with the help of sock & socket structure. Hook this socket to vfs and associate this socket to the inode.

bind(): Register this socket and request kernel to associate port number and(/or) IP address with the socket. At this stage kernel will lock the port number.

listen(): request to the kernel to configure the connection baklog queue for the socket.

accept(): this is the final step to get the server application up. Server application requests the kernel to now start accepting the connection for itself. Kernel creates a new socket on behalf of the server application, associates this socket with the VFS and returns new socket fd to the server applications.

Bind过程及其相关数据结构

bind完成的主要工作就是将socket与sockaddr结构体进行绑定。这点从bind的函数原型就能看出来。

int bind(int sockfd, struct sockaddr *my_addr, int addrlen);

而struct sockaddr其实只是一个通用的数据结构，写实际代码的时候往往声明更加具体的结构体，比如AF_INET family的话就使用sockaddr_in结构体。然后调用bind的时候完成类型转换即可。
相关结构体声明如下：

// 声明在include/linux/socket.h中
struct sockaddr {
    sa_family_t                sa_family;        // 存储address family，AF_xxx
    char                    sa_data[14];    // 通用内存区域
};

// 声明在include/uapi/linux/in.h中
struct sockaddr_in {
    __kernel_sa_family_t     sin_family;        // address family， unsigined short

    __be16                    sin_port;        // port number, __u16, 应是2Bytes
    struct    in_addr            sin_addr;        // internet address

    // pad to size of `struct sockaddr`
    unsigned char             _padXXX;        // 补齐未用完的内存，已具体声明省略
};

// 声明在include/uapi/linux/in.h中
struct in_addr {
    __be32                    s_addr;            // __u32,应是4Bytes
};

可以看出sockaddr_in其实只用了其中6个字节作为有效区域。需要说明的是，作为服务器端sin_addr一般设置为INADDR_ANY（0x00000000）, 意思就是说可以接受来自不同网卡（服务器一般有多个网卡，也就对应对个IP地址）的链接请求。

bind系统调用的过程

sys_bind()
    => sock = sockfd_lookup_light() // 根据fd 获取套接口指针  
    => err = move_addr_to_kernel()
    => sock->ops->bind()  == inet_bind()
        => sk->sk_prot->bind() == raw_bind()  // 如果是RAW sock
        => sk->sk_prot->get_port() == inet_csk_get_port()  // 如果是 TCP
        => sk->sk_prot->get_port() == udp_v4_get_port()  // 如果是 UDP

sys_bind的实现在net/socket.c中，具体的声明如下：

SYSCALL_DEFINE3(bind, int, fd, struct sockaddr __user *, umyaddr, int, addrlen)

主要完成的动作就是首先根据fd找到sock结构体，然后将sockaddr从用户空间move到内核空间。之后就根据sock结构体中的ops->bind函数执行具体的调用。如果fput_needed被标记，则将sock的引用计数减一。

sock_fd_lookup_light的具体过程如下：

sock_fd_lookup_light()
    => file = fget_light(fd, fput_needed);    // 获取文件指针
    => sock = sock_from_file(file, err);      // 获取socket指针
        return file->private_data;  // private_data即是file结构体中的指向socket结构体的指针

inet_bind的实现在net/ipv4/af_inet.c中，具体过程如下：

如果是RAW sock的话，直接调用bind
检查addr_len和sin_family
chk_addr_ret = inet_addr_type();  // 得到地址的类型，用于后续检查
对port进行判断，用户仅能使用Port >= 1024的端口。
根据sk->sk_state及inet->inet_num判断socket状态，检查重复绑定的错误
/* rcv_saddr用于hash lookups, inet_saddr用于transmit。正常情况下它们值相同 */
inet->inet_rcv_saddr = inet->inet_saddr = addr->sin_addr.s_addr;  
调用sk->sk_prot->get_port(sk, snum);  // snum就是sin_port

端口管理

在介绍具体的get_port函数之前，需要先来了解一下内核对于端口是怎么管理的。
Linux内核是通过一张hash表管理socket使用时的端口的，hash表存放在tcp_hashinfo中，通过sk->sk_prot->h.hashinfo引用。

struct inet_hashinfo {
    struct inet_ehash_bucket    *ehash;        
    spinlock_t                    *ehash_locks;
    unsigned int                ehash_mask;
    unsigned int                ehash_locks_mask;

    struct inet_bind_hashbucket        *bhash;    
    unsigned int                    bhash_size;
    struct kmem_cache                *bind_bucket_cachep;

    /* 上面的变量在bootup时确定，之后是只读的
     * 下面的变量往往是dirty的，所以用了cacheline对齐
     */

    struct inet_listen_hashbucket    listening_hash[INET_LHTABLE_SIZE]
                                    ____cacheline_aligned_in_smp;
    atomic_t                        bsockets;
};

struct inet_bind_hashbucket {
    spinlock_t            lock;
    struct hlist_head    chain;
};

struct hlist_head {
    struct hlist_node     *first;
};

struct hlist_node {
    strcut hlist_node    *next, **pprev;
};

TCP端口绑定实现：inet_csk_get_port()

新版的内核中TCP的get_port函数改名为了inet_csk_get_port，这点是与书中不一样的地方。当然完成的功能是一样的，参数是sock结构体和Port，如果Port=0，则需要找到一个空闲的Port给sock。函数的实现位于net/ipv4/inet_connection_sock.c中，具体执行过程可参考这里. 主要就是找到一个可用的Port过程，然后就是判断找到的Port是否可重用，不能重用的Port在使用是还要检查绑定冲突，即调用bind_conflict().
bind_confict()对于TCP来说，就是inet_csk_bind_conflict，具体的实现位于net/ipv4/tcp_ipv4.c中。

小技巧：对于TCP而言，sk->sk_prot结构体的初始化位于net/ipv4/tcp_ipv4.c文件中，struct proto tcp_prot.

Listen过程及其相关结构体

sys_listen()是listen在内核的表示，而具体的实现是在net/socket.c中，声明为：

SYSCALL_DEFINE2(listen, int, fd, int, backlog)

sys_listen在获得sock结构体之后，会进一步检查backlog是否超过系统允许的上限，最后就是调用sock->ops->list。先手来看一下sys_listen关键的调用过程：

sys_listen()
=> sock->ops->listen()  == inet_listen()
    => fastopen_init_queue(sk, backlog)    // 条件允许的话，建立fastopen队列
    => inet_csk_listen_start(sk, backlog)
        => sk->sk_state = TCP_LISTEN
        => sk->sk_prot->get_port(sk, inet->inet_num)  // re-check port是否可用，原因见下面一段原文引用。
        => sk->sk_prot->hash(sk) == inet_hash(sk)     // 如果一切正常，则为该socket建立hash表项
            => __inet_hash()  // 为socket生成表项，并加入listen hash table

inet_listen()首先判断sock的状态，如果不是合法的状态的话就直接返回错误。随后如果不是listen状态，则需要做更多的事情将sock置于listen状态，否则直接设置sk->sk_max_ack_backlog即可。如果需要将sock置于listen状态，首先会判断一些条件看是否满足TCP_FASTOPEN（TFO）的要求，若满足则执行fastopen_init_queue()，然后执行inet_csk_listen_start(). sk->sk_prot->hash(sk)对于TCP来讲的话，具体的实现就是inet_hash(), 位于net/ipv4/inet_hashtables.c文件中。

至于为什么sys_listen还需要调用get_port确认一次，我还不是太懂。根据代码的注释和书本的讲解，大概可以理解为多线程可能存在竞争（race）。关键就是reuse标示可能被重置，详细解释摘录一段书中的解释如下：

We need to check if we are still eligible to use the same port to which we earlier bound this socket. There is a window between the bind() and listen() calls form an application when two threads can race to bind two sockets to the same port. After both the threads are bound to the same port (both the sockets are in the bind hash list, tcp_bhash), one of the sockets makes the socket port not reusable (resets sk->reuse for itself) and gets into the TCP_LISTEN state. The other thread now enters the listen() systemcall and gets into this part of the code. So, once again it needs to make sure whether it can use the same port that it requested eariler.

分析调用bind和listen对socket状态的影响

1. 一个调用了bind，而没有调用listen的socket，是仅仅绑定了端口（或IP）的，并不能接收连接请求。  
    此时socket的状态还不是listening，如果客户端发出连接请求，服务器端会回复reset包  
2. 为socket调用了listen，而没有调用accept时，socket的状态是listening，  
    此时如果客户端发出连接请求，能能够看到三次握手成功的【注意！】；
    同时客户端也能发送数据并收到ACK，但是在发完rwnd数量的数据后会收到rwnd等于0的确认包。  
    此后客户端就会停止发送数据。最终由于服务器端不会consume接收的数据，会导致客户端的0窗口探测包超时后结束连接。

tcp_ehash结构中存储这所有状态为TCP_ESTABLISHED和TIME_WAIT状态下的TCP流。当一个新的数据包到来时，我们需要找到这个包对应的socket，具体就是使用四元组（remote IP, PORT 和 local IP, PORT）作为key在tcp_ehash查找。

区分SYN queue和Accept queue

SYN queue: 当一个调用过listen函数的socket收到连接请求（第一个SYN包）后，会首先发送SYN/ACK包然后添加对应的connection request到SYN queue中去等待最后一个ACK包。

Accept queue: 一旦收到了三次握手的最后一个ACK包，一个新的socket将会为对应的connection request建立，然后将connection request从SYN queue中移除。最后connection request 将放入listening socket对应的accept queue中。

Flow control for handling a new connection request

当TCP层收到一个IP包时，被调用的函数是tcp_v4_rcv()，该函数的实现是在net/ipv4/tcp_ipv4.c中。

tcp_v4_rcv(struct sk_buff *skb)  
    => sk = __inet_lookup_skb(&tcp_hashinfo, skb, th->source, th->dest) //根据四元组查找该IP包对应的sock结构体  
        => __inet_lookup()
            => __inet_lookup_established()  // 在tcp_ehash table中查找  
            => __inet_lookup_listener()     // 上一步没找到再到hashinfo->listening-hash[]中查找  

    => ret = tcp_v4_do_rcv(sk, skb)  
        => if (sk->sk_state == TCP__ESTABLISHED)
            => tcp_rcv_established()  // 后续再分析该函数  
        => if (sk->sk_state == TCP_LISTEN)
            => struct sock *nsk = tcp_v4_hnd_req(sk, skb)    // 找到skb对应的sock，找不到则丢弃  
            => if (nsk != sk)  // 如果nsk与sk不同，即说明已经为该connection request新建了sock
                => tcp_child_process(sk, nsk, skb)  // 对新建立的sock结构体做更多地处理
        => tcp_rcv_state_process(sk, skb, tcp_hdr(skb), skb->len)    // 根据不同的状态处理响应的包，此处关心listen和syn_sent状态      
            => case TCP_LISTEN:  
                => icsk->icsk_af_ops->conn_request() == tcp_v4_conn_request()
                    => inet_csk_reqsk_queue_is_full(sk)    // 判断request queue是否用满  
                    => sk_acceptq_is_full(sk)              // 判断accept queue是否用满
                    => req = inet_reqsk_alloc()            // 为connection request分配一个request sock
                    => tcp_parse_options()                 // 解析TCP options  
                    => tcp_openreq_init()
                    => ip_build_and_send_pkt()             // add an ip header to a skbuff and send it out  
                    => inet_csk_reqsk_queue_hash_add()     // add the request sock to the SYN table
            => case TCP_SYN_SENT:  
                => queue = tcp_rcv_syssent_state_process(sk, skb, th, len)    // 代码里面注释较详细  
                    => tcp_finish_connect()     // 完成连接，进行最后的设置  
                        => tcp_set_state(sk, TCP_ESTABLISHED)   // 设置sk_state
                        => tcp_init_congestion_control(sk)      // 设置congestion control algorithm，我实现的快速重传算法的初始化也是在这附近  
                        => tcp_init_buffer_space(sk)

Accept过程及相关结构体

accept系统调用对应内核中的sys_accept()函数，具体的实现则在net/socket.c文件中。主要调用流程如下：

SYSCALL_DEFINE4(accept4, int, fd, struct sockaddr __user *, upeer_sockaddr,  
                int __user *, upeer_addrlen, int, flags)  == sys_accept4   
    => sock = sockfd_lookup_light(fd, &err, &fput_needed)  // 根据监听的socket fd找到其sock结构体  
    => newsock = sock_alloc()   // 分配一个新的BSD socket 
    => newfd = get_unused_fd_flags(flags)
    => newfile = sock_alloc_file(newsock, flags, sock->sk->sl_prot_creator->name)
    => err = sock->ops->accept(sock, newsock, sock->file->f_flags)  == inet_accept()
        => *sk2 = sk1->sk_prot->accept()  == inet_csk_accept()
            => if accept queue is empty, wait for connect if is blocking
            => otherwise, get the very first request

            => newsk = req->sk    // 获取request结构体中的sock结构体指针并返回
        => sock_graft(sk2, newsock)  // 将获取的sock结构体与之前新建的BSD socket关联  
        => newsock->state = SS_CONNECTED;
    => fd_install(newfd, newfile);   // index newfile for the socket inode in the process file table  
        => fd_install主要完成的动作就是：current->files->fd[fd] = file;

inet_accept()在net/ipv4/af_inet.c文件中，完成连接建立的最后一步：accept a pending connection. 当然这里的pending connection 其实已经完成了三次握手的过程。
注意这里sock_graft函数，函数实现在include/net/sock.h文件中。第一个参数是struct sock sk, 这是通过获取accept queue中第一项找到的，是在三次握手阶段创建的；第二个参数是struct socket parent，这是一个网络编程时看到的一个概念，与kernel中的sock有着本质的区别。
同时也要注意到kernel中命名的一个小规律：
struct socket 结构体的实例化名字，常缩写成sock
struct sock结构体的实例化名字，常缩写成sk

BSD socket与sock结构体的关系

同时sock_graft函数也很直接的现实了BSD socket结构体与sock结构体的关系，源代码不长就直接贴出来了。需要注意的是，这里为了完整的展现sock_graft的功能，手动的展开了inline函数。

static inline void sock_graft(struct sock *sk, struct socket *parent)
{
    write_lock_bh(&sk->sk_callback_lock);
    sk->sk_wq = parent->wq;
    parent->sk = sk;
    sk->sk_tx_queue_mapping = -1;
    sk->sk_socket = parent;
    write_unlock_bh(&sk->sk_callback_lock);
}

至于file, inode 和 socket结构体的关系用文字表达比较费劲，推荐看下书中的Figure 4.21, Figure 4.22再结合代码理解。

Connect过程及相关结构体

客户端建立一个TCP connection需要做的事情相对简单。先调用socket()新建一个BSD socket，然后调用connect函数与远端相连即可。引用一段书中的一段更详细准确的解释如下：

Socket():
    1. Identify correct set of socket & protocol operations and  
    link them together with the help of sock & socket structure.  
    2. Initialize some of the fields of protocol specific data structures.  
    3. Hook this socket to the vfs and associate this socket to the inode.

Connect():

    1. Let the kernel know what services (server port number) you want to avail and  
    from where (IP address).  
    2. Initializes protocol specific data structures, allocates resources for client  
    application, and sets up complete procotol stack for the client side.  
    3. By default, connect blocks and returns to the application once the connection  
    is established with the server else returns an error number.

connect系统调用对应内核中的sys_connect()函数，具体的实现则在net/socket.c文件中。
connect主要调用流程如下：

SYSCALL_DEFINE3(connect, int, fd, struct sockaddr __user *, uservaddr,  
                int, addrlen)  == sys_connect()  
    => sock = sockfd_lookup_light()  
    => err = move_addr_to_kernel()  
    => err = sock->ops->connect()  == inet_stream_connect()  
        => Any states other then SS_UNCONNECTED are unacceptable for processing  
        => err = sk->sk_prot->connect(sk, uaddr, addr_len)  == tcp_v4_connect()  
            => rt = ip_route_connect()  // get the route for the dst addr. All routing entries for the system are hashed in the global table rt_hash_table[].    
            => tcp_set_state(sk, TCP_SYN_SEND)
            => err = inet_hash_connect(&tcp_death_row, sk);  // 获得一个free的Port,流程与tcp_v4_get_port较类似  
                => inet_get_local_port_range(&low, &high)
                => 遍历所有端口，对某个备选端口，遍历inet_bind_bucket确认是否有冲突。  
                => 如果没有冲突，则tb = inet_bind_bucket_create()创建bind_bucket中的hash表项

            => Until now we got the route to destination, and obtained the local port number,  
               and we have initialized remote address, remote port, local address, and local address fields of the socket.  

            => err = tcp_connect(sk)  // generate a SYN packet and give it to the IP layer  
                => tcp_connect_init(sk)  // do all connect socket setups that can be done AF independent  
                    => tcp_select_initial_window() // determine  a window scaling and initial window to offer  
                => buff = alloc_skb_fclone()  // allocate a sk_buff structure, 细节在下一章再写  
                => tcp_transmit_skb(sk, buff, 1, sk->sk_allocation)  // 复制一份Buff，然后发送出去
                    => skb = skb_clone(skb, gfp_mask)  // 复制一份buff  
                    => 初始化skb及TCP header  
                    => tcp_options_write()  // write previously computed tcp options to the packet  
                    => icsk->icsk_af_ops->send_check(sk, skb)  == tcp_v4_send_check()  // compute checksum
                    => err = icsk->icsk_af_ops->queue_xmit()  == ip_queue_xmit()  // transmit packet to IP layer  
        => 至此已发送SYN包，然后等待SYN/ACK包从而完成三次握手  
        => timeo = sock_sndtimeo(sk, flag * O_NONBLOCK)
        => inet_wait_for_connect(sk, timeo, writebias)  // 完成三次握手的最后的工作

SUMMARY

摘录书中的原文如下

Protocol-specific operation one the socket is accessed from prot field of the sock object.  
For the INET stream protocol, this field is initialized to tcp_prot.  

The tcp_hashinfo object has pointers to different hash tables for bind, established, and listening sockets.  
    1. tcp_bhash is an object of type tcp_bind_hashbucket pointing to bind hash table.  
       This table is hased based ont the port number sockets are bound to them.  
    2. ehash is object of type tcp_ehash_bucket points to established hash table. 
       Hashed on the destination and source port/IP.
    3. tcp_listenging_hash is a hash table of sock objects hashing all the listenging sockets.  
       Hashed on the listening port number.  

tcp_bind_conflict() checks for any conflicts related to allocation of port.  
tcp_port_rover stores the last allocated port number.  
tcp_listen_opt is an object that keeps information about all connection requests for a listening socket.  
    sys_table field of tcp_lisen_opt object of type open_request.  
    This hashes in all the connection requests for the lisening socket.  

Once a three-way handshake is over, the connection request is moved from listeners SYN queue to accept queue, tp->accept_queue  
[Important] sock and tcp_opt objects are initialized for the new connection int the accept queue.  
[Important] Once an application accepts a connection request int the accept queue,  
            a BSD socket is created for the new connection and is associated with VFS.  

__tcp_v4_lookup_established() searches for established connections in the ehash table.  
tcp_v4_lookup_listener() searches for listening sockets in the tcp_listening_hash hash table.