TCP ADI in Linux(3): Implementation of Sockets

这章主要从socket()系统调用入手,介绍socket建立时初始化的一些重要数据结构。熟练掌握这些数据结构和他们之间的关系能够有效的帮助理解代码,所以这部分随着理解的加深,随时需要补充和完善。

首先是,sock结构体,是network layer对于socket的表示。结构体定义在include/net/sock.h中,主要需要理解的结构元素如下:

// sock_common是第一个内嵌的结构体,是于inet_timewait_sock共享的layout  
struct    sock_common        __sk_common;  
...  
struct    sk_buff_head    sk_receive_queue;    // 接收队列,注意是skb head的队列  
...  
int                        sk_forward_alloc;    // 预分配的空间  
...  
int                        sk_rcvbuf;            // 接收buf的大小(Bytes)  
...  
atomic_t                sk_wmem_alloc;        // 传输队列中已committed的字节数
atomic_t                sk_omem_alloc;        // 'o' 表示 other
int                        sk_sndbuf;            // 发送buf的大小(Bytes)  
struct    sk_buff_head    sk_write_queue;        // 数据包的发送队列  
...  
int                        sk_wmem_queue;        // persistent 队列大小
gfp_t                    sk_allocation;        // allocation mode
u32                        sk_pacing_rate;        // pacing rate(Bps,if supported by transport/packet scheduler)  
...  
struct    sk_buff_head    sk_error_queue;        // queue of defactive packets  
...  
long                    sk_rcvtimeo;        // 判断接收超时的上限  
long                    sk_sndtimeo;        // 判断发送超时的上限  
...  
struct                    *sk_send_head;        // 指向下一个应该发送的skb
int                     sk_write_pending;    // a write to stream socket waits to start  
...  

然后是inet_sock。sock结构体是inet_sock的第一个元素,然后inet_sock中包含了一些TTL,IP和Port等信息。结构体的定义在include/net/inet_sock.h中。

再接着就是inet_connection_sock。inet_sock是inet_connection_sock的第一个元素,然后从字面理解也可以看出inet_connection_sock相比于inet_sock添加了一些面向connection的信息。主要元素如下:

struct    inet_sock        icsk_inet;
...  
struct    timer_list        icsk_retransmit_timer;    // resend (no ack)  
__u32                    icsk_rto;                  // retransmit timeout  
...  
const    struct tcp_congestion_ops    *icsk_ca_ops; // 拥塞控制算法hook  
__u8                    icsk_ca_state;              // 拥塞控制的状态  

最后就是TCP最相关的,tcp_sock结构体。类似的,inet_connection_sock是tcp_sock的第一个元素。主要元素如下:

struct    inet_connection_sock    inet_conn;  
...  
/* RFC793 and RFC1122 are the best references for this */
u32        rcv_nxt;                // what we want to receive next  
u32        copied_seq;                // Head of yet unread data  
u32        snd_nxt;                // next sequence we send  
u32        snd_una;                // first byte we want an ack for  
...  
u32        rcv_tstamp;                // timestamp of last rcv ack (for keepalives)  
u32        lsndtime;                // timestamp of last snd pkt (for restart window)  
...  
u32        snd_wnd;                // the window we expect to receive  
u32        max_window;                // maximal window ever seen from peer  
u32        window_clamp;            // maximal window to advertise  
u32        rcv_ssthresh;            // current window clamp  
u16        advmss;                    // advertised MSS  

/* RTT measurement */
u32        srtt;                    // smoothed RTT << 3  
u32        mdev;                    // medium deviation  
u32        mdev_max;                // maximal mdev for the last rtt period  
u32        rttvar;                    // smoothed mdev_max  
u32        rtt_seq;                // sequence number to update rttvar  

u32        packets_out;            // packets which are "in_flight"  
u32        retrans_out;            // retransmitted packets out  
...  

/* Slow start and congestion control */
u32        snd_ssthresh;            // slow start size threshold  
u32        snd_cwnd;                // sending congestion window, 注意与snd_wnd区分开  
u32        snd_cwnd_cnt;            // linear increase counter  
u32        snd_cwnd_clamp;            // snd_cwnd的上限  
u32        snd_cwnd_used;
u32        snd_cwnd_stamp;
u32        prior_cwnd;                // Recovery 开始时的cwnd值  
u32        prr_delivered;            // # of newly delivered pkts in Recovery  
u32        prr_out;                // # of total sent pkts during Recovery  

u32        rcv_wnd;                // current receiver window, 表示的是自己作为receiver时的window大小,用于通知对方  

struct sk_buff *highest_sack;    // skb just after the highest skb with SACKed bit set  
...  
u32        retransmit_high;
u32        lost_retrans_low;
u32        prior_ssthresh;            // ssthresh saved at Recovery start  
u32        high_seq;  
u32        retrans_stamp;            // timestamp of the last retransmit  
u32        uodo_marker;            // tracking retrans started here  
int     undo_retrans;            // number of undoable retransmissions  
u32        total_retrans;            // total retransmits for entire connection  

总结

可以看出来,这些结构体是一个嵌套一个的形式被定义,个人感觉很像C++中类继承时的内存分布状况。下面是书中总结的一段,觉得很好,就直接摘录了。

There are two levels of socket abstraction. At the top is the BSD  
socket layer defined as struct socket and then protocol-specific  
socket defined as sturct sock.  
    1. sock_register() is an interface to register BSD sockets for
    different net families. For INET family, inet_family_ops of  
    type net_proto_family is registered.  
    2. net_families is a global array to indexed on net family
    number. Net family sockets are registered with this table.  
    3. inet_register_protose() is an interface to register  
    protocol supported by the INET family. These protocols  
    are TCP, UDP and RAW.  
    4. inetsw_array is a global table that registers the  
    INET family protocols, object of type inet_protosw.  
    5. inet_stream_ops is set of operation for INET stream BSD  
    socket, and tcp_prot is a protocol-specific set of operations  
    TCP sockets.  

最后结合Linux v3.10的代码梳理一下,sys-socket()中暂时感兴趣的major rountines.

socket() [User Space] 等价于 sys_socket() [Kernel Space]
__________________________________________________________________
SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol)  // net/socket.c
    => sock_create() // 在net/socket.c文件中  
        => sock = sock_alloc()  // allocate the BSD socket
        => pf = rcu_dereference(net_families[family]);  // 根据family值,得到struct net_proto_family结构体  
        => pf->create() = inet_create()  // PF_INET对应的定义在net/ipv4/af_inet.c中  
            => 遍历inetsw list,如果protocol没设置,则默认会匹配成IPPROTO_TCP
            => sock->ops = answer->ops  // 设置sock->ops = &inet_stream_ops
            => sk = sk_alloc()  // 分配struct sock结构体,传递了tcp_prot结构体地址作为参数,所以一次性分配了整个tcp_sock结构体大小的内存空间  
            => sock_init_data()  // 完成sock结构体的初始化,把sock与socket关联

            => sk->sk_prot->init() = tcp_v4_init_sock()    
                => tcp_init_sock()  // 完成tcp_sock结构体的初始化
    => sock_map_fd()  // bind sock with fd
        => fd = get_unused_fd_flags()
        => newfile = sock_alloc_file()  // create file 结构体,并于socket关联