这章主要从socket()系统调用入手,介绍socket建立时初始化的一些重要数据结构。熟练掌握这些数据结构和他们之间的关系能够有效的帮助理解代码,所以这部分随着理解的加深,随时需要补充和完善。
首先是,sock结构体,是network layer对于socket的表示。结构体定义在include/net/sock.h中,主要需要理解的结构元素如下:
// sock_common是第一个内嵌的结构体,是于inet_timewait_sock共享的layout
struct sock_common __sk_common;
...
struct sk_buff_head sk_receive_queue; // 接收队列,注意是skb head的队列
...
int sk_forward_alloc; // 预分配的空间
...
int sk_rcvbuf; // 接收buf的大小(Bytes)
...
atomic_t sk_wmem_alloc; // 传输队列中已committed的字节数
atomic_t sk_omem_alloc; // 'o' 表示 other
int sk_sndbuf; // 发送buf的大小(Bytes)
struct sk_buff_head sk_write_queue; // 数据包的发送队列
...
int sk_wmem_queue; // persistent 队列大小
gfp_t sk_allocation; // allocation mode
u32 sk_pacing_rate; // pacing rate(Bps,if supported by transport/packet scheduler)
...
struct sk_buff_head sk_error_queue; // queue of defactive packets
...
long sk_rcvtimeo; // 判断接收超时的上限
long sk_sndtimeo; // 判断发送超时的上限
...
struct *sk_send_head; // 指向下一个应该发送的skb
int sk_write_pending; // a write to stream socket waits to start
...
然后是inet_sock。sock结构体是inet_sock的第一个元素,然后inet_sock中包含了一些TTL,IP和Port等信息。结构体的定义在include/net/inet_sock.h中。
再接着就是inet_connection_sock。inet_sock是inet_connection_sock的第一个元素,然后从字面理解也可以看出inet_connection_sock相比于inet_sock添加了一些面向connection的信息。主要元素如下:
struct inet_sock icsk_inet;
...
struct timer_list icsk_retransmit_timer; // resend (no ack)
__u32 icsk_rto; // retransmit timeout
...
const struct tcp_congestion_ops *icsk_ca_ops; // 拥塞控制算法hook
__u8 icsk_ca_state; // 拥塞控制的状态
最后就是TCP最相关的,tcp_sock结构体。类似的,inet_connection_sock是tcp_sock的第一个元素。主要元素如下:
struct inet_connection_sock inet_conn;
...
/* RFC793 and RFC1122 are the best references for this */
u32 rcv_nxt; // what we want to receive next
u32 copied_seq; // Head of yet unread data
u32 snd_nxt; // next sequence we send
u32 snd_una; // first byte we want an ack for
...
u32 rcv_tstamp; // timestamp of last rcv ack (for keepalives)
u32 lsndtime; // timestamp of last snd pkt (for restart window)
...
u32 snd_wnd; // the window we expect to receive
u32 max_window; // maximal window ever seen from peer
u32 window_clamp; // maximal window to advertise
u32 rcv_ssthresh; // current window clamp
u16 advmss; // advertised MSS
/* RTT measurement */
u32 srtt; // smoothed RTT << 3
u32 mdev; // medium deviation
u32 mdev_max; // maximal mdev for the last rtt period
u32 rttvar; // smoothed mdev_max
u32 rtt_seq; // sequence number to update rttvar
u32 packets_out; // packets which are "in_flight"
u32 retrans_out; // retransmitted packets out
...
/* Slow start and congestion control */
u32 snd_ssthresh; // slow start size threshold
u32 snd_cwnd; // sending congestion window, 注意与snd_wnd区分开
u32 snd_cwnd_cnt; // linear increase counter
u32 snd_cwnd_clamp; // snd_cwnd的上限
u32 snd_cwnd_used;
u32 snd_cwnd_stamp;
u32 prior_cwnd; // Recovery 开始时的cwnd值
u32 prr_delivered; // # of newly delivered pkts in Recovery
u32 prr_out; // # of total sent pkts during Recovery
u32 rcv_wnd; // current receiver window, 表示的是自己作为receiver时的window大小,用于通知对方
struct sk_buff *highest_sack; // skb just after the highest skb with SACKed bit set
...
u32 retransmit_high;
u32 lost_retrans_low;
u32 prior_ssthresh; // ssthresh saved at Recovery start
u32 high_seq;
u32 retrans_stamp; // timestamp of the last retransmit
u32 uodo_marker; // tracking retrans started here
int undo_retrans; // number of undoable retransmissions
u32 total_retrans; // total retransmits for entire connection
总结
可以看出来,这些结构体是一个嵌套一个的形式被定义,个人感觉很像C++中类继承时的内存分布状况。下面是书中总结的一段,觉得很好,就直接摘录了。
There are two levels of socket abstraction. At the top is the BSD
socket layer defined as struct socket and then protocol-specific
socket defined as sturct sock.
1. sock_register() is an interface to register BSD sockets for
different net families. For INET family, inet_family_ops of
type net_proto_family is registered.
2. net_families is a global array to indexed on net family
number. Net family sockets are registered with this table.
3. inet_register_protose() is an interface to register
protocol supported by the INET family. These protocols
are TCP, UDP and RAW.
4. inetsw_array is a global table that registers the
INET family protocols, object of type inet_protosw.
5. inet_stream_ops is set of operation for INET stream BSD
socket, and tcp_prot is a protocol-specific set of operations
TCP sockets.
最后结合Linux v3.10的代码梳理一下,sys-socket()中暂时感兴趣的major rountines.
socket() [User Space] 等价于 sys_socket() [Kernel Space]
__________________________________________________________________
SYSCALL_DEFINE3(socket, int, family, int, type, int, protocol) // net/socket.c
=> sock_create() // 在net/socket.c文件中
=> sock = sock_alloc() // allocate the BSD socket
=> pf = rcu_dereference(net_families[family]); // 根据family值,得到struct net_proto_family结构体
=> pf->create() = inet_create() // PF_INET对应的定义在net/ipv4/af_inet.c中
=> 遍历inetsw list,如果protocol没设置,则默认会匹配成IPPROTO_TCP
=> sock->ops = answer->ops // 设置sock->ops = &inet_stream_ops
=> sk = sk_alloc() // 分配struct sock结构体,传递了tcp_prot结构体地址作为参数,所以一次性分配了整个tcp_sock结构体大小的内存空间
=> sock_init_data() // 完成sock结构体的初始化,把sock与socket关联
=> sk->sk_prot->init() = tcp_v4_init_sock()
=> tcp_init_sock() // 完成tcp_sock结构体的初始化
=> sock_map_fd() // bind sock with fd
=> fd = get_unused_fd_flags()
=> newfile = sock_alloc_file() // create file 结构体,并于socket关联