11月 20 2014

TCP ADI in Linux(8): TCP receive

这章主要是TCP接收包得处理流程。
kernel对于TCP包得处理大致可以分为两类：

如果处理in-sequence的包时，application正阻塞在read操作中，则接收到的数据包的数据会被直接拷贝到user buffer。
否则，in-sequence包会被放在receive queue中，out-of-order包会放置于out-of-order queue中。

8.1 Queuing mechanism
处理收到的TCP时，涉及到三个queue：backlog queue, prequeue, 和receive queue。
注：prequeue好像已经是一个过气的概念了。

1. receive queue 包含的都是处理过的包，也就是说所有的协议头都被解析完了，  
   里面的数据就等着被复制到user space了。

TCP层处理数据包的第一个函数是tcp_v4_rcv()，首先从这个函数开始理解整个流程。

tcp_v4_rcv()  // net/ipv4/tcp_ipv4.c  
    => sk = __inet_lookup_skb()  // 找到skb属于的sock结构体  
    => if (!sock_owned_by_user(sk))  // sock未被加锁  
        => if (!tcp_prequeue(sk, skb))  // 如果符合加入prequeue的原则，则加入prequeue，返回true;反之返回false  
            => ret = tcp_v4_do_rcv(sk, skb)  
                => tcp_rcv_established()  // receive function for the ESTABLISHED state  
                    => if (len == tcp_header_len)  // 如果是纯ack包
                        => tcp_ack(sk, skb, 0)  // dealing with incoming acks  
                            => flag |= tcp_clean_rtx_queue()  // see if we can take anything off of the retransmit queue  
                            => if (tcp_ack_is_dubious(sk, flag)  // 判断时候出现可疑情况，具体看下代码吧。  
                                => tcp_fastretrans_alert()  // 进入快速重传  
                                    => tcp_cwnd_down()  // decrease cwnd each second ack, 该函数就是快速重传对cwnd操作的关键函数  
                                    => tcp_xmit_retransmit_queue(sk)  // 在重传阶段，该函数负责找到合适的数据进行重传  
                        => __kfree_skb(skb)  // free an sk_buff  
                        => tcp_data_snd_check(sk)  // 如果有数据需要发送，则发送数据到对端  
                            => tcp_push_pending_frames(sk)   // 发送pending的数据
                                => tcp_write_xmit()  // writes packets to the network， 这部分在上一章已经分析过了   
                            => tcp_check_space(sk)  //  如果有内存释放，则唤醒等待内存的队列  
                                /* when incoming ACK allows to free some skb from write_queue,  
                                 * we remember this event in flag SOCK_QUEUE_SHRUNK and wake up socket  
                                 * on the exit from tcp input hander.
                                 *  
                                 * PROBLEM: sndbuf expansion does not work well with largesend. 
                                 */
                                => tcp_new_space(sk)  
                                    => sk->sk_sndbuf = min(sndmem, sysctl_tcp_wmem[2])  // expand the sndbuf if possible  
                    => else // 如果是带数据的包  
                        /* 此数据包刚好是下一个读取的数据，并且用户空间可存放下该数据包 */
                        => if (tp->copied_seq == tp->rcv_nxt && len - tcp_header_len <= tp->ucopy.len)  
                            /* 如果函数在进程上下文调用并且sock被用户占用的话 */
                            => if (tp->ucopy.task == current && sock_owned_by_user(sk) && !copied_early)  
                                => tcp_copy_to_iovec()  // 直接copy 到用户空间  
                        => if (!eaten)  // 没有直接读到用户空间  
                            /* 当truesize大于sk_forward_alloc时，表示已分配的限额已经用完，不能直接放到receive queue中, 此时往往要重新计算sk_forward_alloc */
                            => if (skb->truesize > sk->sk_forward_alloc) goto step5  
                            => eaten = tcp_queue_rcv()  
                                => tcp_try_coalesce()  // try merge skb to prior one  
                                => if (!eaten) __skb_queue_tail()  // 如果上一步未成功，则将skb放入receive queue中  
                        => tcp_event_data_recv(sk, skb)  // 数据包接收后续处理  
                            /* 每次收到超过128字节的数据报后，需要调用tcp_grow_window增加rcv_ssthresh的值 */
                            => if (skb->len >= 128) tcp_grow_window(sk, skb) 
                        => __tcp_ack_snd_check(sk, 0) // check if sending an ack is needed  

                    => tcp_validate_incoming(sk, skb, th, 1)  // standard slow path, [details ignored]  
                    => tcp_data_queue(sk, skb)  // 对数据包进行处理  
                        => if (TCP_SKB_CB(skb)->seq == tp->rcv_nxt)   // 如果是待接收的报文  
                            => if (tcp_receive_window(tp) == 0) goto out_of_window;  // 如果超出rwnd，则直接丢掉  
                            => 如果正在读，且正是要读的数据，那么直接拷贝到用户空间  
                            => else eaten = tcp_queue_rcv()  // 将数据放入receive queue中  
                            => if (!skb_queue_empty(&tp->out_of_order_queue))  // 如果out of ordre queue不为空  
                                => tcp_ofo_queue(sk)  // This one checks to see if we can put data from the out-of-order queue into the receive-queue  
                            => tcp_fast_path_check(sk)  // 检查是否可以从slowpath回到fastpath  

                        => else tcp_data_queue_ofo(sk, skb)  // 将数据包放到out-of-order queue中  
                    => tcp_data_snd_check(sk)  // 如果有数据需要发送，则发送数据到对端  
                    => tcp_ack_snd_check(sk)  // 判断是否有必要发送一个ack

prequeue的作用

首先来理解两个相关的参数。

sysctl_tcp_low_latency == /proc/sys/net/ipv4/tcp_low_latency
通过man 7 tcp可以看到官方的解释

tcp_low_latency (Boolean; default: disabled)

 If enabled, the  TCP  stack makes decisions that prefer lower
 latency as opposed to higher throughput.   It  this  option  is
 disabled,  then  higher throughput is preferred.  An example of
 an application where this default should be changed would be  a
 Beowulf compute cluster.

tcp_sock->ucopy.task
ucopy.task != NULL 表示进程空间有进程在等待sock的数据到来

下面这句话是tcp_prequeue()的一个关键判断：是否将该skb放到prequeue中。

if (sysctl_tcp_low_latency || !tp->ucopy.task) return false;

首先字面理解这行代码就是说:
如果更关心low latency则不用prequeue;
如果当前用进程在等待读数据，则不用prequeue.

根据目前的理解，我认为prequeue有如下几方面的作用：

1. 进prequeue处理会更侧重throughput。
    由于软中断每次处理一个包，如果不进prequeue，而调用tcp_v4_do_rcv()放进receive queue的话，  
    工作量会不少(tcp_rcv_established()函数很复杂的)。  
    为了软中断更快的完成，放进prequeue后软中断就直接返回了，从而能更多地处理更多地收包。  
    具体处理prequeue中数据的任务交给了进程上下文(即tcp_recvmsg调用中)去处理了。   
    注：其实最终prequeue中的skb还是调用tcp_v4_do_rcv()来处理的，所以放在prequeue中只是选择不同的处理skb的时机罢了。  
2. 进prequeue能够更快速的唤醒blocking状态的数据读请求。  
    这点很显然，进入prequeue的话，一般会立即wake up等待的进程。

fastpath VS slowpath
这是一个看代码才了解到的概念，目前理解还不是太透彻，只是通过注释和代码初步知道了一些判断是否进入fast path的原则。区分这两类path的具体原因还没理解，等之后理解了再来补充吧。
目前的猜测可能是说满足fast path条件的话能省去许多判断的工作。暂时就简单列下tcp_rcv_established函数前面的一些注释吧：

The fast path is disabled when:  
- A zero window was announced from us - zero window probing  
  is only handled properly in the slow path.  
- Out of order segments arrived.  
- Urgent data is expected.  
- There is no buffer space left.  
- Unexpected TCP flags/window value/header lenghts are received  
  (detected by checking the TCP header against pred_flags)  
- Data is sent in both directions. Fast path only supports pure senders  
  or pure receivers (this means either the sequence number or the ack  
  value must stay constant)  
- Unexpected TCP option.  

Fast processing is turned on in tcp_data_queue when everything is OK.

Processing of Queues

TCP的接收队列的处理主要是在tcp_recvmsg()函数中，所以先从这个函数入手。
CSDN的这篇博客是一个不错的参考，不过一切还请以代码为准。

tcp_recvmsg()  // this routine copies from a sock struct into the user buffer  
    => lock_sock(sk)  // become a socket user  
    => skb_queue_walk()  // get a skb  

    => 如果有skb可供拷贝  
        => err = skb_copy_datagram_iovec()  // copy data into iovec if found_ok_skb  
        /* This function should be called every time data is copied to user space.  
         * It calculates the appropriate TCP receive buffer space.  
         */  
        => tcp_rcv_space_adjust(sk)  
            => 调整至少每隔一个RTT才进行一次  
            => space = 2 * (tp->copied_seq - tp->rcvq_space.seq)  // 一个RTT内，接收并复制到用户空间的数量的2倍  
            ...  
            => sk->sk_rcvbuf = space  // 调整接收缓冲区的大小  
        => sk_eat_skb(sk, skb, copied_early)  // 如果一个skb内数据被拷贝完了，则释放掉该skb  

    => 如果没有skb可供拷贝  
        /* 如果设置了MSG_WAITALL，target == len; 否则target == 1 */
        => if (copied >= target && !sk->sk_backlog.tail) break;  // 如果读够了target，且backlog queue 为空则直接return  

        => tcp_cleanup_rbuf(sk, copied)  
            /* 注意区分这个函数与sk_eat_skb()  后者是清掉某个skb及其内存, 前者的主要功能是发送一个接收窗口更新的ACK--因为用户进程消费了读缓存中的数据 */
            => if (inet_csk_ack_scheduled(sk))  // if the ack is scheduled by calling tcp_ack_scheduled()  
                => if delayed ACK was blocked by socket lock, send an ACK  
                => if we have not ACKed data of length > 1mss, send an ACK  
                => if we have emptied the receive buffer, and there is data flow only in one direction, send an ACK  
            => rcv_window_now = tcp_receive_window(tp)  // 计算当前的应该通知对方的receive window
                => win = tp->rcv_wup + tp->rcv_wnd - tp->rcv_nxt  // 左边界 + 当前receive_window - 已用  
            => new_window = __tcp_select_window(sk)  // 计算新的接收窗口大小, 约为rcvbuf空闲部分的一半  
            => if (new_window && new_window >= 2 * rcv_window_now)  send an ACK  

            => if (time_to_ack) tcp_send_ack(sk)   // 如果上面有需要发送ack的需求，则发送一个ACK  
        /* if prequeue is not empty, we have to process it before releasing socket  
         * queue的处理优先级如下：  
         * receive queue 最高
         * prequeue queue 次之
         * backlog queue 最低
         */  
        => if prequeue is not empty, goto do_prequeue
            => tcp_prequeue_process(sk)  
                => sk_backlog_rcv(sk, skb)  == tcp_v4_do_rcv()  

        => if (copied >= target)   // 下面两个步骤主要就是为了处理backlog queue  
            => release_sock(sk)  
                => if (sk->sk_backlog.tail)   
                    => __release_sock(sk)  
                        => sk_backlog_rcv(sk, skb)  == tcp_v4_do_rcv()  
            => lock_sock(sk) 
        => else  
            => sk_wait_data(sk, &timeo)  // 睡眠等待新数据的到来