How does the IP receiving mechanism assemble fragmented datagrams?
First of all, this writing is based on Linux kernel 2.6.21.5.
When I looked into ip_frag_reasm() that was commented with /* Build a new IP datagram from all its fragments. */ in net/ipv4/ip_fragment.c, I could not find the code that I had looked for, specifically, the code to construct a new big SKB and copy all data fragments in the received SKBs into the new big SKB. Of course, if I had found what I had looked for, it would have meant that ip_frag_reasm() was so inefficient (i.e., dumb). Instead, ip_frag_reasm() only prepared the chain of the SKBs to be processed by skb_copy_datagram_iovec() as I have described here.
The preparation is detailed in the following code that I have annotated.
static struct sk_buff *ip_frag_reasm(struct ipq *qp, struct net_device *dev)
{
struct iphdr *iph;
struct sk_buff *fp, *head = qp->fragments;
int len;
int ihlen;
ipq_kill(qp);
BUG_TRAP(head != NULL);
BUG_TRAP(FRAG_CB(head)->offset == 0);
/* Allocate a new buffer for the datagram. */
ihlen = head->nh.iph->ihl*4;
len = ihlen + qp->len;
if(len > 65535)
goto out_oversize;
/* Head of list must not be cloned. */
if (skb_cloned(head) && pskb_expand_head(head, 0, 0, GFP_ATOMIC))
goto out_nomem;
[EUS] The head of list must not be cloned because it is going to be modified in the following preparation code.
/* If the first fragment is fragmented itself, we split
* it to two chunks: the first with data and paged part
* and the second, holding only fragments. */
[EUS] This is required because skb_copy_datagram_iovec() requires that the head's skb_shinfo(skb)->frag_list points to the next SKB in the list so that skb_copy_datagram_iovec() can perform its special "reassembly" task by recursion.
if (skb_shinfo(head)->frag_list) {
struct sk_buff *clone;
int i, plen = 0;
[EUS] +------+ +------+ [EUS] | head |--next---------------->| skb1 |--next--> ... [EUS] +------+ +------+ [EUS] | | | | | | [EUS] | | | +-----------+ | | | +-----------+ [EUS] | | +--data-->| L2 header | | | +--data-->| L2 header | [EUS] | | | L3 header | | | | L3 header | [EUS] | | | L4 header | | | | L4 header | [EUS] | | | DataA | | | | DataB | [EUS] | | +-----------+ | | +-----------+ [EUS] | | | | [EUS] | | +------+ | | +------+ [EUS] | +--frag_list-->| FL_A | | +--frag_list-->| FL_B | [EUS] | +------+ | +------+ [EUS] | | [EUS] | +------+ | +------+ [EUS] +--frags[]-->| F_A | +--frags[]-->| F_B | [EUS] +------+ +------+
if ((clone = alloc_skb(0, GFP_ATOMIC)) == NULL)
goto out_nomem;
clone->next = head->next;
[EUS] +-------+ [EUS] | clone |--next------------------+ [EUS] +-------+ | [EUS] V [EUS] +------+ +------+ [EUS] | head |--next---------------->| skb1 |--next--> ... [EUS] +------+ +------+ [EUS] | | | | | | [EUS] | | | +-----------+ | | | +-----------+ [EUS] | | +--data-->| L2 header | | | +--data-->| L2 header | [EUS] | | | L3 header | | | | L3 header | [EUS] | | | L4 header | | | | L4 header | [EUS] | | | DataA | | | | DataB | [EUS] | | +-----------+ | | +-----------+ [EUS] | | | | [EUS] | | +------+ | | +------+ [EUS] | +--frag_list-->| FL_A | | +--frag_list-->| FL_B | [EUS] | +------+ | +------+ [EUS] | | [EUS] | +------+ | +------+ [EUS] +--frags[]-->| F_A | +--frags[]-->| F_B | [EUS] +------+ +------+
head->next = clone;
[EUS] +------+ +-------+ +------+ [EUS] | head |--next-->| clone |--next-->| skb1 |--next--> ... [EUS] +------+ +-------+ +------+ [EUS] | | | | | | [EUS] | | | +-----------+ | | | +-----------+ [EUS] | | +--data-->| L2 header | | | +--data-->| L2 header | [EUS] | | | L3 header | | | | L3 header | [EUS] | | | L4 header | | | | L4 header | [EUS] | | | DataA | | | | DataB | [EUS] | | +-----------+ | | +-----------+ [EUS] | | | | [EUS] | | +------+ | | +------+ [EUS] | +--frag_list-->| FL_A | | +--frag_list-->| FL_B | [EUS] | +------+ | +------+ [EUS] | | [EUS] | +------+ | +------+ [EUS] +--frags[]-->| F_A | +--frags[]-->| F_B | [EUS] +------+ +------+
skb_shinfo(clone)->frag_list = skb_shinfo(head)->frag_list;
skb_shinfo(head)->frag_list = NULL;
[EUS] +------+ +-------+ +------+ [EUS] | head |--next-->| clone |--next-->| skb1 |--next--> ... [EUS] +------+ +-------+ +------+ [EUS] | | | | | | [EUS] | | frag_list | | | [EUS] | | | | | | [EUS] | | +------+ | | | [EUS] | | +-----------+ | | | | +-----------+ [EUS] | +--data-->| L2 header | | | | +--data-->| L2 header | [EUS] | | L3 header | | | | | L3 header | [EUS] | | L4 header | | | | | L4 header | [EUS] | | DataA | | | | | DataB | [EUS] | +-----------+ | | | +-----------+ [EUS] | | | | [EUS] | +------+ | | | +------+ [EUS] | | FL_A |<----+ | +--frag_list-->| FL_B | [EUS] | +------+ | +------+ [EUS] | | [EUS] | +------+ | +------+ [EUS] +--frags[]-->| F_A | +--frags[]-->| F_B | [EUS] +------+ +------+
for (i=0; inr_frags; i++)
plen += skb_shinfo(head)->frags[i].size;
clone->len = clone->data_len = head->data_len - plen;
head->data_len -= clone->len;
head->len -= clone->len;
[EUS] That is because head->data_len = head->len - (size of skb_shinfo(head)->frag_list + size of skb_shinfo(head)->frags[]). See here for the details.
clone->csum = 0;
clone->ip_summed = head->ip_summed;
atomic_add(clone->truesize, &ip_frag_mem);
}
skb_shinfo(head)->frag_list = head->next;
[EUS] +--frag_list-----+ [EUS] | | [EUS] | V [EUS] +------+ +-------+ +------+ [EUS] | head |--next-->| clone |--next-->| skb1 |--next--> ... [EUS] +------+ +-------+ +------+ [EUS] | | | | | | [EUS] | | frag_list | | | [EUS] | | | | | | [EUS] | | +------+ | | | [EUS] | | +-----------+ | | | | +-----------+ [EUS] | +--data-->| L2 header | | | | +--data-->| L2 header | [EUS] | | L3 header | | | | | L3 header | [EUS] | | L4 header | | | | | L4 header | [EUS] | | DataA | | | | | DataB | [EUS] | +-----------+ | | | +-----------+ [EUS] | | | | [EUS] | +------+ | | | +------+ [EUS] | | FL_A |<----+ | +--frag_list-->| FL_B | [EUS] | +------+ | +------+ [EUS] | | [EUS] | +------+ | +------+ [EUS] +--frags[]-->| F_A | +--frags[]-->| F_B | [EUS] +------+ +------+
skb_push(head, head->data - head->nh.raw);
[EUS] Set head->data to point to head->nh.raw.
atomic_sub(head->truesize, &ip_frag_mem);
for (fp=head->next; fp; fp = fp->next) {
head->data_len += fp->len;
head->len += fp->len;
[EUS] This will give the illusion that head is a big SKB containing all of the data fragments in the SKBs in the chain. So, instead of constructing a new big SKB and copy all data fragments in the received SKBs into the new big SKB, this clever trick is used.
if (head->ip_summed != fp->ip_summed)
head->ip_summed = CHECKSUM_NONE;
else if (head->ip_summed == CHECKSUM_COMPLETE)
head->csum = csum_add(head->csum, fp->csum);
head->truesize += fp->truesize;
atomic_sub(fp->truesize, &ip_frag_mem);
}
head->next = NULL;
[EUS] +--frag_list-----+ [EUS] | | [EUS] | V [EUS] +------+ +-------+ +------+ [EUS] | head | | clone |--next-->| skb1 |--next--> ... [EUS] +------+ +-------+ +------+ [EUS] | | | | | | [EUS] | | frag_list | | | [EUS] | | | | | | [EUS] | | +------+ | | | [EUS] | | +-----------+ | | | | +-----------+ [EUS] | +--data-->| L2 header | | | | +--data-->| L2 header | [EUS] | | L3 header | | | | | L3 header | [EUS] | | L4 header | | | | | L4 header | [EUS] | | DataA | | | | | DataB | [EUS] | +-----------+ | | | +-----------+ [EUS] | | | | [EUS] | +------+ | | | +------+ [EUS] | | FL_A |<----+ | +--frag_list-->| FL_B | [EUS] | +------+ | +------+ [EUS] | | [EUS] | +------+ | +------+ [EUS] +--frags[]-->| F_A | +--frags[]-->| F_B | [EUS] +------+ +------+
[EUS] At this point, the structural requirement for skb_copy_datagram_iovec() to perform its special "reassembly" task by recursion has been completed.
head->dev = dev;
skb_set_timestamp(head, &qp->stamp);
iph = head->nh.iph;
iph->frag_off = 0;
iph->tot_len = htons(len);
IP_INC_STATS_BH(IPSTATS_MIB_REASMOKS);
qp->fragments = NULL;
return head;
out_nomem:
LIMIT_NETDEBUG(KERN_ERR "IP: queue_glue: no memory for gluing "
"queue %p\n", qp);
goto out_fail;
out_oversize:
if (net_ratelimit())
printk(KERN_INFO
"Oversized IP packet from %d.%d.%d.%d.\n",
NIPQUAD(qp->saddr));
out_fail:
IP_INC_STATS_BH(IPSTATS_MIB_REASMFAILS);
return NULL;
}
why there is a fragmented skb when reassemble frags
hi,
could you please tell me why in ip_frag_reasm function, there is a note as follows:
* If the first fragment is fragmented itself, we split
* it to two chunks: the first with data and paged part
* and the second, holding only fragments. */
----------------------------------------------------
as we know, ip_frag_reasm is used to reassemble all frags in a ipq instance, and this must happen in receiving stage, not in sending phase, thus why we need to cope with the possibility that the first fragment(i.e., the head) is a fragmented skb? from the receiving point of view, each fragment is a individual ip packet, and all receiving packets are organized by corresponding ipq instance, it seems there would be no frag_list or frags[] at all, well I am confused!
we know in ip_append_data, ip_push_pending_frames, we will have a fragmented skb for some cases, but this happens just at sending side, not at receiving side!
For the Loopback Device
Hi Ho!
The following question has been addressed in my blog post.
Specifically, I explained it as follows:
For the following question, the use of a loopback device may be the answer.
AFAIK, sending through a loopback device will transfer the SKB intact back into the receiving part. IOW, if the SKB is fragmented in such a way that uses `frags[]' by the sending part, the loopback device will simply hand it over without any change whatsoever to the receiving part.
Besides, the "fragment" does not refer to an IP fragment.
Instead, it may refer to a memory page that contains the data of the IP fragment itself. For further details, you can read my previous post (http://kerneltrap.org/node/16224) or its source (http://vger.kernel.org/~davem/skb_data.html).
Best regards,
Eus (FSF member #4445)
In this digital era, where computing technology is pervasive, your freedom depends on the software controlling those computing devices.
Join free software movement today! It is free as in freedom, not as in free beer!
Join: http://www.fsf.org/jf?referrer=4445