2.6.25: Weird IPv4 stack behaviour, IPv6 is fine

Previous thread: adding tcpdump/OAM support to usb ATM devices by Jeremy Jackson on Sunday, April 27, 2008 - 2:31 pm. (6 messages)

Next thread: [GIT]: Networking fixes by David Miller on Sunday, April 27, 2008 - 4:56 pm. (1 message)
From: Russell King
Date: Sunday, April 27, 2008 - 4:14 pm

Hi,

I've upgraded lists.arm.linux.org.uk to 2.6.25, and I'm now seeing some
very weird networking behaviour from the machine which seems to only
affect IPv4 - including ICMP and NFS(tcp).

tcpdump is available (all 4MB worth):
 http://www.home.arm.linux.org.uk/~rmk/ping.capture

Machines involved:
  dyn-67 - x86 box 2.6.20-1.2320.fc5
  (192.168.0.67 / 2002:4e20:1eda:1:201:80ff:fe4b:1778)

  n2100 - ARM box 2.6.24
  (78.32.30.221, has ipv6 as well)

  lists - ARM box 2.6.25
  (78.32.30.220 / 2002:4e20:1eda:1:201:3dff:fe00:0156)

The dump shows three 8200 byte pings running - one IPv4 on n2100 against
lists, one IPv4 on dyn-67 against lists, and one IPv6 on dyn-67 against
lists.

The tcpdump was running on lists itself.

Everything looks fine until around packet 1688, where n2100 sends an
echo request to lists, which doesn't get a reply.  300ms later, dyn-67
sends an echo request to lists, which also coincidentally doesn't get
a reply.  Note, however, how the IPv6 pings continue.

The stats for the pings upon their termination are:

rmk@dyn-67:[~]:<1005> ping6 -s 8192 lists
PING lists(lists.arm.linux.org.uk) 8192 data bytes
--- lists ping statistics ---
101 packets transmitted, 101 received, 0% packet loss, time 99990ms
rtt min/avg/max/mdev = 4.132/4.488/26.585/2.374 ms, pipe 2

rmk@dyn-67:[~]:<1051> ping -s 8192 lists
PING lists.arm.linux.org.uk (78.32.30.220) 8192(8220) bytes of data.
--- lists.arm.linux.org.uk ping statistics ---
101 packets transmitted, 54 received, 46% packet loss, time 99993ms
rtt min/avg/max/mdev = 4.139/6.027/35.274/6.405 ms

root@n2100:~# ping -s 8192 lists
PING lists.arm.linux.org.uk (78.32.30.220) 8192(8220) bytes of data.
--- lists.arm.linux.org.uk ping statistics ---
101 packets transmitted, 55 received, 45% packet loss, time 100020ms
rtt min/avg/max/mdev = 4.404/4.610/13.235/1.175 ms

Lastly, in /proc/net/snmp on lists, I find:

Ip: Forwarding DefaultTTL InReceives InHdrErrors InAddrErrors ForwDatagrams InUnknownProtos ...
From: Russell King
Date: Sunday, April 27, 2008 - 4:17 pm

Forgot the config file for the problem kernel...

  http://www.home.arm.linux.org.uk/~rmk/bast-config-2.6.25

-- 
Russell King
--

From: David Miller
Date: Sunday, April 27, 2008 - 4:26 pm

From: Russell King <rmk@arm.linux.org.uk>

The ReasmTimeout and ReasmFails look interesting.  Maybe it was the
namespace bits?

Pavel, could you take a quick look?

Thanks.
--

From: Pavel Emelyanov
Date: Monday, April 28, 2008 - 12:02 am

Can you please also show the /proc/net/netstat contents - I'm interested
in IpExt statistics.

Thanks,
Pavel
--

From: Russell King
Date: Monday, April 28, 2008 - 2:31 am

IpExt: InNoRoutes InTruncatedPkts InMcastPkts OutMcastPkts InBcastPkts OutBcastPkts
IpExt: 0 0 0 0 0 0

I suspect that you were expecting these to be non-zero.

I've just added some debug printks into ip_input.c, and I don't think the
IP stack itself is at fault (if it were, you'd be flooded with reports.)

int ip_local_deliver(struct sk_buff *skb)
{
...
if (ip_hdr(skb)->saddr == htonl(0xc0a80043) &&
    ip_hdr(skb)->protocol == IPPROTO_ICMP) printk("ping 2\n");
        return NF_HOOK(PF_INET, NF_INET_LOCAL_IN, skb, skb->dev, NULL,
                       ip_local_deliver_finish);
}

static int ip_local_deliver_finish(struct sk_buff *skb)
{
        __skb_pull(skb, ip_hdrlen(skb));

        /* Point into the IP datagram, just past the header. */
        skb_reset_transport_header(skb);

if (ip_hdr(skb)->saddr == htonl(0xc0a80043) &&
    ip_hdr(skb)->protocol == IPPROTO_ICMP) printk("ping 3\n");

When the machine stops responding to pings, I see in the kernel message
log 'ping 2' but no 'ping 3' (whereas I get both when it does respond.)

I don't have the iptables binary installed, so there aren't any rules.
(Also, the iptables_filter module isn't loaded.)

I'll see if I can track the packet's progress through the netfilter code
today.

-- 
Russell King
--

From: Russell King
Date: Monday, April 28, 2008 - 3:18 am

(Adding netfilter mailing list.  See http://marc.info/?t=120933809600001&r=1&w=2
for the initial problem description.)

Further to this, it's looking like there's a nf_conntrack issue.  Having
placed similar printks in the netfilter code, I see the ipv4_confirm()
hook normally returning 1 (NF_ACCEPT), but then decides to return 0
(NF_DROP) and no ping replies.

-bash-3.1# cat /proc/net/stat/ip_conntrack
entries  searched found new invalid ignore delete delete_list insert insert_failed drop early_drop icmp_error  expect_new expect_create expect_delete
00000110  000000e2 000001c6 000003bb 00000140 00000000 000002ab 0000023a 0000034a 0000005f 00000000 00000000 0000000f  00000000 00000000 00000000

insert_failed increments when there aren't any ping replies.

The other interesting thing (though I'm not sure if it's really
related or helps) is:

-bash-3.1# grep 'ipv4.*icmp.*192.168.0.67' /proc/net/nf_conntrack
ipv4     2 icmp     1 29 src=192.168.0.67 dst=78.32.30.220 type=8 code=0 id=53823 packets=19 bytes=156180 [UNREPLIED] src=78.32.30.220 dst=192.168.0.67 type=0 code=0 id=53823 packets=0 bytes=0 mark=0 use=1
-bash-3.1# grep 'ipv4.*icmp.*192.168.0.67' /proc/net/nf_conntrack
ipv4     2 icmp     1 29 src=192.168.0.67 dst=78.32.30.220 type=8 code=0 id=53823 packets=21 bytes=172620 [UNREPLIED] src=78.32.30.220 dst=192.168.0.67 type=0 code=0 id=53823 packets=0 bytes=0 mark=0 use=1
-bash-3.1# grep 'ipv4.*icmp.*192.168.0.67' /proc/net/nf_conntrack
ipv4     2 icmp     1 29 src=192.168.0.67 dst=78.32.30.220 type=8 code=0 id=53823 packets=22 bytes=180840 [UNREPLIED] src=78.32.30.220 dst=192.168.0.67 type=0 code=0 id=53823 packets=0 bytes=0 mark=0 use=1
-bash-3.1# grep 'ipv4.*icmp.*192.168.0.67' /proc/net/nf_conntrack
ipv4     2 icmp     1 29 src=192.168.0.67 dst=78.32.30.220 type=8 code=0 id=53823 packets=23 bytes=189060 [UNREPLIED] src=78.32.30.220 dst=192.168.0.67 type=0 code=0 id=53823 packets=0 bytes=0 mark=0 use=1
-bash-3.1# grep 'ipv4.*icmp.*192.168.0.67' /proc/net/nf_conntrack
ipv4     ...
From: David Miller
Date: Monday, April 28, 2008 - 3:30 am

From: Russell King <rmk@arm.linux.org.uk>

There's already been a report about specific hashing problems with
conntrack on ARM.  It has something to do with how structures are
padding on ARM combined with the following patch made by Patrick:

commit 0794935e21a18e7c171b604c31219b60ad9749a9
Author: Patrick McHardy <kaber@trash.net>
Date:   Thu Jan 31 04:40:52 2008 -0800

    [NETFILTER]: nf_conntrack: optimize hash_conntrack()
    
    Avoid calling jhash three times and hash the entire tuple in one go.
    
      __hash_conntrack | -485 # 760 -> 275, # inlines: 3 -> 1, size inlines: 717 -> 252
     1 function changed, 485 bytes removed
    
    Signed-off-by: Patrick McHardy <kaber@trash.net>
    Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index ce4c4ba..4a2cce1 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -73,15 +73,19 @@ static unsigned int nf_conntrack_hash_rnd;
 static u_int32_t __hash_conntrack(const struct nf_conntrack_tuple *tuple,
 				  unsigned int size, unsigned int rnd)
 {
-	unsigned int a, b;
+	unsigned int n;
+	u_int32_t h;
 
-	a = jhash2(tuple->src.u3.all, ARRAY_SIZE(tuple->src.u3.all),
-		   (tuple->src.l3num << 16) | tuple->dst.protonum);
-	b = jhash2(tuple->dst.u3.all, ARRAY_SIZE(tuple->dst.u3.all),
-		   ((__force __u16)tuple->src.u.all << 16) |
-		    (__force __u16)tuple->dst.u.all);
+	/* The direction must be ignored, so we hash everything up to the
+	 * destination ports (which is a multiple of 4) and treat the last
+	 * three bytes manually.
+	 */
+	n = (sizeof(tuple->src) + sizeof(tuple->dst.u3)) / sizeof(u32);
+	h = jhash2((u32 *)tuple, n,
+		   rnd ^ (((__force __u16)tuple->dst.u.all << 16) |
+			  tuple->dst.protonum));
 
-	return ((u64)jhash_2words(a, b, rnd) * size) >> 32;
+	return ((u64)h * size) >> 32;
 }
 
 static inline u_int32_t hash_conntrack(const struct nf_conntrack_tuple *tuple)

--

From: Russell King
Date: Monday, April 28, 2008 - 5:00 am

Yup, reverting that appears to fix the problem.  Looking at the
structure, it will contain two bytes of padding in the 'u' union
and another two bytes in the 'dst' structure.

I suspect there'll be objections to packing the structure, in which
case what's the permanent fix?

-- 
Russell King
--

Previous thread: adding tcpdump/OAM support to usb ATM devices by Jeremy Jackson on Sunday, April 27, 2008 - 2:31 pm. (6 messages)

Next thread: [GIT]: Networking fixes by David Miller on Sunday, April 27, 2008 - 4:56 pm. (1 message)