Re: strange linux kernel NFS problem(s)

Previous thread: [ANNOUNCE] Git 1.7.0.8, 1.7.1.3 and 1.7.2.4 by Junio C Hamano on Thursday, December 2, 2010 - 7:21 pm. (2 messages)

Next thread: [PATCH -v7 1/3] Add Kconfig option ARCH_HAVE_NMI_SAFE_CMPXCHG by Huang Ying on Thursday, December 2, 2010 - 7:46 pm. (1 message)
From: Doug Hughes
Date: Thursday, December 2, 2010 - 7:40 pm

So, this is my first post, but not my first problem of this nature. It 
just so happens that this is the first one with a recent kernel to give 
useful data, useful enough to post it and seek some advice on the subject:

symptoms: machine gets high load, nfs mount processes hang, and things 
(particularly NFS) stop working. ssh and ip connectivity still works, as 
does ps.

*general protection fault: 0000 [#1] SMP
last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
CPU 1
Modules linked in: nfs auth_rpcgss autofs4 i2c_dev i2c_core lockd sunrpc 
cachefiles fscache ipmi_si ipmi_devintf ipmi_msghandler ip6t_REJECT 
xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 video output battery 
ac parport_pc lp parport joydev button sr_mod pcspkr iTCO_wdt shpchp 
dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage 
pata_acpi ata_piix ata_generic libata uhci_hcd ohci_hcd ehci_hcd [last 
unloaded: microcode]

Pid: 28573, comm: python2.5 Not tainted 2.6.34 #3 X7DWT/X7DWT
RIP: 0010:[<ffffffffa0292cdb>]  [<ffffffffa0292cdb>] 
nfs_release+0x64/0x94 [nfs]
RSP: 0018:ffff88041ccb9d58  EFLAGS: 00010246
RAX: ffff88041c47d160 RBX: ffff88041c47d1e8 RCX: ff88041c47d16088
RDX: ffff88042c593288 RSI: ffff88042c504e40 RDI: ffff88041c47d294
RBP: ffff88041ccb9d78 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000300000000 R11: 0000000000000000 R12: ffff88042c593240
R13: ffff88042c504e40 R14: ffff88041ea59ec0 R15: ffff8804273f55c0
FS:  0000000000000000(0000) GS:ffff880001840000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000003fd5c03350 CR3: 0000000001613000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process python2.5 (pid: 28573, threadinfo ffff88041ccb8000, task 
ffff8803e246adf0)
Stack:
  0000000300000000 ffff88042c504e40 ffff88041c47d1e8 ffff88041c47d1e8
<0> ffff88041ccb9d98 ffffffffa0290fc5 ...
From: John Stoffel
Date: Friday, December 3, 2010 - 10:36 am

>>>>> "Doug" == Doug Hughes <doug@will.to> writes:

Doug> So, this is my first post, but not my first problem of this
Doug> nature. It just so happens that this is the first one with a
Doug> recent kernel to give useful data, useful enough to post it and
Doug> seek some advice on the subject:

kernel 2.6.34 is still pretty old, and there have been lots of NFS
fixes.  Can you upgrade to something newer as a test?  Also, what
distro are you using?  

Is this an NFS client or the NFS server which is crapping out?  More
details please...

John


Doug> symptoms: machine gets high load, nfs mount processes hang, and things 
Doug> (particularly NFS) stop working. ssh and ip connectivity still works, as 
Doug> does ps.

Doug> *general protection fault: 0000 [#1] SMP
Doug> last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map
Doug> CPU 1
Doug> Modules linked in: nfs auth_rpcgss autofs4 i2c_dev i2c_core lockd sunrpc 
Doug> cachefiles fscache ipmi_si ipmi_devintf ipmi_msghandler ip6t_REJECT 
Doug> xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 video output battery 
Doug> ac parport_pc lp parport joydev button sr_mod pcspkr iTCO_wdt shpchp 
Doug> dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage 
Doug> pata_acpi ata_piix ata_generic libata uhci_hcd ohci_hcd ehci_hcd [last 
Doug> unloaded: microcode]

Doug> Pid: 28573, comm: python2.5 Not tainted 2.6.34 #3 X7DWT/X7DWT
Doug> RIP: 0010:[<ffffffffa0292cdb>]  [<ffffffffa0292cdb>] 
Doug> nfs_release+0x64/0x94 [nfs]
Doug> RSP: 0018:ffff88041ccb9d58  EFLAGS: 00010246
Doug> RAX: ffff88041c47d160 RBX: ffff88041c47d1e8 RCX: ff88041c47d16088
Doug> RDX: ffff88042c593288 RSI: ffff88042c504e40 RDI: ffff88041c47d294
Doug> RBP: ffff88041ccb9d78 R08: 0000000000000000 R09: 0000000000000000
Doug> R10: 0000000300000000 R11: 0000000000000000 R12: ffff88042c593240
Doug> R13: ffff88042c504e40 R14: ffff88041ea59ec0 R15: ffff8804273f55c0
Doug> FS:  0000000000000000(0000) GS:ffff880001840000(0000) ...
From: Doug Hughes
Date: Friday, December 3, 2010 - 11:47 am

It wasn't very old when we started testing it to resolve further NFS 
problems about 6 weeks ago. It takes a while to get through the 
necessary regressions to make sure things are generally ok before 
getting comfortable with a rollout to more than a couple nodes. The 
problems we experience are more of a statistical nature across nodes, so 
we don't usually experience them until we have some mass of upgraded nodes.

We checked through the changelists and didn't see anything that stood 
out as "ah ha, that's the problem". Most of the updates seemed to not 
mention NFS at all. Do you have one a particular issue/patch in mind?

This is a NFS client mounting a server elsewhere. The ps listing shows 
several stuck mount commands, which is another symptom of the general 
issue. Let me know what else. Certainly it's possible to try another, 
new kernel, but then I'll be posting about .36.1 in about 6-9 weeks and 
chances are that it will be considered old. :\

Distro is Centos5.4 with updates. kernel is from kernel.org

--

Previous thread: [ANNOUNCE] Git 1.7.0.8, 1.7.1.3 and 1.7.2.4 by Junio C Hamano on Thursday, December 2, 2010 - 7:21 pm. (2 messages)

Next thread: [PATCH -v7 1/3] Add Kconfig option ARCH_HAVE_NMI_SAFE_CMPXCHG by Huang Ying on Thursday, December 2, 2010 - 7:46 pm. (1 message)