So, this is my first post, but not my first problem of this nature. It just so happens that this is the first one with a recent kernel to give useful data, useful enough to post it and seek some advice on the subject: symptoms: machine gets high load, nfs mount processes hang, and things (particularly NFS) stop working. ssh and ip connectivity still works, as does ps. *general protection fault: 0000 [#1] SMP last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map CPU 1 Modules linked in: nfs auth_rpcgss autofs4 i2c_dev i2c_core lockd sunrpc cachefiles fscache ipmi_si ipmi_devintf ipmi_msghandler ip6t_REJECT xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 video output battery ac parport_pc lp parport joydev button sr_mod pcspkr iTCO_wdt shpchp dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage pata_acpi ata_piix ata_generic libata uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode] Pid: 28573, comm: python2.5 Not tainted 2.6.34 #3 X7DWT/X7DWT RIP: 0010:[<ffffffffa0292cdb>] [<ffffffffa0292cdb>] nfs_release+0x64/0x94 [nfs] RSP: 0018:ffff88041ccb9d58 EFLAGS: 00010246 RAX: ffff88041c47d160 RBX: ffff88041c47d1e8 RCX: ff88041c47d16088 RDX: ffff88042c593288 RSI: ffff88042c504e40 RDI: ffff88041c47d294 RBP: ffff88041ccb9d78 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000300000000 R11: 0000000000000000 R12: ffff88042c593240 R13: ffff88042c504e40 R14: ffff88041ea59ec0 R15: ffff8804273f55c0 FS: 0000000000000000(0000) GS:ffff880001840000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000003fd5c03350 CR3: 0000000001613000 CR4: 00000000000006e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process python2.5 (pid: 28573, threadinfo ffff88041ccb8000, task ffff8803e246adf0) Stack: 0000000300000000 ffff88042c504e40 ffff88041c47d1e8 ffff88041c47d1e8 <0> ffff88041ccb9d98 ffffffffa0290fc5 ...
>>>>> "Doug" == Doug Hughes <doug@will.to> writes: Doug> So, this is my first post, but not my first problem of this Doug> nature. It just so happens that this is the first one with a Doug> recent kernel to give useful data, useful enough to post it and Doug> seek some advice on the subject: kernel 2.6.34 is still pretty old, and there have been lots of NFS fixes. Can you upgrade to something newer as a test? Also, what distro are you using? Is this an NFS client or the NFS server which is crapping out? More details please... John Doug> symptoms: machine gets high load, nfs mount processes hang, and things Doug> (particularly NFS) stop working. ssh and ip connectivity still works, as Doug> does ps. Doug> *general protection fault: 0000 [#1] SMP Doug> last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map Doug> CPU 1 Doug> Modules linked in: nfs auth_rpcgss autofs4 i2c_dev i2c_core lockd sunrpc Doug> cachefiles fscache ipmi_si ipmi_devintf ipmi_msghandler ip6t_REJECT Doug> xt_tcpudp ip6table_filter ip6_tables x_tables ipv6 video output battery Doug> ac parport_pc lp parport joydev button sr_mod pcspkr iTCO_wdt shpchp Doug> dm_snapshot dm_zero dm_mirror dm_region_hash dm_log dm_mod usb_storage Doug> pata_acpi ata_piix ata_generic libata uhci_hcd ohci_hcd ehci_hcd [last Doug> unloaded: microcode] Doug> Pid: 28573, comm: python2.5 Not tainted 2.6.34 #3 X7DWT/X7DWT Doug> RIP: 0010:[<ffffffffa0292cdb>] [<ffffffffa0292cdb>] Doug> nfs_release+0x64/0x94 [nfs] Doug> RSP: 0018:ffff88041ccb9d58 EFLAGS: 00010246 Doug> RAX: ffff88041c47d160 RBX: ffff88041c47d1e8 RCX: ff88041c47d16088 Doug> RDX: ffff88042c593288 RSI: ffff88042c504e40 RDI: ffff88041c47d294 Doug> RBP: ffff88041ccb9d78 R08: 0000000000000000 R09: 0000000000000000 Doug> R10: 0000000300000000 R11: 0000000000000000 R12: ffff88042c593240 Doug> R13: ffff88042c504e40 R14: ffff88041ea59ec0 R15: ffff8804273f55c0 Doug> FS: 0000000000000000(0000) GS:ffff880001840000(0000) ...
It wasn't very old when we started testing it to resolve further NFS problems about 6 weeks ago. It takes a while to get through the necessary regressions to make sure things are generally ok before getting comfortable with a rollout to more than a couple nodes. The problems we experience are more of a statistical nature across nodes, so we don't usually experience them until we have some mass of upgraded nodes. We checked through the changelists and didn't see anything that stood out as "ah ha, that's the problem". Most of the updates seemed to not mention NFS at all. Do you have one a particular issue/patch in mind? This is a NFS client mounting a server elsewhere. The ps listing shows several stuck mount commands, which is another symptom of the general issue. Let me know what else. Certainly it's possible to try another, new kernel, but then I'll be posting about .36.1 in about 6-9 weeks and chances are that it will be considered old. :\ Distro is Centos5.4 with updates. kernel is from kernel.org --
