Background: Server: x86-64 dual core Intel, kernel 2.6.23-rc1 (my home fileserver) Exporting NFS/NFSv4 mounts. Client count: 1 Uptime: 4 days Client: x86-64 dual core Intel, kernel 2.6.23-rc1 (my main workstation) NFS mount setup: pretzel:/ on /g type nfs4 (rw,noatime,proto=tcp,addr=10.10.10.1) Uptime: 4 days Home directory mounted via NFSv4. Problem: My workstation has been happily talking to my file server for several days without incident. An hour ago, my numeric keypad stopping working (unrelated problem... USB or X bug?). The solution to the keypad problem is usually to log out of X and log back in, or worse case, reboot. So, I log out, and log back in. At first, a few shell windows open and successfully initialize themselves (read bash profile over NFS, etc.) Then, as more shell windows open, things start hanging. I can easily switch to console and ssh to the fileserver, so it is clear this is an NFS hang. No adverse messages at all on the client. On the server, I see NFSv4 spamming dmesg with hundreds of thousands of messages: Jul 27 08:20:53 pretzel kernel: NFSD: preprocess_seqid_op: old stateid! Jul 27 08:21:24 pretzel last message repeated 167966 times Jul 27 08:21:55 pretzel last message repeated 173628 times Jul 27 08:21:55 pretzel kernel: NFSD: preprocess_seqid_op: old stateid! Jul 27 08:22:26 pretzel last message repeated 171286 times Jul 27 08:23:27 pretzel last message repeated 344461 times Jul 27 08:23:30 pretzel last message repeated 18656 times I rebooted the client, the problem disappeared, and everything is happy again... but clearly NFSv4 shat itself. And now I am worried this will happen again. In all my quite-heavy use of NFSv4 I've never seen this behavior before, so I would call this a regression. I always run vanilla linux-2.6.git self-built kernels on both client and server. Jeff -
Yup. Bruce has reported the same bug so it is under investigation. I'll keep you posted when I think we have a fix. Cheers Trond -
Jeff and Bruce, could you please try to reproduce the problem after either applying patches 001 to 004 or just the single NFS_ALL patch from http://client.linux-nfs.org/Linux-2.6.x/2.6.23-rc1/ Cheers Trond -
I suspect that the problem may involve mixing O_RDWR with O_RDONLY and/or O_WRONLY file access. Bruce might have a better suggestion, though. -
I haven't done any more work to identify when exactly the problem is triggered, but I did confirm that I could reproduce the problem reliably with current git, but *not* with current git plus your 4-patch nfs-all. --b. -
Hi, as posted earlier, 001-002 does not fix it. Then I also applied 003+004 and it seems to be fixed now. kde boots up and I can start some programs, which I couldn't before. The "inconsistent lock state" error still exists, but it really seems to be thanks Marc -
Hi,
me too, my server has 2.6.18-? (openSUSE 10.2). On the client
(2.6.23-rc1-mm1), I also see (shortly before the hang)
Jul 26 13:09:19 fb07-iapwap2 kernel: =================================
Jul 26 13:09:19 fb07-iapwap2 kernel: [ INFO: inconsistent lock state ]
Jul 26 13:09:19 fb07-iapwap2 kernel: 2.6.23-rc1-mm1 #1
Jul 26 13:09:19 fb07-iapwap2 kernel: ---------------------------------
Jul 26 13:09:19 fb07-iapwap2 kernel: inconsistent {softirq-on-W} ->
{in-softirq-W} usage.
Jul 26 13:09:19 fb07-iapwap2 kernel: hald/3873 [HC0[0]:SC1[1]:HE1:SE0] takes:
Jul 26 13:09:19 fb07-iapwap2 kernel: (rpc_credcache_lock){-+..}, at:
[<c01dc166>] _atomic_dec_and_lock+0x16/0x60
Jul 26 13:09:19 fb07-iapwap2 kernel: {softirq-on-W} state was registered at:
Jul 26 13:09:19 fb07-iapwap2 kernel: [<c013d4f7>] mark_lock+0x77/0x630
Jul 26 13:09:19 fb07-iapwap2 kernel: [<c013c094>] add_lock_to_list+0x44/0xc0
Jul 26 13:09:19 fb07-iapwap2 kernel: [<c013e8af>]
__lock_acquire+0x65f/0x1020
Jul 26 13:09:19 fb07-iapwap2 kernel: [<c013db0e>] mark_held_locks+0x5e/0x80
Jul 26 13:09:19 fb07-iapwap2 kernel: [<c012448d>] local_bh_enable+0x7d/0x130
Jul 26 13:09:19 fb07-iapwap2 kernel: [<c013f2cf>] lock_acquire+0x5f/0x80
Jul 26 13:09:19 fb07-iapwap2 kernel: [<c01dc166>]
_atomic_dec_and_lock+0x16/0x60
Jul 26 13:09:19 fb07-iapwap2 kernel: [<c02d156a>] _spin_lock+0x2a/0x40
Jul 26 13:09:19 fb07-iapwap2 kernel: [<c01dc166>]
_atomic_dec_and_lock+0x16/0x60
Jul 26 13:09:19 fb07-iapwap2 kernel: [<c01dc166>]
_atomic_dec_and_lock+0x16/0x60
Jul 26 13:09:19 fb07-iapwap2 kernel: [<c02d156a>] _spin_lock+0x2a/0x40
Jul 26 13:09:19 fb07-iapwap2 kernel: [<dcecf770>] put_rpccred+0x60/0x110
[sunrpc]
Jul 26 13:09:19 fb07-iapwap2 kernel: [<dcecf840>]
rpcauth_unbindcred+0x20/0x60 [sunrpc]
Jul 26 13:09:19 fb07-iapwap2 kernel: [<dcece1f4>] rpc_put_task+0x44/0xb0
[sunrpc]
Jul 26 13:09:19 fb07-iapwap2 kernel: [<dcec8ffd>] rpc_call_sync+0x2d/0x40
[sunrpc]
Jul 26 13:09:19 fb07-iapwap2 kernel: ...(please don't drop CC's when you reply to email; you are cutting relevant people out of the loop) [continues] That's quite interesting. I wonder if I enable lock debugging and tracing, if new stuff will appear. Jeff -
That particular hang in rpciod_down we do have a fix for, but it is not related to the issue you were seeing Jeff. Trond
Doesn't appear related ... could you rebuild with CONFIG_FRAME_POINTER and test + resend the above log? The trace up there looks quite ugly. [ Unrelated, but I just realized from lib/Kconfig.debug that stuff in there "select"s LOCKDEP, which itself "selects" frame pointers (on some archs). But I thought "select" didn't chain properly like that? There's whole lots of depending and selecting and non-user-visible options out there in that file -- and I'm not sure all those selects and depends are really even required to successfully build (?) ] Satyam -
Hi again,
this seems to be better ...
Jul 27 19:43:33 fb07-iapwap2 kernel: =================================
Jul 27 19:43:33 fb07-iapwap2 kernel: [ INFO: inconsistent lock state ]
Jul 27 19:43:33 fb07-iapwap2 kernel: 2.6.23-rc1-mm1 #1
Jul 27 19:43:33 fb07-iapwap2 kernel: ---------------------------------
Jul 27 19:43:33 fb07-iapwap2 kernel: inconsistent {softirq-on-W} ->
{in-softirq-W} usage.
Jul 27 19:43:33 fb07-iapwap2 kernel: kdm/5909 [HC0[0]:SC1[1]:HE1:SE0] takes:
Jul 27 19:43:33 fb07-iapwap2 kernel: (rpc_credcache_lock){-+..}, at:
[<c01dba67>] _atomic_dec_and_lock+0x17/0x60
Jul 27 19:43:33 fb07-iapwap2 kernel: {softirq-on-W} state was registered at:
Jul 27 19:43:33 fb07-iapwap2 kernel: [<c013e800>]
__lock_acquire+0x650/0x1030
Jul 27 19:43:33 fb07-iapwap2 kernel: [<c013f241>] lock_acquire+0x61/0x80
Jul 27 19:43:33 fb07-iapwap2 kernel: [<c02db8dc>] _spin_lock+0x2c/0x40
Jul 27 19:43:33 fb07-iapwap2 kernel: [<c01dba67>]
_atomic_dec_and_lock+0x17/0x60
Jul 27 19:43:33 fb07-iapwap2 kernel: [<dceca5fd>] put_rpccred+0x5d/0x100
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel: [<dceca6c1>]
rpcauth_unbindcred+0x21/0x60 [sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel: [<dcec8fd4>] rpc_put_task+0x44/0xa0
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel: [<dcec3fe0>] rpc_call_sync+0x30/0x40
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel: [<dced172b>] rpcb_register+0xdb/0x180
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel: [<dcecb5b3>] svc_register+0x93/0x160
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel: [<dcecbebe>] __svc_create+0x1ee/0x220
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel: [<dcecc053>] svc_create+0x13/0x20
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel: [<dcf62752>] nfs_callback_up+0x82/0x120
[nfs]
Jul 27 19:43:33 fb07-iapwap2 kernel: [<dcf3df36>] nfs_get_client+0x176/0x390
[nfs]
Jul 27 19:43:33 fb07-iapwap2 kernel: [<dcf3e181>] nfs4_set_client+0x31/0x190
[nfs]
Jul 27 19:43:33 fb07-iapwap2 kernel: [<dcf3e983>] ...That is very odd. I suspect the memory shrinker, but am at a loss to explain why that would be called from a softirq context. Could you please see if you can reproduce the above problem with 2.6.23-rc1 + the 4 patches? Cheers Trond -
