Re: NFSv4 poops itself

Previous thread: [PATCH] perfctr_watchdog: do not BUG_ON() when MSR is unknown by Stephane Eranian on Friday, July 27, 2007 - 5:49 am. (2 messages)

Next thread: [PATCH 1/4] fix endianness bug in l2cap_sock_listen() by Al Viro on Friday, July 27, 2007 - 5:57 am. (3 messages)
From: Jeff Garzik
Date: Friday, July 27, 2007 - 5:53 am

Background:

Server: x86-64 dual core Intel, kernel 2.6.23-rc1 (my home fileserver)
	Exporting NFS/NFSv4 mounts.  Client count: 1  Uptime: 4 days

Client: x86-64 dual core Intel, kernel 2.6.23-rc1 (my main workstation)
	NFS mount setup:
	pretzel:/ on /g type nfs4 (rw,noatime,proto=tcp,addr=10.10.10.1)
	Uptime: 4 days

	Home directory mounted via NFSv4.

Problem:

My workstation has been happily talking to my file server for several 
days without incident.  An hour ago, my numeric keypad stopping working 
(unrelated problem... USB or X bug?).  The solution to the keypad 
problem is usually to log out of X and log back in, or worse case, reboot.

So, I log out, and log back in.  At first, a few shell windows open and 
successfully initialize themselves (read bash profile over NFS, etc.) 
Then, as more shell windows open, things start hanging.  I can easily 
switch to console and ssh to the fileserver, so it is clear this is an 
NFS hang.

No adverse messages at all on the client.

On the server, I see NFSv4 spamming dmesg with hundreds of thousands of 
messages:

Jul 27 08:20:53 pretzel kernel: NFSD: preprocess_seqid_op: old stateid!
Jul 27 08:21:24 pretzel last message repeated 167966 times
Jul 27 08:21:55 pretzel last message repeated 173628 times
Jul 27 08:21:55 pretzel kernel: NFSD: preprocess_seqid_op: old stateid!
Jul 27 08:22:26 pretzel last message repeated 171286 times
Jul 27 08:23:27 pretzel last message repeated 344461 times
Jul 27 08:23:30 pretzel last message repeated 18656 times

I rebooted the client, the problem disappeared, and everything is happy 
again...  but clearly NFSv4 shat itself.  And now I am worried this will 
happen again.

In all my quite-heavy use of NFSv4 I've never seen this behavior before, 
  so I would call this a regression.

I always run vanilla linux-2.6.git self-built kernels on both client and 
server.

	Jeff


-

From: Trond Myklebust
Date: Friday, July 27, 2007 - 6:33 am

Yup. Bruce has reported the same bug so it is under investigation. I'll
keep you posted when I think we have a fix.

Cheers
  Trond
-

From: Trond Myklebust
Date: Friday, July 27, 2007 - 10:09 am

Jeff and Bruce, could you please try to reproduce the problem after
either applying patches 001 to 004 or just the single NFS_ALL patch
from 

   http://client.linux-nfs.org/Linux-2.6.x/2.6.23-rc1/

Cheers
  Trond
-

From: Jeff Garzik
Date: Friday, July 27, 2007 - 10:16 am

It's not easily reproducible here.  Any hints on triggering the behavior?

	Jeff


-

From: Trond Myklebust
Date: Friday, July 27, 2007 - 10:29 am

I suspect that the problem may involve mixing O_RDWR with O_RDONLY
and/or O_WRONLY file access. Bruce might have a better suggestion,
though.
-

From: J. Bruce Fields
Date: Friday, August 3, 2007 - 12:13 pm

I haven't done any more work to identify when exactly the problem is
triggered, but I did confirm that I could reproduce the problem reliably
with current git, but *not* with current git plus your 4-patch nfs-all.

--b.
-

From: J. Bruce Fields
Date: Monday, August 6, 2007 - 6:40 am

And that result still holds after using the patched kernel for several
more hours.

--b.
-

From: Trond Myklebust
Date: Monday, August 6, 2007 - 11:40 am

Thanks Bruce!
  Trond
-

From: Marc Dietrich
Date: Friday, July 27, 2007 - 11:29 am

Hi,


as posted earlier, 001-002 does not fix it. Then I also applied 003+004 and it 
seems to be fixed now. kde boots up and I can start some programs, which I 
couldn't before. 

The "inconsistent lock state" error still exists, but it really seems to be 

thanks

Marc
-

From: Marc Dietrich
Date: Friday, July 27, 2007 - 6:36 am

Hi,


me too, my server has 2.6.18-? (openSUSE 10.2). On the client 
(2.6.23-rc1-mm1), I also see (shortly before the hang)

Jul 26 13:09:19 fb07-iapwap2 kernel: =================================
Jul 26 13:09:19 fb07-iapwap2 kernel: [ INFO: inconsistent lock state ]
Jul 26 13:09:19 fb07-iapwap2 kernel: 2.6.23-rc1-mm1 #1
Jul 26 13:09:19 fb07-iapwap2 kernel: ---------------------------------
Jul 26 13:09:19 fb07-iapwap2 kernel: inconsistent {softirq-on-W} -> 
{in-softirq-W} usage.
Jul 26 13:09:19 fb07-iapwap2 kernel: hald/3873 [HC0[0]:SC1[1]:HE1:SE0] takes:
Jul 26 13:09:19 fb07-iapwap2 kernel:  (rpc_credcache_lock){-+..}, at: 
[<c01dc166>] _atomic_dec_and_lock+0x16/0x60
Jul 26 13:09:19 fb07-iapwap2 kernel: {softirq-on-W} state was registered at:
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013d4f7>] mark_lock+0x77/0x630
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013c094>] add_lock_to_list+0x44/0xc0
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013e8af>] 
__lock_acquire+0x65f/0x1020
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013db0e>] mark_held_locks+0x5e/0x80
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c012448d>] local_bh_enable+0x7d/0x130
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c013f2cf>] lock_acquire+0x5f/0x80
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01dc166>] 
_atomic_dec_and_lock+0x16/0x60
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c02d156a>] _spin_lock+0x2a/0x40
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01dc166>] 
_atomic_dec_and_lock+0x16/0x60
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c01dc166>] 
_atomic_dec_and_lock+0x16/0x60
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<c02d156a>] _spin_lock+0x2a/0x40
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcecf770>] put_rpccred+0x60/0x110 
[sunrpc]
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcecf840>] 
rpcauth_unbindcred+0x20/0x60 [sunrpc]
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcece1f4>] rpc_put_task+0x44/0xb0 
[sunrpc]
Jul 26 13:09:19 fb07-iapwap2 kernel:   [<dcec8ffd>] rpc_call_sync+0x2d/0x40 
[sunrpc]
Jul 26 13:09:19 fb07-iapwap2 kernel:   ...
From: Jeff Garzik
Date: Friday, July 27, 2007 - 6:40 am

(please don't drop CC's when you reply to email; you are cutting 
relevant people out of the loop)


[continues]

That's quite interesting.  I wonder if I enable lock debugging and 
tracing, if new stuff will appear.

	Jeff


-

From: Trond Myklebust
Date: Friday, July 27, 2007 - 7:45 am

That particular hang in rpciod_down we do have a fix for, but it is not
related to the issue you were seeing Jeff.

Trond

From: Satyam Sharma
Date: Friday, July 27, 2007 - 7:58 am

Doesn't appear related ... could you rebuild with CONFIG_FRAME_POINTER
and test + resend the above log? The trace up there looks quite ugly.

[ Unrelated, but I just realized from lib/Kconfig.debug that stuff in there
"select"s LOCKDEP, which itself "selects" frame pointers (on some archs).
But I thought "select" didn't chain properly like that? There's whole lots
of depending and selecting and non-user-visible options out there in
that file -- and I'm not sure all those selects and depends are really
even required to successfully build (?) ]


Satyam
-

From: Marc Dietrich
Date: Friday, July 27, 2007 - 10:58 am

Hi again,


this seems to be better ...

Jul 27 19:43:33 fb07-iapwap2 kernel: =================================
Jul 27 19:43:33 fb07-iapwap2 kernel: [ INFO: inconsistent lock state ]
Jul 27 19:43:33 fb07-iapwap2 kernel: 2.6.23-rc1-mm1 #1
Jul 27 19:43:33 fb07-iapwap2 kernel: ---------------------------------
Jul 27 19:43:33 fb07-iapwap2 kernel: inconsistent {softirq-on-W} -> 
{in-softirq-W} usage.
Jul 27 19:43:33 fb07-iapwap2 kernel: kdm/5909 [HC0[0]:SC1[1]:HE1:SE0] takes:
Jul 27 19:43:33 fb07-iapwap2 kernel:  (rpc_credcache_lock){-+..}, at: 
[<c01dba67>] _atomic_dec_and_lock+0x17/0x60
Jul 27 19:43:33 fb07-iapwap2 kernel: {softirq-on-W} state was registered at:
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<c013e800>] 
__lock_acquire+0x650/0x1030
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<c013f241>] lock_acquire+0x61/0x80
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<c02db8dc>] _spin_lock+0x2c/0x40
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<c01dba67>] 
_atomic_dec_and_lock+0x17/0x60
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dceca5fd>] put_rpccred+0x5d/0x100 
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dceca6c1>] 
rpcauth_unbindcred+0x21/0x60 [sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcec8fd4>] rpc_put_task+0x44/0xa0 
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcec3fe0>] rpc_call_sync+0x30/0x40 
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dced172b>] rpcb_register+0xdb/0x180 
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcecb5b3>] svc_register+0x93/0x160 
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcecbebe>] __svc_create+0x1ee/0x220 
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcecc053>] svc_create+0x13/0x20 
[sunrpc]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcf62752>] nfs_callback_up+0x82/0x120 
[nfs]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcf3df36>] nfs_get_client+0x176/0x390 
[nfs]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcf3e181>] nfs4_set_client+0x31/0x190 
[nfs]
Jul 27 19:43:33 fb07-iapwap2 kernel:   [<dcf3e983>] ...
From: Trond Myklebust
Date: Friday, July 27, 2007 - 11:39 am

That is very odd. I suspect the memory shrinker, but am at a loss to
explain why that would be called from a softirq context.

Could you please see if you can reproduce the above problem with
2.6.23-rc1 + the 4 patches?

Cheers
  Trond
-

Previous thread: [PATCH] perfctr_watchdog: do not BUG_ON() when MSR is unknown by Stephane Eranian on Friday, July 27, 2007 - 5:49 am. (2 messages)

Next thread: [PATCH 1/4] fix endianness bug in l2cap_sock_listen() by Al Viro on Friday, July 27, 2007 - 5:57 am. (3 messages)