Having experienced 'mount.nfs4: internal error' when mounting nfsv4 in the past, I have a minimal test-case I sometimes run: $ while :; do mount -t nfs4 filer:/store /store; umount /store; done After ~100 iterations, I saw the 'mount.nfs4: internal error', followed by symptoms of memory corruption [1], a locking issue with the reporting [2] and another (related?) memory-corruption issue (off-by-1?) [3]. A little analysis shows memory being overwritten by (likely) a poison value, which gets complicated if it's not use-after-free... Anyone dare confirm this issue? NFSv4 server is x86-64 Ubuntu 8.04 2.6.24-18, client U8.04 2.6.26-rc4; batteries included [4]. I'm happy to decode addresses, test patches etc. Daniel --- [1] We see RAX (pointing to a structure) getting overwritten with the poison value, so is 0x6b6b6b6b6b6b6b6b: [4294454.232202] general protection fault: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC [4294454.232328] CPU 1 [4294454.232379] Modules linked in: snd_hda_intel snd_pcm snd_timer snd soundcore snd_page_alloc usbhid hid [4294454.232623] Pid: 6795, comm: mount.nfs4 Not tainted 2.6.26-rc4-200 #2 [4294454.232657] RIP: 0010:[<ffffffff8024a12a>] [<ffffffff8024a12a>] task_rq_lock+0x3a/0xa0 [4294454.232723] RSP: 0018:ffff8100707cf878 EFLAGS: 00010002 [4294454.232756] RAX: 6b6b6b6b6b6b6b6b RBX: ffffffff808542c0 RCX: 0000000000000001 [4294454.232794] RDX: ffff81007dd52490 RSI: ffff8100707cf8b8 RDI: 0000000000000000 [4294454.233077] RBP: ffff8100707cf898 R08: 0000000000000001 R09: 0000000000000001 [4294454.233077] R10: 0000000000000002 R11: 0000000000000001 R12: ffffffff808542c0 [4294454.233077] R13: ffff81007e378000 R14: ffff8100707cf8b8 R15: ffff81007ebf77e0 [4294454.233077] FS: 00007f42363566e0(0000) GS:ffff81007fb41320(0000) knlGS:0000000000000000 [4294454.233077] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [4294454.233077] CR2: 00000000023dc000 CR3: 0000000070711000 CR4: 00000000000026e0 [4294454.233077] DR0: 0000000000000000 DR1: ...
Hi Daniel- On Wed, Jun 4, 2008 at 7:33 PM, Daniel J Blueman We have some other reports of late model kernels with memory corruption issues during NFS mount. The problem is that by the time these canaries start singing, the evidence of what did the corrupting If these crashes are more or less reliably reproduced, it would be helpful if you could do a 'git bisect' on the client to figure out at what point in the kernel revision history this problem was introduced. > # CONFI
Firstly, I had omitted that I'd booted the kernel with debug_objects=1, which provides the canary here. The primary failure I see is 'mount.nfs4: internal error', and always after 358 umount/mount cycles (plus 1 initial mount) which gives us a clue; 'netstat' shows all these connections in a TIME_WAIT state, thus the bug relates to the inability to allocate a socket error path. I found that after the connection lifetime expired, you can mount again, which corroborates this theory. In this case, we saw the mount() syscall result in the mount.nfsv4 process being SEGV'd when booted with 'debug_object=1', without this option, we see: # strace /sbin/mount.nfs4 x1:/ /store ... mount("x1:/", "/store", "nfs4", 0, "addr=192.168.0.250,clientaddr=19"...) = -1 EIO (Input/output error) So, it's impossible to tell when the corruption was introduced, as it has only become detectable recently. It's worth a look-over of the socket-allocation error path, if someone can check, and reproduces 100% with the 'debug_object=1' param, available since 2.6.26-rc1 and 359 mounts in quick succession. Thanks! Daniel -- Daniel J Blueman --
Hi Daniel- That's nicely specific. I'm juggling several other problems at the moment, but I will try to get back to this again in a day or two. There's a lot of new code in the NFS mount path. "internal error" means the mount command has encountered something entirely unexpected. So when anyone sees this message, please go ahead and report it on the linux-nfs@vger.kernel.org list. Thanks! I will see if I can make that message a little more explicit. -- Chuck Lever chuck[dot]lever[at]oracle[dot]com --
On Sun, 15 Jun 2008 19:10:27 +0100 For some strange reason (probably something I'm doing wrong or maybe something environmental), I've not been able to reproduce this panic on a stock kernel. I did, however, apply the following fault injection patch and was able to reproduce it on the second mount attempt. The 3 patch set that I posted last week definitely prevents the oops. If you're able to confirm that it also fixes your panic it would be a helpful data point. The fault injection patch I'm using is attached. It just simulates nfs4_init_client() consistently returning an error. Cheers, -- Jeff Layton <jlayton@redhat.com>
Thanks! I hope to get time for this tonight and will get back to you; I appreciate your help, Jeff. The config I used to reproduce it on the client is http://www.quora.org/config-debug and is relevant to (at least) 2.6.26-rc5. I was able to reproduce it by exporting a single NFSv4 export on 2.6.25 (eg Ubuntu 8.04 server); client and server were x86-64. -- Daniel J Blueman --
On Thu, Jun 19, 2008 at 8:37 AM, Daniel J Blueman For the record, I was able to reproduce with a dual core 2.6.26-rc4 x86 client (using SLUB, not SLAB) on Fedora 8 against a Solaris 2008.5 NFSv4 server. When the client runs out of ports, the mount commands start to fail with EIO. Then after a few dozen of these, the mounting process hangs. Sometimes it will BUG because freed & poisoned memory is passed to kthread_stop(). I haven't seen some of the other subsequent issues that Daniel initially reported. -- I am certain that these presidents will understand the cry of the people of Bolivia, of the people of Latin America and the whole world, which wants to have more food and not more cars. First food, then if something's left over, more cars, more automobiles. I think that life has to come first. -- Evo Morales --
On Thu, 5 Jun 2008 00:33:54 +0100 Looks like it fell down while trying to take down the kthread during a failed mount attempt. I have to wonder if I might have introduced a race when I changed nfs4 callback thread to kthread API. I think we may need the BKL around the last 2 statements in the main callback thread function. If you can easily reproduce this, could you test the following patch and let me know if it helps? Note that this patch is entirely untested, so test it someplace non-critical ;-). Signed-off-by: Jeff Layton <jlayton@redhat.com> diff --git a/fs/nfs/callback.c b/fs/nfs/callback.c index c1e7c83..a3e83f9 100644 --- a/fs/nfs/callback.c +++ b/fs/nfs/callback.c @@ -90,9 +90,9 @@ nfs_callback_svc(void *vrqstp) preverr = err; svc_process(rqstp); } - unlock_kernel(); nfs_callback_info.task = NULL; svc_exit_thread(rqstp); + unlock_kernel(); return 0; } --
This doesn't resolve the issue, alas. I'll hunker down to the bisection in the next few days. Thanks, Daniel -- Daniel J Blueman --
On Thu, 5 Jun 2008 09:28:51 +0100 Yeah, after I sent that I went back and looked and it seemed like while we don't hold the BKL there, access to the callback info is still serialized via a semaphore. Thanks for testing it anyway... -- Jeff Layton <jlayton@redhat.com> --
We certainly need to protect nfs_callback_info.task (and I believe I explained this earlier), but why do we need to protect svc_exit_thread? Also, looking at the general use of the BKL in that code, I thought we agreed that there was no need to hold the BKL while taking the nfs_callback_mutex? Trond --
On Tue, 10 Jun 2008 14:54:48 -0400
Hmm, I don't remember that discussion, but I'll take your word for it...
I think you're basically correct, but it looks to me like the
nfs_callback_mutex actually protects nfs_callback_info.task as well.
If we're starting the thread, then we can't call kthread_stop on it
until we release the mutex. So the thread can't exit until we release
the mutex, and we can be guaranteed that this:
nfs_callback_info.task = NULL;
...can't happen until after kthread_run returns and nfs_callback_up
sets it.
If that's right, then maybe this (untested, RFC only) patch would make sense?
----------[snip]------------
From cd0ce86919ede3f1abda1ba7522b72283bb94d7e Mon Sep 17 00:00:00 2001
From: Jeff Layton <jlayton@redhat.com>
Date: Tue, 10 Jun 2008 15:12:16 -0400
Subject: [PATCH] nfs4: remove BKL from nfs_callback_up and nfs_callback_down
The nfs_callback_mutex is sufficient protection. We don't need the
BKL here.
Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
fs/nfs/callback.c | 4 ----
1 files changed, 0 insertions(+), 4 deletions(-)
diff --git a/fs/nfs/callback.c b/fs/nfs/callback.c
index c1e7c83..9e713d2 100644
--- a/fs/nfs/callback.c
+++ b/fs/nfs/callback.c
@@ -105,7 +105,6 @@ int nfs_callback_up(void)
struct svc_rqst *rqstp;
int ret = 0;
- lock_kernel();
mutex_lock(&nfs_callback_mutex);
if (nfs_callback_info.users++ || nfs_callback_info.task != NULL)
goto out;
@@ -149,7 +148,6 @@ out:
if (serv)
svc_destroy(serv);
mutex_unlock(&nfs_callback_mutex);
- unlock_kernel();
return ret;
out_err:
dprintk("Couldn't create callback socket or server thread; err = %d\n",
@@ -163,13 +161,11 @@ out_err:
*/
void nfs_callback_down(void)
{
- lock_kernel();
mutex_lock(&nfs_callback_mutex);
nfs_callback_info.users--;
if (nfs_callback_info.users == 0 && nfs_callback_info.task != NULL)
kthread_stop(nfs_callback_info.task);
mutex_unlock(&nfs_callback_mutex);
- unlock_kernel();
}
...On Tue, 10 Jun 2008 15:13:57 -0400 To clarify for Dan... I don't think that this patch will help the problem you're having. This is essentially a cleanup patch to remove some locking that doesn't appear to be needed. The original patch that Trond commented on above is also probably unnecessary (assuming I'm right about the locking here). -- Jeff Layton <jlayton@redhat.com> --
Thanks for the head-up, Jeff. I took it at face value, so didn't harbour the notion it would fix the memory corruption. Let's see If I can get time for this git bisect sooner rather than later... -- Daniel J Blueman --
On Tue, 10 Jun 2008 21:27:07 +0100 I've tried reproducing this, but haven't had much success (probably some differences in my kernel config). I suspect that Trond is correct here and the race has something to do with the kthread being spawned but nfs_callback_svc() never getting a chance to run. I posted a patchset to the list late last week with the intro email: [PATCH 0/3] fix potential races in lockd and nfs4-callback startup/shutdown Dan, could you apply that patchset to your kernel and see if it helps this problem? Thanks, -- Jeff Layton <jlayton@redhat.com> --
Yes, I confirm the problem reproduces without this patchset and I'm unable to reproduce it with the patchset, so looks good! Thanks again for the all the help and fixes! Dan -- Daniel J Blueman --
Hmm... I suppose that is correct, but what if nfs_alloc_client() does nfs_callback_up(); <kstrdup() fails> nfs_callback_down(); AFAICS, if nfs_callback_down() gets called before the kthread() function gets scheduled back in, then you can get left with a value of nfs_callback_info.task != NULL, since nfs_callback_svc() will never be called. Wouldn't it therefore make more sense to clear nfs_callback_info.task in nfs_callback_down()? Cheers Trond --
On Tue, 10 Jun 2008 15:58:29 -0400 I don't see this race. We can't call nfs_callback_down() until after nfs_callback_up() returns, so we're guaranteed to have "task" set to a valid task (presuming that nfs_callback_up() doesn't return error). We also can't return from nfs_callback_down() until after the nfs_callback_svc() has I suppose that makes just as much sense. It also seems more symmetrical given that we also set the var in nfs_callback_up(). I'll roll that into the BKL removal patch, and give it some testing. Look for it in a day or two... Thanks, -- Jeff Layton <jlayton@redhat.com> --
The code I'm alluding to is in kthread():
/* OK, tell user we're spawned, wait for stop or wakeup */
__set_current_state(TASK_UNINTERRUPTIBLE);
complete(&create->started);
schedule();
if (!kthread_should_stop())
ret = threadfn(data);
schedule() is called _after_ the complete() call, and _before_ we
execute threadfn() a.k.a. nfs_callback_svc(). If nfs_alloc_client() has
time to call nfs_callback_down() before the above thread gets scheduled
back in, then threadfn() doesn't get called at all, since
kthread_should_stop() is true.
Cheers
Trond
--
On Tue, 10 Jun 2008 16:33:32 -0400 That makes total sense. Thanks for clearing it up. Cheers, -- Jeff Layton <jlayton@redhat.com> --
On Tue, 10 Jun 2008 16:33:32 -0400 Hmm...this is a bigger problem than it looked at first glance. lockd also has a similar problem...ditto for the new nfsd code wrt module refcounts. In practice, I think the thread generally runs immediately (at least with current scheduler behavior), so we're probably not terribly vulnerable to this race. Still, we shouldn't rely on that... For lockd and the nfs4 callback thread, we'll also need to deal with the fact that svc_exit_thread() doesn't get called if this happens. So we'll need to call svc_exit_thread from the *_down() functions too (I presume that's OK). nfsd is a bigger problem since it exits on a signal. For that, perhaps we should declare a completion variable and have svc_set_num_threads() wait until nfsd() has actually run before continuing. Thoughts? -- Jeff Layton <jlayton@redhat.com> --
In the code I showed you, the 'kthread' task is put to sleep, then kthread_run() calls wake_up_process() on it, but the current task isn't scheduled out. Rather, it continues to run, so in almost all UP cases, the race is not only possible, it is actually rather likely if These *_up()/*_down() functions are getting very complex. Any chance we could hide some of this complexity in some helpers? Looking at the NFSv4 callback code and lockd, it seems that there might be a couple of nfsd doesn't use kthreads, does it? --
On Tue, 10 Jun 2008 17:37:56 -0400 Ok, makes sense. It looks like with the kernel_thread interface, that the function always runs, so we're not vulnerable to this race with Possibly. I may start by just closing the races and then look and see what Bruce just took a patchset to make it use kthreads too. I'm thinking 2.6.27 for that, presuming that we get all of these issues worked out. I think the only thing we can reasonably do there is make sure that the thread actually starts before allowing svc_set_num_threads to continue. -- Jeff Layton <jlayton@redhat.com> --
