Re: [2.6.26-rc4] mount.nfsv4/memory poisoning issues...

Previous thread: Re: 2.6.25 md oops during boot. by Neil Brown on Wednesday, June 4, 2008 - 4:29 pm. (1 message)

Next thread: [GIT]: Sparc by David Miller on Wednesday, June 4, 2008 - 4:56 pm. (1 message)
From: Daniel J Blueman
Date: Wednesday, June 4, 2008 - 4:33 pm

Having experienced 'mount.nfs4: internal error' when mounting nfsv4 in
the past, I have a minimal test-case I sometimes run:

$ while :; do mount -t nfs4 filer:/store /store; umount /store; done

After ~100 iterations, I saw the 'mount.nfs4: internal error',
followed by symptoms of memory corruption [1], a locking issue with
the reporting [2] and another (related?) memory-corruption issue
(off-by-1?) [3]. A little analysis shows memory being overwritten by
(likely) a poison value, which gets complicated if it's not
use-after-free...

Anyone dare confirm this issue? NFSv4 server is x86-64 Ubuntu 8.04
2.6.24-18, client U8.04 2.6.26-rc4; batteries included [4].

I'm happy to decode addresses, test patches etc.

Daniel

--- [1]

We see RAX (pointing to a structure) getting overwritten with the
poison value, so is 0x6b6b6b6b6b6b6b6b:

[4294454.232202] general protection fault: 0000 [1] PREEMPT SMP DEBUG_PAGEALLOC
[4294454.232328] CPU 1
[4294454.232379] Modules linked in: snd_hda_intel snd_pcm snd_timer
snd soundcore snd_page_alloc usbhid hid
[4294454.232623] Pid: 6795, comm: mount.nfs4 Not tainted 2.6.26-rc4-200 #2
[4294454.232657] RIP: 0010:[<ffffffff8024a12a>]  [<ffffffff8024a12a>]
task_rq_lock+0x3a/0xa0
[4294454.232723] RSP: 0018:ffff8100707cf878  EFLAGS: 00010002
[4294454.232756] RAX: 6b6b6b6b6b6b6b6b RBX: ffffffff808542c0 RCX:
0000000000000001
[4294454.232794] RDX: ffff81007dd52490 RSI: ffff8100707cf8b8 RDI:
0000000000000000
[4294454.233077] RBP: ffff8100707cf898 R08: 0000000000000001 R09:
0000000000000001
[4294454.233077] R10: 0000000000000002 R11: 0000000000000001 R12:
ffffffff808542c0
[4294454.233077] R13: ffff81007e378000 R14: ffff8100707cf8b8 R15:
ffff81007ebf77e0
[4294454.233077] FS:  00007f42363566e0(0000) GS:ffff81007fb41320(0000)
knlGS:0000000000000000
[4294454.233077] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[4294454.233077] CR2: 00000000023dc000 CR3: 0000000070711000 CR4:
00000000000026e0
[4294454.233077] DR0: 0000000000000000 DR1: ...
From: Chuck Lever
Date: Wednesday, June 4, 2008 - 4:43 pm

Hi Daniel-

On Wed, Jun 4, 2008 at 7:33 PM, Daniel J Blueman

We have some other reports of late model kernels with memory
corruption issues during NFS mount.  The problem is that by the time
these canaries start singing, the evidence of what did the corrupting

If these crashes are more or less reliably reproduced, it would be
helpful if you could do a 'git bisect' on the client to figure out at
what point in the kernel revision history this problem was introduced.

> # CONFI
From: Daniel J Blueman
Date: Sunday, June 15, 2008 - 11:10 am

Firstly, I had omitted that I'd booted the kernel with
debug_objects=1, which provides the canary here.

The primary failure I see is 'mount.nfs4: internal error', and always
after 358 umount/mount cycles (plus 1 initial mount) which gives us a
clue; 'netstat' shows all these connections in a TIME_WAIT state, thus
the bug relates to the inability to allocate a socket error path. I
found that after the connection lifetime expired, you can mount again,
which corroborates this theory.

In this case, we saw the mount() syscall result in the mount.nfsv4
process being SEGV'd when booted with 'debug_object=1', without this
option, we see:

# strace /sbin/mount.nfs4 x1:/ /store
...
mount("x1:/", "/store", "nfs4", 0,
"addr=192.168.0.250,clientaddr=19"...) = -1 EIO (Input/output error)

So, it's impossible to tell when the corruption was introduced, as it
has only become detectable recently.

It's worth a look-over of the socket-allocation error path, if someone
can check, and reproduces 100% with the 'debug_object=1' param,
available since 2.6.26-rc1 and 359 mounts in quick succession.

Thanks!
  Daniel
-- 
Daniel J Blueman
--

From: Chuck Lever
Date: Monday, June 16, 2008 - 9:17 am

Hi Daniel-



That's nicely specific.

I'm juggling several other problems at the moment, but I will try to  
get back to this again in a day or two.

There's a lot of new code in the NFS mount path.  "internal error"  
means the mount command has encountered something entirely  
unexpected.  So when anyone sees this message, please go ahead and  
report it on the linux-nfs@vger.kernel.org list.  Thanks!

I will see if I can make that message a little more explicit.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
--

From: Jeff Layton
Date: Thursday, June 19, 2008 - 5:14 am

On Sun, 15 Jun 2008 19:10:27 +0100

For some strange reason (probably something I'm doing wrong or maybe
something environmental), I've not been able to reproduce this panic on
a stock kernel. I did, however, apply the following fault injection
patch and was able to reproduce it on the second mount attempt. The 3
patch set that I posted last week definitely prevents the oops. If
you're able to confirm that it also fixes your panic it would be a
helpful data point.

The fault injection patch I'm using is attached. It just simulates
nfs4_init_client() consistently returning an error.

Cheers,
-- 
Jeff Layton <jlayton@redhat.com>
From: Daniel J Blueman
Date: Thursday, June 19, 2008 - 5:37 am

Thanks! I hope to get time for this tonight and will get back to you;
I appreciate your help, Jeff.

The config I used to reproduce it on the client is
http://www.quora.org/config-debug and is relevant to (at least)
2.6.26-rc5. I was able to reproduce it by exporting a single NFSv4
export on 2.6.25 (eg Ubuntu 8.04 server); client and server were
x86-64.

-- 
Daniel J Blueman
--

From: Chuck Lever
Date: Thursday, June 19, 2008 - 10:32 am

On Thu, Jun 19, 2008 at 8:37 AM, Daniel J Blueman

For the record, I was able to reproduce with a dual core 2.6.26-rc4
x86 client (using SLUB, not SLAB) on Fedora 8 against a Solaris 2008.5
NFSv4 server.  When the client runs out of ports, the mount commands
start to fail with EIO.  Then after a few dozen of these, the mounting
process hangs.  Sometimes it will BUG because freed & poisoned memory
is passed to kthread_stop().  I haven't seen some of the other
subsequent issues that Daniel initially reported.

-- 
I am certain that these presidents will understand the cry of the
people of Bolivia, of the people of Latin America and the whole world,
which wants to have more food and not more cars. First food, then if
something's left over, more cars, more automobiles. I think that life
has to come first.
-- Evo Morales
--

From: Jeff Layton
Date: Wednesday, June 4, 2008 - 5:35 pm

On Thu, 5 Jun 2008 00:33:54 +0100

Looks like it fell down while trying to take down the kthread during a
failed mount attempt. I have to wonder if I might have introduced a
race when I changed nfs4 callback thread to kthread API. I think we may
need the BKL around the last 2 statements in the main callback thread
function. If you can easily reproduce this, could you test the
following patch and let me know if it helps?

Note that this patch is entirely untested, so test it someplace
non-critical ;-).

Signed-off-by: Jeff Layton <jlayton@redhat.com>


diff --git a/fs/nfs/callback.c b/fs/nfs/callback.c
index c1e7c83..a3e83f9 100644
--- a/fs/nfs/callback.c
+++ b/fs/nfs/callback.c
@@ -90,9 +90,9 @@ nfs_callback_svc(void *vrqstp)
 		preverr = err;
 		svc_process(rqstp);
 	}
-	unlock_kernel();
 	nfs_callback_info.task = NULL;
 	svc_exit_thread(rqstp);
+	unlock_kernel();
 	return 0;
 }
 
--

From: Daniel J Blueman
Date: Thursday, June 5, 2008 - 1:28 am

This doesn't resolve the issue, alas. I'll hunker down to the
bisection in the next few days.

Thanks,
  Daniel
-- 
Daniel J Blueman
--

From: Jeff Layton
Date: Thursday, June 5, 2008 - 3:32 am

On Thu, 5 Jun 2008 09:28:51 +0100

Yeah, after I sent that I went back and looked and it seemed like
while we don't hold the BKL there, access to the callback info is still
serialized via a semaphore. Thanks for testing it anyway...

-- 
Jeff Layton <jlayton@redhat.com>
--

From: Trond Myklebust
Date: Tuesday, June 10, 2008 - 11:54 am

We certainly need to protect nfs_callback_info.task (and I believe I
explained this earlier), but why do we need to protect svc_exit_thread?

Also, looking at the general use of the BKL in that code, I thought we
agreed that there was no need to hold the BKL while taking the
nfs_callback_mutex?

Trond

--

From: Jeff Layton
Date: Tuesday, June 10, 2008 - 12:13 pm

On Tue, 10 Jun 2008 14:54:48 -0400

Hmm, I don't remember that discussion, but I'll take your word for it...

I think you're basically correct, but it looks to me like the
nfs_callback_mutex actually protects nfs_callback_info.task as well.

If we're starting the thread, then we can't call kthread_stop on it
until we release the mutex. So the thread can't exit until we release
the mutex, and we can be guaranteed that this:

     nfs_callback_info.task = NULL;

...can't happen until after kthread_run returns and nfs_callback_up
sets it.

If that's right, then maybe this (untested, RFC only) patch would make sense?

----------[snip]------------

From cd0ce86919ede3f1abda1ba7522b72283bb94d7e Mon Sep 17 00:00:00 2001
From: Jeff Layton <jlayton@redhat.com>
Date: Tue, 10 Jun 2008 15:12:16 -0400
Subject: [PATCH] nfs4: remove BKL from nfs_callback_up and nfs_callback_down

The nfs_callback_mutex is sufficient protection. We don't need the
BKL here.

Signed-off-by: Jeff Layton <jlayton@redhat.com>
---
 fs/nfs/callback.c |    4 ----
 1 files changed, 0 insertions(+), 4 deletions(-)

diff --git a/fs/nfs/callback.c b/fs/nfs/callback.c
index c1e7c83..9e713d2 100644
--- a/fs/nfs/callback.c
+++ b/fs/nfs/callback.c
@@ -105,7 +105,6 @@ int nfs_callback_up(void)
 	struct svc_rqst *rqstp;
 	int ret = 0;
 
-	lock_kernel();
 	mutex_lock(&nfs_callback_mutex);
 	if (nfs_callback_info.users++ || nfs_callback_info.task != NULL)
 		goto out;
@@ -149,7 +148,6 @@ out:
 	if (serv)
 		svc_destroy(serv);
 	mutex_unlock(&nfs_callback_mutex);
-	unlock_kernel();
 	return ret;
 out_err:
 	dprintk("Couldn't create callback socket or server thread; err = %d\n",
@@ -163,13 +161,11 @@ out_err:
  */
 void nfs_callback_down(void)
 {
-	lock_kernel();
 	mutex_lock(&nfs_callback_mutex);
 	nfs_callback_info.users--;
 	if (nfs_callback_info.users == 0 && nfs_callback_info.task != NULL)
 		kthread_stop(nfs_callback_info.task);
 	mutex_unlock(&nfs_callback_mutex);
-	unlock_kernel();
 }
 
 ...
From: Jeff Layton
Date: Tuesday, June 10, 2008 - 12:18 pm

On Tue, 10 Jun 2008 15:13:57 -0400

To clarify for Dan...

I don't think that this patch will help the problem you're having. This
is essentially a cleanup patch to remove some locking that doesn't
appear to be needed.

The original patch that Trond commented on above is also probably
unnecessary (assuming I'm right about the locking here).

-- 
Jeff Layton <jlayton@redhat.com>
--

From: Daniel J Blueman
Date: Tuesday, June 10, 2008 - 1:27 pm

Thanks for the head-up, Jeff. I took it at face value, so didn't
harbour the notion it would fix the memory corruption.

Let's see If I can get time for this git bisect sooner rather than later...
-- 
Daniel J Blueman
--

From: Jeff Layton
Date: Wednesday, June 18, 2008 - 5:07 am

On Tue, 10 Jun 2008 21:27:07 +0100

I've tried reproducing this, but haven't had much success (probably some
differences in my kernel config).

I suspect that Trond is correct here and the race has something to
do with the kthread being spawned but nfs_callback_svc() never getting
a chance to run. I posted a patchset to the list late last week with the
intro email:

[PATCH 0/3] fix potential races in lockd and nfs4-callback startup/shutdown

Dan, could you apply that patchset to your kernel and see if it helps
this problem?

Thanks,
-- 
Jeff Layton <jlayton@redhat.com>
--

From: Daniel J Blueman
Date: Saturday, June 21, 2008 - 10:52 am

Yes, I confirm the problem reproduces without this patchset and I'm
unable to reproduce it with the patchset, so looks good!

Thanks again for the all the help and fixes!
  Dan
-- 
Daniel J Blueman
--

From: Trond Myklebust
Date: Tuesday, June 10, 2008 - 12:58 pm

Hmm... I suppose that is correct, but what if nfs_alloc_client() does

	nfs_callback_up();
	<kstrdup() fails>
	nfs_callback_down();

AFAICS, if nfs_callback_down() gets called before the kthread() function
gets scheduled back in, then you can get left with a value of
nfs_callback_info.task != NULL, since nfs_callback_svc() will never be
called.

Wouldn't it therefore make more sense to clear nfs_callback_info.task in
nfs_callback_down()?

Cheers
  Trond

--

From: Jeff Layton
Date: Tuesday, June 10, 2008 - 1:13 pm

On Tue, 10 Jun 2008 15:58:29 -0400

I don't see this race.

We can't call nfs_callback_down() until after nfs_callback_up()
returns, so we're guaranteed to have "task" set to a valid task
(presuming that nfs_callback_up() doesn't return error). We also can't
return from nfs_callback_down() until after the nfs_callback_svc() has

I suppose that makes just as much sense. It also seems more symmetrical
given that we also set the var in nfs_callback_up(). I'll roll that into
the BKL removal patch, and give it some testing. Look for it in a day
or two...

Thanks,
-- 
Jeff Layton <jlayton@redhat.com>
--

From: Trond Myklebust
Date: Tuesday, June 10, 2008 - 1:33 pm

The code I'm alluding to is in kthread():

       /* OK, tell user we're spawned, wait for stop or wakeup */
        __set_current_state(TASK_UNINTERRUPTIBLE);
        complete(&create->started);
        schedule();

        if (!kthread_should_stop())
                ret = threadfn(data);

schedule() is called _after_ the complete() call, and _before_ we
execute threadfn() a.k.a. nfs_callback_svc(). If nfs_alloc_client() has
time to call nfs_callback_down() before the above thread gets scheduled
back in, then threadfn() doesn't get called at all, since
kthread_should_stop() is true.

Cheers
  Trond

--

From: Jeff Layton
Date: Tuesday, June 10, 2008 - 1:41 pm

On Tue, 10 Jun 2008 16:33:32 -0400

That makes total sense. Thanks for clearing it up.

Cheers,
-- 
Jeff Layton <jlayton@redhat.com>
--

From: Jeff Layton
Date: Tuesday, June 10, 2008 - 2:01 pm

On Tue, 10 Jun 2008 16:33:32 -0400

Hmm...this is a bigger problem than it looked at first glance. lockd
also has a similar problem...ditto for the new nfsd code wrt module
refcounts.

In practice, I think the thread generally runs immediately (at least
with current scheduler behavior), so we're probably not terribly
vulnerable to this race. Still, we shouldn't rely on that...

For lockd and the nfs4 callback thread, we'll also need to deal with
the fact that svc_exit_thread() doesn't get called if this happens. So
we'll need to call svc_exit_thread from the *_down() functions too
(I presume that's OK).

nfsd is a bigger problem since it exits on a signal. For that, perhaps
we should declare a completion variable and have svc_set_num_threads()
wait until nfsd() has actually run before continuing.

Thoughts?

-- 
Jeff Layton <jlayton@redhat.com>
--

From: Trond Myklebust
Date: Tuesday, June 10, 2008 - 2:37 pm

In the code I showed you, the 'kthread' task is put to sleep, then
kthread_run() calls wake_up_process() on it, but the current task isn't
scheduled out. Rather, it continues to run, so in almost all UP cases,
the race is not only possible, it is actually rather likely if

These *_up()/*_down() functions are getting very complex. Any chance we
could hide some of this complexity in some helpers? Looking at the NFSv4
callback code and lockd, it seems that there might be a couple of

nfsd doesn't use kthreads, does it?



--

From: Jeff Layton
Date: Tuesday, June 10, 2008 - 3:04 pm

On Tue, 10 Jun 2008 17:37:56 -0400

Ok, makes sense. It looks like with the kernel_thread interface, that
the function always runs, so we're not vulnerable to this race with

Possibly. I may start by just closing the races and then look and see what

Bruce just took a patchset to make it use kthreads too. I'm thinking
2.6.27 for that, presuming that we get all of these issues worked out.

I think the only thing we can reasonably do there is make sure that
the thread actually starts before allowing svc_set_num_threads to
continue.

-- 
Jeff Layton <jlayton@redhat.com>
--

Previous thread: Re: 2.6.25 md oops during boot. by Neil Brown on Wednesday, June 4, 2008 - 4:29 pm. (1 message)

Next thread: [GIT]: Sparc by David Miller on Wednesday, June 4, 2008 - 4:56 pm. (1 message)