Re: 2.6.27-rc7-sha1: EIP at proc_sys_compare+0x36/0x50

Previous thread: Kernel unable to adjust timeofday by Rafal on Friday, September 26, 2008 - 6:29 am. (3 messages)

Next thread: [PATCH] ARM: Delete ARM's own cnt32_to_63.h by David Howells on Friday, September 26, 2008 - 8:22 am. (1 message)
From: Alexey Dobriyan
Date: Friday, September 26, 2008 - 8:20 am

Gentlemen, this happened while script was slowly rebuilding 300+ configs
sequentially. Very little recompling activity itself, much seeking.

This is first time I see this. No debugging was on, no preemption.

Version: 2.6.27-rc7-c0f4d6d4b14a75a341d972ff73fb9740e1ceb634 +
	atl1 fixlet + "notes" kobject fixlet, but they don't matter.


ffffffff802bc690 <proc_sys_compare>:
ffffffff802bc690:	48 83 ec 08          	sub    $0x8,%rsp
ffffffff802bc694:	8b 46 04             	mov    0x4(%rsi),%eax
ffffffff802bc697:	3b 42 04             	cmp    0x4(%rdx),%eax
ffffffff802bc69a:	49 89 f0             	mov    %rsi,%r8
ffffffff802bc69d:	74 11                	je     ffffffff802bc6b0 <proc_sys_compare+0x20>
ffffffff802bc69f:	b8 01 00 00 00       	mov    $0x1,%eax
ffffffff802bc6a4:	48 83 c4 08          	add    $0x8,%rsp
ffffffff802bc6a8:	c3                   	retq   
ffffffff802bc6a9:	0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
ffffffff802bc6b0:	48 8b 76 08          	mov    0x8(%rsi),%rsi
ffffffff802bc6b4:	48 8b 7a 08          	mov    0x8(%rdx),%rdi
ffffffff802bc6b8:	89 c1                	mov    %eax,%ecx
ffffffff802bc6ba:	fc                   	cld    
ffffffff802bc6bb:	48 39 c9             	cmp    %rcx,%rcx
ffffffff802bc6be:	f3 a6                	repz cmpsb %es:(%rdi),%ds:(%rsi)
ffffffff802bc6c0:	75 dd                	jne    ffffffff802bc69f <proc_sys_compare+0xf>
ffffffff802bc6c2:	49 8b 40 e0          	mov    -0x20(%r8),%rax
ffffffff802bc6c6: ===>	48 8b 78 f0          	mov    -0x10(%rax),%rdi		<===
ffffffff802bc6ca:	e8 71 96 f7 ff       	callq  ffffffff80235d40 <sysctl_is_seen>
ffffffff802bc6cf:	85 c0                	test   %eax,%eax
ffffffff802bc6d1:	0f 94 c0             	sete   %al
ffffffff802bc6d4:	48 83 c4 08          	add    $0x8,%rsp
ffffffff802bc6d8:	0f b6 c0             	movzbl %al,%eax
ffffffff802bc6db:	c3                   	retq   
ffffffff802bc6dc:	0f 1f 40 00          	nopl   0x0(%rax)


[16526.029537] BUG: unable to handle kernel paging request at fffffffffffffff0
[16526.029643] ...
From: Linus Torvalds
Date: Friday, September 26, 2008 - 8:47 am

That would be the

	sysctl_is_seen(PROC_I(dentry->d_inode)->sysctl)


The whole PROC_I() thing just offsets from the inode:

	container_of(inode, struct proc_inode, vfs_inode);

and 'sysctl' is indeed 16 bytes below the vfs inode on x86-64:

	struct proc_inode {
		...
	        struct ctl_table_header *sysctl;
	        struct ctl_table *sysctl_entry;
	        struct inode vfs_inode;
	};

and as far as I can tell, there is nothing to say that a /proc inode 
cannot be a negative dentry. Sure, we try to get rid of them, but during a 
parallel lookup, we will have added the dentry with a NULL inode in the 
other lookup.

So assuming that you have an inode at that point seems to be utter crap.

Now, the whole _function_ is utter crap and should probably be dropped, 
but whatever. That's just another sysctl insanity. In the meantime, 
something like this does look appropriate, no?

Al, did I miss something?

		Linus
---
 fs/proc/proc_sysctl.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/proc/proc_sysctl.c b/fs/proc/proc_sysctl.c
index f9a8b89..9435fd0 100644
--- a/fs/proc/proc_sysctl.c
+++ b/fs/proc/proc_sysctl.c
@@ -386,6 +386,8 @@ static int proc_sys_compare(struct dentry *dir, struct qstr *qstr,
 		return 1;
 	if (memcmp(qstr->name, name->name, name->len))
 		return 1;
+	if (!dentry->d_inode)
+		return 1;
 	return !sysctl_is_seen(PROC_I(dentry->d_inode)->sysctl);
 }
 
--

From: Eric W. Biederman
Date: Saturday, September 27, 2008 - 1:44 am

Where is that? 

This is an issue because if we can get negative dentries the other
dentry operations are wrong as well.n

We hold the directory mutex in real_lookup
before calling i_op->lookup.  So lookups should be serialized.
proc_sys_lookup always either succeeds and calls d_add with
a valid inode or fails and returns an error code in which case
real_lookup puts the dentry before it is hashed.

We hold the directory mutex in readdir before calling
file->f_op->readdir.  Deep inside of proc_sys_readdir
proc_sys_fill_cache only adds the dentry if it has an inode.

do_revalidate doesn't look like it could get us there.
Nor does any of the paths that call ->d_delete.

It looks like don't free the inode from an non-negative
dentry until after it is unhashed.


Clearly we don't have an inode there but I don't see we can end up with

If we can't avoid negative dentries that looks appropriate.

But won't .d_revalidate and .d_delete be susceptible to the same
problem if we do have negative dentries?  

Eric
--

From: Linus Torvalds
Date: Sunday, September 28, 2008 - 1:38 pm

Irrelevant.

The d_compare function si called _before_ we get the directory mutex. It's 
done purely under dentry->dlock (and the RCU read lock).


lookups are serialized before calling into the filesystem with ->lookup, 
but _not_ at the d_compare level. If we serialized d_compare, the dentry 
cache would be no use at all, we'd serialize all lookups, cached or not.

(Of course, sane filesystems will not have d_compare at all, just the 
memcmp, but we're talking /proc here - although I forget why it wants to 
do that insane d_compare thing)

So forget the directory mutex. d_compare is much more low-level than that. 
It can hit a dentry that is being unhashed at the same time for whatever 
reason (memory pressure, whatever).

The fact is, the "->lookup()" function is meant to be easy for filesystems 
to use, but d_compare i really low-level dentry magic and is meant to look 
at just the *name*. It's meant to be a replacement for memcpy() for things 
like case-independent comparisons or strange utf-8 rules (ie "equivalent" 
characters). It has no real locking, since the names are supposed to be 
"stable" anyway (the dentry->d_lock should guarantee that the dentry isn't 
being renamed etc).

I may be missing something, of course, but the dentry eas actually found 
_before_ takign even the dentry->d_lock, so the dentry we call compare on 
may not even be *valid* any more, because something might have unhashed it 
_after_ we found it, but _before_ we got d_lock.

d_compare() really is pretty special (d_revalidate is too, for that 
matter, although at least _slightly_ less so: by that time we have at 
least tested that the UNHASHED bit isn't set because we raced with 
removal etc).

		Linus
--

From: Al Viro
Date: Sunday, September 28, 2008 - 7:18 am

The real underlying bug, whatever it is.  If this sucker ever becomes
negative, we have a big problem.  Where _could_ that happen?  Remember,
we do not allow ->rmdir() and ->unlink() to succeed there.  So d_delete()
callers in namei.c are out of question.  We also never do d_add() with
NULL inode in there.  We _might_ be doing a bogus d_rehash() on a negative
/prooc/sys/<something> dentry that had never been hashed to start with
somewhere in generic code, but... I don't see where that could happen.
vfs_rename_dir() with negative new_dentry would have to get it from
something and that would have to be ->lookup().  And that sucker returns
ERR_PTR() or a positive dentry in all cases here.  d_splice_alias() is not
used there at all; d_move_locked() would scream bloody murder if dentry
it's rehashing is negative.  d_materialize_unique() and d_add_unique()
are not used.  So just WTF is creating this sucker?

IOW, your patch will probably be enough to stop the visible problem, but
I would dearly like to understand what's really causing it.  It appears to
be a refcounting breakage somewhere and we have *another* bug report that
smells like that - it seems like we sometimes end up with negative dentry
on alias list of an inode (outside of /proc/sys, AFAICT).  Something really
fishy is going on...
--

From: Hugh Dickins
Date: Sunday, September 28, 2008 - 12:28 pm

I got a couple of earlier instances of this on powerpc
http://lkml.org/lkml/2008/8/14/289
but saw nothing more of it, so asked Al to forget about it.

But today I've got it again, this time on x86_64, with kdb in
(but not serial console), similar kernel builds with swapping
loads as before.  Though with Andrew's latest mmotm, so some
details different from 2.6.27-rc, and could be an mmotm bug.

The dentry in question (it's for /proc/sys/kernel/ngroups_max)
looks as if the __d_drop and d_kill of prune_one_dentry() came
in on one cpu just after __d_lookup() had found the entry on
parent's hashlist, just before it acquired dentry->d_lock.

That's plausible, isn't it, and would account for the rarity,
and would say Linus's patch is good?

Do ask me for any details you'd like out of the dentry.

Hugh
--

From: Linus Torvalds
Date: Sunday, September 28, 2008 - 1:55 pm

Ok, you were definitely under memory pressure, and yes, it looks like the 
exact same bug on ppc64 - access to a pointer that is two poointers offset 


I actually like my second patch better - it looks simpler, and it means 
that the rules for filesystems using d_compare() are a bit clearer: at 
least we'll only pass them dentries to look at that haven't gone through 
d_drop (and we do hold dentry->d_lock that serializes all of that).

So here it is again (I sent it out just minutes ago, but you weren't on 
that cc, you must have picked this up off the kernel list)

NOTE! Totally untested patch! It looks sane and really obvious, but maybe 
it has some insane and non-obvious bug.

		Linus

---
 fs/dcache.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 80e9395..e7a1a99 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1395,6 +1395,10 @@ struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
 		if (dentry->d_parent != parent)
 			goto next;
 
+		/* non-existing due to RCU? */
+		if (d_unhashed(dentry))
+			goto next;
+
 		/*
 		 * It is safe to compare names since d_move() cannot
 		 * change the qstr (protected by d_lock).
@@ -1410,10 +1414,8 @@ struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
 				goto next;
 		}
 
-		if (!d_unhashed(dentry)) {
-			atomic_inc(&dentry->d_count);
-			found = dentry;
-		}
+		atomic_inc(&dentry->d_count);
+		found = dentry;
 		spin_unlock(&dentry->d_lock);
 		break;
 next:
--

From: Linus Torvalds
Date: Sunday, September 28, 2008 - 1:59 pm

Oh. I think I see at least a _potential_ insane and non-obvious bug: if 
somebody actually is going to do a __d_drop() _inside_ their d_compare(), 
this would fail horribly because we now assume that the dentry is still 
fine, since we held d_lock.

Of course, I think that would be very very buggy of a filesystem to do (we 
don't even pass in the dentry as an argument - you have to figure it out 
from the qstr, and a filesystem really should not do that!), but /proc 
_does_ look up the dentry in question, maybe some other insane filesystem 
does too and then does the __d_drop.

I'm not seeing it, though. So I still think the patch is sane and good, 
but somebody really needs to double- or triple-check me on it.

		Linus
--

From: Hugh Dickins
Date: Sunday, September 28, 2008 - 3:07 pm

Looks good to me, nicer than the first, and would have prevented my
oops today (if I'm interpreting it correctly: certainly I do have

I agree that would be insane.  There's no end to the weird things
a filesystem _could_ do in its d_compare, but it is supposed to be
about comparison, and every filesystem I can see in the tree treats

Hugh
--

From: Eric W. Biederman
Date: Sunday, September 28, 2008 - 8:05 pm

We definitely have a race between d_kill setting dentry->d_inode = NULL
and proc_sys_compare reading d_inode. 

We don't generate negative dentries for /proc/sys.

In dput atomic_dec_and_lock takes the lock before setting the count to 0.
So there is no race there.

Testing for d_unhashed and getting us out of rcu limbo before calling
into the filesystem methods makes the reasoning a lot clearer.


Looks good to me.

Reviewed-by: "Eric W. Biederman" <ebiederm@xmission.com>

--

From: Linus Torvalds
Date: Sunday, September 28, 2008 - 1:46 pm

What about pure memory pressure? We're holding only the RCU read-side lock 
when looking up dentries, and if there is any memory pressure, the 
dentries may be unhashed and the inodes removed in parallel. Yes, yes, we 
end up not actually _releasing_ the dentry, since it's all RCU, but it 
will set D_UNHASHED and be scheduled for releasing later under RCU.

And d_compare() is called before we have done any validation that the name 
is still active, including checking whether it even got released already! 

I dunno. Do we want to move the D_UNHASHED check up earlier? Or am I still 
missing something?

		Linus

----
 fs/dcache.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 80e9395..e7a1a99 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1395,6 +1395,10 @@ struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
 		if (dentry->d_parent != parent)
 			goto next;
 
+		/* non-existing due to RCU? */
+		if (d_unhashed(dentry))
+			goto next;
+
 		/*
 		 * It is safe to compare names since d_move() cannot
 		 * change the qstr (protected by d_lock).
@@ -1410,10 +1414,8 @@ struct dentry * __d_lookup(struct dentry * parent, struct qstr * name)
 				goto next;
 		}
 
-		if (!d_unhashed(dentry)) {
-			atomic_inc(&dentry->d_count);
-			found = dentry;
-		}
+		atomic_inc(&dentry->d_count);
+		found = dentry;
 		spin_unlock(&dentry->d_lock);
 		break;
 next:
--

From: Linus Torvalds
Date: Sunday, September 28, 2008 - 1:50 pm

Oh, it's not D_UNHASHED, it's DCACHE_UNHASHED. Whatever. The patch still 
looks fine, I had just forgotten the naming of the d_flags fields.

		Linus
--

Previous thread: Kernel unable to adjust timeofday by Rafal on Friday, September 26, 2008 - 6:29 am. (3 messages)

Next thread: [PATCH] ARM: Delete ARM's own cnt32_to_63.h by David Howells on Friday, September 26, 2008 - 8:22 am. (1 message)