Re: still nfs problems [Was: Linux 2.6.37-rc8]

Previous thread: GPIO on VIA Epia by Udo van den Heuvel on Thursday, December 30, 2010 - 9:20 am. (1 message)

Next thread: [PATCHSET] x86: unify x86_32 and 64 NUMA init paths, take#4 by Tejun Heo on Thursday, December 30, 2010 - 10:49 am. (18 messages)
From: Uwe Kleine-König
Date: Thursday, December 30, 2010 - 10:14 am

Hello,

I wonder if the nfs-stuff is considered to be solved, because I still
see strange things.

During boot my machine sometimes (approx one out of two times) hangs with
the output pasted below on Sysctl-l.  The irq 

I'm not 100% sure it's related, but at least it seems to hang in
nfs_readdir.  (When the serial irq happend that triggered the sysrq the
program counter was at 0xc014601c, which is fs/nfs/dir.c:647 for me.)

This is on 2.6.37-rc8 plus some patches for machine support on an ARM
machine.

Best regards
Uwe

[ 2700.100000] SysRq : Show State
[ 2700.100000]   task                PC stack   pid father
[ 2700.100000] init          S c0285d80     0     1      0 0x00000000
[ 2700.100000] Backtrace: 
[ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c004f268>] (do_wait+0x1a4/0x20c)
[ 2700.100000] [<c004f0c4>] (do_wait+0x0/0x20c) from [<c004f378>] (sys_wait4+0xa8/0xc0)
[ 2700.100000] [<c004f2d0>] (sys_wait4+0x0/0xc0) from [<c0033e80>] (ret_fast_syscall+0x0/0x38)
[ 2700.100000]  r8:c0034088 r7:00000072 r6:00000001 r5:0000001b r4:0140b228
[ 2700.100000] kthreadd      S c0285d80     0     2      0 0x00000000
[ 2700.100000] Backtrace: 
[ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c006a30c>] (kthreadd+0x70/0xfc)
[ 2700.100000] [<c006a29c>] (kthreadd+0x0/0xfc) from [<c004f4d8>] (do_exit+0x0/0x658)
[ 2700.100000] ksoftirqd/0   S c0285d80     0     3      2 0x00000000
[ 2700.100000] Backtrace: 
[ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from [<c0052714>] (run_ksoftirqd+0x5c/0x110)
[ 2700.100000] [<c00526b8>] (run_ksoftirqd+0x0/0x110) from [<c006a294>] (kthread+0x8c/0x94)
[ 2700.100000]  r8:00000000 r7:c00526b8 r6:00000000 r5:c7843f1c r4:c7859fac
[ 2700.100000] [<c006a208>] (kthread+0x0/0x94) from [<c004f4d8>] (do_exit+0x0/0x658)
[ 2700.100000]  r7:00000013 r6:c004f4d8 r5:c006a208 r4:c7843f1c
[ 2700.100000] kworker/0:0   S c0285d80     0     4      2 0x00000000
[ 2700.100000] Backtrace: 
[ 2700.100000] [<c02858c8>] (schedule+0x0/0x534) from ...
From: Linus Torvalds
Date: Thursday, December 30, 2010 - 10:57 am

Please cc the poor hapless NFS people too, who probably otherwise
wouldn't see it. And Arnd just in case it might be locking-related.

Trond, any ideas? The sysrq thing does imply that it's stuck in some
busy-loop in fs/nfs/dir.c, and line 647 is get_cache_page(), which in
turn implies that the endless loop is either the loop in
readdir_search_pagecache() _or_ in a caller. In particular, the
EBADCOOKIE case in the caller (nfs_readdir) looks suspicious. What
protects us from endless streams of EBADCOOKIE and a successful
uncached_readdir?

                     Linus

--

From: Trond Myklebust
Date: Thursday, December 30, 2010 - 11:24 am

There is nothing we can do to protect ourselves against an infinite loop
if the server (or underlying filesystem) is breaking the rules w.r.t.
cookie generation. It should be possible to recover from all other
situations.
IOW: if the server generates non-unique cookies, then we're screwed.
Fixing that particular problem is impossible since it is basically a
variant of the halting problem.
That was why I asked which filesystem is being exported in my previous
reply.

The point of 'uncached_readdir' is to resolve a cookie that was
previously valid, but has since been invalidated; usually that is due to
the file having been unlinked. If it succeeds, it should result in a new
set of valid entries being posted to the 'filldir' callback, and a new
cookie being set in the filp->private (i.e. we should have made
progress). If it fails, we exit, as you can see.

Cheers
  Trond

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

--

From: Linus Torvalds
Date: Thursday, December 30, 2010 - 11:50 am

On Thu, Dec 30, 2010 at 10:24 AM, Trond Myklebust

Umm. Sure there is. Just make sure that you return the uncached entry
to user space, rather than loop forever.

Looping forever in kernel space is a bad idea. How about just changing
the "continue" into a "break" for the "uncached readdir returned
success".

No halting problems, no excuses. There is absolutely _no_ excuse for
an endless loop in kernel mode. Certainly not "the other end is
incompetent".

EVERYBODY is incompetent sometimes. That just means that you must
never trust the other end too much. You can't say "we require the
server to be sane in order not to lock up".

                    Linus
--

From: Trond Myklebust
Date: Thursday, December 30, 2010 - 12:25 pm

uncached_readdir is not really a problem. The real problem is
filesystems that generate "infinite directories" by producing looping
combinations of cookies.

IOW: I've seen servers that generate cookies in a sequence of a form
vaguely resembling

1 2 3 4 5 6 7 8 9 10 11 12 3...

(with possibly a thousand or so entries between the first and second
copy of '3')

The kernel won't loop forever with something like that (because
eventually filldir() will declare it is out of buffer space), but
userland has a halting problem: it needs to detect that every
sys_getdents() call it is making is generating another copy of the

We should never get an endless loop in _kernel mode_ with the current

Unfortunately we must. Call it an NFS protocol failure, but it really
boils down to the fact that POSIX readdir() generates a data stream with
no well-defined concept of an offset. As a result, each and every
filesystem has their own interesting ways of generating cookies to
represent that 'offset'.

Trond

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

--

From: Linus Torvalds
Date: Thursday, December 30, 2010 - 1:02 pm

On Thu, Dec 30, 2010 at 11:25 AM, Trond Myklebust

But if we don't have any lseek's, the readdir cache should trivially
take care of this by just incrementing the page_index, and we should
return to user space the (eventually ending) sequence, even if there
are duplicate numbers.

(Also, I suspect that "page_index" should not be a page index, but a
position, and then you the "search_for_pos()" should use that instead
of the file_pos/current_index thing, but that's a detail that would
show up only when you have duplicate cookies within one page worth of
directory caches)

And if the server really sends us an infinite stream of entries, then
that's fine - at least we give to user space the infinite entries that
were given to us, instead of _generating_ an infinite stream from what
was a finite - but broken - stream).

So it seems wrong that the directory caching code resets page_index to
the start when it then does an uncached readdir. That seems wrong.

I'm sure there's some reason for it, but wouldn't it be nice if the
rule for page_index was that it starts off at zero, and only gets

Ok, so that' obviously broken, but it's then _doubly_ broken to turn
that long broken sequence into an _endless_ broken sequence.

And I agree that when user space sees such an endless broken sequence,
it's a real stopping problem for user space. But in the absense of
lseek, it should _never_ be a problem for the kernel itself, afaik.
The kernel should happily return just the broken sequence. No?

So then perhaps the solution is to just remove the resetting of
page_index in the uncached_readdir() function? Make sure that the
page_index is monotonically increasing for any readdir(), and you
protect against turning a bad sequence into an endless sequence.

Of course, lseek() will have to reset page_index to zero, and if
somebody does an lseek() on the directory, then the duplicate '3"
entry in the cookie sequence will inevitably be ambiguous, but that
really is unavoidable. And rare. People ...
From: Trond Myklebust
Date: Thursday, December 30, 2010 - 10:59 am

Ccing linux-nfs@vger.kernel.org

What filesystem are you exporting on the server? What is the NFS
version? Is this nfsroot, autofs or an ordinary nfs mount?

In short, how can we reproduce this?

Trond
-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

--

From: Uwe Kleine-König
Date: Thursday, December 30, 2010 - 12:18 pm

Hi Trond,

Yeah, good idea.  I had that ~2min after sending my report during
This is an nfsroot of /home/ukl/nfsroot/tx28 which is a symlink to a
directory on a different partition.  I don't know the filesystem of my
homedir as it resides on a server I have no access to, but I asked the
admin, so I can follow up with this info later (I'd suspect ext3, too).
The real root directory is on ext3 (rw,noatime).

The serving nfs-server is Debian's nfs-kernel-server 1:1.2.2-1.
nfs-related kernel parameters are

	ip=dhcp root=/dev/nfs nfsroot=192.168.23.2:/home/ukl/nfsroot/tx28,v3,tcp

I hope this answers your questions.  If not, please ask.

I tried without the symlink and saw some different errors, e.g.

	starting splashutils daemon.../etc/rc.d/S00splashutils: line 50: //sbin/fbsplashd.static: Unknown error 521

(this is the init script that hung before) and

	[    6.160000] NFS: server 192.168.23.2 error: fileid changed
	[    6.160000] fsid 0:c: expected fileid 0x33590a4, got 0x4d11bedc

but no hang as before.  So maybe it's related to the symlink?  I don't
know if testing that further would help or just waste of my time, so
please let me know if I can help you and how.

Best regards
Uwe

-- 
Pengutronix e.K.                           | Uwe Kleine-König            |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
--

From: Uwe Kleine-König
Date: Monday, January 3, 2011 - 2:38 pm

If that matters, kernel is linux-image-2.6.32-5-amd64 (2.6.32-29)
This still applies

Uwe

-- 
Pengutronix e.K.                           | Uwe Kleine-König            |
Industrial Linux Solutions                 | http://www.pengutronix.de/  |
--

From: Trond Myklebust
Date: Monday, January 3, 2011 - 5:22 pm

I'm having trouble reproducing this with my own nfsroot setup (which is
just a 'fedora 13 live' disk with NetworkManager turned firmly off).

However looking back at your report, you said that when you remove the
symlink, you get an error message of the form:

"starting splashutils daemon.../etc/rc.d/S00splashutils: line
50: //sbin/fbsplashd.static: Unknown error 521"

Error 521 is EBADHANDLE, which basically means your client got a
corrupted filehandle. The 'fileid changed' thing also indicates some
form of corruption.

The question is whether this is something happening on the server or the
client. Does an older client kernel boot without any trouble?

-- 
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust@netapp.com
www.netapp.com

--

Previous thread: GPIO on VIA Epia by Udo van den Heuvel on Thursday, December 30, 2010 - 9:20 am. (1 message)

Next thread: [PATCHSET] x86: unify x86_32 and 64 NUMA init paths, take#4 by Tejun Heo on Thursday, December 30, 2010 - 10:49 am. (18 messages)