Re: NFS

Previous thread: Re: uvm_pagealloc_strat locking against itself by Martin Husemann on Saturday, May 30, 2009 - 3:39 pm. (1 message)

Next thread: options MODULAR improvements phase 1.0 by John Nemeth on Tuesday, June 2, 2009 - 6:48 pm. (3 messages)
From: Jukka Marin
Subject: NFS
Date: Tuesday, June 2, 2009 - 4:10 am

Dear All,

I have been using a setup where my $HOME is located on a NetBSD NFS server
(for 10+ years).  Every now and then something goes wrong with NFS and
all accesses to /home cause the process enter disk wait (D in ps listing).
No matter how long I wait, the disk wait never completes.  I can't even
shutdown and reboot the client system properly because shutdown fails
to unmount the NFS partition(s).  At the same time, all other NFS mounts
from the same server work just fine.

Now that I started running FAM (hoping it would make gimp work better),
the famd process couldn't be killed, so shutdown never even got to the
point of unmounting disks.

The only way to reboot the system was hitting reset - and now it takes
4 hours or so to recalculate raidframe parity.

All I did was launch firefox3 and bang, /home was dead.

Here's how I mount /home in case I'm using wrong options (I have tried
many combinations with no luck):

server:/home    /home   nfs     rw,-X,-i,-b,-s,-C,-x16  0 0

So, what's up with NFS?  Will it ever be fixed?  It's been the same
for years and for many NetBSD releases.  Am I supposed to be running
samba between NetBSD systems?

  -jm
From: Martin Husemann
Subject: Re: NFS
Date: Tuesday, June 2, 2009 - 4:23 am

Is there a PR for your issue?
Some things imediately spring to mind:

 - are you using IPF on the client? There was a fragment handling bug...
 - are you using amd(8)? There is something wrong in that area...
 - have you tried TCP instead of UDP mounts or vice versa? Some driver
   bugs apparently can be worked around this way
 - have you tried reducing write and read block size (like -r1024 -w1024)?
   This sometimes helps broken drivers.
 - what eterhnet driver are you using?

Martin
From: Jukka Marin
Subject: Re: NFS
Date: Tuesday, June 2, 2009 - 6:47 am

No.  I can try that, too.  (I'm using a local /home atm, though, to

I'm using ale now, but I have had similar problems with nfe and rtk
(although not as soon as today with ale).

  -jm
From: David Holland
Subject: Re: NFS
Date: Tuesday, June 2, 2009 - 11:50 pm

On Tue, Jun 02, 2009 at 04:47:08PM +0300, Jukka Marin wrote:
 > >  - are you using amd(8)? There is something wrong in that area...
 > 
 > No.  (I never figured out how to use it ;)

rm is the best way :-)

FWIW, several of my machines have NFS-mounted homedirs and I've not
seen a problem in quite a while. One likely difference is that in my
case the server is a NetApp filer... what happens on the server when
things go sour?

-- 
David A. Holland
dholland@netbsd.org
From: Michai Ramakers
Subject: Re: NFS
Date: Tuesday, June 2, 2009 - 6:50 am

> all accesses to /home cause the process enter disk wait (D in ps listing)
From: Reinoud Zandijk
Subject: Re: NFS
Date: Tuesday, June 9, 2009 - 8:18 am

I'm using a blunt version of NFS: just plain
/bulky		-network xxx.xxx.xxx.xxx/24
/bulky		-network iipv6::/64

in the /etc/exports and invoked with:

mountd=YES
nfsd=YES
nfs_client=YES	# not needed normally
nfs_server=YES
nfsd_flags="-6 -u -t -n 6"

in /etc/rc.conf

hostname:/usr/sources /usr/sources nfs rw,noauto 0 0



What are you refering to? Its working fine here with 5.0 and -current machines
mixed. There used to be some issues on NetBSD 1.6 AFAIR but those were solved

Arguably i should/could use -s myself too....

With regards,
Reinoud

From: Jukka Marin
Subject: Re: NFS
Date: Wednesday, July 8, 2009 - 10:49 pm

It happened again.

After the last discussion, I changed /home to use NFS over TCP and it has
been working well so far.  Yesterday, another UDP mounted partition died
in the middle of a compilation process.  I couldn't umount the partition,
not even with umount -f.  When I ran shutdown -r, it killed a bunch of
processes and then - nothing.  I tried sync, it hang.  I tried reboot,
it didn't.  Even halt didn't do anything useful.

The only way to "fix" the dead NFS mount was hitting the reset button
(which forced a parity rewrite of a 1 TB raidframe disk and fsck of
several largish partitions).

I'm using TCP for all NFS mounts now.. we'll see if that helps.  It would
be great if one could get over an NFS problem without the reset button,
though....

  -jm
From: Matthew Mondor
Subject: Re: NFS
Date: Wednesday, July 8, 2009 - 11:34 pm

On Thu, 9 Jul 2009 08:49:53 +0300

Interestingly I've been using TCP for NFS for a long time (years), yet
lately I tend to often see these (it might be since the
netbsd-4->netbsd-5 upgrade, but I'm not sure.  Since most of the time
this seemed harmless I didn't look into it much until now):

nfs server foo not responding
nfs server foo is alive again

In all cases, the delay between the two is quite short, in the order of
a second or two at most.  Sometimes these are rare and performance
doesn't seem affected much.  However, at other times these occur
non-stop, and performance is greatly reduced, a CVS update from a local
repository mirror mounted via NFS can then take at least three to four
times longer than usual.

I suspected a possible problem with NFS threads being too low, but
increasing client-side vfs.nfs.iothreads sysctl, and number of threads
in nfsd_flags on server-side didn't seem to help (despite top showing
more available threads).

So yesterday I decided to switch to UDP to verify if it's better.
However, after a certain number of hours, all processes on the client
locked in tstile wchan/state and I had to reboot the client.

The problem doesn't seem to be network related, as an ssh
connection to the server keeps working when this happens, as well as
HTTP/FTP to the server and outside.  I see no network interface
diagnostics in dmesg either.  This didn't occur again, but I'll keep
using UDP a bit for now and see how it goes.

Thanks,
-- 
Matt
From: Jukka Marin
Subject: Re: NFS
Date: Thursday, July 9, 2009 - 1:16 am

I have seen these, too.  Usually both messages are logged with the same
timestamp.

The problem I'm whining about, however, does not show up in the log at all.

Sounds like my problem.. it usually occurs on my system once or twice
a month.

  -jm
From: Matthew Mondor
Subject: Re: NFS
Date: Thursday, July 9, 2009 - 11:15 am

On Thu, 9 Jul 2009 11:16:32 +0300


So today, on another NFS client box on which I had returned to UDP,
processes also started locking on the remote NFS mount.  The server was
fine again, but client processes were now locking in nfsrcv state.  The
client had to be rebooted and I switched back to TCP which at least
seems more reliable so far.  So I'm keeping one client box only with
NFS mounted via UDP for now for further testing.
-- 
Matt
From: Matthew Mondor
Subject: Re: NFS
Date: Saturday, July 11, 2009 - 2:30 pm

On Thu, 9 Jul 2009 14:15:04 -0400

The testing box in question now mounts again in TCP mode because the
problem occurred twice today (client NFS processes locking in nfsrcv
with a single "not responding" message in dmesg) and it was way
annoying requireing two reboots.  However, it could run fine during
approximately 48 hours previously.  It is thus hard to reproduce, it
appears very intermittent.
-- 
Matt
From: Sverre Froyen
Date: Thursday, July 9, 2009 - 8:54 am

On Wed July 8 2009 23:49:53 Jukka Marin wrote:

It would be nice if reboot could be made to force a reboot.  Perhaps using a 
flag.  I have a NetBSD 5.0 Web server where any process accessing the Web 
content partition would get stuck in state D (disk wait) waiting on "tstile".  
(I suspect this is a wapbl issue since I have not seen the problem since 
remounting the file system without -o log.)   Reboot hung after printing 
"syncing disks", requiring physical access to restore the server.

Regards,
Sverre
From: Greg A. Woods
Date: Thursday, July 9, 2009 - 10:05 am

At Thu, 9 Jul 2009 09:54:48 -0600, Sverre Froyen <sverre@viewmark.com> wrot=
e:

There already is such a flag, IIUC:  "RB_NOSYNC", aka 0x0004.

I don't think that's what you want though -- it will still require a
full RAIDframe parity check and possibly fsck on the next boot.

A safer mode could probably be done fairly easily too by setting some
kind of watchdog timer before the unmounting of filesystems and other
sundry cleanup, and then forcing the system to reboot if the timer
expires.

(FYI, I have some changes, against netbsd-4, which much more reliably
reboot i386 machines using much more standard methods of rebooting too.)

--=20
						Greg A. Woods
						Planix, Inc.

<woods@planix.com>       +1 416 218-0099        http://www.planix.com/
From: Sverre Froyen
Date: Friday, July 10, 2009 - 10:55 am

In my case, that would have been preferable to jumping in a car and driving 
for 30 mins.   Fsck was skipped anyway because of the "log" option -- perhaps 


Could they be added to 5 and current?

Thanks,
Sverre
From: Greg A. Woods
Date: Friday, July 10, 2009 - 5:22 pm

At Fri, 10 Jul 2009 11:55:22 -0600, Sverre Froyen <sverre@viewmark.com> wro=
te:

Ah yes, indeed -- I almost forgot about that, I was only thinking about

I really hate PCs, especially PC "servers".

I want a modern system with real, and simple, lights-out remote
management, just like the old AlphaServers and their RMC, or the ILOM or
ALOM that Sun servers, have.  And something without a bass-ackwards
compatible BIOS, and of course all that means it will have real serial
console support in the firmware too.  I guess I should just shut up and
shell out the bucks for a new(er) Sun server, but sadly I think I'd
still be stuck with their x86 or maybe AMD platforms if I wanted to run
NetBSD.

Of course I'm assuming you at least have a serial console connected for
remote management and that you didn't just drive to the machine to
"press any key".  Proper firmware would not directly avoid that problem,

Yes, indeed, it would have to be a kernel feature.  I haven't looked
closely at the kernel shutdown sequences since the 1.6.x days but I

I would imagine....  They're at the bottom of this diff (note some are
still #if-0'ed out because I haven't had time to figure out how to do
them properly in the NetBSD context):


Index: sys/arch/i386/i386/machdep.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
RCS file: /cvs/master/m-NetBSD/main/src/sys/arch/i386/i386/machdep.c,v
retrieving revision 1.586.2.5
diff -u -r1.586.2.5 machdep.c
--- sys/arch/i386/i386/machdep.c	28 Aug 2007 11:46:26 -0000	1.586.2.5
+++ sys/arch/i386/i386/machdep.c	3 Jul 2009 20:10:12 -0000
@@ -278,6 +278,14 @@
 phys_ram_seg_t mem_clusters[VM_PHYSSEG_MAX];
 int	mem_cluster_cnt;
=20
+#ifdef VGA_APERTURE
+# ifdef INSECURE
+int allow_vga_aperture =3D 1;
+# else
+int allow_vga_aperture =3D 0;
+# endif
+#endif
+
 int	cpu_dump(void);
 ...
From: Antti Kantee
Subject: Re: NFS
Date: Thursday, July 9, 2009 - 9:08 am

I've pretty much stopped using the in-kernel nfs client on my desktop
(laptop) and use either rump_nfs or sshfs.  I don't use them for /home,
though, and you will have issues at least with sshfs in a multiuser
environment.
From: David Brownlee
Subject: Re: NFS
Date: Friday, July 10, 2009 - 10:38 am

Could I ask if there is an obvious trick to using sshfs
 	that I'm missing? - I tried playing with it and it seems
 	to mount and let me read and rename files fine, but I can't
 	create any new files... :/

-- 
 		David/absolute       -- www.NetBSD.org: No hype required --
From: Antti Kantee
Subject: Re: NFS
Date: Friday, July 10, 2009 - 10:44 am

There are no tricks that I am aware of.  What command are you executing
and what's the error message?
From: David Brownlee
Subject: Re: NFS
Date: Friday, July 10, 2009 - 12:44 pm

I've tried the mount as:
 	    sshfs ${user}@${host}:/files/netbsd /mnt
 	and
 	    sshfs -o workaround=all -o sshfs_debug -o idmap=user ${user}@${host}:/files/netbsd /mnt

 	Client and server are both NetBSD/i386 5.0_STABLE from within
 	the last week.

 	I'm using fuse-sshfs-1.4nb1 from pkgsrc but with the
 	bluez-libs/buildlink3.mk removed and pkg-config added to USE_TOOLS

 	'mv' of an existing file works fine, but 'touch /mnt/moo' fails with:

  26361      1 touch    CALL  __stat30(0xbfbff96e,0xbfbfe790)
  26361      1 touch    NAMI  "/mnt/moo"
  26361      1 touch    RET   __stat30 -1 errno 2 No such file or directory
  26361      1 touch    CALL  open(0xbfbff96e,0x201,0x1b6)
  26361      1 touch    NAMI  "/mnt/moo"
  26361      1 touch    RET   open -1 errno 2 No such file or directory
  26361      1 touch    CALL  write(2,0xbfbfdf40,7)


-- 
 		David/absolute       -- www.NetBSD.org: No hype required --
From: Antti Kantee
Subject: Re: NFS
Date: Friday, July 10, 2009 - 1:03 pm

Oh.  I've never used sshfs from pkgsrc.  I'm talking about the
NetBSD-optimized psshfs found as mount_psshfs from base.
From: David Brownlee
Subject: Re: NFS
Date: Saturday, July 11, 2009 - 10:35 am

Arg - I was looking for a mount_ssh* not mount_*ssh* :)
 	*many* thanks - now very happy with mount_psshfs on my
 	NetBSD 5 box :)

--
 		David/absolute       -- www.NetBSD.org: No hype required --
From: Alistair Crooks
Subject: Re: NFS
Date: Friday, July 10, 2009 - 11:34 pm

OK, I have to ask - what is "fuse-sshfs-1.4nb1 from pkgsrc"?
 
Up until you said that, I'd been thinking you were using mount_psshfs(8).

Regards,
Al
From: David Brownlee
Subject: Re: NFS
Date: Saturday, July 11, 2009 - 1:01 am

Ahem, should have stated "from pkgsrc-wip" :)

-- 
 		David/absolute       -- www.NetBSD.org: No hype required --
Previous thread: Re: uvm_pagealloc_strat locking against itself by Martin Husemann on Saturday, May 30, 2009 - 3:39 pm. (1 message)

Next thread: options MODULAR improvements phase 1.0 by John Nemeth on Tuesday, June 2, 2009 - 6:48 pm. (3 messages)