Dear All, I have been using a setup where my $HOME is located on a NetBSD NFS server (for 10+ years). Every now and then something goes wrong with NFS and all accesses to /home cause the process enter disk wait (D in ps listing). No matter how long I wait, the disk wait never completes. I can't even shutdown and reboot the client system properly because shutdown fails to unmount the NFS partition(s). At the same time, all other NFS mounts from the same server work just fine. Now that I started running FAM (hoping it would make gimp work better), the famd process couldn't be killed, so shutdown never even got to the point of unmounting disks. The only way to reboot the system was hitting reset - and now it takes 4 hours or so to recalculate raidframe parity. All I did was launch firefox3 and bang, /home was dead. Here's how I mount /home in case I'm using wrong options (I have tried many combinations with no luck): server:/home /home nfs rw,-X,-i,-b,-s,-C,-x16 0 0 So, what's up with NFS? Will it ever be fixed? It's been the same for years and for many NetBSD releases. Am I supposed to be running samba between NetBSD systems? -jm
Is there a PR for your issue? Some things imediately spring to mind: - are you using IPF on the client? There was a fragment handling bug... - are you using amd(8)? There is something wrong in that area... - have you tried TCP instead of UDP mounts or vice versa? Some driver bugs apparently can be worked around this way - have you tried reducing write and read block size (like -r1024 -w1024)? This sometimes helps broken drivers. - what eterhnet driver are you using? Martin
No. I can try that, too. (I'm using a local /home atm, though, to I'm using ale now, but I have had similar problems with nfe and rtk (although not as soon as today with ale). -jm
On Tue, Jun 02, 2009 at 04:47:08PM +0300, Jukka Marin wrote: > > - are you using amd(8)? There is something wrong in that area... > > No. (I never figured out how to use it ;) rm is the best way :-) FWIW, several of my machines have NFS-mounted homedirs and I've not seen a problem in quite a while. One likely difference is that in my case the server is a NetApp filer... what happens on the server when things go sour? -- David A. Holland dholland@netbsd.org
I'm using a blunt version of NFS: just plain /bulky -network xxx.xxx.xxx.xxx/24 /bulky -network iipv6::/64 in the /etc/exports and invoked with: mountd=YES nfsd=YES nfs_client=YES # not needed normally nfs_server=YES nfsd_flags="-6 -u -t -n 6" in /etc/rc.conf hostname:/usr/sources /usr/sources nfs rw,noauto 0 0 What are you refering to? Its working fine here with 5.0 and -current machines mixed. There used to be some issues on NetBSD 1.6 AFAIR but those were solved Arguably i should/could use -s myself too.... With regards, Reinoud
It happened again. After the last discussion, I changed /home to use NFS over TCP and it has been working well so far. Yesterday, another UDP mounted partition died in the middle of a compilation process. I couldn't umount the partition, not even with umount -f. When I ran shutdown -r, it killed a bunch of processes and then - nothing. I tried sync, it hang. I tried reboot, it didn't. Even halt didn't do anything useful. The only way to "fix" the dead NFS mount was hitting the reset button (which forced a parity rewrite of a 1 TB raidframe disk and fsck of several largish partitions). I'm using TCP for all NFS mounts now.. we'll see if that helps. It would be great if one could get over an NFS problem without the reset button, though.... -jm
On Thu, 9 Jul 2009 08:49:53 +0300 Interestingly I've been using TCP for NFS for a long time (years), yet lately I tend to often see these (it might be since the netbsd-4->netbsd-5 upgrade, but I'm not sure. Since most of the time this seemed harmless I didn't look into it much until now): nfs server foo not responding nfs server foo is alive again In all cases, the delay between the two is quite short, in the order of a second or two at most. Sometimes these are rare and performance doesn't seem affected much. However, at other times these occur non-stop, and performance is greatly reduced, a CVS update from a local repository mirror mounted via NFS can then take at least three to four times longer than usual. I suspected a possible problem with NFS threads being too low, but increasing client-side vfs.nfs.iothreads sysctl, and number of threads in nfsd_flags on server-side didn't seem to help (despite top showing more available threads). So yesterday I decided to switch to UDP to verify if it's better. However, after a certain number of hours, all processes on the client locked in tstile wchan/state and I had to reboot the client. The problem doesn't seem to be network related, as an ssh connection to the server keeps working when this happens, as well as HTTP/FTP to the server and outside. I see no network interface diagnostics in dmesg either. This didn't occur again, but I'll keep using UDP a bit for now and see how it goes. Thanks, -- Matt
I have seen these, too. Usually both messages are logged with the same timestamp. The problem I'm whining about, however, does not show up in the log at all. Sounds like my problem.. it usually occurs on my system once or twice a month. -jm
On Thu, 9 Jul 2009 11:16:32 +0300 So today, on another NFS client box on which I had returned to UDP, processes also started locking on the remote NFS mount. The server was fine again, but client processes were now locking in nfsrcv state. The client had to be rebooted and I switched back to TCP which at least seems more reliable so far. So I'm keeping one client box only with NFS mounted via UDP for now for further testing. -- Matt
On Thu, 9 Jul 2009 14:15:04 -0400 The testing box in question now mounts again in TCP mode because the problem occurred twice today (client NFS processes locking in nfsrcv with a single "not responding" message in dmesg) and it was way annoying requireing two reboots. However, it could run fine during approximately 48 hours previously. It is thus hard to reproduce, it appears very intermittent. -- Matt
On Wed July 8 2009 23:49:53 Jukka Marin wrote: It would be nice if reboot could be made to force a reboot. Perhaps using a flag. I have a NetBSD 5.0 Web server where any process accessing the Web content partition would get stuck in state D (disk wait) waiting on "tstile". (I suspect this is a wapbl issue since I have not seen the problem since remounting the file system without -o log.) Reboot hung after printing "syncing disks", requiring physical access to restore the server. Regards, Sverre
At Thu, 9 Jul 2009 09:54:48 -0600, Sverre Froyen <sverre@viewmark.com> wrot= e: There already is such a flag, IIUC: "RB_NOSYNC", aka 0x0004. I don't think that's what you want though -- it will still require a full RAIDframe parity check and possibly fsck on the next boot. A safer mode could probably be done fairly easily too by setting some kind of watchdog timer before the unmounting of filesystems and other sundry cleanup, and then forcing the system to reboot if the timer expires. (FYI, I have some changes, against netbsd-4, which much more reliably reboot i386 machines using much more standard methods of rebooting too.) --=20 Greg A. Woods Planix, Inc. <woods@planix.com> +1 416 218-0099 http://www.planix.com/
In my case, that would have been preferable to jumping in a car and driving for 30 mins. Fsck was skipped anyway because of the "log" option -- perhaps Could they be added to 5 and current? Thanks, Sverre
At Fri, 10 Jul 2009 11:55:22 -0600, Sverre Froyen <sverre@viewmark.com> wro= te: Ah yes, indeed -- I almost forgot about that, I was only thinking about I really hate PCs, especially PC "servers". I want a modern system with real, and simple, lights-out remote management, just like the old AlphaServers and their RMC, or the ILOM or ALOM that Sun servers, have. And something without a bass-ackwards compatible BIOS, and of course all that means it will have real serial console support in the firmware too. I guess I should just shut up and shell out the bucks for a new(er) Sun server, but sadly I think I'd still be stuck with their x86 or maybe AMD platforms if I wanted to run NetBSD. Of course I'm assuming you at least have a serial console connected for remote management and that you didn't just drive to the machine to "press any key". Proper firmware would not directly avoid that problem, Yes, indeed, it would have to be a kernel feature. I haven't looked closely at the kernel shutdown sequences since the 1.6.x days but I I would imagine.... They're at the bottom of this diff (note some are still #if-0'ed out because I haven't had time to figure out how to do them properly in the NetBSD context): Index: sys/arch/i386/i386/machdep.c =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D RCS file: /cvs/master/m-NetBSD/main/src/sys/arch/i386/i386/machdep.c,v retrieving revision 1.586.2.5 diff -u -r1.586.2.5 machdep.c --- sys/arch/i386/i386/machdep.c 28 Aug 2007 11:46:26 -0000 1.586.2.5 +++ sys/arch/i386/i386/machdep.c 3 Jul 2009 20:10:12 -0000 @@ -278,6 +278,14 @@ phys_ram_seg_t mem_clusters[VM_PHYSSEG_MAX]; int mem_cluster_cnt; =20 +#ifdef VGA_APERTURE +# ifdef INSECURE +int allow_vga_aperture =3D 1; +# else +int allow_vga_aperture =3D 0; +# endif +#endif + int cpu_dump(void); ...
I've pretty much stopped using the in-kernel nfs client on my desktop (laptop) and use either rump_nfs or sshfs. I don't use them for /home, though, and you will have issues at least with sshfs in a multiuser environment.
Could I ask if there is an obvious trick to using sshfs that I'm missing? - I tried playing with it and it seems to mount and let me read and rename files fine, but I can't create any new files... :/ -- David/absolute -- www.NetBSD.org: No hype required --
There are no tricks that I am aware of. What command are you executing and what's the error message?
I've tried the mount as:
sshfs ${user}@${host}:/files/netbsd /mnt
and
sshfs -o workaround=all -o sshfs_debug -o idmap=user ${user}@${host}:/files/netbsd /mnt
Client and server are both NetBSD/i386 5.0_STABLE from within
the last week.
I'm using fuse-sshfs-1.4nb1 from pkgsrc but with the
bluez-libs/buildlink3.mk removed and pkg-config added to USE_TOOLS
'mv' of an existing file works fine, but 'touch /mnt/moo' fails with:
26361 1 touch CALL __stat30(0xbfbff96e,0xbfbfe790)
26361 1 touch NAMI "/mnt/moo"
26361 1 touch RET __stat30 -1 errno 2 No such file or directory
26361 1 touch CALL open(0xbfbff96e,0x201,0x1b6)
26361 1 touch NAMI "/mnt/moo"
26361 1 touch RET open -1 errno 2 No such file or directory
26361 1 touch CALL write(2,0xbfbfdf40,7)
--
David/absolute -- www.NetBSD.org: No hype required --
Oh. I've never used sshfs from pkgsrc. I'm talking about the NetBSD-optimized psshfs found as mount_psshfs from base.
Arg - I was looking for a mount_ssh* not mount_*ssh* :) *many* thanks - now very happy with mount_psshfs on my NetBSD 5 box :) -- David/absolute -- www.NetBSD.org: No hype required --
OK, I have to ask - what is "fuse-sshfs-1.4nb1 from pkgsrc"? Up until you said that, I'd been thinking you were using mount_psshfs(8). Regards, Al
Ahem, should have stated "from pkgsrc-wip" :) -- David/absolute -- www.NetBSD.org: No hype required --
