(cc's to me appreciated) It would be really, really nice if "umount -f" against a hung NFS mount actually worked on Linux. As much as I hate Solaris, I consider it the gold standard in this case: If I say "umount -f /mount/that/is/hung" it just goes away, immediately, and anything still trying to use it dies (with EIO, I'm told). If I know the NFS server is down, that really is the correct behaviour. I very much want this behaviour, and am willing to bribe/pay for it, although my resources are limited. Unless you're interested in details of my tests, stop here. I'm bringing this up again (I know it's been mentioned here before) because I had been told that NFS support had gotten better in Linux recently, so I have been (for my $dayjob) testing the behaviour of NFS (autofs NFS, specifically) under Linux with hard,intr and using iptables to simulate a hang. fuser hangs, as far as I can tell indefinately, as does lsof. umount -f returns after a long time with "busy", umount -l works after a long time but leaves the system in a very unfortunate state such that I have to kill things by hand and manually edit /etc/mtab to get autofs to work again. The "correct solution" to this situation according to http://nfs.sourceforge.net/ is cycles of "kill processes" and "umount -f". This has two problems: 1. It sucks. 2. If fuser and lsof both hand (and they do: fuser has been on "stat("/home/rpowell/"," for > 30 minutes now), I have no way to pick which processes to kill. I've read every man page I could find, and the only nfs option that semes even vaguely helpful is "soft", but everything that mentions "soft" also says to never use it. This is the single worst aspect of adminning a Linux system that I, as a carreer sysadmin, have to deal with. In fact, it's really the only one I even dislike. At my current work place, we've lost multiple person-days to this issue, having to go around and reboot every Linux box that was hanging off a down NFS server. I know many ...
Robin> I'm bringing this up again (I know it's been mentioned here Robin> before) because I had been told that NFS support had gotten Robin> better in Linux recently, so I have been (for my $dayjob) Robin> testing the behaviour of NFS (autofs NFS, specifically) under Robin> Linux with hard,intr and using iptables to simulate a hang. So why are you mouting with hard,intr semantics? At my current SysAdmin job, we mount everything (solaris included) with 'soft,intr' and it works well. If an NFS server goes down, clients don't hang for large periods of time. Robin> fuser hangs, as far as I can tell indefinately, as does Robin> lsof. umount -f returns after a long time with "busy", umount Robin> -l works after a long time but leaves the system in a very Robin> unfortunate state such that I have to kill things by hand and Robin> manually edit /etc/mtab to get autofs to work again. Robin> The "correct solution" to this situation according to Robin> http://nfs.sourceforge.net/ is cycles of "kill processes" and Robin> "umount -f". This has two problems: 1. It sucks. 2. If fuser Robin> and lsof both hand (and they do: fuser has been on Robin> "stat("/home/rpowell/"," for > 30 minutes now), I have no way to Robin> pick which processes to kill. Robin> I've read every man page I could find, and the only nfs option Robin> that semes even vaguely helpful is "soft", but everything that Robin> mentions "soft" also says to never use it. I think the man pages are out of date, or ignoring reality. Try mounting with soft,intr and see how it works for you. I think you'll be happy. Robin> This is the single worst aspect of adminning a Linux system that I, Robin> as a carreer sysadmin, have to deal with. In fact, it's really the Robin> only one I even dislike. At my current work place, we've lost Robin> multiple person-days to this issue, having to go around and reboot Robin> every Linux box that was hanging off a down NFS server. Robin> I know many other admins who also ...
No. The price of using "soft" is the chance of data corruption, since an application may for example be left thinking that a write has succeeded when it hasn't. See http://nfs.sourceforge.net/#faq_e4 --b. -
Wow! That's _really_ a bad idea. NFS READ operations which
timeout can lead to executables which mysteriously fail, file
corruption, etc. NFS WRITE operations which fail may or may
not lead to file corruption.
Anything writable should _always_ be mounted "hard" for safety
purposes. Readonly mounted file systems _may_ be mounted "soft",
Please don't. You will end up regretting it in the long run.
Taking a chance on corrupted data or critical applications which
just fail is not worth the benefit.
It would safer for us to implement something which works like
the Solaris forced umount support for NFS.
Thanx...
-
To add to the pain, lsof or fuser hang on unresponsive shares. I wrote my own wrapper to go through the "/proc/<pid>" file tables and find any process using the unresponsive mounts and kill those processes.This works well. Also, it brings another point. If the unresponsives problem cannot be fixed for some NFS data corruption reasons, is it possible for a mount to have both soft & hard semantics? Some process might want to use the mount point soft and other processes hard. This can be implemented easily in NFS & SUNRPC layers adding timeout to requests, but it becomes tricky in VFS layer. If a soft proces is waiting on an inode locked by a hard process, the soft process gets hard semantics too. Thanks --Chakri -
Does write + tcp make this any different? -Robin -- http://www.digitalkingdom.org/~rlpowell/ *** http://www.lojban.org/ Reason #237 To Learn Lojban: "Homonyms: Their Grate!" Proud Supporter of the Singularity Institute - http://singinst.org/ -
Nope...
TCP may make a difference if the problem is related to the network
being slow or lossy, but will not affect anything if the server
is just slow or down. Even if TCP would have eventually gotten
all of the packets in a request or response through, the client
may time out, cease waiting, and corruption may occur again.
ps
-
>>>>> "Peter" == Peter Staubach <staubach@redhat.com> writes: Peter> John Stoffel wrote: Robin> I'm bringing this up again (I know it's been mentioned here Robin> before) because I had been told that NFS support had gotten Robin> better in Linux recently, so I have been (for my $dayjob) Robin> testing the behaviour of NFS (autofs NFS, specifically) under Peter> Wow! That's _really_ a bad idea. NFS READ operations which Peter> timeout can lead to executables which mysteriously fail, file Peter> corruption, etc. NFS WRITE operations which fail may or may Peter> not lead to file corruption. Peter> Anything writable should _always_ be mounted "hard" for safety Peter> purposes. Readonly mounted file systems _may_ be mounted Peter> "soft", depending upon what is located on them. Not in my experience. We use NetApps as our backing NFS servers, so maybe my experience isn't totally relevant. But with a mix of Linux and Solaris clients, we've never had problems with soft,intr on our NFS clients. We also don't see file corruption, mysterious executables failing to run, etc. Now maybe those issues are raised when you have a Linux NFS server with Solaris clients. But in my book, reliable NFS servers are key, and if they are reliable, 'soft,intr' works just fine. Now maybe if we had NFS exported directories everywhere, and stuff cross mounted all over the place with autofs, then we might change our minds. In any case, I don't dis-agree with the fundamental request to make the NFS client code on Linux easier to work with. I bet Trond (who works at NetApp) will have something to say on this issue. John -
Just for the others who may be reading this thread --
If you use sufficient network bandwidth and high quality
enough networks and NFS servers with plenty of resources,
then you _may_ be able to get away with "soft" mounting
for a some period of time.
However, any server, including Solaris and NetApp servers,
will fail, and those failures may or may not affect the
NFS service being provided. In fact, unless the system
is being carefully administrated and the applications are
written very well, with error detection and recovery in
mind, then corruption can occur, and it can be silent and
unnoticed until too late. In fact, most failures do occur
silently and get chalked up to other causes because it will
not be possible to correlate the badness with the NFS
client giving up when attempting to communicate with an
NFS server.
I wish you the best of luck, although with the environment
that you describe, it seems like "hard" mounts would work
equally well and would not incur the risks.
ps
-
The NFS server alone can't prevent the problems Peter Staubach refers to. Their frequency also depends on the network and the way you're using the filesystem. (A sufficiently paranoid application accessing the filesystem could function correctly despite the problems caused by soft mounts, but the degree of paranoia required probably isn't common.) In practice, you may get away with soft mounts and never see problems. But other people considering them should probably make sure they understand the issues before trusting anything important to them. --b. -
Would it be sufficient to insure that that application always issues an fsync() before closing any recently written/updated file? Is there some other subtle paranoid techniques that should be used? ric -
I suspect that this is not sufficient. The application should
be prepared to rewrite data if it can determine what data did
not get written. Using fsync will tell the application when
data was not written to the server correctly, but not which
part of the data.
Perhaps O_SYNC or fsync following each write, but either one of
these options will also cause a large performance degradation.
The right solution is the use of TCP and hard mounting.
ps
-
NFS already syncs on close (and on unlock), so you should just need to check the return values from any writes, fsyncs, closes, etc. (and realize that an error there may mean some or all of the previous writes to this file descriptor failed). And operations like mkdir have the same problem--a timeout leaves you not knowing whether the directory was created, because you don't know whether the operation reached the server or not. I assume the problems with executables that Peter Staubach refers to are due to reads on mmap'd files timing out. I don't use soft mounts myself and haven't had to debug user problems with them, so my understanding of it all is purely theoretical--others will have a better idea when and how these kinds of failures actually manifest themselves in practice. --b. -
And you don't need all that ext3 journal overhead if your disk drives are reliable too. Gotcha. :)
Err, no. The ext3 journal overhead buys you not needing to fsck after
an unclean shutdown, and safety against crap getting written to the
inode table on an unclean power hit while the disk drive is writing
and the memory goes insane before the DMA engine and disk drive stop
working from the voltage on the power supply rails. (Hence my advice
that if you use XFS on Linux, make *sure* you have a UPS; on machines
such as the SGI Indy they added bigger capacitors to the PSU and a
real power fail interrupt, but PC-class hardware is
inexpensive/crappy, so it doesn't have such niceties.)
- Ted
-
>>>>> "Valdis" == Valdis Kletnieks <Valdis.Kletnieks@vt.edu> writes: Valdis> And you don't need all that ext3 journal overhead if your disk Valdis> drives are reliable too. Gotcha. :) Yeah yeah... you got me. *grin* In a way. How to say this. NFS is like ext2 in some ways. No real protection from errors unless you turn on possibly performance killing aspects of the code. Ext3 takes it to a higher level of consistency without compromising as much on the performance. RAID can be the base of both of these things, and that helps alot. If your RAID is reliable. So, my NetApps are reliable because they have NVRAM for performance, and it's battery backed for reliability. On that they build the Volume and Filesystem stuff, which also has performance and reliability built-in. On top of this, they have NFS (or CIFS or other protocols, but I use only NFS). And we actually default to "proto=tcp,soft,intr" for all our mounts. We do this for performance, because we're confident of the underlying reliability of the layers below it. All the way down to the Network switches in a way. Though I admit we don't dual-path everything since we don't have enough need for that level of reliability. So that's where I'm coming from. Now, I'd be happy to be proven wrong, but I'd like to see people giving test scripts which can be run on a client to simulate failures and such so I can run them here in my environment as test. Maybe I'll change my mind. Maybe I won't. At least we've got choice. :] John -
So, there's a power outage and the UPS had a glitch. Oops, you've got to recover multiple TB and tell users everything since the last incremental backup is gone. You use UPS in the computer room but management, in it's cost cutting wisdom, hasn't provided for UPS for your Unix workstations and there's a power outage. Oops, you've got lots of corrupt files but you don't know which ones they are so you've got to recover multiple TB and tell users everything since the last incremental backup is gone. Ok, so hard mounting may not always save you in these circumstances but soft mounting will surely get you in the neck. Ian -
Murphy can get a *lot* more creative than that. So we'd outgrown the capacity on our UPS and diesel generator, and decided to replace them. So we schedule downtime for a Saturday. Rather scary, we had a Sun E10K that had been powered-up for several years, and just as expected, a good fraction of the 400+ drives it had failed to re-spinup. While recovering from that, we discovered that although the vast majority of the 400 drives were either mirrors or raidsets, due to a config error, the boot volume wasn't mirrored (fortunately, it spun up OK so we dodged the bullet), so we fixed that. Literally the next Friday, not even a week later, a contractor relocating a door into our machine room shorted out a sensor circuit in our fire suppression system, triggering a Halon dump. Of course, no amount of UPS and diesel was going to save us now, because there was a safety interlock that killed the power feeds if the Halon dumped. This time, since they'd all been stressed just a week before, only 2 of the 400+ disks on the E10K failed to spin up. Guess which two. ;)
