Hua Zhong reported an NFS regression [1] in 2.6.23-rc4 as compared to 2.6.22, "[upgrading] causes several autofs mounts to fail silently - they just [do] not appear when they should." Trond Myklebust explained that the change to default behavior was intentional to prevent an NFS mount from being mounted with the wrong options. The patch also introduced a new mount option, "the new option is there in order to make it damned clear to sysadmins that this is a dangerous thing to do: mounts which don't share the same superblock also don't share the same data and attribute caches. Any file or directory which appears in both mounts had better only be used by one application at a time or be using an appropriate locking scheme." Jakob Oestergaard defended the change asserting, "what he 'broke' is, for example, a ro mount being mounted as rw. That *could* be a very serious security (etc.etc.) problem which he just fixed. Anything depending on read-only not being enforced will cease to work, of course, and that is what a few people complain about(!)."
Linus Torvalds disagreed strongly with the change, "that commit gets reverted or fixed. It's a regression, and your theories that it's 'better' that way are obviously broken." He added:
"The point being that you just disallowed people from doing things that are sane but _potentially_ dangerous. That's not how we work. The UNIX way is to give people rope - if you cannot *prove* that what they are doing is wrong, then you damn well better not disallow it."
In response to the concern that the changes to NFS were necessary to fix a security hole, Linus retorted, "this is *not* a security hole. In order to make it a security hole, you need to be root in the first place. So what you call a security hole is really no different from root installing a bad SUID binary. It's simply not the kernels place to then say 'SUID binaries will not work, because it's a potential security hole'."
From: Hua Zhong [email blocked]
Subject: recent nfs change causes autofs regression
Date: Thu, 30 Aug 2007 14:07:43 -0700
I am re-sending this after help from Ian and git-bisect. To me it's a
show-stopper: I cannot find an acceptable workaround that I can implement.
The problem: upgrading to 2.6.23-rc4 from 2.6.22 causes several autofs
mounts to fail silently - they just not appear when they should.
I believe it's caused by the NFS change that forces multiple mounts from
different directories under the same server side filesystem to have the same
mount options by default, otherwise it returns EBUSY.
For example, if server has a filesystem /a, and it exports /a/x and /a/y
(maybe with rw or ro), and a client must mount /a/x and /a/y with the same
mount options now.
Since in my setup they are managed by autofs, and the autofs map is managed
by nis, there is no way I could easily workaround it..
If we have to live with this regression, I want to hear some suggestions
about how to fix them realistically. Thanks.
By the way, I am not sure if I did the bisect right, but FWIW, git-bisect
says:
c98451bdb2f3e6d6cc1e03adad641e9497512b49 is first bad commit
commit c98451bdb2f3e6d6cc1e03adad641e9497512b49
Author: Frank van Maarseveen [email blocked]
Date: Mon Jul 9 22:25:29 2007 +0200
NLM: fix source address of callback to client
Use the destination address of the original NLM request as the
source address in callbacks to the client.
Signed-off-by: Frank van Maarseveen [email blocked]
Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
:040000 040000 675c84bd8b2c50744018becaa0db4aeca19b8f9f
105fbd3cb3fa5e3019836b4b5268125d0181a72d M fs
:040000 040000 0138796e0806b4ebd1cc3850ed4e8c7ab24d2d41
2fec08debe51c20423a88b1a0d4281c683ba5daf M include
-----Original Message-----
From: Hua Zhong [mailto:hzhong@gmail.com]
Sent: Wednesday, August 29, 2007 1:59 PM
Subject: regression of autofs for current git?
Hi,
I am wondering if this is a known issue, but I just built the current git
and several autofs mounts mysteriously disappeared. Restarting autofs could
fix some, but then lose others. 2.6.22 was fine.
Is there anything I could check other than bisect? (It may take some time
for me to get to it)
Thanks for your help.
Hua
From: Trond Myklebust <trond.myklebust@fys.uio.no>
Subject: Re: recent nfs change causes autofs regression
Date: Thu, 30 Aug 2007 18:37:13 -0400
On Thu, 2007-08-30 at 14:07 -0700, Hua Zhong wrote:
> I am re-sending this after help from Ian and git-bisect. To me it's a
> show-stopper: I cannot find an acceptable workaround that I can implement.
>
> The problem: upgrading to 2.6.23-rc4 from 2.6.22 causes several autofs
> mounts to fail silently - they just not appear when they should.
>
> I believe it's caused by the NFS change that forces multiple mounts from
> different directories under the same server side filesystem to have the same
> mount options by default, otherwise it returns EBUSY.
>
> For example, if server has a filesystem /a, and it exports /a/x and /a/y
> (maybe with rw or ro), and a client must mount /a/x and /a/y with the same
> mount options now.
Which is better than having it fail silently, or giving you a mount with
the wrong mount options.
If you need to mount the same filesystem with incompatible mount options
on the same client, then there is a new mount option "nosharecache",
which enables it.
The new option is there in order to make it damned clear to sysadmins
that this is a dangerous thing to do: mounts which don't share the same
superblock also don't share the same data and attribute caches. Any file
or directory which appears in both mounts had better only be used by one
application at a time or be using an appropriate locking scheme.
Trond
From: Andrew Morton [2] [email blocked]
Subject: Re: recent nfs change causes autofs regression
Date: Thu, 30 Aug 2007 18:24:35 -0700
On Thu, 30 Aug 2007 18:37:13 -0400 Trond Myklebust <trond.myklebust@fys.uio.no> wrote:
> On Thu, 2007-08-30 at 14:07 -0700, Hua Zhong wrote:
> > I am re-sending this after help from Ian and git-bisect. To me it's a
> > show-stopper: I cannot find an acceptable workaround that I can implement.
> >
> > The problem: upgrading to 2.6.23-rc4 from 2.6.22 causes several autofs
> > mounts to fail silently - they just not appear when they should.
> >
> > I believe it's caused by the NFS change that forces multiple mounts from
> > different directories under the same server side filesystem to have the same
> > mount options by default, otherwise it returns EBUSY.
> >
> > For example, if server has a filesystem /a, and it exports /a/x and /a/y
> > (maybe with rw or ro), and a client must mount /a/x and /a/y with the same
> > mount options now.
>
> Which is better than having it fail silently, or giving you a mount with
> the wrong mount options.
>
> If you need to mount the same filesystem with incompatible mount options
> on the same client, then there is a new mount option "nosharecache",
> which enables it.
> The new option is there in order to make it damned clear to sysadmins
> that this is a dangerous thing to do: mounts which don't share the same
> superblock also don't share the same data and attribute caches. Any file
> or directory which appears in both mounts had better only be used by one
> application at a time or be using an appropriate locking scheme.
>
If we're going to send a message to sysadmins, we shouldn't force them to go
through a git bisection search and a lkml discussion to receive it!
Is there at least some way in which the kernel can detect this situation
and emit a friendly printk which guides people to a friendly document?
From: Linus Torvalds [email blocked]
Subject: Re: recent nfs change causes autofs regression
Date: Thu, 30 Aug 2007 20:49:34 -0700 (PDT)
On Thu, 30 Aug 2007, Trond Myklebust wrote:
>
> Which is better than having it fail silently, or giving you a mount with
> the wrong mount options.
No, Trond.
That commit gets reverted or fixed. It's a regression, and your theories
that it's "better" that way are obviously broken.
It's obviously broken because you seem to say that you know better, even
though you also admit that:
"How is the NFS client to know that these directories are disjoint, or
that no-one will ever create a hard link from one directory to another?
To my knowledge, the only way to ensure this is to put them on
different disk partitions."
the point being that you just disallowed people from doing things that are
sane but _potentially_ dangerous. That's now how we work. The UNIX way sis
to give people rope - if you cannot *prove* that what they are doing is
wrong, then you damn well better not disallow it.
No regressions, Trond. Especially not for stuff that used to work, was
used, and that could be sanely expected to work (which this *definitely*
sounds like).
Please send in a fix. If the fix involves making "nosharecache" the
default, then that is better than making policy decisions like this in the
kernel. The kernel should do what the user asks and not put in unnecessary
roadblocks.
Hua - that said, I don't actually see why the commit you bisected to has
anything to do with the issue being discussed. Can you double-check that
it's literally that particular commit that breaks for you (you could try
just reverting that commit).
Linus
From: Trond Myklebust [email blocked]
Subject: Re: recent nfs change causes autofs regression
Date: Fri, 31 Aug 2007 00:44:45 -0400
On Thu, 2007-08-30 at 20:49 -0700, Linus Torvalds wrote:
>
> On Thu, 30 Aug 2007, Trond Myklebust wrote:
> >
> > Which is better than having it fail silently, or giving you a mount with
> > the wrong mount options.
>
> No, Trond.
>
> That commit gets reverted or fixed. It's a regression, and your theories
> that it's "better" that way are obviously broken.
>
> It's obviously broken because you seem to say that you know better, even
> though you also admit that:
>
> "How is the NFS client to know that these directories are disjoint, or
> that no-one will ever create a hard link from one directory to another?
> To my knowledge, the only way to ensure this is to put them on
> different disk partitions."
>
> the point being that you just disallowed people from doing things that are
> sane but _potentially_ dangerous. That's now how we work. The UNIX way sis
> to give people rope - if you cannot *prove* that what they are doing is
> wrong, then you damn well better not disallow it.
>
> No regressions, Trond. Especially not for stuff that used to work, was
> used, and that could be sanely expected to work (which this *definitely*
> sounds like).
It did not. The previous behaviour was to always silently override the
user mount options.
> Please send in a fix. If the fix involves making "nosharecache" the
> default, then that is better than making policy decisions like this in the
> kernel. The kernel should do what the user asks and not put in unnecessary
> roadblocks.
This is _not_ a kernel policy decision. The kernel is simply informing
the user that it cannot fulfil the mount request as specified. Exactly
why do you think that NFS should be any different from other filesystems
when it comes to this?
AFAIK, every other filesystem will give you an EBUSY if you try to mount
a partition with -oro if you are already mounting somewhere else with
-orw. Every filesystem will give you an EBUSY if you try to mount the
partition with -oacl if it is mounted somewhere else with -onoacl. The
reason: exactly the same as NFS, the caches cannot remain consistent
when you try to mount two different super blocks that both refer to the
same underlying filesystem.
Trond
From: Linus Torvalds [email blocked]
Subject: Re: recent nfs change causes autofs regression
Date: Thu, 30 Aug 2007 21:59:17 -0700 (PDT)
On Fri, 31 Aug 2007, Trond Myklebust wrote:
>
> It did not. The previous behaviour was to always silently override the
> user mount options.
..so it still worked for any sane setup, at least.
You broke that. Hua gave good reasons for why he cannot use the current
kernel. It's a regression.
In other words, the new behaviour is *worse* than the behaviour you
consider to be the incorrect one.
Linus
From: Trond Myklebust [email blocked]
Subject: Re: recent nfs change causes autofs regression
Date: Fri, 31 Aug 2007 01:04:18 -0400
On Thu, 2007-08-30 at 21:59 -0700, Linus Torvalds wrote:
>
> On Fri, 31 Aug 2007, Trond Myklebust wrote:
> >
> > It did not. The previous behaviour was to always silently override the
> > user mount options.
>
> ..so it still worked for any sane setup, at least.
>
> You broke that. Hua gave good reasons for why he cannot use the current
> kernel. It's a regression.
>
> In other words, the new behaviour is *worse* than the behaviour you
> consider to be the incorrect one.
So you are saying that it is acceptable for the kernel to decide
unilaterally to override mount options? Why aren't we doing that for any
other filesystem than NFS?
Trond
From: Linus Torvalds [email blocked]
Subject: Re: recent nfs change causes autofs regression
Date: Thu, 30 Aug 2007 22:16:37 -0700 (PDT)
On Fri, 31 Aug 2007, Trond Myklebust wrote:
>
> So you are saying that it is acceptable for the kernel to decide
> unilaterally to override mount options?
IT'S WHAT WE'VE APPARENTLY ALWAYS DONE!
> Why aren't we doing that for any other filesystem than NFS?
How hard is it to acknowledge the following little word:
"regression"
It's simple. You broke things. You may want to fix them, but you need to
fix them in a way that does not break user space.
Linus
From: Jakob Oestergaard [email blocked]
Subject: Re: recent nfs change causes autofs regression
Date: Fri, 31 Aug 2007 09:40:28 +0200
On Thu, Aug 30, 2007 at 10:16:37PM -0700, Linus Torvalds wrote:
>
...
> > Why aren't we doing that for any other filesystem than NFS?
>
> How hard is it to acknowledge the following little word:
>
> "regression"
>
> It's simple. You broke things. You may want to fix them, but you need to
> fix them in a way that does not break user space.
Trond has a point Linus.
What he "broke" is, for example, a ro mount being mounted as rw.
That *could* be a very serious security (etc.etc.) problem which he just fixed.
Anything depending on read-only not being enforced will cease to work, of
course, and that is what a few people complain about(!).
If ext3 in some rare case (which would still mean it hit a few thousand users)
failed to remember that a file had been marked read-only and allowed writes to
it, wouldn't we want to fix that too? It would cause regressions, but we'd fix
it, right?
mount passes back the error code on a failed mount. autofs passes that error
along too (when people configure syslog correctly). In short; when these
serious mistakes are made and caught, the admin sees an error in his logs.
This is not wrong. This is good.
--
/ jakob
From: Linus Torvalds [email blocked]
Subject: Re: recent nfs change causes autofs regression
Date: Fri, 31 Aug 2007 01:07:56 -0700 (PDT)
On Fri, 31 Aug 2007, Jakob Oestergaard wrote:
>
> Trond has a point Linus.
I don't dispute that the new code does somethign good.
But it changes existing behaviour.
When we add NEW BEHAVIOUR, we don't add it to old interfaces when that
breaks old user mode! We add a new flag saying "I want the new behaviour".
This is not rocket science, guys. This is very basic kernel behaviour. The
kernel exists only to serve user space, and that means that there is no
more important thing to do than to make sure you don't break existing
users, unless you have some *damns* strong reasons.
> What he "broke" is, for example, a ro mount being mounted as rw.
No. What he broke was a working and sane setup.
The fact that he may *also* have broken insane setups is totally
irrelevant. Don't go off on some tangent that has nothing to do with the
regression in question!
> If ext3 in some rare case (which would still mean it hit a few thousand users)
> failed to remember that a file had been marked read-only and allowed writes to
> it, wouldn't we want to fix that too? It would cause regressions, but we'd fix
> it, right?
Stop blathering. Of course we fix security holes. But we don't break
things that don't need breaking. This wasn't a security hole.
You are making up irrelevant arguments that have nothing to do with this
regression.
If you want new behaviour, you add a new flag saying you want new
behaviour. You don't just start behaving differently from what you've
always done before (and what *other* UNIXes do, for that matter).
Besides, even *if* it was a matter of somebody doing a mount with "rw",
when the previous mount was "ro", returning EBUSY is still the wrong thing
to do! If the user asks for a new mount that is read-write, he should just
get it - ie we should not re-use the old client handles, and we should do
what Solaris apparently does, namely to just make it a totally different
mount.
In other words, it should (as I already mentioned once) have used
"nosharecache" by default, which makes it all work.
Then, people who want to re-use the caches (which in turn may mean that
everything needs to have the same flags), THOSE PEOPLE, who want the NEW
SEMANTICS (errors and all) should then use a "sharecache" flag.
See? You don't have to screw people over.
> mount passes back the error code on a failed mount. autofs passes that error
> along too (when people configure syslog correctly). In short; when these
> serious mistakes are made and caught, the admin sees an error in his logs.
Bullshit. "Seeing the error in his logs" doesn't help anything. The
problem wasn't the lack of error, the problem was that it was a new and
unnecessary error in the first place. Logging it doesn't make it any less
buggy.
Linus
From: Trond Myklebust [email blocked]
Subject: Re: recent nfs change causes autofs regression
Date: Fri, 31 Aug 2007 08:11:38 -0400
On Fri, 2007-08-31 at 01:07 -0700, Linus Torvalds wrote:
>
> If you want new behaviour, you add a new flag saying you want new
> behaviour. You don't just start behaving differently from what you've
> always done before (and what *other* UNIXes do, for that matter).
>
> Besides, even *if* it was a matter of somebody doing a mount with "rw",
> when the previous mount was "ro", returning EBUSY is still the wrong thing
> to do! If the user asks for a new mount that is read-write, he should just
> get it - ie we should not re-use the old client handles, and we should do
> what Solaris apparently does, namely to just make it a totally different
> mount.
>
> In other words, it should (as I already mentioned once) have used
> "nosharecache" by default, which makes it all work.
>
> Then, people who want to re-use the caches (which in turn may mean that
> everything needs to have the same flags), THOSE PEOPLE, who want the NEW
> SEMANTICS (errors and all) should then use a "sharecache" flag.
That would be a major change in existing semantics. The default has been
"sharecache" ever since Al Viro introduced the "sget()" function some 6
or 7 years ago. The problem was that we never advertised the fact that
the kernel was overriding your mount options, and so sysadmins were
(rightly IMO) complaining that they should _know_ when the client does
this.
The list of known problems with a "nosharecache" default is nasty too:
- file and directory attribute and data caching breaks.
Applications will see stale data in cases where they otherwise
would not expect it.
- the existing dcache and icache issues when a file is renamed
or deleted on the server are now extended to also include the
case where the rename or deletion occurs on an alias in another
directory on the client itself. In particular, sillyrename will
break.
- file locking breaks (the server knows that the client holds
locks on one file, whereas the client thinks it holds locks on
several).
- the NFSv4 delegation model breaks: the client will be using
OPEN when it could use cached opens. More importantly, when
performing an operation that requires it to return the
delegation on the aliased file, it won't know until the server
sends it a callback.
...and of course, the amount of unnecessary traffic to the server
increases. I'm not aware of any sane way of dealing with those issues,
and I doubt Solaris has a solution for them either.
Trond
From: Linus Torvalds [email blocked]
Subject: Re: recent nfs change causes autofs regression
Date: Fri, 31 Aug 2007 09:43:29 -0700 (PDT)
On Fri, 31 Aug 2007, Jakob Oestergaard wrote:
>
> > The fact that he may *also* have broken insane setups is totally
> > irrelevant. Don't go off on some tangent that has nothing to do with the
> > regression in question!
>
> It does not have "nothing" to do with the regression.
>
> Some setups which worked more by accident than by design earlier on were broken
> by the fix. This could have been avoided, I agree, but the breakage was caused
> by the fix (or the breakage is the fix, however you prefer to look at it).
Well, it's not a "fix" if it breaks other setups.
It's especially not a fix since the whole requirement that all the flags
be exactly the same is totally brain-dead in the first place. We *have*
that kind of mount already, and it has nothing to do with NFS: it's called
a "bind" mount.
So if you want an identical mount, with cache coherency and tying the two
mount-points together (requiring that they have the same mount flags),
then that has absolutely *nothing* to do with NFS. The VFS layer does that
for you.
> *part* of it wasn't a security hole.
>
> The other half very much was.
No, the fix was simply wrong. It was done the wrong way, and it broke
things it shouldn't have broken.
Let's put it this way: if I create a patch that stops the system from
booting, I sure as hell fix a potential security hole, don't I?
Does that make my patch a "fix"?
No it does not.
> Sure, given that Trond (or whomever) has the time it takes to go and implement
> all of this, there's no need to screw anyone.
>
> Assuming he's on a schedule and this will have to wait, I agree with him that
> it makes the most sense to play it safe security/consistency-wise rather than
> functionality-wise.
I disagree. Either that thing gets fixed before 2.6.23, or the commit that
introduced the broken behaviour gets reverted.
We've had this policy of "regressions are fixed" for a long time, and
we're not suddenly changing it.
This is *not* a security hole. In order to make it a security hole, you
need to be root in the first place. So what you call a security hole is
really no different from root installing a bad SUID binary. It's simply
not the kernels place to then say "SUID binaries will not work, because
it's a potential security hole".
See?
So stop calling this a security hole. It's certainly a misfeature, but:
- it's a misfeature that people are used to, and has been around forever.
- there are bound to be ways to fix it that don't break existing users.
- the requirement that all flags be the same for a mount to the same NFS
directory is *particularly* stupid, since there are better ways to do
that than go through NFS!
so I really don't see why people excuse the new behaviour.
Linus
From: Linus Torvalds [email blocked]
Subject: RE: recent nfs change causes autofs regression
Date: Thu, 30 Aug 2007 21:38:08 -0700 (PDT)
On Fri, 31 Aug 2007, Trond Myklebust wrote:
>
> No. Solaris defaults to breaking cache consistency.
If so, and since that's obviously what people _expect_ to happen, why not
make that the default, with the "consistent" behaviour being the one that
needs an explicit option.
Just out of curiosity - Hua, is this NFSv2? Especially there, cache
"consistency" is largely a joke anyway, so defaulting to some annoying
careful mode is doubly ridiculous.
Linus
From: Hua Zhong [email blocked]
Subject: RE: recent nfs change causes autofs regression
Date: Thu, 30 Aug 2007 21:47:59 -0700
> On Fri, 31 Aug 2007, Trond Myklebust wrote:
> >
> > No. Solaris defaults to breaking cache consistency.
>
> If so, and since that's obviously what people _expect_ to happen, why
> not make that the default, with the "consistent" behaviour being the
> one that needs an explicit option.
>
> Just out of curiosity - Hua, is this NFSv2? Especially there, cache
> "consistency" is largely a joke anyway, so defaulting to some annoying
> careful mode is doubly ridiculous.
It's v3 as can be seen from the autofs maps I posted.
These directories are used mostly as read-only and get pulled in via our
build system. We do not actually write to them often, if at all. I don't
think this setup is uncommon, and I am worried that once people start using
the latest kernel their systems will mysteriously break.
From: Linus Torvalds [email blocked]
Subject: Re: recent nfs change causes autofs regression
Date: Fri, 31 Aug 2007 10:01:50 -0700 (PDT)
On Fri, 31 Aug 2007, Trond Myklebust wrote:
>
> The best I can do given the constraints appears to be to have the kernel
> first look for a superblock that matches both the fsid and the
> user-specified mount options, and then spawn off a new superblock if
> that search fails.
I think this is probably acceptable to get roughly the old behaviour, but
I still think it's a bit stupid.
What happens at "mount -o remount,..." time?
The fact is, the whole "match the fsid and user mount options, and re-use
the mount" sounds like it's trying to solve a problem that doesn't need
solving. If the user really wants to duplicate the mount, he really should
be using a a bind-mount instead.
In other words, let's assume that the user has /some/nfs/mount mounted
over NFS, and wants to re-mount it (or even just a subset of it) somewhere
else, the sane thing to do is not to mount it again, but to just do
mount --bind /some/nfs/mount/subdir /new/mount/place
instead. That *guarantees* that the low-level filesystem uses the same
flags, and it also means that things like re-mounting have sane and
well-defined semantics, and will fail or succeed predictably.
In contrast, if a user wants to create a new NFS mount, it really should
be independent of the old one, because that's (a) what other systems do,
and (b) also makes the semantics of re-mounting it with other flags be
clear and unambiguous (ie the remount has nothing what-so-ever to do with
the independent NFS mount).
See?
This is why I think "nosharecache" should just be the default, because
that's the behaviour that simply does not have any subtle issues. The
*special* case should be the "sharecache" case, and 99% of the time that
one should likely be done with a "--bind" mount.
(I don't really see the point of _ever_ doing anything but a bind mount,
but maybe there are reasons to try to share at a NFS layer that I don't
really see)
> The attached patch does just that.
Hua, does this fix things for you? If it gets rid of the regression, I can
certainly live with it, but as per above, I don't really think this makes
much sense in the "bigger picture" kind of thing.
> Finally, for the record: I still feel very uncomfortable about not being
> able to report the state of the client setup back to the sysadmin.
> AFAIK, the only way to do so is to stat the mountpoints, and compare the
> device ids.
Well, not only don't I see that as being horribly wrong, I actually think
that the sysadmin should know what his mount setup is, even without having
to ask. But since he *can* ask, using easy and standard interfaces, I
don't really see what the problem really is.
Linus
From: David Howells [email blocked]
Subject: Re: recent nfs change causes autofs regression
Date: Tue, 04 Sep 2007 09:35:40 +0100
Place, 107-111 Peascod Street, Windsor, Berkshire, SI4 1TE, United
Kingdom.
Registered in England and Wales under Company Registration No. 3798903
Linus Torvalds [email blocked] wrote:
> In other words, let's assume that the user has /some/nfs/mount mounted
> over NFS, and wants to re-mount it (or even just a subset of it) somewhere
> else, the sane thing to do is not to mount it again, but to just do
That helps one case, yes, but what about a superset? What about two sets that
might intersect but for which you don't have the common root to hand? The
current NFS code deals with all these problems by attempting to share the
dentry sets. Superblocks can now have multiple roots and we graft trees
together automatically when we discover one is a subset of another.
The case I came up with was this:
mount home:/home/fred /home/fred
mount home:/home/jim /home/jim
To effect these, the NFS mount process looks up "/home/fred" or "/home/jim"
directly rather than looking up "/" and path walking. However, the NFS client
in the kernel may note that both Fred's and Jim's home directories reside on
the same NFS volume. You cannot use a bind mount here because there's nothing
to bind from.
Then, should, say, this happen:
mount home:/home /mnt
You'll probably end up with three roots in the NFS superblock. Following with
an ls of /home, say, would then populate the dentries for /home - including
those for fred and jim, and the code would splice in the dentried now rooted at
/home/fred and /home/jim.
You can't do that with bind mounts as far as I know because I don't believe
that you can go up the tree (rootwards) from the apparent root of a vfsmount.
So bind mounts aren't quite it for this problem, and in any case your
suggestion of:
mount --bind /some/nfs/mount/subdir /new/mount/place
doesn't help with the automounter case particularly well. The automounter
*could* probe to see if the server stuff is common with an already existing
mount, but there would then be a race, and it doesn't help with the homedir
example I gave above either.
You might think "well, start by mounting '/' somewhere and then bind mounting
subdirs of it", but that doesn't work if you can't mount "/" or "/home", and
might go spectacularly wrong if the server has a symlink in the path that you
can't see.
> This is why I think "nosharecache" should just be the default, because
> that's the behaviour that simply does not have any subtle issues. The
> *special* case should be the "sharecache" case, and 99% of the time that
> one should likely be done with a "--bind" mount.
Yeah, that's probably necessary, if annoying. However, local caching can
enable sharing or make it a prerequisite option.
> (I don't really see the point of _ever_ doing anything but a bind mount,
> but maybe there are reasons to try to share at a NFS layer that I don't
> really see)
The reason I added all this NFS superblock sharing is so that I could implement
on-disk local caching much more easily. If, for instance, two netfs inodes
aren't shared, but their "index keys" say they should use the same piece of
cache then all sorts of fun ensues from the disjoint cache coherency.
Even working out that two inodes are using the same piece of cache isn't
trivial (though it seems like it ought to be).
David
From: Linus Torvalds [email blocked]
Subject: Re: recent nfs change causes autofs regression
Date: Tue, 4 Sep 2007 02:04:12 -0700 (PDT)
On Tue, 4 Sep 2007, David Howells wrote:
>
> That helps one case, yes, but what about a superset? What about two sets that
> might intersect but for which you don't have the common root to hand?
Sure. In which case bind mounts don't work. Fair enough.
> The case I came up with was this:
>
> mount home:/home/fred /home/fred
> mount home:/home/jim /home/jim
The much more trivial case is
mount -o ro server:/usr/bin /usr/share/bin
mount server:/usr/tmp /usr/share/tmp
and now tell me any reasonable reason why this should fail? (Replace "-o
ro" with any other attributes).
Quite frankly, if the above two mounts fail - just beause /usr/bin and
/usr/tmp happen to be on the same filesystem on the server - then the
implementation is more than just buggy - it's a pure piece of shit.
And quite frankly, as far as I can tell, that was exactly what the NFS
changes that are being discussed did. They failed the equivalent of the
second mount, because it didn't have the same flags as the first one.
Can you really honestly say that wasn't totally broken?
> The reason I added all this NFS superblock sharing is so that I could implement
> on-disk local caching much more easily. If, for instance, two netfs inodes
> aren't shared, but their "index keys" say they should use the same piece of
> cache then all sorts of fun ensues from the disjoint cache coherency.
>
> Even working out that two inodes are using the same piece of cache isn't
> trivial (though it seems like it ought to be).
I'm just saying that the whole "require all mount flags to be identical,
and error out if they are not" is pure and utter CRAP.
So anything that does that - for *any* reason what-so-ever - is just
broken. If you require identical mount-time flags, that absolutely has to
be a special case (like using "--bind", or perhaps using a special option
like "sharecache").
It really is that simple. I don't know how anybody could possibly ever
dispute that.
As far as I can tell, the current situation in NFS is "reasonably ok", but
I already asked Trond about what happens with "remount" with the "same
mount options imply sharecache" code that he did, and afaik, I never got
an answer. In other words, let's change the above two commands to the
following three commands:
mount server:/usr/bin /usr/share/bin
mount server:/usr/tmp /usr/share/tmp
mount -o remount,ro /usr/share/bin
and I'm claiming that if the above fails (or remounts /usr/share/tmp as
read-only too), then it's also obvious CRAP (replace "ro" with any other
possible attribute - whether cache timeouts or similar)
See? It really is that simple. The obvious mount usage above absolutely
*has* to work, and anything that breaks it is crap, crap, crap. And that
was exactly what apparently happened here, and I really don't see why
anybody has the *gall* to claim that the "default to sharecache" code
wasn't totally broken.
Linus
Related Links:
- Archive of above thread [3]
- KernelTrap interview with Andrew Morton [4]