login
Header Space

 
 

Re: VFS + path walktrough

Previous thread: [GIT PULL] kgdb fixes for 2.6.26 by Jason Wessel on Monday, May 5, 2008 - 8:32 am. (4 messages)

Next thread: [PATCH 36/56] microblaze_v2: dma support by monstr on Sunday, May 4, 2008 - 7:41 am. (170 messages)
To: linux kernel list <linux-kernel@...>
Date: Monday, May 5, 2008 - 8:40 am

Hi folks,

could anyone please enlighten me, what exactly happens when 
some app opens an file with some longer pathname ?

Lets say we open /a/b/c/d and /a is mounted w/ some network 
filesystem (eg. 9P). Who exactly does the walktrough from b to d ?
The individual filesystem or VFS ?

The point is: the 9P protocol can work with whole pathnames, so
the client doesn't have to do the walkthrough manually - this
can heavily reduce traffic and latency. I'd like the 9P fs driver
to directly use this, if VFS can send the whole pathname at once.


cu
-- 
---------------------------------------------------------------------
 Enrico Weigelt    ==   metux IT service - http://www.metux.de/
---------------------------------------------------------------------
 Please visit the OpenSource QM Taskforce:
 	http://wiki.metux.de/public/OpenSource_QM_Taskforce
 Patches / Fixes for a lot dozens of packages in dozens of versions:
	http://patches.metux.de/
---------------------------------------------------------------------
--
To: linux kernel list <linux-kernel@...>
Date: Monday, May 5, 2008 - 9:06 am

I've digget somebit in the source and found out that it goes 
down to link_path_walk(). It seems to split the pathname into 
components and walk through them one by one.

We could just add another call vector to struct file_operations,
as replacement for link_path_walk() - if it's zero, the original
function is used. This way an filesystem can do the walktrough
by it's own, but doesn't need to.


What do you think about this ?


cu
-- 
---------------------------------------------------------------------
 Enrico Weigelt    ==   metux IT service - http://www.metux.de/
---------------------------------------------------------------------
 Please visit the OpenSource QM Taskforce:
 	http://wiki.metux.de/public/OpenSource_QM_Taskforce
 Patches / Fixes for a lot dozens of packages in dozens of versions:
	http://patches.metux.de/
---------------------------------------------------------------------
--
To: Enrico Weigelt <weigelt@...>
Cc: linux kernel list <linux-kernel@...>
Date: Monday, May 5, 2008 - 9:13 am

That you have quite forgotten about mounts.  And we *REALLY* don't
want to shift the entire logics of link_path_walk() into filesystems -
this is insane.  Even "let's follow that symlink" part alone, not to
mention mountpoint handling, populating dcache, etc.
--
To: linux kernel list <linux-kernel@...>
Date: Monday, May 5, 2008 - 9:43 am

hmm, I though this would be done before the link_path_walk() 

Only for those filesystems who *really* want to do it by 
themselves and set the appropriate call vector. All other 
fs'es will just leave it blank (even don't have to be touched)
and so the old way remains for them.

To get around mointpoint issues, we could at least do it only
when an special mount option is given and add an big-fat warning
that mountpoints within these mounts won't work. So these fast
lookups will only happen when:

#1: the fs explicitly supports it
#2: mounted with an special option

And if you use that option, you'll simply loose the ability
of using mointpoints within this specific mount. This won't 
affect any situation other than #1 &amp;&amp; #2, IMHO this is better
than no chance of fast lookups at all. Of course, an cleaner
approach would be better, but it's IMHO not critical.

BTW: there are (or have been) certain speed improvements for 
specific situations w/ loosing other standard features, eg.
fast bridging.


cu
-- 
---------------------------------------------------------------------
 Enrico Weigelt    ==   metux IT service - http://www.metux.de/
---------------------------------------------------------------------
 Please visit the OpenSource QM Taskforce:
 	http://wiki.metux.de/public/OpenSource_QM_Taskforce
 Patches / Fixes for a lot dozens of packages in dozens of versions:
	http://patches.metux.de/
---------------------------------------------------------------------
--
To: Enrico Weigelt <weigelt@...>
Cc: linux kernel list <linux-kernel@...>
Date: Monday, May 5, 2008 - 11:35 am

How on the earth...?  You don't know where will pathname resolution
get you, so how could you possibly handle mountpoint transitions prior

This is crap.  First of all, the logics is already overcomplicated.
_Then_ we have a problem of populating dcache for intermediates.

Besides, that's not what that thing is for - it's to allow local
caching (which we do) with revalidation of several components
at once.  _After_ VFS has decided that nothing interesting is in
the part of path it has cached.  Then the protocol allows to do
bulk Walk, verifying that all cached intermediates still match
the reality, all in one roundtrip.
--
To: <viro@...>
Cc: <weigelt@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 12:43 pm

One way this could be done cleanly, is to pass the rest of the path
(as hint) to the filesystem in its lookup function.  Most filesystems
would just ignore it, but those which have the capabilities can use it
to do the lookup in one go, and internally cache the result.  The VFS
doesn't need to know _anything_ about all this.  If there are
mountpoints, they are already cached, so -&gt;lookup() wouldn't be called
at all, only -&gt;d_revalidate(), which is a different issue.

Miklos
--
To: Miklos Szeredi <miklos@...>
Cc: <weigelt@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 1:14 pm

This is still wrong.  We not just pass the pathname to filesystem (note
that you still need to deal with symlinks), but we make that filesystem
to populate dentry tree.  Take a look at 9P walk - it does *not* give
you anything resembling stat, you just get qids of intermediates.  Which
is bloody useful when you want to do intelligent revalidation (do local
cached walk, then issue a single protocol request that will both do
bulk revalidate *and* tell you where in the path you've got the first
invalid one - just compare qids with what you've got stored locally).
However, it's just about useless for cutting corners in cold-cache
lookup.

It _is_ a useful thing, no arguments about that.  However, to use it
a sane way we need to massage the pathname resolution loop, taking
the "simple pass without symlinks or mountpoints" part into a new
helper, turning the current __link_path_walk() into a loop calling that
one and then folding it into callers.  Would also allow to kill the
last remnants of recursion in symlink handling for normal fs case...

_Then_ we can do saner logics for revalidate, allowing it on such segments.
Which, BTW, would deal with -ESTALE in a saner way, rather than "repeat
full pathname resolution from the very beginning".  And that's where
9P multi-step walk(5) would do very nicely, indeed.

And fuck the "hints" of all kinds, pardon the rudeness.  We already have
more than enough of that crap and it already makes cleaning the logics
up bloody painful.
--
To: <viro@...>
Cc: <miklos@...>, <weigelt@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 1:33 pm

Symlinks are easy: filesystem just needs to *stop* the resolution the






Separate i_op for it is fine by me as well.

Not that I care very much.  I have plans for such a bulk lookup
interface in fuse, but that's far in the future.

Miklos

--
To: Miklos Szeredi <miklos@...>
Cc: <weigelt@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 1:40 pm

No - you need inodes as well (i.e. as the absolute least you want
mode and ownership).  Which is to say, you need to issue stat on
each component in such situation anyway.  Not a win...
--
To: linux kernel list <linux-kernel@...>
Date: Monday, May 5, 2008 - 2:23 pm

I've just read the spec for walk again:

Assuming the server doesn't resolve symlinks itself, the walk
will fail right at the symlink. So we can have a deeper look
here and try stat()'ing (adds one more request). If the fail 
point *is* an symlink, we need to properly handle it.

Would it be very complicated to give the link target back to

Naive question: is it really *necessary* to have all the 
intermediate dirs in dcache ?


cu
-- 
---------------------------------------------------------------------
 Enrico Weigelt    ==   metux IT service - http://www.metux.de/
---------------------------------------------------------------------
 Please visit the OpenSource QM Taskforce:
 	http://wiki.metux.de/public/OpenSource_QM_Taskforce
 Patches / Fixes for a lot dozens of packages in dozens of versions:
	http://patches.metux.de/
---------------------------------------------------------------------
--
To: Enrico Weigelt <weigelt@...>
Cc: linux kernel list <linux-kernel@...>
Date: Monday, May 5, 2008 - 2:34 pm

Um...  What the hell are you talking about?  How _can_ server resolve
symlinks, when result of symlink resolution depends on where the damn
thing is mounted on client and even how deeply the process trying to
do lookup happens to be chrooted?

It wouldn't work even for relative symlinks - remember that we might
bloody well have something bound on the middle of the path in question.

The answer's "yes".
--
To: linux kernel list <linux-kernel@...>
Date: Monday, May 5, 2008 - 3:02 pm

In the same way as, eg. http servers, do. Of course this fails 
if the symlink isn't resolvable within server's fs.

Several years ago, I've seen exactly this behaviour on Samba.

What exactly are they needed for ? 
Which information is needed ?
Can we perhaps fake them (at least we know - on success - the
intermediate components are dirs) ?


cu
-- 
---------------------------------------------------------------------
 Enrico Weigelt    ==   metux IT service - http://www.metux.de/
---------------------------------------------------------------------
 Please visit the OpenSource QM Taskforce:
 	http://wiki.metux.de/public/OpenSource_QM_Taskforce
 Patches / Fixes for a lot dozens of packages in dozens of versions:
	http://patches.metux.de/
---------------------------------------------------------------------
--
To: Enrico Weigelt <weigelt@...>
Cc: linux kernel list <linux-kernel@...>
Date: Monday, May 5, 2008 - 3:09 pm

Umm...  You know, it might make more sense if you
	* explained what are you really trying to do
	* short of that, perhaps figured out what the hell symlinks and
bindings _are_.

Again, _no_ symlink is resolvable by server alone, simply because
server can not know if target of that symlink is overmounted from
the point of view of whoever is doing lookup.  Note that it *does*
depend on who's doing that and where in the namespace we are seeing
that sucker (the latter kills the "we want per-user connection"
variants).
--
To: <viro@...>
Cc: <miklos@...>, <weigelt@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 2:03 pm

Umm, OK.  The 9P server does see the type of objects, so it should be

You're right.  It doesn't sound too good, although it all depends on
the how permission checking is done.  If it's done in the server, then
neither the mode nor the ownership is needed for lookup.  The file
type *is* known for all but the last component, and doing a stat for
that one is not a big issue.  All this is modulo the symlink issue of
course.

Miklos
--
To: linux kernel list <linux-kernel@...>
Date: Monday, May 5, 2008 - 2:50 pm

It really should be done in the server. But this adds another
issue (no idea if the current 9p driver already handles this):
We need one link for each user accessing the filesystem.


cu
-- 
---------------------------------------------------------------------
 Enrico Weigelt    ==   metux IT service - http://www.metux.de/
---------------------------------------------------------------------
 Please visit the OpenSource QM Taskforce:
 	http://wiki.metux.de/public/OpenSource_QM_Taskforce
 Patches / Fixes for a lot dozens of packages in dozens of versions:
	http://patches.metux.de/
---------------------------------------------------------------------
--
To: <miklos@...>
Cc: <viro@...>, <miklos@...>, <weigelt@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 2:31 pm

And actually even that *could* be a win, if the network latency is
large.  Because by doing the lookup first, the stats can be performed
in parallel.  So a path with an arbitrary number of components could
be resolved in just 2 RTTs.

Miklos
--
To: Miklos Szeredi <miklos@...>
Cc: <viro@...>, <weigelt@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 4:16 pm

...and NFSv4 could do it in a single RPC call (assuming no symlinks or
submounts).

Cheers
  Trond

--
To: <trond.myklebust@...>
Cc: <miklos@...>, <viro@...>, <weigelt@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 4:35 pm

And just to show how utterly trivially this could be done, here's a
patch (totally untested).

Hack?  Hell, yes.

Miklos

---
 fs/namei.c         |   20 ++++++++++++--------
 include/linux/fs.h |    1 +
 2 files changed, 13 insertions(+), 8 deletions(-)

Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c	2008-05-05 12:03:24.000000000 +0200
+++ linux-2.6/fs/namei.c	2008-05-05 22:25:52.000000000 +0200
@@ -519,14 +519,18 @@ static struct dentry * real_lookup(struc
 	 */
 	result = d_lookup(parent, name);
 	if (!result) {
-		struct dentry * dentry = d_alloc(parent, name);
-		result = ERR_PTR(-ENOMEM);
-		if (dentry) {
-			result = dir-&gt;i_op-&gt;lookup(dir, dentry, nd);
-			if (result)
-				dput(dentry);
-			else
-				result = dentry;
+		if (dir-&gt;i_op-&gt;lookup_path) {
+			result = dir-&gt;i_op-&gt;lookup_path(dir, name);
+		} else  {
+			struct dentry * dentry = d_alloc(parent, name);
+			result = ERR_PTR(-ENOMEM);
+			if (dentry) {
+				result = dir-&gt;i_op-&gt;lookup(dir, dentry, nd);
+				if (result)
+					dput(dentry);
+				else
+					result = dentry;
+			}
 		}
 		mutex_unlock(&amp;dir-&gt;i_mutex);
 		return result;
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h	2008-05-05 12:03:24.000000000 +0200
+++ linux-2.6/include/linux/fs.h	2008-05-05 22:26:59.000000000 +0200
@@ -1251,6 +1251,7 @@ struct file_operations {
 struct inode_operations {
 	int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
 	struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
+	struct dentry * (*lookup_path) (struct inode *, struct qstr *);
 	int (*link) (struct dentry *,struct inode *,struct dentry *);
 	int (*unlink) (struct inode *,struct dentry *);
 	int (*symlink) (struct inode *,struct dentry *,const char *);





--
To: <miklos@...>
Cc: <viro@...>, <weigelt@...>, <linux-kernel@...>
Date: Monday, May 5, 2008 - 1:03 pm

Better, the filesystem can just populate the dcache with the result.
The entry being looked up is locked, so noone can get at it, and so it
should be quite safe to build a tree below.

Miklos
--
Previous thread: [GIT PULL] kgdb fixes for 2.6.26 by Jason Wessel on Monday, May 5, 2008 - 8:32 am. (4 messages)

Next thread: [PATCH 36/56] microblaze_v2: dma support by monstr on Sunday, May 4, 2008 - 7:41 am. (170 messages)
speck-geostationary