Hi folks, could anyone please enlighten me, what exactly happens when some app opens an file with some longer pathname ? Lets say we open /a/b/c/d and /a is mounted w/ some network filesystem (eg. 9P). Who exactly does the walktrough from b to d ? The individual filesystem or VFS ? The point is: the 9P protocol can work with whole pathnames, so the client doesn't have to do the walkthrough manually - this can heavily reduce traffic and latency. I'd like the 9P fs driver to directly use this, if VFS can send the whole pathname at once. cu -- --------------------------------------------------------------------- Enrico Weigelt == metux IT service - http://www.metux.de/ --------------------------------------------------------------------- Please visit the OpenSource QM Taskforce: http://wiki.metux.de/public/OpenSource_QM_Taskforce Patches / Fixes for a lot dozens of packages in dozens of versions: http://patches.metux.de/ --------------------------------------------------------------------- --
I've digget somebit in the source and found out that it goes down to link_path_walk(). It seems to split the pathname into components and walk through them one by one. We could just add another call vector to struct file_operations, as replacement for link_path_walk() - if it's zero, the original function is used. This way an filesystem can do the walktrough by it's own, but doesn't need to. What do you think about this ? cu -- --------------------------------------------------------------------- Enrico Weigelt == metux IT service - http://www.metux.de/ --------------------------------------------------------------------- Please visit the OpenSource QM Taskforce: http://wiki.metux.de/public/OpenSource_QM_Taskforce Patches / Fixes for a lot dozens of packages in dozens of versions: http://patches.metux.de/ --------------------------------------------------------------------- --
That you have quite forgotten about mounts. And we *REALLY* don't want to shift the entire logics of link_path_walk() into filesystems - this is insane. Even "let's follow that symlink" part alone, not to mention mountpoint handling, populating dcache, etc. --
hmm, I though this would be done before the link_path_walk() Only for those filesystems who *really* want to do it by themselves and set the appropriate call vector. All other fs'es will just leave it blank (even don't have to be touched) and so the old way remains for them. To get around mointpoint issues, we could at least do it only when an special mount option is given and add an big-fat warning that mountpoints within these mounts won't work. So these fast lookups will only happen when: #1: the fs explicitly supports it #2: mounted with an special option And if you use that option, you'll simply loose the ability of using mointpoints within this specific mount. This won't affect any situation other than #1 && #2, IMHO this is better than no chance of fast lookups at all. Of course, an cleaner approach would be better, but it's IMHO not critical. BTW: there are (or have been) certain speed improvements for specific situations w/ loosing other standard features, eg. fast bridging. cu -- --------------------------------------------------------------------- Enrico Weigelt == metux IT service - http://www.metux.de/ --------------------------------------------------------------------- Please visit the OpenSource QM Taskforce: http://wiki.metux.de/public/OpenSource_QM_Taskforce Patches / Fixes for a lot dozens of packages in dozens of versions: http://patches.metux.de/ --------------------------------------------------------------------- --
How on the earth...? You don't know where will pathname resolution get you, so how could you possibly handle mountpoint transitions prior This is crap. First of all, the logics is already overcomplicated. _Then_ we have a problem of populating dcache for intermediates. Besides, that's not what that thing is for - it's to allow local caching (which we do) with revalidation of several components at once. _After_ VFS has decided that nothing interesting is in the part of path it has cached. Then the protocol allows to do bulk Walk, verifying that all cached intermediates still match the reality, all in one roundtrip. --
One way this could be done cleanly, is to pass the rest of the path (as hint) to the filesystem in its lookup function. Most filesystems would just ignore it, but those which have the capabilities can use it to do the lookup in one go, and internally cache the result. The VFS doesn't need to know _anything_ about all this. If there are mountpoints, they are already cached, so ->lookup() wouldn't be called at all, only ->d_revalidate(), which is a different issue. Miklos --
This is still wrong. We not just pass the pathname to filesystem (note that you still need to deal with symlinks), but we make that filesystem to populate dentry tree. Take a look at 9P walk - it does *not* give you anything resembling stat, you just get qids of intermediates. Which is bloody useful when you want to do intelligent revalidation (do local cached walk, then issue a single protocol request that will both do bulk revalidate *and* tell you where in the path you've got the first invalid one - just compare qids with what you've got stored locally). However, it's just about useless for cutting corners in cold-cache lookup. It _is_ a useful thing, no arguments about that. However, to use it a sane way we need to massage the pathname resolution loop, taking the "simple pass without symlinks or mountpoints" part into a new helper, turning the current __link_path_walk() into a loop calling that one and then folding it into callers. Would also allow to kill the last remnants of recursion in symlink handling for normal fs case... _Then_ we can do saner logics for revalidate, allowing it on such segments. Which, BTW, would deal with -ESTALE in a saner way, rather than "repeat full pathname resolution from the very beginning". And that's where 9P multi-step walk(5) would do very nicely, indeed. And fuck the "hints" of all kinds, pardon the rudeness. We already have more than enough of that crap and it already makes cleaning the logics up bloody painful. --
Symlinks are easy: filesystem just needs to *stop* the resolution the Separate i_op for it is fine by me as well. Not that I care very much. I have plans for such a bulk lookup interface in fuse, but that's far in the future. Miklos --
No - you need inodes as well (i.e. as the absolute least you want mode and ownership). Which is to say, you need to issue stat on each component in such situation anyway. Not a win... --
I've just read the spec for walk again: Assuming the server doesn't resolve symlinks itself, the walk will fail right at the symlink. So we can have a deeper look here and try stat()'ing (adds one more request). If the fail point *is* an symlink, we need to properly handle it. Would it be very complicated to give the link target back to Naive question: is it really *necessary* to have all the intermediate dirs in dcache ? cu -- --------------------------------------------------------------------- Enrico Weigelt == metux IT service - http://www.metux.de/ --------------------------------------------------------------------- Please visit the OpenSource QM Taskforce: http://wiki.metux.de/public/OpenSource_QM_Taskforce Patches / Fixes for a lot dozens of packages in dozens of versions: http://patches.metux.de/ --------------------------------------------------------------------- --
Um... What the hell are you talking about? How _can_ server resolve symlinks, when result of symlink resolution depends on where the damn thing is mounted on client and even how deeply the process trying to do lookup happens to be chrooted? It wouldn't work even for relative symlinks - remember that we might bloody well have something bound on the middle of the path in question. The answer's "yes". --
In the same way as, eg. http servers, do. Of course this fails if the symlink isn't resolvable within server's fs. Several years ago, I've seen exactly this behaviour on Samba. What exactly are they needed for ? Which information is needed ? Can we perhaps fake them (at least we know - on success - the intermediate components are dirs) ? cu -- --------------------------------------------------------------------- Enrico Weigelt == metux IT service - http://www.metux.de/ --------------------------------------------------------------------- Please visit the OpenSource QM Taskforce: http://wiki.metux.de/public/OpenSource_QM_Taskforce Patches / Fixes for a lot dozens of packages in dozens of versions: http://patches.metux.de/ --------------------------------------------------------------------- --
Umm... You know, it might make more sense if you * explained what are you really trying to do * short of that, perhaps figured out what the hell symlinks and bindings _are_. Again, _no_ symlink is resolvable by server alone, simply because server can not know if target of that symlink is overmounted from the point of view of whoever is doing lookup. Note that it *does* depend on who's doing that and where in the namespace we are seeing that sucker (the latter kills the "we want per-user connection" variants). --
Umm, OK. The 9P server does see the type of objects, so it should be You're right. It doesn't sound too good, although it all depends on the how permission checking is done. If it's done in the server, then neither the mode nor the ownership is needed for lookup. The file type *is* known for all but the last component, and doing a stat for that one is not a big issue. All this is modulo the symlink issue of course. Miklos --
It really should be done in the server. But this adds another issue (no idea if the current 9p driver already handles this): We need one link for each user accessing the filesystem. cu -- --------------------------------------------------------------------- Enrico Weigelt == metux IT service - http://www.metux.de/ --------------------------------------------------------------------- Please visit the OpenSource QM Taskforce: http://wiki.metux.de/public/OpenSource_QM_Taskforce Patches / Fixes for a lot dozens of packages in dozens of versions: http://patches.metux.de/ --------------------------------------------------------------------- --
And actually even that *could* be a win, if the network latency is large. Because by doing the lookup first, the stats can be performed in parallel. So a path with an arbitrary number of components could be resolved in just 2 RTTs. Miklos --
...and NFSv4 could do it in a single RPC call (assuming no symlinks or submounts). Cheers Trond --
And just to show how utterly trivially this could be done, here's a
patch (totally untested).
Hack? Hell, yes.
Miklos
---
fs/namei.c | 20 ++++++++++++--------
include/linux/fs.h | 1 +
2 files changed, 13 insertions(+), 8 deletions(-)
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c 2008-05-05 12:03:24.000000000 +0200
+++ linux-2.6/fs/namei.c 2008-05-05 22:25:52.000000000 +0200
@@ -519,14 +519,18 @@ static struct dentry * real_lookup(struc
*/
result = d_lookup(parent, name);
if (!result) {
- struct dentry * dentry = d_alloc(parent, name);
- result = ERR_PTR(-ENOMEM);
- if (dentry) {
- result = dir->i_op->lookup(dir, dentry, nd);
- if (result)
- dput(dentry);
- else
- result = dentry;
+ if (dir->i_op->lookup_path) {
+ result = dir->i_op->lookup_path(dir, name);
+ } else {
+ struct dentry * dentry = d_alloc(parent, name);
+ result = ERR_PTR(-ENOMEM);
+ if (dentry) {
+ result = dir->i_op->lookup(dir, dentry, nd);
+ if (result)
+ dput(dentry);
+ else
+ result = dentry;
+ }
}
mutex_unlock(&dir->i_mutex);
return result;
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h 2008-05-05 12:03:24.000000000 +0200
+++ linux-2.6/include/linux/fs.h 2008-05-05 22:26:59.000000000 +0200
@@ -1251,6 +1251,7 @@ struct file_operations {
struct inode_operations {
int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
struct dentry * (*lookup) (struct inode *,struct dentry *, struct nameidata *);
+ struct dentry * (*lookup_path) (struct inode *, struct qstr *);
int (*link) (struct dentry *,struct inode *,struct dentry *);
int (*unlink) (struct inode *,struct dentry *);
int (*symlink) (struct inode *,struct dentry *,const char *);
--Better, the filesystem can just populate the dcache with the result. The entry being looked up is locked, so noone can get at it, and so it should be quite safe to build a tree below. Miklos --
| Andrea Arcangeli | [PATCH 06 of 11] rwsem contended |
| Manu Abraham | PCIE |
| Alex Samad | page swap allocation error/failure in 2.6.25 |
| Rafael J. Wysocki | Re: [Bug 10030] Suspend doesn't work when SD card is inserted |
git: | |
| Elijah Newren | Trying to use git-filter-branch to compress history by removing large, obsolete bi... |
| Andy Parkins | svn:externals using git submodules |
| Junio C Hamano | [ANNOUNCE] GIT 1.5.4 |
| Tommi Virtanen | [PATCH] "git shell" won't work, need "git-shell" |
| Marcos Laufer | dmesg IBM x3650 OpenBSD 4.3 |
| Richard Stallman | Real men don't attack straw men |
| Richard Storm | MAXDSIZ 1GB memory limit for process |
| Edd Barrett | Re: OpenBSD in the webcomic XKCD |
| Felix Radensky | RE: e1000e "Detected Tx Unit Hang" |
| Sami Farin | Re: Linux 2.6.27.5 / SFQ/HTB scheduling problems |
| Jeff Garzik | Re: [PATCH] sky2: jumbo frame regression fix |
| Indan Zupancic | Re: Realtek 8111C transmit timed out |
