I don't understand how this leads to a more efficient implementation. These don't seem to be documented on the website. -
What crack do you guys have been smoking? ---end quoted text--- -
I'd like to apologize for this statement, it was a little harsh. I still think most of these APIs are rather braindead, but then again everyone does braindead APIs from now to then. I still think it's very futile that you try to force APIs using standizations on us. Instead of going down that route please try to present a case for every single API you want, including reasonings why this can't be fixed by speeding up existing APIs. Note that with us I don't mean just linux but also other OpenSource OSes. Unless you at least get Linux and FreeBSD and Solaris to agree on the need for the API it's very pointless to go anywhere close to a standization body. Anyway, let's go on to the individual API groups: - readdirplus This one is completely unneeded as a kernel API. Doing readdir plus calls on the wire makes a lot of sense and we already do that for NFSv3+. Doing this at the syscall layer just means kernel bloat - syscalls are very cheap. - lockg I'm more than unhappy to add new kernel-level file locking calls. The whole mess of lockf vs fcntl vs leases is bad enough that we don't want to add more to it. Doing some form of advisory locks that can be implemented in userland using a shared memory region or message passing might be fine. - openg/sutoc No way. We already have a very nice file descriptor abstraction. You can pass file descriptors over unix sockets just fine. - NFSV4acls These have nothing to do at all with I/O performance. They're also sufficiently braindead. Even if you still want to push for it you shouldn't mix it up with anything else in here. - statlite The concept generally makes sense. The specified details are however very wrong. Any statlite call should operate on the normal OS-specified stat structure and have the mask of flags as an additional argument. Because of that you can only specific existing posix stat values as mandatory, but we should have an informal agreement that assigns unique mask...
Yes, but it behaves like dup(). Gary replied to me off-list (which I didn't notice and continued replying to him off-list). I wrote: Is this for people who don't know about dup(), or do they need independent file offsets? If the latter, I think an xdup() would be preferable (would there be a security issue for OSes with revoke()?) Either that, or make the key be useful for something else. -
Not sharing the file offset means we need a separate file struct, at which point the only thing saved is doing a lookup at the time of opening the file. While a full pathname traversal can be quite costly an open is not something you do all that often anyway. And if you really need to open/close files very often you can speed it up nicely by keeping a file descriptor on the parent directory open and use openat(). Anyway, enough of talking here. We really need a very good description of the use case people want this for, and the specific performance problems they see to find a solution. And the solution definitly does not involve as second half-assed file handle time with unspecified lifetime rules :-) -
Hi all, The use model for openg() and openfh() (renamed sutoc()) is n processes spread across a large cluster simultaneously opening a file. The challenge is to avoid to the greatest extent possible incurring O(n) FS interactions. To do that we need to allow actions of one process to be reused by other processes on other OS instances. The openg() call allows one process to perform name resolution, which is often the most expensive part of this use model. Because permission checking is also performed as part of the openg(), some file systems to not require additional communication between OS and FS at openfh(). External communication channels are used to pass the handle resulting from the openg() call out to processes on other nodes (e.g. MPI_Bcast). dup(), openat(), and UNIX sockets are not viable options in this model, because there are many OS instances, not just one. All the calls that are being discussed as part of the HEC extensions are being discussed in this context of multiple OS instances and cluster file systems. Regarding the lifetime of the handle, there has been quite a bit of discussion about this. I believe that we most recently were thinking that there was an undefined lifetime for this, allowing servers to "forget" these values (as in the case where a server is restarted). Clients would need to perform the openg() again if they were to try to use an outdated handle, or simply fall back to a regular open(). This is not a problem in our use model. I've attached a graph showing the time to use individual open() calls vs. the openg()/MPI_Bcast()/openfh() combination; it's a clear win for any significant number of processes. These results are from our colleagues at Sandia (Ruth Klundt et. al.) with PVFS underneath, but I expect the trend to be similar for many cluster file systems. Regarding trying to "force APIs using standardization" on you (Christoph's 11/29/2006 message), you've got us all wrong. The standardization process is go...
Hi,
One general remark: I don't think it is feasible to add new system
calls every time somebody has a problem. Usually there are (may be not
that good) solutions that don't require big changes and work well
enough. "Let's change the interface and make the life of many
filesystem developers miserable, because they have to worry about
3-4-5 more operations" is not the easiest solution in the long run.
If the name resolution is the most expensive part, why not implement
just the name lookup part and call it "lookup" instead of "openg". Or
even better, make NFS to resolve multiple names with a single request.
If the NFS server caches the last few name lookups, the responses from
the other nodes will be fast, and you will get your file descriptor
with two instead of the proposed one request. The performance could be
just good enough without introducing any new functions and file
handles.
Thanks,
Lucho
-Hi, I agree that it is not feasible to add new system calls every time somebody has a problem, and we don't take adding system calls lightly. However, in this case we're talking about an entire *community* of people (high-end computing), not just one or two people. Of course it may still be the case that that community is not important enough to justify the addition of system calls; that's obviously not my call to make! I'm sure that you meant more than just to rename openg() to lookup(), but I don't understand what you are proposing. We still need a second call to take the results of the lookup (by whatever name) and convert that into a file descriptor. That's all the openfh() (previously named sutoc()) is for. I think the subject line might be a little misleading; we're not just talking about NFS here. There are a number of different file systems that might benefit from these enhancements (e.g. GPFS, Lustre, PVFS, PanFS, etc.). Finally, your comment on making filesystem developers miserable is sort of a point of philosophical debate for me. I personally find myself miserable trying to extract performance given the very small amount of information passing through the existing POSIX calls. The additional information passing through these new calls will make it much easier to obtain performance without correctly guessing what the user might actually be up to. While they do mean more work in the short term, they should also mean a more straight-forward path to performance for cluster/parallel file systems. Thanks for the input. Does this help explain why we don't think we can just work under the existing calls? Rob -
I have the feeling that openg stuff is rushed without looking into all
solutions, that don't require changes to the current interface. I
don't see any numbers showing where exactly the time is spent? Is
opening too slow because of the number of requests that the file
server suddently has to respond to? Does having an operation that
looks up multiple names instead of a single name good enough? How much
The idea is that lookup doesn't open the file, just does to name
resolution. The actual opening is done by openfh (or whatever you call
it next :). I don't think it is a good idea to introduce another way
of addressing files on the file system at all, but if you still decide
to do it, it makes more sense to separate the name resolution from the
operations (at the moment only open operation, but who knows what'll
I think that the main problem is that all these file systems resove a
path name, one directory at a time bringing the server to its knees by
the huge amount of requests. I would like to see what the performance
is if you a) cache the last few hundred lookups on the server side,
and b) modify VFS and the file systems to support multi-name lookups.
Just assume for a moment that there is no any way to get these new
operations in (which is probaly going to be true anyway :). What other
solutions can you think of? :)
Thanks,
Lucho
-I also get the feeling that interfaces that already do this open-by-handle stuff haven't been explored either. Does anyone here know about the XFS libhandle API? This has been around for years and it does _exactly_ what these proposed syscalls are supposed to do (and more). See: http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname... For the libhandle man page. Basically: openg == path_to_handle sutoc == open_by_handle And here for the userspace code: http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libhandle/ Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
Thanks for pointing these out Dave. These are indeed along the same lines as the openg()/openfh() approach. One difference is that they appear to perform permission checking on the open_by_handle(), which means that the entire path needs to be encoded in the handle, and makes it difficult to eliminate the path traversal overhead on N open_by_handle() operations. Regards, Rob -
The open-by-handle makes a little more sense, because the "handle" is
not opened, it only points to a resolved file. As I mentioned before,
it doesn't make much sense to bundle in openg name resolution and file
open.
Still I am not convinced that we need two ways of "finding" files.
Thanks,
Lucho
-I don't think that I understand what you're saying here. The openg() call does not perform file open (not that that is necessarily even a first-class FS operation), it simply does the lookup. When we were naming these calls, from a POSIX consistency perspective it seemed best to keep the "open" nomenclature. That seems to be confusing to some. Perhaps we should rename the function "lookup" or something similar, to help keep from giving the wrong idea? There is a difference between the openg() and path_to_handle() approach in that we do permission checking at openg(), and that does have implications on how the handle might be stored and such. That's being discussed in a separate thread. Thanks, Rob -
I was just thinking about how one might implement this, when it struck
me ... how much more efficient is a kernel implementation compared to:
int openg(const char *path)
{
char *s;
do {
s = tempnam(FSROOT, ".sutoc");
link(path, s);
} while (errno == EEXIST);
mpi_broadcast(s);
sleep(10);
unlink(s);
}
and sutoc() becomes simply open(). Now you have a name that's quick to
open (if a client has the filesystem mounted, it has a handle for the
root already), has a defined lifespan, has minimal permission checking,
and doesn't require standardisation.
I suppose some cluster fs' might not support cross-directory links
(AFS is one, I think), but then, no cluster fs's support openg/sutoc.
If a filesystem's willing to add support for these handles, it shouldn't
be too hard for them to treat files starting ".sutoc" specially, and as
efficiently as adding the openg/sutoc concept.
-Adding atomic reference count updating on file metadata so that we can have cross-directory links is not necessarily easier than supporting openg/openfh, and supporting cross-directory links precludes certain metadata organizations, such as the ones being used in Ceph (as I understand it). This also still forces all clients to read a directory and for N permission checking operations to be performed. I don't see what the FS could do to eliminate those operations given what you've described. Am I missing something? Also this looks too much like sillyrename, and that's hard to swallow... Regards, Rob -
open_by_handle() is checking the inode flags for things like immutibility and whether the inode is writable to determine if the open mode is valid given these flags. It's not actually checking permissions. IOWs, open_by_handle() has the same overhead as NFS filehandle to inode translation; i.e. no path traversal on open. Permission checks are done on the path_to_handle(), so in reality only root or CAP_SYS_ADMIN users can currently use the open_by_handle interface because of this lack of checking. Given that our current users of this interface need root permissions to do other things (data migration), this has never been an issue. This is an implementation detail - it is possible that file handle, being opaque, could encode a UID/GID of the user that constructed the handle and then allow any process with the same UID/GID to use open_by_handle() on that handle. (I think hch has already pointed this out.) Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
Thanks for the clarification Dave. So I take it that you would be interested in this type of functionality then? Regards, Rob -
Not really - just trying to help by pointing out something no-one seemed to know about.... Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
While it could do that, I'd be interested to see how you'd construct the handle such that it's immune to a malicious user tampering with it, or saving it across a reboot, or constructing one from scratch. I suspect any real answer to this would have to involve cryptographical techniques (say, creating a secure hash of the information plus a boot-time generated nonce). Now you're starting to use a lot of bits, and compute time, and you'll need to be sure to keep the nonce secret. -
If the server has to have processed a real "open" request, say within the preceding 30s, then it would have a handle for openfh() to match against. If the server reboots, or a client tries to construct a new handle from scratch, or even tries to use the handle after the file is closed then the handle would be invalid. It isn't just an encoding for "open-by-inum", but rather a handle that references some just-created open file handle on the server. That the handle might contain the UID/GID is mostly irrelevant - either the process + network is trusted to pass the handle around without snooping, or a malicious client which intercepts the handle can spoof the UID/GID just as easily. Make the handle sufficiently large to avoid guessing and it is "secure enough" until the whole filesystem is using kerberos to avoid any number of other client/user spoofing attacks. Considering that filesystems like GFS and OCFS allow clients DIRECT ACCESS to the block device itself (which no amount of authentication will fix, unless it is in the disks themselves), the risk of passing a file handle around is pretty minimal. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
That would be fine as long as the file handle would be a kernel-level concept. The issue here is that they intent to make the whole filehandle userspace visible, for example to pass it around via mpi. As soon as an untrused user can tamper with the file descriptor we're in trouble. -
I guess it could reference some "just-created open file handle" on the server, if the server tracks that sort of thing. Or it could be a capability, as mentioned previously. So it isn't necessary to tie this to an open, but I think that would be a reasonable underlying implementation for a file system that tracks opens. If clients can survive a server reboot without a remount, then even this implementation should continue to operate if a server were rebooted, because the open file context would be reconstructed. If capabilities were being employed, we could likewise survive a server reboot. But this issue of server reboots isn't that critical -- the use case has the handle being reused relatively quickly after the initial openg(), and clients have a clean fallback in the event that the handle is no longer valid -- just use open(). Visibility of the handle to a user does not imply that the user can effectively tamper with the handle. A cryptographically secure one-way hash of the data, stored in the handle itself, would allow servers to verify that the handle wasn't tampered with, or that the client just made up a handle from scratch. The server managing the metadata for that file would not need to share its nonce with other servers, assuming that single servers are responsible for particular files. Regards, Rob -
That's either disingenuous, or missing the point. OCFS/GFS allow the kernel direct access to the block device. openg()&sutoc() are about passing around file handles to untrusted users. -
Consider - in order to intercept the file handle on the network one would have to be root on a trusted client. The same is true for direct block access. If the network isn't to be trusted or the clients aren't to be trusted, then in the absence of strong external authentication like kerberos the whole thing just falls down (i.e. root on any client can su to an arbitrary UID/GID to access files to avoid root squash, or could intercept all of the traffic on the network anyways). With some network filesystems it is at least possible to get strong authentication and crypto, but with shared block device filesystems like OCFS/GFS/GPFS they completely rely on the fact that the network and all of the clients attached thereon are secure. If the server that did the original file open and generates the unique per-open file handle can do basic sanity checking (i.e. user doing the new open is the same, the file handle isn't stale) then that is no additional security hole. Similarly, NFS passes file handles to clients that are also used to get access to the open file without traversing the whole path each time. Those file handles are even (supposed to be) persistent over reboots. Don't get me wrong - I understand that what I propose is not secure. I'm just saying it is no LESS secure than a number of other things which already exist. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
An auth header and GSS-API integration would probably be the way to go here if you really care. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
Another (and highly important) difference is that usage is restricted to
root:
xfs_open_by_handle(...)
...
if (!capable(CAP_SYS_ADMIN))
return -XFS_ERROR(EPERM);
-I assume that this is because the implementation chose not to do the path encoding in the handle? Because if they did, they could do full path permission checking as part of the open_by_handle. Rob -
The original use of this interface (if I understand the Irix history correctly - this is way before my time at SGI) was a userspace NFS server and so permission checks were done after the filehandle was opened and a stat could be done on the fd and mode/uid/gid could be compared to what was in the NFS request. Paths were never needed for this because everything needed could be obtained directly from the inode. Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
Thanks for looking at the graph. To clarify the workload, we do not expect that application processes will be opening a large number of files all at once; that was just how the test was run to get a reasonable average value. So I don't think that something that looked up multiple file names would help for this case. I unfortunately don't have data to show exactly where the time was spent, but it's a good guess that it is all the network traffic in the I really think that we're saying the same thing here? I think of the open() call as doing two (maybe three) things. First, performs name resolution and permission checking. Second, creates the file descriptor that allows the user process to do subsequent I/O. Third, creates a context for access, if the FS keeps track of "open" files (not all do). The openg() really just does the lookup and permission checking). The openfh() creates the file descriptor and starts that context if the Well you've caught me. I don't want to cache the values, because I fundamentally believe that sharing state between clients and servers is braindead (to use Christoph's phrase) in systems of this scale (thousands to tens of thousands of clients). So I don't want locks, so I can't keep the cache consistent, ... So someone else will have to run the tests you propose :)... Also, to address Christoph's snipe while we're here; I don't care one way or another whether the Linux community wants to help GPFS or not. I do care that I'm arguing for something that is useful to more than just my own pet project, and that was the point that I was trying to make. I'll be sure not to mention GPFS again. What's the etiquette on changing subject lines here? It might be useful to separate the openg() etc. discussion from the readdirplus() etc. discussion. Thanks again for the comments, Rob -
Is it hard to repeat the test and check what requests (and how much
Having file handles in the server looks like a cache to me :) What are
the properties of a cache that it lacks?
Thanks,
Lucho
-Besides the whole ugliness you miss a few points about the fundamental architecture of the unix filesystem permission model unfortunately. Say you want to lookup a path /foo/bar/baz, then the access permission is based on the following things: - the credentials of the user. let's only take traditional uid/gid for this example although credentials are much more complex these days - the kind of operation you want to perform - the access permission of the actual object the path points to (inode) - the lookup permission (x bit) for every object on the way to you object In your proposal sutoc is a simple conversion operation, that means openg needs to perfom all these access checks and encodes them in the fh_t. That means an fh_t must fundamentally be an object that is kept in the kernel aka a capability as defined by Henry Levy. This does imply you _do_ need to keep state. And because it needs kernel support you fh_t is more or less equivalent to a file descriptor with sutoc equivalent to a dup variant that really duplicates the backing object instead of just the userspace index into it. Note somewhat similar open by filehandle APIs like oben by inode number as used by lustre or the XFS *_by_handle APIs are privilegued operations because of exactly this problem. What according to your mail is the most important bit in this proposal is that you thing the filehandles should be easily shared with other system in a cluster. That fact is not mentioned in the actual proposal at all, and is in fact that hardest part because of inherent statefulness of Changing subject lines is fine. -
The fh_t is indeed a type of capability. fh_t, properly protected, could be passed into user space and validated by the file system when presented back to the file system. There is state here, clearly. I feel ok about that because we allow servers to forget that they handed out these fh_ts if they feel like it; there is no guaranteed lifetime in the current proposal. This allows servers to come and go without needing to persistently store these. Likewise, clients can forget them with no real penalty. This approach is ok because of the use case. Because we expect the fh_t to be used relatively soon after its creation, servers will not need to hold onto these long before the openfh() is performed and we're back into a normal "everyone has an valid fd" use case. Well, a FD has some additional state associated with it (position, I'm not sure what a properly protected fh_t couldn't be passed back into user space and handed around, but I'm not a security expert. What am I The documentation of the calls is complicated by the way POSIX calls are described. We need to have a second document describing use cases also available, so that we can avoid misunderstandings as best we can, get straight to the real issues. Sorry that document wasn't available. Thanks. Rob -
Well, there's quite a lot of papers on how to implement properly secure capabilities. The only performant way to do it is to implement them in kernel space or with hardware support. As soon as you pass them to userspace the user can manipulate them, and doing a cheap enough verification is non-trivial (e.g. it doesn't buy you anything if you spent the time you previously spent for lookup roundtrip latency Objects without defined lifetime rules are not something we're very keen on. Particularly in userspace interface they will cause all kinds of trouble because people will expect the lifetime rules they get from their The real problem is that you want to do something in a POSIX spec that is fundamentally out of scope. POSIX .1 deals with system interfaces on a single system. You want to specify semantics over multiple systems in a cluster. -
I agree that if the cryptographic verification took longer than the N namespace traversals and permission checking that would occur in the other case, that this would be a silly proposal. honestly that didn't occur to me as even remotely possible, especially given that in most cases the server will be verifying the exact same handle lots of times, rather than needing to verify a large number of different handles I agree that not being able to clearly define the lifetime of the handle is suboptimal. If the handle is a capability, then its lifetime would be bounded only by potential revocations of the capability, the same way an open FD might then suddenly cease to be valid. On the other hand, in Andreas' "open file handle" implementation the handle might have a shorter lifetime. We're attempting to allow for the underlying FS to implement this in the most natural way for that file system. Those mechanisms lead to different lifetimes. This would bother me quite a bit *if* it complicated the use model, but it really doesn't, particularly because less savvy users are likely to I agree; the real problem is that POSIX .1 is being used to specify semantics over multiple systems in a cluster. But we're stuck with that. Thanks, Rob -
- your private namespace particularities (submounts etc) Trond -
How exactly would you want a multi-name lookup to work? Are you saying
that open("/usr/share/misc/pci.ids") should ask the server "Find usr, if
you find it, find share, if you find it, find misc, if you find it, find
pci.ids"? That would be potentially very wasteful; consider mount
points, symlinks and other such effects on the namespace. You could ask
the server to do a lot of work which you then discard ... and that's not
efficient.
-It could be inefficient, as pointed out, but defined right, it could
greatly reduce the number of over the wire trips.
The client can already tell from its own namespace when a submount may
be encountered, so know not to utilize the multicomponent pathname
lookup facility. The requirements could state that the server stops
when it encounters a non-directory/non-regular file node in the namespace.
This sort of thing...
ps
-Any support for advance filesystem semantics will definitly not be available to propritary filesystems like GPFS that violate our copyrights blatantly. -
I further wonder if these people would see appreciable gains from doing sutoc rather than doing openat(dirfd, "basename", flags, mode); If they did, I could also see openat being extended to allow dirfd to be a file fd, as long as pathname were NULL or a pointer to NUL. But with all the readx stuff being proposed, I bet they don't really need independent file offsets. That's, like, so *1970*s. -
There is a business case at the Open Group Web site. It is not a full use case document though. For a very tiny amount of background. It seems from the discussion that others (at least those working in clustered file systems) have seen the need for a statlite and readdir+ type function, what ever they might be called or how ever they might be implemented. As for openg, the gains have been seen in clustered file systems where you have 10s of thousands of processes spread out over thousands of machines. All 100k processes may open the same file and offset different amounts, sometimes strided sometimes not strided through the file. The opens all fire within a few milliseconds or less. This is a problem for large clustered file systems, open times have been seen in the minutes or worse. The writes all come at once as well quite often. Often they are complicated scatter gather operations spread out across the entire distributed memory of thousands of machines, not even in a completely uniform manner. A little knowledge about the intent of the application goes a long way when you are dealing with 100k parallelism. Additionally, having some notion of groups of processes collaborating at the file system level is useful for trying to make informed decisions about determinism and quality of service you might want to provide, how strictly you want to enforce rules on collaborating processes, etc. As for NFS acl's. This was going to be a separate extension volume, not associated with the performance portion. It comes up because many of the users of high end/clustered file system technology are also in often secure environments and have need to know issues. We were trying to be helpful to the NFSv4 community which has been kind enough to have these security features in their product. Additionally, this entire effort is being proposed as an extension, not as a change to the base POSIX I/O API. We certainly have no religion about how we make progress to...
Please don't repeat the stupid marketroid speach. If you want this to go anywhere please get someone with an actual clue to talk to us instead of you. Thanks a lot. -
The question is how does the filesystem know that the application is going to do readdir + stat every file? It has to do this as a heuristic implemented in the filesystem to determine if the ->getattr() calls match the ->readdir() order. If the application knows that it is going to be doing this (e.g. ls, GNU rm, find, etc) then why not let the filesystem take advantage of this information? If combined with the statlite interface, it can make a huge difference for clustered filesystems. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
I think that this kind of heuristic would be a win for local file systems with a huge number of files as well... ric -
Hi, I agree that this is a good plan, but I'd been looking at this idea from a different direction recently. The in kernel NFS server calls vfs_getattr from its filldir routine for readdirplus and this means not only are we unable to optimise performance by (for example) sorting groups of getattr calls so that we read the inodes in disk block order, but also that its effectively enforcing a locking order of the inodes on us too. Since we can have async locking in GFS2, we should be able to do "lockahead" with readdirplus too. I had been considering proposing a readdirplus export operation, but since this thread has come up, perhaps a file operation would be preferable as it could solve two problems with one operation? Steve. -
Doing this as an export operation is wrong. Even if it's only used for nfsd for now the logical level this should be on are the file operations. If you do it you could probably prototype a syscall for it aswell - once we have the infrastructure the syscall should be no more than about 20 lines of code. -
The other thing is that a readdirplus at least for some file systems can
be implemented much more efficiently than readdir + stat because the
directory entry itself contains a lot of extra information.
To take NTFS as an example I know something about, the directory entry
caches the a/c/m time as well as the data file size (needed for "ls")
and the allocated on disk file size (needed for "du") as well as the
inode number corresponding to the name, the flags of the inode
(read-only, hidden, system, whether it is a file or directory, etc) and
some other tidbits so readdirplus on NTFS can simply return wanted
information without ever having to do a lookup() on the file name to
obtain the inode to then use that in the stat() system call... The
potential decrease in work needed is tremendous in this case...
Imagine "ls -li" running with a single readdirplus() syscall and that is
all that happens on the kernel side, too. Not a single file name needs
to be looked up and not a single inode needs to be loaded. I don't
think anyone can deny that that would be a massive speedup of "ls -li"
for file systems whose directory entries store extra information to
traditional unix file systems...
Best regards,
Anton
--
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/
-For a more extreme case, hfs and hfsplus don't even have a separation between directory entries and inode information. The code creates this separation synthetically to match the expectations of the kernel. During a readdir(), the full catalog record is loaded from disk, but all that is used is the information passed back to the filldir callback. The only thing that would be needed to return extra information would be code to copy information from the internal structure to whatever the system call used to return data to the program. Brad Boyer flar@allandria.com -
In this case you can infact already instanciate inodes froms readdir. Take a look at the NFS code. -
Sure. And having readdirplus over the wire is a great performance win for NFS, but it works only because NFS metadata consistency is already weak. Giving applications an atomic readdirplus makes things considerably simpler for distributed filesystems that want to provide strong consistency (and a reasonable interpretation of what POSIX semantics mean for a distributed filesystem). In particular, it allows the application (e.g. ls --color or -al) to communicate to the kernel and filesystem that it doesn't care about the relative ordering of each subsequent stat() with respect to other writers (possibly on different hosts, with whom synchronization can incur a heavy performance penalty), but rather only wants a snapshot of dentry+inode state. As Andreas already mentioned, detecting this (exceedingly common) case may be possible with heuristics (e.g. watching the ordering of stat() calls vs the filldir resuls), but that's hardly ideal when a cleaner interface can explicitly capture the application's requirements. sage -
What exactly do you mean by an "atomic readdirplus"? Standard readdir is
by its very nature weakly cached, and there is no guarantee whatsoever
even that you will see all files in the directory. See the SuSv3
definition, which explicitly states that there is no ordering w.r.t.
file creation/deletion:
The type DIR, which is defined in the <dirent.h> header,
represents a directory stream, which is an ordered sequence of
all the directory entries in a particular directory. Directory
entries represent files; files may be removed from a directory
or added to a directory asynchronously to the operation of
readdir().
Besides, why would your application care about atomicity of the
attribute information unless you also have some form of locking to
guarantee that said information remains valid until you are done
processing it?
Cheers,
Trond
-I mean atomic only in the sense that the stat result returned by readdirplus() would reflect the file state at some point during the time consumed by that system call. In contrast, when you call stat() separately, it's expected that the result you get back reflects the state at some time during the stat() call, and not the readdir() that may have preceeded it. readdir() results may be weakly cached, but stat() results normally aren't (ignoring the usual NFS behavior for the moment). It's the stat() part of readdir() + stat() that makes life unnecessarily difficult for a filesystem providing strong consistency. How can the filesystem know that 'ls' doesn't care if the stat() results are accurate at the time of the readdir() and not the subsequent stat()? Something like readdirplus() allows that to be explicitly communicated, without resorting to heuristics or weak metadata consistency (ala NFS attribute caching). For distributed or network filesystems that can be a big win. (Admittedly, there's probably little benefit for local filesystems beyond the possibility of better prefetching, if syscalls are as cheap as Something like 'ls' certainly doesn't care, but in general applications do care that stat() results aren't cached. They expect the stat results to reflect the file's state at a point in time _after_ they decide to call stat(). For example, for process A to see how much data a just-finished process B wrote to a file... sage -
'ls --color' and 'find' don't give a toss about most of the arguments from 'stat()'. They just want to know what kind of filesystem object they are dealing with. We already provide that information in the readdir() syscall via the 'd_type' field. Adding all the other stat() information is just going to add unnecessary AFAICS, it will not change any consistency semantics. The main irritation it will introduce will be that the NFS client will suddenly have to do things like synchronising readdirplus() and file write() in order to provide the POSIX guarantees that you mentioned. i.e: if someone has written data to one of the files in the directory, then an NFS client will now have to flush that data out before calling readdir so that the server returns the correct m/ctime or file size. Previously, it could delay that until the stat() call. Trond -
That is _almost_ true, except that "ls --color" does a stat anyways to get the file mode (to set the "*" executable type) and the file blocks (with -s) and the size (with -l) and the inode number (with -i). In a clustered filesystem getting the inode number and mode is easily done along with the uid/gid (for many kinds of "find") while getting the file size may be non-trivial. Just to be clear, I have no desire to include any kind of "synchronization" semantics to readdirplus() that is also being discussed in this thread. Just the ability to bundle select stat info along with the readdir information, and to allow stat to not return any unnecessary info (in particular size, blocks, mtime) that may be harder to gather on a clustered filesystem. Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
I'm not suggesting any "synchronization" beyond what opendir()/readdir() already require for the directory entries themselves. If I'm not mistaken, readdir() is required only to return directory entries as recent as the opendir() (i.e., you shouldn't see entries that were unlink()ed before you called opendir(), and intervening changes to the directory may or may not be reflected in the result, depending on how your implementation is buffering things). I would think the stat() portion of readdirplus() would be similarly (in)consistent (i.e., return a value at least as recent as the opendir()) to make life easy for the implementation and to align with existing readdir() semantics. My only concern is the "at least as recent as the opendir()" part, in contrast to statlite(), which has undefined "recentness" of its result for fields not specified in the mask. Ideally, I'd like to see readdirplus() also take a statlite() style mask, so that you can choose between either "vaguely recent" and "at least as recent as opendir()". As you mentioned, by the time you look at the result of any call (in the absence of locking) it may be out of date. But simply establishing an ordering is useful, especially in a clustered environment where some nodes are waiting for other nodes (via barriers or whatever) and then want to see the effects of previously completed fs operations. Anyway, "synchronization" semantics aside (since I appear to be somewhat alone on this :)... I'm wondering if a corresponding opendirplus() (or similar) would also be appropriate to inform the kernel/filesystem that readdirplus() will follow, and stat information should be gathered/buffered. Or do most implementations wait for the first readdir() before doing any actual work anyway? sage -
In my opinion (which may not be that of the original statlite() authors) is that the flags should really be called "valid", and any bit set in the flag can be considered valid, and any unset bit means "this field has no valid data". Having it mean "it might be out of date" gives the false impression that it might contain valid (if slightly out of date) Ah, OK. I didn't understand what you were getting at before. I agree I'm not sure what some filesystems might do here. I suppose NFS has weak enough cache semantics that it _might_ return stale cached data from the client in order to fill the readdirplus() data, but it is just as likely that it ships the whole thing to the server and returns everything in one shot. That would imply everything would be at least as up-to-date as the opendir(). Cheers, Andreas -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
Whether or not the posix committee decides on readdirplus, I propose that we implement this sort of thing in the kernel via a readdir equivalent to posix_fadvise(). That can give exactly the barrier semantics that they are asking for, and only costs 1 extra syscall as opposed to 2 (opendirplus() and readdirplus()). Cheers Trond -
I think the "barrier semantics" are something that have just crept into this discussion and is confusing the issue. The primary goal (IMHO) of this syscall is to allow the filesystem (primarily distributed cluster filesystems, but HFS and NTFS developers seem on board with this too) to avoid tens to thousands of stat RPCs in very common ls -R, find, etc. kind of operations. I can't see how fadvise() could help this case? Yes, it would tell the filesystem that it could do readahead of the readdir() data, but the app will still be doing stat() on each of the thousands of files in the directory, instantiating inodes and dentries on that node (which need locking, and potentially immediate lock revocation if the files are being written to by other nodes). In some cases (e.g. rm -r, grep -r) that might even be a win, because the client will soon be touching all of those files, but not necessarily in the ls -lR, find cases. The filesystem can't always do "stat-ahead" on the files because that requires instantiating an inode on the client which may be stale (lock revoked) by the time the app gets to it, and the app (and the VFS) have no idea just how stale it is, and whether the stat is a "real" stat or "only" the readdir stat (because the fadvise would only be useful on the directory, and not all of the child entries), so it would need to re-stat the file. Also, this would potentially blow the client's real working set of inodes out of cache. Doing things en-masse with readdirplus() also allows the filesystem to do the stat() operations in parallel internally (which is a net win if there are many servers involved) instead of serially as the application would do. Cheers, Andreas PS - I changed the topic to separate this from the openfh() thread. -- Andreas Dilger Principal Software Engineer Cluster File Systems, Inc. -
I don't think that ls -R and find are that common cases that they need
introduction of new operations in order to made them faster. On the
other hand may be they are often being used to do microbenchmarks. If
you goal is to make these filesystems look faster on microbenchmarks,
then probably you have the right solution. For normal use, especially
on clusters, I don't see any advantage of doing that.
Thanks,
Lucho
-Hi Lucho, Andreas is right on mark. The problem here is that when one user kicks off an ls -l or ls -R on a cluster file system *while other users are trying to get work done*, all those stat RPCs and lock reclamations can kill performance. We're not interested in a "ls -lR" top 500, we're interested in making systems more usable, more tolerant to everyday user behaviors. Regards, Rob -
It is the _only_ concept that is of interest for something like NFS or 'find' should be quite happy with the existing readdir(). It does not need to use stat() or readdirplus() in order to recurse because readdir() provides d_type. The locking problem is only of interest to clustered filesystems. On local filesystems such as HFS, NTFS, and on networked filesystems like NFS or CIFS, the only lock that matters is the parent directory's inode->i_sem, which is held by readdir() anyway. If the application is able to select a statlite()-type of behaviour with the fadvise() hints, your filesystem could be told to serve up cached information instead of regrabbing locks. In fact that is a much more flexible scheme, since it also allows the filesystem to background the actual inode lookups, or to defer them altogether if that is more Then provide hints that allow the app to select which behaviour it prefers. Most (all?) apps don't _care_, and so would be quite happy with cached information. That is why the current NFS caching model exists in If your application really cared, it could add threading to 'ls' to achieve the same result. You can also have the filesystem preload that information based on fadvise hints. Trond -
Actually, wouldn't the ability for readdirplus() (with valid flag) be useful for NFS if only to indicate that it does not need to flush the It does in any but the most simplistic invocations, like "find -mtime" I guess I just don't understand how fadvise() on a directory file handle (used for readdir()) can be used to affect later stat operations (which definitely will NOT be using that file handle)? If you mean that the application should actually open() each file, fadvise(), fstat(), close(), instead of just a stat() call then we are WAY into negative improvements Most clustered filesystems have strong cache semantics, so that isn't a problem. IMHO, the mechanism to pass the hint to the filesystem IS the readdirplus_lite() that tells the filesystem exactly which data is Because in many cases it is desirable to limit the number of DLM locks on a given client (e.g. GFS2 thread with AKPM about clients with millions of DLM locks due to lack of memory pressure on large mem systems). That means a finite-size lock LRU on the client that risks being wiped out by a few thousand files in a directory doing "readdir() + 5000*stat()". Consider a system like BlueGene/L with 128k compute cores. Jobs that run on that system will periodically (e.g. every hour) create up to 128K checkpoint+restart files to avoid losing a lot of computation if a node crashes. Even if each one of the checkpoints is in a separate directory (I wish all users were so nice :-) it means 128K inodes+DLM locks for doing But it would still need 128K RPCs to get that information, and 128K new inodes on that client. And what is the chance that I can get a multi-threading "ls" into the upstream GNU ls code? In the case of local filesystems multi-threading ls would be a net loss due to seeking. But even for local filesystems readdirplus_lite() would allow them to fill in stat information they already have (either in cache or on disk), and may avoid doing extra work that isn't needed. For filesystems that do...
That is why statlite() might be useful. I'd prefer something more The only 'win' a readdirplus may give you there as far as NFS is concerned is the sysenter overhead that you would have for calling On the contrary, the readdir descriptor is used in all those funky new statat(), calls. Ditto for readlinkat(), faccessat(). You could even have openat() turn off the close-to-open GETATTR if the readdir descriptor contained a hint that told it that was unnecessary. Furthermore, since the fadvise-like caching operation works on filehandles, you could have it work both on readdir() for the benefit of the above *at() calls, and also on the regular file descriptor for the That is precisely the sort of situation where knowing when you can cache, and when you cannot would be a plus. An ls call may not need 128k dlm locks, because it only cares about the state of the inodes as they NFS doesn't 'cos it implements readdirplus under the covers as far as The thing to note, though, is that in the NFS implementation we are _very_ careful about use the GETATTR information it returns if there is already an inode instantiated for that dentry. This is precisely because we don't want to deal with the issue of synchronisation w.r.t. an inode that may be under writeout, that may be the subject of setattr() calls, etc. As far as we're concerned, READDIRPLUS is a form of mass LOOKUP, not a mass inode revalidation Trond -
