Re: NFSv4/pNFS possible POSIX I/O API standards

Previous thread: We Guuaranteees Bigger Pen-nis by Elena Waller on Sunday, November 26, 2006 - 9:59 am. (1 message)

Next thread: [PATCH -mm] gfs2 lock function parameter by Randy Dunlap on Tuesday, November 28, 2006 - 11:29 pm. (2 messages)
From: Gary Grider
Date: Monday, November 27, 2006 - 9:34 pm

From: Christoph Hellwig
Date: Monday, November 27, 2006 - 10:54 pm

What crack do you guys have been smoking?

---end quoted text---
-



From: Andreas Dilger
Date: Tuesday, November 28, 2006 - 3:54 am

IMHO, this is a logical extension to readv/writev.  It allows a single
readx/writex syscall to specify different targets in the file instead of
needing separate syscalls.  So, for example, a single syscall could be
given to dump a sparse corefile or a compiled+linked binary, allowing
the filesystem to optimize the allocations instead of getting essentially

This is a big win for clustered filesystems.  Some "stat" items are
a lot more work to gather than others, and if an application (e.g.
"ls --color" which is default on all distros) doesn't need anything
except the file mode to print "*" and color an executable green it
is a waste to gather the remaining ones.

My objection to the current proposal is that it should be possible
to _completely_ specify which fields are required and which are not,
instead of having a split "required" and "optional" section to the
stat data.  In some implementations, it might be desirable to only
find the file blocks (e.g. ls -s, du -s) and not the owner, links,
metadata, so why implement a half-baked version of a "lite" stat()?

Also, why pass the "st_litemask" as a parameter set in the struct
(which would have to be reset on each call) instead of as a parameter
to the function (makes the calling convention much clearer)?

int statlite(const char *fname, struct stat *buf, unsigned long *statflags);


[ readdirplus not referenced ]
It would be prudent, IMHO, that if we are proposing statlite() and
readdirplus() syscalls, that the readdirplus() syscall be implemented
as a special case of statlite().  It avoids the need for yet another
version in the future "readdirpluslite()" or whatever...

Namely readdirplus() takes a "statflags" paremeter (per above) so that
the dirent_plus data only has to retrieve the required stat data (e.g. ls
-iR only needs inode number) and not all of it.  Each returned stat
has a mask of valid fields in it, as e.g. some items might be in cache

Strange, group is called HECIWG, website is "hecewg"?

Cheers, ...
From: Anton Altaparmakov
Date: Tuesday, November 28, 2006 - 4:28 am

Indeed.  It is best to be able to say what the application wants.  Have
a look at the Mac OS X getattrlist() system call (note there is a
setattrlist(), too but that is off topic wrt to the stat() function).

You can see the man page here for example:

http://www.hmug.org/man/2/getattrlist.php

This interface btw also is made such that each file system can define
which attributes it actually supports and a call to getattrlist() can
determine what the current file system supports.  This allows
applications to tune themselves to what the file system supports that
they are running on...

I am not saying you should just copy it.  But in case you were not aware
of it you may want to at least look at it for what others have done in
this area.

Best regards,

Best regards,

        Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

-



From: Russell Cattelan
Date: Tuesday, November 28, 2006 - 1:17 pm

I half agree with allowing for a configurable stat call.
But when it comes to standards and expecting everybody to adhere
them a well defined list required and optional fields seems necessary.
Otherwise every app will simply start asking for various fields they
*think*
is important, without much regard for what might be and expensive stat
to obtain
cluster wide and which ones are cheap.

By clearly defining the list and standardizing that list the app
programmer and
the kernel programmer know what is expected.

The mask idea sounds like a good way to implement it, but at the same
--=20
Russell Cattelan <cattelan@thebarn.com>
From: Wendy Cheng
Date: Tuesday, November 28, 2006 - 4:28 pm

Some of the described calls look very exciting and, based on our current 
customer issues, we have needs for them today rather than tomorrow. This 
"statlite()" is definitely one of them as we have been plagued by "ls" 
performance for a while. I'm wondering whether there are implementation 
efforts to push this into LKML soon ?

-- Wendy

-



From: Christoph Hellwig
Date: Wednesday, November 29, 2006 - 2:12 am

Ameer Armaly started and implementation of this, but unfortunately never
posted an updated patch incorporating the review comments.  See
http://marc.theaimsgroup.com/?l=linux-fsdevel&m=115487991724607&w=2 for
details.
-



From: Christoph Hellwig
Date: Wednesday, November 29, 2006 - 2:04 am

I'd like to apologize for this statement, it was a little harsh.

I still think most of these APIs are rather braindead, but then again
everyone does braindead APIs from now to then.

I still think it's very futile that you try to force APIs using
standizations on us.  Instead of going down that route please try
to present a case for every single API you want, including reasonings
why this can't be fixed by speeding up existing APIs.  Note that with
us I don't mean just linux but also other OpenSource OSes.  Unless you
at least get Linux and FreeBSD and Solaris to agree on the need for
the API it's very pointless to go anywhere close to a standization body.

Anyway, let's go on to the individual API groups:

 - readdirplus

	This one is completely unneeded as a kernel API.  Doing readdir
	plus calls on the wire makes a lot of sense and we already do
	that for NFSv3+.  Doing this at the syscall layer just means
	kernel bloat - syscalls are very cheap.

 - lockg

	I'm more than unhappy to add new kernel-level file locking calls.
	The whole mess of lockf vs fcntl vs leases is bad enough that we
	don't want to add more to it.  Doing some form of advisory locks
	that can be implemented in userland using a shared memory region
	or message passing might be fine.

 - openg/sutoc

	No way.  We already have a very nice file descriptor abstraction.
	You can pass file descriptors over unix sockets just fine.

 - NFSV4acls

	These have nothing to do at all with I/O performance.  They're
	also sufficiently braindead.  Even if you still want to push for
	it you shouldn't mix it up with anything else in here.

 - statlite

	The concept generally makes sense.  The specified details are however
	very wrong.  Any statlite call should operate on the normal
	OS-specified stat structure and have the mask of flags as an
	additional argument.  Because of that you can only specific
	existing posix stat values as mandatory, but we should have an
	informal agreement that assigns unique ...
From: Christoph Hellwig
Date: Wednesday, November 29, 2006 - 2:14 am

Just thinking about the need to add another half a dozend syscalls for this.
What about somehow funneling this into the flags argument of the {f,l,}statat
syscalls?

-



From: Andreas Dilger
Date: Wednesday, November 29, 2006 - 2:48 am

The question is how does the filesystem know that the application is
going to do readdir + stat every file?  It has to do this as a heuristic
implemented in the filesystem to determine if the ->getattr() calls match
the ->readdir() order.  If the application knows that it is going to be
doing this (e.g. ls, GNU rm, find, etc) then why not let the filesystem
take advantage of this information?  If combined with the statlite
interface, it can make a huge difference for clustered filesystems.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-



From: Anton Altaparmakov
Date: Wednesday, November 29, 2006 - 3:18 am

The other thing is that a readdirplus at least for some file systems can
be implemented much more efficiently than readdir + stat because the
directory entry itself contains a lot of extra information.

To take NTFS as an example I know something about, the directory entry
caches the a/c/m time as well as the data file size (needed for "ls")
and the allocated on disk file size (needed for "du") as well as the
inode number corresponding to the name, the flags of the inode
(read-only, hidden, system, whether it is a file or directory, etc) and
some other tidbits so readdirplus on NTFS can simply return wanted
information without ever having to do a lookup() on the file name to
obtain the inode to then use that in the stat() system call...  The
potential decrease in work needed is tremendous in this case...

Imagine "ls -li" running with a single readdirplus() syscall and that is
all that happens on the kernel side, too.  Not a single file name needs
to be looked up and not a single inode needs to be loaded.  I don't
think anyone can deny that that would be a massive speedup of "ls -li"
for file systems whose directory entries store extra information to
traditional unix file systems...

Best regards,

        Anton
-- 
Anton Altaparmakov <aia21 at cam.ac.uk> (replace at with @)
Unix Support, Computing Service, University of Cambridge, CB2 3QH, UK
Linux NTFS maintainer, http://www.linux-ntfs.org/

-



From: Brad Boyer
Date: Wednesday, November 29, 2006 - 1:26 am

For a more extreme case, hfs and hfsplus don't even have a separation
between directory entries and inode information. The code creates this
separation synthetically to match the expectations of the kernel. During
a readdir(), the full catalog record is loaded from disk, but all that
is used is the information passed back to the filldir callback. The only
thing that would be needed to return extra information would be code to
copy information from the internal structure to whatever the system call
used to return data to the program.

	Brad Boyer
	flar@allandria.com

-



From: Christoph Hellwig
Date: Thursday, November 30, 2006 - 2:25 am

In this case you can infact already instanciate inodes froms readdir.
Take a look at the NFS code.

-



From: Sage Weil
Date: Thursday, November 30, 2006 - 10:49 am

Sure.  And having readdirplus over the wire is a great performance win for 
NFS, but it works only because NFS metadata consistency is already weak.

Giving applications an atomic readdirplus makes things considerably 
simpler for distributed filesystems that want to provide strong 
consistency (and a reasonable interpretation of what POSIX semantics mean 
for a distributed filesystem).  In particular, it allows the application 
(e.g. ls --color or -al) to communicate to the kernel and filesystem that 
it doesn't care about the relative ordering of each subsequent stat() with 
respect to other writers (possibly on different hosts, with whom 
synchronization can incur a heavy performance penalty), but rather only 
wants a snapshot of dentry+inode state.

As Andreas already mentioned, detecting this (exceedingly common) case may 
be possible with heuristics (e.g. watching the ordering of stat() calls vs 
the filldir resuls), but that's hardly ideal when a cleaner interface can 
explicitly capture the application's requirements.

sage
-



From: Trond Myklebust
Date: Thursday, November 30, 2006 - 10:26 pm

What exactly do you mean by an "atomic readdirplus"? Standard readdir is
by its very nature weakly cached, and there is no guarantee whatsoever
even that you will see all files in the directory. See the SuSv3
definition, which explicitly states that there is no ordering w.r.t.
file creation/deletion:

        The type DIR, which is defined in the <dirent.h> header,
        represents a directory stream, which is an ordered sequence of
        all the directory entries in a particular directory. Directory
        entries represent files; files may be removed from a directory
        or added to a directory asynchronously to the operation of
        readdir().

Besides, why would your application care about atomicity of the
attribute information unless you also have some form of locking to
guarantee that said information remains valid until you are done
processing it?

Cheers,
  Trond

-



From: Sage Weil
Date: Friday, December 1, 2006 - 12:08 am

I mean atomic only in the sense that the stat result returned by 
readdirplus() would reflect the file state at some point during the time 
consumed by that system call.  In contrast, when you call stat() 
separately, it's expected that the result you get back reflects the state 
at some time during the stat() call, and not the readdir() that may 
have preceeded it.  readdir() results may be weakly cached, but stat() 
results normally aren't (ignoring the usual NFS behavior for the moment).

It's the stat() part of readdir() + stat() that makes life unnecessarily 
difficult for a filesystem providing strong consistency.  How can the 
filesystem know that 'ls' doesn't care if the stat() results are accurate 
at the time of the readdir() and not the subsequent stat()?  Something 
like readdirplus() allows that to be explicitly communicated, without 
resorting to heuristics or weak metadata consistency (ala NFS attribute 
caching).  For distributed or network filesystems that can be a big win. 
(Admittedly, there's probably little benefit for local filesystems beyond 
the possibility of better prefetching, if syscalls are as cheap as 

Something like 'ls' certainly doesn't care, but in general applications do 
care that stat() results aren't cached.  They expect the stat results to 
reflect the file's state at a point in time _after_ they decide to call 
stat().  For example, for process A to see how much data a just-finished 
process B wrote to a file...

sage

-



From: Trond Myklebust
Date: Friday, December 1, 2006 - 7:41 am

'ls --color' and 'find' don't give a toss about most of the arguments
from 'stat()'. They just want to know what kind of filesystem object
they are dealing with. We already provide that information in the
readdir() syscall via the 'd_type' field.
Adding all the other stat() information is just going to add unnecessary

AFAICS, it will not change any consistency semantics. The main
irritation it will introduce will be that the NFS client will suddenly
have to do things like synchronising readdirplus() and file write() in
order to provide the POSIX guarantees that you mentioned.
i.e: if someone has written data to one of the files in the directory,
then an NFS client will now have to flush that data out before calling
readdir so that the server returns the correct m/ctime or file size.
Previously, it could delay that until the stat() call.

Trond

-



From: Sage Weil
Date: Friday, December 1, 2006 - 9:47 am

'ls -al' cares about the stat() results, but does not care about the 
relative timing accuracy wrt the preceeding readdir().  I'm not sure why 
'ls --color' still calls stat when it can get that from the readdir() 
results, but either way it's asking more from the kernel/filesystem than 

It sounds like you're talking about a single (asynchronous) client in a 
directory.  In that case, the client need only flush if someone calls 
readdirplus() instead of readdir(), and since readdirplus() is effectively 
also a stat(), the situation isn't actually any different.

The more interesting case is multiple clients in the same directory.  In 
order to provide strong consistency, both stat() and readdir() have to 
talk to the server (or more complicated leasing mechanisms are needed). 
In that scenario, readdirplus() is asking for _less_ 
synchronization/consistency of results than readdir()+stat(), not more. 
i.e. both the readdir() and stat() would require a server request in order 
to achieve the standard POSIX semantics, while a readdirplus() would allow 
a single request.  The NFS client already provibes weak consistency of 
stat() results for clients.  Extending the interface doesn't suddenly 
require the NFS client to provide strong consistency, it just makes life 
easier for the implementation if it (or some other filesystem) chooses to 
do so.

Consider two use cases.  Process A is 'ls -al', who doesn't really care 
about when the size/mtime are from (i.e. sometime after opendir()). 
Process B waits for a process on another host to write to a file, and then 
calls stat() locally to check the result.  In order for B to get the 
correct result, stat() _must_ return a value for size/mtime from _after_ 
the stat() initiated.  That makes 'ls -al' slow, because it probably has 
to talk to the server to make sure files haven't been modified between the 
readdir() and stat().  In reality, 'ls -al' doesn't care, but the 
filesystem has no way to know that without the presense of ...
From: Trond Myklebust
Date: Friday, December 1, 2006 - 11:07 am

Why would that be interesting? What applications do you have that
require strong consistency in that scenario? I keep looking for uses for
strong cache consistency with no synchronisation, but I have yet to meet

I'm quite happy with a proposal for a statlite(). I'm objecting to
readdirplus() because I can't see that it offers you anything useful.
You haven't provided an example of an application which would clearly
benefit from a readdirplus() interface instead of readdir()+statlite()

Using readdir() to monitor size/mtime on individual files is hardly a
case we want to optimise for. There are better tools, including
inotify() for applications that care.

I agree that an interface which allows a userland process offer hints to
the kernel as to what kind of cache consistency it requires for file
metadata would be useful. We already have stuff like posix_fadvise() etc
for file data, and perhaps it might be worth looking into how you could
devise something similar for metadata.
If what you really want is for applications to be able to manage network
filesystem cache consistency, then why not provide those tools instead?

Cheers,
  Trond

-



From: Sage Weil
Date: Friday, December 1, 2006 - 11:42 am

Okay, now I think I understand where you're coming from.

The difference between readdirplus() and readdir()+statlite() is that 
(depending on the mask you specify) statlite() either provides the "right" 
answer (ala stat()), or anything that is vaguely "recent."  readdirplus() 
would provide size/mtime from sometime _after_ the initial opendir() call, 
establishing a useful ordering.  So without readdirplus(), you either get 
readdir()+stat() and the performance problems I mentioned before, or 
readdir()+statlite() where "recent" may not be good enough.

Instead of my previous example of proccess #1 waiting for process #2 to 
finish and then checking the results with stat(), imagine instead that #1 
is waiting for 100,000 other processes to finish, and then wants to check 
the results (size/mtime) of all of them.  readdir()+statlite() won't 
work, and readdir()+stat() may be pathologically slow.

Also, it's a tiring and trivial example, but even the 'ls -al' scenario 
isn't ideally addressed by readdir()+statlite(), since statlite() might 
return size/mtime from before 'ls -al' was executed by the user.  One can 
easily imagine modifying a file on one host, then doing 'ls -al' on 
another host and not seeing the effects.  If 'ls -al' can use 
readdirplus(), it's overall application semantics can be preserved without 

True, something to manage the attribute cache consistency for statlite() 
results would also address the issue by letting an application declare how 
weak it's results are allowed to be.  That seems a bit more awkward, 
though, and would only affect statlite()--the only call that allows weak 
consistency in the first place.  In contrast, readdirplus maps nicely onto 
what filesystems like NFS are already doing over the wire.

sage
-



From: Trond Myklebust
Date: Friday, December 1, 2006 - 12:13 pm

Currently, you will never get anything other than weak consistency with
NFS whether you are talking about stat(), access(), getacl(),
lseek(SEEK_END), or append(). Your 'permitting it' only in statlite() is
irrelevant to the facts on the ground: I am not changing the NFS client
caching model in any way that would affect existing applications.

Cheers
  Trond

-



From: Sage Weil
Date: Friday, December 1, 2006 - 1:32 pm

It does with NFS, but only because NFS doesn't follow POSIX in that 
regard.  In general, stat() is supposed to return a value that's 
accurate at the time of the call.

(Although now I'm confused again.  If you're assuming stat() can return 

Clearly, if you cache attributes on the client and provide only weak 
consistency, then readdirplus() doesn't change much.  But _other_ non-NFS 
filesystems may elect to provide POSIX semantics and strong consistency, 
even though NFS doesn't.  And the interface simply doesn't allow that to 
be done efficiently in distributed environments, because applications 
can't communicate their varying consistency needs.  Instead, systems like 
NFS weaken attribute consistency globally.  That works well enough for 
most people most of the time, but it's hardly ideal.

readdirplus() allows applications like 'ls -al' to distinguish themselves 
from applications that want individually accurate stat() results.  That in 
turn allows distributed filesystems that are both strongly consistent 
_and_ efficient at scale.  In most cases, it'll trivially turn into a 
readdir()+stat() in the VFS, but in some cases filesystems can exploit 
that information for (often enormous) performance gain, while still 
maintaining well-defined consistency semantics.  readdir() already leaks 
some inode information into it's result (via d_type)... I'm not sure I 
understand the resistance to providing more.

sage
-



From: Peter Staubach
Date: Monday, December 4, 2006 - 11:02 am

I think that there are several points which are missing here.

First, readdirplus(), without any sort of caching, is going to be _very_
expensive, performance-wise, for _any_ size directory.  You can see this
by instrumenting any NFS server which already supports the NFSv3 READDIRPLUS
semantics.

Second, the NFS client side readdirplus() implementation is going to be
_very_ expensive as well.  The NFS client does write-behind and all this
data _must_ be flushed to the server _before_ the over the wire READDIRPLUS
can be issued.  This means that the client will have to step through every
inode which is associated with the directory inode being readdirplus()'d
and ensure that all modified data has been successfully written out.  This
part of the operation, for a sufficiently large directory and a sufficiently
large page cache, could take signficant time in itself.

These overheads may make this new operation expensive enough that no

Speaking of applications, how many applications are there in the world,
or even being contemplated, which are interested in a directory of
files and whether or not this set of files has changed from the previous
snapshot of the set of files?  Most applications deal with one or two
files on such a basis, not multitudes.  In fact, having worked with
file systems and NFS in particular for more than 20 years now, I have
yet to hear of one.  This is a lot of work and complexity for very
little gain, I think.

Is this not a problem which be better solved at the application level?
Or perhaps finer granularity than "noac" for the NFS attribute caching?

    Thanx...

       ps
-



From: Sage Weil
Date: Tuesday, December 5, 2006 - 4:20 pm

Are you referring to the work the server must do to gather stat 

Why can't the client send the over the wire READDIRPLUS without flushing 
inode data, and then simply ignore the stat portion of the server's 
response in instances where it's locally cached (and dirty) inode data is 

If the application calls readdirplus() only when it would otherwise do 
readdir()+stat(), the flushing you mention would happen anyway (from the 
stat()).  Wouldn't this at least allow that to happen in parallel for the 
whole directory?

sage
-



From: Peter Staubach
Date: Wednesday, December 6, 2006 - 8:48 am

Yes and the fact that the client will be forced to go over the wire for
each readdirplus() call, whereas it can use cached information today.
An application actually waiting on the response to a READDIRPLUS will

This would seem to minimize the value as far as I understand the

I don't see where the parallelism comes from.  Before issuing the
READDIRPLUS over the wire, the client would have to ensure that each
and every one of those flushes was completed.  I suppose that a
sufficiently clever and complex implementation could figure out how
to schedule all those flushes asynchronously and then wait for all
of them to complete, but there will be a performance cost.  Walking
the caches for all of those inodes, perhaps using several or all of
the cpus in the system, smacking the server with all of those WRITE
operations simultaneously with all of the associated network
bandwidth usage, all adds up to other applications on the client
and potentially the network not doing much at the same time.

All of this cost to the system and to the network for the benefit of
a single application?  That seems like a tough sell to me.

This is an easy problem to look at from the application viewpoint.
The solution seems obvious.  Give it the fastest possible way to
read the directory and retrieve stat information about every entry
in the directory.  However, when viewed from a systemic level, this
becomes a very different problem with many more aspects.  Perhaps
flow controlling this one application in favor of many other applications,
running network wide, may be the better thing to continue to do.
I dunno.

       ps
-



From: Andreas Dilger
Date: Saturday, December 2, 2006 - 6:57 pm

To be honest, I can't think of any use that actually _requires_ consistency
from stat() or readdir(), because even if the data was valid in the kernel
at the time it was gathered, there is no guarantee all the files haven't
been deleted by another thread even before the syscall is complete.  Any
pretending that the returned data is "current" is a pipe dream.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-



From: Andreas Dilger
Date: Saturday, December 2, 2006 - 6:52 pm

That is _almost_ true, except that "ls --color" does a stat anyways to
get the file mode (to set the "*" executable type) and the file blocks
(with -s) and the size (with -l) and the inode number (with -i).

In a clustered filesystem getting the inode number and mode is easily
done along with the uid/gid (for many kinds of "find") while getting
the file size may be non-trivial.

Just to be clear, I have no desire to include any kind of
"synchronization" semantics to readdirplus() that is also being discussed
in this thread.  Just the ability to bundle select stat info along with
the readdir information, and to allow stat to not return any unnecessary
info (in particular size, blocks, mtime) that may be harder to gather
on a clustered filesystem.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-



From: Sage Weil
Date: Sunday, December 3, 2006 - 9:10 am

I'm not suggesting any "synchronization" beyond what opendir()/readdir() 
already require for the directory entries themselves.  If I'm not 
mistaken, readdir() is required only to return directory entries as recent 
as the opendir() (i.e., you shouldn't see entries that were unlink()ed 
before you called opendir(), and intervening changes to the directory may 
or may not be reflected in the result, depending on how your 
implementation is buffering things).  I would think the stat() portion of 
readdirplus() would be similarly (in)consistent (i.e., return a value at 
least as recent as the opendir()) to make life easy for the implementation 
and to align with existing readdir() semantics.  My only concern is the 
"at least as recent as the opendir()" part, in contrast to statlite(), 
which has undefined "recentness" of its result for fields not specified in 
the mask.

Ideally, I'd like to see readdirplus() also take a statlite() style mask, 
so that you can choose between either "vaguely recent" and "at least as 
recent as opendir()".

As you mentioned, by the time you look at the result of any call (in the 
absence of locking) it may be out of date.  But simply establishing an 
ordering is useful, especially in a clustered environment where some nodes 
are waiting for other nodes (via barriers or whatever) and then want to 
see the effects of previously completed fs operations.

Anyway, "synchronization" semantics aside (since I appear to be somewhat 
alone on this :)...

I'm wondering if a corresponding opendirplus() (or similar) would also be 
appropriate to inform the kernel/filesystem that readdirplus() will 
follow, and stat information should be gathered/buffered.  Or do most 
implementations wait for the first readdir() before doing any actual work 
anyway?

sage
-



From: Andreas Dilger
Date: Monday, December 4, 2006 - 12:32 am

In my opinion (which may not be that of the original statlite() authors)
is that the flags should really be called "valid", and any bit set in
the flag can be considered valid, and any unset bit means "this field
has no valid data".  Having it mean "it might be out of date" gives the
false impression that it might contain valid (if slightly out of date)

Ah, OK.  I didn't understand what you were getting at before.  I agree

I'm not sure what some filesystems might do here.  I suppose NFS has weak
enough cache semantics that it _might_ return stale cached data from the
client in order to fill the readdirplus() data, but it is just as likely
that it ships the whole thing to the server and returns everything in
one shot.  That would imply everything would be at least as up-to-date
as the opendir().

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-



From: Trond Myklebust
Date: Monday, December 4, 2006 - 8:15 am

Whether or not the posix committee decides on readdirplus, I propose
that we implement this sort of thing in the kernel via a readdir
equivalent to posix_fadvise(). That can give exactly the barrier
semantics that they are asking for, and only costs 1 extra syscall as
opposed to 2 (opendirplus() and readdirplus()).

Cheers
  Trond

-



From: Rob Ross
Date: Monday, December 4, 2006 - 5:59 pm

Hi all,

I don't think that the group intended that there be an opendirplus(); 
rather readdirplus() would simply be called instead of the usual 
readdir(). We should clarify that.

Regarding Peter Staubach's comments about no one ever using the 
readdirplus() call; well, if people weren't performing this workload in 
the first place, we wouldn't *need* this sort of call! This call is 
specifically targeted at improving "ls -l" performance on large 
directories, and Sage has pointed out quite nicely how that might work.

In our case (PVFS), we would essentially perform three phases of 
communication with the file system for a readdirplus that was obtaining 
full statistics: first grabbing the directory entries, then obtaining 
metadata from servers on all objects in bulk, then gathering file sizes 
in bulk. The reduction in control message traffic is enormous, and the 
concurrency is much greater than in a readdir()+stat()s workload. We'd 
never perform this sort of optimization optimistically, as the cost of 
guessing wrong is just too high. We would want to see the call as a 
proper VFS operation that we could act upon.

The entire readdirplus() operation wasn't intended to be atomic, and in 
fact the returned structure has space for an error associated with the 
stat() on a particular entry, to allow for implementations that stat() 
subsequently and get an error because the object was removed between 
when the entry was read out of the directory and when the stat was 
performed. I think this fits well with what Andreas and others are 
thinking. We should clarify the description appropriately.

I don't think that we have a readdirpluslite() variation documented yet? 
Gary? It would make a lot of sense. Except that it should probably have 
a better name...

Regarding Andreas's note that he would prefer the statlite() flags to 
mean "valid", that makes good sense to me (and would obviously apply to 
the so-far even more hypothetical readdirpluslite()). I don't think ...
From: Gary Grider
Date: Monday, December 4, 2006 - 9:44 pm

Correct, we do not have that documented.  I suppose we could just 
have a mask like

The one use that some users talk about is just knowing the file is 
growing is important  and useful to them,
knowing exactly to the byte how much growth seems less important to 
them until they close.
On these big parallel apps, so many things can happen that can just 
hang.  They often use
the presence of checkpoint files and how big they are to gage 
progress of he application.
Of course there are other ways this can be accomplished but they do 
this sort of thing
a lot.  That is the main case I have heard that might benefit from 
"possibly-inaccurate" values.
Of course it assumes that the inaccuracy is just old information and 
not bogus information.

Thanks, we will put out a complete version of what we have in a 
document to the Open Group
site in a week or two so all the pages in their current state are 
available.  We could then
begin some iteration on all these comments we have gotten from the 
various communities.

Thanks


-



From: Christoph Hellwig
Date: Tuesday, December 5, 2006 - 3:05 am

There are better ways to do it but we refuse to do it right is hardly

Could you please stop putting out specs until you actually have working
code?  There's absolutely no point in standardizing things until it's
actually used in practice.

-



From: Trond Myklebust
Date: Monday, December 4, 2006 - 10:56 pm

...and we have pointed out how nicely this ignores the realities of
current caching models. There is no need for a  readdirplus() system
call. There may be a need for a caching barrier, but AFAICS that is all.

Trond

-



From: Christoph Hellwig
Date: Tuesday, December 5, 2006 - 3:07 am

I think Andreas mentioned that it is useful for clustered filesystems
that can avoid additional roundtrips this way.  That alone might now
be enough reason for API additions, though.  The again statlite and
readdirplus really are the most sane bits of these proposals as they
fit nicely into the existing set of APIs.  The filehandle idiocy on
the other hand is way of into crackpipe land.
-



From: Matthew Wilcox
Date: Tuesday, December 5, 2006 - 7:20 am

Right, and it needs to be discarded.  Of course, there was a real
problem that it addressed, so we need to come up with an acceptable
alternative.

The scenario is a cluster-wide application doing simultaneous opens of
the same file.  So thousands of nodes all hitting the same DLM locks
(for read) all at once.  The openg() non-solution implies that all
nodes in the cluster share the same filehandle space, so I think a
reasonable solution can be implemented entirely within the clusterfs,
with an extra flag to open(), say O_CLUSTER_WIDE.  When the clusterfs
sees this flag set (in ->lookup), it can treat it as a hint that this
pathname component is likely to be opened again on other nodes and
broadcast that fact to the other nodes within the cluster.  Other nodes
on seeing that hint (which could be structured as "The child "bin"
of filehandle e62438630ca37539c8cc1553710bbfaa3cf960a7 has filehandle
ff51a98799931256b555446b2f5675db08de6229") can keep a record of that fact.
When they see their own open, they can populate the path to that file
without asking the server for extra metadata.

There's obviously security issues there (why I say 'hint' rather than
'command'), but there's also security problems with open-by-filehandle.
Note that this solution requires no syscall changes, no application
changes, and also helps a scenario where each node opens a different
file in the same directory.

I've never worked on a clusterfs, so there may be some gotchas (eg, how
do you invalidate the caches of nodes when you do a rename).  But this
has to be preferable to open-by-fh.
-



From: Rob Ross
Date: Wednesday, December 6, 2006 - 8:04 am

The openg() solution has the following advantages to what you propose. 
First, it places the burden of the communication of the file handle on 
the application process, not the file system. That means less work for 
the file system. Second, it does not require that clients respond to 
unexpected network traffic. Third, the network traffic is deterministic 
-- one client interacts with the file system and then explicitly 
performs the broadcast. Fourth, it does not require that the file system 
store additional state on clients.

In the O_CLUSTER_WIDE approach, a naive implementation (everyone passing 
the flag) would likely cause a storm of network traffic if clients were 
closely synchronized (which they are likely to be). We could work around 
this by having one application open early, then barrier, then have 
everyone else open, but then we might as well have just sent the handle 
as the barrier operation, and we've made the use of the O_CLUSTER_WIDE 
open() significantly more complicated for the application.

However, the application change issue is actually moot; we will make 
whatever changes inside our MPI-IO implementation, and many users will 
get the benefits for free.

The readdirplus(), readx()/writex(), and openg()/openfh() were all 
designed to allow our applications to explain exactly what they wanted 
and to allow for explicit communication. I understand that there is a 
tendency toward solutions where the FS guesses what the app is going to 
do or is passed a hint (e.g. fadvise) about what is going to happen, 
because these things don't require interface changes. But these 
solutions just aren't as effective as actually spelling out what the 
application wants.

Regards,

Rob
-



From: Matthew Wilcox
Date: Wednesday, December 6, 2006 - 8:44 am

You didn't address the disadvantages I pointed out on December 1st in a

Christoph has also touched on some of these points, and added some I

I think you're referring to a naive application, rather than a naive
cluster filesystem, right?  There's several ways to fix that problem,
including throttling broadcasts of information, having nodes ask their
immediate neighbours if they have a cache of the information, and having
the server not respond (wait for a retransmit) if it's recently sent out


Sure, but I think you're emphasising "these interfaces let us get our
job done" over the legitimate concerns that we have.  I haven't really
looked at the readdirplus() or readx()/writex() interfaces, but the
security problems with openg() makes me think you haven't really looked
at it from the "what could go wrong" perspective.  I'd be interested in
reviewing the readx()/writex() interfaces, but still don't see a document
for them anywhere.
-



From: Rob Ross
Date: Wednesday, December 6, 2006 - 9:15 am

I coincidentally just wrote about some of this in another email. Wasn't 


The fh_t would be validated either (a) when the openfh() is called, or 
on accesses using the associated capability. As Christoph pointed out, 
this really is a capability and encapsulates everything necessary for a 
particular user to access a particular file. It can be handed to others, 
and in fact that is a critical feature for our use case.

After the openfh(), the access model is identical to a previously 
open()ed file. So the question is what happens between the openg() and 
the openfh().

Our intention was to allow servers to "forget" these fh_ts at will. So a 
revoke between openg() and openfh() would kill the fh_t, and the 
subsequent openfh() would fail, or subsequent accesses would fail 
(depending on when the FS chose to validate).


We could use advice on this point. Certainly it's possible to encode 
information about the FS from which the fh_t originated, but we haven't 
tried to spell out exactly how that would happen. Your approach 

Yes, naive application. You're right that the file system could adapt to 
this, but on the other hand if we were explicitly passing the fh_t in 
user space, we could just use MPI_Bcast and be done with it, with an 

Absolutely. Same goes for readx()/writex() also, BTW, at least for 
MPI-IO users. We will build the input parameters inside MPI-IO using 
existing information from users, rather than applying data sieving or 

I'm sorry if it seems like I'm ignoring your concerns; that isn't my 
intention. I am advocating the calls though, because the whole point in 
getting into these discussions is to improve the state of things for 
these access patterns.

Part of the problem is that the descriptions of these calls were written 
for inclusion in a POSIX document and not for discussion on this list. 
Those descriptions don't usually include detailed descriptions of 
implementation options or use cases. We should have created some 
additional ...
From: Trond Myklebust
Date: Tuesday, December 5, 2006 - 7:55 am

They provide no benefits whatsoever for the two most commonly used
networked filesystems NFS and CIFS. As far as they are concerned, the
only new thing added by readdirplus() is the caching barrier semantics.
I don't see why you would want to add that into a generic syscall like
readdir() though: it is
        
        a) networked filesystem specific. The mask stuff etc adds no
        value whatsoever to actual "posix" filesystems. In fact it is
        telling the kernel that it can violate posix semantics.
        
        b) quite unnatural to impose caching semantics on all the
        directory _entries_ using a syscall that refers to the directory
        itself (see the explanations by both myself and Peter Staubach
        of the synchronisation difficulties). Consider in particular
        that it is quite possible for directory contents to change in
        between readdirplus calls.
        
        i.e. the "strict posix caching model' is pretty much impossible
        to implement on something like NFS or CIFS using these
        semantics. Why then even bother to have "masks" to tell you when
        it is OK to violate said strict model.

        c) Says nothing about what should happen to non-stat() metadata
        such as ACL information and other extended attributes (for
        example future selinux context info). You would think that the
        'ls -l' application would care about this.

Trond

-



From: Rob Ross
Date: Tuesday, December 5, 2006 - 3:11 pm

It isn't violating POSIX semantics if we get the calls passed as an 

I want to make sure that I understand this correctly. NFS semantics 
dictate that if someone stat()s a file that all changes from that client 
need to be propagated to the server? And this call complicates that 
semantic because now there's an operation on a different object (the 
directory) that would cause this flush on the files?

Of course directory contents can change in between readdirplus() calls, 
just as they can between readdir() calls. That's expected, and we do not 

We're trying to obtain improved performance for distributed file systems 

Honestly, we hadn't thought about other non-stat() metadata because we 
didn't think it was part of the use case, and we were trying to stay 
close to the flavor of POSIX. If you have ideas here, we'd like to hear 
them.

Thanks for the comments,

Rob
-



From: Trond Myklebust
Date: Tuesday, December 5, 2006 - 4:24 pm

The only way for an NFS client to obey the POSIX requirement that
write() immediately updates the mtime/ctime is to flush out all cached


See my previous postings.

Trond

-



From: Rob Ross
Date: Wednesday, December 6, 2006 - 9:42 am

Thanks for explaining this. I've never understood how it is decided 
where the line is drawn with respect to where NFS does obey POSIX 
semantics for a particular implementation.


No I'm trying to explain when the calls might be useful. But if you are 
only interested in NFS and CIFS, then I guess the thread might not be 

I'll do that. Thanks.

Rob
-



From: Ragnar
Date: Wednesday, December 6, 2006 - 5:22 am

I don't see what's network filesystem specific about it. Correct me if
I'm wrong, but today ls -l on a local filesystem will first do readdir
and then n stat calls. In the worst case scenario this will generate n+=
1
disk seeks.

Local filesystems go through a lot of trouble to try to make the disk
layout of the directory entries and the inodes optimal so that readahea=
d
and caching reduces the number of seeks.

With readdirplus on the other hand, the filesystem would be able to sen=
d
all the requests to the block layer and it would be free to optimize
through disk elevators and what not.=20

And this is not simply an "ls -l" optimization. Allthough I can no loge=
r
remember why, I think this is exactly what imap servers are doing when
opening up big imap folders stored in maildir.=20

--=20
Ragnar Kj=F8rstad
Software Engineer
Scali - http://www.scali.com
Scaling the Linux Datacenter
-



From: Trond Myklebust
Date: Wednesday, December 6, 2006 - 8:14 am

As far as local filesystems are concerned, the procedure is still the
same: read the directory contents, then do lookup() of the files
returned to you by directory contents, then do getattr().

There is no way to magically queue up the getattr() calls before you
have done the lookups, nor is there a way to magically queue up the
lookups before you have read the directory contents.

Trond

-



From: Latchesar Ionkov
Date: Tuesday, December 5, 2006 - 9:55 am

What is your opinion on giving the file system an option to lookup a
file more than one name/directory at a time? I think that all remote
file systems can benefit from that?

Thanks,
    Lucho
-



From: Christoph Hellwig
Date: Tuesday, December 5, 2006 - 3:12 pm

Do you mean something like the 4.4BSD namei interface where the VOP_LOOKUP
routine get the entire remaining path and is allowed to resolve as much of
it as it can (or wants)?

While this allows remote filesystems to optimize deep tree traversals it
creates a pretty big mess about state that is kept on lookup operations.

For Linux in particular it would mean doing large parts of __link_path_walk
in the filesystem, which I can't thing of a sane way to do.

-



From: Latchesar Ionkov
Date: Wednesday, December 6, 2006 - 4:12 pm

The way I was thinking of implementing it is leaving all the hard
parts of the name resolution in __link_path_walk and modifying inode's
lookup operation to accept an array of qstrs (and its size). lookup
would also check and revalidate the dentries if necessary (usually the
same operation as looking up a name for the remote filesystems).
lookup will check if it reaches symbolic link or mountpoint and will
stop resolving any further. __link_path_walk will use the name to fill
an array of qstrs (we can choose some sane size of the array, like 8
or 16), then call (directly or indirectly) ->lookup (nd->flags will
reflect the flags for the last element in the array), check if the
inode of the last dentry is symlink, and do what it currently does for
symlinks.

Does that make sense? Am I missing anything?

Thanks,
    Lucho
-



From: Trond Myklebust
Date: Wednesday, December 6, 2006 - 4:33 pm

I beg to differ. Revalidation is not the same as looking up: the locking

Again: locking. How do you keep the dcache sane while the filesystem is
doing a jumble of revalidation and new lookups.

Trond

-



From: Rob Ross
Date: Tuesday, December 5, 2006 - 2:50 pm

So other than this "lite" version of the readdirplus() call, and this 
idea of making the flags indicate validity rather than accuracy, are 
there other comments on the directory-related calls? I understand that 
they might or might not ever make it in, but assuming they did, what 
other changes would you like to see?

Thanks,

Rob
-



From: Christoph Hellwig
Date: Tuesday, December 5, 2006 - 3:05 pm

I'd like to Cc Ulrich Drepper in this thread because he's going to decide
what APIs will be exposed at the C library level in the end, and he also
has quite a lot of experience with the various standardization bodies.

Ulrich, this in reply to these API proposals:

	http://www.opengroup.org/platform/hecewg/uploads/40/10903/posix_io_readdir+.pdf
	http://www.opengroup.org/platform/hecewg/uploads/40/10898/POSIX-stat-manpages.pdf


statlite needs to separate the flag for valid fields from the actual
stat structure and reuse the existing stat(64) structure.  stat lite
needs to at least get a better name, even better be folded into *statat*,
either by having a new AT_VALID_MASK flag that enables a new
unsigned int valid argument or by folding the valid flags into the AT_
flags.  It should also drop the notation of required vs optional field.
If a filesystem always always has certain values at hand it can just
fill them even if they weren't requested.

I think having a stat lite variant is pretty much consensus, we just need
to fine tune the actual API - and of course get a reference implementation.
So if you want to get this going try to implement it based on
http://marc.theaimsgroup.com/?l=linux-fsdevel&m=115487991724607&w=2.
Bonus points for actually making use of the flags in some filesystems.

Readdir plus is a little more involved.  For one thing the actual kernel
implementation will be a variant of getdents() call anyway while a
readdirplus would only be a library level interface.  At the actual
C prototype level I would rename d_stat_err to d_stat_errno for consistency
and maybe drop the readdirplus() entry point in favour of readdirplus_r
only - there is no point in introducing new non-reenetrant APIs today.

Also we should not try to put in any of the synchronization or non-caching
behaviours mentioned earlier in this thread (they're fortunately not in
the pdf mentioned above either).  If we ever want to implement these kinds
of additional gurantees (which I doubt) that ...
From: Sage Weil
Date: Tuesday, December 5, 2006 - 4:18 pm

Can you explain what the struct stat result portion of readdirplus() 
should mean in this case?

My suggestion was that its consistency follow that of the directory entry 
(i.e. mimic the readdir() specification), which (as far as the POSIX 
description goes) means it is at least as recent as opendir().  That model 
seems to work pretty well for readdir() on both local and network 
filesystems, as it allows buffering and so forth.  This is evident from 
the fact that it's semantics haven't been relaxed by NFS et al (AFAIK).

Alternatively, one might specify that the result be valid at the time of 
the readdirplus() call, but I think everyone agrees that is unnecessary, 
and more importantly, semantically indistinguishable from a 
readdir()+stat().

The only other option I've heard seems to be that the validity of stat() 
not be specified at all.  This strikes me as utterly pointless--why create 
a call whose result has no definition.  It's also semantically 
indistinguishable from a readdir()+statlite(null mask).

The fact that NFS and maybe others returned cached results for stat() 
doesn't seem relevant to how the call is _defined_.  If the definition of 
stat() followed NFS, then it might read something like "returns 
information about a file that was accurate at some point in the last 30 
seconds or so."  On the other hand, if readdirplus()'s stat consistency is 
defined the same way as the dirent, NFS et al are still free to ignore 
that specification and return cached results, as they already do for 
stat().  (A 'lite' version of readdirplus() might even let users pick and 
choose, should the fs support both behaviors, just like statlite().)  I 
don't really care what NFS does, but if readdirplus() is going to be
specified at all, it should be defined in a way that makes sense and has 
some added semantic value.

Also, one note about the fadvise() suggestion.  I think there's a key 
distinction between what fadvise() currently does (provide hints to the 
filesystem ...
From: Ulrich Drepper
Date: Tuesday, December 5, 2006 - 4:55 pm

I know the documents.  The HECWG was actually supposed to submit an=20
actual draft to the OpenGroup-internal working group but I haven't seen=
=20

I don't think an accuracy flag is useful at all.  Programs don't want t=
o=20
use fuzzy information.  If you want a fast 'ls -l' then add a mode whic=
h=20
doesn't print the fields which are not provided.  Don't provide outdate=
d=20

Yes, this is also my pet peeve with this interface.  I don't want to=20
have another data structure.  Especially since programs might want to=20
store the value in places where normal stat results are returned.

And also yes on 'statat'.  I strongly suggest to define only a statat=20
variant.  In the standards group I'll vehemently oppose the introductio=
n=20
of yet another superfluous non-*at interface.

As for reusing the existing statat interface and magically add another=20
parameter through ellipsis: no.  We need to become more type-safe.  The=
=20
userlevel interface needs to be a new one.  For the system call there i=
s=20
no such restriction.  We can indeed extend the existing syscall.  We=20
have appropriate checks for the validity of the flags parameter in plac=
e=20
=2E

I don't like that approach.  The flag parameter should be exclusively a=
n=20
output parameter.  By default the kernel should fill in all the fields=20
it has access to.  If access is not easily possible then set the bit an=
d=20
clear the field.  There are of course certain fields which always shoul=
d=20
be added.  In the proposed man page these are already identified (i.e.,=
=20
=2E

No, readdirplus should be kept (and yes, readdirplus_r must be added).=20
The reason is that the readdir_r interface is only needed if multiple=20
threads use the _same_ DIR stream.  This is hardly ever the case.=20
=46orcing everybody to use the _r variant means that we unconditionally=
=20
have to copy the data in the user-provided buffer.  With readdir there=20
is the possibility to just pass back a pointer into the ...
From: Andreas Dilger
Date: Wednesday, December 6, 2006 - 3:06 am

Does this mean you are against the statlite() API entirely, or only against
the document's use of the flag as a vague "accuracy" value instead of a

IMHO, if the application doesn't need a particular field (e.g. "ls -i"
doesn't need size, "ls -s" doesn't need the inode number, etc) why should
these be filled in if they are not easily accessible?  As for what is
easily accessible, that needs to be determined by the filesystem itself.

It is of course fine if the filesystem fills in values that it has at
hand, even if they are not requested, but it shouldn't have to do extra
work to fill in values that will not be needed.

"ls --color" and "ls -F" are prime examples.  It does stat on files only
to get the file mode (the file type is already part of many dirent structs).
But a clustered filesystem may need to do a lot of work to also get the

That was previously suggested by me already.  IMHO, there should ONLY be
a statlite variant of readdirplus(), and I think most people agree with
that part of it (though there is contention on whether readdirplus() is
needed at all).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-



From: Ulrich Drepper
Date: Wednesday, December 6, 2006 - 10:19 am

I'm against fuzzy values.  I've no problems with a bitmap specifying=20
that certain members are not wanted or wanted (probably the later, zero=
=20
lf.

Is the size not easily accessible?  It would surprise me.  If yes, then=
,=20
by all means add it to the list.  I'm not against extending the list of=
=20
members which are optional if it makes sense.  But certain information=20

Indeed.  Given there is statlite and we have d_type information, in mos=
t=20
situations we won't need more complete stat information.  Outside of=20
programs like ls that is.


Part of why I wished the lab guys had submitted the draft to the=20
OpenGroup first is that this way they would have to be more detailed on=
=20
why each and every interface they propose for adding is really needed.=20
Maybe they can do it now and here.  What programs really require=20
readdirplus?

--=20
=E2=9E=A7 Ulrich Drepper =E2=9E=A7 Red Hat, Inc. =E2=9E=A7 444 Castro S=
t =E2=9E=A7 Mountain View, CA =E2=9D=96
-



From: Rob Ross
Date: Wednesday, December 6, 2006 - 10:27 am

File size is definitely one of the more difficult of the parameters, 
either because (a) it isn't stored in one place but is instead derived, 
or (b) because a lock has to be obtained to guarantee consistency of the 

I can't speak for everyone, but "ls" is the #1 consumer as far as I am 
concerned.

Regards,

Rob

-



From: Ulrich Drepper
Date: Wednesday, December 6, 2006 - 10:42 am

OK, and looking at the man page again, it is already on the list in the=
=20

So a syscall for ls alone?

I think this is more a user problem.  For normal plain old 'ls' you get=
=20
by with readdir.  For 'ls -F' and 'ls --color' you mostly get by with=20
readdir+d_type.  If you cannot provide d_type info the readdirplus=20
extension does you no good.  For the cases when an additional stat is=20
needed (for symlinks, for instance, to test whether they are dangling)=20
readdirplus won't help.

So, readdirplus is really only useful for 'ls -l'.  But then you need=20
st_size and st_?time.  So what is gained with readdirplus?

--=20
=E2=9E=A7 Ulrich Drepper =E2=9E=A7 Red Hat, Inc. =E2=9E=A7 444 Castro S=
t =E2=9E=A7 Mountain View, CA =E2=9D=96
-



From: Ragnar
Date: Wednesday, December 6, 2006 - 11:01 am

I guess the code needs to be checked, but I would think that:
* ls
* find
* rm -r
* chown -R
* chmod -R
* rsync
* various backup software
* imap servers
are all likely users of readdirplus. Of course the ones that spend the
majority of the time doing stat are the ones that would benefit more.



--=20
Ragnar Kj=F8rstad
Software Engineer
Scali - http://www.scali.com
Scaling the Linux Datacenter
-



From: Ulrich Drepper
Date: Wednesday, December 6, 2006 - 11:13 am

Then somebody do the analysis.  And please an analysis which takes into=
=20
account that some programs might need to be adapted to take advantage o=
f=20
d_type or non-optional data from the proposed statlite.

Plus, how often are these commands really used on such filesystems?  I'=
d=20
hope that chown -R or so is a once in a lifetime thing on such=20
filesystems and not worth optimizing for.

I'd suggest until such data is provided the readdirplus plans are put o=
n=20
hold.  statlite I have no problems with if the semantics is changed as =
I=20
explained.

--=20
=E2=9E=A7 Ulrich Drepper =E2=9E=A7 Red Hat, Inc. =E2=9E=A7 444 Castro S=
t =E2=9E=A7 Mountain View, CA =E2=9D=96
-



From: Ragnar
Date: Sunday, December 17, 2006 - 7:41 am

This is by no means a full analysis, but maybe someone will find it
useful anyway. All performance tests are done with a directory tree wit=
h
the lkml archive in maildir format on a local ext3 filesystem. The
numbers are systemcall walltime, seen through strace.=20


I think Andreas already wrote that "ls --color" is the default in most
distributions and needs to stat every file.

ls --color -R kernel_old:
82.27% 176.37s  0.325ms  543332 lstat
17.61%  37.75s  5.860ms    6442 getdents64
 0.04%   0.09s  0.018ms    4997 write
 0.03%   0.06s 55.462ms       1 execve
 0.02%   0.04s  5.255ms       8 poll


"find" is already smart enough to not call stat when it's not needed,
and make use of d_type when it's available. But in many cases stat is
still needed (such as with -user)

find kernel_old -not -user 1002:
83.63% 173.11s  0.319ms  543338 lstat
16.31%  33.77s  5.242ms    6442 getdents64
 0.03%   0.06s 62.882ms       1 execve
 0.01%   0.03s  6.904ms       4 poll
 0.01%   0.02s  8.383ms       2 connect

rm was a false alarm. It only uses stat to check for directories, and
it's already beeing smart about it, not statting directories with
n_links=3D=3D2.


chown uses stat to:
* check for directories / symlinks / regular files
* Only change ownership on files with a specific existing ownership.
* Only change ownership if the requested owner does not match the
  current owner.=20
* Different output when ownership is actually changed from when it's
  not necessary (in verbose mode).
* Reset S_UID, S_GID options after setting ownership in some cases.
but it seems the most recent version will not use stat for every file
with typical options:

chown -R rk kernel_old:
93.30% 463.84s  0.854ms  543337 lchown
 6.67%  33.18s  5.151ms    6442 getdents64
 0.01%   0.04s  0.036ms    1224 brk
 0.00%   0.02s  5.830ms       4 poll
 0.00%   0.02s  0.526ms      38 open


chmod needs stat to do things like "u+w", but the current implementatio=
n
uses stat regardless of if it's ...
From: Ulrich Drepper
Date: Sunday, December 17, 2006 - 12:07 pm

And how often do the scripts which are in everyday use require such a=20
command?  And the same for the other programs.

I do not doubt that such a new syscall can potentially be useful.  The=20
question is whether it is worth it given _real_ situations on today's=20
systems.  And more so: on systems where combining the operations really=
=20
makes a difference.

Exposing new data structures is no small feat.  It's always risky since=
=20
something might require a change and then backward compatibility is an=20
issue.

Introducing new syscalls just because a combination of two existing one=
s=20
happens to be used in some programs is not scalable and not the=20
Unix-way.  Small building blocks.  Otherwise I'd have more proposals=20
which can be much more widely usable (e.g., syscall to read a file into=
=20
a freshly mmaped area).  Nobody wants to go that route since it would=20
lead to creeping featurism.  So it is up to the proponents of=20
readdirplus to show this is not such a situation.

--=20
=E2=9E=A7 Ulrich Drepper =E2=9E=A7 Red Hat, Inc. =E2=9E=A7 444 Castro S=
t =E2=9E=A7 Mountain View, CA =E2=9D=96
-



From: Matthew Wilcox
Date: Sunday, December 17, 2006 - 12:38 pm

I know that the rsync load is a major factor on kernel.org right now.
With all the git trees (particularly the ones that people haven't packed
recently), there's a lot of files in a lot of directories.  If
readdirplus would help this situation, it would definitely have a real
world benefit.  Obviously, I haven't done any measurements or attempted
to quantify what the improvement would be.

For those not familiar with a git repo, it has an 'objects' directory with
256 directories named 00 to ff.  Each of those directories can contain
many files (with names like '8cd5bbfb4763322837cd1f7c621f02ebe22fef') Once
a file is written, it is never modified, so all rsync needs to do is be
able to compare the timestamps and sizes and notice they haven't changed.

-



From: Ulrich Drepper
Date: Sunday, December 17, 2006 - 2:51 pm

That should be quite easy to quantify then.  Move the readdir and stat=20
call next to each other in the sources, pass the struct stat around if=20
necessary, and then count the stat calls which do not originate from th=
e=20
stat following the readdir call.  Of course we'll also need the actual=20
improvement which can be achieved by combining the calls.  Given the=20
inodes are cached, is there more overhead then finding the right inode?=
=20
  Note that is rsync doesn't already use fstatat() it should do so and=20
this means then that there is no long file path to follow, all file=20
names are local to the directory opened with opendir().

My but feeling is that the improvements are minimal for normal (not=20
cluster etc) filesystems and hence the improvements for kernel.org woul=
d=20
be minimal.

--=20
=E2=9E=A7 Ulrich Drepper =E2=9E=A7 Red Hat, Inc. =E2=9E=A7 444 Castro S=
t =E2=9E=A7 Mountain View, CA =E2=9D=96
-



From: Ragnar
Date: Sunday, December 17, 2006 - 7:57 pm

I don't think the overhead of finding the right inode or the system
calls themselves makes a difference at all. E.g. the rsync numbers I
listed spend more than 0.3ms per stat syscall. That kind of time is not
spent in looking up kernel datastructures - it's spent doing IO.

That part that I think is important (and please correct me if I've
gotten it all wrong) is to do the IO in parallel. This applies both to
local filesystems and clustered filesystems, allthough it would probabl=
y
be much more significant for clustered filesystems since they would
typically have longer latency for each roundtrip.  Today there is no go=
od=20
way for an application to stat many files in parallel. You could do it
through threading, but with significant overhead and complexity.

I'm curious what results one would get by comparing performance of:
* application doing readdir and then stat on every single file
* application doing readdirplus
* application doing readdir and then stat on every file using a lot of
  threads or an asyncronous stat interface

As far as parallel IO goes, I would think that async stat would be
nearly as fast as readdirplus?
=46or the clustered filesystem case there may be locking issues that ma=
kes
readdirplus faster?


--=20
Ragnar Kj=F8rstad
Software Engineer
Scali - http://www.scali.com
Scaling the Linux Datacenter
-



From: Gary Grider
Date: Sunday, December 17, 2006 - 8:54 pm

We have done something similar to what you suggest.
We wrote a parallel file tree walker to run on=20
clustered file systems that spread the file systems
metadata out over multiple disks.  The program=20
parallelizes the stat operations across multiple
nodes (via MPI). We needed to walk a tree with=20
about a hundred million files in a reasonable amount of time.
We cut the time from dozens of hours to less than=20
an hour.  We were able to keep all the metadata
raids/disks much busier doing the work for the=20
stat operations.  We have used this on two
different clustered file systems with similar=20
results. In both cases, it scaled with the number
of disks the metadata was spread over, not quite=20
linearly but it was a huge win for these two
file systems.



-



From: Andreas Dilger
Date: Wednesday, December 6, 2006 - 10:57 pm

IMHO, once part of the information is optional, why bother making ANY
of it required?  Consider "ls -s" on a distributed filesystem that has
UID+GID mapping.  It doesn't actually NEED to return the UID+GID to ls
for each file, since it won't be shown, but if that is part of the 
"required" fields then the filesystem would have to remap each UID+GID
on each file in the directory.  Similar arguments can be made for "find"
with various options (-atime, -mtime, etc) where any one of the "required"
parameters isn't needed.

I don't think it is _harmful_ to fill in unrequested values if they are
readily available (it might in fact avoid a lot of conditional branches)

That is my opinion also.  Lustre can do incredibly fast IO, but it isn't
very good at "ls" at all because it has to do way more work than you

I used to think this also, but even though Lustre supplies d_type info
GNU ls will still do stat operations because "ls --color" depends on
st_mode in order to color executable files differently.  Since virtually
all distros alias ls to "ls --color" this is pretty much default behaviour.
Another popular alias is "ls -F" which also uses st_mode for executables.

open(".", O_RDONLY|O_NONBLOCK|O_LARGEFILE|O_DIRECTORY) = 3
fstat64(3, {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
getdents64(3, /* 53 entries */, 4096)   = 1840
lstat64("ChangeLog", {st_mode=S_IFREG|0660, st_size=48, ...}) = 0
lstat64("install-sh", {st_mode=S_IFREG|0755, st_size=7122, ...}) = 0
lstat64("config.sub", {st_mode=S_IFREG|0755, st_size=30221, ...}) = 0
lstat64("autogen.sh", {st_mode=S_IFREG|0660, st_size=41, ...}) = 0
lstat64("config.h", {st_mode=S_IFREG|0664, st_size=7177, ...}) = 0

Similarly, GNU rm will stat all of the files (when run as a regular user)
to ask the "rm: remove write-protected regular file `foo.orig'?" question,
which also depends on st_mode.


Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-



From: Ulrich Drepper
Date: Friday, December 15, 2006 - 3:37 pm

The kernel at least has to clear the fields in the stat structure in an=
y=20
case.  So, if information is easily available, why add another 'if' in=20
the case if the real information can be filled in just as easily?

I don't know the kernel code but I would sincerely look at this case.=20

Right, and only executables.

You can easily leave out the :ex=3D*** part of LS_COLORS.

I don't think it's useful to introduce a new system call just to have=20
this support.

--=20
=E2=9E=A7 Ulrich Drepper =E2=9E=A7 Red Hat, Inc. =E2=9E=A7 444 Castro S=
t =E2=9E=A7 Mountain View, CA =E2=9D=96
-



From: Andreas Dilger
Date: Saturday, December 16, 2006 - 11:13 am

The kernel doesn't necessarily have to clear the fields.  The per-field
valid flag would determine is that field had valid data or garbage.

That said, there is no harm in the kernel/fs filling in additional fields
is they are readily available.  It would be up to the caller to NOT ask
for fields that it doesn't need, as that _might_ cause additional work
on the part of the filesystem.  If the kernel returns extra valid fields

Tell that to every distro maintainer, and/or try to convince the upstream

It isn't just to fix the ls --color problem.  There are lots of other
apps that need some stat fields and not others.  Also, implementing
the compatibility support for this (statlite->stat(), flags=$all_valid)
is trivial, if potentially less performant (though no worse than today).

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-



From: Ulrich Drepper
Date: Saturday, December 16, 2006 - 12:08 pm

You cannot leak kernel memory content.  Either you clear the field or,=20
in the code which actually copies the data to userlevel, you copy again=
=20
field by field.   The latter is far too slow.  So you better clear all=20


Name them.  I've asked for it before and got the answer "it's mainly=20
ls".  Now ls is debunked.  So, provide more evidence that the=20
y).

We're not talking about statlite.  The ls case is about getdirentplus.=20
I fail to see evidence that it is really needed.

--=20
=E2=9E=A7 Ulrich Drepper =E2=9E=A7 Red Hat, Inc. =E2=9E=A7 444 Castro S=
t =E2=9E=A7 Mountain View, CA =E2=9D=96
-



From: Rob Ross
Subject: statlite()
Date: Thursday, December 14, 2006 - 4:58 pm

We're going to clean the statlite() call up based on this (and 
subsequent) discussion and post again.

Thanks!

Rob

-



From: Nikita Danilov
Date: Thursday, December 7, 2006 - 4:39 pm

Christoph Hellwig writes:
 > I'd like to Cc Ulrich Drepper in this thread because he's going to decide
 > what APIs will be exposed at the C library level in the end, and he also
 > has quite a lot of experience with the various standardization bodies.
 > 
 > Ulrich, this in reply to these API proposals:
 > 
 > 	http://www.opengroup.org/platform/hecewg/uploads/40/10903/posix_io_readdir+.pdf
 > 	http://www.opengroup.org/platform/hecewg/uploads/40/10898/POSIX-stat-manpages.pdf

What readdirplus() is supposed to return in ->d_stat field for a name
"foo" in directory "bar" when "bar/foo" is a mount-point? Note that in
the case of distributed file system, server has no idea about client
mount-points, which implies some form of local post-processing.

Nikita.

-



From: Peter Staubach
Date: Tuesday, December 5, 2006 - 7:37 am

I don't think that anyone has shown a *need* for this sort of call yet,
actually.  What application would actually benefit from this call and
where are the measurements?  Simply asserting that "ls -l" will benefit
is not enough without some measurements.  Or mention a different real
world application...

Having developed and prototyped the NFSv3 READDIRPLUS, I can tell you
that the wins were less than expected/hoped for and while it wasn't
all that hard to implement in a simple way, doing so in a high performance
fashion is much harder.  Many implementations that I have heard about
turn off READDIRPLUS when dealing with a large directory.

Caching is what makes things fast and caching means avoiding going
over the network.

       ps
-



From: Andreas Dilger
Date: Tuesday, December 5, 2006 - 3:26 am

I think the "barrier semantics" are something that have just crept
into this discussion and is confusing the issue.

The primary goal (IMHO) of this syscall is to allow the filesystem
(primarily distributed cluster filesystems, but HFS and NTFS developers
seem on board with this too) to avoid tens to thousands of stat RPCs in
very common ls -R, find, etc. kind of operations.

I can't see how fadvise() could help this case?  Yes, it would tell the
filesystem that it could do readahead of the readdir() data, but the
app will still be doing stat() on each of the thousands of files in the
directory, instantiating inodes and dentries on that node (which need
locking, and potentially immediate lock revocation if the files are
being written to by other nodes).  In some cases (e.g. rm -r, grep -r)
that might even be a win, because the client will soon be touching all
of those files, but not necessarily in the ls -lR, find cases.

The filesystem can't always do "stat-ahead" on the files because that
requires instantiating an inode on the client which may be stale (lock
revoked) by the time the app gets to it, and the app (and the VFS)  have
no idea just how stale it is, and whether the stat is a "real" stat or
"only" the readdir stat (because the fadvise would only be useful on
the directory, and not all of the child entries), so it would need to
re-stat the file.  Also, this would potentially blow the client's real
working set of inodes out of cache.

Doing things en-masse with readdirplus() also allows the filesystem to
do the stat() operations in parallel internally (which is a net win if
there are many servers involved) instead of serially as the application
would do.

Cheers, Andreas

PS - I changed the topic to separate this from the openfh() thread.
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-



From: Trond Myklebust
Date: Tuesday, December 5, 2006 - 8:23 am

It is the _only_ concept that is of interest for something like NFS or

'find' should be quite happy with the existing readdir(). It does not
need to use stat() or readdirplus() in order to recurse because
readdir() provides d_type.

The locking problem is only of interest to clustered filesystems. On
local filesystems such as HFS, NTFS, and on networked filesystems like
NFS or CIFS, the only lock that matters is the parent directory's
inode->i_sem, which is held by readdir() anyway.

If the application is able to select a statlite()-type of behaviour with
the fadvise() hints, your filesystem could be told to serve up cached
information instead of regrabbing locks. In fact that is a much more
flexible scheme, since it also allows the filesystem to background the
actual inode lookups, or to defer them altogether if that is more

Then provide hints that allow the app to select which behaviour it
prefers. Most (all?) apps don't _care_, and so would be quite happy with
cached information. That is why the current NFS caching model exists in


If your application really cared, it could add threading to 'ls' to
achieve the same result. You can also have the filesystem preload that
information based on fadvise hints.

Trond

-



From: Andreas Dilger
Date: Wednesday, December 6, 2006 - 3:28 am

Actually, wouldn't the ability for readdirplus() (with valid flag) be
useful for NFS if only to indicate that it does not need to flush the

It does in any but the most simplistic invocations, like "find -mtime"

I guess I just don't understand how fadvise() on a directory file handle
(used for readdir()) can be used to affect later stat operations (which
definitely will NOT be using that file handle)?  If you mean that the
application should actually open() each file, fadvise(), fstat(), close(),
instead of just a stat() call then we are WAY into negative improvements

Most clustered filesystems have strong cache semantics, so that isn't
a problem.  IMHO, the mechanism to pass the hint to the filesystem IS
the readdirplus_lite() that tells the filesystem exactly which data is

Because in many cases it is desirable to limit the number of DLM locks
on a given client (e.g. GFS2 thread with AKPM about clients with
millions of DLM locks due to lack of memory pressure on large mem systems).
That means a finite-size lock LRU on the client that risks being wiped
out by a few thousand files in a directory doing "readdir() + 5000*stat()".


Consider a system like BlueGene/L with 128k compute cores.  Jobs that
run on that system will periodically (e.g. every hour) create up to 128K
checkpoint+restart files to avoid losing a lot of computation if a node
crashes.  Even if each one of the checkpoints is in a separate directory
(I wish all users were so nice :-) it means 128K inodes+DLM locks for doing

But it would still need 128K RPCs to get that information, and 128K new
inodes on that client.  And what is the chance that I can get a
multi-threading "ls" into the upstream GNU ls code?  In the case of local
filesystems multi-threading ls would be a net loss due to seeking.

But even for local filesystems readdirplus_lite() would allow them to
fill in stat information they already have (either in cache or on disk),
and may avoid doing extra work that isn't needed.  For filesystems ...
From: Trond Myklebust
Date: Wednesday, December 6, 2006 - 8:10 am

That is why statlite() might be useful. I'd prefer something more

The only 'win' a readdirplus may give you there as far as NFS is
concerned is the sysenter overhead that you would have for calling

On the contrary, the readdir descriptor is used in all those funky new
statat(), calls. Ditto for readlinkat(), faccessat().

You could even have openat() turn off the close-to-open GETATTR if the
readdir descriptor contained a hint that told it that was unnecessary.

Furthermore, since the fadvise-like caching operation works on
filehandles, you could have it work both on readdir() for the benefit of
the above *at() calls, and also on the regular file descriptor for the

That is precisely the sort of situation where knowing when you can
cache, and when you cannot would be a plus. An ls call may not need 128k
dlm locks, because it only cares about the state of the inodes as they

NFS doesn't 'cos it implements readdirplus under the covers as far as

The thing to note, though, is that in the NFS implementation we are
_very_ careful about use the GETATTR information it returns if there is
already an inode instantiated for that dentry. This is precisely because
we don't want to deal with the issue of synchronisation w.r.t. an inode
that may be under writeout, that may be the subject of setattr() calls,
etc. As far as we're concerned, READDIRPLUS is a form of mass LOOKUP,
not a mass inode revalidation

Trond

-



From: Latchesar Ionkov
Date: Tuesday, December 5, 2006 - 10:06 am

I don't think that ls -R and find are that common cases that they need
introduction of new operations in order to made them faster. On the
other hand may be they are often being used to do microbenchmarks. If
you goal is to make these filesystems look faster on microbenchmarks,
then probably you have the right solution. For normal use, especially
on clusters, I don't see any advantage of doing that.

Thanks,
    Lucho
-



From: Rob Ross
Date: Tuesday, December 5, 2006 - 3:48 pm

Hi Lucho,

Andreas is right on mark. The problem here is that when one user kicks 
off an ls -l or ls -R on a cluster file system *while other users are 
trying to get work done*, all those stat RPCs and lock reclamations can 
kill performance.

We're not interested in a "ls -lR" top 500, we're interested in making 
systems more usable, more tolerant to everyday user behaviors.

Regards,

Rob

-



From: Steven Whitehouse
Date: Wednesday, November 29, 2006 - 3:25 am

Hi,


I agree that this is a good plan, but I'd been looking at this idea from
a different direction recently. The in kernel NFS server calls
vfs_getattr from its filldir routine for readdirplus and this means not
only are we unable to optimise performance by (for example) sorting
groups of getattr calls so that we read the inodes in disk block order,
but also that its effectively enforcing a locking order of the inodes on
us too. Since we can have async locking in GFS2, we should be able to do
"lockahead" with readdirplus too.

I had been considering proposing a readdirplus export operation, but
since this thread has come up, perhaps a file operation would be
preferable as it could solve two problems with one operation?

Steve.


-



From: Christoph Hellwig
Date: Thursday, November 30, 2006 - 5:29 am

Doing this as an export operation is wrong.  Even if it's only used
for nfsd for now the logical level this should be on are the file operations.
If you do it you could probably prototype a syscall for it aswell - once
we have the infrastructure the syscall should be no more than about 20
lines of code.
-



From: Ric Wheeler
Date: Friday, December 1, 2006 - 8:52 am

I think that this kind of heuristic would be a win for local file systems with a 
huge number of files as well...

ric
-



From: Matthew Wilcox
Date: Wednesday, November 29, 2006 - 5:23 am

Yes, but it behaves like dup().  Gary replied to me off-list (which I
didn't notice and continued replying to him off-list).  I wrote:

Is this for people who don't know about dup(), or do they need
independent file offsets?  If the latter, I think an xdup() would be
preferable (would there be a security issue for OSes with revoke()?)
Either that, or make the key be useful for something else.

-



From: Matthew Wilcox
Date: Wednesday, November 29, 2006 - 5:35 am

I further wonder if these people would see appreciable gains from doing
sutoc rather than doing openat(dirfd, "basename", flags, mode);

If they did, I could also see openat being extended to allow dirfd to
be a file fd, as long as pathname were NULL or a pointer to NUL.

But with all the readx stuff being proposed, I bet they don't really
need independent file offsets.  That's, like, so *1970*s.
-



From: Gary Grider
Date: Wednesday, November 29, 2006 - 9:26 am

There is a business case at the Open Group Web site.  It is not a full use case
document though.

For a very tiny amount of background.
It seems from the discussion that others (at least those working in 
clustered file systems)
have seen the need for a statlite and readdir+
type function, what ever they might be called or how ever they might 
be implemented.
As for openg, the gains have been seen in clustered file systems 
where you have 10s of thousands
of processes spread out over thousands of machines.  All 100k 
processes may open the same file
and offset different amounts, sometimes strided sometimes not strided 
through the file.  The opens
all fire within a few milliseconds or less.  This is a problem for 
large clustered file systems,
open times have been seen in the minutes or worse.  The writes all 
come at once as well quite often.
Often they are complicated scatter gather operations spread out 
across the entire distributed
memory of thousands of machines, not even in a completely uniform manner.
A little knowledge about the intent of the application
goes a long way when you are dealing with 100k 
parallelism.   Additionally, having some notion
of groups of processes collaborating at the file system level is 
useful for trying to make informed
decisions about determinism and quality of service you might want to 
provide, how strictly
you want to enforce rules on collaborating processes, etc.

As for NFS acl's.
This was going to be a separate extension volume, not associated with 
the performance
portion.  It comes up because many of the users of high end/clustered 
file system technology
are also in often secure environments and have need to know 
issues.  We were trying to be
helpful to the NFSv4 community which has been kind enough to have 
these security features
in their product.

Additionally, this entire effort is being proposed as an extension, 
not as a change to the base POSIX I/O API.

We certainly have no religion about how we make ...
From: Christoph Hellwig
Date: Wednesday, November 29, 2006 - 10:18 am

Please don't repeat the stupid marketroid speach.  If you want this
to go anywhere please get someone with an actual clue to talk to us
instead of you.  Thanks a lot.

-



From: Christoph Hellwig
Date: Wednesday, November 29, 2006 - 5:39 am

Not sharing the file offset means we need a separate file struct, at
which point the only thing saved is doing a lookup at the time of
opening the file.  While a full pathname traversal can be quite costly
an open is not something you do all that often anyway.  And if you really
need to open/close files very often you can speed it up nicely by keeping
a file descriptor on the parent directory open and use openat().

Anyway, enough of talking here.  We really need a very good description
of the use case people want this for, and the specific performance problems
they see to find a solution.  And the solution definitly does not involve
as second half-assed file handle time with unspecified lifetime rules :-)
-



From: Rob Ross
Date: Friday, December 1, 2006 - 3:29 pm

Hi all,

The use model for openg() and openfh() (renamed sutoc()) is n processes 
spread across a large cluster simultaneously opening a file. The 
challenge is to avoid to the greatest extent possible incurring O(n) FS 
interactions. To do that we need to allow actions of one process to be 
reused by other processes on other OS instances.

The openg() call allows one process to perform name resolution, which is 
often the most expensive part of this use model. Because permission 
checking is also performed as part of the openg(), some file systems to 
not require additional communication between OS and FS at openfh(). 
External communication channels are used to pass the handle resulting 
from the openg() call out to processes on other nodes (e.g. MPI_Bcast).

dup(), openat(), and UNIX sockets are not viable options in this model, 
because there are many OS instances, not just one.

All the calls that are being discussed as part of the HEC extensions are 
being discussed in this context of multiple OS instances and cluster 
file systems.

Regarding the lifetime of the handle, there has been quite a bit of 
discussion about this. I believe that we most recently were thinking 
that there was an undefined lifetime for this, allowing servers to 
"forget" these values (as in the case where a server is restarted). 
Clients would need to perform the openg() again if they were to try to 
use an outdated handle, or simply fall back to a regular open(). This is 
not a problem in our use model.

I've attached a graph showing the time to use individual open() calls 
vs. the openg()/MPI_Bcast()/openfh() combination; it's a clear win for 
any significant number of processes. These results are from our 
colleagues at Sandia (Ruth Klundt et. al.) with PVFS underneath, but I 
expect the trend to be similar for many cluster file systems.

Regarding trying to "force APIs using standardization" on you 
(Christoph's 11/29/2006 message), you've got us all wrong. The 
standardization process ...
From: Latchesar Ionkov
Date: Friday, December 1, 2006 - 7:35 pm

Hi,

One general remark: I don't think it is feasible to add new system
calls every time somebody has a problem. Usually there are (may be not
that good) solutions that don't require big changes and work well
enough. "Let's change the interface and make the life of many
filesystem developers miserable, because they have to worry about
3-4-5 more operations" is not the easiest solution in the long run.


If the name resolution is the most expensive part, why not implement
just the name lookup part and call it "lookup" instead of "openg". Or
even better, make NFS to resolve multiple names with a single request.
If the NFS server caches the last few name lookups, the responses from
the other nodes will be fast, and you will get your file descriptor
with two instead of the proposed one request. The performance could be
just good enough without introducing any new functions and file
handles.

Thanks,
    Lucho
-



From: Rob Ross
Date: Monday, December 4, 2006 - 5:37 pm

Hi,

I agree that it is not feasible to add new system calls every time 
somebody has a problem, and we don't take adding system calls lightly. 
However, in this case we're talking about an entire *community* of 
people (high-end computing), not just one or two people. Of course it 
may still be the case that that community is not important enough to 
justify the addition of system calls; that's obviously not my call to make!

I'm sure that you meant more than just to rename openg() to lookup(), 
but I don't understand what you are proposing. We still need a second 
call to take the results of the lookup (by whatever name) and convert 
that into a file descriptor. That's all the openfh() (previously named 
sutoc()) is for.

I think the subject line might be a little misleading; we're not just 
talking about NFS here. There are a number of different file systems 
that might benefit from these enhancements (e.g. GPFS, Lustre, PVFS, 
PanFS, etc.).

Finally, your comment on making filesystem developers miserable is sort 
of a point of philosophical debate for me. I personally find myself 
miserable trying to extract performance given the very small amount of 
information passing through the existing POSIX calls. The additional 
information passing through these new calls will make it much easier to 
obtain performance without correctly guessing what the user might 
actually be up to. While they do mean more work in the short term, they 
should also mean a more straight-forward path to performance for 
cluster/parallel file systems.

Thanks for the input. Does this help explain why we don't think we can 
just work under the existing calls?

Rob


-



From: Christoph Hellwig
Date: Tuesday, December 5, 2006 - 3:02 am

Any support for advance filesystem semantics will definitly not be
available to propritary filesystems like GPFS that violate our copyrights
blatantly.

-



From: Latchesar Ionkov
Date: Tuesday, December 5, 2006 - 9:47 am

I have the feeling that openg stuff is rushed without looking into all
solutions, that don't require changes to the current interface. I
don't see any numbers showing where exactly the time is spent? Is
opening too slow because of the number of requests that the file
server suddently has to respond to? Does having an operation that
looks up multiple names instead of a single name good enough? How much

The idea is that lookup doesn't open the file, just does to name
resolution. The actual opening is done by openfh (or whatever you call
it next :). I don't think it is a good idea to introduce another way
of addressing files on the file system at all, but if you still decide
to do it, it makes more sense to separate the name resolution from the
operations (at the moment only open operation, but who knows what'll

I think that the main problem is that all these file systems resove a
path name, one directory at a time bringing the server to its knees by
the huge amount of requests. I would like to see what the performance
is if you a) cache the last few hundred lookups on the server side,
and b) modify VFS and the file systems to support multi-name lookups.
Just assume for a moment that there is no any way to get these new
operations in (which is probaly going to be true anyway :). What other
solutions can you think of? :)

Thanks,
    Lucho
-



From: Matthew Wilcox
Date: Tuesday, December 5, 2006 - 10:01 am

How exactly would you want a multi-name lookup to work?  Are you saying
that open("/usr/share/misc/pci.ids") should ask the server "Find usr, if
you find it, find share, if you find it, find misc, if you find it, find
pci.ids"?  That would be potentially very wasteful; consider mount
points, symlinks and other such effects on the namespace.  You could ask
the server to do a lot of work which you then discard ... and that's not
efficient.
-



From: Peter Staubach
Date: Tuesday, December 5, 2006 - 2:50 pm

It could be inefficient, as pointed out, but defined right, it could
greatly reduce the number of over the wire trips.

The client can already tell from its own namespace when a submount may
be encountered, so know not to utilize the multicomponent pathname
lookup facility.  The requirements could state that the server stops
when it encounters a non-directory/non-regular file node in the namespace.
This sort of thing...

       ps
-



From: Rob Ross
Date: Tuesday, December 5, 2006 - 2:44 pm

Thanks for looking at the graph.

To clarify the workload, we do not expect that application processes 
will be opening a large number of files all at once; that was just how 
the test was run to get a reasonable average value. So I don't think 
that something that looked up multiple file names would help for this case.

I unfortunately don't have data to show exactly where the time was 
spent, but it's a good guess that it is all the network traffic in the 

I really think that we're saying the same thing here?

I think of the open() call as doing two (maybe three) things. First, 
performs name resolution and permission checking. Second, creates the 
file descriptor that allows the user process to do subsequent I/O. 
Third, creates a context for access, if the FS keeps track of "open" 
files (not all do).

The openg() really just does the lookup and permission checking). The 
openfh() creates the file descriptor and starts that context if the 

Well you've caught me. I don't want to cache the values, because I 
fundamentally believe that sharing state between clients and servers is 
braindead (to use Christoph's phrase) in systems of this scale 
(thousands to tens of thousands of clients). So I don't want locks, so I 
can't keep the cache consistent, ... So someone else will have to run 
the tests you propose :)...

Also, to address Christoph's snipe while we're here; I don't care one 
way or another whether the Linux community wants to help GPFS or not. I 
do care that I'm arguing for something that is useful to more than just 
my own pet project, and that was the point that I was trying to make. 
I'll be sure not to mention GPFS again.

What's the etiquette on changing subject lines here? It might be useful 
to separate the openg() etc. discussion from the readdirplus() etc. 
discussion.

Thanks again for the comments,

Rob
-



From: Christoph Hellwig
Subject: openg
Date: Wednesday, December 6, 2006 - 4:01 am

Besides the whole ugliness you miss a few points about the fundamental
architecture of the unix filesystem permission model unfortunately.

Say you want to lookup a path /foo/bar/baz, then the access permission
is based on the following things:

 - the credentials of the user.  let's only take traditional uid/gid
   for this example although credentials are much more complex these
   days
 - the kind of operation you want to perform
 - the access permission of the actual object the path points to (inode)
 - the lookup permission (x bit) for every object on the way to you object

In your proposal sutoc is a simple conversion operation, that means
openg needs to perfom all these access checks and encodes them in the
fh_t.  That means an fh_t must fundamentally be an object that is kept
in the kernel aka a capability as defined by Henry Levy.  This does imply
you _do_ need to keep state.  And because it needs kernel support you
fh_t is more or less equivalent to a file descriptor with sutoc equivalent
to a dup variant that really duplicates the backing object instead of just
the userspace index into it.

Note somewhat similar open by filehandle APIs like oben by inode number
as used by lustre or the XFS *_by_handle APIs are privilegued operations
because of exactly this problem.

What according to your mail is the most important bit in this proposal is
that you thing the filehandles should be easily shared with other system
in a cluster.  That fact is not mentioned in the actual proposal at all,
and is in fact that hardest part because of inherent statefulness of

Changing subject lines is fine.

-



From: Trond Myklebust
Subject: Re: openg
Date: Wednesday, December 6, 2006 - 8:41 am

- your private namespace particularities (submounts etc)

Trond

-



From: Rob Ross
Subject: Re: openg
Date: Wednesday, December 6, 2006 - 8:42 am

The fh_t is indeed a type of capability. fh_t, properly protected, could 
be passed into user space and validated by the file system when 
presented back to the file system.

There is state here, clearly. I feel ok about that because we allow 
servers to forget that they handed out these fh_ts if they feel like it; 
there is no guaranteed lifetime in the current proposal. This allows 
servers to come and go without needing to persistently store these. 
Likewise, clients can forget them with no real penalty.

This approach is ok because of the use case. Because we expect the fh_t 
to be used relatively soon after its creation, servers will not need to 
hold onto these long before the openfh() is performed and we're back 
into a normal "everyone has an valid fd" use case.


Well, a FD has some additional state associated with it (position, 

I'm not sure what a properly protected fh_t couldn't be passed back into 
user space and handed around, but I'm not a security expert. What am I 

The documentation of the calls is complicated by the way POSIX calls are 
described. We need to have a second document describing use cases also 
available, so that we can avoid misunderstandings as best we can, get 
straight to the real issues. Sorry that document wasn't available.


Thanks.

Rob
-



From: Christoph Hellwig
Subject: Re: openg
Date: Wednesday, December 6, 2006 - 4:32 pm

Well, there's quite a lot of papers on how to implement properly secure
capabilities.  The only performant way to do it is to implement them
in kernel space or with hardware support.  As soon as you pass them
to userspace the user can manipulate them, and doing a cheap enough
verification is non-trivial (e.g. it doesn't buy you anything if you
spent the time you previously spent for lookup roundtrip latency

Objects without defined lifetime rules are not something we're very keen
on.  Particularly in userspace interface they will cause all kinds of
trouble because people will expect the lifetime rules they get from their

The real problem is that you want to do something in a POSIX spec that
is fundamentally out of scope.  POSIX .1 deals with system interfaces
on a single system.  You want to specify semantics over multiple systems
in a cluster.

-



From: Rob Ross
Subject: Re: openg
Date: Thursday, December 14, 2006 - 4:36 pm

I agree that if the cryptographic verification took longer than the N 
namespace traversals and permission checking that would occur in the 
other case, that this would be a silly proposal. honestly that didn't 
occur to me as even remotely possible, especially given that in most 
cases the server will be verifying the exact same handle lots of times, 
rather than needing to verify a large number of different handles 

I agree that not being able to clearly define the lifetime of the handle 
is suboptimal.

If the handle is a capability, then its lifetime would be bounded only 
by potential revocations of the capability, the same way an open FD 
might then suddenly cease to be valid. On the other hand, in Andreas' 
"open file handle" implementation the handle might have a shorter lifetime.

We're attempting to allow for the underlying FS to implement this in the 
most natural way for that file system. Those mechanisms lead to 
different lifetimes.

This would bother me quite a bit *if* it complicated the use model, but 
it really doesn't, particularly because less savvy users are likely to 

I agree; the real problem is that POSIX .1 is being used to specify 
semantics over multiple systems in a cluster. But we're stuck with that.

Thanks,

Rob
-



From: Latchesar Ionkov
Date: Wednesday, December 6, 2006 - 4:25 pm

Is it hard to repeat the test and check what requests (and how much

Having file handles in the server looks like a cache to me :) What are
the properties of a cache that it lacks?

Thanks,
    Lucho
-



From: David Chinner
Date: Wednesday, December 6, 2006 - 2:48 am

I also get the feeling that interfaces that already do this
open-by-handle stuff haven't been explored either.

Does anyone here know about the XFS libhandle API? This has been
around for years and it does _exactly_ what these proposed syscalls
are supposed to do (and more).

See:

http://techpubs.sgi.com/library/tpl/cgi-bin/getdoc.cgi?coll=linux&db=man&fname...

For the libhandle man page. Basically:

openg == path_to_handle
sutoc == open_by_handle

And here for the userspace code:

http://oss.sgi.com/cgi-bin/cvsweb.cgi/xfs-cmds/xfsprogs/libhandle/

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-



From: Rob Ross
Date: Wednesday, December 6, 2006 - 8:53 am

Thanks for pointing these out Dave. These are indeed along the same 
lines as the openg()/openfh() approach.

One difference is that they appear to perform permission checking on the 
open_by_handle(), which means that the entire path needs to be encoded 
in the handle, and makes it difficult to eliminate the path traversal 
overhead on N open_by_handle() operations.

Regards,

Rob
-



From: Matthew Wilcox
Date: Wednesday, December 6, 2006 - 9:04 am

Another (and highly important) difference is that usage is restricted to
root:

xfs_open_by_handle(...)
...
        if (!capable(CAP_SYS_ADMIN))
		return -XFS_ERROR(EPERM);

-



From: Rob Ross
Date: Wednesday, December 6, 2006 - 9:20 am

I assume that this is because the implementation chose not to do the 
path encoding in the handle? Because if they did, they could do full 
path permission checking as part of the open_by_handle.

Rob
-



From: David Chinner
Date: Wednesday, December 6, 2006 - 1:57 pm

The original use of this interface (if I understand the Irix history
correctly - this is way before my time at SGI) was a userspace NFS
server and so permission checks were done after the filehandle was
opened and a stat could be done on the fd and mode/uid/gid could be
compared to what was in the NFS request. Paths were never needed for
this because everything needed could be obtained directly from
the inode.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-



From: David Chinner
Date: Wednesday, December 6, 2006 - 1:40 pm

open_by_handle() is checking the inode flags for things like
immutibility and whether the inode is writable to determine if the
open mode is valid given these flags. It's not actually checking
permissions. IOWs, open_by_handle() has the same overhead as NFS
filehandle to inode translation; i.e. no path traversal on open.

Permission checks are done on the path_to_handle(), so in reality
only root or CAP_SYS_ADMIN users can currently use the
open_by_handle interface because of this lack of checking. Given
that our current users of this interface need root permissions to do
other things (data migration), this has never been an issue.

This is an implementation detail - it is possible that file handle,
being opaque, could encode a UID/GID of the user that constructed
the handle and then allow any process with the same UID/GID to use
open_by_handle() on that handle. (I think hch has already pointed
this out.)

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-



From: Matthew Wilcox
Date: Wednesday, December 6, 2006 - 1:50 pm

While it could do that, I'd be interested to see how you'd construct
the handle such that it's immune to a malicious user tampering with it,
or saving it across a reboot, or constructing one from scratch.

I suspect any real answer to this would have to involve cryptographical
techniques (say, creating a secure hash of the information plus a
boot-time generated nonce).  Now you're starting to use a lot of bits,
and compute time, and you'll need to be sure to keep the nonce secret.
-



From: David Chinner
Date: Wednesday, December 6, 2006 - 2:09 pm

An auth header and GSS-API integration would probably be the way
to go here if you really care.

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-



From: Andreas Dilger
Date: Wednesday, December 6, 2006 - 3:09 pm

If the server has to have processed a real "open" request, say within
the preceding 30s, then it would have a handle for openfh() to match
against.  If the server reboots, or a client tries to construct a new
handle from scratch, or even tries to use the handle after the file is
closed then the handle would be invalid.

It isn't just an encoding for "open-by-inum", but rather a handle that
references some just-created open file handle on the server.  That the
handle might contain the UID/GID is mostly irrelevant - either the
process + network is trusted to pass the handle around without snooping,
or a malicious client which intercepts the handle can spoof the UID/GID
just as easily.  Make the handle sufficiently large to avoid guessing
and it is "secure enough" until the whole filesystem is using kerberos
to avoid any number of other client/user spoofing attacks.

Considering that filesystems like GFS and OCFS allow clients DIRECT
ACCESS to the block device itself (which no amount of authentication
will fix, unless it is in the disks themselves), the risk of passing a
file handle around is pretty minimal.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-



From: Matthew Wilcox
Date: Wednesday, December 6, 2006 - 3:17 pm

That's either disingenuous, or missing the point.  OCFS/GFS allow the
kernel direct access to the block device.  openg()&sutoc() are about
passing around file handles to untrusted users.
-



From: Andreas Dilger
Date: Wednesday, December 6, 2006 - 3:41 pm

Consider - in order to intercept the file handle on the network one would
have to be root on a trusted client.  The same is true for direct block
access.

If the network isn't to be trusted or the clients aren't to be trusted,
then in the absence of strong external authentication like kerberos the
whole thing just falls down (i.e. root on any client can su to an arbitrary
UID/GID to access files to avoid root squash, or could intercept all of
the traffic on the network anyways).

With some network filesystems it is at least possible to get strong
authentication and crypto, but with shared block device filesystems like
OCFS/GFS/GPFS they completely rely on the fact that the network and all
of the clients attached thereon are secure.

If the server that did the original file open and generates the unique
per-open file handle can do basic sanity checking (i.e. user doing the
new open is the same, the file handle isn't stale) then that is no
additional security hole.

Similarly, NFS passes file handles to clients that are also used to get
access to the open file without traversing the whole path each time.
Those file handles are even (supposed to be) persistent over reboots.

Don't get me wrong - I understand that what I propose is not secure.
I'm just saying it is no LESS secure than a number of other things
which already exist.

Cheers, Andreas
--
Andreas Dilger
Principal Software Engineer
Cluster File Systems, Inc.

-



From: Christoph Hellwig
Date: Wednesday, December 6, 2006 - 4:39 pm

That would be fine as long as the file handle would be a kernel-level
concept.  The issue here is that they intent to make the whole filehandle
userspace visible, for example to pass it around via mpi.  As soon as
an untrused user can tamper with the file descriptor we're in trouble.

-



From: Rob Ross
Date: Thursday, December 14, 2006 - 3:52 pm

I guess it could reference some "just-created open file handle" on the 
server, if the server tracks that sort of thing. Or it could be a 
capability, as mentioned previously. So it isn't necessary to tie this 
to an open, but I think that would be a reasonable underlying 
implementation for a file system that tracks opens.

If clients can survive a server reboot without a remount, then even this 
implementation should continue to operate if a server were rebooted, 
because the open file context would be reconstructed. If capabilities 
were being employed, we could likewise survive a server reboot.

But this issue of server reboots isn't that critical -- the use case has 
the handle being reused relatively quickly after the initial openg(), 
and clients have a clean fallback in the event that the handle is no 
longer valid -- just use open().

Visibility of the handle to a user does not imply that the user can 
effectively tamper with the handle. A cryptographically secure one-way 
hash of the data, stored in the handle itself, would allow servers to 
verify that the handle wasn't tampered with, or that the client just 
made up a handle from scratch. The server managing the metadata for that 
file would not need to share its nonce with other servers, assuming that 
single servers are responsible for particular files.

Regards,

Rob
-



From: Rob Ross
Date: Wednesday, December 6, 2006 - 1:50 pm

Thanks for the clarification Dave. So I take it that you would be 
interested in this type of functionality then?

Regards,

Rob
-



From: David Chinner
Date: Wednesday, December 6, 2006 - 2:01 pm

Not really - just trying to help by pointing out something no-one
seemed to know about....

Cheers,

Dave.
-- 
Dave Chinner
Principal Engineer
SGI Australian Software Group
-



From: Latchesar Ionkov
Date: Wednesday, December 6, 2006 - 4:19 pm

The open-by-handle makes a little more sense, because the "handle" is
not opened, it only points to a resolved file. As I mentioned before,
it doesn't make much sense to bundle in openg name resolution and file
open.

Still I am not convinced that we need two ways of "finding" files.

Thanks,
    Lucho
-



From: Rob Ross
Date: Thursday, December 14, 2006 - 2:00 pm

I don't think that I understand what you're saying here. The openg() 
call does not perform file open (not that that is necessarily even a 
first-class FS operation), it simply does the lookup.

When we were naming these calls, from a POSIX consistency perspective it 
seemed best to keep the "open" nomenclature. That seems to be confusing 
to some. Perhaps we should rename the function "lookup" or something 
similar, to help keep from giving the wrong idea?

There is a difference between the openg() and path_to_handle() approach 
in that we do permission checking at openg(), and that does have 
implications on how the handle might be stored and such. That's being 
discussed in a separate thread.

Thanks,

Rob
-



From: Matthew Wilcox
Date: Thursday, December 14, 2006 - 2:20 pm

I was just thinking about how one might implement this, when it struck
me ... how much more efficient is a kernel implementation compared to:

int openg(const char *path)
{
	char *s;
	do {
		s = tempnam(FSROOT, ".sutoc");
		link(path, s);
	} while (errno == EEXIST);

	mpi_broadcast(s);
	sleep(10);
	unlink(s);
}

and sutoc() becomes simply open().  Now you have a name that's quick to
open (if a client has the filesystem mounted, it has a handle for the
root already), has a defined lifespan, has minimal permission checking,
and doesn't require standardisation.

I suppose some cluster fs' might not support cross-directory links
(AFS is one, I think), but then, no cluster fs's support openg/sutoc.
If a filesystem's willing to add support for these handles, it shouldn't
be too hard for them to treat files starting ".sutoc" specially, and as
efficiently as adding the openg/sutoc concept.
-



From: Rob Ross
Date: Thursday, December 14, 2006 - 4:02 pm

Adding atomic reference count updating on file metadata so that we can 
have cross-directory links is not necessarily easier than supporting 
openg/openfh, and supporting cross-directory links precludes certain 
metadata organizations, such as the ones being used in Ceph (as I 
understand it).

This also still forces all clients to read a directory and for N 
permission checking operations to be performed. I don't see what the FS 
could do to eliminate those operations given what you've described. Am I 
missing something?

Also this looks too much like sillyrename, and that's hard to swallow...

Regards,

Rob
-



From: Matthew Wilcox
Date: Tuesday, November 28, 2006 - 8:08 am

I don't understand how this leads to a more efficient implementation.

These don't seem to be documented on the website.

-



Previous thread: We Guuaranteees Bigger Pen-nis by Elena Waller on Sunday, November 26, 2006 - 9:59 am. (1 message)

Next thread: [PATCH -mm] gfs2 lock function parameter by