For our application, we have 1 process capturing images from a frame-grabber and dumping them to a file in a memory file system. It sends a UDP message to another process that reads the images from this file and writes them to disk in a database. The first process has to guarantee that after it's done writing that image to the memory file system, that it is actually there for the second process to read. The problem is slightly more complicated, as we have systems out in the field running 1.5.1 which we still need to support. I've tried opening the file with O_SYNC and not doing the fsync(). On 1.5.1, this has increased the write() time slightly, which was expected, but the overall processor utilization hasn't changed much. On 1.6.1, this has the same impact as just removing the fsync() - which I expected because of the MNT_SYNCHRONOUS flag in the mfs mount in 1.6.1. So far, I haven't had any problems with this solution. Will this solution guarantee that whatever the first process writes is there for the second process to read? ----- Original Message ----- From: "Chuck Silvers" <chuq@chuq.com> To: "Daniel Brewer" <danielb@cat.co.za> Cc: <tech-kern@netbsd.org> Sent: Sunday, July 06, 2003 6:00 PM noticed an inexplicably high usage on 161. After digging deeper with gprof, I found that an fsync on 161 takes significantly longer than on 151. Our software writes captured images into a ring buffer in a memory file-system, so other servers can retrieve them. Could someone explain the why fsync on 1.6.1 is so significantly slower than on 1.5.1? Is there a work-around? Or Celeron) both have the same motherboard, have 128MB ram and are both running Western Digital 20Gig drives. Running the 161 box on a 2GHz celeron with
Provided the writing and the reading use the same choice of interface (which primarily means, on the one hand, read() and write(), and on the other, mmap()), you shouldn't need to do any explicit syncs. (You need to take care when mixing read/write and mmapped access, but not when all accesses use the same choice of interface.) At least that's how I've always understood it, and that's been my experience. I've been bitten occasionally by missing synchronization between users of different interfaces, but never otherwise. /~\ The ASCII der Mouse \ / Ribbon Campaign X Against HTML mouse@rodents.montreal.qc.ca / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
for a unified-cache system like we have in 1.6 and later, there's no coherency problem even if you switch back and forth between read/write and mmap. for non-unified-cache systems like sunos 3.x this was probably a problem. the only major non-unified-cache system left that I know of these days (HP-UX) does flushing internally to guarantee coherency for single-threaded applications even if they switch interfaces and makes a best-effort attempt at coherency for multi-threaded (or multi-process) accesses via different interfaces. -Chuck
The only case that really does cause problems is if one process is using NFS to acess the files. This can cause extreme grief! The grief is compounded by NFS implementations that compare client and server times. David -- David Laight: david@l8s.co.uk
hi, oh, for the file in the MFS you shouldn't need any kind of syncing at all. the data will be available in memory for the second process to read immediately after the write() returns. this is true for both 1.6.x and 1.5.x. -Chuck
First off I wouldn't think you'd have to worry about fsync() on a memory filesystem. :-) In fact so long as you're not worried about system failures possibly causing loss of data then I don't think you should ever need the fsync(). Is the data _always_ going to be read by the second process? If so then maybe now with bigger pipe buffers in 1.6.x you'd be better off just using a pipe? You should only need a runtime configuration flag to select between use of a pipe and a temp MFS file to be able to also support 1.5.x with the same compiled binary. Also, why are you using UDP to communicate with another process on the same system? Why not an AF_LOCAL socket, pipe, or msgsnd()/msgrcv()? -- Greg A. Woods +1 416 218-0098; <g.a.woods@ieee.org>; <woods@robohack.ca> Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>
Why don't you use a AF_UNIX socket to pipe the image data from process
to process or SYSV sharerd memory with semaphore interlocking? The later
would be the fastetst way to transfer the data and it avoids any file
system interaction.=20
--=20
tsch=FC=DF,
Jochen
Homepage: http://www.unixag-kl.fh-kl.de/~jkunz/
SYSV shared memory is horrid stuff! Just mmap a file! David -- David Laight: david@l8s.co.uk
Indeed, it is just compatibility stuff. I'm apalled that new software
is still using that deprecated (and ill-conceived) API. Maybe it's
because Gnu/Linux can't do any other method (at least the last time
I looked.) That shouldn't mean that on other systems (SVR4/44BSD)
that broken API should be used aswell.
--
Matthias Buelow; mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}
New software _should_ use POSIX IPC, but of course NetBSD doesn't yet have any implementation of POSIX IPC, so portable software the must run on NetBSD must still use UNIX System V IPC. It's really not that badly concieved an API, given the fact that it has to be very portable; and it is very "standard" too (it's been in P1003.1 since Issue 2 and has been in the BASE API set since Issue 5). (The lack of ftok() in POSIX has also been repaired since Issue 4, IIRC.) The full implementation gives some very nice user-level controls that can be an enormous boon to debugging and operational management. While a variant/subset of mmap() has been defined for IEEE-P1003.1-2001, it ends up being a hell of a lot more complex than the plain old SysV, aka XSI, shared memory you're complaining about and it doesn't include MAP_ANON either so portable applications still have to use SysV/XSI SHM for the kinds of uses I'm guessing are applicable for the application this thread has been discussing. The only part of the whole SysV IPC API that's really rather inelegant is the semaphores part (and of course that's due to the minor flaw that requires two separate system calls to create and then initialize a semaphore). IIUC this problem is fixed with POSIX semaphores, but of course NetBSD doesn't yet have any implementation of POSIX semaphores. Message queues could have been a little bit simpler too I suppose, though only at the loss of some useful flexibility, and it is unfortunate that you can't use poll() on message queues (sysV or posix). -- Greg A. Woods +1 416 218-0098; <g.a.woods@ieee.org>; <woods@robohack.ca> Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>
I can hardly believe that someone is actually recommending the use of
this blasted, totally deprecated API. SysV IPC is a horrible
anachronism with lots of inconsistencies and bad behaviour (can't be
multiplexed with files, identifiers & resources hang around when
process exits abnormally, sometimes until the system is rebooted (you
can't always ipcrm), small, fixed limits compiled into kernel etc.)
Given that both System V R4 and higher and 4.4BSD and higher provide
much better, mmap-based shared memory mechanisms which are supported in
one way or the other by most if not all systems today (e.g. Free/NetBSD
and, iirc, Solaris can do both the SVR4 method of shared mmapping of
/dev/zero aswell as the anon/shared method of 4.4BSD) there is little
point in programming for the SVIPC API, except as a fallback for
systems on which it is the only method available (GNU/Linux). A
uniform interface which encapsulates those three methods (SVR4, 44BSD,
SVIPC) towards the rest of the application is easily written in a
couple dozen lines of code so there's no reason to use the SVIPC crap
on systems where it is not necessary.
--
Matthias Buelow; mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}
SysV IPC is still a standard API and is far more portable than anything What do _you_ mean by "multiplexed"? And are you talking explicitly Everything has limits -- at least with SysV IPC you know what they are no need to write it from scratch -- Richard Stevens already wrote one. Indeed POSIX IPC mechanisms are far more desirable, but also equally non-portable until a majority of modern OS releases catch up to P1003.1-2001 and are deployed in a significant number of installations. -- Greg A. Woods +1 416 218-0098; <g.a.woods@ieee.org>; <woods@robohack.ca> Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>
A feature? Resources that don't get freed because a per-process
reference count is missing? Just imagine what the situation would
It's not an implementation bug if you don't know if the identifier
you see in ipcs is still in use (and deleting it might crash the
Of course a fully implemented, sensible, standardized API is
preferrable in the long run. I don't know the story of the SysV IPC
stuff but it's very likely that it was a quick ad-hoc thing written
without proper design in order to provide the necessary IPC mechanism
for some specific software that needed to be written. It predates BSD
sockets in any case (wasn't it already available in SysIII?)
--
Matthias Buelow; mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}
We already knew that we couldn't poll() on message queues, so what do Yes, a feature -- they are independent entities that can remain in place SysV IPC wasn't a "quick ad-hoc thing" as far as I can tell. It is very generic and was apparently designed to provide an integrated set of generic IPC facilities that could be used safely and reliably in production environments. I once used SysV IPC to build some of the core communications software in an application that was used to control real-world railroad switches and signals. I've used it in lots of other Nope. They call it SysV IPC for a reason.... (SCO back-ported it to an earlier version of XENIX, but did so badly and with some nasty bugs.) -- Greg A. Woods +1 416 218-0098; <g.a.woods@ieee.org>; <woods@robohack.ca> Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>
I know that certain folks here like bashing Linux, but it might help to looks at the facts before posting that bogus claims. Also remember that it's Linux - as far as the kernel is concerned that GNU folks didn't contribute but rather do the same bashing from time to time y=that you like, too..
I do not bash lignux, I state facts. I asked in 1996 when the missing
functionality was to be implemented and got told along the lines that
it will take at least a few months still because it couldn't be fitted
into the then current VM architecture easily. Last time I looked (2.4
kernel of 2002 or so) it still wasn't available. Maybe that has
miraculously changed in the meantime. But who cares, this is not a
lignux mailing list.
--
Matthias Buelow; mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}
Well, that's certainly wrong. MAP_ANON has been available at least through all of the 2.4 series.
Hmm.. my memory could be failing me and it was a pre-2.4 kernel against
which I tested last time. Good news, though, that they're at least
picking up certain things. I hope it also works.
--
Matthias Buelow; mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}
*shudder* I have yet to see any part that _isn't_ inelgant. A new namespace for each resource. A new _flat_ namespace for each resource. A new flat namespace _with human-meaningless names_ for each resource. The worst of both the persistent and transient worlds: no cleanup on Unfortunate? I would call it fatal. /~\ The ASCII der Mouse \ / Ribbon Campaign X Against HTML mouse@rodents.montreal.qc.ca / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Have you never heard of inode numbers? :-) Perhaps you're not aware of ftok(3) and its use to map normal pathnames I think you've mis-interpreted what I believe was a design goal of SysV IPC. The IPC entities are not supposed to go away when the last process exits -- they are supposed to persist so that a process can come along later and re-attach to them. Asking SysV IPC entities to disappear on last exit of all processes which have interacted with them would be like asking all files to disappear on last close! There's nothing fundamentally wrong with them disappearing on reboot though -- in that sense they are no different than a memory filesystem. (or pipes). (It might have been nice to have some way to explicitly ask to "unlink" an IPC entity such that it would be cleaned up on last exit, but the poll() came along to the systems in question quite a bit later than message queues -- at the time the only things that you might want to be able to do simultaneous non-blocking reads on were TTYs and the best you could do with anything at the time were manually polled non-blocking reads and when that's what you have to do then adding a msgrcv() call using IPC_NOWAIT to the loop is no big deal. BTW, it is only not portable to use poll() or select() on message queue identifiers -- it is possible to use select() on at least one more modern implementation (AIX). Also, BTW, it seems I am indeed wrong about never being able to use poll() or select() on POSIX message queues -- POSIX does specifically allow message queue descriptors to be implemented as file descriptors and so it should be possible to implement them in such a way that poll() or select() will do the right thing for them. On the other hand this can't be relied upon by a portable application. Luckily POSIX message queues have a way to establish a notify callback function that will be called just like a signal handler whenever a message appears in an empty queue (though the notifier does have to be ...
Certainly. How does it avoid collisions? (Hint: it can't. There can be, and on large systems not too infrequently are, more files than there are possible IPC IDs.) How is there any excuse for tying this It's fine to have that option. It's broken to have that as the only Pipes do not have names. Memory filesystems, yes, and that's one ...so? Taking an OS whose unifying concept is "everything is a file descriptor" and creating three new object types that can't have file Gah. We need to get away from signal handlers. Or else we need an environment (language and OS) better suited to them. Getting away from signals was one of the things that made me implement AF_TIMER sockets. (Another was the insanely small number - one - of possible outstanding timeouts with the signal-based timer facilities. The third was the So they not only repeated the mistakes of signals, they repeated the mistake of making them unreliable. And this is the API you are holding up as a paragon of goodness?? I shudder to imagine your idea of a bad API. /~\ The ASCII der Mouse \ / Ribbon Campaign X Against HTML mouse@rodents.montreal.qc.ca / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Unix filesystems use inodes. :) but my point was that a namespace consisting of integers isn't such a Note I had not actually ever looked at NetBSD's implementation before, but I see now that it is far less than ideal, and strictly speaking is not correct since as far as I can tell it cannot possibly conform to the full requirements of POSIX 1003.1-2001. ftok() _should_ always create a unique key to match any unique file and 'id' parameter (and of course _should_ always return the same key for the same file & 'id'). Most implementations are done in userland (and POSIX requires that a userland implementation be possible) and like the half-baked one in NetBSD they use the combined st_ino and st_dev numbers found from stat()ing the file (and then shift in the low 8 bits of the 'id' parameter so that each file can also represent 2^8 unique keys). Once upon a time the bits used from each value would probably have guaranteed a unique key, but since then the values have widened You can ask for the world, but that doesn't mean you'll get it! ;-) There is no "option" to have a file automatically deleted on last close. You have to unlink() it explicitly and then its allocated storage will be released on last close. The unlink() means the file is immediately invisible to all other processes which do not have it open(). Does this mean the open()/close()/unlink() API is also fundamentaly broken? I don't think so! I would say that any API for any complex functionality would be broken by being too complex to use if it had every imaginable option and feature. Just like the files on a memory filesystem, SysV IPC entities persist until reboot so that they can be used by transient processes and there exist tools to manage them as necessary. I.e. their whole API is sufficient to do everything that's necessary, and it is not bloated with Everything is a file descriptor in unix unless it's an object located a memory address. :-) Message queues are the only one of the ...
In the API?? The only placees I know of they appear in the API are Maybe, maybe not, but your example is irrelevant - inumbers are _not_ Which implies that no conformant OS on a machine with 32-bit ints can ever support as many as 2^24 files simultaneously accessible. Nice idea. It won't fly in practice. /~\ The ASCII der Mouse \ / Ribbon Campaign X Against HTML mouse@rodents.montreal.qc.ca / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Anyone who ignores the fact that unix filesystems are inherently flat namespaces at their lowest level does so at his or her own peril! ;-) Directories are just files full of names and pointers to the true What gives you that idea? The standards do not specify what a "key_t" is. P1003.1-2001 says only that it is "Used for XSI interprocess communication" and that it is defined by including <sys/types.h>. For all you know at the API level of ftok(), shmget(), semget(), and msgget() a key_t could be a struct, or a pointer to a struct, containing a full copy of a "struct stat" along with a full copy of the stat()ed pathname in a char array and a copy of the whole int-sized 'id' value. (and that's probably what it should be, without perhaps the pathname) The resulting IPC resource descriptors returned by shmget(), msgget(), and semget(), each being in its own namespace and each being defined as type 'int' with "-1" being a reserved value, allows, strictly speaking, for UINT_MAX-1 open resources of each type, along with UINT_MAX-1 open ordinary files (though practically speaking it's INT_MAX for each of course because so many programmers assume all negative values are equivalent to "-1"). -- Greg A. Woods +1 416 218-0098; <g.a.woods@ieee.org>; <woods@robohack.ca> Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>
You are still insisting on confusing the implementation namespace (inumbers) with the API namespace (pathnames). Given how intelligent you have proven yourself ot be in other areas, I can only conclude you are being deliberately stubborn in this misunderstanding, and I see no need to even attempt to discuss matters with someone acting that way. Goodbye. /~\ The ASCII der Mouse \ / Ribbon Campaign X Against HTML mouse@rodents.montreal.qc.ca / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
I'm only trying to point out to you that complaining that flat namespaces are inherently broken is like complaining that the sky is blue on a clear and sunny day. I'm only blurring the boundary between kernel and user-land because when you get down to these kinds of things the boundary _should_ be blurry. Just as the unix filesystem has namei(), SysV IPC has ftok(). -- Greg A. Woods +1 416 218-0098; <g.a.woods@ieee.org>; <woods@robohack.ca> Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>
namei() is not an API. ftok() is. /~\ The ASCII der Mouse \ / Ribbon Campaign X Against HTML mouse@rodents.montreal.qc.ca / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Thus spake Greg A. Woods ("GAW> ") sometime Today... GAW> I'm only blurring the boundary between kernel and user-land because when GAW> you get down to these kinds of things the boundary _should_ be blurry. GAW> GAW> Just as the unix filesystem has namei(), SysV IPC has ftok(). The difference, though, is that ftok() is implemented in userspace, forcing whatever uses IPC to do its own groveling, while namei() is never seen from userland -- the kernel always ALWAYS handles namei(). You might like to blur the line between kernel and userspace, and as much as the line should be blurred on occasion, this is not one of those occasions, except, it seems, to suit your own point of view. Using IPC shouldn't require kernel-like knowledge to address any more than one is required to know kernel-like things in order to call, e.g. open(2) or chdir(2). With open/chdir, you pass a pathname which is easily determined and can even be proscribed by the program in question, at which point error detection and handling and probably several fallbacks come into play. With the IPC stuff, you have to jump through quite a few hoops to locate the identifier. You must ALWAYS do this. You cannot hope to even generate the identifier on the fly. At this point, error detection and handling come into play, and there's really no fallback -- if what the system gives you doesn't work for some strange reason, you're hosed (of course, so's the system, most likely). Comparing the filesystem to IPC/shm/sem/msg is like comparing apples to astrology. There is little commonality between the two; given that they serve somewhat different needs (unless you count the open/fork/unlink trick which preceded anything resembling current IPC), this is expected. Nonetheless, that the implementation as it stands deals with nonhuman(oid)- readable identifiers which must be obtained through secondary means is just ridiculous. The least that could be done is that a shm/sem/msg could be requested by a particular name; ...
And this difference means what, exactly? Have you never heard of any systems which implement filesystems in userland? A little bird tells me You absolutely do not need any "kernel-like" knowledge to understand or use SysV IPC mechanism. If you think that is true then what you're probably missing is generic knowledge of interprocess communications techniques, as well as perhaps a general understanding of how various naming conventions are implemented. Would you say that using the DNS requires "kernel-like" knowledge too? "one" != "quite a few" I guess maybe you don't like to open your files before you read them? Perhaps you'd rather not have to bother managing your open file Why not? If you know its name then you can find its resource ID with trivial ease (in very much the same way you would know a file's name since they are after all exactly the same things, and indeed you can even use normal filesystem tools to examine a list of filenames which might have been associated with IPC resources to select from amongst them). Or do you mean you were so totally overwhelmed by the flexibility of the API that you forgot to make appropriate use one of its key features in So, what do you do when you open a file that doesn't exist? What if your filesystem wasn't even mounted? The system is never hosed because of bugs in applications using SysV IPC -- at least not so long as it's being managed by anyone with a clue, and provided of course there are no latent bugs in the implementation. I'm sorry to have to say this but it seems as if your complaints are Hmmm... sounds like you're equating inode numbers, or maybe file descriptors, or some similar kind of resource handle, to filenames. Do you find it too confusing to have the ability to have a separate (and potentially user-replacable) name-to-key mapping sitting on top of the Please try to follow the bouncing ball in this simple example that's been simplified even further by removing error handling for your ...
Im not exactly a kernel guru (and unless you mean kernel data structures)-
I must say that using SysV IPC (or any other UNIX system calls) does
require a good knowledge of how things are implemented under the hood.
System calls (depending on sw and hw issues) have a finite
capacity/latency etc. which a product developer needs to understand to
take care of race conditions, scalability etc..
regards
-kamal
--
Greg A. Woods
+1 416 218-0098; <g.a.woods@ieee.org>; <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird
<woods@weird.com>
No, not "under the hood". A good systems programmer will need to know how to use system calls effectively and safely and what system resources they may consume; and thus the API for those calls must be well and completely documented. However just as with any well documented API it should never be necessary to know how it is implemented under the hood. -- Greg A. Woods +1 416 218-0098; <g.a.woods@ieee.org>; <woods@robohack.ca> Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>
Umm, posix SHM _does_ use mmap. It just uses shm_open to get a suitable fd, on Solaris and Linus that would be on tmpfs.
Yes, that's my point. :-) If you know how IEEE standards committees work and you understand how much they (are supposed to) hate inventing new things, the fact that they invented shm_open and shm_unlink() suggests that some strong member(s) of the comittee were just completely and totally unwilling to allow for mmap() to work on all normal files and that the only way they would be happy with mmap() becoming the true standard shared memory interface was if it was required that the file descriptors it used be allocated by some special new function. You would think shared memory would be simpler to describe and discuss than something like message queues which have lots of fancy features, since it is, after all, just a chunk of memory storage that can be mapped into the address space of multiple processes. However because of this strange use of mmap() and all the qualifiers they put on it for POSIX, folks like Bill Gallmeister in his O'Reilly "POSIX.4" book actually spend more pages describing shared memory and give all kinds of caveats about its use. The SysV SHM API is trivial by comparison to POSIX. I don't know why POSIX doesn't include MAP_ANON either -- that would have made things ever so much simpler! The rationale in P1003.1-2001 claims they decided to use the SysVr4 mmap() implementation as the basis of the POSIX API, and indeed SysVr4 lacks MAP_ANON, however MAP_ANON was very well known before mmap() was finalized since 4.3net2 was already widely disseminated (1003.4 was still in draft at the end of 1991). -- Greg A. Woods +1 416 218-0098; <g.a.woods@ieee.org>; <woods@robohack.ca> Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>
Not trying to defend IEEE here, but there is some sense at leat behind shm_open. Given that for shm your really want an object that's not backed by permantent storage (= a normal filesystem) you need to know where to look for a tmpfs-lookalike or, in the case you mentioned above something outside the normal filesystem namespace (yuck!). As IEEE isn't into the filesystem namespace business shm_open is an okay wrapper for leaving this to the implementation. Why the heck they specified shm_unlink is completly unclear to me, Just because it was know that doesn't mean it should be standandardize. And MAP_ANON really doesn't fit into the SunOS4/SVR4 VM that wants a backing vnode for each memory object unlike the Mach VM. Thus the horrible mmap() of /dev/zero hack, btw..
You don't need such a concept for mmap(MAP_ANON|MAP_SHARED) -- the filename is simply a key to the anonymous memory so that multiple Oh I agree there's some sense behing shm_open() -- just so long as you ignore the MAP_ANON jumping up and down and waving its hands and That one's easy! ;-) shm_unlink(), like unlink(), takes a pathname parameter, so given the fact shm_open() names are strictly outside the normal visible filesystem space then you need a matching unlink() interface to work in this private, invisible, namespace. (or at least you do so long as you don't also have something like a funlink() call that takes an open file Given the constraints of trying to work without MAP_ANON to thus end up with the same functionality only after inventing a dozen new API signatures to work around the lack of MAP_ANON is in fact a very good reason to standardize a far simpler API. That's why I say there must have been some very strong politics influencing the committee members. Normally these comittees are loathe to invent new APIs and the mere fact that they started down that road when they thought they could do without MAP_ANON should have suggested to them that they were going in the wrong direction. "Oops! We're inventing something! Let's go back to that last fork in the road we took to get here!" (Of course POSIX.4 seems to be mostly cut from whole cloth so maybe they didn't share that same I don't buy that argument at all. SysVr4 VM has the concept of anonymous memory and the swap layer provides the backing store for anonymous pages. I suspect forcing anonymous pages to always have the MAP_PRIVATE attribute was their downfall. Anonymous pages could have been made sharable simply by associating a vnode from an ordinary file descriptor with them -- i.e. there's a vnode but it's not what's mapped, anonymous memory is mapped and thus the swap layer continues to provide the backing store. That's essentially how mmap(MAP_ANON|MAP_SHARED) works, IIUC -- the filename, ...
Thus spake Greg A. Woods ("GAW> ") sometime Today...
GAW> shm_unlink(), like unlink(), takes a pathname parameter, so given the
GAW> fact shm_open() names are strictly outside the normal visible filesystem
GAW> space then you need a matching unlink() interface to work in this
GAW> private, invisible, namespace. (or at least you do so long as you don't
GAW> also have something like a funlink() call that takes an open file
GAW> descriptor as its parameter :-)
Um, slightly off-topic, but wouldn't funlink() be somewhat disastrous in
practice? (I presume that's why the smiley). That would require a file-
system cleaner process or a routine that knew instantly how to match inode
numbers to pathnames (as it was explained to me, "The kernel routine is
called namei() for a reason. You will note that there is no converse
routine, since while name -> ino-dev is unique for each ino-dev, the
reverse is untrue -- consider /foo/bar/.. and /foo, for example...").
GAW> > Thus the horrible
GAW> > mmap() of /dev/zero hack, btw..
GAW>
GAW> Hmmm.... yes. What a stupid idea that was. :-) (A NULL vnode pointer
GAW> was apparently supposed to suffice such that a /dev/zero vnode was
GAW> unnecessary.)
Wow, a (vno_t *) NULL was supposed to allow one to create pre-cleared
pages in memory?
Despite its ugliness, /dev/zero has other uses, such as creating
arbitrarily large filespaces (for, e.g., swap (don't go there.)) without
having to rewrite a program to handle it -- one can use dd for it,
though I wouldn't have minded a 'mkfile' program to do the same thing
(thus avoiding the need for /dev/zero).
I have a question regarding mmap()ing /dev/zero:
Purportedly this was used by crt0.o and/or ld.so to create blank spots
into which to load the dynamic libraries. Surely the same thing could
have been accomplished with *alloc() and a clear routine, or
mmap() could just pre-zero whatever pages it maps. Was /dev/zero
*truly* necessary? Its sudden disappearance once or ...No more than unlink(), provided that the link count was only one; Why "instantly"? funlink() could block until it found the directory Yes, that's strictly true, but your example is slightly wrong. :-) -- Greg A. Woods +1 416 218-0098; <g.a.woods@ieee.org>; <woods@robohack.ca> Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>
Thus spake Greg A. Woods ("GAW> ") sometime Today...
GAW> > Um, slightly off-topic, but wouldn't funlink() be somewhat disastrous in
GAW> > practice?
GAW>
GAW> No more than unlink(), provided that the link count was only one;
Well, yes, there is always that restriction, but that wasn't stated.
GAW> alternately the pathname could be cached in the kernel. ;-)
I don't get that at all, sorry. What good would that do for a file that
had (N>1) links to it? If the (st_nlink == 1) restriction is not
enforced, funlink() would be tantamount to clri...
Gah! You know what? [I'm so chagrined -- this took me several passes.]
funlink() would STILL be tantamount to clri, and it would require the same
procedure (fsck) from which to recover, unless the kernel DID cache the
paths of every open (I know it caches vnodes, but pathames?) until (last)
close or until unlink.
GAW> > That would require a file-
GAW> > system cleaner process or a routine that knew instantly how to match inode
GAW> > numbers to pathnames
GAW>
GAW> Why "instantly"? funlink() could block until it found the directory
GAW> entry and confirmed that the link count was still one. :-)
<quiz>
[ ] You expect me to wait that long for funlink() to return from that
procedure?
[ ] ...on a 1GB filesystem?
[ ] ...with a high density of inodes?
[ ] ...on a system with a slow CPU?
</quiz>
I'm guessing the smileys all over the place are conveying your intent that
funlink() would not work at all, practically speaking...
GAW> > You will note that there is no converse
GAW> > routine, since while name -> ino-dev is unique for each ino-dev, the
GAW> > reverse is untrue -- consider /foo/bar/.. and /foo, for example...").
GAW>
GAW> Yes, that's strictly true, but your example is slightly wrong. :-)
I see that's either because they're directories or, more likely what you're
hinting at, because /foo/bar may be on a different dev than /foo.
I should say, then, "consider /foo/bar/.. and /foo on the ...Just to be clear I'm thinking of funlink() as relating to unlink() in the same way fstat() relates to stat() -- i.e. it would unlink the file that had been opened to create the file descriptor which would be passed to it as its only parameter. funlink(), with pathames of open file descriptors cached in the kernel, would be _exactly_ like unlink() and wouldn't require any other magic. Funlink() would merely unlink the file that was opened to create the file descriptor it was passed. Funlink() without cached filenames would have to internally do much the same as fstat() to find the device (and then the mount point) and the inode number of the open file. Then the directory tree of the target filesystem would have to be traversed to find an entry with a matching inode number, and then if all were OK (regular file, one link, etc.) then the equivalent of a normal unlink() would be done on the found filename. About the only semi-sane use of funlink() (at least that I have ever been able to think of over the years) could be to give a process the ability of unlinking a file that its parent process had connected to one of its file descriptors (e.g. stdin). Of course such an ability would I was thinking more along the lines that /foo/bar/.. and /foo/bar refer to the same directory (assuming /foo/bar isn't a mount point) and of course all directories have multiple hard links. Of course you can't normally/safely unlink() a directory, regardless of whether you refer to it by its "true" name, or whether you refer to it by its ".." alias name, but that's a slightly different issue [frmdir()?] :-) -- Greg A. Woods +1 416 218-0098; <g.a.woods@ieee.org>; <woods@robohack.ca> Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>
Thus spake Greg A. Woods ("GAW> ") sometime Today... GAW> I was thinking more along the lines that /foo/bar/.. and /foo/bar refer GAW> to the same directory (assuming /foo/bar isn't a mount point) Really? cd /var/tmp/.. Where are you now? :) GAW> and of GAW> course all directories have multiple hard links. Of course you can't GAW> normally/safely unlink() a directory, regardless of whether you refer to GAW> it by its "true" name, or whether you refer to it by its ".." alias GAW> name, but that's a slightly different issue [frmdir()?] :-) Well, yeah, of course; the point, though, which has been sidestepped here, is that you can't use a dev-ino to come up with a unique name, even though you can use a name to come up with a unique dev-ino pair. --*greywolf; -- NetBSD: Groovy Baby!
Nothing's been sidestepped here. Perhaps you've forgotten either one of: (a) the option of caching open filenames; or (b) the qualifier that funlink() could/should fail/misbehave if the inode has multiple hard links. -- Greg A. Woods +1 416 218-0098; <g.a.woods@ieee.org>; <woods@robohack.ca> Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>
Thus spake Greg A. Woods ("GAW> ") sometime Today... GAW> Nothing's been sidestepped here. Perhaps you've forgotten either one GAW> of: (a) the option of caching open filenames; or (b) the qualifier GAW> that funlink() could/should fail/misbehave if the inode has multiple GAW> hard links. In order for funlink() to work, you will need both caching of open filenames and the requirement that st_nlink == 1. You'll need the caching to make sure you can find the name quickly, and the one-link requirement to insure you don't break things. The only other option would be to keep an inverse lookup db somewhere, and that could get prohibitive (unless the means you kept was a db which contained not actual names, but pointers to ino-devs which were directories, and the offsets within from whence names could be retrieved (since the kernel can divine things like data blocks from inodes and thus dereference them...)). [that could have implications of either robustness or frailty, depending on how you looked at it. At first glance, something like that would enable run-time consistency capability, but I think it's probably more frail than that (I get starry-eyed at new features, at least until I discover their drawbacks, so I'm probably not (always) the best judge of robustitude). I mean, think about it: you'd get a cross-check of link counts available at any time. iname() would be possible, although it would return an array of objects rather than a single one; but if there was an inconsistency between the link count on the ino and the number of objects returned, that could be useful. Don't ask me how (see "starry-eyed", above). I'm sure someone, somewhere, before me, has thought of doing this. I'm equally sure that this person has presented it, only to have it go down in flames because of the unnecessary complexity it added to things like open, creat, unlink, link, mkdir, rmdir and rename. But I digress...] I had the experience of tripping over something ...
No, not really -- i.e. not just to make it "work". You only need one or "quickly" is irrelevant -- the call would always have to block until it found the right directory entry, regardless. Assuming one doesn't want to pay the price of caching all open filenames then whether funlink() locks the inode first before searching for the First off if you have cached the open file name then you don't need to worry about the link count -- funlink() would always remove the intended file in that case since it knows the file's name precisely. The restriction on a link count of one would only be necessary when open filename caching wasn't implemented. It would be needed to make sure the kernel didn't find the "wrong" filename first and unlink it instead of the filename that was opened to create the file descriptor passed to Did it persist that way for any period of time? Or was it a momentary quirk that could have been caused by a race condition for a file that If that file really did persist until you fsck'ed then it sounds like a corrupt filesystem was mounted without being properly fsck'ed the first time around -- perhaps because of a bug in fsck; or perhaps there was a bug in the filesystem code, or an undetected data corruption on the disk, that caused the error to appear since the last boot. -- Greg A. Woods +1 416 218-0098; <g.a.woods@ieee.org>; <woods@robohack.ca> Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>
The name under which it was opened does not necessarily bear any relation to any names it has later, nor does the file which was opened under a given name necessarily bear any relation to any file that may later exist under that name. These remain true even if every file in sight has nlink 1. /~\ The ASCII der Mouse \ / Ribbon Campaign X Against HTML mouse@rodents.montreal.qc.ca / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Neither is of any real use by itself. fd = open("/home/alice/fnord/flarp",O_WRONLY|O_CREAT|O_EXCL,0666); chdir("/home/carol"); rename("../alice","/home/bob"); funlink(fd); /* What does this do? Why and how? */ Even if you cache the pathname passed to open, even though the rename() does not have anything obvious to do with the open()ed path, even though file still has only one hard link, you have to search the filesystem. Or else you have to do an amazing amount of work (including, at a minimum, an open file table walk) in rename(), link(), unlink(), and probably other calls I haven't thought of to make sure that the "cached" pathnames remain correct as files and directories get moved around. It might work to cache the containing-directory vnode and the opened-as name (or moral equivalent, such as having directory vnodes point to a list of currently-open files in them), and then all you have to worry about is other clients of the same fileserver renaming it out from behind your back (which admittedly is really no different from many other issues such environments already have). You don't even need the nlink==1 check if you do that. /~\ The ASCII der Mouse \ / Ribbon Campaign X Against HTML mouse@rodents.montreal.qc.ca / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
If you don't have an opened filename cache then assuming the rename() works (i.e. assuming /home/bob isn't a mount point) then the funlink() call would unlink that newly created file no matter what its name (which does suggest one possible reason which might make funlink() useful even If you do have an record of the opened filename then I would say the easy way out means the funlink() must simply fail because it would be exactly equivalent to: unlink("/home/alice/fnord/flarp"); (I should clarify that what I have been implying by the "opened filename" is a fully qualified name which would have to be computed using sys/kern/vfs_getcwd.c:getcwd_common() or similar if the name passed to open was a relative one. Caching the process' cwd vnode at the time of open for each open file (even if only for those opened with relative names) would probably complicate far too many other things to be worth considering.) I guess the choice between just failing the funlink() call in your example above, or trying to make it work if possible, depends on what possible reason one might have for implementing funlink() in the first place, and thus how one defines it. If all one is trying to do is flesh out the set of f*() system calls which have file descriptor parameters to make it orthogonal to the set which have pathname parameters then I suppose it depends on who one wants to pay the price for this feature (and perhaps whether or not one can conceive of any other reason to bother caching the opened filenames. (Note this fleshing out of the f*() system call set was the original purpose I had when I played this same mental exercise many years ago.) If on the other hand what the funlink() implementor is trying to do is actually make the funlink() call "work" no matter what name the opened file might have at the time of the funlink() call is made then, as you say above, caching the opened filename may not be any help and one really does have to force the caller to pay the price of the ...
...you shouldn't do funlink() in this form in the first place, as unlink() is not an operation on a file but an operation on a link to a file. Now, an funlink() that takes an fd on a directory and a (slashless) component name, that would be a sensible way to add an fd-based variant of unlink(). But there are more serious things to fix first, like the inability to use open()/fchdir() with directories that are execute-only. (To fix It could, because then rename() could notice the change and change the cache to refer to the new location. (In which case I'm not sure how fair it is to still call it a "cache"....) I'm not sure what to do with link/unlink pairs, though. Neither call should update the cache, but together they should.... /~\ The ASCII der Mouse \ / Ribbon Campaign X Against HTML mouse@rodents.montreal.qc.ca / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
You could also argue that it would be useful to have variants of open() that take a directory fd and a pathname. David -- David Laight: david@l8s.co.uk
But that's the whole point of funlink() (and perhaps even some of the other f*() calls, such as fchdir()) -- turn an operation on a file into an operation on a filename (i.e. a link to a file). funlink() would of course act upon the file (i.e. the inode and the storage it points to) as well as the link recorded in the parent directory, since of course it will also have to mark the file (inode) (and its storage) as free. I suppose by stretching one's imagination a wee bit it's possible to see how funlink() could help eliminate a TOCTOU race condition for a process that must unlink a temporary file in some unsafe place like /tmp. That's quite a bit of a stretch though and it's an especially long and unlikely stretch if you consider that "safe" sub-directories can and should _always_ be used in unsafe directories (since rmdir() is always safe to use in a world-writable directory). Perhaps funlink() could have secure programming uses if a process has to remove files created in a "safe" sub-directory of a world-writable directory and for some reason it cannot chdir to that sub-directory any longer, but I can't at the If the process has an FD open on the parent directory then it should be able to much more easily just fchdir() there first, obtaining and keeping an open FD for its PWD for those cases where it has to go right back to where it started. For other uses (such as unlinking stdin) such a form would be unusable, and after all the goal is to define a system call that accepts a file descriptor and acts upon the file that was opened in the same way as the s/^f// system call would act when passed a filename referring to the same file that was opened. As I say above funlink() would actually act Yes, I suppose that such an ability could be somewhat helpful at times. I've always though the only extremely serious omission in the f*() function call set has been faccess() (and of course there should not have ever been an access() call since it is inherently insecure); ...
Not quite. chdir() operates on a process and a directory; the pathname is necessary only to identify the directory. It could equally well identify the directory by a file descriptor open on it, and that's exactly what fchdir() does. truncate() operates on a file; the pathname is used merely to find the desired file. It too could equally well identify the file by a file descriptor open on it, and that's exactly what ftruncate() does. However, unlink() does not operate on a file, except as a side effect; it operates on a link to a file. The file is affected, sometimes very mildly, sometimes severely, but the critical point is that the pathname is not serving just to locate the file, but rather to locate a particular link to the file. Since file descriptors are on files, not Yes, it should be able to. But since you can't open "." if your current directory is execute-only, it can't. (It'd also be nice to be able to do one syscall (funlink of this flavor) instead of three Which is why you can't do funlink(), because unlink doesn't operate on files; it operates on links to files. The file is operated on only in that it's garbage-collected once it's no longer referenceable. (Which may be when its refcount goes to zero, or it may be an indeterminate It doesn't really make sense, unless you also add fopen() (which name has unfortunately already been preempted by stdio), to re-open a file with potentially different access rights. Otherwise, faccess() doesn't How so? Provided you realize what it does, and more importantly what it doesn't do, there's nothing wrong with it. (In particular, access() No, it wouldn't. You still couldn't save and restore your current directory by opening "." and fchdir()ing back there if your current directory is execute-only, even with O_MKDIR, without O_NOACCESS. /~\ The ASCII der Mouse \ / Ribbon Campaign X Against HTML mouse@rodents.montreal.qc.ca / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
I think it's not proper to regard [f]unlink() as operating on a file.
Unlink only operates on directories (which can be simplified as
ordinary files, although this is not quite right).
In most systems you delete files (that's also what probably anybody,
including those who are familiar with the internal workings of a
typical Unix system, is thinking when he's working with the system.)
However, that's not what's being done on Unix. The kernel deletes
files, opaquely and behind the scenes (if necessary with the help of
fsck after an unexpected reboot.) Not the user, who just removes
reference entries from directory files. Of course you know all that,
I just want to emphasize it to support my argument.
That the "garbage collection" of files which are no longer referenced
is more or less immediate by using a simple reference counter in the
file's on-disk structure imho should be regarded as an implementation
detail. For proper operation, one could also conceive to have a
process which regularly scans the filesystem and collects files which
are no longer referenced (like the memory management of certain
programming languages does.) It is irrelevant to the exported API how
exactly this is implemented. Unlink() should only work as an editing
operation on directories. Any contrived operation that tries to find
a proper name for an open file through its descriptor is a rather
unclean thing which probably cannot be done correctly for all cases
and is way beside the design of the Unix filesystem.
--
Matthias Buelow; mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}
I addition, the only proper semantics for an funlink() system call I
could see would be to set the reference counter in the inode to zero,
close/invalidate the file and all descriptors to it and have the
kernel, some external process or fsck at reboot remove all references
from directories. It might be surprising for users who are accustomed
to what unlink() does but it would be consistent with the
file/dir-entry schism. It would indeed delete the file, not a
directory entry, which is literally what the caller requested.
--
Matthias Buelow; mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}
No, not zero -- just decremented by one (assuming the proper directory No, absolutely NOT! unlink() doesn't to this for VERY good reasons, and as I say in my other reply this is impossible to do safely without locking the whole filesystem (or unmounting it or going to single-user mode and assuming the admin is clueful). Forcing the user of funlink() to suffer through an internal ftw() and still risk a failure if the link count is not exactly one is one thing, but forcing the whole system to do without a filesystem for the time it That's completely wrong given how I've described funlink(). -- Greg A. Woods +1 416 218-0098; <g.a.woods@ieee.org>; <woods@robohack.ca> Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>
Sure it could. If you want it to be an operation on the file, as opposed to a link to the file, it has to be _something_ that is independent of any particular pathname to the file. Otherwise, you're making it an operation on a link that was at some past time (save pathname at open() time) or is now (search filesystem) linked to the file, and only secondarily an operation on the file itself. You have to go through great contortions (saving pathnames, searching filesystems, checking that nlink==1 and/or that the pathname still refers to the same file) to make funlink() perform the same operation unlink() does. This is a clue that what you are trying to do is inappropriate. I don't know where you got this resistance to comprehending that unlink destroys links to files, only secondarily affecting the files themselves, but it's clear to me that you have it. You insist on trying to somehow tie an open file descriptor to a particular link to its file, and, since file descriptors refer to files, not links to files, you're having trouble. Changing file descriptors to refer to not files but particular links to files would be a major philosophical change, and, as you're discovering, would demand either a thorough redesign or some extremely heavy performance penalties. It probably could be done, but I see no point, and I do not understand why you persist in trying to do it. In any case, I see no point in discussing it further with you. /~\ The ASCII der Mouse \ / Ribbon Campaign X Against HTML mouse@rodents.montreal.qc.ca / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
But it should, not, indeed must not, else a fundamental feature of how No, not really -- it should be obvious to anyone aware of how unix filesystems are structured that implementing funlink() has it A file is more than its contents -- it is all the metadata that controls access to the content and allows the content to sit on the same storage media along with many other distinct files. Unlink() cannot, and luckily does not, "just" destroy links to files. The primary purpose of unlink() is _also_ to decrement the link count in the file metadata. Unlink() _always_ acts on both a file and the directory entry which points to that file. However directory entries (names) are just pointers to the real files. The only part of what unlink() also often does in addition to those first two critical functions, which could safely be left for some cleanup daemon to do, would be the moving of data block pointers from inodes which have a link Indeed since that is the only way to implement the semantics of funlink() as I've described it. It's not such a difficult thing to do for the most common case, even if one doesn't cache the opened filename, though it could potentially incur a lot of disk reads.... -- Greg A. Woods +1 416 218-0098; <g.a.woods@ieee.org>; <woods@robohack.ca> Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>
Of course it is and I only gave this example to show how nonsense
an "funlink()" call would be -- because the only valid behaviour
in the current framework (clearing of inode / invalidating all fds)
would be totally impractical.
Now can we stop this thing? Or should I also propose some b*llsh!t
system call and have a hundred-mails long thread debate its uselessness?
--
Matthias Buelow; mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}
Your example was flawed though, even for its intent, since it did not describe "the only valid behaviour in the current framework". In fact it described a completely different bahaviour un-related to the funlink() call I initially proposed. I described, informally, the valid behviour for funlink() as I saw it, along with its API and its limitations. I did so in order to make a point and to hopefully help give some rationale for some other related proposals. funlink() is the most natural way one naive of the internals of Unix filesystems might think to use to make it possible to allow one to safely get away without using temporary directories for temporary files. Indeed the lack of a funlink() call has been mentioned, and the consequences of this lack discussed, by several experts in secure programming practices. However as I've already said I do agree it is less practical to actually implement funlink() than it is to simply use the existing mechanisms which do allow safe use of temporary files in temporary directories. Unfortunately I still see too many NetBSD programs creating and removing temporary files directly in /tmp and /var/tmp, and of course the default sysinst still creates those directories on filesystems which also contain other sensitive informatoin. While setting the sticky bit on those directories can protect most such programs from the most obvious attacks, I don't believe the most important of these programs (i.e. those most often run as root) will actively check to make sure that either the sticky bit is set or at least that the directory is a I guess you're not someone who enjoys a good academic discussion even if its merit is only academic, and/or you don't care that people can learn from it and that it can lead to true innovation of related things, e.g. O_MKDIR. I hadn't really considered O_MKDIR before, but having the occasion to re-read in the right context code designed to facilitate the safe creation and disposal of temporary files where ...
Greg, in my opinion it is. The functionality you describe for
sure is something desirable but it probably would be better kept
as an ordinary utility function in libc, perhaps with a different
name. As a syscall, as I have voiced in my opinion, it simply
doesn't fit (imho, of course).
--
Matthias Buelow; mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}
I really don't want to sound condescending, but I do get the distinct feeling that you don't really understand the underlying reason why funlink() is desirable in the first place. The underlying goal of having a system call that can unlink a file when given a file descriptor open on that file is to avoid an unfortunately common insecure programming technique commonly called a "Time-Of-Check, Time-Of-Use (TOCTOU) race condition". Calls to unlink() are vulnerable if they are passed the fully qualified pathname of a file that was created in or under an insecure (i.e. world-writable) directory, even if that path is checked for vulnerabilities and the file's metadata is compared to that of the originally created file before the unlink() call is made. Implementing funlink() in userland would simply move the race condition to a new place and thus be no fix at all. However a system-call implementation of funlink() could ensure the new race condition is impossible, thus ensuring the functionality fulfils the underlying requirements. Indeed it would be dangerous to imply that a userland implementation of funlink() could do something that, as a userland implementation, it most certainly could not possibly do. As we've explored funlink() doesn't make sense as solution for unix-like systems for very different and far more practical reasons, and it is the explanation of those reasons that leads to learning what must be done instead and how the alternatives could be optimized. -- Greg A. Woods +1 416 218-0098 VE3TCP RoboHack <woods@robohack.ca> Planix, Inc. <woods@planix.com> Secrets of the Weird <woods@weird.com>
IMHO, the best solution (albeit outside the established Unix framework)
would be to fully separate operations on directories and the flat file
system (inodes/device-numbers or equivalents)... There would be an
operation, let's call it lookup() : pathname -> identifier, to
translate a symbolic pathname into a more or less opaque identifier
(similar to an fd) which is unique and reversible for both the
referenced file and the directory entry during the period of its
allocation to the process. Open(2) would then take this identifier to
actually open the file, not a pathname. The advantage would be that
the application would have a handle on the actual directory entry,
other than the volatile pathname. One could then use something like
funlink() on that identifier to delete the directory entry and simulate
unlink() without having to care for the case that a new entry with the
same name has been established in the directory in the meantime.
Unfortunately this poses several problems: some filesystems cannot
easily produce such an indirection, it has to be emulated on them
(should be feasible, though), and it doesn't work with the current
established Unix filesystem API. It somewhat surprises me in hindsight
that such an approach was not taken in the original Unix
implementation. However, I guess, this is influenced by the fact that
most likely the open()-style API was established already before Unix
got real directories (and perhaps also to keep it stupid simple.)
--
Matthias Buelow; mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}
Yes, "file handles" as some folks call them. They are effectively We already have fhopen(2), fhstat(2), and fhstatfs(). Of course these calls (including getfh()) are currently restricted to the superuser because they have not had the necessary ACL semantics defined for them. We are missing at least fhchdir(2) (though for the superuser this can be You can't do that (directly) in any filesystem that has all of the same semantics as a unix filesystem. File names are just pointers to files that exist in special files called directory files. By convention the first file (inode #0) in a filesystem is also the root directory for the namespace we lay over top of the filesystem. By convention we have the first two entries in a directory file point to the directory file itself and the parent directory file. However by convention we do not have the filename(s) recorded in the files themselves and thus the only way to find the name for a file is to traverse the directory structure until one encounters a name pointing to the file in question. Of course since a file may have more than one name there's never any sure way to know if the name encountered is the one the user had in mind for such a multi-named file. Finding all the names for a file is of course possible (especially since the link count tells us how many to look for), but it still doesn't help decide which was indented by the user. Note that we don't want to try to record the filename(s) in the file because there would be significantly more overhead and complication to maintain those "reverse pointers", especially if you consider the number of possible updates needed in a hierarchical filesystem for an operation such as "mv /usr /user". We also don't want to do this because we don't want to have to have to allocate variable numbers of disk blocks for one inode (which we would likely end up having to do sometimes if a file had many names, even on filesystems with large blocks). Once you begin down the path ...
Ok, I didn't know about file handles. They're also not want I meant. I wanted a handle on the directory entry (essentially the pathname) as some kind of invariant representation of a particular directory entry at a discrete point (originally specified by pathname). This would then be used instead of a pathname in open(2) etc. That this handle also points to a specific file (through the dir-entry's data) is a rather less interesting detail in this context. It's just so that the application's got a more direct grip on the directory entry through which a file is to be opened, for later use (such as your proposed funlink()). The particular entry in the directory would be marked somehow (or locked), so if the entry gets unlinked in the meantime it won't get reused and overwritten immediately -- something like a zombie entry. This would persist until the process releases the handle, or exits. funlink() could then be implemented as follows. You want to preserve atomicity for security reasons; no problem -- you could have a system call which, instead of expecting a pathname like unlink(), would accept such a directory entry handle, the same one convienently passed to open() instead of a real pathname in your particular program (but it is not necessary to open the file, of course.) The system would know which entry in which directory exactly was used for opening a particular file (it would associate that information with this handle.) It could then unlink the entry from the directory and do the other cleanup stuff (like decrementing the counter in the inode, etc.) without having to fear colliding with a newly generated pathname of the same name that was originally used to obtain the directory entry handle. Of course this is of theoretical nature; building an extra API for that would be ugly, unportable and few people would use it (especially not existing applications). If at all, the underlying API of open() etc. would have to be changed to this design, which is not feasible, ...
Oh, I know they're not exactly what you meant, but you can't (easily) have what you meant -- it's next to impossible (and certainly not ever practical) to implement what you wanted with any unix-like hierarchical filesystem that allows multiple hard links and rename operations on directories. What you're suggesting is massively more complex and obviously much more invasive than my simple little funlink(2) idea! ;-) (and it doesn't seem to offer much beyond what funlink() and/or But a directory handle isn't a representation of a pathname (though one can intuit the name of the directory by walking back down to the root directory and ascertaining the pathnames of each previous directory along the way). You'd have to do this for all the directories in the path, and you'd have to somehow lock all the directories in the path in order to make sure that every rename() involving any such directory could co-operatively update those handles. I.e. you're just beginning to enter the twisty little maze of passages If you want to keep the open(2) API intact then you could try to implement it as a function call that's the moral equivalent of fhopen(getfh(path)). Either way the only logical thing to do (without affecting the open() API) is to make it easy to translate a file descriptor back into the file handle it came from. I haven't thought of the implications of having file descriptors in user-land, though presumably everything could be transmuted to use file handles, including even descriptor passing through AF_LOCAL sockets. The tricky parts involve seek pointers and such -- and that's something I always get very confused about unless I diagram it all out on a big That's part of the problem, not the solution. Please study the example safe_dir() implementation in the book I referenced. You can find it in here: http://www.buildingsecuresoftware.com/bss_examples-1.0.tar.gz Unfortunately that's not (yet) true. To quote from the fhopen(2) manual page on NetBSD: ...
Yes, "file handles" as some folks call them. They are effectively vnodes in the *BSD terminology. "file handles" and "vnodes" are not the same thing. i can have multiple file handles for the same vnode...
How so? There's a 1-to-1 correspondence between the output of getfh() (and VFS_VPTOFH()) and the in-core vnode, which is rather important for an NFS server. ;-) Take care, Bill
Because if you have the file handle, then you can spoof NFS traffic to it, and spoof your UID to be something that can access the file. Or you can snoop NFS traffic and get a file handle, then use fhopen() to open something you don't have path permissions to access. Thus for now you have to be root, as these calls can circumvent too many security checks. ;-) Take care, Bill
The reason we have the three fh calls we do is because when I created them, I didn't see any need for anything else. Once we have fhopen(2), we have all of the fxx(2) calls. fhstat(2) and fhstatfs(2) are two calls that made sense to be usable w/o having to open the file (say if it's a device :-) . All the others were ones where it looked like it was better to have the file open, or ones where it didn't seem fundamentally important enough to have a syscall when you could just fhopen() then f_whatever_(). Take care, Bill
Yes, true enough -- unless fhopen(2) enforces the access rights checks that open(2) would have enforced. In that case fhchdir(2) gives one the same advantage that could be afforded by O_NOACCESS. -- Greg A. Woods +1 416 218-0098 VE3TCP RoboHack <woods@robohack.ca> Planix, Inc. <woods@planix.com> Secrets of the Weird <woods@weird.com>
NetBSD actually has a mechanism that works like this. Look at getfh(2) and fhopen(2). It's designed for implementing NFS servers, so there's no You suggested that the flat file-system would use i-nodes, not directory entries. The distinction is important (and file handles refer to i-nodes, so they wouldn't actually be useful for funlink()). -- Ben Harris
As an observer of this whole exchange, I'd have to say that a lot of us don't have the "understanding" of why funlink() is desirable, because no one who's championing "it" has really defined it nor explained what it's to do. As best I can tell, you and greywolf has had different things in Then my suggestion is funlink(fd, path). The call does a nami lookup on path, and if it gets the same thing as fd, removes it. And it can return different errors depending on: unable to lookup path (permissions along the way), unable to unlink (write permissions on parent dir), no such file, file and fd differ, multiple links(*), or file already unlinked (0 links). (*) Some flag may be in order to indicate if the unlink should proceed if there are multiple links yet the one we're asked to remove exists; an unlink in this case should get a different error code. While this is an error in the "I've made a temp file and tried to unlink it" case, there are other cases where I can see it may be useful. Since the program should (or can) know the path the file should have, let it take care of remembering it. That way only the cases that need this bother with it. Take care, Bill
Not really -- file descriptors inherited from parent processes: shell redirections and such. But the problem is not race conditions in world writeable directories -- that is already solved. Sticky bits, etc. Don't make mode 777 directories for temporary files. I can't see why we need a new system call either. Are we certain the horse is dead yet? Franchement! -w
Yes, but come on, how many programs really really are going to want to What Greg is talking about here is a way a program can make & remove temp files and know they are gone all by itself. What you're talking about requires administrative assistance. While it is easy to do, the program has to trust the admin to have gotten things right. While I'm not saying get rid of sticky bits and such, I can see the utility of a call that a program can use to make SURE that a file has been unlinked. The only way you can be really sure the path and the file descriptor are the same is if you do the comparison & removal without unlocking the vnode. Since we're talking about vnode locks, we're talking about code in the kernel. Thus a system call. :-) Take care, Bill
That is easy enough to do - just fstat() the fd and check that its link count is zero. The problem is not making sure that it got unlinked. The problem is making sure you don't unlink something else by mistake. /~\ The ASCII der Mouse \ / Ribbon Campaign X Against HTML mouse@rodents.montreal.qc.ca / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Hi, Some 31ee7 hAx0r group will surely find a way to, err, make use of=20 unlinking stdin/stdout, if they find a vulnerable networked program. Seriously: think of passing some open descriptors to plug-ins , then letting _them_ dispose of the file... although you could open, unlink, then call the plugin, if it doesn't need the filename. -is --=20 seal your e-mail: http://www.gnupg.org/
Since you're passing the fd in, why can't you pass in the path? I'm not saying it's unreasonable for a module to unlink descriptors. I'm saying it is unlikely that they will need to unlink (as opposed to close) ones for which it can't find/know the path. I'd really expect that in something like this, you'd do the unlink in the same code block that did the open. That way, if you have an error unliking, you don't use "not-really anonymous" temporary files. Take care, Bill
I'm (still?) not convinced this is a problem. Given O_EXCL, stat() and fstat(), and sticky directories, I can't see what the danger is. I discount admins doing things like leaving /usr world-writeable, or /tmp non-sticky. I consider that akin to leaving /dev/mem mode 666: you can do it, but you have nobody to blame but yourself. And it most certainly is not for the system to protect you from the consequences. /~\ The ASCII der Mouse \ / Ribbon Campaign X Against HTML mouse@rodents.montreal.qc.ca / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
Thus spake Greg A. Woods ("GAW> ") sometime Saturday... GAW> > I don't know where you got this resistance to GAW> > comprehending that unlink destroys links to files, only secondarily GAW> > affecting the files themselves, but it's clear to me that you have it. GAW> GAW> A file is more than its contents -- it is all the metadata that controls GAW> access to the content and allows the content to sit on the same storage GAW> media along with many other distinct files. Gee, didn't you just say that "a file is more than its metadata -- it is all the content" not too long ago? GAW> Unlink() cannot, and luckily does not, "just" destroy links to files. Actually, that is all it does. The fact that the data goes away and gets GCd when the link count and the reference count (in-core) go to zero is a side effect of the link count and the reference count (in-core) going to zero. GAW> The primary purpose of unlink() is _also_ to decrement the link count in GAW> the file metadata. Unlink() _always_ acts on both a file and the GAW> directory entry which points to that file. You're really trying to force the data and the metadata to be equal. They are not -- the metadata supercedes the data; the data is a child of the metadata. If the metadata goes away, there is nothing to hold onto the data. This is UNIX by design. The only thing that binds a pathname to a set of data is the fortuitous bunch of hoops that we must jump thorough to resolve to an inode, i.e. metadata. GAW> However directory entries GAW> (names) are just pointers to the real files. Correction: They are pointers to metadata (inodes). What those inodes reference is determined once they are accessed; what the references are specifically are determined once they are open()ed. GAW> The only part of what GAW> unlink() also often does in addition to those first two critical GAW> functions, which could safely be left for some cleanup daemon to do, GAW> would be the moving of data block pointers from inodes ...
True, but the modification of the link count "MUST" be done at the very same time that the directory entry is modified (just afterwards, but before the unlink() call returns). I.e. unlink() does do more than just clear the inode number in the directory entry, and it does, and "MOST", modify the file itself as well, and at the same time. (if you want to talk about side effects then talk about the modification I'm not trying to force the issue because I don't need to. They are exactly equivalent. You cannot have a file without its metadata, though Thank you for confirming that a file is inherently the metadata, and that the content data is only an adjunct to a true file (if and only if the file is a regular file which can contain data, and if and only if there is any content in the file). I.e. don't forget other "empty" files, such as devices and FIFOs. They No, not a "correction" -- real files _are_ the "inodes". Period. The data they contain, or the lack thereof, is irrelevant here. File _NAMES_ are the pointers to the real files. Please try to keep No, I am not. You have clearly not paid any attention whatsoever to the core of this discussion. funlink() as I've described it can in fact be "trivially" implemented totally transparently to any existing kernel function or data structure (in-core or on-disk). Obviously such a "trivial" implementation would not be the most efficient or optimized implementation, but it would work just fine without changing any existing filesystem semantics or syntax. -- Greg A. Woods +1 416 218-0098 VE3TCP RoboHack <woods@robohack.ca> Planix, Inc. <woods@planix.com> Secrets of the Weird <woods@weird.com>
Thus spake Greg A. Woods ("GAW> ") sometime Today... GAW> > GAW> Unlink() cannot, and luckily does not, "just" destroy links to files. GAW> > GAW> > Actually, that is all it does. GAW> GAW> Yes, OK, black is white. Now you're just being stupid. Coming from you, I'm going to take that as a compliment. GAW> > The fact that the data goes away and gets GCd when the link count and the GAW> > reference count (in-core) go to zero is a side effect of the link count and GAW> > the reference count (in-core) going to zero. GAW> GAW> True, but the modification of the link count "MUST" be done at the very GAW> same time that the directory entry is modified (just afterwards, but GAW> before the unlink() call returns). And this relates to the price of beer in Germany just how, again? GAW> I.e. unlink() does do more than just clear the inode number in the GAW> directory entry, and it does, and "MOST", modify the file itself as GAW> well, and at the same time. Ah. I see what you're getting at. You're trying to tell me that the file is the inode, and that data is connected with said inode is a side effect. I get it. (Not.) GAW> (if you want to talk about side effects then talk about the modification GAW> of the timestamp fields in the file, or other TRUE side effects) Yes, those are there, too. GAW> > You're really trying to force the data and the metadata to be equal. GAW> GAW> I'm not trying to force the issue because I don't need to. They are GAW> exactly equivalent. You cannot have a file without its metadata, though GAW> you can have an empty file, i.e. a file with no content data! You cannot have a file's *data* without its metadata. You can have metadata that corresponds to zero allocation to data (think "device"). GAW> > They are not -- the metadata supercedes the data; the data GAW> > is a child of the metadata. If the metadata goes away, there is nothing GAW> > to hold onto the data. This is UNIX by design. GAW> GAW> Thank you for confirming that a ...
Actually they're talking about increasing the alcohol tax.. I hope
for Greg he doesn't have a hand in it...
--
Matthias Buelow; mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}
Thank you, finally. FYI you'll find essentially the same definition in Perhaps you've never done anything even remotely like "find -x /mountpoint -inum 12345 -print". You can't have. Otherwise you'd probably understand at least some of what I'm talking about. (Yes, It's not impossible, or even necessarily impractical given the lack of better alternatives for safe and secure programming in some situtations, so, it is still something to think about. I do think der Mouse's O_NOACCESS flag (and perhaps my O_MKDIR flag as well), along with the extra fchdir() and fstat() calls they require, is better than funlink(2) since they have a better chance of completeing successfully, even in the normal case, and a potentially much better chance of giving useful diagnostics in the failure case and doing so in a timely fashion. However until you understand what I'm talking about you can't even begin to sanely discuss funlink(2) or its alternatives on the same level playing field. I.e. unless and until you can understand funlink(2) as I've described it, together with all its implications and limitations and possible optimisations, you cannot possibly even begin to make any fair or reasonable assessment of any of the alternatives to this most obvious solution to the underlying problem (i.e. the problem which prompted me to open the discussion in the first place). -- Greg A. Woods +1 416 218-0098 VE3TCP RoboHack <woods@robohack.ca> Planix, Inc. <woods@planix.com> Secrets of the Weird <woods@weird.com>
Thus spake Greg A. Woods ("GAW> ") sometime Today... GAW> > For your definition that file == inode, fine. GAW> GAW> Thank you, finally. FYI you'll find essentially the same definition in GAW> all the serious books about Unix internals. Sorry, I always took "file" to mean "data", as opposed to a hard-line definition that "file" == "inode". GAW> GAW> > name -> inode. Great. Fine. Whatever. N:1 GAW> > inode -> name. 1:N. Pain in the patella. GAW> GAW> Perhaps you've never done anything even remotely like "find -x GAW> /mountpoint -inum 12345 -print". You can't have. Otherwise you'd GAW> probably understand at least some of what I'm talking about. (Yes, GAW> that's meant entirely as sarcasm.) Oh, no, not at all. I've NEVER been THAT adventurous. ;) [I refuse to buy into the premise that I'm as stupid as you seem to think.] GAW> > As the semantics exist right now, it is not a practical idea. GAW> GAW> It's not impossible, or even necessarily impractical given the lack of GAW> better alternatives for safe and secure programming in some situtations, GAW> so, it is still something to think about. No, it's not impossible, but you must define the level of practicality you're discussing here. If you maintain a table of reverse lookups in conjunction with open/rename/ link/unlink/mkdir/rmdir, somewhere, that incurs the overhead of rewriting the system calls to handle that; if, at any time, that scrambles the API, it becomes impractical, because things will then not behave as defined. As it stands, such a rewrite is unlikely. The other choice is to trudge through every directory on the filesystem and perform unlink()s (effectively) on them. This is time-consuming; seeing as system calls which do this sort of thing are not expected to take quite that long, and that they are expected to lock objects until they complete, this approach is impractical. GAW> I do think der Mouse's O_NOACCESS flag (and perhaps my O_MKDIR flag as GAW> well), along with the ...
Thus spake Greg A. Woods ("GAW> ") sometime Today... GAW> > could see would be to set the reference counter in the inode to zero, GAW> GAW> No, not zero -- just decremented by one (assuming the proper directory GAW> entry can be found). Greg, file descriptors are not associated with pathnames. There is no "proper directory entry". unlink() operates on a pathname (i.e. dirent). Only. That's it. GAW> No, absolutely NOT! unlink() doesn't to this for VERY good reasons, and GAW> funlink() could not do so either. unlink() operates on a pathname from which an fd might otherwise be generated. funlink would work on an fd, from which it is not possible to divine a unique name. Matthias is saying effectively that funlink() is tantamount to clri(), plus revoke() semantics on all open fds to the node that just got funlink()ed (something I'm not altogether sure I agree with, since you might actually want the reference, but not the physical object, to remain present.). In short, he gets the concept. GAW> > It would indeed delete the file, not a GAW> > directory entry, which is literally what the caller requested. GAW> GAW> That's completely wrong given how I've described funlink(). You want a 1:1 relationship between filehandles and pathnames, more or less. Who's been taking lessons from M$, now? --*greywolf; -- NetBSD: We Suck Less
How is it that suddenly you have absolutely no imagination at all? File descriptors are associated with files. Filenames are associated with files. If you can't make the connection implied by these two axioms then I would humbly suggest you're not seeing the whole picture with even remotely enough clarity and understanding to make this discussion useful No, I absolutely do not. If you think this then by now I can only conclude that you have not read all of what I wrote carefully enough. However I do know that for the case where a file has a link count of exactly one then there is guaranteed to be a one-to-one relationship with one particular pathname in the directory tree of the filesystem that file belongs to. Because of this guaranteed truth I know that it is possible, if somewhat costly, to implement funlink() as I've described it such that it could be safe and useful to use in the vast majority of situations where it might make sense to be used. I also know that one could spread the cost around a little bit and further increase the reliability of funlink(), as I've described it. -- Greg A. Woods +1 416 218-0098; <g.a.woods@ieee.org>; <woods@robohack.ca> Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>
Yes. Therefore, a file descriptor can in principle be mapped to a set of pathnames. (And a pathname can in principle be mapped to a set of file descriptors.) You are trying to add a way to go from a file descriptor to one particular member of that set of pathnames. (And then do something with it, but the hard part is that mapping. Depending on which message I read, either you're willing to error out unless the set has only one element, or you want a rather ill-defined member of the set, one somehow related to how the file descriptor was obtained.) This is possible to do. It is also a major philosophical shift, with the corresponding design shift that implies. It may be an interesting thing to consider when designing a new OS; it is of no particular value for an existing one with an existing commitment to the present design (and the philosophy behind it). Hint: this list is about the latter. /~\ The ASCII der Mouse \ / Ribbon Campaign X Against HTML mouse@rodents.montreal.qc.ca / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
I think I've always been willing to have the funlink() call fail with an error if it finds the link count of the file to be greater than one. In any case that has always been my intent. The ultimate goal is to find a safer way for a process to unlink a temporary file. The ability to do this to a file passed via stdin, etc., is only a secondary feature. (and one for which I still can't think of any good application :-) The caching of the opened filenames obviously optimizes the implementation of funlink(2) significantly, but of course at some expense that must be shared with all open() calls (though I suppose another O_CACHENAME flag could be added to help tune it :-). The caching of the opened filename also helps deal with the case where the file has gained additional links, but the original pathname still points to the opened file. In fact I'd be just as happy to have funlink() fail if the original pathname no longer existed -- i.e. do away with the internal ftw() idea entirely and rely only on the cached opened filename, especially if the need for funlink(2) was anticipated (as it normally would be) such that the filename caching would only be done when explicitly requested. Indeed one would expect no better chance of success if some third part process had renamed the file to be unlinked behind one's back, so to speak. The thing we're trying to avoid is some symlink replacing a directory in the pathname of file being renamed such that the wrong file is unlinked. Obviously several other things have to be broken for such a situation to result in a true vulnerability, but still the availability of funlink(2) would eliminate the need to always carefully fchdir() into a safe directory, and indeed may eliminate some of the I see no philosophical shift implied by my funlink(2) proposal, least of all any that one might call "major". The "trivial" implementation of funlink() is (logically at least) no different than using "find -inum", but of course it can be made ...
> [Greg Woods going on about funlink()] If I've understood you right, it would be just as satisfactory to have a call unlink_if_pathname_matches_fd(const char *path, int fd), which would work like unlink() applied to its first argument, but only if that pathname resolves to the same object the fd is attached to. (Modulo name issues, of course; I'm not thrilled with that name. :-) For some uses, it might be even better to have an O_UNLINK flag, which would be abstractly equivalent to opening the file with O_CREAT|O_EXCL, but then unlink()ing the path, the critical part being that this sequence is atomic with respect to all other filesystem operations. /~\ The ASCII der Mouse \ / Ribbon Campaign X Against HTML mouse@rodents.montreal.qc.ca / \ Email! 7D C8 61 52 5D E7 2D 39 4E F1 31 3E E8 B3 27 4B
I think you're forgetting that a "unix file" includes its metadata (especially if you're talking about kernel internals) and unlink() most certainly always operates directly on the metadata of a file, even if the link count is greater than one since the link count is always Unlink() _also_ operates on directories, but most importantly it Ah, not, that's wrong. The kernel most definitely always decrements the link count of an inode when unlink() is called. It _also_ zeros out the The directory reference is only a part of the picture -- if you ignore the link count in the inode then you have failed to understand the unix The fact that the unix filesystem is primarily a table of inodes is no No, one could not concieve of such a thing since it would have to be run either in single user mode or while the filesystem was not mounted (or while the whole filesystem is locked from modification). Do you forget that Unix systems are inherently multi-processing systems? It is absolutely fundamental and critical that unlink() modify the file's link count (as well as of course freeing the directory entry). (in fact the moral opposite is true -- it would be possible to not immediately free the directory entry if inodes included two reference counts instead of just one such count since an attempt to reference a file with no valid links but only remaining directory references could No doubt -- but there it is none the less. :-) -- Greg A. Woods +1 416 218-0098; <g.a.woods@ieee.org>; <woods@robohack.ca> Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>
Thus spake Greg A. Woods ("GAW> ") sometime Today... GAW> But that's the whole point of funlink() (and perhaps even some of the GAW> other f*() calls, such as fchdir()) -- turn an operation on a file into GAW> an operation on a filename (i.e. a link to a file). <setattr param="baud" value="110"> fchdir does not operate on a filename. It operates on a DESCRIPTOR which ultimately HAS NO MEMORY, CACHE or CONNECTION _whatsoever_ to the original NAME. This DESCRIPTOR ultimately contains a reference to a VNODE which points to a FILESYSTEM OBJECT, i.e. METADATA. There is a difference between METADATA and a FILENAME. The FILENAME gets you the location, ultimately, of the METADATA, at which point the ONLY THING that is going to remember the FILENAME is going to be the *user's program*. The things that is pretty much stopping funlink from being practical are that potentially many names correspond to a particular file, and there needs to be some cached pointer to the dirent being referenced. QED. Now, is there anyone ELSE (put your hand down, Greg) who wishes to refute this? --*greywolf; -- NetBSD: the free unix for the rest of us.
On 1057854937 seconds since the Beginning of the UNIX epoch
I think that funlink(2) as an idea doesn't really make sense.
unlink(2) specifically operates on directory entries, not on files.
--
Roland Dowdeswell http://www.Imrryr.ORG/~elric/
Unlink(2) may specifically also operate on files as well as the link in the parent directory, and it will always also operate on the file if that file has only one link. Strictly funlink() isn't necessary for secure programming provided you have fchdir() and given you can always lstat() and then open() and then fstat() the parent directory in which you need to do the unlink() and thus to which you need to fchdir() before you call unlink(basename()). Strictly funlink() is also going to incur a lot more overhead than one or two additional fchdir() (and open()) calls. Thus I agree it doesn't make a whole lot of sense to implement funlink() unless you also want the ability to unlink a file you were handed on stdin, for example. -- Greg A. Woods +1 416 218-0098; <g.a.woods@ieee.org>; <woods@robohack.ca> Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>
Thus spake Greg A. Woods ("GAW> ") sometime Today... GAW> Date: Fri, 11 Jul 2003 13:58:02 -0400 (EDT) GAW> From: Greg A. Woods <woods@weird.com> GAW> Reply-To: NetBSD Kernel Technical Discussion List GAW> <tech-kern@NetBSD.org> GAW> To: Roland Dowdeswell <elric@imrryr.org> GAW> Cc: NetBSD Kernel Technical Discussion List <tech-kern@NetBSD.org> GAW> Subject: Re: funlink() for fun! GAW> GAW> [ On Friday, July 11, 2003 at 11:03:03 (-0400), Roland Dowdeswell wrote: ] GAW> > Subject: Re: funlink() for fun! GAW> > GAW> > I think that funlink(2) as an idea doesn't really make sense. GAW> > unlink(2) specifically operates on directory entries, not on files. GAW> GAW> Unlink(2) may specifically also operate on files as well as the link in GAW> the parent directory, and it will always also operate on the file if GAW> that file has only one link. Your logic is flawed. Please show how unlink(2) operates on files. - it doesn't write to them. - it doesn't create them. - it doesn't even really destroy them, though it arranges for them to be potentially destroyed once the link count goes to zero and the filesystem reclaims the blocks associated with them. --*greywolf; -- NetBSD: exercised any daemons lately?
Actually it does -- provided their link count is only one and soon to be zero. It doesn't (normally) write to their content of course -- just their metadata (i.e. it writes a zero into the link count, amongst other things!). (a unix file is an inode and any data blocks the inode may point to, including indirect blocks that point to other data blocks -- i.e. the inode is a part of what we cal a "file" and thus a file is (often) How it works under the hood is irrelevant. For all you know unlink() on some super-paranoid system could implement a cryptographically secure 35-pass overwrite that must complete before the call returns. -- Greg A. Woods +1 416 218-0098; <g.a.woods@ieee.org>; <woods@robohack.ca> Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>
Thus spake Greg A. Woods ("GAW> ") sometime Today... GAW> Unlink(2) may specifically also operate on files as well as the link in GAW> the parent directory, and it will always also operate on the file if GAW> that file has only one link. No, it doesn't. The only thing that happens to the file, per se, is a side effect of the deletion of all physical entries associated with a record of the data. The operation of unlink(2) is only associated with a file in that said file is a special type of file called a directory. Otherwise, all that happens is the link count of the node representing the file drops to zero, the last physical entry which names that inode is removed, and, unless another process is holding open a file descriptor associated with that inode, the data blocks of that inode are unallocated and the inode is cleared. Strictly speaking, this has *nothing* to do with the file and *everything* to do with the file's metadata. Sure, the end result is that the data is lost, but when you have cleared all physical references to the metadata, what else are you supposed to do? :-) To pinball off in another direction and re-iterate: The only way that a funlink() call can hope to work is to maintain a table of (vno_t **) around which contain offsets into dirs represented by other (vno_t *), and if you need to funlink() a file, you can at least get the offsets of all the links, visit the dirs associated with them, clear their entries and remove the (vno_t **) from the table. We wouldn't even need to know pathnames if we knew dir vps and offsets into them for the entries. There's some details I am missing, I'm sure, but anyone reading this who does anything with fs internals will probably get the concept. I ran it by a few people yesterday and they seemed to understand it well enough, even if the design isn't fully fleshed out. Whether or not we need funlink(2), though, remains to be seen. I rather suspect we really don't. Other things to consider: Where does ...
You've obviously forgotten what a file is (in a unix filesystem), despite my many attempts to remind you. You cannot, and must not, Never. But not for the reasons you seem to be implying. As I've already said: perhaps. :-) -- Greg A. Woods +1 416 218-0098; <g.a.woods@ieee.org>; <woods@robohack.ca> Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>
Well, for traditional UNIX filesystem both can be implemented by *stat. For local filesystems without ACLs too. As soon as ACLs are implemented e.g. in FFS, NTFS, AFS to name a few *stat is _not_ enough to implement access or faccess. Even implementing proper ACL checks in user space might not be enough for network filesystems like AFS if you don't check the actual credentials. So basically an implementation of access or I like the idea of flink(2). With flink you could extend the creat(3) syntax (or open(2)) to allow creation of anonymous files on a filesystem given jut one writtable directory and later add a reference somewhere. E.g. passwd creates an anonymous files, updates it with the content of /etc/master.passwd and links it there. To cleanup mess with unexpected Me too.
Apparently. There is only one anonymous object in the system -- why would you need a "fake" pathname to represent it? (there's no backing That would be _very_ expensive in terms of VM for every process since Indeed mmap(MAP_ANON) memory is zero-filled (and must be else the garbage it revealed may violate someones privacy because that garbage would very likely have come from some other process) -- Greg A. Woods +1 416 218-0098; <g.a.woods@ieee.org>; <woods@robohack.ca> Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>
I think it's rather elegant.. certainly much more so than the anon
version of bsd.. it builds on existing objects (/dev/zero in that
case) at least.
--
Matthias Buelow; mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}
I forwarded Bill Gallmeister your message, and he gave me the following to use as a response: And to comment on the email thread you forwarded: hindsight's a beautiful thing. shm_open et al were designed to allow a trivial implementation atop mmap (or atop the Ludicrous Sys V Interface (TM)), but at the time we came up with the standard, I believe it was only the rocket scientists at Sun who had mmap--no one else had ponied up to the memory==file proposition and all its implications. There wasn't even an INTERNET back then, for Chrissake. Al Gore hadn't been BORN. We wrote the damn standard using OIL LANTERNS. Okay, maybe I'm exaggerating a little. It was a DECADE ago! -- Matt Thomas Internet: matt@3am-software.com 3am Software Foundry WWW URL: http://www.3am-software.com/bio/matt/ Cupertino, CA Disclaimer: I avow all knowledge of this message
Thanks for that! I've had Bill's book open here on my desk for other reasons lately and when this came up here I did a rough page count of each section and while doing so I was reading all the "this is tricky" disclaimers. :-) How quickly people forget their heritage though. By that I mean if Unix people of the day had not forgotten their Multics background then the "memory==file" proposition would not have been so strange and its implications would have been well understood by them. I was an avid user of Multics right up to the day I first learned to use SysV IPC so perhaps thats why it never confused me or made me cringe. The SysV IPC mechanisms have had a long, and IMVNSHO greatly undeserved, history of being decried, discredited, and disparaged. I suspect this would have happened to SHM in particular even if it had not shared the ipcs/ipcrm resource identifier namespace quirks of message queues and semaphores simply because of the stupid politics separating USL and the BSD/CSRG crowds at the time (and still :-). Regardless POSIX shared memory is still effectively stuck with a flat, invisible, namespace that now looks ever so much more like a flat filesystem but has almost none of the utitlity. I.e. there's a long and vast gap between the goal of making POSIX shared memory capable of being implemented on top of either true mmap() or SHM, but in the end the final standard leaves an application author hanging high and dry because the restrictions required for portable applications are so ludicrous that it's almost infinitely easier for the application to independently support both variants instead of trying to contort through the POSIX API. (and in modern systems that almost invariable means just using SHM). In some senses it was also unfortunate that the POSIX Realtime working group got stuck with defining IPC mechanisms, but of course that seemed to be just more fallout from the stupid USL vs. BSD politics. The X/Open guys had a much more, well, open, ...
IMHO (in retrospect), if a system call does not provide any new feature (shm vs mmap), it needs to be deprecated/moved to user-space [using a common mechanism to share data efficiently]. regards -kamal Matt Thomas <matt@3am-software.com> Sent by: tech-kern-owner@netbsd.org 07/09/2003 07:13 PM To: tech-kern@netbsd.org (NetBSD Kernel Technical Discussion List) cc: Subject: Re: fsync performance hit on 1.6.1 POSIX. I forwarded Bill Gallmeister your message, and he gave me the following to use as a response: And to comment on the email thread you forwarded: hindsight's a beautiful thing. shm_open et al were designed to allow a trivial implementation atop mmap (or atop the Ludicrous Sys V Interface (TM)), but at the time we came up with the standard, I believe it was only the rocket scientists at Sun who had mmap--no one else had ponied up to the memory==file proposition and all its implications. There wasn't even an INTERNET back then, for Chrissake. Al Gore hadn't been BORN. We wrote the damn standard using OIL LANTERNS. Okay, maybe I'm exaggerating a little. It was a DECADE ago! -- Matt Thomas Internet: matt@3am-software.com 3am Software Foundry WWW URL: http://www.3am-software.com/bio/matt/ Cupertino, CA Disclaimer: I avow all knowledge of this message
I have used BSD style shared memory using mmap(2) with MAP_ANON on linux 2.4 and it worked fine... For synchronization I used flock(3) on temporary lock files with it Matt
