Re: funlink() for fun!

Previous thread: Re: fsync performance hit on 1.6.1 by enami tsugutomo on Sunday, July 6, 2003 - 6:55 pm. (2 messages)

Next thread: making a netbsd kernel look like a linux kernel? [was: Re: non-Linux on XBox? (fwd)] by Hubert Feyrer on Monday, July 7, 2003 - 9:36 am. (1 message)
From: Daniel Brewer
Date: Monday, July 7, 2003 - 12:45 am

For our application, we have 1 process capturing images from a frame-grabber
and dumping them to a file in a memory file system. It sends a UDP message
to another process that reads the images from this file and writes them to
disk in a database. The first process has to guarantee that after it's done
writing that image to the memory file system, that it is actually there for
the second process to read. The problem is slightly more complicated, as we
have systems out in the field running 1.5.1 which we still need to support.

I've tried opening the file with O_SYNC and not doing the fsync(). On 1.5.1,
this has increased the write() time slightly, which was expected, but the
overall processor utilization hasn't changed much. On 1.6.1, this has the
same impact as just removing the fsync() - which I expected because of the
MNT_SYNCHRONOUS flag in the mfs mount in 1.6.1. So far, I haven't had any
problems with this solution. Will this solution guarantee that whatever the
first process writes is there for the second process to read?

----- Original Message -----
From: "Chuck Silvers" <chuq@chuq.com>
To: "Daniel Brewer" <danielb@cat.co.za>
Cc: <tech-kern@netbsd.org>
Sent: Sunday, July 06, 2003 6:00 PM
noticed an inexplicably high usage on 161. After digging deeper with gprof,
I found that an fsync on 161 takes significantly longer than on 151. Our
software writes captured images into a ring buffer in a memory file-system,
so other servers can retrieve them. Could someone explain the why fsync on
1.6.1 is so significantly slower than on 1.5.1? Is there a work-around? Or
Celeron) both have the same motherboard, have 128MB ram and are both running
Western Digital 20Gig drives. Running the 161 box on a 2GHz celeron with



From: der Mouse
Date: Monday, July 7, 2003 - 12:53 am

Provided the writing and the reading use the same choice of interface
(which primarily means, on the one hand, read() and write(), and on the
other, mmap()), you shouldn't need to do any explicit syncs.  (You need
to take care when mixing read/write and mmapped access, but not when
all accesses use the same choice of interface.)

At least that's how I've always understood it, and that's been my
experience.  I've been bitten occasionally by missing synchronization
between users of different interfaces, but never otherwise.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


From: Chuck Silvers
Date: Monday, July 7, 2003 - 9:39 am

for a unified-cache system like we have in 1.6 and later,
there's no coherency problem even if you switch back and forth
between read/write and mmap.  for non-unified-cache systems like
sunos 3.x this was probably a problem.  the only major non-unified-cache
system left that I know of these days (HP-UX) does flushing internally to
guarantee coherency for single-threaded applications even if they switch
interfaces and makes a best-effort attempt at coherency for multi-threaded
(or multi-process) accesses via different interfaces.

-Chuck





From: David Laight
Date: Monday, July 7, 2003 - 1:58 pm

The only case that really does cause problems is if one process is
using NFS to acess the files.  This can cause extreme grief!
The grief is compounded by NFS implementations that compare client
and server times.

	David

-- 
David Laight: david@l8s.co.uk


From: Chuck Silvers
Date: Monday, July 7, 2003 - 9:33 am

hi,

oh, for the file in the MFS you shouldn't need any kind of syncing at all.
the data will be available in memory for the second process to read
immediately after the write() returns.  this is true for both 1.6.x and 1.5.x.

-Chuck




From: Greg A. Woods
Date: Monday, July 7, 2003 - 11:45 am

First off I wouldn't think you'd have to worry about fsync() on a memory
filesystem.  :-)

In fact so long as you're not worried about system failures possibly
causing loss of data then I don't think you should ever need the fsync().

Is the data _always_ going to be read by the second process?  If so then
maybe now with bigger pipe buffers in 1.6.x you'd be better off just
using a pipe?  You should only need a runtime configuration flag to
select between use of a pipe and a temp MFS file to be able to also
support 1.5.x with the same compiled binary.


Also, why are you using UDP to communicate with another process on the
same system?  Why not an AF_LOCAL socket, pipe, or msgsnd()/msgrcv()?

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>


From: Jochen Kunz
Date: Tuesday, July 8, 2003 - 1:16 am

Why don't you use a AF_UNIX socket to pipe the image data from process
to process or SYSV sharerd memory with semaphore interlocking? The later
would be the fastetst way to transfer the data and it avoids any file
system interaction.=20
--=20


tsch=FC=DF,
       Jochen

Homepage: http://www.unixag-kl.fh-kl.de/~jkunz/



From: David Laight
Date: Tuesday, July 8, 2003 - 5:17 am

SYSV shared memory is horrid stuff!
Just mmap a file!

	David

-- 
David Laight: david@l8s.co.uk


From: Matthias Buelow
Date: Tuesday, July 8, 2003 - 11:18 am

Indeed, it is just compatibility stuff.  I'm apalled that new software
is still using that deprecated (and ill-conceived) API.  Maybe it's
because Gnu/Linux can't do any other method (at least the last time
I looked.)  That shouldn't mean that on other systems (SVR4/44BSD)
that broken API should be used aswell.

-- 
  Matthias Buelow;  mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}


From: Greg A. Woods
Date: Tuesday, July 8, 2003 - 1:57 pm

New software _should_ use POSIX IPC, but of course NetBSD doesn't yet
have any implementation of POSIX IPC, so portable software the must run
on NetBSD must still use UNIX System V IPC.

It's really not that badly concieved an API, given the fact that it has
to be very portable; and it is very "standard" too (it's been in P1003.1
since Issue 2 and has been in the BASE API set since Issue 5).  (The
lack of ftok() in POSIX has also been repaired since Issue 4, IIRC.)

The full implementation gives some very nice user-level controls that
can be an enormous boon to debugging and operational management.

While a variant/subset of mmap() has been defined for IEEE-P1003.1-2001,
it ends up being a hell of a lot more complex than the plain old SysV,
aka XSI, shared memory you're complaining about and it doesn't include
MAP_ANON either so portable applications still have to use SysV/XSI SHM
for the kinds of uses I'm guessing are applicable for the application
this thread has been discussing.

The only part of the whole SysV IPC API that's really rather inelegant
is the semaphores part (and of course that's due to the minor flaw that
requires two separate system calls to create and then initialize a
semaphore).  IIUC this problem is fixed with POSIX semaphores, but of
course NetBSD doesn't yet have any implementation of POSIX semaphores.

Message queues could have been a little bit simpler too I suppose,
though only at the loss of some useful flexibility, and it is
unfortunate that you can't use poll() on message queues (sysV or posix).

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>


From: Matthias Buelow
Date: Tuesday, July 8, 2003 - 3:32 pm

I can hardly believe that someone is actually recommending the use of
this blasted, totally deprecated API.  SysV IPC is a horrible
anachronism with lots of inconsistencies and bad behaviour (can't be
multiplexed with files, identifiers & resources hang around when
process exits abnormally, sometimes until the system is rebooted (you
can't always ipcrm), small, fixed limits compiled into kernel etc.)
Given that both System V R4 and higher and 4.4BSD and higher provide
much better, mmap-based shared memory mechanisms which are supported in
one way or the other by most if not all systems today (e.g. Free/NetBSD
and, iirc, Solaris can do both the SVR4 method of shared mmapping of
/dev/zero aswell as the anon/shared method of 4.4BSD) there is little
point in programming for the SVIPC API, except as a fallback for
systems on which it is the only method available (GNU/Linux).  A
uniform interface which encapsulates those three methods (SVR4, 44BSD,
SVIPC) towards the rest of the application is easily written in a
couple dozen lines of code so there's no reason to use the SVIPC crap
on systems where it is not necessary.

-- 
  Matthias Buelow;  mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}


From: Greg A. Woods
Date: Tuesday, July 8, 2003 - 6:59 pm

SysV IPC is still a standard API and is far more portable than anything

What do _you_ mean by "multiplexed"?  And are you talking explicitly



Everything has limits -- at least with SysV IPC you know what they are

no need to write it from scratch -- Richard Stevens already wrote one.

Indeed POSIX IPC mechanisms are far more desirable, but also equally
non-portable until a majority of modern OS releases catch up to
P1003.1-2001 and are deployed in a significant number of installations.

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>


From: Matthias Buelow
Date: Tuesday, July 8, 2003 - 7:54 pm

A feature?  Resources that don't get freed because a per-process
reference count is missing?  Just imagine what the situation would

It's not an implementation bug if you don't know if the identifier
you see in ipcs is still in use (and deleting it might crash the

Of course a fully implemented, sensible, standardized API is
preferrable in the long run.  I don't know the story of the SysV IPC
stuff but it's very likely that it was a quick ad-hoc thing written
without proper design in order to provide the necessary IPC mechanism
for some specific software that needed to be written.  It predates BSD
sockets in any case (wasn't it already available in SysIII?)

-- 
  Matthias Buelow;  mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}


From: Greg A. Woods
Date: Tuesday, July 8, 2003 - 8:44 pm

We already knew that we couldn't poll() on message queues, so what do

Yes, a feature -- they are independent entities that can remain in place


SysV IPC wasn't a "quick ad-hoc thing" as far as I can tell.  It is very
generic and was apparently designed to provide an integrated set of
generic IPC facilities that could be used safely and reliably in
production environments.  I once used SysV IPC to build some of the core
communications software in an application that was used to control
real-world railroad switches and signals.  I've used it in lots of other


Nope.  They call it SysV IPC for a reason....

(SCO back-ported it to an earlier version of XENIX, but did so badly and
with some nasty bugs.)

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>


From: Christoph Hellwig
Date: Wednesday, July 9, 2003 - 1:04 am

I know that certain folks here like bashing Linux, but it might help
to looks at the facts before posting that bogus claims.  Also remember
that it's Linux - as far as the kernel is concerned that GNU folks
didn't contribute but rather do the same bashing from time to time
y=that you like, too..



From: Matthias Buelow
Date: Wednesday, July 9, 2003 - 11:43 am

I do not bash lignux, I state facts.  I asked in 1996 when the missing
functionality was to be implemented and got told along the lines that
it will take at least a few months still because it couldn't be fitted
into the then current VM architecture easily.  Last time I looked (2.4
kernel of 2002 or so) it still wasn't available.  Maybe that has
miraculously changed in the meantime.  But who cares, this is not a
lignux mailing list.

-- 
  Matthias Buelow;  mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}


From: Christoph Hellwig
Date: Wednesday, July 9, 2003 - 5:14 pm

Well, that's certainly wrong.  MAP_ANON has been available at least
through all of the 2.4 series.



From: Matthias Buelow
Date: Wednesday, July 9, 2003 - 6:32 pm

Hmm.. my memory could be failing me and it was a pre-2.4 kernel against
which I tested last time.  Good news, though, that they're at least
picking up certain things.  I hope it also works.

-- 
  Matthias Buelow;  mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}


From: der Mouse
Date: Tuesday, July 8, 2003 - 8:09 pm

*shudder*

I have yet to see any part that _isn't_ inelgant.

A new namespace for each resource.

A new _flat_ namespace for each resource.

A new flat namespace _with human-meaningless names_ for each resource.

The worst of both the persistent and transient worlds: no cleanup on

Unfortunate?  I would call it fatal.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


From: Greg A. Woods
Date: Wednesday, July 9, 2003 - 12:01 am

Have you never heard of inode numbers?   :-)

Perhaps you're not aware of ftok(3) and its use to map normal pathnames

I think you've mis-interpreted what I believe was a design goal of SysV
IPC.  The IPC entities are not supposed to go away when the last process
exits -- they are supposed to persist so that a process can come along
later and re-attach to them.

Asking SysV IPC entities to disappear on last exit of all processes
which have interacted with them would be like asking all files to
disappear on last close!

There's nothing fundamentally wrong with them disappearing on reboot
though -- in that sense they are no different than a memory filesystem.
(or pipes).

(It might have been nice to have some way to explicitly ask to "unlink"
an IPC entity such that it would be cleaned up on last exit, but the

poll() came along to the systems in question quite a bit later than
message queues -- at the time the only things that you might want to be
able to do simultaneous non-blocking reads on were TTYs and the best you
could do with anything at the time were manually polled non-blocking
reads and when that's what you have to do then adding a msgrcv() call
using IPC_NOWAIT to the loop is no big deal.

BTW, it is only not portable to use poll() or select() on message queue
identifiers -- it is possible to use select() on at least one more
modern implementation (AIX).

Also, BTW, it seems I am indeed wrong about never being able to use
poll() or select() on POSIX message queues -- POSIX does specifically
allow message queue descriptors to be implemented as file descriptors
and so it should be possible to implement them in such a way that poll()
or select() will do the right thing for them.  On the other hand this
can't be relied upon by a portable application.  Luckily POSIX message
queues have a way to establish a notify callback function that will be
called just like a signal handler whenever a message appears in an empty
queue (though the notifier does have to be ...
From: der Mouse
Date: Wednesday, July 9, 2003 - 12:11 am

Certainly.  How does it avoid collisions?  (Hint: it can't.  There can
be, and on large systems not too infrequently are, more files than
there are possible IPC IDs.)  How is there any excuse for tying this

It's fine to have that option.  It's broken to have that as the only

Pipes do not have names.  Memory filesystems, yes, and that's one

...so?  Taking an OS whose unifying concept is "everything is a file
descriptor" and creating three new object types that can't have file

Gah.  We need to get away from signal handlers.  Or else we need an
environment (language and OS) better suited to them.  Getting away from
signals was one of the things that made me implement AF_TIMER sockets.
(Another was the insanely small number - one - of possible outstanding
timeouts with the signal-based timer facilities.  The third was the

So they not only repeated the mistakes of signals, they repeated the
mistake of making them unreliable.

And this is the API you are holding up as a paragon of goodness??  I
shudder to imagine your idea of a bad API.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


From: Greg A. Woods
Date: Wednesday, July 9, 2003 - 1:36 am

Unix filesystems use inodes.  :)

but my point was that a namespace consisting of integers isn't such a

Note I had not actually ever looked at NetBSD's implementation before,
but I see now that it is far less than ideal, and strictly speaking is
not correct since as far as I can tell it cannot possibly conform to the
full requirements of POSIX 1003.1-2001.

ftok() _should_ always create a unique key to match any unique file and
'id' parameter (and of course _should_ always return the same key for
the same file & 'id').  Most implementations are done in userland (and
POSIX requires that a userland implementation be possible) and like the
half-baked one in NetBSD they use the combined st_ino and st_dev numbers
found from stat()ing the file (and then shift in the low 8 bits of the
'id' parameter so that each file can also represent 2^8 unique keys).
Once upon a time the bits used from each value would probably  have
guaranteed a unique key, but since then the values have widened

You can ask for the world, but that doesn't mean you'll get it!  ;-)

There is no "option" to have a file automatically deleted on last close.
You have to unlink() it explicitly and then its allocated storage will
be released on last close.  The unlink() means the file is immediately
invisible to all other processes which do not have it open().  Does this
mean the open()/close()/unlink() API is also fundamentaly broken?  I
don't think so!

I would say that any API for any complex functionality would be broken
by being too complex to use if it had every imaginable option and
feature.

Just like the files on a memory filesystem, SysV IPC entities persist
until reboot so that they can be used by transient processes and there
exist tools to manage them as necessary.  I.e. their whole API is
sufficient to do everything that's necessary, and it is not bloated with

Everything is a file descriptor in unix unless it's an object located a
memory address.  :-)

Message queues are the only one of the ...
From: der Mouse
Date: Wednesday, July 9, 2003 - 12:21 pm

In the API??  The only placees I know of they appear in the API are

Maybe, maybe not, but your example is irrelevant - inumbers are _not_

Which implies that no conformant OS on a machine with 32-bit ints can
ever support as many as 2^24 files simultaneously accessible.

Nice idea.  It won't fly in practice.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


From: Greg A. Woods
Date: Wednesday, July 9, 2003 - 12:56 pm

Anyone who ignores the fact that unix filesystems are inherently flat
namespaces at their lowest level does so at his or her own peril!  ;-)
Directories are just files full of names and pointers to the true

What gives you that idea?

The standards do not specify what a "key_t" is.  P1003.1-2001 says only
that it is "Used for XSI interprocess communication" and that it is
defined by including <sys/types.h>.

For all you know at the API level of ftok(), shmget(), semget(), and
msgget() a key_t could be a struct, or a pointer to a struct, containing
a full copy of a "struct stat" along with a full copy of the stat()ed
pathname in a char array and a copy of the whole int-sized 'id' value.
(and that's probably what it should be, without perhaps the pathname)

The resulting IPC resource descriptors returned by shmget(), msgget(),
and semget(), each being in its own namespace and each being defined as
type 'int' with "-1" being a reserved value, allows, strictly speaking,
for UINT_MAX-1 open resources of each type, along with UINT_MAX-1 open
ordinary files (though practically speaking it's INT_MAX for each of
course because so many programmers assume all negative values are
equivalent to "-1").

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>


From: der Mouse
Date: Wednesday, July 9, 2003 - 1:05 pm

You are still insisting on confusing the implementation namespace
(inumbers) with the API namespace (pathnames).

Given how intelligent you have proven yourself ot be in other areas, I
can only conclude you are being deliberately stubborn in this
misunderstanding, and I see no need to even attempt to discuss matters
with someone acting that way.

Goodbye.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


From: Greg A. Woods
Date: Wednesday, July 9, 2003 - 1:40 pm

I'm only trying to point out to you that complaining that flat
namespaces are inherently broken is like complaining that the sky is
blue on a clear and sunny day.

I'm only blurring the boundary between kernel and user-land because when
you get down to these kinds of things the boundary _should_ be blurry.

Just as the unix filesystem has namei(), SysV IPC has ftok().

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>


From: der Mouse
Date: Wednesday, July 9, 2003 - 1:50 pm

namei() is not an API.  ftok() is.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


From: Greywolf
Date: Wednesday, July 9, 2003 - 2:06 pm

Thus spake Greg A. Woods ("GAW> ") sometime Today...

GAW> I'm only blurring the boundary between kernel and user-land because when
GAW> you get down to these kinds of things the boundary _should_ be blurry.
GAW>
GAW> Just as the unix filesystem has namei(), SysV IPC has ftok().

The difference, though, is that ftok() is implemented in userspace, forcing
whatever uses IPC to do its own groveling, while namei() is never seen from
userland -- the kernel always ALWAYS handles namei().

You might like to blur the line between kernel and userspace, and as much
as the line should be blurred on occasion, this is not one of those
occasions, except, it seems, to suit your own point of view.  Using IPC
shouldn't require kernel-like knowledge to address any more than one is
required to know kernel-like things in order to call, e.g. open(2) or
chdir(2).

With open/chdir, you pass a pathname which is easily determined and can
even be proscribed by the program in question, at which point error
detection and handling and probably several fallbacks come into play.

With the IPC stuff, you have to jump through quite a few hoops to locate
the identifier.  You must ALWAYS do this.  You cannot hope to even generate
the identifier on the fly.  At this point, error detection and handling
come into play, and there's really no fallback -- if what the system
gives you doesn't work for some strange reason, you're hosed (of course,
so's the system, most likely).

Comparing the filesystem to IPC/shm/sem/msg is like comparing apples
to astrology.  There is little commonality between the two; given that they
serve somewhat different needs (unless you count the open/fork/unlink
trick which preceded anything resembling current IPC), this is expected.

Nonetheless, that the implementation as it stands deals with nonhuman(oid)-
readable identifiers which must be obtained through secondary means is just
ridiculous.  The least that could be done is that a shm/sem/msg could
be requested by a particular name; ...
From: Greg A. Woods
Date: Thursday, July 10, 2003 - 12:06 am

And this difference means what, exactly?  Have you never heard of any
systems which implement filesystems in userland?  A little bird tells me

You absolutely do not need any "kernel-like" knowledge to understand or
use SysV IPC mechanism.  If you think that is true then what you're
probably missing is generic knowledge of interprocess communications
techniques, as well as perhaps a general understanding of how various
naming conventions are implemented.

Would you say that using the DNS requires "kernel-like" knowledge too?

"one" != "quite a few"

I guess maybe you don't like to open your files before you read them?
Perhaps you'd rather not have to bother managing your open file

Why not?  If you know its name then you can find its resource ID with
trivial ease (in very much the same way you would know a file's name
since they are after all exactly the same things, and indeed you can
even use normal filesystem tools to examine a list of filenames which
might have been associated with IPC resources to select from amongst
them).

Or do you mean you were so totally overwhelmed by the flexibility of the
API that you forgot to make appropriate use one of its key features in

So, what do you do when you open a file that doesn't exist?  What if
your filesystem wasn't even mounted?

The system is never hosed because of bugs in applications using SysV IPC
-- at least not so long as it's being managed by anyone with a clue, and
provided of course there are no latent bugs in the implementation.

I'm sorry to have to say this but it seems as if your complaints are

Hmmm... sounds like you're equating inode numbers, or maybe file
descriptors, or some similar kind of resource handle, to filenames.  Do
you find it too confusing to have the ability to have a separate (and
potentially user-replacable) name-to-key mapping sitting on top of the

Please try to follow the bouncing ball in this simple example that's
been simplified even further by removing error handling for your ...
From: Kamal R Prasad
Date: Thursday, July 10, 2003 - 12:23 am

Im not exactly a kernel guru (and unless you mean kernel data structures)- 
I must say that using SysV IPC (or any other UNIX system calls) does 
require a good knowledge of how things are implemented under the hood. 
System calls (depending on sw and hw issues) have a finite 
capacity/latency etc. which a product developer needs to understand to 
take care of race conditions, scalability etc..
 
regards
-kamal

-- 
                                                 Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>; <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird 
<woods@weird.com>




From: Greg A. Woods
Date: Thursday, July 10, 2003 - 9:40 am

No, not "under the hood".  A good systems programmer will need to know
how to use system calls effectively and safely and what system resources
they may consume; and thus the API for those calls must be well and
completely documented.  However just as with any well documented API it
should never be necessary to know how it is implemented under the hood.

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>


From: Christoph Hellwig
Date: Wednesday, July 9, 2003 - 1:07 am

Umm, posix SHM _does_ use mmap.  It just uses shm_open to get a suitable
fd, on Solaris and Linus that would be on tmpfs.



From: Greg A. Woods
Date: Wednesday, July 9, 2003 - 9:26 am

Yes, that's my point.  :-)

If you know how IEEE standards committees work and you understand how
much they (are supposed to) hate inventing new things, the fact that
they invented shm_open and shm_unlink() suggests that some strong
member(s) of the comittee were just completely and totally unwilling to
allow for mmap() to work on all normal files and that the only way they
would be happy with mmap() becoming the true standard shared memory
interface was if it was required that the file descriptors it used be
allocated by some special new function.

You would think shared memory would be simpler to describe and discuss
than something like message queues which have lots of fancy features,
since it is, after all, just a chunk of memory storage that can be
mapped into the address space of multiple processes.  However because of
this strange use of mmap() and all the qualifiers they put on it for
POSIX, folks like Bill Gallmeister in his O'Reilly "POSIX.4" book
actually spend more pages describing shared memory and give all kinds of
caveats about its use.  The SysV SHM API is trivial by comparison to POSIX.

I don't know why POSIX doesn't include MAP_ANON either -- that would
have made things ever so much simpler!  The rationale in P1003.1-2001
claims they decided to use the SysVr4 mmap() implementation as the basis
of the POSIX API, and indeed SysVr4 lacks MAP_ANON, however MAP_ANON was
very well known before mmap() was finalized since 4.3net2 was already
widely disseminated  (1003.4 was still in draft at the end of 1991).

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>


From: Christoph Hellwig
Date: Wednesday, July 9, 2003 - 9:43 am

Not trying to defend IEEE here, but there is some sense at leat behind
shm_open.  Given that for shm your really want an object that's not
backed by permantent storage (= a normal filesystem) you need to know
where to look for a tmpfs-lookalike or, in the case you mentioned above
something outside the normal filesystem namespace (yuck!).  As IEEE
isn't into the filesystem namespace business shm_open is an okay wrapper
for leaving this to the implementation.

Why the heck they specified shm_unlink is completly unclear to me,

Just because it was know that doesn't mean it should be standandardize.
And MAP_ANON really doesn't fit into the SunOS4/SVR4 VM that wants a backing
vnode for each memory object unlike the Mach VM.  Thus the horrible
mmap() of /dev/zero hack, btw..



From: Greg A. Woods
Date: Wednesday, July 9, 2003 - 11:17 am

You don't need such a concept for mmap(MAP_ANON|MAP_SHARED) -- the
filename is simply a key to the anonymous memory so that multiple

Oh I agree there's some sense behing shm_open() -- just so long as you
ignore the MAP_ANON jumping up and down and waving its hands and

That one's easy!  ;-)

shm_unlink(), like unlink(), takes a pathname parameter, so given the
fact shm_open() names are strictly outside the normal visible filesystem
space then you need a matching unlink() interface to work in this
private, invisible, namespace.  (or at least you do so long as you don't
also have something like a funlink() call that takes an open file

Given the constraints of trying to work without MAP_ANON to thus end up
with the same functionality only after inventing a dozen new API
signatures to work around the lack of MAP_ANON is in fact a very good
reason to standardize a far simpler API.  That's why I say there must
have been some very strong politics influencing the committee members.
Normally these comittees are loathe to invent new APIs and the mere fact
that they started down that road when they thought they could do without
MAP_ANON should have suggested to them that they were going in the wrong
direction.  "Oops!  We're inventing something!  Let's go back to that
last fork in the road we took to get here!"  (Of course POSIX.4 seems to
be mostly cut from whole cloth so maybe they didn't share that same

I don't buy that argument at all.  SysVr4 VM has the concept of
anonymous memory and the swap layer provides the backing store for
anonymous pages.  I suspect forcing anonymous pages to always have the
MAP_PRIVATE attribute was their downfall.  Anonymous pages could have
been made sharable simply by associating a vnode from an ordinary file
descriptor with them -- i.e. there's a vnode but it's not what's mapped,
anonymous memory is mapped and thus the swap layer continues to provide
the backing store.  That's essentially how mmap(MAP_ANON|MAP_SHARED)
works, IIUC -- the filename, ...
From: Greywolf
Date: Wednesday, July 9, 2003 - 1:00 pm

Thus spake Greg A. Woods ("GAW> ") sometime Today...

GAW> shm_unlink(), like unlink(), takes a pathname parameter, so given the
GAW> fact shm_open() names are strictly outside the normal visible filesystem
GAW> space then you need a matching unlink() interface to work in this
GAW> private, invisible, namespace.  (or at least you do so long as you don't
GAW> also have something like a funlink() call that takes an open file
GAW> descriptor as its parameter :-)

Um, slightly off-topic, but wouldn't funlink() be somewhat disastrous in
practice? (I presume that's why the smiley).  That would require a file-
system cleaner process or a routine that knew instantly how to match inode
numbers to pathnames (as it was explained to me, "The kernel routine is
called namei() for a reason.  You will note that there is no converse
routine, since while name -> ino-dev is unique for each ino-dev, the
reverse is untrue -- consider /foo/bar/.. and /foo, for example...").

GAW> > Thus the horrible
GAW> > mmap() of /dev/zero hack, btw..
GAW>
GAW> Hmmm.... yes.  What a stupid idea that was.  :-)  (A NULL vnode pointer
GAW> was apparently supposed to suffice such that a /dev/zero vnode was
GAW> unnecessary.)

Wow, a (vno_t *) NULL was supposed to allow one to create pre-cleared
pages in memory?

Despite its ugliness, /dev/zero has other uses, such as creating
arbitrarily large filespaces (for, e.g., swap (don't go there.)) without
having to rewrite a program to handle it -- one can use dd for it,
though I wouldn't have minded a 'mkfile' program to do the same thing
(thus avoiding the need for /dev/zero).

I have a question regarding mmap()ing /dev/zero:

    Purportedly this was used by crt0.o and/or ld.so to create blank spots
    into which to load the dynamic libraries.  Surely the same thing could
    have been accomplished with *alloc() and a clear routine, or
    mmap() could just pre-zero whatever pages it maps.  Was /dev/zero
    *truly* necessary?  Its sudden disappearance once or ...
From: Greg A. Woods
Date: Thursday, July 10, 2003 - 9:35 am

No more than unlink(), provided that the link count was only one;

Why "instantly"?  funlink() could block until it found the directory

Yes, that's strictly true, but your example is slightly wrong.  :-)

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>


From: Greywolf
Date: Thursday, July 10, 2003 - 2:52 pm

Thus spake Greg A. Woods ("GAW> ") sometime Today...

GAW> > Um, slightly off-topic, but wouldn't funlink() be somewhat disastrous in
GAW> > practice?
GAW>
GAW> No more than unlink(), provided that the link count was only one;

Well, yes, there is always that restriction, but that wasn't stated.

GAW> alternately the pathname could be cached in the kernel.  ;-)

I don't get that at all, sorry.   What good would that do for a file that
had (N>1) links to it?  If the (st_nlink == 1) restriction is not
enforced, funlink() would be tantamount to clri...

Gah!  You know what?  [I'm so chagrined -- this took me several passes.]
funlink() would STILL be tantamount to clri, and it would require the same
procedure (fsck) from which to recover, unless the kernel DID cache the
paths of every open (I know it caches vnodes, but pathames?) until (last)
close or until unlink.

GAW> > That would require a file-
GAW> > system cleaner process or a routine that knew instantly how to match inode
GAW> > numbers to pathnames
GAW>
GAW> Why "instantly"?  funlink() could block until it found the directory
GAW> entry and confirmed that the link count was still one.  :-)

<quiz>
[ ] You expect me to wait that long for funlink() to return from that
    procedure?
    [ ]	...on a 1GB filesystem?
    [ ] ...with a high density of inodes?
    [ ] ...on a system with a slow CPU?
</quiz>

I'm guessing the smileys all over the place are conveying your intent that
funlink() would not work at all, practically speaking...

GAW> >  You will note that there is no converse
GAW> > routine, since while name -> ino-dev is unique for each ino-dev, the
GAW> > reverse is untrue -- consider /foo/bar/.. and /foo, for example...").
GAW>
GAW> Yes, that's strictly true, but your example is slightly wrong.  :-)

I see that's either because they're directories or, more likely what you're
hinting at, because /foo/bar may be on a different dev than /foo.

I should say, then, "consider /foo/bar/.. and /foo on the ...
From: Greg A. Woods
Date: Thursday, July 10, 2003 - 3:34 pm

Just to be clear I'm thinking of funlink() as relating to unlink() in
the same way fstat() relates to stat() -- i.e. it would unlink the file
that had been opened to create the file descriptor which would be passed
to it as its only parameter.

funlink(), with pathames of open file descriptors cached in the kernel,
would be _exactly_ like unlink() and wouldn't require any other magic.
Funlink() would merely unlink the file that was opened to create the
file descriptor it was passed.

Funlink() without cached filenames would have to internally do much the
same as fstat() to find the device (and then the mount point) and the
inode number of the open file.  Then the directory tree of the target
filesystem would have to be traversed to find an entry with a matching
inode number, and then if all were OK (regular file, one link, etc.)
then the equivalent of a normal unlink() would be done on the found
filename.

About the only semi-sane use of funlink() (at least that I have ever
been able to think of over the years) could be to give a process the
ability of unlinking a file that its parent process had connected to one
of its file descriptors (e.g. stdin).  Of course such an ability would


I was thinking more along the lines that /foo/bar/.. and /foo/bar refer
to the same directory (assuming /foo/bar isn't a mount point) and of
course all directories have multiple hard links.  Of course you can't
normally/safely unlink() a directory, regardless of whether you refer to
it by its "true" name, or whether you refer to it by its ".." alias
name, but that's a slightly different issue [frmdir()?]  :-)

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>


From: Greywolf
Date: Thursday, July 10, 2003 - 3:51 pm

Thus spake Greg A. Woods ("GAW> ") sometime Today...

GAW> I was thinking more along the lines that /foo/bar/.. and /foo/bar refer
GAW> to the same directory (assuming /foo/bar isn't a mount point)

Really?

cd /var/tmp/..

Where are you now? :)

GAW> and of
GAW> course all directories have multiple hard links.  Of course you can't
GAW> normally/safely unlink() a directory, regardless of whether you refer to
GAW> it by its "true" name, or whether you refer to it by its ".." alias
GAW> name, but that's a slightly different issue [frmdir()?]  :-)

Well, yeah, of course; the point, though, which has been sidestepped here,
is that you can't use a dev-ino to come up with a unique name, even though
you can use a name to come up with a unique dev-ino pair.

				--*greywolf;
--
NetBSD: Groovy Baby!


From: Greg A. Woods
Date: Thursday, July 10, 2003 - 4:56 pm

Nothing's been sidestepped here.  Perhaps you've forgotten either one
of:  (a) the option of caching open filenames; or (b) the qualifier that
funlink() could/should fail/misbehave if the inode has multiple hard
links.

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>


From: Greywolf
Date: Thursday, July 10, 2003 - 5:31 pm

Thus spake Greg A. Woods ("GAW> ") sometime Today...

GAW> Nothing's been sidestepped here.  Perhaps you've forgotten either one
GAW> of:  (a) the option of caching open filenames; or (b) the qualifier
GAW> that funlink() could/should fail/misbehave if the inode has multiple
GAW> hard links.

In order for funlink() to work, you will need both caching of open
filenames and the requirement that st_nlink == 1. You'll need the caching
to make sure you can find the name quickly, and the one-link requirement
to insure you don't break things.

The only other option would be to keep an inverse lookup db somewhere, and
that could get prohibitive (unless the means you kept was a db which
contained not actual names, but pointers to ino-devs which were
directories, and the offsets within from whence names could be retrieved
(since the kernel can divine things like data blocks from inodes and thus
dereference them...)).

[that could have implications of either robustness or frailty, depending
 on how you looked at it.  At first glance, something like that would
 enable run-time consistency capability, but I think it's probably more
 frail than that (I get starry-eyed at new features, at least until I
 discover their drawbacks, so I'm probably not (always) the best judge of
 robustitude).

 I mean, think about it:  you'd get a cross-check of link counts available
 at any time.  iname() would be possible, although it would return an
 array of objects rather than a single one; but if there was an
 inconsistency between the link count on the ino and the number of objects
 returned, that could be useful.  Don't ask me how (see "starry-eyed",
 above).

 I'm sure someone, somewhere, before me, has thought of doing this. I'm
 equally sure that this person has presented it, only to have it go down
 in flames because of the unnecessary complexity it added to things like
 open, creat, unlink, link, mkdir, rmdir and rename.

 But I digress...]

I had the experience of tripping over something ...
From: Greg A. Woods
Date: Thursday, July 10, 2003 - 11:38 pm

No, not really -- i.e. not just to make it "work".  You only need one or

"quickly" is irrelevant -- the call would always have to block until it
found the right directory entry, regardless.

Assuming one doesn't want to pay the price of caching all open filenames
then whether funlink() locks the inode first before searching for the

First off if you have cached the open file name then you don't need to
worry about the link count -- funlink() would always remove the intended
file in that case since it knows the file's name precisely.

The restriction on a link count of one would only be necessary when open
filename caching wasn't implemented.  It would be needed to make sure
the kernel didn't find the "wrong" filename first and unlink it instead
of the filename that was opened to create the file descriptor passed to

Did it persist that way for any period of time?  Or was it a momentary
quirk that could have been caused by a race condition for a file that

If that file really did persist until you fsck'ed then it sounds like a
corrupt filesystem was mounted without being properly fsck'ed the first
time around -- perhaps because of a bug in fsck; or perhaps there was a
bug in the filesystem code, or an undetected data corruption on the
disk, that caused the error to appear since the last boot.

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>


From: der Mouse
Date: Thursday, July 10, 2003 - 11:43 pm

The name under which it was opened does not necessarily bear any
relation to any names it has later, nor does the file which was opened
under a given name necessarily bear any relation to any file that may
later exist under that name.

These remain true even if every file in sight has nlink 1.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


From: der Mouse
Date: Thursday, July 10, 2003 - 8:41 pm

Neither is of any real use by itself.

	fd = open("/home/alice/fnord/flarp",O_WRONLY|O_CREAT|O_EXCL,0666);
	chdir("/home/carol");
	rename("../alice","/home/bob");
	funlink(fd); /* What does this do?  Why and how? */

Even if you cache the pathname passed to open, even though the rename()
does not have anything obvious to do with the open()ed path, even
though file still has only one hard link, you have to search the
filesystem.  Or else you have to do an amazing amount of work
(including, at a minimum, an open file table walk) in rename(), link(),
unlink(), and probably other calls I haven't thought of to make sure
that the "cached" pathnames remain correct as files and directories get
moved around.

It might work to cache the containing-directory vnode and the opened-as
name (or moral equivalent, such as having directory vnodes point to a
list of currently-open files in them), and then all you have to worry
about is other clients of the same fileserver renaming it out from
behind your back (which admittedly is really no different from many
other issues such environments already have).  You don't even need the
nlink==1 check if you do that.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


From: Greg A. Woods
Date: Friday, July 11, 2003 - 12:31 am

If you don't have an opened filename cache then assuming the rename()
works (i.e. assuming /home/bob isn't a mount point) then the funlink()
call would unlink that newly created file no matter what its name (which
does suggest one possible reason which might make funlink() useful even

If you do have an record of the opened filename then I would say the
easy way out means the funlink() must simply fail because it would be
exactly equivalent to:

	unlink("/home/alice/fnord/flarp");

(I should clarify that what I have been implying by the "opened
filename" is a fully qualified name which would have to be computed
using sys/kern/vfs_getcwd.c:getcwd_common() or similar if the name
passed to open was a relative one.  Caching the process' cwd vnode at
the time of open for each open file (even if only for those opened with
relative names) would probably complicate far too many other things to
be worth considering.)

I guess the choice between just failing the funlink() call in your
example above, or trying to make it work if possible, depends on what
possible reason one might have for implementing funlink() in the first
place, and thus how one defines it.

If all one is trying to do is flesh out the set of f*() system calls
which have file descriptor parameters to make it orthogonal to the set
which have pathname parameters then I suppose it depends on who one
wants to pay the price for this feature (and perhaps whether or not one
can conceive of any other reason to bother caching the opened filenames.
(Note this fleshing out of the f*() system call set was the original
purpose I had when I played this same mental exercise many years ago.)

If on the other hand what the funlink() implementor is trying to do is
actually make the funlink() call "work" no matter what name the opened
file might have at the time of the funlink() call is made then, as you
say above, caching the opened filename may not be any help and one
really does have to force the caller to pay the price of the ...
From: der Mouse
Date: Friday, July 11, 2003 - 1:12 am

...you shouldn't do funlink() in this form in the first place, as
unlink() is not an operation on a file but an operation on a link to a
file.

Now, an funlink() that takes an fd on a directory and a (slashless)
component name, that would be a sensible way to add an fd-based variant
of unlink().

But there are more serious things to fix first, like the inability to
use open()/fchdir() with directories that are execute-only.  (To fix

It could, because then rename() could notice the change and change the
cache to refer to the new location.  (In which case I'm not sure how
fair it is to still call it a "cache"....)  I'm not sure what to do
with link/unlink pairs, though.  Neither call should update the cache,
but together they should....

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


From: David Laight
Date: Friday, July 11, 2003 - 2:36 am

You could also argue that it would be useful to have variants of open()
that take a directory fd and a pathname.

	David

-- 
David Laight: david@l8s.co.uk


From: Greg A. Woods
Date: Friday, July 11, 2003 - 10:47 am

But that's the whole point of funlink() (and perhaps even some of the
other f*() calls, such as fchdir()) -- turn an operation on a file into
an operation on a filename (i.e. a link to a file).  funlink() would of
course act upon the file (i.e. the inode and the storage it points to)
as well as the link recorded in the parent directory, since of course it
will also have to mark the file (inode) (and its storage) as free.

I suppose by stretching one's imagination a wee bit it's possible to see
how funlink() could help eliminate a TOCTOU race condition for a process
that must unlink a temporary file in some unsafe place like /tmp.
That's quite a bit of a stretch though and it's an especially long and
unlikely stretch if you consider that "safe" sub-directories can and
should _always_ be used in unsafe directories (since rmdir() is always
safe to use in a world-writable directory).  Perhaps funlink() could
have secure programming uses if a process has to remove files created in
a "safe" sub-directory of a world-writable directory and for some reason
it cannot chdir to that sub-directory any longer, but I can't at the

If the process has an FD open on the parent directory then it should be
able to much more easily just fchdir() there first, obtaining and
keeping an open FD for its PWD for those cases where it has to go right
back to where it started.

For other uses (such as unlinking stdin) such a form would be unusable,
and after all the goal is to define a system call that accepts a file
descriptor and acts upon the file that was opened in the same way as the
s/^f// system call would act when passed a filename referring to the
same file that was opened.  As I say above funlink() would actually act

Yes, I suppose that such an ability could be somewhat helpful at times.

I've always though the only extremely serious omission in the f*()
function call set has been faccess() (and of course there should not
have ever been an access() call since it is inherently insecure); ...
From: der Mouse
Date: Friday, July 11, 2003 - 1:48 pm

Not quite.  chdir() operates on a process and a directory; the pathname
is necessary only to identify the directory.  It could equally well
identify the directory by a file descriptor open on it, and that's
exactly what fchdir() does.  truncate() operates on a file; the
pathname is used merely to find the desired file.  It too could equally
well identify the file by a file descriptor open on it, and that's
exactly what ftruncate() does.

However, unlink() does not operate on a file, except as a side effect;
it operates on a link to a file.  The file is affected, sometimes very
mildly, sometimes severely, but the critical point is that the pathname
is not serving just to locate the file, but rather to locate a
particular link to the file.  Since file descriptors are on files, not

Yes, it should be able to.  But since you can't open "." if your
current directory is execute-only, it can't.  (It'd also be nice to be
able to do one syscall (funlink of this flavor) instead of three

Which is why you can't do funlink(), because unlink doesn't operate on
files; it operates on links to files.  The file is operated on only in
that it's garbage-collected once it's no longer referenceable.  (Which
may be when its refcount goes to zero, or it may be an indeterminate

It doesn't really make sense, unless you also add fopen() (which name
has unfortunately already been preempted by stdio), to re-open a file
with potentially different access rights.  Otherwise, faccess() doesn't

How so?  Provided you realize what it does, and more importantly what
it doesn't do, there's nothing wrong with it.  (In particular, access()

No, it wouldn't.  You still couldn't save and restore your current
directory by opening "." and fchdir()ing back there if your current
directory is execute-only, even with O_MKDIR, without O_NOACCESS.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


From: Ignatios Souvatzis
Date: Saturday, July 12, 2003 - 6:18 am

Hi,


Just what I wanted to write myself.
	-is
From: Matthias Buelow
Date: Friday, July 11, 2003 - 6:38 pm

I think it's not proper to regard [f]unlink() as operating on a file.
Unlink only operates on directories (which can be simplified as
ordinary files, although this is not quite right).
In most systems you delete files (that's also what probably anybody,
including those who are familiar with the internal workings of a
typical Unix system, is thinking when he's working with the system.)
However, that's not what's being done on Unix.  The kernel deletes
files, opaquely and behind the scenes (if necessary with the help of
fsck after an unexpected reboot.)  Not the user, who just removes
reference entries from directory files.  Of course you know all that,
I just want to emphasize it to support my argument.
That the "garbage collection" of files which are no longer referenced
is more or less immediate by using a simple reference counter in the
file's on-disk structure imho should be regarded as an implementation
detail.  For proper operation, one could also conceive to have a
process which regularly scans the filesystem and collects files which
are no longer referenced (like the memory management of certain
programming languages does.)  It is irrelevant to the exported API how
exactly this is implemented.  Unlink() should only work as an editing
operation on directories.  Any contrived operation that tries to find
a proper name for an open file through its descriptor is a rather
unclean thing which probably cannot be done correctly for all cases
and is way beside the design of the Unix filesystem.

-- 
  Matthias Buelow;  mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}


From: Matthias Buelow
Date: Friday, July 11, 2003 - 6:53 pm

I addition, the only proper semantics for an funlink() system call I
could see would be to set the reference counter in the inode to zero,
close/invalidate the file and all descriptors to it and have the
kernel, some external process or fsck at reboot remove all references
from directories.  It might be surprising for users who are accustomed
to what unlink() does but it would be consistent with the
file/dir-entry schism.  It would indeed delete the file, not a
directory entry, which is literally what the caller requested.

-- 
  Matthias Buelow;  mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}


From: Greg A. Woods
Date: Saturday, July 12, 2003 - 1:54 am

No, not zero -- just decremented by one (assuming the proper directory

No, absolutely NOT!  unlink() doesn't to this for VERY good reasons, and

as I say in my other reply this is impossible to do safely without
locking the whole filesystem (or unmounting it or going to single-user
mode and assuming the admin is clueful).

Forcing the user of funlink() to suffer through an internal ftw() and
still risk a failure if the link count is not exactly one is one thing,
but forcing the whole system to do without a filesystem for the time it

That's completely wrong given how I've described funlink().

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>


From: der Mouse
Date: Saturday, July 12, 2003 - 1:57 am

Sure it could.  If you want it to be an operation on the file, as
opposed to a link to the file, it has to be _something_ that is
independent of any particular pathname to the file.

Otherwise, you're making it an operation on a link that was at some
past time (save pathname at open() time) or is now (search filesystem)
linked to the file, and only secondarily an operation on the file
itself.

You have to go through great contortions (saving pathnames, searching
filesystems, checking that nlink==1 and/or that the pathname still
refers to the same file) to make funlink() perform the same operation
unlink() does.  This is a clue that what you are trying to do is
inappropriate.  I don't know where you got this resistance to
comprehending that unlink destroys links to files, only secondarily
affecting the files themselves, but it's clear to me that you have it.
You insist on trying to somehow tie an open file descriptor to a
particular link to its file, and, since file descriptors refer to
files, not links to files, you're having trouble.  Changing file
descriptors to refer to not files but particular links to files would
be a major philosophical change, and, as you're discovering, would
demand either a thorough redesign or some extremely heavy performance
penalties.  It probably could be done, but I see no point, and I do not
understand why you persist in trying to do it.  In any case, I see no
point in discussing it further with you.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


From: Greg A. Woods
Date: Saturday, July 12, 2003 - 10:48 am

But it should, not, indeed must not, else a fundamental feature of how


No, not really -- it should be obvious to anyone aware of how unix
filesystems are structured that implementing funlink() has it

A file is more than its contents -- it is all the metadata that controls
access to the content and allows the content to sit on the same storage
media along with many other distinct files.

Unlink() cannot, and luckily does not, "just" destroy links to files.
The primary purpose of unlink() is _also_ to decrement the link count in
the file metadata.  Unlink() _always_ acts on both a file and the
directory entry which points to that file.  However directory entries
(names) are just pointers to the real files.  The only part of what
unlink() also often does in addition to those first two critical
functions, which could safely be left for some cleanup daemon to do,
would be the moving of data block pointers from inodes which have a link

Indeed since that is the only way to implement the semantics of
funlink() as I've described it.  It's not such a difficult thing to do
for the most common case, even if one doesn't cache the opened filename,
though it could potentially incur a lot of disk reads....

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>


From: Matthias Buelow
Date: Sunday, July 13, 2003 - 1:25 pm

Of course it is and I only gave this example to show how nonsense
an "funlink()" call would be -- because the only valid behaviour
in the current framework (clearing of inode / invalidating all fds)
would be totally impractical.

Now can we stop this thing?  Or should I also propose some b*llsh!t
system call and have a hundred-mails long thread debate its uselessness?

-- 
  Matthias Buelow;  mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}


From: Greg A. Woods
Date: Sunday, July 13, 2003 - 3:58 pm

Your example was flawed though, even for its intent, since it did not
describe "the only valid behaviour in the current framework".  In fact
it described a completely different bahaviour un-related to the
funlink() call I initially proposed.  I described, informally, the valid
behviour for funlink() as I saw it, along with its API and its
limitations.  I did so in order to make a point and to hopefully help
give some rationale for some other related proposals.

funlink() is the most natural way one naive of the internals of Unix
filesystems might think to use to make it possible to allow one to
safely get away without using temporary directories for temporary files.
Indeed the lack of a funlink() call has been mentioned, and the
consequences of this lack discussed, by several experts in secure
programming practices.

However as I've already said I do agree it is less practical to actually
implement funlink() than it is to simply use the existing mechanisms
which do allow safe use of temporary files in temporary directories.

Unfortunately I still see too many NetBSD programs creating and removing
temporary files directly in /tmp and /var/tmp, and of course the default
sysinst still creates those directories on filesystems which also
contain other sensitive informatoin.  While setting the sticky bit on
those directories can protect most such programs from the most obvious
attacks, I don't believe the most important of these programs
(i.e. those most often run as root) will actively check to make sure
that either the sticky bit is set or at least that the directory is a

I guess you're not someone who enjoys a good academic discussion even if
its merit is only academic, and/or you don't care that people can learn
from it and that it can lead to true innovation of related things,
e.g. O_MKDIR.  I hadn't really considered O_MKDIR before, but having the
occasion to re-read in the right context code designed to facilitate the
safe creation and disposal of temporary files where ...
From: Matthias Buelow
Date: Monday, July 14, 2003 - 7:19 am

Greg, in my opinion it is.  The functionality you describe for
sure is something desirable but it probably would be better kept
as an ordinary utility function in libc, perhaps with a different
name.  As a syscall, as I have voiced in my opinion, it simply
doesn't fit (imho, of course).

-- 
  Matthias Buelow;  mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}


From: Greg A. Woods
Date: Monday, July 14, 2003 - 9:08 am

I really don't want to sound condescending, but I do get the distinct
feeling that you don't really understand the underlying reason why
funlink() is desirable in the first place.

The underlying goal of having a system call that can unlink a file when
given a file descriptor open on that file is to avoid an unfortunately
common insecure programming technique commonly called a "Time-Of-Check,
Time-Of-Use (TOCTOU) race condition".  Calls to unlink() are vulnerable
if they are passed the fully qualified pathname of a file that was
created in or under an insecure (i.e. world-writable) directory, even if
that path is checked for vulnerabilities and the file's metadata is
compared to that of the originally created file before the unlink() call
is made.  Implementing funlink() in userland would simply move the race
condition to a new place and thus be no fix at all.  However a
system-call implementation of funlink() could ensure the new race
condition is impossible, thus ensuring the functionality fulfils the
underlying requirements.

Indeed it would be dangerous to imply that a userland implementation of
funlink() could do something that, as a userland implementation, it most
certainly could not possibly do.

As we've explored funlink() doesn't make sense as solution for unix-like
systems for very different and far more practical reasons, and it is the
explanation of those reasons that leads to learning what must be done
instead and how the alternatives could be optimized.

-- 
						Greg A. Woods

+1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>


From: Matthias Buelow
Date: Monday, July 14, 2003 - 9:34 am

IMHO, the best solution (albeit outside the established Unix framework)
would be to fully separate operations on directories and the flat file
system (inodes/device-numbers or equivalents)...  There would be an
operation, let's call it lookup() : pathname -> identifier, to
translate a symbolic pathname into a more or less opaque identifier
(similar to an fd) which is unique and reversible for both the
referenced file and the directory entry during the period of its
allocation to the process.  Open(2) would then take this identifier to
actually open the file, not a pathname.  The advantage would be that
the application would have a handle on the actual directory entry,
other than the volatile pathname.  One could then use something like
funlink() on that identifier to delete the directory entry and simulate
unlink() without having to care for the case that a new entry with the
same name has been established in the directory in the meantime.
Unfortunately this poses several problems: some filesystems cannot
easily produce such an indirection, it has to be emulated on them
(should be feasible, though), and it doesn't work with the current
established Unix filesystem API.  It somewhat surprises me in hindsight
that such an approach was not taken in the original Unix
implementation.  However, I guess, this is influenced by the fact that
most likely the open()-style API was established already before Unix
got real directories (and perhaps also to keep it stupid simple.)

-- 
  Matthias Buelow;  mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}


From: Greg A. Woods
Date: Monday, July 14, 2003 - 11:30 am

Yes, "file handles" as some folks call them.  They are effectively


We already have fhopen(2), fhstat(2), and fhstatfs().

Of course these calls (including getfh()) are currently restricted to
the superuser because they have not had the necessary ACL semantics
defined for them.

We are missing at least fhchdir(2) (though for the superuser this can be

You can't do that (directly) in any filesystem that has all of the same
semantics as a unix filesystem.

File names are just pointers to files that exist in special files called
directory files.  By convention the first file (inode #0) in a
filesystem is also the root directory for the namespace we lay over top
of the filesystem.  By convention we have the first two entries in a
directory file point to the directory file itself and the parent
directory file.

However by convention we do not have the filename(s) recorded in the
files themselves and thus the only way to find the name for a file is to
traverse the directory structure until one encounters a name pointing to
the file in question.  Of course since a file may have more than one
name there's never any sure way to know if the name encountered is the
one the user had in mind for such a multi-named file.  Finding all the
names for a file is of course possible (especially since the link count
tells us how many to look for), but it still doesn't help decide which
was indented by the user.

Note that we don't want to try to record the filename(s) in the file
because there would be significantly more overhead and complication to
maintain those "reverse pointers", especially if you consider the number
of possible updates needed in a hierarchical filesystem for an operation
such as "mv /usr /user".  We also don't want to do this because we don't
want to have to have to allocate variable numbers of disk blocks for one
inode (which we would likely end up having to do sometimes if a file had
many names, even on filesystems with large blocks).

Once you begin down the path ...
From: Matthias Buelow
Date: Monday, July 14, 2003 - 12:46 pm

Ok, I didn't know about file handles.  They're also not want I meant.
I wanted a handle on the directory entry (essentially the pathname) as
some kind of invariant representation of a particular directory entry
at a discrete point (originally specified by pathname).  This would
then be used instead of a pathname in open(2) etc.  That this handle
also points to a specific file (through the dir-entry's data) is a
rather less interesting detail in this context.  It's just so that the
application's got a more direct grip on the directory entry through
which a file is to be opened, for later use (such as your proposed
funlink()).  The particular entry in the directory would be marked
somehow (or locked), so if the entry gets unlinked in the meantime it
won't get reused and overwritten immediately -- something like a zombie
entry.  This would persist until the process releases the handle, or
exits.  funlink() could then be implemented as follows.  You want to
preserve atomicity for security reasons; no problem -- you could have a
system call which, instead of expecting a pathname like unlink(), would
accept such a directory entry handle, the same one convienently passed
to open() instead of a real pathname in your particular program (but it
is not necessary to open the file, of course.)  The system would know
which entry in which directory exactly was used for opening a
particular file (it would associate that information with this
handle.)  It could then unlink the entry from the directory and do the
other cleanup stuff (like decrementing the counter in the inode, etc.)
without having to fear colliding with a newly generated pathname of the
same name that was originally used to obtain the directory entry
handle.  Of course this is of theoretical nature; building an extra API
for that would be ugly, unportable and few people would use it
(especially not existing applications).  If at all, the underlying API
of open() etc. would have to be changed to this design, which is not
feasible, ...
From: Greg A. Woods
Date: Monday, July 14, 2003 - 1:58 pm

Oh, I know they're not exactly what you meant, but you can't (easily)
have what you meant -- it's next to impossible (and certainly not ever
practical) to implement what you wanted with any unix-like hierarchical
filesystem that allows multiple hard links and rename operations on
directories.

What you're suggesting is massively more complex and obviously much more
invasive than my simple little funlink(2) idea!  ;-)

(and it doesn't seem to offer much beyond what funlink() and/or

But a directory handle isn't a representation of a pathname (though one
can intuit the name of the directory by walking back down to the root
directory and ascertaining the pathnames of each previous directory
along the way).

You'd have to do this for all the directories in the path, and you'd
have to somehow lock all the directories in the path in order to make
sure that every rename() involving any such directory could
co-operatively update those handles.

I.e. you're just beginning to enter the twisty little maze of passages

If you want to keep the open(2) API intact then you could try to
implement it as a function call that's the moral equivalent of
fhopen(getfh(path)).  Either way the only logical thing to do (without
affecting the open() API) is to make it easy to translate a file
descriptor back into the file handle it came from.

I haven't thought of the implications of having file descriptors in
user-land, though presumably everything could be transmuted to use file
handles, including even descriptor passing through AF_LOCAL sockets.
The tricky parts involve seek pointers and such -- and that's something
I always get very confused about unless I diagram it all out on a big

That's part of the problem, not the solution.  Please study the example
safe_dir() implementation in the book I referenced.  You can find it in
here:

	http://www.buildingsecuresoftware.com/bss_examples-1.0.tar.gz


Unfortunately that's not (yet) true.  To quote from the fhopen(2) manual
page on NetBSD:

 ...
From: matthew green
Date: Monday, July 14, 2003 - 8:55 pm

Yes, "file handles" as some folks call them.  They are effectively
   vnodes in the *BSD terminology.

"file handles" and "vnodes" are not the same thing.  i can have
multiple file handles for the same vnode... 


From: Bill Studenmund
Date: Tuesday, July 15, 2003 - 10:31 am

How so? There's a 1-to-1 correspondence between the output of getfh() (and
VFS_VPTOFH()) and the in-core vnode, which is rather important for an NFS
server. ;-)

Take care,

Bill



From: Bill Studenmund
Date: Tuesday, July 15, 2003 - 10:34 am

Because if you have the file handle, then you can spoof NFS traffic to it,
and spoof your UID to be something that can access the file. Or you can
snoop NFS traffic and get a file handle, then use fhopen() to open
something you don't have path permissions to access. Thus for now you have
to be root, as these calls can circumvent too many security checks. ;-)

Take care,

Bill



From: Bill Studenmund
Date: Tuesday, July 15, 2003 - 12:00 pm

The reason we have the three fh calls we do is because when I created
them, I didn't see any need for anything else. Once we have fhopen(2), we
have all of the fxx(2) calls. fhstat(2) and fhstatfs(2) are two calls
that made sense to be usable w/o having to open the file (say if it's a
device :-) . All the others were ones where it looked like it was better
to have the file open, or ones where it didn't seem fundamentally
important enough to have a syscall when you could just fhopen() then
f_whatever_().

Take care,

Bill



From: Greg A. Woods
Date: Wednesday, July 16, 2003 - 12:05 am

Yes, true enough -- unless fhopen(2) enforces the access rights checks
that open(2) would have enforced.  In that case fhchdir(2) gives one the
same advantage that could be afforded by O_NOACCESS.

-- 
						Greg A. Woods

+1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>


From: Ben Harris
Date: Monday, July 14, 2003 - 11:06 am

NetBSD actually has a mechanism that works like this.  Look at getfh(2) and
fhopen(2).  It's designed for implementing NFS servers, so there's no

You suggested that the flat file-system would use i-nodes, not directory
entries.  The distinction is important (and file handles refer to i-nodes,
so they wouldn't actually be useful for funlink()).

-- 
Ben Harris


From: Bill Studenmund
Date: Tuesday, July 15, 2003 - 11:47 am

As an observer of this whole exchange, I'd have to say that a lot of us
don't have the "understanding" of why funlink() is desirable, because no
one who's championing "it" has really defined it nor explained what it's
to do. As best I can tell, you and greywolf has had different things in

Then my suggestion is funlink(fd, path). The call does a nami lookup on
path, and if it gets the same thing as fd, removes it. And it can return
different errors depending on: unable to lookup path (permissions along
the way), unable to unlink (write permissions on parent dir), no such
file, file and fd differ, multiple links(*), or file already unlinked (0
links).

(*) Some flag may be in order to indicate if the unlink should proceed if
there are multiple links yet the one we're asked to remove exists; an
unlink in this case should get a different error code. While this is an
error in the "I've made a temp file and tried to unlink it" case, there
are other cases where I can see it may be useful.

Since the program should (or can) know the path the file should have, let
it take care of remembering it. That way only the cases that need this
bother with it.

Take care,

Bill



From: ww
Date: Tuesday, July 15, 2003 - 12:09 pm

Not really -- file descriptors inherited from parent processes:
shell redirections and such.

But the problem is not race conditions in world writeable directories
-- that is already solved. Sticky bits, etc. Don't make mode 777
directories for temporary files. I can't see why we need a new system
call either.

Are we certain the horse is dead yet?

Franchement!

-w


From: Bill Studenmund
Date: Tuesday, July 15, 2003 - 12:40 pm

Yes, but come on, how many programs really really are going to want to

What Greg is talking about here is a way a program can make & remove temp
files and know they are gone all by itself. What you're talking about
requires administrative assistance. While it is easy to do, the program
has to trust the admin to have gotten things right.

While I'm not saying get rid of sticky bits and such, I can see the
utility of a call that a program can use to make SURE that a file has been
unlinked.

The only way you can be really sure the path and the file descriptor are
the same is if you do the comparison & removal without unlocking the
vnode. Since we're talking about vnode locks, we're talking about code in
the kernel. Thus a system call. :-)

Take care,

Bill



From: der Mouse
Date: Tuesday, July 15, 2003 - 12:49 pm

That is easy enough to do - just fstat() the fd and check that its link
count is zero.

The problem is not making sure that it got unlinked.  The problem is
making sure you don't unlink something else by mistake.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


From: Ignatios Souvatzis
Date: Tuesday, July 15, 2003 - 1:02 pm

Hi,


Some 31ee7 hAx0r group will surely find a way to, err, make use of=20
unlinking stdin/stdout, if they find a vulnerable networked program.

Seriously: think of passing some open descriptors to plug-ins , then letting
_them_ dispose of the file... although you could open, unlink, then call
the plugin, if it doesn't need the filename.


	-is
--=20
seal your e-mail: http://www.gnupg.org/
From: Bill Studenmund
Date: Tuesday, July 15, 2003 - 1:28 pm

Since you're passing the fd in, why can't you pass in the path?

I'm not saying it's unreasonable for a module to unlink descriptors. I'm
saying it is unlikely that they will need to unlink (as opposed to close)
ones for which it can't find/know the path.

I'd really expect that in something like this, you'd do the unlink in the
same code block that did the open. That way, if you have an error
unliking, you don't use "not-really anonymous" temporary files.

Take care,

Bill



From: der Mouse
Date: Monday, July 14, 2003 - 4:56 pm

I'm (still?) not convinced this is a problem.  Given O_EXCL, stat() and
fstat(), and sticky directories, I can't see what the danger is.

I discount admins doing things like leaving /usr world-writeable, or
/tmp non-sticky.  I consider that akin to leaving /dev/mem mode 666:
you can do it, but you have nobody to blame but yourself.  And it most
certainly is not for the system to protect you from the consequences.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


From: Greywolf
Date: Monday, July 14, 2003 - 9:24 am

Thus spake Greg A. Woods ("GAW> ") sometime Saturday...

GAW> >  I don't know where you got this resistance to
GAW> > comprehending that unlink destroys links to files, only secondarily
GAW> > affecting the files themselves, but it's clear to me that you have it.
GAW>
GAW> A file is more than its contents -- it is all the metadata that controls
GAW> access to the content and allows the content to sit on the same storage
GAW> media along with many other distinct files.

Gee, didn't you just say that "a file is more than its metadata -- it is
all the content" not too long ago?

GAW> Unlink() cannot, and luckily does not, "just" destroy links to files.

Actually, that is all it does.

The fact that the data goes away and gets GCd when the link count and the
reference count (in-core) go to zero is a side effect of the link count and
the reference count (in-core) going to zero.

GAW> The primary purpose of unlink() is _also_ to decrement the link count in
GAW> the file metadata.  Unlink() _always_ acts on both a file and the
GAW> directory entry which points to that file.

You're really trying to force the data and the metadata to be equal.
They are not -- the metadata supercedes the data; the data
is a child of the metadata.   If the metadata goes away, there is nothing
to hold onto the data.  This is UNIX by design.  The only thing that
binds a pathname to a set of data is the fortuitous bunch of hoops that
we must jump thorough to resolve to an inode, i.e. metadata.

GAW>  However directory entries
GAW> (names) are just pointers to the real files.

Correction:  They are pointers to metadata (inodes).  What those inodes
reference is determined once they are accessed; what the references are
specifically are determined once they are open()ed.

GAW> The only part of what
GAW> unlink() also often does in addition to those first two critical
GAW> functions, which could safely be left for some cleanup daemon to do,
GAW> would be the moving of data block pointers from inodes ...
From: Greg A. Woods
Date: Monday, July 14, 2003 - 10:44 am

True, but the modification of the link count "MUST" be done at the very
same time that the directory entry is modified (just afterwards, but
before the unlink() call returns).

I.e. unlink() does do more than just clear the inode number in the
directory entry, and it does, and "MOST", modify the file itself as
well, and at the same time.

(if you want to talk about side effects then talk about the modification

I'm not trying to force the issue because I don't need to.  They are
exactly equivalent.  You cannot have a file without its metadata, though

Thank you for confirming that a file is inherently the metadata, and
that the content data is only an adjunct to a true file (if and only if
the file is a regular file which can contain data, and if and only if
there is any content in the file).

I.e. don't forget other "empty" files, such as devices and FIFOs.  They

No, not a "correction" -- real files _are_ the "inodes".  Period.  The
data they contain, or the lack thereof, is irrelevant here.

File _NAMES_ are the pointers to the real files.  Please try to keep

No, I am not.  You have clearly not paid any attention whatsoever to the
core of this discussion.  funlink() as I've described it can in fact be
"trivially" implemented totally transparently to any existing kernel
function or data structure (in-core or on-disk).  Obviously such a
"trivial" implementation would not be the most efficient or optimized
implementation, but it would work just fine without changing any
existing filesystem semantics or syntax.

-- 
						Greg A. Woods

+1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>


From: Greywolf
Date: Monday, July 14, 2003 - 11:20 am

Thus spake Greg A. Woods ("GAW> ") sometime Today...

GAW> > GAW> Unlink() cannot, and luckily does not, "just" destroy links to files.
GAW> >
GAW> > Actually, that is all it does.
GAW>
GAW> Yes, OK, black is white.  Now you're just being stupid.

Coming from you, I'm going to take that as a compliment.

GAW> > The fact that the data goes away and gets GCd when the link count and the
GAW> > reference count (in-core) go to zero is a side effect of the link count and
GAW> > the reference count (in-core) going to zero.
GAW>
GAW> True, but the modification of the link count "MUST" be done at the very
GAW> same time that the directory entry is modified (just afterwards, but
GAW> before the unlink() call returns).

And this relates to the price of beer in Germany just how, again?

GAW> I.e. unlink() does do more than just clear the inode number in the
GAW> directory entry, and it does, and "MOST", modify the file itself as
GAW> well, and at the same time.

Ah.  I see what you're getting at.  You're trying to tell me that the file
is the inode, and that data is connected with said inode is a side effect.
I get it.

(Not.)

GAW> (if you want to talk about side effects then talk about the modification
GAW> of the timestamp fields in the file, or other TRUE side effects)

Yes, those are there, too.

GAW> > You're really trying to force the data and the metadata to be equal.
GAW>
GAW> I'm not trying to force the issue because I don't need to.  They are
GAW> exactly equivalent.  You cannot have a file without its metadata, though
GAW> you can have an empty file, i.e. a file with no content data!

You cannot have a file's *data* without its metadata.  You can have metadata
that corresponds to zero allocation to data (think "device").

GAW> > They are not -- the metadata supercedes the data; the data
GAW> > is a child of the metadata.   If the metadata goes away, there is nothing
GAW> > to hold onto the data.  This is UNIX by design.
GAW>
GAW> Thank you for confirming that a ...
From: Matthias Buelow
Date: Monday, July 14, 2003 - 1:51 pm

Actually they're talking about increasing the alcohol tax.. I hope
for Greg he doesn't have a hand in it...

-- 
  Matthias Buelow;  mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}


From: Greg A. Woods
Date: Monday, July 14, 2003 - 2:43 pm

Thank you, finally.  FYI you'll find essentially the same definition in

Perhaps you've never done anything even remotely like "find -x
/mountpoint -inum 12345 -print".  You can't have.  Otherwise you'd
probably understand at least some of what I'm talking about.  (Yes,

It's not impossible, or even necessarily impractical given the lack of
better alternatives for safe and secure programming in some situtations,
so, it is still something to think about.

I do think der Mouse's O_NOACCESS flag (and perhaps my O_MKDIR flag as
well), along with the extra fchdir() and fstat() calls they require, is
better than funlink(2) since they have a better chance of completeing
successfully, even in the normal case, and a potentially much better
chance of giving useful diagnostics in the failure case and doing so in
a timely fashion.  However until you understand what I'm talking about
you can't even begin to sanely discuss funlink(2) or its alternatives on
the same level playing field.

I.e. unless and until you can understand funlink(2) as I've described
it, together with all its implications and limitations and possible
optimisations, you cannot possibly even begin to make any fair or
reasonable assessment of any of the alternatives to this most obvious
solution to the underlying problem (i.e. the problem which prompted me
to open the discussion in the first place).

-- 
						Greg A. Woods

+1 416 218-0098                  VE3TCP            RoboHack <woods@robohack.ca>
Planix, Inc. <woods@planix.com>          Secrets of the Weird <woods@weird.com>


From: Greywolf
Date: Monday, July 14, 2003 - 4:05 pm

Thus spake Greg A. Woods ("GAW> ") sometime Today...

GAW> > For your definition that file == inode, fine.
GAW>
GAW> Thank you, finally.  FYI you'll find essentially the same definition in
GAW> all the serious books about Unix internals.

Sorry, I always took "file" to mean "data", as opposed to a hard-line
definition that "file" == "inode".

GAW>
GAW> > name -> inode.  Great.  Fine.  Whatever.  N:1
GAW> > inode -> name.  1:N.  Pain in the patella.
GAW>
GAW> Perhaps you've never done anything even remotely like "find -x
GAW> /mountpoint -inum 12345 -print".  You can't have.  Otherwise you'd
GAW> probably understand at least some of what I'm talking about.  (Yes,
GAW> that's meant entirely as sarcasm.)

Oh, no, not at all.  I've NEVER been THAT adventurous.  ;)

[I refuse to buy into the premise that I'm as stupid as you seem to think.]

GAW> > As the semantics exist right now, it is not a practical idea.
GAW>
GAW> It's not impossible, or even necessarily impractical given the lack of
GAW> better alternatives for safe and secure programming in some situtations,
GAW> so, it is still something to think about.

No, it's not impossible, but you must define the level of practicality
you're discussing here.

If you maintain a table of reverse lookups in conjunction with open/rename/
link/unlink/mkdir/rmdir, somewhere, that incurs the overhead of rewriting
the system calls to handle that; if, at any time, that scrambles the API,
it becomes impractical, because things will then not behave as defined.

As it stands, such a rewrite is unlikely.

The other choice is to trudge through every directory on the filesystem
and perform unlink()s (effectively) on them.  This is time-consuming;
seeing as system calls which do this sort of thing are not expected to
take quite that long, and that they are expected to lock objects until
they complete, this approach is impractical.

GAW> I do think der Mouse's O_NOACCESS flag (and perhaps my O_MKDIR flag as
GAW> well), along with the ...
From: Greywolf
Date: Saturday, July 12, 2003 - 2:11 am

Thus spake Greg A. Woods ("GAW> ") sometime Today...


GAW> > could see would be to set the reference counter in the inode to zero,
GAW>
GAW> No, not zero -- just decremented by one (assuming the proper directory
GAW> entry can be found).

Greg, file descriptors are not associated with pathnames.  There is no
"proper directory entry".

unlink() operates on a pathname (i.e. dirent).  Only.  That's it.

GAW> No, absolutely NOT!  unlink() doesn't to this for VERY good reasons, and
GAW> funlink() could not do so either.

unlink() operates on a pathname from which an fd might otherwise
be generated.  funlink would work on an fd, from which it is not possible
to divine a unique name.  Matthias is saying effectively that funlink()
is tantamount to clri(), plus revoke() semantics on all open fds to the
node that just got funlink()ed (something I'm not altogether sure I agree
with, since you might actually want the reference, but not the physical
object, to remain present.).

In short, he gets the concept.

GAW> >  It would indeed delete the file, not a
GAW> > directory entry, which is literally what the caller requested.
GAW>
GAW> That's completely wrong given how I've described funlink().

You want a 1:1 relationship between filehandles and pathnames, more or
less.   Who's been taking lessons from M$, now?

				--*greywolf;
--
NetBSD: We Suck Less


From: Greg A. Woods
Date: Saturday, July 12, 2003 - 10:57 am

How is it that suddenly you have absolutely no imagination at all?

File descriptors are associated with files.

Filenames are associated with files.

If you can't make the connection implied by these two axioms then I
would humbly suggest you're not seeing the whole picture with even
remotely enough clarity and understanding to make this discussion useful

No, I absolutely do not.  If you think this then by now I can only
conclude that you have not read all of what I wrote carefully enough.

However I do know that for the case where a file has a link count of
exactly one then there is guaranteed to be a one-to-one relationship
with one particular pathname in the directory tree of the filesystem
that file belongs to.  Because of this guaranteed truth I know that it
is possible, if somewhat costly, to implement funlink() as I've
described it such that it could be safe and useful to use in the vast
majority of situations where it might make sense to be used.  I also
know that one could spread the cost around a little bit and further
increase the reliability of funlink(), as I've described it.

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>


From: der Mouse
Date: Monday, July 14, 2003 - 3:00 pm

Yes.  Therefore, a file descriptor can in principle be mapped to a set
of pathnames.  (And a pathname can in principle be mapped to a set of
file descriptors.)

You are trying to add a way to go from a file descriptor to one
particular member of that set of pathnames.  (And then do something
with it, but the hard part is that mapping.  Depending on which message
I read, either you're willing to error out unless the set has only one
element, or you want a rather ill-defined member of the set, one
somehow related to how the file descriptor was obtained.)

This is possible to do.  It is also a major philosophical shift, with
the corresponding design shift that implies.  It may be an interesting
thing to consider when designing a new OS; it is of no particular value
for an existing one with an existing commitment to the present design
(and the philosophy behind it).

Hint: this list is about the latter.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


From: Greg A. Woods
Date: Monday, July 14, 2003 - 5:07 pm

I think I've always been willing to have the funlink() call fail with an
error if it finds the link count of the file to be greater than one.

In any case that has always been my intent.  The ultimate goal is to
find a safer way for a process to unlink a temporary file.  The ability
to do this to a file passed via stdin, etc., is only a secondary feature.
(and one for which I still can't think of any good application  :-)

The caching of the opened filenames obviously optimizes the
implementation of funlink(2) significantly, but of course at some
expense that must be shared with all open() calls (though I suppose
another O_CACHENAME flag could be added to help tune it :-).

The caching of the opened filename also helps deal with the case where
the file has gained additional links, but the original pathname still
points to the opened file.

In fact I'd be just as happy to have funlink() fail if the original
pathname no longer existed -- i.e. do away with the internal ftw() idea
entirely and rely only on the cached opened filename, especially if the
need for funlink(2) was anticipated (as it normally would be) such that
the filename caching would only be done when explicitly requested.

Indeed one would expect no better chance of success if some third part
process had renamed the file to be unlinked behind one's back, so to
speak.  The thing we're trying to avoid is some symlink replacing a
directory in the pathname of file being renamed such that the wrong file
is unlinked.  Obviously several other things have to be broken for such
a situation to result in a true vulnerability, but still the
availability of funlink(2) would eliminate the need to always carefully
fchdir() into a safe directory, and indeed may eliminate some of the

I see no philosophical shift implied by my funlink(2) proposal, least of
all any that one might call "major".

The "trivial" implementation of funlink() is (logically at least) no
different than using "find -inum", but of course it can be made ...
From: der Mouse
Date: Monday, July 14, 2003 - 5:42 pm

> [Greg Woods going on about funlink()]

If I've understood you right, it would be just as satisfactory to have
a call unlink_if_pathname_matches_fd(const char *path, int fd), which
would work like unlink() applied to its first argument, but only if
that pathname resolves to the same object the fd is attached to.
(Modulo name issues, of course; I'm not thrilled with that name. :-)

For some uses, it might be even better to have an O_UNLINK flag, which
would be abstractly equivalent to opening the file with O_CREAT|O_EXCL,
but then unlink()ing the path, the critical part being that this
sequence is atomic with respect to all other filesystem operations.

/~\ The ASCII				der Mouse
\ / Ribbon Campaign
 X  Against HTML	       mouse@rodents.montreal.qc.ca
/ \ Email!	     7D C8 61 52 5D E7 2D 39  4E F1 31 3E E8 B3 27 4B


From: Greg A. Woods
Date: Saturday, July 12, 2003 - 1:47 am

I think you're forgetting that a "unix file" includes its metadata
(especially if you're talking about kernel internals) and unlink() most
certainly always operates directly on the metadata of a file, even if
the link count is greater than one since the link count is always

Unlink() _also_ operates on directories, but most importantly it

Ah, not, that's wrong.  The kernel most definitely always decrements the
link count of an inode when unlink() is called.  It _also_ zeros out the

The directory reference is only a part of the picture -- if you ignore
the link count in the inode then you have failed to understand the unix

The fact that the unix filesystem is primarily a table of inodes is no

No, one could not concieve of such a thing since it would have to be run
either in single user mode or while the filesystem was not mounted (or
while the whole filesystem is locked from modification).  Do you forget
that Unix systems are inherently multi-processing systems?  It is
absolutely fundamental and critical that unlink() modify the file's link
count (as well as of course freeing the directory entry).

(in fact the moral opposite is true -- it would be possible to not
immediately free the directory entry if inodes included two reference
counts instead of just one such count since an attempt to reference a
file with no valid links but only remaining directory references could

No doubt -- but there it is none the less.  :-)

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>


From: Greywolf
Date: Friday, July 11, 2003 - 7:20 pm

Thus spake Greg A. Woods ("GAW> ") sometime Today...

GAW> But that's the whole point of funlink() (and perhaps even some of the
GAW> other f*() calls, such as fchdir()) -- turn an operation on a file into
GAW> an operation on a filename (i.e. a link to a file).

<setattr param="baud" value="110">

fchdir does not operate on a filename.  It operates on a DESCRIPTOR which
ultimately HAS NO MEMORY, CACHE or CONNECTION _whatsoever_ to the original
NAME.  This DESCRIPTOR ultimately contains a reference to a VNODE which
points to a FILESYSTEM OBJECT, i.e. METADATA.

There is a difference between METADATA and a FILENAME.  The FILENAME
gets you the location, ultimately, of the METADATA, at which point the
ONLY THING that is going to remember the FILENAME is going to be the *user's
program*.

The things that is pretty much stopping funlink from being practical
are that potentially many names correspond to a particular file, and
there needs to be some cached pointer to the dirent being referenced.
QED.

Now, is there anyone ELSE (put your hand down, Greg) who wishes to
refute this?

				--*greywolf;
--
NetBSD: the free unix for the rest of us.


From: Roland Dowdeswell
Date: Friday, July 11, 2003 - 8:03 am

On 1057854937 seconds since the Beginning of the UNIX epoch

I think that funlink(2) as an idea doesn't really make sense.
unlink(2) specifically operates on directory entries, not on files.

--
    Roland Dowdeswell                      http://www.Imrryr.ORG/~elric/


From: Greg A. Woods
Date: Friday, July 11, 2003 - 10:58 am

Unlink(2) may specifically also operate on files as well as the link in
the parent directory, and it will always also operate on the file if
that file has only one link.

Strictly funlink() isn't necessary for secure programming provided you
have fchdir() and given you can always lstat() and then open() and then
fstat() the parent directory in which you need to do the unlink() and
thus to which you need to fchdir() before you call unlink(basename()).

Strictly funlink() is also going to incur a lot more overhead than one
or two additional fchdir() (and open()) calls.

Thus I agree it doesn't make a whole lot of sense to implement funlink()
unless you also want the ability to unlink a file you were handed on
stdin, for example.

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>


From: Greywolf
Date: Friday, July 11, 2003 - 1:42 pm

Thus spake Greg A. Woods ("GAW> ") sometime Today...

GAW> Date: Fri, 11 Jul 2003 13:58:02 -0400 (EDT)
GAW> From: Greg A. Woods <woods@weird.com>
GAW> Reply-To: NetBSD Kernel Technical Discussion List
GAW>     <tech-kern@NetBSD.org>
GAW> To: Roland Dowdeswell <elric@imrryr.org>
GAW> Cc: NetBSD Kernel Technical Discussion List <tech-kern@NetBSD.org>
GAW> Subject: Re: funlink() for fun!
GAW>
GAW> [ On Friday, July 11, 2003 at 11:03:03 (-0400), Roland Dowdeswell wrote: ]
GAW> > Subject: Re: funlink() for fun!
GAW> >
GAW> > I think that funlink(2) as an idea doesn't really make sense.
GAW> > unlink(2) specifically operates on directory entries, not on files.
GAW>
GAW> Unlink(2) may specifically also operate on files as well as the link in
GAW> the parent directory, and it will always also operate on the file if
GAW> that file has only one link.

Your logic is flawed.  Please show how unlink(2) operates on files.

- it doesn't write to them.
- it doesn't create them.
- it doesn't even really destroy them, though it arranges for them to be
  potentially destroyed once the link count goes to zero and the filesystem
  reclaims the blocks associated with them.

				--*greywolf;
--
NetBSD:  exercised any daemons lately?


From: Greg A. Woods
Date: Friday, July 11, 2003 - 4:02 pm

Actually it does -- provided their link count is only one and soon to be
zero.  It doesn't (normally) write to their content of course -- just
their metadata (i.e. it writes a zero into the link count, amongst other
things!).

(a unix file is an inode and any data blocks the inode may point to,
including indirect blocks that point to other data blocks -- i.e. the
inode is a part of what we cal a "file" and thus a file is (often)


How it works under the hood is irrelevant.  For all you know unlink() on
some super-paranoid system could implement a cryptographically secure
35-pass overwrite that must complete before the call returns.

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>


From: Greywolf
Date: Friday, July 11, 2003 - 4:20 pm

Thus spake Greg A. Woods ("GAW> ") sometime Today...

GAW> Unlink(2) may specifically also operate on files as well as the link in
GAW> the parent directory, and it will always also operate on the file if
GAW> that file has only one link.

No, it doesn't.  The only thing that happens to the file, per se, is a
side effect of the deletion of all physical entries associated with a
record of the data.   The operation of unlink(2) is only associated with
a file in that said file is a special type of file called a directory.
Otherwise, all that happens is the link count of the node representing
the file drops to zero, the last physical entry which names that inode
is removed, and, unless another process is holding open a file descriptor
associated with that inode, the data blocks of that inode are unallocated
and the inode is cleared.

Strictly speaking, this has *nothing* to do with the file and *everything*
to do with the file's metadata.  Sure, the end result is that the data is
lost, but when you have cleared all physical references to the metadata,
what else are you supposed to do? :-)

To pinball off in another direction and re-iterate:  The only way that a
funlink() call can hope to work is to maintain a table of (vno_t **)
around which contain offsets into dirs represented by other (vno_t *), and
if you need to funlink() a file, you can at least get the offsets of all
the links, visit the dirs associated with them, clear their entries and
remove the (vno_t **) from the table.  We wouldn't even need to know pathnames
if we knew dir vps and offsets into them for the entries.

There's some details I am missing, I'm sure, but anyone reading this who
does anything with fs internals will probably get the concept.  I ran it
by a few people yesterday and they seemed to understand it well enough,
even if the design isn't fully fleshed out.

Whether or not we need funlink(2), though, remains to be seen.
I rather suspect we really don't.

Other things to consider:

Where does ...
From: Greg A. Woods
Date: Saturday, July 12, 2003 - 1:11 am

You've obviously forgotten what a file is (in a unix filesystem),
despite my many attempts to remind you.  You cannot, and must not,


Never.  But not for the reasons you seem to be implying.



As I've already said:  perhaps.  :-)

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>


From: joerg
Date: Saturday, July 12, 2003 - 4:07 am

Well, for traditional UNIX filesystem both can be implemented by *stat.
For local filesystems without ACLs too. As soon as ACLs are implemented
e.g. in FFS, NTFS, AFS to name a few *stat is _not_ enough to implement
access or faccess. Even implementing proper ACL checks in user space
might not be enough for network filesystems like AFS if you don't check
the actual credentials. So basically an implementation of access or

I like the idea of flink(2). With flink you could extend the creat(3)
syntax (or open(2)) to allow creation of anonymous files on a filesystem
given jut one writtable directory and later add a reference somewhere.
E.g. passwd creates an anonymous files, updates it with the content of
/etc/master.passwd and links it there. To cleanup mess with unexpected

Me too.



From: Greg A. Woods
Date: Thursday, July 10, 2003 - 9:37 am

Apparently.  There is only one anonymous object in the system -- why
would you need a "fake" pathname to represent it?  (there's no backing

That would be _very_ expensive in terms of VM for every process since

Indeed mmap(MAP_ANON) memory is zero-filled (and must be else the
garbage it revealed may violate someones privacy because that garbage
would very likely have come from some other process)

-- 
								Greg A. Woods

+1 416 218-0098;            <g.a.woods@ieee.org>;           <woods@robohack.ca>
Planix, Inc. <woods@planix.com>; VE3TCP; Secrets of the Weird <woods@weird.com>


From: Matthias Buelow
Date: Wednesday, July 9, 2003 - 11:34 am

I think it's rather elegant.. certainly much more so than the anon
version of bsd.. it builds on existing objects (/dev/zero in that
case) at least.

-- 
  Matthias Buelow;  mkb@{mukappabeta.de,informatik.uni-wuerzburg.de}


From: Matt Thomas
Date: Wednesday, July 9, 2003 - 11:13 am

I forwarded Bill Gallmeister your message, and he gave me the following
to use as a response:

And to comment on the email thread you forwarded:  hindsight's a beautiful
thing.  shm_open et al were designed to allow a trivial implementation atop
mmap (or atop the Ludicrous Sys V Interface (TM)), but at the time we came up
with the standard, I believe it was only the rocket scientists at Sun who had
mmap--no one else had ponied up to the memory==file proposition and all its
implications.  There wasn't even an INTERNET back then, for Chrissake.  Al Gore
hadn't been BORN.  We wrote the damn standard using OIL LANTERNS.  Okay, maybe
I'm exaggerating a little.  It was a DECADE ago!



-- 
Matt Thomas               Internet:   matt@3am-software.com
3am Software Foundry      WWW URL:    http://www.3am-software.com/bio/matt/
Cupertino, CA             Disclaimer: I avow all knowledge of this message 



From: Greg A. Woods
Date: Wednesday, July 9, 2003 - 12:22 pm

Thanks for that!  I've had Bill's book open here on my desk for other
reasons lately and when this came up here I did a rough page count of
each section and while doing so I was reading all the "this is tricky"
disclaimers.  :-)

How quickly people forget their heritage though.  By that I mean if Unix
people of the day had not forgotten their Multics background then the
"memory==file" proposition would not have been so strange and its
implications would have been well understood by them.  I was an avid
user of Multics right up to the day I first learned to use SysV IPC so
perhaps thats why it never confused me or made me cringe.

The SysV IPC mechanisms have had a long, and IMVNSHO greatly undeserved,
history of being decried, discredited, and disparaged.  I suspect this
would have happened to SHM in particular even if it had not shared the
ipcs/ipcrm resource identifier namespace quirks of message queues and
semaphores simply because of the stupid politics separating USL and the
BSD/CSRG crowds at the time (and still :-).

Regardless POSIX shared memory is still effectively stuck with a flat,
invisible, namespace that now looks ever so much more like a flat
filesystem but has almost none of the utitlity.  I.e. there's a long and
vast gap between the goal of making POSIX shared memory capable of being
implemented on top of either true mmap() or SHM, but in the end the
final standard leaves an application author hanging high and dry because
the restrictions required for portable applications are so ludicrous
that it's almost infinitely easier for the application to independently
support both variants instead of trying to contort through the POSIX API.
(and in modern systems that almost invariable means just using SHM).

In some senses it was also unfortunate that the POSIX Realtime working
group got stuck with defining IPC mechanisms, but of course that seemed
to be just more fallout from the stupid USL vs. BSD politics.  The
X/Open guys had a much more, well, open, ...
From: Kamal R Prasad
Date: Wednesday, July 9, 2003 - 11:11 pm

IMHO (in retrospect), if a system call does not provide any new feature 
(shm vs mmap), it needs to be deprecated/moved to user-space [using a 
common mechanism to share data efficiently]. 

regards
-kamal









Matt Thomas <matt@3am-software.com>
Sent by: tech-kern-owner@netbsd.org
07/09/2003 07:13 PM
 
        To:     tech-kern@netbsd.org (NetBSD Kernel Technical Discussion 
List)
        cc: 
        Subject:        Re: fsync performance hit on 1.6.1

 

POSIX.

I forwarded Bill Gallmeister your message, and he gave me the following
to use as a response:

And to comment on the email thread you forwarded:  hindsight's a beautiful
thing.  shm_open et al were designed to allow a trivial implementation 
atop
mmap (or atop the Ludicrous Sys V Interface (TM)), but at the time we came 
up
with the standard, I believe it was only the rocket scientists at Sun who 
had
mmap--no one else had ponied up to the memory==file proposition and all 
its
implications.  There wasn't even an INTERNET back then, for Chrissake.  Al 
Gore
hadn't been BORN.  We wrote the damn standard using OIL LANTERNS.  Okay, 
maybe
I'm exaggerating a little.  It was a DECADE ago!



-- 
Matt Thomas               Internet:   matt@3am-software.com
3am Software Foundry      WWW URL:    
http://www.3am-software.com/bio/matt/
Cupertino, CA             Disclaimer: I avow all knowledge of this message 






From: Matthew Mondor
Date: Thursday, July 10, 2003 - 3:08 pm

I have used BSD style shared memory using mmap(2) with MAP_ANON on linux 2.4
and it worked fine... For synchronization I used flock(3) on temporary
lock files with it

Matt



Previous thread: Re: fsync performance hit on 1.6.1 by enami tsugutomo on Sunday, July 6, 2003 - 6:55 pm. (2 messages)

Next thread: making a netbsd kernel look like a linux kernel? [was: Re: non-Linux on XBox? (fwd)] by Hubert Feyrer on Monday, July 7, 2003 - 9:36 am. (1 message)