Re: Syslets, Threadlets, generic AIO support, v6

Previous thread: [BUG] Something goes wrong with timer statistics. by Ian Kumlien on Tuesday, May 29, 2007 - 2:38 pm. (7 messages)

Next thread: [git patches] libata fix by Jeff Garzik on Tuesday, May 29, 2007 - 3:06 pm. (1 message)
From: Zach Brown
Date: Tuesday, May 29, 2007 - 2:27 pm

I'm pleased to announce the availability of version 6 of the syslet subsystem.
Ingo and I agreed that I'll handle syslet releases while he's busy with CFS.  I
copied the cc: list from Ingo's v5 announcement.  If you'd like to be dropped
(or added), please let me know.

The v6 patch series against 2.6.21 can be downloaded from:

  http://oss.oracle.com/~zab/syslets/v6/

Example applications and previous syslet releases can be found at:

 http://people.redhat.com/~mingo/syslet-patches/
  
The syslet subsystem aims to provide user-space with an efficient interface for
managing the asynchronus submission and completion of existing system calls.

The only changes since v5 are small changes that I made to support the
experimental aio patch described below.

My syslet subsystem todo list is as follows, in no particular order:

 - replace WARN_ON() calls with error handling or avoidance
 - split the x86_64-async.patch into more specific patches
 - investigate integration with ptrace
 - investigate rare ./syslet-test cpu spinning
 - provide distro kernel rpms and documentation for developers
 - compat design problems, still? http://lkml.org/lkml/2007/3/7/523

Included in this patch series is an experimental patch which reworks fs/aio.c
to reuse the syslet subsystem to process iocb requests from user space.  The
intent of this work is to simplify the code and broaden aio functionality.  

Many issues need to be addressed before this aio work could be merged:

 - support cancellation by sending signals to async_threads 
 - figure out what to do about signals from handlers, like SIGXFSZ
 - verify that heavy loads do not consume excessive cpu or memory 
 - concurrent dio writes
 - cfq gets confused, share io_context amongst threads?
 - restrict allowed operations like .aio_{r,w} methods used to

More details on this work in progress can be found in the patch.

Any and all feedback is welcome and encouraged!

 - z
-

From: Linus Torvalds
Date: Tuesday, May 29, 2007 - 2:49 pm

.. so don't keep us in suspense. Do you have any numbers for anything 
(like Oracle, to pick a random thing out of thin air ;) that might 
actually indicate whether this actually works or not?

Or is it just so experimental that no real program that uses aio can 
actually work yet?

		Linus
-

From: Zach Brown
Date: Tuesday, May 29, 2007 - 3:49 pm

I haven't gotten to running Oracle's database against it.  It is going
to be Very Cranky if O_DIRECT writes aren't concurrent, and that's going
to take a bit of work in fs/direct-io.c.

I've done initial micro-benchmarking runs for basic sanity testing with
fio.  They haven't wildly regressed, that's about as much as can be said
with confidence so far :).

Take a streaming O_DIRECT read.  1meg requests, 64 in flight.

str: (g=0): rw=read, bs=1M-1M/1M-1M, ioengine=libaio, iodepth=64

mainline:

	  read : io=3,405MiB, bw=97,996KiB/s, iops=93, runt= 36434msec

aio+syslets:

	  read : io=3,452MiB, bw=99,115KiB/s, iops=94, runt= 36520msec

That's on an old gigabit copper FC array with 10 drives behind a, no
seriously, qla2100.

The real test is the change in memory and cpu consumption, and I haven't
modified fio to take reasonably precise measurements of those yet.  Once
I get O_DIRECT writes concurrent that'll be the next step. 

I was pleased to see my motivation for the patches, to avoid having to
add specific support for operations to be called from fs/aio.c, work
out.  

Take the case of 4k random buffered reads from a block device with a
cold cache:

read: (g=0): rw=randread, bs=4K-4K/4K-4K, ioengine=libaio, iodepth=64

mainine:

  read : io=16,116KiB, bw=457KiB/s, iops=111, runt= 36047msec
    slat (msec): min=    4, max=  629, avg=563.17, stdev=71.92
    clat (msec): min=    0, max=    0, avg= 0.00, stdev= 0.00

aio+syslets:

  read : io=125MiB, bw=3,634KiB/s, iops=887, runt= 36147msec
    slat (msec): min=    0, max=    3, avg= 0.00, stdev= 0.08
    clat (msec): min=    2, max=  643, avg=71.59, stdev=74.25

aio+syslets w/o cfq

  read : io=208MiB, bw=6,057KiB/s, iops=1,478, runt= 36071msec
    slat (msec): min=    0, max=   15, avg= 0.00, stdev= 0.09
    clat (msec): min=    2, max=  758, avg=42.75, stdev=37.33

Everyone step back and thank Jens for writing a tool that gives us
interesting data without us always having to craft some stupid ...
From: Jeff Garzik
Date: Tuesday, May 29, 2007 - 3:16 pm

You should pick up the kevent work :)

Having async request and response rings would be quite useful, and most 
closely match what is going on under the hood in the kernel and hardware.

	Jeff


-

From: Zach Brown
Date: Tuesday, May 29, 2007 - 4:09 pm

> You should pick up the kevent work :)


Yeah, but I have lots of competing thoughts about this.

For the time being I'm focusing on simplifying the mechanisms that
support the sys_io_*() interface so I never ever have to debug fs/aio.c
(also known as chewing glass to those of us with the scars) again.

That said, I'll gladly work closely with developers who are seriously
considering putting some next gen interface to the test.  That todo item
about producing documentation and distro kernels is specifically to bait
Uli into trying to implement posix aio on top of syslets in glibc.

'cause we can go back and forth about potential interfaces for, well,
how long as it been?  years?  I want non-trivial users who we can
measure so we can *stop* designing and implementing the moment something
is good enough for them.

- z
-

From: Ulrich Drepper
Date: Tuesday, May 29, 2007 - 4:20 pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Get DaveJ to pick up the code for Fedora kernels and I'll get to it.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)
Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org

iD8DBQFGXLUk2ijCOnn/RHQRAjL0AJ0UQzNnMn8xpj7ga0OeEWUhnkhZfgCfTH+j
iQ52SLZgWwp4wmAGCy/eLZs=
=hpyn
-----END PGP SIGNATURE-----
-

From: Dave Jones
Date: Tuesday, May 29, 2007 - 6:11 pm

On Tue, May 29, 2007 at 04:20:04PM -0700, Ulrich Drepper wrote:
 > -----BEGIN PGP SIGNED MESSAGE-----
 > Hash: SHA1
 > 
 > Zach Brown wrote:
 > > That todo item
 > > about producing documentation and distro kernels is specifically to bait
 > > Uli into trying to implement posix aio on top of syslets in glibc.
 > 
 > Get DaveJ to pick up the code for Fedora kernels and I'll get to it.

With F7 out the door, I'm looking at getting devel/ back in shape again,
so I can get something done there soon-ish.  With the usual caveat that if
this isn't upstream by the time we do a release, we'll have to drop it
due to the added syscall. (Maybe we can just get that reserved upstream now?)

	Dave

-- 
http://www.codemonkey.org.uk
-

From: Zach Brown
Date: Wednesday, May 30, 2007 - 10:08 am

Maybe, but we'd have to agree on the bare syslet interface that is being
supported :).

Personally, I'd like that to be the simplest thing that works for people
and I'm not convinced that the current syslet-specific syscalls are that.
Certainly not the atom interface, anyway.

+asmlinkage __attribute__((weak)) long
+sys_umem_add(unsigned long __user *uptr, unsigned long inc)
+{
+       unsigned long val, new_val;
+
+       if (get_user(val, uptr))
+               return -EFAULT;
+       /*
+        * inc == 0 means 'read memory value':
+        */
+       if (!inc)
+               return val;
+
+       new_val = val + inc;
+       if (__put_user(new_val, uptr))
+               return -EFAULT;
+
+       return new_val;
+}

A syscall for *long addition* strikes me as a bit much, I have to admit.
Where do we stop?  (Where's the compat wrapper? :))

Maybe this would be fine for some wildly aggressive optimization some
number of years in the future when we have millions of syslet interface
users complaining about the cycle overhead of their syslet engines, but
it seems like we can do something much less involved in the first pass
without harming the possibility of promising to support this complex
optimization in the future.

- z
-

From: Ingo Molnar
Date: Wednesday, May 30, 2007 - 12:26 am

note that async request and response rings are implemented already in 
essence: that's how FIO uses syslets. The linked list of syslet atoms is 
the 'request ring' (it's just that 'ring' is not a hard-enforced data 
structure - you can use other request formats too), and the completion 
ring is the 'response ring'.

	Ingo
-

From: Ingo Molnar
Date: Wednesday, May 30, 2007 - 12:20 am

3 months ago i verified the published kevent vs. epoll benchmark and 
found that benchmark to be fatally flawed. When i redid it properly 
kevent showed no significant advantage over epoll. Note that i did those 
measurements _before_ the recent round of epoll speedups. So unless 
someone does believable benchmarks i consider kevent an over-hyped, 
mis-benchmarked complication to do something that epoll is perfectly 
capable of doing.

	Ingo
-

From: Ulrich Drepper
Date: Wednesday, May 30, 2007 - 12:31 am

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


I'm not going to judge your tests but saying there are no significant
advantages is too one-sided.  There is one huge advantage: the
interface.  A memory-based interface is simply the best form.  File
descriptors are a resource the runtime cannot transparently consume.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFGXShu2ijCOnn/RHQRAi5ZAJ920rRneulUMjTETu6XoiOaOi7SLgCfbmO+
UDM1CLqbaEZREAMnuOWRzuY=
=CERV
-----END PGP SIGNATURE-----
-

From: Ingo Molnar
Date: Wednesday, May 30, 2007 - 1:42 am

yeah - this is a fundamental design question for Linus i guess :-) glibc 
(and other infrastructure libraries) have a fundamental problem: they 
cannot (and do not) presently use persistent file descriptors to make 
use of kernel functionality, due to ABI side-effects. [applications can 
dup into an fd used by glibc, applications can close it - shells close 
fds blindly for example, etc.] Today glibc simply cannot open a file 
descriptor and keep it open while application code is running due to 
these problems.

we should perhaps enable glibc to have its separate fd namespace (or 
'hidden' file descriptors at the upper end of the fd space) so that it 
can transparently listen to netlink events (or do epoll), without 
impacting the application fd namespace - instead of ducking to a memory 
based API as a workaround.

it is a serious flexibility issue that should not be ignored. The 
unified fd space is a blessing on one hand because it's simple and 
powerful, but it's also a curse because nested use of the fd space for 
libraries is currently not possible. But it should be detached from any
fundamental question of kevent vs. epoll. (By improving library use of
file descriptors we'll improve the utility of all syscalls - by ducking
to a memory based API we only solve that particular event based usage.)

	Ingo
-

From: Evgeniy Polyakov
Date: Wednesday, May 30, 2007 - 1:51 am

There is another issue with file descriptors - userspace must dig into
kernel each time it wants to get a new set of events, while with memory
based approach it has them without doing so. After it has returned from
kernel and know that there are some evetns, kernel can add more of them
into the ring (if there is a place) and userspace will process them
withouth additional syscalls.
Although syscall overhead is very small, it does exist and should not be 

-- 
	Evgeniy Polyakov
-

From: Ingo Molnar
Date: Wednesday, May 30, 2007 - 2:05 am

Firstly, this is not a fundamental property of epoll. If we wanted to, 
it would be possible to extend epoll to fill in a ring of events from 
the wakeup handler. It's an incremental add-on to epoll that should not 
impact the design. How much info to put into a single event is another 
incremental thing - for most of the high-performance cases all the 
information we need is the type of the event and the fd it occured on. 
Currently epoll supports that minimal approach.

Secondly, our current syscall overhead is below 0.1 usecs on latest 
hardware:

  dione:~/l> ./lat_syscall null
  Simple syscall: 0.0911 microseconds

so you need millions of events _per cpu_ for the syscall overhead to 
show up.

Thirdly, our main problem was not the structure of epoll, our main 
problem was that event APIs were not widely available, so applications 
couldnt go to a pure event based design - they always had to handle 
certain types of event domains specially, due to lack of coverage. The
latest epoll patches largely address that. This was a huge barrier
against adoption of epoll.

	Ingo
-

From: Linus Torvalds
Date: Wednesday, May 30, 2007 - 8:16 am

Well, quite frankly, to me, the most important part of syslets is that if 
they are done right, they introduce _no_ new interfaces at all that people 
actually use.

Over the years, we've done lots of nice "extended functionality" stuff. 
Nobody ever uses them. The only thing that gets used is the standard stuff 
that everybody else does too.

So when it comes to syslets, the most important interface will be the 
existing aio_read() etc interfaces _without_ any in-memory stuff at all, 
and everything done by the kernel to just make it look exactly like it 
used to look. And the biggest advantage is that it simplifies the internal 
kernel code, and makes us use the same code for aio and non-aio (and I 
think we have a good possibility of improving performance too, if only 
because we will get much more natural and fine-grained scheduling points!)

Any extended "direct syslets" use is technically _interesting_, but 
ultimately almost totally pointless. Which was why I was pushing really 
really hard for a simple interface and not being too clever or exposing 
internal designs too much. An in-memory thing tends to be the absolute 

glibc has a more fundamental problem: the "fun" stuff is generally not 
worth it. 

For example, any AIO thing that requires glibc to be rewritten is almost 
totally uninteresting. It should work with _existing_ binaries, and 
_existing_ ABI's to be useful - since 99% of all AIO users are binary- 
only and won't recompile for some experimental library.

The whole epoll/kevent flame-wars have ignored a huge issue: almost nobody 
uses either. People still use poll and select, to such an _overwhelming_ 
degree that it almost doesn't even matter if you were to make the 

Yeah, I don't think it would be at all wrong to have "private file 
descriptors". I'd prefer that over memory-based (for all the abstraction 
issues, and because a lot of things really *are* about file descriptors!). 

		Linus
-

From: Ulrich Drepper
Date: Wednesday, May 30, 2007 - 8:39 am

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Something like this would only work reliably if you have actual
protection coming with it.  Also, there are still reasons why an
application might want to see, close, handle, whatever these descriptors
in a separate namespace.

I think such namespaces are a broken concept.  How many do you want to
introduce?  Plus, then you get away from the normal file descriptor
interfaces anyway.  If you'd represent these alternative namespace
descriptors with ordinary ints you gain nothing.  You'd have to use
tuples (namespace,descriptor) and then you need a whole set of new
interfaces or some sticky namespace selection which will only cause

It's not "ducking".  Memory mapping is one of the most natural
interfaces.  Just because Unix/Linux is built around the concept of file
descriptors does not mean this is the ultimate in usability.  File
descriptors are in fact clumsy: if you have a file descriptor to read
and write data, all auxiliary data for that communication must be
transferred out-of-band (e.g, fcntl) or in very magical and hard to use
ways (recvmsg, sendmsg).  With a memory based event mechanism this

Too simple.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFGXZqX2ijCOnn/RHQRAsSFAKCNrd8/sRss1wBA9hkpnYIeALDbXQCfRNAb
yZy2Nofz2CgDo9PQYK3C/bo=
=klUJ
-----END PGP SIGNATURE-----
-

From: Davide Libenzi
Date: Wednesday, May 30, 2007 - 12:40 pm

Here I think we are forgetting that glibc is userspace and there's no 
separation between the application code and glibc code. An application 
linking to glibc can break glibc in thousand ways, indipendently from fds 
or not fds. Like complaining that glibc is broken because printf() 
suddendly does not work anymore ;)

#include <stdio.h>
int main(void) {
        close(fileno(stdout));
        printf("Whiskey Tango Foxtrot?\n");
        return 0;
}



- Davide


-

From: Ulrich Drepper
Date: Wednesday, May 30, 2007 - 12:55 pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


It's not (only/mainly) about breaking.  File descriptors are a resources
which has to be used under the control of the program.  The runtime
cannot just steal some for itself.  This indirectly leads to breaking
code.  We've seen this many times and I keep repeating the same issue
over and over again: why do we have MAP_ANON instead of keeping a file
descriptor with /dev/null open?  Why is mmap made more complicated by
allowing the file descriptor to be closed after the mmap() call is done?

Take a look at a process running your favorite shell.  Ever wonder why
there is this stray file descriptor with a high number?

$ cat /proc/3754/cmdline
bash
$ ll /proc/3754/fd/
total 0
lrwx------ 1 drepper drepper 64 2007-05-30 12:50 0 -> /dev/pts/19
lrwx------ 1 drepper drepper 64 2007-05-30 12:50 1 -> /dev/pts/19
lrwx------ 1 drepper drepper 64 2007-05-30 12:49 2 -> /dev/pts/19
lrwx------ 1 drepper drepper 64 2007-05-30 12:50 255 -> /dev/pts/19

File descriptors must be requested explicitly and cannot be implicitly
consumed.

All that and the other problem I mentioned earlier today about auxiliary
data.  File descriptors are not the ideal interface.  Elegant: yes,
ideal: no.  Fro physics and math you might have learned that not every
result that looks clean and beautiful is correct.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFGXdbC2ijCOnn/RHQRAgBbAJ0RoNsQr4L6Bm5hLy7somAKeTqCcQCbBHmx
8hzG+1w0rYMTqXxNmi/QQ7o=
=O7Xm
-----END PGP SIGNATURE-----
-

From: Linus Torvalds
Date: Wednesday, May 30, 2007 - 1:00 pm

No, Davide, the problem is that some applications depend on getting 
_specific_ file descriptors.

For example, if you do

	close(0);
	.. something else ..
	if (open("myfile", O_RDONLY) < 0)
		exit(1);

you can (and should) depend on the open returning zero.

So library routines *must not* open file descriptors in the normal space.

(The same is true of real applications doing the equivalent of

	for (i = 0; i < NR_OPEN; i++)
		close(i);

to clean up all file descriptors before doing something new. And yes, I 
think it was bash that used to *literally* do something like that a long 
time ago.

Another example of the same thing: people open file descriptors and know 
that they'll be "dense" in the result, and then use "select()" on them.

So it's true that file descriptors can't be used randomly by the standard 
libraries - they'd need to have some kind of separate "private space".

Which *could* be something as simple as saying "bit 30 in the file 
descriptor specifies a separate fd space" along with some flags to make 
open and friends return those separate fd's. That makes them useless for 
"select()" (which assumes a flat address space, of course), but would be 
useful for just about anything else.

		Linus
-

From: Davide Libenzi
Date: Wednesday, May 30, 2007 - 1:21 pm

Right. I misunderstood Uli and Ingo. I thought it was like trying to 

I think it can be solved in a few ways. Yours or Ingo's (or something 
else) can work, to solve the above "legacy" fd space expectations.



- Davide


-

From: Eric Dumazet
Date: Wednesday, May 30, 2007 - 1:31 pm

Then you can also exclude multi-threading, since a thread (even not inside 
glibc) can also use socket()/pipe()/open()/whatever and take the zero file 
descriptor as well.

Frankly I dont buy this fd namespace stuff.

The only hardcoded thing in Unix is 0, 1 and 2 fds.
People usually take care of these, or should use a Microsoft OS.

POSIX mandates that open() returns the lowest available fd.
But this obviously works only if you dont have another thread messing with 
fds, or if you dont call a library function that opens a file.


Quite buggy IMHO

This hack was to avoid bugs coming from ancestors applications, 
forking/execing a shell, and at times where one process could not open more 
than 20 files (AT&T Unix, 21 years ago)

Unix has fcntl(fd, F_SETFD, FD_CLOEXEC). A library should use this to make 


Please dont do that. Second class fds.

Then what about having ten different shared libraries ? Third class fds ?


-

From: Linus Torvalds
Date: Wednesday, May 30, 2007 - 1:44 pm

No. The application is _correct_. It's how file descriptors are defined to 

Totally different. That's an application internal issue. It does *not* 

Wrong. I already gave an example of real code that just didn't bother to 
keep track of which fd's it had open, and closed them all. Partly, in 
fact, because you can't even _know_ which fd's you have open when somebody 
else just execve's you.

You can call it buggy, but the fact is, if you do, you're SIMPLY WRONG. 

You cannot just change years and years of coding practice, and standard 
documentations. The behaviour of file descriptors is a fact. Ignoring that 
fact because you don't like it is na
From: Eric Dumazet
Date: Wednesday, May 30, 2007 - 2:53 pm

If someone really cares, /proc/self/fd can help. But one shouldn't care at all.

About the things that the process can do before execing() a process, file 
descriptors outside of 0,1,2 are the most obvious thing, but you also have 

I want to change nothing. Current situation is fine and well documented, thank 
you.

If a program does "for (i = 0; i < NR_OPEN; i++) close(i);", this 
*will*/*should* work as intended : close all files descriptors from 0 to 
NR_OPEN. Big deal.

But you wont find in a program :

FILE *fp = fopen("somefile", "r");
for (i = 0; i < NR_OPEN; i++)
     close(i);
while (fgets(buff, sizeof(buff), fp)) {
}


You and/or others want to add fd namespaces and other hacks.

I saw on this thread suspicious examples, I am waiting for a real one, 
justifying all this stuff.

After file descriptors separation, I guess we'll need memory space separation 
as well, signal separations (SIGALRM comes to mind), uid/gid separation, cpu 
time separation, and so on... setrlimit() layered for every shared lib.


-

From: Davide Libenzi
Date: Wednesday, May 30, 2007 - 2:31 pm

Looking at it now, I'd agree (although I think I have that somewhere in my 
old code too). Consider though, that such code is contained also in 
reference books like Richard Stevens "UNIX Network Programming".



- Davide


-

From: Ulrich Drepper
Date: Wednesday, May 30, 2007 - 2:16 pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


Indeed.  It was not only bash, though, I fixed probably a dozen
applications.  But even the new and better solution (readdir of
/proc/self/fd) does not prevent the problem of closing descriptors the

I don't like special cases.  For me things better come in quantities 0,
1, and unlimited (well, reasonable high limit).  Otherwise, who gets to
use that special namespace?  The C library is not the only body of code
which would want to use descriptors.

And then the semantics: do these descriptors should show up in
/proc/self/fd?  Are there separate directories for each namespace?  Do
they count against the rlimit?

This seems to me like a shot from the hips without thinking about other
possibilities.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFGXemS2ijCOnn/RHQRAjsFAKCGhakZosSsRzCwOvruxECbzcwIzACeJAiY
z9ql4FJa8XTSiZzRG79ocwM=
=0E7f
-----END PGP SIGNATURE-----
-

From: Linus Torvalds
Date: Wednesday, May 30, 2007 - 2:27 pm

Well, don't think of it as a special case at all: think of bit 30 as a 
"the user asked for a non-linear fd".

In fact, to make it effective, I'd suggest literally scrambling the low 
bits (using, for example, some silly per-boot xor value to to actually 
generate the "true" index - the equivalent of a really stupid randomizer). 

That way you'd have the legacy "linear" space, and a separate "non-linear 
space" where people simply *cannot* make assumptions about contiguous fd 
allocations. There's no special case there - it's just an extension which 
explicitly allows us to say "if you do that, your fd's won't be allocated 
the traditional way any more, but you *can* mix the traditional and the 

Oh, absolutely. The'd be real fd's in every way. People could use them 
100% equivalently (and concurrently) with the traditional ones. The whole, 
and the _only_ point, would be that it breaks the legacy guarantees of a 
dense fd space.

Most apps don't actually *need* that dense fd space in any case. But by 
defaulting to it, we wouldn't break those (few) apps that actually depend 
on it.

		Linus
-

From: Davide Libenzi
Date: Wednesday, May 30, 2007 - 2:48 pm

I agree. What would be a good interface to allocate fds in such area? We 
don't want to replicate syscalls, so maybe a special new dup function?



- Davide


-

From: Linus Torvalds
Date: Wednesday, May 30, 2007 - 3:01 pm

I'd do it with something like "newfd = dup2(fd, NONLINEAR_FD)" or similar, 
and just have NONLINEAR_FD be some magic value (for example, make it be 
0x40000000 - the bit that says "private, nonlinear" in the first place).

But what's gotten lost in the current discussion is that we probably don't 
actually _need_ such a private space. I'm just saying that if the *choice* 
is between memory-mapped interfaces and a private fd-space, we should 
probably go for the latter. "Everything is a file" is the UNIX way, after 
all. But there's little reason to introduce private fd's otherwise.

			Linus
-

From: Ingo Molnar
Date: Wednesday, May 30, 2007 - 11:13 pm

it's both a flexibility and a speedup thing as well:

flexibility: for libraries to be able to open files and keep them open 
comes up regularly. For example currently glibc is quite wasteful in a 
number of common networking related functions (Ulrich, please correct me 
if i'm wrong), which could be optimized if glibc could just keep a 
netlink channel fd open and could poll() it for changes and cache the 
results if there are no changes (or something like that).

speedup: i suggested O_ANY 6 years ago as a speedup to Apache - 
non-linear fds are cheaper to allocate/map:

  http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg23820.html

(i definitely remember having written code for that too, but i cannot 
find that in the archives. hm.) In theory we could avoid _all_ fd-bitmap 
overhead as well and use a per-process list/pool of struct file buffers 
plus a maximum-fd field as the 'non-linear fd allocator' (at the price 
of only deallocating them at process exit time).

	Ingo
-

From: Eric Dumazet
Date: Thursday, May 31, 2007 - 12:35 am

On Thu, 31 May 2007 08:13:03 +0200

Only very few apps need to open more than 100.000 files.

As these files are likely sockets, O_ANY is not a solution.

A trick is to try to keep first 64 handles freed, so that kernel wont consume
too much cpu time and cache in get_unused_fd()

http://lkml.org/lkml/2005/9/15/307

This trick is portable (not linux centric).

-

From: Ingo Molnar
Date: Thursday, May 31, 2007 - 2:26 am

yes. I did not list it as a primary reason for private fds, it's just a 
nice side-effect. As long as the other apps are not hurt, i see no 

why not? It would be a natural thing to extend sys_socket() with a 
'flags' parameter and pass in O_ANY (along with any other possible fd 

this is basically a user-space front-end cache to fd allocation - which 
duplicates data needlessly. I dont see any problem with doing this in 
the kernel. (Also, obviously 'first 64 handles' could easily break with 
certain types of apps so glibc cannot do this.)

	Ingo
-

From: Ingo Molnar
Date: Thursday, May 31, 2007 - 2:02 am

to measure this i've written fd-scale-bench.c:

   http://redhat.com/~mingo/fd-scale-patches/fd-scale-bench.c

which tests the (cache-hot or cache-cold) cost of open()-ing of two fds 
while there are N other fds already open: one is from the 'middle' of 
the range, one is from the end of it.

Lets check our current 'extreme high end' performance with 1 million 
fds. (which is not realistic right now but there certainly are systems 
with over a hundred thousand open fds). Results from a fast CPU with 2MB 
of cache:

 cache-hot:

 # ./fd-scale-bench 1000000 0
 checking the cache-hot performance of open()-ing 1000000 fds.
 num_fds: 1, best cost: 1.40 us, worst cost: 2.00 us
 num_fds: 2, best cost: 1.40 us, worst cost: 1.40 us
 num_fds: 3, best cost: 1.40 us, worst cost: 2.00 us
 num_fds: 4, best cost: 1.40 us, worst cost: 1.40 us
 ...
 num_fds: 77117, best cost: 1.60 us, worst cost: 2.00 us
 num_fds: 96397, best cost: 2.00 us, worst cost: 2.20 us
 num_fds: 120497, best cost: 2.20 us, worst cost: 2.40 us
 num_fds: 150622, best cost: 2.20 us, worst cost: 3.00 us
 num_fds: 188278, best cost: 2.60 us, worst cost: 3.00 us
 num_fds: 235348, best cost: 2.80 us, worst cost: 3.80 us
 num_fds: 294186, best cost: 3.40 us, worst cost: 4.20 us
 num_fds: 367733, best cost: 4.00 us, worst cost: 5.00 us
 num_fds: 459667, best cost: 4.60 us, worst cost: 6.00 us
 num_fds: 574584, best cost: 5.60 us, worst cost: 8.20 us
 num_fds: 718231, best cost: 6.40 us, worst cost: 10.00 us
 num_fds: 897789, best cost: 7.60 us, worst cost: 11.80 us
 num_fds: 1000000, best cost: 8.20 us, worst cost: 9.60 us

 cache-cold:

 # ./fd-scale-bench 1000000 1
 checking the performance of open()-ing 1000000 fds.
 num_fds: 1, best cost: 4.60 us, worst cost: 7.00 us
 num_fds: 2, best cost: 5.00 us, worst cost: 6.60 us
 ...
 num_fds: 77117, best cost: 5.60 us, worst cost: 7.40 us
 num_fds: 96397, best cost: 5.60 us, worst cost: 7.40 us
 num_fds: 120497, best cost: 6.20 us, worst cost: 6.80 us
 num_fds: ...
From: Eric Dumazet
Date: Thursday, May 31, 2007 - 3:41 am

On Thu, 31 May 2007 11:02:52 +0200

Your numbers do not match mines (mines were more than two years old so I redid a test before replying)

I tried your bench and found two problems :
- You scan half of the bitmap
- You incorrectlty divide best_delta and worst_delta by LOOPS (5)

Try to close not a 'middle fd', but a really low one (10 for example), and latencie is doubled.

with a corrected bench; cache-cold numbers are > 100 us on this Intel Pentium-M

num_fds: 1000000, best cost: 120.00 us, worst cost: 131.00 us

On an Opteron x86_64 machine, results are better :)

num_fds: 1000000, best cost: 28.00 us, worst cost: 106.00 us
-

From: Ingo Molnar
Date: Thursday, May 31, 2007 - 3:50 am

that was intentional. I really didnt want to fabricate a worst-case 
result but something more representative: in real apps the bitmap isnt 
fully filled all the time and most of the find-bit sequences are short. 

ah, indeed, that's a bug - victim of a last minute edit :) Since the 
divident is constant it doesnt really matter to the validity of the 
relative nature of the slowdown (which is what i was intested in), but 
you are right - i have fixed the download and have redone the numbers. 
Here are the correct results from my box:

 # ./fd-scale-bench 1000000 0
 checking the cache-hot performance of open()-ing 1000000 fds.
 num_fds: 1, best cost: 6.00 us, worst cost: 8.00 us
 num_fds: 2, best cost: 6.00 us, worst cost: 7.00 us
 ...
 num_fds: 31586, best cost: 7.00 us, worst cost: 8.00 us
 num_fds: 39483, best cost: 8.00 us, worst cost: 8.00 us
 num_fds: 49354, best cost: 7.00 us, worst cost: 9.00 us
 num_fds: 61693, best cost: 8.00 us, worst cost: 10.00 us
 num_fds: 77117, best cost: 8.00 us, worst cost: 13.00 us
 num_fds: 96397, best cost: 9.00 us, worst cost: 11.00 us
 num_fds: 120497, best cost: 10.00 us, worst cost: 14.00 us
 num_fds: 150622, best cost: 11.00 us, worst cost: 13.00 us
 num_fds: 188278, best cost: 12.00 us, worst cost: 15.00 us
 num_fds: 235348, best cost: 14.00 us, worst cost: 20.00 us
 num_fds: 294186, best cost: 16.00 us, worst cost: 22.00 us
 num_fds: 367733, best cost: 19.00 us, worst cost: 35.00 us
 num_fds: 459667, best cost: 22.00 us, worst cost: 37.00 us
 num_fds: 574584, best cost: 26.00 us, worst cost: 40.00 us
 num_fds: 718231, best cost: 31.00 us, worst cost: 62.00 us
 num_fds: 897789, best cost: 37.00 us, worst cost: 54.00 us
 num_fds: 1000000, best cost: 41.00 us, worst cost: 59.00 us

and cache-cold:

 # ./fd-scale-bench 1000000 1
 checking the cache-cold performance of open()-ing 1000000 fds.
 num_fds: 1, best cost: 24.00 us, worst cost: 32.00 us
 ...
 num_fds: 49354, best cost: 26.00 us, worst cost: 28.00 us
 num_fds: 61693, ...
From: Ingo Molnar
Date: Thursday, May 31, 2007 - 2:32 am

btw., this also allows mostly-lockless fd allocation, which would 
probably benefit threaded apps too. (we can just recycle it from a 
per-CPU list of cached fds for that process)

	Ingo
-

From: Jens Axboe
Date: Thursday, May 31, 2007 - 2:34 am

See also:

http://lkml.org/lkml/2006/6/16/144

which originates from a much simpler patch I did to fix performance
regressions in this area for the SLES10 kernel.

-- 
Jens Axboe

-

From: Eric Dumazet
Date: Wednesday, May 30, 2007 - 3:09 pm

If the deal is to be able to get faster open()/socket()/pipe()/... calls by 
not finding the first 0 bit in a huge bitmap, a better way would be to have a 
flag in struct task, reset to 0 at exec time.

A new syscall would say : This process is OK to receive *random* fds.


-

From: Ulrich Drepper
Date: Wednesday, May 30, 2007 - 2:47 pm

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1


This sounds easy but doesn't really solve all the issues.  Let me repeat
your example and the solution currently in use:

problem: application wants to close all file descriptors except a select
few, cleaning up what is currently open.  It doesn't know all the
descriptors that are open.  Maybe all this in preparation of an exec call.

Today the best method to do this is to readdir() /proc/self/fd and
exclude the descriptors on the whitelist.

If the special, non-sequential descriptors are also listed in that
directory the runtimes still cannot use them since they are visible.

If you go ahead with this, then at the very least add a flag which
causes the descriptor to not show up in /proc/*/fd.


You also have to be aware that open() is just one piece of the puzzle.
What about socket()?  I've cursed this interface many times before and
now it's biting you: there is parameter to pass a flag.  What about
transferring file descriptors via Unix domain sockets?  How can I decide
the transferred descriptor should be in the private namespace?

There are likely many many more problems and cornercases like this.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFGXfD12ijCOnn/RHQRAk4nAJ0Zjevd9Y0lQa/fLzKK+BshcLVbngCfSspI
ALNKu8VCKy7CvoIqJD3Xs/Y=
=+fM8
-----END PGP SIGNATURE-----
-

From: Davide Libenzi
Date: Wednesday, May 30, 2007 - 3:06 pm

Well, we can't just replicate/change every system call that creates a file 
descriptor. So I'm for something like:

int sys_fdup(int fd, int flags);

So you basically create your fds with their native/existing system calls, 
and then you dup/move them into the prefered fd space.



- Davide


-

From: David M. Lloyd
Date: Wednesday, May 30, 2007 - 2:51 pm

On Wed, 30 May 2007 14:27:52 -0700 (PDT)

If the sole point is to protect an fd from being closed or operated on
outside of a certain context, why not just provide the ability to
"protect" an fd to prevent its use.  Maybe a pair of syscalls like
"fdprotect" and "fdunprotect" that take an fd and an integer key.
Protected fds would return EBADF or something if accessed.  The same
integer key must be provided to fdunprotect in order to gain access
to it again.  Then glibc or valgrind or whatever would just unprotect
the fd before operating on it.

- DML
-

From: William Lee Irwin III
Date: Wednesday, May 30, 2007 - 3:24 pm

One could always stuff a seed or per-cpu seeds in the files_struct and
use a PRNG. The only trick would be cacheline bounces and/or space
consumption of seeds. Another possibility would be bitreversed
contiguity or otherwise a bit permutation of some contiguous range,
modulo (of course) the high bit used to tag the randomized range.

With "truly" random/sparse fd numbers it may be meaningful to use a
different data structure from a bitmap to track them in-kernel, though
xor and other easily-computed mappings to/from contiguous ranges won't
need such in earnest.


-- wli
-

From: Jeremy Fitzhardinge
Date: Wednesday, May 30, 2007 - 2:38 pm

Valgrind could certainly make use of it.  It currently reserves a set of
fds "high enough", and tries hard to hide them from apps, but
/proc/self/fd makes it intractable in general (there was only so much
simulation I was willing to do in Valgrind).

    J
-

From: Davide Libenzi
Date: Wednesday, May 30, 2007 - 2:39 pm

Please, do not drop me out of the Cc list. If you have a valid point, you 
should be able to carry it forward regardless, no?



- Davide


-

From: Jeremy Fitzhardinge
Date: Wednesday, May 30, 2007 - 2:36 pm

Some programs - legitimately, I think - scan /proc/self/fd to close
everything.  The question is whether the glibc-private fds should appear
there.  And something like a "close-on-fork" flag might be useful,
though I guess glibc can keep track of its own fds closely enough to not
need something like that.

    J
-

From: Linus Torvalds
Date: Wednesday, May 30, 2007 - 2:44 pm

Sure. I think there are things we can do (like make the non-linear fd's 
appear somewhere else, and make them close-on-exec by default etc).

And it's not like it's necessarily at all the only way to do things. 

I just threw it out as a possible solution - and one that is almost 
certainly *superior* to trying to work around the fd thing with some 
shared memory area which has tons of much more serious problems of its own 
(*).

		Linus

(*) Ranging from: specialized-only interfaces, inability to pass it 
around, lack of any abstraction interfaces, and almost impossible to 
debug. The security implications of kernel and user space sharing 
read-write access to some shared area are also legion!
-

From: Linus Torvalds
Date: Wednesday, May 30, 2007 - 2:48 pm

Side note: it might not even be a "close-on-exec by default" thing: it 
might well be a *always* close-on-exec.

That COE is pretty horrid to do, we need to scan a bitmap of those things 
on each exec. So it migth be totally sensible to just declare that the 
non-linear fd's would simply always be "local", and never bleed across an 
execve).

			Linus
-

From: Jeremy Fitzhardinge
Date: Wednesday, May 30, 2007 - 2:54 pm

Hm, I wouldn't limit the mechanism prematurely.  Using Valgrind as an
example of an alternate user of this mechanism, it would be useful to
use a pipe to transmit out-of-band information from an exec-er to an
exec-ee process.  At the moment there's a lot mucking around with
execve() to transmit enough information from the parent valgrind to its
successor.

    J
-

From: Matt Mackall
Date: Wednesday, May 30, 2007 - 3:27 pm

Or.. we could have a method of swizzling in and out an entire FD
array, similar to UML's trick for swizzling MMs.

-- 
Mathematics is the supreme nostalgia of our time.
-

From: William Lee Irwin III
Date: Wednesday, May 30, 2007 - 3:38 pm

I like that notion even better than randomization. I think it should
happen. I like SKAS, too, of course.


-- wli
-

From: Evgeniy Polyakov
Date: Wednesday, May 30, 2007 - 1:32 am

Hi Ingo, developers.


I did not want to start with another round of ping-pong insults :), but, 
Ingo, you did not show that kevent works worse. I did show that
sometimes it works better. It flawed from 0 to 30% win in that tests, 
in results Johann Bork presented kevent and epoll behaved the same. In
results I posted earlier, I said, that sometimes epoll behaved better, 
sometimes kevent. What does it say? Just the fact, that in that given 
workload result was the one we saw. Nothing more, nothing less.
It does not show something is broken, and definitely not that it is:
citation1:
we're heading to yet-another monolitic interface, we're heading with no
valid reasons given if other than some handwaving.
citation2:
consider kevent an over-hyped, mis-benchmarked complication to do 
something that epoll is perfectly

Getting into account another features kevent has (and what it was
designed for originally - for network AIO, which is quite hard 
(if ever possible) with files and epoll, I'm not talking about syslets
as AIO, it is different approach and likely it is simpler, getting even
only that it is already very good), it is not what people said in above 
citations.

It looks like you have some personal insults on that, which I do not
understand. But it has nothing with technical side of the problem, so
lets stop such rethoric and concentrate on real problem and forget any
possible personal issues which might be raised sometimes :).

Although I closed kevent and eventfs projects, I would gladly continue
if we can and want to have progress in that area.


-- 
	Evgeniy Polyakov
-

From: Ingo Molnar
Date: Wednesday, May 30, 2007 - 1:54 am

let me refresh your recollection:

  http://lkml.org/lkml/2007/2/25/116

where you said:

 "But note, that on my athlon64 3500 test machine kevent is about 7900
  requests per second compared to 4000+ epoll, so expect a challenge."

for a long time you made much fuss about how kevents is so much better 
and how epoll cannot perform and scale as well (you said various 
arguments why that is supposedly so), and some people bought into the 
performance argument and advocated kevent due to its supposed 
performance and scalability advantages - while now we are down to "epoll 
and kevent are break-even"?

in my book that is way too much of a difference, it is (best-case) a way 
too sloppy approach to something as fundamental as Linux's basic event 
model and design, and it is also compounded by your continued "nothing 
happened, really, lets move on" stance. Losing trust is easy, winning it 
back is hard. Let me reuse a phrase of yours: "expect a challenge".

	Ingo
-

From: Evgeniy Polyakov
Date: Wednesday, May 30, 2007 - 2:30 am

You can also find in that threads that I managed to run epoll server on 
that machine with 9k requests per second, although that was not

You just draw a picture you want to see.

Even on the kevent page I have links to other people's benchmarks, which
show how kevent behave compared to epoll in theirs load.
_My_ tests showed kevent performance win, you tuned my (can be
broken) epoll code and results changed - this is developemnt process,

Well, I do not care much about what people think I did wrong or right.
There are obviously bad and good ideas and implementations.
I might be absolutely wrong with something, but that is a process of
solving problems, which I really enjoy.

I just want that there sould be no personal insults, if I made such things,

-- 
	Evgeniy Polyakov
-

From: Jeff Garzik
Date: Wednesday, May 30, 2007 - 2:28 am

You snipped the key part of my response, so I'll say it again:

Event rings (a) most closely match what is going on in the hardware and 
(b) often closely match what is going on in multi-socket, event-driven 
software application.

To echo Uli and paraphrase an ad, "it's the interface, silly."

This is not something epoll is capable of doing, at the present time.

	Jeff


-

From: Ingo Molnar
Date: Wednesday, May 30, 2007 - 6:02 am

event rings are just pure data structures that describe a set of data, 
and they have advantages and disadvantages. For the record, we've 
already got direct experience with rings as software APIs: they were 
used for KAIO and they were an implementational and maintainance 
nightmare and nobody used them. Kevent might be better, but you make it 
sound as if it was a trivial design choice while it certainly isnt!

Sure, for hardware interfaces like networking cards tx and rx rings are 
the best thing but that is apples to oranges: hardware itself is about 
_limited_ physical resources, matching a _limited_ data structure like a 
ring quite well. But for software APIs, the built-in limit of rings 
makes it a baroque data structure that has a fair share disadvantages in 

epoll is very much is capable of doing it - but why bother if something 
more flexible than a ring can be used and the performance difference is 
negligible? (Read my other reply in this thread for further points.)

but, for the record, syslets very much use a completion ring, so i'm not 
fundamentally opposed to the idea. I just think it's seriously 
over-hyped, just like most other bits of the kevent approach. (Nor do we 
have to attach this to syslets and threadlets - kevents are an 
orthogonal approach not directly related to asynchronous syscalls - 
syslets/threadlets can make use of epoll just as much as they can make 
use of kevent APIs.)

	Ingo
-

From: Ingo Molnar
Date: Wednesday, May 30, 2007 - 6:20 am

in particular i'd like to (re-)stress this point:

 Thirdly, our main problem was not the structure of epoll, our main
 problem was that event APIs were not widely available, so applications
 couldnt go to a pure event based design - they always had to handle
 certain types of event domains specially, due to lack of coverage. The
 latest epoll patches largely address that. This was a huge barrier
 against adoption of epoll.

starting with putting limits into the design by going to over-smart data 
structures like rings is just stupid. Lets fix, enhance and speed up 
what we have now (epoll) so that it becomes ubiquitous, and _then_ we 
can extend epoll to maybe fill events into rings. We should have our 
priorities right and should stop rewriting the whole world, especially 
when it comes to user APIs. Right now we have _no_ event API with 
complete coverage, and that's far more of a problem than the actual 
micro-structure of the API.

	Ingo
-

From: Linus Torvalds
Date: Wednesday, May 30, 2007 - 8:31 am

I have rather strong counter-arguments:

 (a) yes, it's how hardware does it, but if you actually look at hardware, 
     you quickly realize that every single piece of hardware uses a 
     *different* ring interface.

     This should really tell you something. In fact, it may not be rings 
     at all, but structures with more complex formats (eg the USB 
     descriptors).

 (b) yes, event-driven software tends to use some data structures that are 
     sometimes approximated by event rings, but they all use *different* 
     software structures. There simply *is* no common "event" structure: 
     each program tends to have its own issues, it's own allocation 
     policies, and its own "ring" structures.

     They may not be rings at all. They can be priority queues/heaps or 

THERE IS NO INTERFACE! You're just making that up, and glossing over the 
most important part of the whole thing! 

If you could actually point to something specific that matches what 
everybody needs, and is architecture-neutral, it would be a different 
issue. As is, you're just saying "memory-mapped interfaces" without 
actually going into enough detail to show HOW MUCH IT SUCKS.

There really are very few programs that would use them. We had a trivial 
benchmark, the only function of which was to show usage, and here Ingo and 
Evgeniy are (once more) talking about bugs in that one months later.

THAT should tell you something.

Make poll/select/aio/read etc faster. THAT is where  the payoffs are.

In fact, if somebody wants to look at a standard interface that could be 
speeded up, the prime thing to look at is "readdir()" (aka getdents). 
Making _that_ thing go faster and scale better and do read-ahead is likely 
to be a lot more important for performance. It was one of the bottle-necks 
for samba several years ago, and nobody has really tried to improve it.

And yes, that's because it's hard - people would rather make up new 
interfaces that are largely irrelevant even before ...
From: Ingo Molnar
Date: Wednesday, May 30, 2007 - 9:09 am

looking over the list of our new generic APIs (see further below) i 
think there are three important things that are needed for an API to 
become widely used:

 1) it should solve a real problem (ha ;-), it should be intuitive to 
    humans and it should fit into existing things naturally.

 2) it should be ubiquitous. (if it's about IO it should cover block IO,
    network IO, timers, signals and everything) Even if it might look
    silly in some of the cases, having complete, utter, no compromises,
    100% coverage for everything massively helps the uptake of an API, 
    because it allows the user-space coder to pick just one paradigm 
    that is closest to his application and stick to it and only to it.

 3) it should be end-to-end supported by glibc.

our failed API attempts so far were:

 - sendfile(). This API mainly failed on #2. It partly failed on #1 too.
   (couldnt be used in certain types of scenarios so was unintuitive.)
   splice() fixes this almost completely.

 - KAIO. It fails on #2 and #3.

our more successful new APIs:

 - futexes. After some hickups they form the base of all modern 
   user-space locking.

 - splice. (a bit too early to tell but it's looking good so far. Would
   be nice if someone did a brute-force memcpy() based vmsplice to user
   memory, just to make usage fully symmetric.)

partially successful, not yet failed new APIs:

 - epoll. It currently fails at #2 (v2.6.22 mostly fills the gaps but
   not completely). Despite the non-complete coverage of event domains a
   good number of apps are using it, and in particular a couple really
   'high end' apps with massive amounts of event sources - which apps 
   would have no chance with poll, select or threads.

 - inotify. It's being used quite happily on the desktop, despite some
   of its limitations. (Possibly integratable into epoll?)

	Ingo
-

From: Jens Axboe
Date: Wednesday, May 30, 2007 - 10:57 am

Heh, I actually agree, at least then the interface is complete! We can
always replace it with something more clever, should someone feel so
inclined. Here's a rough patch to do that, it's totally untested (but it
compiles). sparse will warn about the __user removal, though. I'm sure
viro would shoot me dead on the spot, should he see this...

diff --git a/fs/splice.c b/fs/splice.c
index 12f2828..5023c01 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -657,9 +657,9 @@ out_ret:
  * key here is the 'actor' worker passed in that actually moves the data
  * to the wanted destination. See pipe_to_file/pipe_to_sendpage above.
  */
-ssize_t __splice_from_pipe(struct pipe_inode_info *pipe,
-			   struct file *out, loff_t *ppos, size_t len,
-			   unsigned int flags, splice_actor *actor)
+ssize_t __splice_from_pipe(struct pipe_inode_info *pipe, void *actor_priv,
+			   loff_t *ppos, size_t len, unsigned int flags,
+			   splice_actor *actor)
 {
 	int ret, do_wakeup, err;
 	struct splice_desc sd;
@@ -669,7 +669,7 @@ ssize_t __splice_from_pipe(struct pipe_inode_info *pipe,
 
 	sd.total_len = len;
 	sd.flags = flags;
-	sd.file = out;
+	sd.file = actor_priv;
 	sd.pos = *ppos;
 
 	for (;;) {
@@ -1240,28 +1240,104 @@ static int get_iovec_page_array(const struct iovec __user *iov,
 	return error;
 }
 
+static int pipe_to_user(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
+			struct splice_desc *sd)
+{
+	int ret;
+
+	ret = buf->ops->pin(pipe, buf);
+	if (!ret) {
+		void __user *dst = sd->userptr;
+		/*
+		 * use non-atomic map, can be optimized to map atomically if we
+		 * prefault the user memory.
+		 */
+		char *src = buf->ops->map(pipe, buf, 0);
+
+		if (copy_to_user(dst, src, sd->len))
+			ret = -EFAULT;
+
+		buf->ops->unmap(pipe, buf, src);
+
+		if (!ret)
+			return sd->len;
+	}
+
+	return ret;
+}
+
+/*
+ * For lack of a better implementation, implement vmsplice() to userspace
+ * as a simple copy of the pipes pages to the user iov.
+ */
+static ...
From: Mark Lord
Date: Wednesday, May 30, 2007 - 12:05 pm

I wonder how useful it would be to reimplement sendfile()
using splice(), either in glibc or inside the kernel itself?

sendfile() does get used a fair bit, but I really doubt that anyone
outside of a handful of people on this list actually use splice().

Cheers
-

From: Jens Axboe
Date: Wednesday, May 30, 2007 - 12:10 pm

It's indeed the plan, I even have git branch for it. Just never took the
time to actually finish it.

http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=splice-sendfile

-- 
Jens Axboe

-

From: Linus Torvalds
Date: Wednesday, May 30, 2007 - 12:15 pm

I'd like that, if only because right now we have two separate paths that 
kind of do the same thing, and splice really is the only one that is 
generic.

I thought Jens even had some experimental patches for it. It might be 
worth to "just do it" - there's some internal overhead, but on the other 
hand, it's also likely the best way to make sure any issues get sorted 
out.

		Linus
-

From: Jens Axboe
Date: Wednesday, May 30, 2007 - 12:32 pm

I do, this is a one year old patch that does that:

http://git.kernel.dk/?p=linux-2.6-block.git;a=commitdiff;h=f8f550e027fd07ad8d871101788...

I'll update it, test, and submit for 2.6.23.

-- 
Jens Axboe

-

From: Eric Dumazet
Date: Wednesday, May 30, 2007 - 1:07 pm

Last time I played with splice(), I found a bug with readahead logic, most 
probably because nobody but me tried it before.

(corrected by Fengguang Wu in commit 9ae9d68cbf3fe0ec17c17c9ecaa2188ffb854a66 )

So yes, reimplement sendfile() should help to find last splice() bugs, and as 
a bonus it could add non blocking disk io, (O_NONBLOCK on input file -> socket)


-

From: Linus Torvalds
Date: Wednesday, May 30, 2007 - 1:31 pm

Well, to get those kinds of advantages, you'd have to use splice directly, 
since sendfile() hasn't supported nonblocking disk IO, and the interface 
doesn't really allow for it.

In fact, since nonblocking accesses require also some *polling* method, 
and we don't have that for files, I suspect the best option for those 
things is to simply mix AIO and splice(). AIO tends to be the right thing 
for disk waits (read: short, often cached), and if we can improve AIO 
performance for the cached accesses (which is exactly what the threadlets 
should hopefully allow us to do), I would seriously suggest going that 
route.

But the pure "use splice to _implement_ sendfile()" thing is worth doing 
for all the other reasons, even if nonblocking file access is not likely 
one of them.

		Linus
-

From: Eric Dumazet
Date: Wednesday, May 30, 2007 - 1:46 pm

sendfile() interface doesnt allow it, but if you open("somediskfile", O_RDONLY 
| O_NONBLOCK), then splice() based sendfile() can perform a non blocking disk 
io, (while starting an io with readahead)

I actually use this trick myself :)

(splice(disk -> pipe, NONBLOCK), splice(pipe -> worker))


-

From: Davide Libenzi
Date: Wednesday, May 30, 2007 - 12:52 pm

I think, as Linus pointed out (as I did a few months ago), that there's 
confusion about the term "Unification" or "Single Interface".
Unification is not about fetching all the data coming from the more 
diverse sources, into a single interface. That is just broken, because 
each data source wants a different data structure to be reported. 
This is ABI-hell 101. Unification is the ability to uniformly wait for 
readiness, and then fetch data with source-dependent collectors (read(2), 
io_getvents(2), ...). That way you have ABI isolation on the single 
data source, and not monster structures trying to blob together the more 
diverse data formats.
AFAIK, inotify works with select/poll/epoll as is.



- Davide


-

From: Jens Axboe
Date: Wednesday, May 30, 2007 - 12:40 am

On Tue, May 29 2007, Zach Brown wrote:


Yeah, it'll confuse CFQ a lot actually. The threads either need to share
an io context (clean approach, however will introduce locking for things
that were previously lockless), or CFQ needs to get better support for
cooperating processes. The problem is that CFQ will wait for a dependent
IO for a given process, which may arrive from a totally unrelated
process.

For the fio testing, we can make some improvements there. Right now you
don't get any concurrency of the io requests if you set eg iodepth=32,
as the 32 requests will be submitted as a linked chain of atoms. For io
saturation, that's not really what you want.

I'll take a stab at improving both of the above.

-- 
Jens Axboe

-

From: Zach Brown
Date: Wednesday, May 30, 2007 - 9:55 am

Just to be clear: I'm currently focusing on supporting sys_io_*() so I'm
using fio's libaio engine.  I'm not testing the syslet syscall interface
yet.

- z
-

From: Jens Axboe
Date: Wednesday, May 30, 2007 - 10:33 am

Ah ok, then there's no issue from that end!

-- 
Jens Axboe

-

Previous thread: [BUG] Something goes wrong with timer statistics. by Ian Kumlien on Tuesday, May 29, 2007 - 2:38 pm. (7 messages)

Next thread: [git patches] libata fix by Jeff Garzik on Tuesday, May 29, 2007 - 3:06 pm. (1 message)