Linux: Syslets, Generic Asynchronous System Call Support

Submitted by Jeremy
on February 15, 2007 - 12:11pm

Ingo Molnar [interview] posted a set of 11 patches introducing "the first release of the 'Syslet' kernel feature and kernel subsystem, which provides generic asynchrous system call support". Ingo explains:

"Syslets are small, simple, lightweight programs (consisting of system-calls, 'atoms') that the kernel can execute autonomously (and, not the least, asynchronously), without having to exit back into user-space. Syslets can be freely constructed and submitted by any unprivileged user-space context - and they have access to all the resources (and only those resources) that the original context has access to."

Ingo goes on in his email to explain in greater detail how syslets work, then adds, "as it might be obvious to some of you, the syslet subsystem takes many ideas and experience from my Tux in-kernel webserver :) The syslet code originates from a heavy rewrite of the Tux-atom and the Tux-cachemiss infrastructure." He also offered some benchmark results, showing a 33.9% speedup comparing uncached synchronous IO to syslets, and a 19.2% speedup comparing cached synchronous IO to syslets, "so syslets, in this particular workload, are a nice speedup /both/ in the uncached and in the cached case. (note that i used only a single disk, so the level of parallelism in the hardware is quite limited.)"


From: Ingo Molnar [email blocked]
To:  linux-kernel
Subject: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support
Date:	Tue, 13 Feb 2007 15:20:10 +0100

I'm pleased to announce the first release of the "Syslet" kernel feature 
and kernel subsystem, which provides generic asynchrous system call 
support:

   http://redhat.com/~mingo/syslet-patches/

Syslets are small, simple, lightweight programs (consisting of 
system-calls, 'atoms') that the kernel can execute autonomously (and, 
not the least, asynchronously), without having to exit back into 
user-space. Syslets can be freely constructed and submitted by any 
unprivileged user-space context - and they have access to all the 
resources (and only those resources) that the original context has 
access to.

because the proof of the pudding is eating it, here are the performance 
results from async-test.c which does open()+read()+close() of 1000 small 
random files (smaller is better):

                  synchronous IO      |   Syslets:
                  --------------------------------------------
  uncached:       45.8 seconds        |  34.2 seconds   ( +33.9% )
  cached:         31.6 msecs          |  26.5 msecs     ( +19.2% )

("uncached" results were done via "echo 3 > /proc/sys/vm/drop_caches". 
The default IO scheduler was the deadline scheduler, the test was run on 
ext3, using a single PATA IDE disk.)

So syslets, in this particular workload, are a nice speedup /both/ in 
the uncached and in the cached case. (note that i used only a single
disk, so the level of parallelism in the hardware is quite limited.)

the testcode can be found at:

     http://redhat.com/~mingo/syslet-patches/async-test-0.1.tar.gz

The boring details:

Syslets consist of 'syslet atoms', where each atom represents a single 
system-call. These atoms can be chained to each other: serially, in 
branches or in loops. The return value of an executed atom is checked 
against the condition flags. So an atom can specify 'exit on nonzero' or 
'loop until non-negative' kind of constructs.

Syslet atoms fundamentally execute only system calls, thus to be able to 
manipulate user-space variables from syslets i've added a simple special 
system call: sys_umem_add(ptr, val). This can be used to increase or 
decrease the user-space variable (and to get the result), or to simply 
read out the variable (if 'val' is 0).

So a single syslet (submitted and executed via a single system call) can 
be arbitrarily complex. For example it can be like this:

       --------------------
       |     accept()     |-----> [ stop if returns negative ]
       --------------------
                |
                V
  -------------------------------
  |   setsockopt(TCP_NODELAY)   |-----> [ stop if returns negative ]
  -------------------------------
                |
                v
       --------------------
       |      read()      |<---------
       --------------------         | [ loop while positive ]
           |    |                   |
           |    ---------------------
           |
        -----------------------------------------
        | decrease and read user space variable |
        -----------------------------------------                    A
                    |                                                |
                    -------[ loop back to accept() if positive ]------

(you can find a VFS example and a hello.c example in the user-space 
testcode.)

A syslet is executed opportunistically: i.e. the syslet subsystem 
assumes that the syslet will not block, and it will switch to a 
cachemiss kernel thread from the scheduler. This means that even a 
single-atom syslet (i.e. a pure system call) is very close in 
performance to a pure system call. The syslet NULL-overhead in the 
cached case is roughly 10% of the SYSENTER NULL-syscall overhead. This 
means that two atoms are a win already, even in the cached case.

When a 'cachemiss' occurs, i.e. if we hit schedule() and are about to 
consider other threads, the syslet subsystem picks up a 'cachemiss 
thread' and switches the current task's user-space context over to the 
cachemiss thread, and makes the cachemiss thread available. The original 
thread (which now becomes a 'busy' cachemiss thread) continues to block. 
This means that user-space will still be executed without stopping - 
even if user-space is single-threaded.

if the submitting user-space context /knows/ that a system call will 
block, it can request immediate 'cachemiss' via the SYSLET_ASYNC flag. 
This would be used if for example an O_DIRECT file is read() or 
write()n.

likewise, if user-space knows (or expects) that a system call takes alot 
of CPU time even in the cached case, and it wants to offload it to 
another asynchronous context, it can request that via the SYSLET_ASYNC 
flag too.

completions of asynchronous syslets are done via a user-space ringbuffer 
that the kernel fills and user-space clears. Waiting is done via the 
sys_async_wait() system call. Completion can be supressed on a per-atom 
basis via the SYSLET_NO_COMPLETE flag, for atoms that include some 
implicit notification mechanism. (such as sys_kill(), etc.)

As it might be obvious to some of you, the syslet subsystem takes many 
ideas and experience from my Tux in-kernel webserver :) The syslet code 
originates from a heavy rewrite of the Tux-atom and the Tux-cachemiss 
infrastructure.

Open issues:

 - the 'TID' of the 'head' thread currently varies depending on which 
   thread is running the user-space context.

 - signal support is not fully thought through - probably the head 
   should be getting all of them - the cachemiss threads are not really 
   interested in executing signal handlers.

 - sys_fork() and sys_async_exec() should be filtered out from the 
   syscalls that are allowed - first one only makes sense with ptregs, 
   second one is a nice kernel recursion thing :) I didnt want to 
   duplicate the sys_call_table though - maybe others have a better 
   idea.

See more details in Documentation/syslet-design.txt. The patchset is 
against v2.6.20, but should apply to the -git head as well.

Thanks to Zach Brown for the idea to drive cachemisses via the 
scheduler. Thanks to Arjan van de Ven for early review feedback.

Comments, suggestions, reports are welcome!

	Ingo


From: Alan [email blocked] Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support Date: Tue, 13 Feb 2007 15:00:19 +0000 > A syslet is executed opportunistically: i.e. the syslet subsystem > assumes that the syslet will not block, and it will switch to a > cachemiss kernel thread from the scheduler. This means that even a How is scheduler fairness maintained ? and what is done for resource accounting here ? > that the kernel fills and user-space clears. Waiting is done via the > sys_async_wait() system call. Completion can be supressed on a per-atom They should be selectable as well iff possible. > Open issues: Let me add some more sys_setuid/gid/etc need to be synchronous only and not occur while other async syscalls are running in parallel to meet current kernel assumptions. sys_exec and other security boundaries must be synchronous only and not allow async "spill over" (consider setuid async binary patching) > - sys_fork() and sys_async_exec() should be filtered out from the > syscalls that are allowed - first one only makes sense with ptregs, clone and vfork. async_vfork is a real mindbender actually. > second one is a nice kernel recursion thing :) I didnt want to > duplicate the sys_call_table though - maybe others have a better > idea. What are the semantics of async sys_async_wait and async sys_async ?
From: Andi Kleen [email blocked] Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support Date: 13 Feb 2007 17:39:28 +0100 Alan [email blocked] writes: Funny, it sounds like batch() on stereoids @) Ok with an async context it becomes somewhat more interesting. > sys_setuid/gid/etc need to be synchronous only and not occur > while other async syscalls are running in parallel to meet current kernel > assumptions. > > sys_exec and other security boundaries must be synchronous only > and not allow async "spill over" (consider setuid async binary patching) He probably would need some generalization of Andrea's seccomp work. Perhaps using bitmaps? For paranoia I would suggest to white list, not black list calls. -Andi
From: Linus Torvalds [email blocked] Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support Date: Tue, 13 Feb 2007 08:26:03 -0800 (PST) On Tue, 13 Feb 2007, Andi Kleen wrote: > > sys_exec and other security boundaries must be synchronous only > > and not allow async "spill over" (consider setuid async binary patching) > > He probably would need some generalization of Andrea's seccomp work. > Perhaps using bitmaps? For paranoia I would suggest to white list, not black list > calls. It's actually more likely a lot more efficient to let the system call itself do the sanity checking. That allows the common system calls (that *don't* need to even check) to just not do anything at all, instead of having some complex logic in the common system call execution trying to figure out for each system call whether it is ok or not. Ie, we could just add to "do_fork()" (which is where all of the vfork/clone/fork cases end up) a simple case like err = wait_async_context(); if (err) return err; or if (in_async_context()) return -EINVAL; or similar. We need that "async_context()" function anyway for the other cases where we can't do other things concurrently, like changing the UID. I would suggest that "wait_async_context()" would do: - if weare *in* an async context, return an error. We cannot wait for ourselves! - if we are the "real thread", wait for all async contexts to go away (and since we are the real thread, no new ones will be created, so this is not going to be an infinite wait) The new thing would be that wait_async_context() would possibly return -ERESTARTSYS (signal while an async context was executing), so any system call that does this would possibly return EINTR. Which "fork()" hasn't historically done. But if you have async events active, some operations likely cannot be done (setuid() and execve() comes to mind), so you really do need something like this. And obviously it would only affect any program that actually would _use_ any of the suggested new interfaces, so it's not like a new error return would break anything old. Linus
From: Ingo Molnar [email blocked] Subject: Re: [patch 00/11] ANNOUNCE: "Syslets", generic asynchronous system call support Date: Tue, 13 Feb 2007 18:03:48 +0100 * Linus Torvalds <torvalds@linux-foundation.org> wrote: > Ie, we could just add to "do_fork()" (which is where all of the > vfork/clone/fork cases end up) a simple case like > > err = wait_async_context(); > if (err) > return err; > > or > > if (in_async_context()) > return -EINVAL; ok, this is a much nicer solution. I've scrapped the sys_async_sys_call_table[] thing. Ingo

Related Links:

Amazing thing!

Anonymous (not verified)
on
February 15, 2007 - 1:03pm

Things like this ideas make me love Linux world even more!

Keep up the good work!

I was just trying t read

on
June 15, 2007 - 9:40pm

I was just trying t read that interview with Ingo but it's a dead link.
Please could you fix it, thanks

Linus: I hate it

Anonymous (not verified)
on
February 15, 2007 - 4:22pm

You missed the next email Linus has sent after "having a closer look".

Link

on
February 15, 2007 - 4:35pm

FWIW: Anonymous refers to this message from Linus.

And then there's kevent..

Johan (not verified)
on
February 16, 2007 - 10:34am

And then there's kevent.. looking forward to people finally giving it a proper review when it hits 2.6.21-rc1 - which it probably isn't going to get anyway... oh well.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.