Announcing the third version of his syslets subsystem patches [story], Ingo Molnar [interview] noted that he has implemented many fundamental changes to the code including the introduction of threadlets, "'threadlets' are basically the user-space equivalent of syslets: small functions of execution that the kernel attempts to execute without scheduling. If the threadlet blocks, the kernel creates a real thread from it, and execution continues in that thread. The 'head' context (the context that never blocks) returns to the original function that called the threadlet." As threadlets are only moved into a separate thread context if they block, Ingo refers to them as 'optional threads'. He also describes them as 'on-demand parallelism', "user-space does not have to worry about setting up, sizing and feeding a thread pool - the kernel will execute the workload in a single-threaded manner as long as it makes sense, but once the context blocks, a parallel context is created. So parallelism inside applications is utilized in a natural way."
Ingo goes on to note that the syslet code and API has been significantly enhanced in this latest release, "the v3 code is ABI-incompatible with v2, due to these fundamental changes." He adds, "syslets (small, kernel-side, scripted 'syscall plugins') are still supported - they are (much...) harder to program than threadlets but they allow the highest performance. Core infrastructure libraries like glibc/libaio are expected to use syslets. Jens Axboe's FIO tool already includes support for v2 syslets, and the following patch updates FIO to the v3 API".
From: Ingo Molnar [email blocked] To: linux-kernel Subject: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3 Date: Wed, 21 Feb 2007 22:13:55 +0100 this is the v3 release of the syslet/threadlet subsystem: http://redhat.com/~mingo/syslet-patches/ This release came a few days later than i originally wanted, because i've implemented many fundamental changes to the code. The biggest highlights of v3 are: - "Threadlets": the introduction of the 'threadlet' execution concept. - syslets: multiple rings support with no kernel-side footprint, the elimination of mlock() pinning, no async_register/unregister() calls needed anymore and more. "Threadlets" are basically the user-space equivalent of syslets: small functions of execution that the kernel attempts to execute without scheduling. If the threadlet blocks, the kernel creates a real thread from it, and execution continues in that thread. The 'head' context (the context that never blocks) returns to the original function that called the threadlet. Threadlets are very easy to use: long my_threadlet_fn(void *data) { char *name = data; int fd; fd = open(name, O_RDONLY); if (fd < 0) goto out; fstat(fd, &stat); read(fd, buf, count) ... out: return threadlet_complete(); } main() { done = threadlet_exec(threadlet_fn, new_stack, &user_head); if (!done) reqs_queued++; } There is no limitation whatsoever about how a threadlet function can look like: it can use arbitrary system-calls and all execution will be procedural. There is no 'registration' needed when running threadlets either: the kernel will take care of all the details, user-space just runs a threadlet without any preparation and that's it. Completion of async threadlets can be done from user-space via any of the existing APIs: in threadlet-test.c (see the async-test-v3.tar.gz user-space examples at the URL above) i've for example used a futex between the head and the async threads to do threadlet notification. But select(), poll() or signals can be used too - whichever is most convenient to the application writer. Threadlets can also be thought of as 'optional threads': they execute in the original context as long as they do not block, but once they block, they are moved off into their separate thread context - and the original context can continue execution. Threadlets can also be thought of as 'on-demand parallelism': user-space does not have to worry about setting up, sizing and feeding a thread pool - the kernel will execute the workload in a single-threaded manner as long as it makes sense, but once the context blocks, a parallel context is created. So parallelism inside applications is utilized in a natural way. (The best place to do this is in the kernel - user-space has no idea about what level of parallelism is best for any given moment.) I believe this threadlet concept is what user-space will want to use for programmable parallelism. [ Note that right now there's a pair of system-calls: sys_threadlet_on() and sys_threadlet_off() that demarks the beginning and the end of a syslet function, which enter the kernel even in the 'cached' case - but my plan is to do these two system calls via a vsyscall, without having to enter the kernel at all. That will reduce cached threadlet execution NULL-overhead to around 10 nsecs - making it essentially zero. ] Threadlets share much of the scheduling infrastructure with syslets. Syslets (small, kernel-side, scripted "syscall plugins") are still supported - they are (much...) harder to program than threadlets but they allow the highest performance. Core infrastructure libraries like glibc/libaio are expected to use syslets. Jens Axboe's FIO tool already includes support for v2 syslets, and the following patch updates FIO to the v3 API: http://redhat.com/~mingo/syslet-patches/fio-syslet-v3.patch Furthermore, the syslet code and API has been significantly enhanced as well: - support for multiple completion rings has been added - there is no more mlock()ing of the completion ring(s) - sys_async_register()/unregister() has been removed as it is not needed anymore. sys_async_exec() can be called straight away. - there is no kernel-side resource used up by async completion rings at all (all the state is in user-space), so an arbitrary number of completion rings are supported. plus lots of bugs were fixed and a good number of cleanups were done as well. The v3 code is ABI-incompatible with v2, due to these fundamental changes. As always, comments, suggestions, reports are welcome. Ingo
From: "Michael K. Edwards" [email blocked] Subject: Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3 Date: Wed, 21 Feb 2007 14:46:32 -0800 On 2/21/07, Ingo Molnar [email blocked] wrote: > I believe this threadlet concept is what user-space will want to use for > programmable parallelism. This is brilliant. Now it needs just four more things: 1) Documentation of what you can and can't do safely from a threadlet, given that it runs in an unknown thread context; 2) Facilities for manipulating pools of threadlets, so you can throttle their concurrency, reprioritize them, and cancel them in bulk, disposing safely of any dynamically allocated memory, synchronization primitives, and so forth that they may be holding; 3) Reworked threadlet scheduling to allow tens of thousands of blocked threadlets to be dispatched efficiently in a controlled, throttled, non-cache-and-MMU-thrashing manner, immediately following the softirq that unblocks the I/O they're waiting on; and 4) AIO vsyscalls whose semantics resemble those of IEEE 754 floating point operations, with a clear distinction between a) pipeline state vs. operands, b) results vs. side effects, and c) coding errors vs. not-a-number results vs. exceptions that cost you a pipeline flush and nonlocal branch. When these four problems are solved (and possibly one or two more that I'm not thinking of), you will have caught up with the state of the art in massively parallel event-driven cooperative multitasking frameworks. This would be a really, really good thing for Linux and its users. Cheers, - Michael
From: Ingo Molnar [email blocked] Subject: Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3 Date: Thu, 22 Feb 2007 00:03:51 +0100 * Michael K. Edwards [email blocked] wrote: > 1) Documentation of what you can and can't do safely from a threadlet, > given that it runs in an unknown thread context; you can do just about anything from a threadlet, using bog standard procedural programming. (Certain system-calls are excluded at the moment out of caution - but i'll probably lift restrictions like sys_clone() use because sys_clone() can be done safely from a threadlet.) The code must be thread-safe, because the kernel can move execution to a new thread anytime and then it will execute in parallel with the main thread. There's no other requirement. Wrt. performance, one good model is to run request-alike functionality from a threadlet, to maximize parallelism. ingo
From: Ingo Molnar [email blocked] Subject: Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3 Date: Thu, 22 Feb 2007 00:24:07 +0100 * Michael K. Edwards [email blocked] wrote: > 2) Facilities for manipulating pools of threadlets, so you can > throttle their concurrency, reprioritize them, and cancel them in > bulk, disposing safely of any dynamically allocated memory, > synchronization primitives, and so forth that they may be holding; pthread_cancel() [if/once threadlets are integrated into pthreads] ought to do that. A threadlet, if it gets moved to an async context, is a full-blown thread. Ingo
From: Ingo Molnar [email blocked] Subject: Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3 Date: Wed, 21 Feb 2007 23:57:11 +0100 * Michael K. Edwards [email blocked] wrote: > 3) Reworked threadlet scheduling to allow tens of thousands of blocked > threadlets to be dispatched efficiently in a controlled, throttled, > non-cache-and-MMU-thrashing manner, immediately following the softirq > that unblocks the I/O they're waiting on; and threadlets, when they dont block, are just regular user-space function calls - so no need to schedule or throttle them. [*] threadlets, when they block, are regular kernel threads, so the regular O(1) scheduler takes care of them. If MMU trashing is of any concern then syslets should be used to implement the most performance-critical events: under Linux a kernel thread that does not exit out to user-space does not do any TLB switching at all. (even if there are multiple processes active and their syslets intermix) throttling of outstanding async contexts is most easily done by user-space - you can see an example in threadlet-test.c, but there's also fio/engines/syslet-rw.c. v2 had a kernel-space throttling mechanism as well, i'll probably reintroduce that in later versions. Ingo [*] although certain more advanced scheduling tactics like the detection of frequently executed threadlet functions and their pushing out to separate contexts is possible too - but this is an optional add-on and for later.
From: Ingo Molnar [email blocked] Subject: Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3 Date: Thu, 22 Feb 2007 00:31:11 +0100 * Michael K. Edwards [email blocked] wrote: > 4) AIO vsyscalls whose semantics resemble those of IEEE 754 floating > point operations, with a clear distinction between a) pipeline state > vs. operands, b) results vs. side effects, and c) coding errors vs. > not-a-number results vs. exceptions that cost you a pipeline flush and > nonlocal branch. threadlets (and syslets) are parallel contexts and they behave so - queuing and execution semantics are then ontop of that, implemented either by glibc, or implemented by the application. There is no 'pipeline' of requests imposed - the structure of pending requests is totally free-form. For example in threadlet-test.c i've in essence implemented a 'set of requests' with the submission site only interested in whether all requests are done or not - but any stricter (or even looser) semantics and ordering can be used too. in terms of AIO, the best queueing model is i think what the kernel uses internally: freely ordered, with barrier support. (That is equivalent to a "queue of sets", where the queue are the barriers, and the sets are the requests within barriers. If there is no barrier pending then there's just one large freely-ordered set of requests.) Ingo
From: Ulrich Drepper [email blocked] Subject: Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3 Date: Wed, 21 Feb 2007 15:46:45 -0800 Ingo Molnar wrote: > in terms of AIO, the best queueing model is i think what the kernel uses > internally: freely ordered, with barrier support. Speaking of AIO, how do you imagine lio_listio is implemented? If there is no asynchronous syscall it would mean creating a threadlet for each request but this means either waiting or creating several/many threads. -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
From: Ingo Molnar [email blocked] Subject: Re: [patch 00/13] Syslets, "Threadlets", generic AIO support, v3 Date: Thu, 22 Feb 2007 08:40:44 +0100 * Ulrich Drepper [email blocked] wrote: > Ingo Molnar wrote: > > in terms of AIO, the best queueing model is i think what the kernel uses > > internally: freely ordered, with barrier support. > > Speaking of AIO, how do you imagine lio_listio is implemented? If > there is no asynchronous syscall it would mean creating a threadlet > for each request but this means either waiting or creating > several/many threads. my current thinking is that special-purpose (non-programmable, static) APIs like aio_*() and lio_*(), where every last cycle of performance matters, should be implemented using syslets - even if it is quite tricky to write syslets (which they no doubt are - just compare the size of syslet-test.c to threadlet-test.c). So i'd move syslets into the same category as raw syscalls: pieces of the raw infrastructure between the kernel and glibc, not an exposed API to apps. [and even if we keep them in that category they still need quite a bit of API work, to clean up the 32/64-bit issues, etc.] The size of the async thread pool can be kept in check either from user-space (by starting to queue up requests after a certain point of saturation without submitting them) or from kernel-space which involves waiting (the latter was present in v2 but i temporarily removed it from v3). "You have to wait" is the eventual final answer in every well-behaved queueing system anyway. How things work out with a large number of outstanding threads in real apps is still an open question (until someone tries it) but i'm cautiously optimistic: in my own (FIO based) measurements syslets beat the native KAIO interfaces both in the cached and in the non-cached [== many threads] case. I did not expect the latter at all: the non-cached syslet codepath is not optimized at all yet, so i expected it to have (much) higher CPU overhead than KAIO. This means that KAIO is in worse shape than i thought - there's just way too much context KAIO has to build up to submit parallel IO contexts. Many years of optimizations went into KAIO already, so it's probably at its outer edge of performance capabilities. Furthermore, what KAIO has to compete against in the syslet case are the synchronous syscalls turned async, and more than a decade of optimizations went into all the synchronous syscalls. Plus the 'threading overhead of syslets' really boils down to 'scheduling overhead' in the end - and we can do over a million context-switches a second, per CPU. What killed user-space thread-based AIO performance many moons ago wasnt really the threading concept itself or scheduling overhead, it was the (then) fragile threading implementation of Linux, combined with the resulting signal-based AIO code. Catching and handling a single signal is more expensive than a context-switch - and signals have legacies attached to them that make them hard to scale within the kernel. Plus with syslets the 'threading overhead' is optional, it only happens when it has to. Plus there's the fundamental killer that KAIO is a /lot/ harder to implement (and to maintain) on the kernel side: it has to be implemented for every IO discipline, and even for the IO disciplines it supports at the moment, it is not truly asynchronous for things like metadata blocking or VFS blocking. To handle things like metadata blocking it has to resort to non-statemachine techniques like retries - which are bad for performance. Syslets/threadlets on the other hand, once the core is implemented, have near zero ongoing maintainance cost (compared to KAIO pushed into every IO subsystem) and cover all IO disciplines and API variants immediately, and they are as perfectly asynchronous as it gets. So all in one, i used to think that AIO state-machines have a long-term place within the kernel, but with syslets i think i've proven myself embarrasingly wrong =B-) Ingo
pthreads?
Would it be possible to use that thing for pthreads implementation?
What would we win from another fancy toy? Scheduling is not an overhead and more or less a not a show stopper under Linux anyway.
What guarantees we have that the thing will not die from the same ugly death futex of the same RedHat fame had suffered?
As for me, Linux does threads already well. It would be nice to have something to exploit multi-core nature of modern CPUs & DSPs (aka Cell) - but seems to me that the interface for the syslets/threadlets isn't fitting. What threadlets need - are some predefined/preallocated input and output memory areas - so that they can be sent to another CPU or SPE to do actual work. But then, we come back to good ol' pipes and couple of pthreads... So what the big deal anyway?
AIO!
The point of this research is not getting yet another thread implementation, but using in-kernel or hybrid threading techniques as an effective means for implementing AIO (Asynchroneous I/O, see http://en.wikipedia.org/wiki/Asynchronous_I/O), e.g. for databases which need to read and process huge files, want to take advantage of data already in the cache, and often can process data blocks in arbitrary order.
Read http://kerneltrap.org/node/7728 or http://lwn.net/Articles/219954/ , http://lwn.net/Articles/220897/ and http://lwn.net/Articles/221913/ to get it into historical context.
wikipedia
Did you read the page on wikipedia you linked to? It confuses AIO with NBIO, I doubt it's of much use to anyone.
Ok, care to explain?
Care to explain how asynchronous reads are not non-blocking, and how non-blocking reads are not asynchronous? (Writes are less interesting. Reads are where it matters.)
Reads from... where? All
Reads from... where? All asynchronous operations are non-blocking. Non-blocking reads from the disk for example make no sense. Disk blocks will not magically be paged in somehow. From the network when data is arriving asynchronously anyway, getting it with non-blocking reads could be called "asynchronous", but that's a special case.
Non-blocking reads from a
Non-blocking reads from a disk do make sense. The amount of time it takes a disk to seek and read a block that's not cached can be significant, even if the disk is not under load. I think you're forgetting the fact that disks are orders of magnitude slower than the processor.
Disks, actually
Disks are enormously slow, but apps might have several different sets of things they need to do. For instance, imagine a complex database query. Later stages of the query depend on data brought in by early stages, with a dependency graph that looks more like a general directed graph than a linear set of queries. In that case, you want to advance down as many paths in parallel as possible, issuing your I/O to the kernel, and letting the scheduler work out the best order to complete the I/O.
In a synchronous model, the application must pick an order in which to serialize the requests, and the OS has much less opportunity to reorder the I/O in the context of all requests (including requests coming from other apps).
It's not quite accurate but
It's not quite accurate but that wiki link helped me to get on the right path concerning one problem that I had
What guarantees we have that
What exactly are you talking about here? AFAIK, futexes are alive and well.
check glicbs
The biggest user of futex'es was glibc. And many stable versions ago they have stopped using them. And only RedHat had them enabled for everything.
The only association with futex I have is RedHat Linux 8' s RPM: it was hanging all the time you try to run it. If you check the syscall it was hanging on it was precisely "futex". And it was impossible to ^C it since futex are so blazingly fast (as RH people bragged about on lkml) that they do not check signals. Nothing else but "nightmare" to be frank. (Especially added that no RH developer on mail-lists could have checked that, since the RHL8 by time of release was already out-dated and they used internally only RawHide and it all "worked for them perfectly".) (Though that had one positive outcome: moving my ex-employer from RH to SUSE.)
They might be "alive and all", but nobody uses them. At least on my systems (SUSE, Debian & Gentoo) I see no hits for them. (I run "strace" quite often to check on how apps are working - especially I/O - no futex syscalls were spotted in last 3 years.)
From what i know NPTL
From what i know NPTL implementation of pthread's (POSIX threads) are made using futexes
form pthread(7) Both
form pthread(7)
Both threading implementations employ the Linux clone(2) system call. In NPTL, thread synchronisation primitives (mutexes, thread joining, etc.) are implemented using the Linux futex(2) system call
The initial design of
The initial design of futexes was bad. The API was wrong. But they've fixed that now, and futexes are going to be the locking primitive of choice for the future.
Ulrich Drepper, the maintainer of glibc, works at Red Hat. So I'm sure he's in the loop about futexes, despite all your doom and gloom.
fibrils vs syslets
To me it looks like Ingo's syslet/threadlet work is not favoured on LKML, but instead everyone likes the fibril approach more.
This shows the deeper weaknesses of the open-source development model, essentially work is duplicated and one effort will be wasted. Which? I would really like to see benchmarks decide this instead of the theoretical viewpoints of various 'big names' on LKML, we will see...
I don't see it that way at all.
First of all, never forget Fredrick Brooks' dictum, "Plan to throw one away. You will anyhow." Does that mean it's wasted effort? Hardly. For open source, these competing implementations allow some of that process to occur in parallel.
There have been many ideas that have met with poor reception on LKML that later made it into the kernel, and many ideas that were accepted readily and later cut out of the kernel once they proved to be bad ideas. Don't let the fact that ideas get ground up and spat back out distract you from the fact they're making some mighty tasty intellectual sausage.
Mighty tasty intellectual
Mighty tasty intellectual sausage?
OK. Windows NT emulated AIO with kernel-thread pools since the beginning.
And still it performs so
And still it performs so embarassingly bad in the IO department? Intriguing...
The NT kernel is far more
The NT kernel is far more scalable than Linux.
Agreed, as we see the NT
Agreed, as we see the NT kernel being used by anything from standard x86 boxes, to... Ehm... Standard x86 boxes.
Linux OTOH only scales from small embedded boards, to... big... mainframes... Erm... Ehm...
Nevermind
"This shows the deeper
"This shows the deeper weaknesses of the open-source development model"
Umm. Now that's a red herring if I ever saw one.
It shows no such thing. It does rather the opposite, in fact. It shows one of the deeper strengths of the open-source development model - that it is based on the scientific principle of peer review of competing ideas.
So. Try again?
I really don't understand
I really don't understand how this works, and how it's supposed to let "user-space not have to worry about setting up, sizing and feeding a thread pool".
From what I've understood, a threadlet_exec() is executed in the same thread as the calling function, until it stumbles on a blocking syscall. at this point, a new, indipendent thread is created, so that the calling thread can continue in an asynchronous fashion.
I've run benchmarks (2.6.19 vanilla, athlon XP 2000+) showing that creating+destroying a NPTL thread takes 16 MILLISECONDS, while feeding an existing thread with a command (using conditional variables) takes 3.5 milliseconds, which can probably be furterly cut down by an order of magnitude if you implement a command queue with a size > 1, or if the receiving thread is idle when you feed it.
So, my questions are two:
(while if I use an existing thread pool it only costs me 3.5ms or probably much less)
I am definitely confused. any explanation will be really welcome.
(typo)
sorry, actual values are 15 microseconds and 2.7 microseconds, not milliseconds.
The right tool for the right job
It saves user space from having to keep a pool of threads around for the purpose of avoiding blocks on system calls. Threads have other uses other than to hide system call blocking, and syslets don't serve those purposes.
Is it just me or this method for user mode seams wrong.
Lets just say as a coder I am lazy. I don't want to have to alter my source code to take advantage of this.
Would it not make more sense if the complier picked out the sections that are most likely to gain from syslets conversions. __attribute__((syslets)) function () {} to force it and __attribute__((nosyslets)) to forbid it. How much of the syslets transfer could be prebuild and just passed like a different function all. Processed at complier might remove over head of one of uses. This also allows for functions that are only used once not to be assigned a syslet.
Also my code could be built with and without syslets and still function exactly how I wanted.
And somehow magically detect
And somehow magically detect which functions can be run in parallel and which can't? (race conditions and thelike) Also, functions usually return something, have side-effects, so when a function has returned the program assumes that it has finished executing. In your model it's false.