Ingo Molnar has implemented a replacement for the old sys_exit call, which was an O(N) function. The new version is an O(1) function. With the old function, the overhead would go up as the number of processes and threads (N) went up. O(1) means that the overhead is constant, reguardless of the number of processes. Initial post and a link to the thread follows.
this patch is the next step in the journey to get top-notch threading
support implemented under Linux.
every Linux kernel in existence, including 2.5.31, has a fundamentally
unscalable sys_exit() implementation, with an overhead that goes up if the
number of tasks in the system goes up. It does not matter whether those
tasks are doing any work - just sleeping indefinitely causes sys_exit()
overhead to go up.
200 tasks is typical for a relatively idle server system. 1000 tasks or
more is not uncommon during usual server load on a midrange server.
There are servers that use 5000 tasks or more. It is not uncommon for
highly threaded code to use even more threads - client/server models are
often the easiest to implement by using a per-connection thread model.
[Eg. vsftpd, one of the fastest and most secure FTP servers under Linux,
uses 2 (often 3) threads per client connection [which, for security reason
is implemented via inside per-client isolated processes].]
Some numbers to back this up, i've tested 2.5.31-BK-curr on a 525 MHz
PIII, it produces the following lat_proc fork+exit latencies:
# of tasks in the system 200 1000 2000 4000 ------------------------------------------------------------------ ./lat_proc fork+exit (microseconds): 743.0 923.1 1153.4 1622.2
it can be seen that the fork()+exit() overhead more than doubles with
every 4000 tasks in the system.
for threaded applications the situation is even worse. A threading
benchmark that just tests the (linear) creation and exit of 100 thousand
threads using the old glibc libpthreads library, gives the following
results:
# of tasks in the system 200 1000 2000 4000 ------------------------------------------------------------------ ./perf -s 100000 -t 1 -r 0 -T in seconds: 17.8 37.3 61.1 108.0 latency of single-thread create+exit in microseconds: 178 373 611 1080 using the same test linked against new libpthreads: # of tasks in the system 200 1000 2000 4000 ------------------------------------------------------------------ ./perf -s 100000 -t 1 -r 0 -T in seconds: 6.8 25.6 48.7 95.1 latency of single-thread create+exit in microseconds: 68 256 487 951
the same regression happens as in the old-pthreads case, but with a
(dramatically) lower baseline [which are due to the other optimizations].
With a couple of hundred threads created, the thread create+exit latency
becomes dominated by the hefty sys_exit() overhead.
all in one - sys_exit() is O(nr_tasks), and heavily so - even a reasonable
number of completely idle tasks increase the exit() overhead very visibly.
why is sys_exit() so expensive? The main overhead is in
forget_original_parent(), which must be called for every thread that
exits: the kernel has to find all children the exiting task has created
during its lifetime, and has to 'reparent' them to some other task. The
current implementation uses a for_each_task() over every task in the
system, and finds those tasks whose real parent is the now exiting task.
This is a fundamental work of the kernel that cannot be optimized away -
the child/parent tree must always stay coherent.
but the for_each_task() iteration is excessive. There is a subtle reason
for it though: ptrace reparents debugged tasks to the debugging task,
which detaches the child from the original parent. Thus
forget_original_parent() has to search the whole tasklist, to make sure
even debugged tasks are properly reparented.
the attached patch (against BK-curr) reimplements this logic in a scalable
way: the pthreads code also maintains a global list of debugged tasks -
which list is usually empty in a normal system. It has at most a few tasks
- those one which are debugged currently. This list can be maintained very
cheaply, in a number of strategic places.
forget_original_parent() then searched the exiting task's ->children list,
plus the global ptrace list. In the usual 'task has no children and there
are no debugged tasks around' case this becomes as cheap as two
list_empty() tests!
now on to the performance results, on the same 525 MHz PIII box, lat_proc:
# of tasks in the system 200 1000 2000 4000 ---------------------------------------------------------------------- ./lat_proc fork+exit (microseconds): 657.1 680.6 640.8 682.5 (unpatched kernel results): (743.0) (923.1) (1153.4) (1622.2)
process creation latency is essentially constant, it's independent of the
number of tasks in the system. Even the baseline results got improved by
more than 10%. For the 4000 tasks case the speedup is more than 130%.
the speedup is even bigger for threaded applications using the old
pthreads code:
# of tasks in the system 200 1000 2000 4000 -------------------------------------------------------------------- ./perf -s 100000 -t 1 -r 0 -T in seconds: 12.6 12.8 11.9 11.9 (unpatched results): (17.8) (37.3) (61.1) (108.0) latency of single-thread create+exit in microseconds: 126 128 119 119 (unpatched kernel): (178) (373) (611) (1080) improvement: 41% 191% 413% 807%
ie. in the 4000 tasks case the improvement is almost 10-fold!
testing the new glibc libpthreads code shows dramatic improvements:
# of tasks in the system 200 1000 2000 4000 ------------------------------------------------------------------ ./perf -s 100000 -t 1 -r 0 -T in seconds: 1.7 1.9 1.9 1.9 (unpatched results): (6.8) (25.6) (48.7) (95.1) latency of single-thread create+exit in microseconds: 17 19 19 19 (unpatched kernel): (68) (256) (487) (951) improvement: 300% 1247% 2463% 4900%
in the 200 tasks case the speedup is 4-fold, in the 4000 tasks case the
speedup is 50-fold (!). Thread create+destroy latency is a steady 19
usecs. This also enables the head-to-head comparison of old pthreads and
new libpthreads: new libpthreads is more than 6 times faster. This is what
i'd finally call good threading performance.
the hardest part of the patch was to make sure ptrace() semantics are
still correct. I believe i have managed to at least test the typical
workload: i've tested a complex mix of high-load strace situations,
threaded and unthreaded code as well, SMP and UP, so i'm reasonably
certain that it works for the kind of load i use on my systems. [ But
ptrace() is complex beyond belief, so i might as well have missed some of
the subtler items. ]
- the patch also includes another optimization to make bash's wait4() job
easier as well: wait4() does not have to consider non-debugged
CLONE_DETACHED tasks, it wont get any exit code from them, and those
tasks can clean themselves up.
- the patch also cleans up the includes, and moves the
pthread specific declarations from mm.h to ptrace.h - most of the
existing includes were bogus.
Comments?
Ingo
From: Linus Torvalds (torvalds@transmeta.com)
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6
Newsgroups: linux.kernel
Date: 2002-08-19 10:40:09 PST
Hmm.. This looks good, but I wonder if the real problem isn't really that
our ptrace approach has always been kind of flaky.
Basically, we started with the notion that only parents can trace their
children, so no reparenting was ever needed. Then PTRACE_ATTACH came
along, and we did the reparenting, and I think _that_ is where we made our
big mistake.
We sh ould just have made a separate "tsk->tracer" pointer, instead of
continuing with the perverted "my parent is my tracer" logic. We shouldn't
really re-write the parent/child relationship just because we're being
traced.
I'd be happy to apply this patch (well, your fixed version), but I think
I'd prefer even more to make the whole reparenting go away, and keep the
child list valid all through the lifetime of a process. How painful could
that be?
Linus
From: Ingo Molnar (mingo@elte.hu)
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6
Newsgroups: linux.kernel
Date: 2002-08-19 12:40:06 PST
On Mon, 19 Aug 2002, Dave McCracken wrote:
> In looking at the code I was wondering something. What happens to the
> real parent of a ptraced task when it calls wait4()? If that's its only
> child, won't it return ECHILD?
yes, this is ugly beyond belief.
eg. under bash start up some code that blocks, eg:
# cat
the shell will not display a prompt, it wait4()s for the 'cat' process to
exit.
strace the bash PID - it shows the wait4().
strace 'cat' PID (via strace -p) and keep the strace running - it will
show 'cat' blocked on console input. [some signal like this could happen
to a shell anyway, in a reasonably complex script.]
now do something that breaks bash out of its wait4 - eg. send a 'kill
-SIGCONT BASH_PID' signal - bash returns with a prompt! The 'cat' becomes
a 'background' task - and it gets majorly confused when doing the next
shell command in bash - it displays "cat: -: Input/output error" and goes
zombie.
ptrace is clearly broken - and i tested this with a stock 2.4 kernel.
Ingo
On Mon, 19 Aug 2002, Ingo Molnar wrote:
>
> well, this means that we'd still have to iterate through both lists in
> wait4(), and we'd have to maintain the ptrace list(s) in all the relevant
> codepaths - does this buy us anything relative to -B4?
Ok, you've convinced me. The reparenting is fairly ugly, but it sounds
like other implementations would be fairly equivalent and it would be
mainly an issue of just which list we'd work on.
Linus
Here is the link to the discussion that followed.
Google search on complexity theory
Ingo Molnar .. you rock!
Ingo Molnar has made some great performance improvements to the kernel, and I just wanted to say thanks for all the hard work. This kind of improvement is exactly what we need for Linux to continue competing against proprietary implementations.
new glibc pthreads????
This is the first I've heard mention of a "new glibc pthreads". Anyone
have any pointers to what's going on with that, what they're changing,
when it'll be introducted, etc Thanks.
re: pthread
Probably it's about the upcoming glibc 2.3, iirc they reworked the thread implementation.
new glibc pthreads????
See http://www-124.ibm.com/developerworks/oss/pthreads/ for more details.
Radek
new glibc pthreads
The optimizations Ingo mentions are mostly TLS and PLT stuff. It is not pth-ng or whatever ibm has been cooking up. It will be out in glibc 2.3.
Check Drepper's homepage:
http://people.redhat.com/drepper/
or specfically his 2.3 todo list.
http://people.redhat.com/drepper/todo-2.3.html
Great news!
This is great news! I did a before/after comparison with some heavily multithreaded code against OpenBSD. KDE applications segfaulted on average 12% faster with this patch, which is on-par with OpenBSD's performance.
Way to Go!
What do you mean by "segfaul
What do you mean by "segfaulted" faster?
Sorry for the late reply
I was in the middle of writing a reply to your comment when KDE segfaulted. The new branch IS faster. I had to run around and find all the RPM dependencies for Fluxbox but then some of them broke other stuff, so I paid someone to port the other stuff to the newer libs I had to install. I better stop talking and post before Mozilla crashes again. I could just copy the text into clip board in case, but even that is broken. Please ignore any spelling mistakes as I cannot seem to compile my office suite. This free software thing IS great!
Kindly
Would you like, stop posting. Thanks.
re: Kindly
:)
Lies, Lies, ***** lies and statistics
96.42% of all quoted statistics are made up on the fly.
This is cool
I'm glad they are cutting down on some of the overhead.
Right now Linux is slow, unstable, and huge.
I hope these changes make into the stable branch ASAP.
re: This is cool
Ignoring the FUD, these changes will not get into 2.4 period, they will remain in 2.5, and will be in 2.6 when it is released.
Re: Ignoring the FUD
That is misuse of the word FUD.