Linux: Scalable O(1) sys_exit from Ingo

Submitted by nero
on August 19, 2002 - 11:02pm

Ingo Molnar has implemented a replacement for the old sys_exit call, which was an O(N) function. The new version is an O(1) function. With the old function, the overhead would go up as the number of processes and threads (N) went up. O(1) means that the overhead is constant, reguardless of the number of processes. Initial post and a link to the thread follows.


From: Ingo Molnar
To: linux-kernel
Date: 2002-08-19 12:16:01
Subject: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6

this patch is the next step in the journey to get top-notch threading
support implemented under Linux.

every Linux kernel in existence, including 2.5.31, has a fundamentally
unscalable sys_exit() implementation, with an overhead that goes up if the
number of tasks in the system goes up. It does not matter whether those
tasks are doing any work - just sleeping indefinitely causes sys_exit()
overhead to go up.

200 tasks is typical for a relatively idle server system. 1000 tasks or
more is not uncommon during usual server load on a midrange server.
There are servers that use 5000 tasks or more. It is not uncommon for
highly threaded code to use even more threads - client/server models are
often the easiest to implement by using a per-connection thread model.
[Eg. vsftpd, one of the fastest and most secure FTP servers under Linux,
uses 2 (often 3) threads per client connection [which, for security reason
is implemented via inside per-client isolated processes].]

Some numbers to back this up, i've tested 2.5.31-BK-curr on a 525 MHz
PIII, it produces the following lat_proc fork+exit latencies:

  # of tasks in the system                200   1000    2000    4000
  ------------------------------------------------------------------
  ./lat_proc fork+exit (microseconds):  743.0  923.1  1153.4 1622.2



it can be seen that the fork()+exit() overhead more than doubles with
every 4000 tasks in the system.

for threaded applications the situation is even worse. A threading
benchmark that just tests the (linear) creation and exit of 100 thousand
threads using the old glibc libpthreads library, gives the following
results:

  # of tasks in the system                200   1000    2000    4000
  ------------------------------------------------------------------
  ./perf -s 100000 -t 1 -r 0 -T
  in seconds:                            17.8   37.3    61.1   108.0

  latency of single-thread create+exit
  in microseconds:                        178    373     611    1080

using the same test linked against new libpthreads:

  # of tasks in the system                200   1000    2000    4000
  ------------------------------------------------------------------
  ./perf -s 100000 -t 1 -r 0 -T
  in seconds:                             6.8   25.6    48.7    95.1

  latency of single-thread create+exit
  in microseconds:                         68    256     487     951



the same regression happens as in the old-pthreads case, but with a
(dramatically) lower baseline [which are due to the other optimizations].
With a couple of hundred threads created, the thread create+exit latency
becomes dominated by the hefty sys_exit() overhead.

all in one - sys_exit() is O(nr_tasks), and heavily so - even a reasonable
number of completely idle tasks increase the exit() overhead very visibly.

why is sys_exit() so expensive? The main overhead is in
forget_original_parent(), which must be called for every thread that
exits: the kernel has to find all children the exiting task has created
during its lifetime, and has to 'reparent' them to some other task. The
current implementation uses a for_each_task() over every task in the
system, and finds those tasks whose real parent is the now exiting task.
This is a fundamental work of the kernel that cannot be optimized away -
the child/parent tree must always stay coherent.

but the for_each_task() iteration is excessive. There is a subtle reason
for it though: ptrace reparents debugged tasks to the debugging task,
which detaches the child from the original parent. Thus
forget_original_parent() has to search the whole tasklist, to make sure
even debugged tasks are properly reparented.

the attached patch (against BK-curr) reimplements this logic in a scalable
way: the pthreads code also maintains a global list of debugged tasks -
which list is usually empty in a normal system. It has at most a few tasks
- those one which are debugged currently. This list can be maintained very
cheaply, in a number of strategic places.

forget_original_parent() then searched the exiting task's ->children list,
plus the global ptrace list. In the usual 'task has no children and there
are no debugged tasks around' case this becomes as cheap as two
list_empty() tests!

now on to the performance results, on the same 525 MHz PIII box, lat_proc:

  # of tasks in the system                200     1000     2000     4000
  ----------------------------------------------------------------------
  ./lat_proc fork+exit (microseconds):  657.1    680.6    640.8   682.5

  (unpatched kernel results):          (743.0)  (923.1) (1153.4) (1622.2)



process creation latency is essentially constant, it's independent of the
number of tasks in the system. Even the baseline results got improved by
more than 10%. For the 4000 tasks case the speedup is more than 130%.

the speedup is even bigger for threaded applications using the old
pthreads code:

  # of tasks in the system                200    1000    2000    4000
  --------------------------------------------------------------------
  ./perf -s 100000 -t 1 -r 0 -T
  in seconds:                            12.6    12.8    11.9    11.9
  (unpatched results):                   (17.8) (37.3)  (61.1) (108.0)

  latency of single-thread create+exit
  in microseconds:                        126    128     119      119
  (unpatched kernel):                    (178)   (373)   (611)  (1080)

  improvement:                             41%    191%   413%     807%



ie. in the 4000 tasks case the improvement is almost 10-fold!

testing the new glibc libpthreads code shows dramatic improvements:

  # of tasks in the system                200   1000    2000    4000
  ------------------------------------------------------------------
  ./perf -s 100000 -t 1 -r 0 -T
  in seconds:                             1.7    1.9     1.9     1.9
  (unpatched results):                   (6.8) (25.6)  (48.7)  (95.1)

  latency of single-thread create+exit
  in microseconds:                         17     19      19      19
  (unpatched kernel):                     (68)  (256)   (487)   (951)

  improvement:                            300%  1247%   2463%   4900%



in the 200 tasks case the speedup is 4-fold, in the 4000 tasks case the
speedup is 50-fold (!). Thread create+destroy latency is a steady 19
usecs. This also enables the head-to-head comparison of old pthreads and
new libpthreads: new libpthreads is more than 6 times faster. This is what
i'd finally call good threading performance.

the hardest part of the patch was to make sure ptrace() semantics are
still correct. I believe i have managed to at least test the typical
workload: i've tested a complex mix of high-load strace situations,
threaded and unthreaded code as well, SMP and UP, so i'm reasonably
certain that it works for the kind of load i use on my systems. [ But
ptrace() is complex beyond belief, so i might as well have missed some of
the subtler items. ]

- the patch also includes another optimization to make bash's wait4() job
easier as well: wait4() does not have to consider non-debugged
CLONE_DETACHED tasks, it wont get any exit code from them, and those
tasks can clean themselves up.

- the patch also cleans up the includes, and moves the
pthread specific declarations from mm.h to ptrace.h - most of the
existing includes were bogus.

Comments?

Ingo


From: Linus Torvalds (torvalds@transmeta.com)
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6
Newsgroups: linux.kernel
Date: 2002-08-19 10:40:09 PST

Hmm.. This looks good, but I wonder if the real problem isn't really that
our ptrace approach has always been kind of flaky.

Basically, we started with the notion that only parents can trace their
children, so no reparenting was ever needed. Then PTRACE_ATTACH came
along, and we did the reparenting, and I think _that_ is where we made our
big mistake.

We sh ould just have made a separate "tsk->tracer" pointer, instead of
continuing with the perverted "my parent is my tracer" logic. We shouldn't
really re-write the parent/child relationship just because we're being
traced.

I'd be happy to apply this patch (well, your fixed version), but I think
I'd prefer even more to make the whole reparenting go away, and keep the
child list valid all through the lifetime of a process. How painful could
that be?

Linus


From: Ingo Molnar (mingo@elte.hu)
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6
Newsgroups: linux.kernel
Date: 2002-08-19 12:40:06 PST

On Mon, 19 Aug 2002, Dave McCracken wrote:

> In looking at the code I was wondering something. What happens to the
> real parent of a ptraced task when it calls wait4()? If that's its only
> child, won't it return ECHILD?

yes, this is ugly beyond belief.

eg. under bash start up some code that blocks, eg:

# cat

the shell will not display a prompt, it wait4()s for the 'cat' process to
exit.

strace the bash PID - it shows the wait4().

strace 'cat' PID (via strace -p) and keep the strace running - it will
show 'cat' blocked on console input. [some signal like this could happen
to a shell anyway, in a reasonably complex script.]

now do something that breaks bash out of its wait4 - eg. send a 'kill
-SIGCONT BASH_PID' signal - bash returns with a prompt! The 'cat' becomes
a 'background' task - and it gets majorly confused when doing the next
shell command in bash - it displays "cat: -: Input/output error" and goes
zombie.

ptrace is clearly broken - and i tested this with a stock 2.4 kernel.

Ingo


From: Linus Torvalds (torvalds@transmeta.com)
Subject: Re: [patch] O(1) sys_exit(), threading, scalable-exit-2.5.31-A6
Newsgroups: linux.kernel
Date: 2002-08-19 12:20:14 PST

On Mon, 19 Aug 2002, Ingo Molnar wrote:
>
> well, this means that we'd still have to iterate through both lists in
> wait4(), and we'd have to maintain the ptrace list(s) in all the relevant
> codepaths - does this buy us anything relative to -B4?

Ok, you've convinced me. The reparenting is fairly ugly, but it sounds
like other implementations would be fairly equivalent and it would be
mainly an issue of just which list we'd work on.

Linus


Related Links:

Here is the link to the discussion that followed.
Google search on complexity theory

Ingo Molnar .. you rock!

molo
on
August 20, 2002 - 11:42am

Ingo Molnar has made some great performance improvements to the kernel, and I just wanted to say thanks for all the hard work. This kind of improvement is exactly what we need for Linux to continue competing against proprietary implementations.

new glibc pthreads????

flugstadp
on
August 21, 2002 - 7:35am

This is the first I've heard mention of a "new glibc pthreads". Anyone
have any pointers to what's going on with that, what they're changing,
when it'll be introducted, etc Thanks.

re: pthread

darkspecter
on
August 21, 2002 - 8:42am

Probably it's about the upcoming glibc 2.3, iirc they reworked the thread implementation.

new glibc pthreads????

Anonymous
on
August 23, 2002 - 7:46am

new glibc pthreads

Anonymous
on
August 23, 2002 - 9:11am

The optimizations Ingo mentions are mostly TLS and PLT stuff. It is not pth-ng or whatever ibm has been cooking up. It will be out in glibc 2.3.

Check Drepper's homepage:
http://people.redhat.com/drepper/
or specfically his 2.3 todo list.
http://people.redhat.com/drepper/todo-2.3.html

Great news!

Anonymous
on
August 21, 2002 - 5:20pm

This is great news! I did a before/after comparison with some heavily multithreaded code against OpenBSD. KDE applications segfaulted on average 12% faster with this patch, which is on-par with OpenBSD's performance.

Way to Go!

What do you mean by "segfaul

Anonymous
on
August 23, 2002 - 7:57am

What do you mean by "segfaulted" faster?

Sorry for the late reply

Anonymous
on
August 24, 2002 - 7:59am

I was in the middle of writing a reply to your comment when KDE segfaulted. The new branch IS faster. I had to run around and find all the RPM dependencies for Fluxbox but then some of them broke other stuff, so I paid someone to port the other stuff to the newer libs I had to install. I better stop talking and post before Mozilla crashes again. I could just copy the text into clip board in case, but even that is broken. Please ignore any spelling mistakes as I cannot seem to compile my office suite. This free software thing IS great!

Kindly

nero
on
August 24, 2002 - 10:04pm

Would you like, stop posting. Thanks.

re: Kindly

Jeremy
on
August 24, 2002 - 11:10pm

:)

Lies, Lies, ***** lies and statistics

Anonymous
on
August 23, 2002 - 11:50am

96.42% of all quoted statistics are made up on the fly.

This is cool

Anonymous
on
August 23, 2002 - 10:48pm

I'm glad they are cutting down on some of the overhead.

Right now Linux is slow, unstable, and huge.

I hope these changes make into the stable branch ASAP.

re: This is cool

nero
on
August 24, 2002 - 1:52am

Ignoring the FUD, these changes will not get into 2.4 period, they will remain in 2.5, and will be in 2.6 when it is released.

Re: Ignoring the FUD

Anonymous
on
August 24, 2002 - 8:46pm

That is misuse of the word FUD.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.