This is a resubmission of the posix timer rework patch, posted a few days ago.
This addresses Oleg Nesterov's comments, removing the RCU stuff from the patch
and un-inlining the thread_group_cputime() function for SMP. I left the
function itself as an inline defined in sched.h but the SMP version just calls
thread_group_cputime_smp() which is defined in posix-cpu-timers.c. (The UP
version was a one-liner; I left it as an inline.)
The original README follows (to keep it together with the patch itself):
Overview
This patch reworks the handling of POSIX CPU timers, including the
ITIMER_PROF, ITIMER_VIRT timers and rlimit handling. It was put together
with the help of Roland McGrath, the owner and original writer of this code.
The problem we ran into, and the reason for this rework, has to do with using
a profiling timer in a process with a large number of threads. It appears
that the performance of the old implementation of run_posix_cpu_timers() was
at least O(n*3) (where "n" is the number of threads in a process) or worse.
Everything is fine with an increasing number of threads until the time taken
for that routine to run becomes the same as or greater than the tick time, at
which point things degrade rather quickly.
This patch fixes bug 9906, "Weird hang with NPTL and SIGPROF."
Code Changes
This rework corrects the implementation of run_posix_cpu_timers() to make it
run in constant time for a particular machine. (Performance may vary between
one machine and another depending upon whether the kernel is built as single-
or multiprocessor and, in the latter case, depending upon the number of
running processors.) To do this, at each tick we now update fields in
signal_struct as well as task_struct. The run_posix_cpu_timers() function
uses those fields to make its decisions.
We define a new structure, "task_cputime," to contain user, system and
scheduler times and use these in appropriate places:
struct task_cputime {
cputime_t utime;
cputime_t ...On Fri, 12 Sep 2008 09:54:39 -0700 It took a bit of massaging to make it all apply and compile on linux-next, but it looks like I got it landed OK. From: Frank Mayhar <fmayhar@google.com> Overview This patch reworks the handling of POSIX CPU timers, including the ITIMER_PROF, ITIMER_VIRT timers and rlimit handling. It was put together with the help of Roland McGrath, the owner and original writer of this code. The problem we ran into, and the reason for this rework, has to do with using a profiling timer in a process with a large number of threads. It appears that the performance of the old implementation of run_posix_cpu_timers() was at least O(n*3) (where "n" is the number of threads in a process) or worse. Everything is fine with an increasing number of threads until the time taken for that routine to run becomes the same as or greater than the tick time, at which point things degrade rather quickly. This patch fixes http://bugzilla.kernel.org/show_bug.cgi?id=9906, "Weird hang with NPTL and SIGPROF." Code Changes This rework corrects the implementation of run_posix_cpu_timers() to make it run in constant time for a particular machine. (Performance may vary between one machine and another depending upon whether the kernel is built as single- or multiprocessor and, in the latter case, depending upon the number of running processors.) To do this, at each tick we now update fields in signal_struct as well as task_struct. The run_posix_cpu_timers() function uses those fields to make its decisions. We define a new structure, "task_cputime," to contain user, system and scheduler times and use these in appropriate places: struct task_cputime { cputime_t utime; cputime_t stime; unsigned long long sum_exec_runtime; }; This is included in the structure "thread_group_cputime," which is a new substructure of signal_struct and which varies for uniprocessor versus multiprocessor kernels. For uniprocessor kernels, it uses "task_cputime" as a simple ...
i've applied it to tip/timers/posixtimers for more testing, thanks Frank. Ingo --
applied the small build fix below to tip/timers/posixtimers. Ingo ------------> From 430b5294bd72c085c730e1e4b86580f164d976bf Mon Sep 17 00:00:00 2001 From: Ingo Molnar <mingo@elte.hu> Date: Sun, 14 Sep 2008 16:33:01 +0200 Subject: [PATCH] timers: fix itimer/many thread hang, fix MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit fix: kernel/fork.c:843: error: ‘struct signal_struct’ has no member named ‘sum_sched_runtime’ kernel/irq/handle.c:117: warning: ‘sparse_irq_lock’ defined but not used Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/fork.c | 1 - 1 files changed, 0 insertions(+), 1 deletions(-) diff --git a/kernel/fork.c b/kernel/fork.c index a8ac2ef..1181b9a 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -834,7 +834,6 @@ static int copy_signal(unsigned long clone_flags, struct task_struct *tsk) sig->min_flt = sig->maj_flt = sig->cmin_flt = sig->cmaj_flt = 0; sig->inblock = sig->oublock = sig->cinblock = sig->coublock = 0; task_io_accounting_init(&sig->ioac); - sig->sum_sched_runtime = 0; INIT_LIST_HEAD(&sig->cpu_timers[0]); INIT_LIST_HEAD(&sig->cpu_timers[1]); INIT_LIST_HEAD(&sig->cpu_timers[2]); --
the patch wasnt built on UP i guess - see the fix below.
the wider question is, shouldnt the UP case be just the same as the SMP
case? Especially the type assymetry struct thread_group_cputime looks
ugly. (and assymetries like that tend to be a constant source of
breakage like the one below.)
Ingo
-------------->
From 0a8eaa4f9b58759595a1bfe13a1295fdc25ba026 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@elte.hu>
Date: Sun, 14 Sep 2008 17:03:52 +0200
Subject: [PATCH] timers: fix itimer/many thread hang, fix #2
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
fix the UP build:
In file included from arch/x86/kernel/asm-offsets_32.c:9,
from arch/x86/kernel/asm-offsets.c:3:
include/linux/sched.h: In function ‘thread_group_cputime_clone_thread’:
include/linux/sched.h:2272: warning: no return statement in function returning non-void
include/linux/sched.h: In function ‘thread_group_cputime_account_user’:
include/linux/sched.h:2284: error: invalid type argument of ‘->’ (have ‘struct task_cputime’)
include/linux/sched.h:2284: error: invalid type argument of ‘->’ (have ‘struct task_cputime’)
include/linux/sched.h: In function ‘thread_group_cputime_account_system’:
include/linux/sched.h:2291: error: invalid type argument of ‘->’ (have ‘struct task_cputime’)
include/linux/sched.h:2291: error: invalid type argument of ‘->’ (have ‘struct task_cputime’)
include/linux/sched.h: In function ‘thread_group_cputime_account_exec_runtime’:
include/linux/sched.h:2298: error: invalid type argument of ‘->’ (have ‘struct task_cputime’)
distcc[14501] ERROR: compile arch/x86/kernel/asm-offsets.c on a/30 failed
make[1]: *** [arch/x86/kernel/asm-offsets.s] Error 1
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
include/linux/sched.h | 7 ++++---
1 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 26d7a5f..ed355f0 100644
--- a/include/linux/sched.h
+++ ...Yeah, I've been focusing so much on getting the SMP case right that I've I'm not overly fond of this one, either; I did it at Roland's suggestion (it's all _his_ fault, yeah, _that's_ the ticket! :-); his opinion IIRC was that the UP case will perform better without the extra pointer dereferences. I agree that it's a potential source of pain such as the one you point out. -- Frank Mayhar <fmayhar@google.com> Google, Inc. --
i dont know ... lets try that simplification as a delta patch, ok? Please check the before/after size of 'vmlinux' in a 'make defconfig' [with SMP disabled after make defconfig] UP build. If there's visible size difference then Roland's point holds. Ingo --
I finally finished this. Yeah, there's a visible size difference,
although it's small:
bobble ~/build/linux-2.6>size vmlinux
text data bss dec hex filename
6417840 824160 578416 7820416 775480 vmlinux
bobble ~/build/up-simplify-tree>size vmlinux
text data bss dec hex filename
6417999 824160 578416 7820575 77551f vmlinux
For a difference of 159 bytes. Not big, but nonzero. The delta patch
(relative to a tree containing all suggested changes _except_ this
simplification) follows:
--- ../linux-2.6/include/linux/sched.h 2008-09-16 14:39:26.000000000 -0700
+++ ./include/linux/sched.h 2008-09-17 10:28:33.000000000 -0700
@@ -454,15 +454,9 @@ struct task_cputime {
* This structure contains the version of task_cputime, above, that is
* used for thread group CPU clock calculations.
*/
-#ifdef CONFIG_SMP
struct thread_group_cputime {
struct task_cputime *totals;
};
-#else
-struct thread_group_cputime {
- struct task_cputime totals;
-};
-#endif
/*
* NOTE! "signal_struct" does not have it's own
@@ -2124,10 +2118,8 @@ static inline int spin_needbreak(spinloc
/*
* Thread group CPU time accounting.
*/
-#ifdef CONFIG_SMP
-extern int thread_group_cputime_alloc_smp(struct task_struct *);
-extern void thread_group_cputime_smp(struct task_struct *, struct task_cputime *);
+extern int thread_group_cputime_alloc(struct task_struct *);
static inline void thread_group_cputime_init(struct signal_struct *sig)
{
@@ -2138,9 +2130,13 @@ static inline int thread_group_cputime_c
{
if (curr->signal->cputime.totals)
return 0;
- return thread_group_cputime_alloc_smp(curr);
+ return thread_group_cputime_alloc(curr);
}
+#ifdef CONFIG_SMP
+
+extern void thread_group_cputime_smp(struct task_struct *, struct task_cputime *);
+
static inline void thread_group_cputime_free(struct signal_struct *sig)
{
free_percpu(sig->cputime.totals);
@@ -2161,31 +2157,15 @@ static inline void thread_group_cputime(
...I had been thinking not so much about code size, but about the unnecessary indirection and allocation for the UP case. I'm happy to defer to Ingo on whether to worry about that. Thanks, Roland --
Yeah, I was just responding to his request for a code-size check. I'm okay either way, although making the data structures consistent between the two cases does simplify the code a bit. The next patch is waiting on Ingo's verdict. :-) -- Frank Mayhar <fmayhar@google.com> Google, Inc. --
i'd not worry about those 160 bytes - getting this stuff to work fine is far more important. Details around threading seem to be one of the slowest converging technological details of Linux. why is there _any_ assymetry needed between UP and SMP? These days we just write straight code for SMP, and UP is just a single-CPU special-case of it. _Sometimes_ if it's really worth it we do some UP special cases but it's the exception, not the rule. Ingo --
As far as I know, I still need to handle the SMP per_cpu allocate/free differently from the UP kmalloc. I'll check again, though; if that's no longer the case, I'll fix it. -- Frank Mayhar <fmayhar@google.com> Google, Inc. --
This is the second resubmission of the posix timer rework patch, posted a few days ago. This includes the changes from the previous resubmittion, which addressed Oleg Nesterov's comments, removing the RCU stuff from the patch and un-inlining the thread_group_cputime() function for SMP. In addition, per Ingo Molnar it simplifies the UP code, consolidating much of it with the SMP version and depending on lower-level SMP/UP handling to take care of the differences. It also cleans up some UP compile errors, moves the scheduler stats-related macros into kernel/sched_stats.h, cleans up a merge error in kernel/fork.c and has a few other minor fixes and cleanups as suggested by Oleg and Ingo. Thanks for the review, guys. The original README follows (to keep it together with the patch itself): Overview This patch reworks the handling of POSIX CPU timers, including the ITIMER_PROF, ITIMER_VIRT timers and rlimit handling. It was put together with the help of Roland McGrath, the owner and original writer of this code. The problem we ran into, and the reason for this rework, has to do with using a profiling timer in a process with a large number of threads. It appears that the performance of the old implementation of run_posix_cpu_timers() was at least O(n*3) (where "n" is the number of threads in a process) or worse. Everything is fine with an increasing number of threads until the time taken for that routine to run becomes the same as or greater than the tick time, at which point things degrade rather quickly. This patch fixes bug 9906, "Weird hang with NPTL and SIGPROF." Code Changes This rework corrects the implementation of run_posix_cpu_timers() to make it run in constant time for a particular machine. (Performance may vary between one machine and another depending upon whether the kernel is built as single- or multiprocessor and, in the latter case, depending upon the number of running processors.) To do this, at each tick we now update fields in signal_struct ...
below is the delta patch i've applied to tip/timers/posixtimers - might be easier to review for those who've seen v1 and who'd like to check the changes. i've also merged v2 into tip/master and started testing it. Ingo ----------------> From bb34d92f643086d546b49cef680f6f305ed84414 Mon Sep 17 00:00:00 2001 From: Frank Mayhar <fmayhar@google.com> Date: Fri, 12 Sep 2008 09:54:39 -0700 Subject: [PATCH] timers: fix itimer/many thread hang, v2 This is the second resubmission of the posix timer rework patch, posted a few days ago. This includes the changes from the previous resubmittion, which addressed Oleg Nesterov's comments, removing the RCU stuff from the patch and un-inlining the thread_group_cputime() function for SMP. In addition, per Ingo Molnar it simplifies the UP code, consolidating much of it with the SMP version and depending on lower-level SMP/UP handling to take care of the differences. It also cleans up some UP compile errors, moves the scheduler stats-related macros into kernel/sched_stats.h, cleans up a merge error in kernel/fork.c and has a few other minor fixes and cleanups as suggested by Oleg and Ingo. Thanks for the review, guys. Signed-off-by: Frank Mayhar <fmayhar@google.com> Cc: Roland McGrath <roland@redhat.com> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/kernel_stat.h | 1 + include/linux/sched.h | 183 +----------------------------------------- kernel/fork.c | 5 +- kernel/posix-cpu-timers.c | 155 +++++++++++++++--------------------- kernel/sched.c | 47 ++---------- kernel/sched_stats.h | 136 ++++++++++++++++++++++++++++++++ 6 files changed, 215 insertions(+), 312 deletions(-) diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index cf9f40a..cac3750 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -52,6 +52,7 @@ ...
-tip testing found a boot crash on a 32-bit x86 test system, and i've bisected it down to this commit. I've attached the .config. The box is hard to debug and the crash happens very early so there's no log - but it was reproducible so you should be able to trigger it by doing this in the tip tree: git checkout 788222bf3d6d7ee08ce308f078c8774eb896c59e [ that is the -tip tree just before the revert commit. ] ... and use the attached .config to build a test kernel and to reproduce the crash. (it doesnt matter if it wont fully boot on your testbox - the crash happen very early) here is how you can access the tree: http://people.redhat.com/mingo/tip.git/README i also found a boot hang on a 64-bit x86 test system - that too is solved with this revert. I've attached that config too. The common theme in both configs is that they are !SMP - and many updates in v2 relate to !SMP details, so i'd suspect a bug in that area. ( and thanks to working with delta patches i was able to fall back to the v1+fixes state in tip/timers/posix-timers, instead of having to drop/revert the whole work. So please update this stuff with delta patches in the future. ) Ingo
Confused. Unless I misread the patch, tgtimes->totals can be NULL on UP too? Do we really need the special version for !CONFIG_SMP ? per_cpu_ptr() just returns the pointer. Yes, get_cpu() disables preemption, but perhaps this is tolerable? (of course, this can't explain the crash) Oleg. --
Yep. I've had about four different irons in the fire over the last few days and I think because of that I ended up giving this less attention than I should have. I'll beat on this some more and reroll the patch in the next day or so. I'll also provide the delta patch as Ingo requested. Sorry, folks. -- Frank Mayhar <fmayhar@google.com> Google, Inc. --
Yeah. I totally messed up on this, too many things going on at once. I've eliminated the SMP/UP split in sched_stats.h and pulled it all into I'll be investigating this in the morning (I'm at home today and without a test box). -- Frank Mayhar <fmayhar@google.com> Google, Inc. --
A couple of days ago I re-resubmitted the itimer/many thread hang. Unfortunately, I had too many things going on at once and screwed it up, not verifying the UP version and entirely missing a set of cleanups that I had intended to include. This rectifies that screwup. At Ingo's request, this is an incremental patch, relative to the code in his -tip tree. These changes have been minimally tested on both UP and SMP environments. They simplify the UP code, consolidating pretty much all of it with the SMP version and depending on lower-level SMP/UP handling to take care of the differences. It also cleans up some UP compile errors, moves the scheduler stats-related macros into kernel/sched_stats.h, cleans up a merge error in kernel/fork.c and has a few other minor fixes and cleanups as suggested by Oleg and Ingo. Again, thanks for the reviews, guys. I'm dropping the original README; if you want it it's in the archives at least three times. Signed-off-by: Frank Mayhar <fmayhar@google.com> include/linux/kernel_stat.h | 1 + include/linux/sched.h | 182 +------------------------------------------ kernel/fork.c | 5 +- kernel/posix-cpu-timers.c | 155 +++++++++++++++--------------------- kernel/sched.c | 46 ++---------- kernel/sched_stats.h | 86 ++++++++++++++++++++ 6 files changed, 163 insertions(+), 312 deletions(-) diff --git a/include/linux/kernel_stat.h b/include/linux/kernel_stat.h index 21249d8..a97caac 100644 --- a/include/linux/kernel_stat.h +++ b/include/linux/kernel_stat.h @@ -70,6 +70,7 @@ static inline unsigned int kstat_irqs(unsigned int irq) return sum; } +extern unsigned long long task_delta_exec(struct task_struct *); extern void account_user_time(struct task_struct *, cputime_t); extern void account_user_time_scaled(struct task_struct *, cputime_t); extern void account_system_time(struct task_struct *, int, cputime_t); diff --git a/include/linux/sched.h b/include/linux/sched.h index ...
hmmm ... where do we get 'rq' from? in v3 you did this: - rq = task_rq_lock(p, &flags); which removed the deadlock but left us with a random uninitialized rq variable ... the right solution for the bug would have been to unlock it. Miraculously we didnt actually crash anywhere visibly, found it by reviewing the code. I thought this code gets excercised quite frequently. The commit below fixes it. Could you please functionality-test latest tip/master: http://people.redhat.com/mingo/tip.git/README with your testcase that excercises these codepaths heavily? Thanks, Ingo --
You know, I just knew I had a brain around here _somewhere_. Weirdly, this didn't fall over on any of my testing. The gods know what it was actually doing, though. I'm picking up tip/master right now and will test your fix for this shortly. -- Frank Mayhar <fmayhar@google.com> Google, Inc. --
Okay, I've put it through its paces and everything appears to be working well with Ingo's fix. -- Frank Mayhar <fmayhar@google.com> Google, Inc. --
great, thanks for checking! it has not caused any new problems in -tip testing either, and that means a couple of thousand test builds and test boots so far in the past week, so we've got the usual codepaths covered pretty well. With your test of the less common codepaths we should be pretty good in general. Ingo --
a couple of fixlets below, pointed out by scripts/checkpatch.pl.
Ingo
---------->
From 5ce73a4a5a4893a1aa4cdeed1b1a5a6de42c43b6 Mon Sep 17 00:00:00 2001
From: Ingo Molnar <mingo@elte.hu>
Date: Sun, 14 Sep 2008 17:11:46 +0200
Subject: [PATCH] timers: fix itimer/many thread hang, cleanups
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
include/linux/sched.h | 2 +-
kernel/posix-cpu-timers.c | 6 +++---
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ed355f0..7ce8d4e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -430,7 +430,7 @@ struct pacct_struct {
* @utime: time spent in user mode, in &cputime_t units
* @stime: time spent in kernel mode, in &cputime_t units
* @sum_exec_runtime: total time spent on the CPU, in nanoseconds
- *
+ *
* This structure groups together three kinds of CPU time that are
* tracked for threads and thread groups. Most things considering
* CPU time want to group these counts together and treat all three
diff --git a/kernel/posix-cpu-timers.c b/kernel/posix-cpu-timers.c
index dba1c33..9a7ea04 100644
--- a/kernel/posix-cpu-timers.c
+++ b/kernel/posix-cpu-timers.c
@@ -94,7 +94,7 @@ void update_rlimit_cpu(unsigned long rlim_new)
cputime = secs_to_cputime(rlim_new);
if (cputime_eq(current->signal->it_prof_expires, cputime_zero) ||
- cputime_lt(current->signal->it_prof_expires, cputime)) {
+ cputime_lt(current->signal->it_prof_expires, cputime)) {
spin_lock_irq(&current->sighand->siglock);
set_process_cpu_timer(current, CPUCLOCK_PROF, &cputime, NULL);
spin_unlock_irq(&current->sighand->siglock);
@@ -1372,9 +1372,9 @@ void run_posix_cpu_timers(struct task_struct *tsk)
* tsk->signal is non-NULL; this probably can't happen but cover the
* possibility anyway.
*/
- if (unlikely(!sig) || !fastpath_timer_check(tsk, sig)) {
+ if (unlikely(!sig) || !fastpath_timer_check(tsk, sig))
return;
- }
+
...please uninline these functions. Ingo --
> please uninline these functions. They are short and have only one caller. Thanks, Roland --
then they should be defined in the .c file which uses them. really, sched.h is large enough already and has lots of unnecessary stuff in it. Ingo --
Well, they are actually used in sched.h, in the inline routines account_group_user_time() and friends, which are in turn used by routines in sched.c. The point here is that the thread_group_cputime_account_xxx() routines are defined differently depending upon whether we're building for UP or SMP and the Agreed, but Roland and I were trying to make life easier for folks who have to maintain SMP and UP versions of this stuff, keeping it all in one place rather than scattering it about. If you insist, I'll un-inline the routines but this will mean moving all of this (thread_group_cputime_account_xxx() and account_group_xxx() routines) to sched.c. Maybe that makes sense. It's your call; I'll hold off on the changes (and resubmitting the patch) while I wait for you to let me know what you prefer. Thanks. -- Frank Mayhar <fmayhar@google.com> Google, Inc. --
it's already in tip/sched/posix-cpu-timers, so please send a delta patch against tip/master: http://people.redhat.com/mingo/tip.git/README i'd suggest to move all the scheduler-internal accounting functions into kernel/sched_stats.h. AFAICS there's no scheduler-external user of these APIs. The most widely used external APIs is thread_group_cputime_smp(), and that is out of line already. (except on UP where it's trivial) Ingo --
