Re: [PATCH 2.6.27-rc5] Fix itimer/many thread hang.

Previous thread: [git pull] scheduler fixes by Ingo Molnar on Monday, September 8, 2008 - 1:06 pm. (1 message)

Next thread: [GIT]: Sparc by David Miller on Monday, September 8, 2008 - 1:59 pm. (1 message)
From: Frank Mayhar
Date: Monday, September 8, 2008 - 1:44 pm

Overview

This patch reworks the handling of POSIX CPU timers, including the
ITIMER_PROF, ITIMER_VIRT timers and rlimit handling.  It was put together
with the help of Roland McGrath, the owner and original writer of this code.

The problem we ran into, and the reason for this rework, has to do with using
a profiling timer in a process with a large number of threads.  It appears
that the performance of the old implementation of run_posix_cpu_timers() was
at least O(n*3) (where "n" is the number of threads in a process) or worse.
Everything is fine with an increasing number of threads until the time taken
for that routine to run becomes the same as or greater than the tick time, at
which point things degrade rather quickly.

This patch fixes bug 9906, "Weird hang with NPTL and SIGPROF."  And thanks
very much to Roland McGrath, without whom this patch wouldn't exist in this
form.


Code Changes

This rework corrects the implementation of run_posix_cpu_timers() to make it
run in constant time for a particular machine.  (Performance may vary between
one machine and another depending upon whether the kernel is built as single-
or multiprocessor and, in the latter case, depending upon the number of
running processors.)  To do this, at each tick we now update fields in
signal_struct as well as task_struct.  The run_posix_cpu_timers() function
uses those fields to make its decisions.

We define a new structure, "task_cputime," to contain user, system and
scheduler times and use these in appropriate places:

struct task_cputime {
	cputime_t utime;
	cputime_t stime;
	unsigned long long sum_exec_runtime;
};

This is included in the structure "thread_group_cputime," which is a new
substructure of signal_struct and which varies for uniprocessor versus
multiprocessor kernels.  For uniprocessor kernels, it uses "task_cputime" as
a simple substructure, while for multiprocessor kernels it is a pointer:

struct thread_group_cputime {
	struct task_cputime totals;
};

struct ...
From: Ingo Molnar
Date: Tuesday, September 9, 2008 - 12:03 am

[ i've extended the Cc: line accordingly. ]

	Ingo
--

From: Oleg Nesterov
Date: Tuesday, September 9, 2008 - 9:01 am

I'll try to read this patch on weekend. A couple of naive questions

The patch has a lot of

	rcu_read_lock();
	sig = rcu_dereference(tsk->signal);

This is bogus, task_struct->signal is not protected by RCU.

However, at first glance the code (this and other funcs) looks correct...
Either tsk == current, or the code runs under ->siglock. Or we know that
->signal can't go away (wait_task_zombie).

As for this particular function, it seems to me that ->signal == NULL
is not possible, no?

Please remove the false RCU stuff.

Btw, this function has a lot of callers, perhaps it is better to

So, the first CLONE_THREAD creates ->cputime.totals. After that
thread_group_cputime_account_xxx() start to use it even if the task
doesn't have the attached cpu timers.

Stupid question: can't we allocate .totals in posix_cpu_timer_create() /



Ugh. Probably I misunderstand the patch, but...

Let's suppose the task doesn't have cpu timers. Currently, in this case
run_posix_cpu_timers() quickly checks UNEXPIRED() and returns. With this
patch we call fastpath_timer_check(). The first task_cputime_expired()
returns 0, so we are doing thread_group_cputime()->for_each_possible_cpu().

Not good, this code runs every timer tick. Perhaps it makes sense
to add a fastpath check.

(again, rcu stuff is bogus)

Oleg.

--

From: Oleg Nesterov
Date: Tuesday, September 9, 2008 - 9:13 am

Just to clarify: yes I see, .totals is also used by do_task_stat()/etc,
but they can iterate over sub-threads as before.

And my apologies if this was already discussed ;)

Oleg.

--

From: Frank Mayhar
Date: Tuesday, September 9, 2008 - 1:29 pm

Roland suggested this approach; I just want to snapshot the field

That's not completely clear to me.  I'm allowing for the possibility
that it might be called during, say, process teardown.  It's used in so


If that's the consensus I'll do so.  I assumed that speed was more

That was the original plan but we (that is, Roland and I) decided to
eliminate the separate storage for the dead-threads totals.  It's now
all kept in the totals field, for the whole thread group.  Note that
this doesn't affect single-threaded processes at all.  Multithreaded
processes, though, now keep their times in the new field, eliminating
the old dead-thread fields from the signal structure, reducing the bloat

After looking at the code again, you're right.  I paid so much attention
to fast-pathing the unexpired timer case that I forgot to handle the
_no_-timer case.  I've added code to fastpath_timer_check() to handle
that case.

I'll be addressing the rest of your comments (particularly the RCU
stuff) and producing a new patch fairly shortly.
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.

--

From: Oleg Nesterov
Date: Wednesday, September 10, 2008 - 5:12 am

Are you sure inline will be faster? It has a lot of calllers, think
about i-cache. And the function call is not that expensive.


I see, thanks.

Oleg.

--

From: Frank Mayhar
Date: Wednesday, September 10, 2008 - 10:50 am

I saw it.  I dunno, I'm more of the "belt-and-suspenders" mindset.  I'll
add a comment, thought, that this is probably a "can't happen" but we

Hmm.  It just seems to me that making it inline enables the optimizer to
do smarter things with the flow of control.  The routine isn't all that
long, disassembling the posix_cpu_timers_exit_group() routine gives a
good view of it and it appears to be around 120 bytes.  Certainly less
than 200 bytes (all of posix_cpu_timers_exit_group() is only 201 bytes).
That doesn't seem like a cache buster but I defer to those who are more

You're welcome.  I actually like the new method of keeping track of this
stuff; it seems cleaner and certainly removes the requirement of walking
the thread group to add it up.  Should help some microbenchmarks, at
least. :-)
-- 
Frank Mayhar <fmayhar@google.com>
Google, Inc.

--

From: Oleg Nesterov
Date: Wednesday, September 10, 2008 - 9:32 am

Missed this part...

No, no, you don't need an atomic read. Just do

	sig = tsk->signal;
	if (sig)
		use(sig);

->signal must be stable, or the code is buggy anyway. (and the task
never changes its ->signal, it becomes NULL when the task dies).


BTW, thanks for doing this!

Oleg.

--

From: Roland McGrath
Date: Tuesday, September 9, 2008 - 9:04 pm

The concern is to make sure that the timer tick path is always safe for a
tick that hits during release_task(current) in exit_notify().  A timer tick
there can come with current->signal == NULL.  We just need to make sure one
way or another that this is safe.


Thanks,
Roland
--

From: Oleg Nesterov
Date: Wednesday, September 10, 2008 - 4:44 am

run_posix_cpu_timers() must check ->signal != NULL anyway. (The same for
other functions like account_group_user_time() which are called by the
timer tick).

Apart from the timer tick, "current" should not use this function after
exit_notify().

And, if tsk != current, ->signal must be pinned, this means it can't be NULL.


Perhaps I missed something, but imho this check is confusing and misleading.
However, this is just a minor detail even if I am right.

Oleg.

--

From: Roland McGrath
Date: Tuesday, September 9, 2008 - 8:59 pm

Thanks, Frank!  I won't have time to do a thorough fresh review of this
until after next week.  But from my recollection of the state of your patch
the last time we batted it around, I think it was getting close to right.
Oleg has a keen eye for this stuff.  I'm sure the next patch after
incorporating his review will be even closer.


Thanks,
Roland
--

Previous thread: [git pull] scheduler fixes by Ingo Molnar on Monday, September 8, 2008 - 1:06 pm. (1 message)

Next thread: [GIT]: Sparc by David Miller on Monday, September 8, 2008 - 1:59 pm. (1 message)