Overview
This patch reworks the handling of POSIX CPU timers, including the
ITIMER_PROF, ITIMER_VIRT timers and rlimit handling. It was put together
with the help of Roland McGrath, the owner and original writer of this code.
The problem we ran into, and the reason for this rework, has to do with using
a profiling timer in a process with a large number of threads. It appears
that the performance of the old implementation of run_posix_cpu_timers() was
at least O(n*3) (where "n" is the number of threads in a process) or worse.
Everything is fine with an increasing number of threads until the time taken
for that routine to run becomes the same as or greater than the tick time, at
which point things degrade rather quickly.
This patch fixes bug 9906, "Weird hang with NPTL and SIGPROF." And thanks
very much to Roland McGrath, without whom this patch wouldn't exist in this
form.
Code Changes
This rework corrects the implementation of run_posix_cpu_timers() to make it
run in constant time for a particular machine. (Performance may vary between
one machine and another depending upon whether the kernel is built as single-
or multiprocessor and, in the latter case, depending upon the number of
running processors.) To do this, at each tick we now update fields in
signal_struct as well as task_struct. The run_posix_cpu_timers() function
uses those fields to make its decisions.
We define a new structure, "task_cputime," to contain user, system and
scheduler times and use these in appropriate places:
struct task_cputime {
cputime_t utime;
cputime_t stime;
unsigned long long sum_exec_runtime;
};
This is included in the structure "thread_group_cputime," which is a new
substructure of signal_struct and which varies for uniprocessor versus
multiprocessor kernels. For uniprocessor kernels, it uses "task_cputime" as
a simple substructure, while for multiprocessor kernels it is a pointer:
struct thread_group_cputime {
struct task_cputime totals;
};
struct ...[ i've extended the Cc: line accordingly. ] Ingo --
I'll try to read this patch on weekend. A couple of naive questions The patch has a lot of rcu_read_lock(); sig = rcu_dereference(tsk->signal); This is bogus, task_struct->signal is not protected by RCU. However, at first glance the code (this and other funcs) looks correct... Either tsk == current, or the code runs under ->siglock. Or we know that ->signal can't go away (wait_task_zombie). As for this particular function, it seems to me that ->signal == NULL is not possible, no? Please remove the false RCU stuff. Btw, this function has a lot of callers, perhaps it is better to So, the first CLONE_THREAD creates ->cputime.totals. After that thread_group_cputime_account_xxx() start to use it even if the task doesn't have the attached cpu timers. Stupid question: can't we allocate .totals in posix_cpu_timer_create() / Ugh. Probably I misunderstand the patch, but... Let's suppose the task doesn't have cpu timers. Currently, in this case run_posix_cpu_timers() quickly checks UNEXPIRED() and returns. With this patch we call fastpath_timer_check(). The first task_cputime_expired() returns 0, so we are doing thread_group_cputime()->for_each_possible_cpu(). Not good, this code runs every timer tick. Perhaps it makes sense to add a fastpath check. (again, rcu stuff is bogus) Oleg. --
Just to clarify: yes I see, .totals is also used by do_task_stat()/etc, but they can iterate over sub-threads as before. And my apologies if this was already discussed ;) Oleg. --
Roland suggested this approach; I just want to snapshot the field That's not completely clear to me. I'm allowing for the possibility that it might be called during, say, process teardown. It's used in so If that's the consensus I'll do so. I assumed that speed was more That was the original plan but we (that is, Roland and I) decided to eliminate the separate storage for the dead-threads totals. It's now all kept in the totals field, for the whole thread group. Note that this doesn't affect single-threaded processes at all. Multithreaded processes, though, now keep their times in the new field, eliminating the old dead-thread fields from the signal structure, reducing the bloat After looking at the code again, you're right. I paid so much attention to fast-pathing the unexpired timer case that I forgot to handle the _no_-timer case. I've added code to fastpath_timer_check() to handle that case. I'll be addressing the rest of your comments (particularly the RCU stuff) and producing a new patch fairly shortly. -- Frank Mayhar <fmayhar@google.com> Google, Inc. --
Are you sure inline will be faster? It has a lot of calllers, think about i-cache. And the function call is not that expensive. I see, thanks. Oleg. --
I saw it. I dunno, I'm more of the "belt-and-suspenders" mindset. I'll add a comment, thought, that this is probably a "can't happen" but we Hmm. It just seems to me that making it inline enables the optimizer to do smarter things with the flow of control. The routine isn't all that long, disassembling the posix_cpu_timers_exit_group() routine gives a good view of it and it appears to be around 120 bytes. Certainly less than 200 bytes (all of posix_cpu_timers_exit_group() is only 201 bytes). That doesn't seem like a cache buster but I defer to those who are more You're welcome. I actually like the new method of keeping track of this stuff; it seems cleaner and certainly removes the requirement of walking the thread group to add it up. Should help some microbenchmarks, at least. :-) -- Frank Mayhar <fmayhar@google.com> Google, Inc. --
Missed this part... No, no, you don't need an atomic read. Just do sig = tsk->signal; if (sig) use(sig); ->signal must be stable, or the code is buggy anyway. (and the task never changes its ->signal, it becomes NULL when the task dies). BTW, thanks for doing this! Oleg. --
The concern is to make sure that the timer tick path is always safe for a tick that hits during release_task(current) in exit_notify(). A timer tick there can come with current->signal == NULL. We just need to make sure one way or another that this is safe. Thanks, Roland --
run_posix_cpu_timers() must check ->signal != NULL anyway. (The same for other functions like account_group_user_time() which are called by the timer tick). Apart from the timer tick, "current" should not use this function after exit_notify(). And, if tsk != current, ->signal must be pinned, this means it can't be NULL. Perhaps I missed something, but imho this check is confusing and misleading. However, this is just a minor detail even if I am right. Oleg. --
Thanks, Frank! I won't have time to do a thorough fresh review of this until after next week. But from my recollection of the state of your patch the last time we batted it around, I think it was getting close to right. Oleg has a keen eye for this stuff. I'm sure the next patch after incorporating his review will be even closer. Thanks, Roland --
