On Wed, 6 Feb 2008 16:33:20 -0800 (PST) It's probably better to handle this one via email, so please send that testcase vie reply-to-all to this email, thanks. --
Testcase attached.
Build with
gcc -D_GNU_SOURCE -c hangc-2.c -o hangc-2.o
gcc -lpthread -o hangc-2 hangc-2.o
Run with
hangc-2 4500 4500
--
Frank Mayhar <fmayhar@google.com>
Google, Inc.FWIW this is not reproducible on 2.6.24/x86/CentOS-51. (I tried running it nearly 1500 times in a loop.) Assuming those many tries are sufficient to reproduce this bug, there seems to be something specific to the environment/architecture/configuration that's necessary to trigger it. It might be helpful to provide full details like glibc version, compiler version, .config etc. Parag --
El Wed, 6 Feb 2008 21:57:38 -0500 (EST) I can reproduce it on my Ubuntu 7.10 kernel 2.6.24 Note i use CC=gcc-4.2 gcc --version gcc-4.2 (GCC) 4.2.1 (Ubuntu 4.2.1-5ubuntu4) If some fields are empty or look unusual you may have an old version. Compare to the current minimal requirements in Documentation/Changes. Linux Varda 2.6.24 #2 SMP PREEMPT Fri Jan 25 01:05:47 CET 2008 x86_64 GNU/Linux Gnu C 4.1.3 Gnu make 3.81 binutils 2.18 util-linux 2.13 mount 2.13 module-init-tools 3.3-pre2 e2fsprogs 1.40.2 jfsutils 1.1.11 reiserfsprogs 3.6.19 pcmciautils 014 PPP 2.4.4 Linux C Library 2.6.1 Dynamic linker (ldd) 2.6.1 Procps 3.2.7 Net-tools 1.60 Kbd [opcion...][archivo Console-tools 0.2.3 Sh-utils 5.97 udev 113 wireless-tools 29 Modules Loaded af_packet binfmt_misc rfcomm l2cap bluetooth ipv6 powernow_k8 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand freq_table cpufreq_conservative nf_conntrack_ftp nf_conntrack_irc xt_tcpudp ipt_ULOG xt_limit xt_state iptable_filter nf_conntrack_ipv4 nf_conntrack ip_tables x_tables kvm_amd kvm w83627ehf hwmon_vid lp snd_hda_intel arc4 ecb blkcipher cryptomgr crypto_algapi snd_pcm_oss snd_mixer_oss snd_pcm snd_mpu401 snd_mpu401_uart snd_seq_dummy rt2500pci rt2x00pci rt2x00lib snd_seq_oss rfkill snd_seq_midi input_polldev snd_rawmidi crc_itu_t snd_seq_midi_event snd_seq mac80211 usbhid snd_timer snd_seq_device cfg80211 usblp ff_memless snd eeprom_93cx6 nvidia i2c_ali1535 i2c_ali15x3 evdev snd_page_alloc sr_mod cdrom button soundcore uli526x 8250_pnp 8250 serial_core i2c_core k8temp hwmon parport_pc parport pata_acpi pcspkr rtc floppy sg ata_generic ehci_hcd r8169 ohci_hcd usbcore unix thermal processor fan fuse # # Automatically generated make config: don't edit # Lin...
On Thu, 7 Feb 2008, Alejandro Riveira Fern
Several, among which were i686+SMP+GnuC-4.0.3+Glibc-2.3.6. No PREEMPT. Linux 2.6.18, 2.6.21 and 2.6.24-rc4. -- Frank Mayhar <fmayhar@google.com> Google, Inc. --
That should of course be x86_64+SMP+PREEMPT+GnuC-4.1.3+Glibc-2.6.1 = Reproducible. --
El Thu, 7 Feb 2008 10:56:16 -0500 (EST) From my previous mail Note that i use CC=gcc-4.2 $gcc-4.2 --version gcc-4.2 (GCC) 4.2.1 (Ubuntu 4.2.1-5ubuntu4) --
On Thu, 7 Feb 2008, Alejandro Riveira Fern
Not reproducible with PREEMPT either. Parag --
Okay, here's an analysis of the problem and a potential solution. I mentioned this in the bug itself but I'll repeat it here: A couple of us here have been investigating this thing and have concluded that the problem lies in the implementation of run_posix_cpu_timers() and specifically in the quadratic nature of the implementation. It calls check_process_timers() to sum the utime/stime/sched_time (in 2.6.18.5, under another name in 2.6.24+) of all threads in the thread group. This means that runtime there grows with the number of threads. It can go through the list _again_ if and when it decides to rebalance expiry times. After thinking through it, it seems clear that the critical number of threads is that in which run_posix_cpu_timers() takes as long as or longer than a tick to get its work done. The system makes progress to that point but after that everything goes to hell as it gets further and further behind. This explains all the symptoms we've seen, including seeing run_posix_cpu_timers() at the top of a bunch of profiling stats (I saw it get more than a third of overall processing time on a bunch of tests, even where the system _didn't_ hang!). It explains the fact that things get slow right before they go to hell and it explains why under certain conditions the system can recover (if the threads have started exiting by the time it hangs, for example). I've come up with a potential fix for the problem. It does two things. First, rather than summing the utime/stime/sched_time at interrupt it adds all of those times to a new task_struct field on the group leader then at interrupt just consults those fields; this avoids repeatedly blowing the cache as well as a loop across all the threads. Second, if there are more than 1000 threads in the process (as noted in task->signal->live), it just punts all of the processing to a workqueue. With these changes I've gone from a hang at 4500 (or fewer) threads to running out of resources at more than 32000 threads on...
Thanks for the detailed explanation and for bringing this to my attention. This is a problem we knew about when I first implemented posix-cpu-timers and process-wide SIGPROF/SIGVTALRM. I'm a little surprised it took this long to become a problem in practice. I originally expected to have to revisit it sooner than this, but I certainly haven't thought about it for quite some time. I'd guess that HZ=1000 becoming common is what did it. The obvious implementation for the process-wide clocks is to have the tick interrupt increment shared utime/stime/sched_time fields in signal_struct as well as the private task_struct fields. The all-threads totals accumulate in the signal_struct fields, which would be atomic_t. It's then trivial for the timer expiry checks to compare against those totals. The concern I had about this was multiple CPUs competing for the signal_struct fields. (That is, several CPUs all running threads in the same process.) If the ticks on each CPU are even close to synchronized, then every single time all those CPUs will do an atomic_add on the same word. I'm not any kind of expert on SMP and cache effects, but I know this is bad. However bad it is, it's that bad all the time and however few threads (down to 2) it's that bad for that many CPUs. The implementation we have instead is obviously dismal for large numbers of threads. I always figured we'd replace that with something based on more sophisticated thinking about the CPU-clash issue. I don't entirely follow your description of your patch. It sounds like it should be two patches, though. The second of those patches (workqueue) sounds like it could be an appropriate generic cleanup, or like it could be a complication that might be unnecessary if we get a really good solution to main issue. The first patch I'm not sure whether I understand what you said or not. Can you elaborate? Or just post the unfinished patch as illustration, marking it as not for submission until you've finished. Th...
Put this on the patch but I'm emailing it as well. Well, the iron is getting bigger, too, so it's beginning to be feasible My first patch did essentially what you outlined above, incrementing shared utime/stime/sched_time fields, except that they were in the task_struct of the group leader rather than in the signal_struct. It's not clear to me exactly how the signal_struct is shared, whether it is shared among all threads or if each has its own version. So each timer routine had something like: /* If we're part of a thread group, add our time to the leader. */ if (p->group_leader != NULL) p->group_leader->threads_sched_time += tmp; and check_process_timers() had /* Times for the whole thread group are held by the group leader. */ utime = cputime_add(utime, tsk->group_leader->threads_utime); stime = cputime_add(stime, tsk->group_leader->threads_stime); sched_time += tsk->group_leader->threads_sched_time; Of course, this alone is insufficient. It speeds things up a tiny bit but not nearly enough. The other issue has to do with the rest of the processing in run_posix_cpu_timers(), walking the timer lists and walking the whole thread group (again) to rebalance expiry times. My second patch moved all that work to a workqueue, but only if there were more than 100 threads in the process. This basically papered over the problem by moving the processing out of interrupt and into a kernel thread. It's still insufficient, though, because it takes just as long and will get backed up just as badly on large numbers of threads. This was made clear in a test I ran yesterday where I generated some 200,000 threads. The work queue was unreasonably large, as you might expect. I am looking for a way to do everything that needs to be done in fewer operations, but unfortunately I'm not familiar enough with the SIGPROF/SIGVTALRM semantics or with the details of the Linux implementation to know where it is safe to consolidate things. -- Frank...
There is a 1:1 correspondence between "shares signal_struct" and "member of
same thread group". signal_struct is the right place for such new fields.
Don't be confused by the existing fields utime, stime, gtime, and
sum_sched_runtime. All of those are accumulators only touched when a
non-leader thread dies (in __exit_signal), and governed by the siglock.
Their only purpose now is to represent the threads that are dead and gone
when calculating the cumulative total for the whole thread group. If you
were to provide cumulative totals that are updated on every tick, then
The task_struct.group_leader field is never NULL. Every thread is a member
of some thread group. The degenerate case is that it's the only member of
It sounds like you sped up only one of the sampling loops. Having a
cumulative total already on hand means cpu_clock_sample_group can also
become simple and cheap, as can the analogues in do_getitimer and
k_getrusage. These are what's used in clock_gettime and in the timer
manipulation calls, and in getitimer and getrusage. That's all just gravy.
The real benefit of having a cumulative total is for the basic logic of
run_posix_cpu_timers (check_process_timers) and the timer expiry setup.
It sounds like you didn't take advantage of the new fields for that.
When a cumulative total is on hand in the tick handler, then there is no
need at all to derive per-thread expiry times from group-wide CPU timers
("rebalance") either there or when arming the timer in the first place.
All of that complexity can just disappear from the implementation.
check_process_timers can look just like check_thread_timers, but
consulting the shared fields instead of the per-thread ones for both the
clock accumulators and the timers' expiry times. Likewise, arm_timer
only has to set signal->it_*_expires; process_timer_rebalance goes away.
If you do all that then the time spent in run_posix_cpu_timers should
not be affected at all by the number of threads. The only "walking the
...Based on Roland's comments and from reading the source, I have a possible fix. I'm posting the attached patch _not_ for submission but _only_ for comment. For one thing it's based on 2.6.18.5 and for another it hasn't had much testing yet. I wanted to get it out here for comment, though, in case anyone can see where I might have gone wrong. Comments, criticism and (especially!) testing enthusiastically Replaces the utime, stime and sched_time fields in signal_struct with shared_utime, shared_stime and shared_schedtime, respectively. It also adds it_sched_expires to the signal struct. Each place that loops through all threads in a thread group to sum task->utime and/or task->stime now loads the value from task->signal->shared_[us]time. This includes compat_sys_times(), do_task_stat(), do_getitimer(), sys_times() and k_getrusage(). Certain routines that used task->signal->[us]time now use the shared fields instead, which may change their semantics slightly. These include fill_prstatus() (in fs/binfmt_elf.c), do_task_stat() (in fs/proc/array.c), wait_task_zombie() and do_notify_parent(). The shared fields are updated at each tick, in update_cpu_clock() (shared_schedtime), account_user_time() (shared_utime) and account_system_time() (shared_stime). Each of these functions updates the task-private field followed by the shared version in the signal structure if one is present. Note that if different threads of the same process are being run by different CPUs at the tick, there may be serious cache contention here. Finally, kernel/posix-cpu-timers.c has changed quite dramatically. First, run_posix_cpu_timers() decides whether a timer has expired by consulting the it_*_expires and shared_* fields in the signal struct. The check_process_timers() routine bases its computations on the new shared fields, removing two loops through the threads. "Rebalancing" is no longer required, the process_timer_rebalance() routine as disappeared en...
The previous email was missing one small part of the patch, reproduced
below. Remain calm.
----------------------------------BEGIN------------------------------------
diff -urp /home/fmayhar/Static/linux-2.6.18.5/kernel/compat.c linux-2.6.18.5/kernel/compat.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/compat.c 2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/compat.c 2008-03-06 17:26:21.000000000 -0800
@@ -161,18 +161,11 @@ asmlinkage long compat_sys_times(struct
if (tbuf) {
struct compat_tms tmp;
struct task_struct *tsk = current;
- struct task_struct *t;
cputime_t utime, stime, cutime, cstime;
read_lock(&tasklist_lock);
- utime = tsk->signal->utime;
- stime = tsk->signal->stime;
- t = tsk;
- do {
- utime = cputime_add(utime, t->utime);
- stime = cputime_add(stime, t->stime);
- t = next_thread(t);
- } while (t != tsk);
+ utime = tsk->signal->shared_utime;
+ stime = tsk->signal->shared_stime;
/*
* While we have tasklist_lock read-locked, no dying thread
-----------------------------------END-------------------------------------
--
Frank Mayhar <fmayhar@google.com>
Google, Inc.
--Okay, my understanding of this is still evolving, so please (please!)
correct me when I get it wrong. I take what you're saying to mean that,
first, run_posix_cpu_timers() only needs to be run once per thread
group. It _sounds_ like it should be checking the shared fields rather
than the per-task fields for timer expiration (in fact, the more I think
about it the more sure I am that that's the case).
The old process_timer_rebalance() routine was intended to distribute the
remaining ticks across all the threads, so that the per-task fields
would cause run_posix_cpu_timers() to run at the appropriate time. With
it checking the shared fields this becomes no longer necessary.
Since the shared fields are getting all the ticks, this will work for
per-thread timers as well.
The arm_timers() routine, instead of calling posix_timer_rebalance(),
should just directly set signal->it_*_expires to the expiration time,
e.g.:
switch (CPUCLOCK_WHICH(timer->it_clock)) {
default:
BUG();
case CPUCLOCK_VIRT:
if (!cputime_eq(p->signal->it_virt_expires,
cputime_zero) &&
cputime_lt(p->signal->it_virt_expires,
timer->it.cpu.expires.cpu))
break;
p->signal->it_virt_expires = timer->it.cpu.expires.cpu;
goto rebalance;
case CPUCLOCK_PROF:
if (!cputime_eq(p->signal->it_prof_expires,
cputime_zero) &&
cputime_lt(p->signal->it_prof_expires,
timer->it.cpu.expires.cpu))
break;
i = p->signal->rlim[RLIMIT_CPU].rlim_cur;
if (i != RLIM_INFINITY &&
i <= cputime_to_secs(timer->it.cpu.expires.cpu))
break;
p->signal->it_prof_expires = timer->it.cpu.expires.cpu;
goto rebalance;
case CPUCLOCK_SCHED:
p->signal->it_sched_expires = timer->it.cpu.expires.sched;
break;
It's still probably worthwhile to defer processing to a workqueue
thread, though, just because it's still a lot ...[I changed the subject and trimmed the CC list, as this is now quite far away from the "some mysterious NPTL problem" subject. If anyone else Not quite. check_process_timers only needs to be run once per thread run_posix_cpu_timers does two things: thread CPU timers and process CPU timers. The thread CPU timers track the thread CPU clocks, which are what the per-thread fields in task_struct count. check_thread_timers finds what thread CPU timers have fired. The task_struct.it_*_expires fields are set when there are thread CPU timers set on those clocks. The process CPU timers track the process CPU clocks. Each process CPU clock (virt, prof, sched) is just the sum of the corresponding thread CPU clock across all threads in the group. In the original code, these clocks are never maintained in any storage as such, but sampled by summing all the thread clocks when a current value is needed. check_process_timers finds what process CPU timers have fired. The signal_struct.it_*_expires fields are set when there are process CPU timers set on those clocks. The "rebalance" stuff also sets the task_struct.it_*_expires fields of all the threads in the group when there are process CPU timers. So each of these fields is really the minimum of the expiration time of the earliest thread CPU timer and the "balanced" sample-and-update time I do not follow your logic at all here. The signal_struct fields being proposed track each process CPU clock's value. The thread CPU timers I think the natural place for it is run_timer_softirq. This is where natural time timers run. The posix-cpu-timers processing is analogous to the posix-timers processing, which runs via timer callbacks from here. That roughly means doing it right after the interrupt normally, and falling back to something similar to a workqueue when the load is too heavy. In the common case, it doesn't entail any context switches, as workqueues always do. The issue with either the workqueue or the softirq approach ...
Where "interesting tick" means "tick in which a process timer has And my changes introduce these clocks as separate fields in the signal Okay, I hadn't been clear on the distinction between process-wide and thread-only timers. So, really, run_posix_cpu_timers() needs to check both sets, the versions in the signal struct for the process-wide timers I'm going to table this for now. Based on my preliminary performance results, the changes I've made mean that using a workqueue or softirq is not necessary. In the profiles of a couple of testcases I've run, run_posix_cpu_timers() didn't show up at all, whereas before my change it was right at the top with ~35% of the time. So, check_process_timers() checks for and handles any expired timers for the currently-running process, whereas check_thread_timers() checks for and handles any expired timers for the currently-running thread. Is that correct? And, since these timers are only counting CPU time, if a thread is never running at the tick (since that's how we account time in the first place) any timers it might have will never expire. Sorry to repeat the obvious but sometimes it's better to state things very explicitly. At each tick a process-wide timer may have expired. Also, at each tick a thread-only timer may have expired. Or, of course, both. So we need to detect both events and fire the appropriate timer in the appropriate context. I think my current code (i.e. the patch I published for comment a few days ago) does this, with one exception: If a thread-only timer expires it _won't_ be detected when run_posix_cpu_timers() runs, since I'm only checking the process-wide timers. This implies that I need to do twice I've actually gotten a little bit of insight into this. I don't think that a straight set of shared fields is sufficient except in UP (and possibly dual-CPU) environments. I was able to run a reasonable test on both a four-core Opteron system and a sixteen-core Opteron system. (That's two dual-cor...
I'm not surprised by this result. (I do want to see much more detailed That is exactly what I had in mind. (I hadn't noticed alloc_percpu, and it has one more level of indirection than I'd planned. But that wastes less space when num_possible_cpus() is far greater than num_online_cpus(), and I imagine it's vastly superior for NUMA.) Don't forget do_[gs]etitimer and k_getrusage can use this too. (Though maybe no reason to bother in k_getrusage since it has I tend to agree. It's the only plan I've thought through in detail. But my remarks stand, about thorough analysis of performance impacts of options we can think of. Thanks, Roland --
After the recent conversation with Roland and after more testing, I have another patch for review (although _not_ for submission, as again it's against 2.6.18.5). This patch breaks the shared utime/stime/sched_time fields out into their own structure which is allocated as needed via alloc_percpu(). This avoids cache thrashing when running lots of threads on lots of CPUs. Please take a look and let me know what you think. In the meantime I'll be working on a similar patch to 2.6-head that has optimizations for uniprocessor and two-CPU operation, to avoid the overhead of the percpu functions when they are unneeded. This patch: Replaces the utime, stime and sched_time fields in signal_struct with the shared_times structure, which is cacheline aligned and allocated when needed using the alloc_percpu() mechanism. There is one copy of this structure per running CPU when it is being used. Each place that loops through all threads in a thread group to sum task->utime and/or task->stime now use the shared_*_sum() inline functions defined in sched.h to sum the per-CPU structures. This includes compat_sys_times(), do_task_stat(), do_getitimer(), sys_times() and k_getrusage(). Certain routines that used task->signal->[us]time now use the shared_*_sum() functions instead, which may (but hopefully will not) change their semantics slightly. These include fill_prstatus() (in fs/binfmt_elf.c), do_task_stat() (in fs/proc/array.c), wait_task_zombie() and do_notify_parent(). At each tick, update_cpu_clock(), account_user_time() and account_system_time() update the relevant field of the shared_times structure using a pointer obtained using per_cpu_ptr, with the effect that these functions do not compete with one another for the cacheline. Each of these functions updates the task-private field followed by the shared_times version if one is present. Finally, kernel/posix-cpu-timers.c has changed quite dramatically. First, run_posix_cpu_timers() decide...
My mention of a 2-CPU special case was just an off-hand idea. I don't
really have any idea if that would be optimal given the tradeoff of
I think I misled you about the use of the it_*_expires fields, sorry.
The task_struct.it_*_expires fields are used solely as a cache of the
head of cpu_timers[]. Despite the poor choice of the same name, the
signal_struct.it_*_expires fields serve a different purpose. For an
analogous cache of the soonest timer to expire, you need to add new
fields. The signal_struct.it_{prof,virt}_{expires,incr} fields hold
the setitimer settings for ITIMER_{PROF,VTALRM}. You can't change
those in arm_timer. For a quick cache you need a new field that is
the sooner of it_foo_expires or the head cpu_timers[foo] expiry time.
The shared_utime_sum et al names are somewhat oblique to anyone who
hasn't just been hacking on exactly this thing like you and I have.
Things like thread_group_*time make more sense.
There are now several places where you call both shared_utime_sum and
shared_stime_sum. It looks simple because they're nicely encapsulated.
But now you have two loops through all CPUs, and three loops in
check_process_timers.
I think what we want instead is this:
struct task_cputime
{
cputime_t utime;
cputime_t stime;
unsigned long long schedtime;
};
Use one in task_struct to replace the utime, stime, and sum_sched_runtime
fields, and another to replace it_*_expires. Use a single inline function
thread_group_cputime() that fills a sum struct task_cputime using a single
loop. For the places only one or two of the sums is actually used, the
compiler should optimize away the extra summing from the loop.
Don't use __cacheline_aligned on this struct type itself, because most of
the uses don't need that. When using alloc_percpu, you can rely on it to
take care of those needs--that's what it's for. If you implement a
variant that uses a flat array, you can use a wrapper struct with
__cacheline_aligned for that.
Thanks,
Roland...Actually, after looking at the code again and thinking about it a bit, it appears that the signal_struct.it_*_incr field holds the actual interval as set by setitimer. Initially the it_*_expires field holds the expiration time as set by setitimer, but after the timer fires the first time that value becomes <firing time>+it_*_incr. In other words, the first time it fires at the value set by setitimer() but from then on it fires at a time indicated by whatever the time was the last time the timer fired plus the value in it_*_incr. This time is stored in signal_struct.it_*_expires. I guess I could be wrong about this, but it appears to be what the code is doing. If my analysis is correct, I really don't need a new field, since the old fields work just fine. -- Frank Mayhar <fmayhar@google.com> Google, Inc. --
I would really like to just ignore the 2-cpu scenario and just have two versions, the UP version and the n-way SMP version. It would make life, Okay, I'll go back over this and make sure I got it right. It's interesting, though, that my current patch (written without this particular bit of knowledge) actually performs no differently from the kernel gets: ./nohangc-3 1300 200000 Interval timer off. Threads: 1300 Max prime: 200000 Elapsed: 95.421s Execution: User 356.001s, System 0.029s, Total 356.030s Context switches: vol 1319, invol 7402 ./hangc-3 1300 200000 Interval timer set to 0.010 sec. Threads: 1300 Max prime: 200000 Elapsed: 131.457s Execution: User 435.037s, System 59.495s, Total 494.532s Context switches: vol 1464, invol 10123 Ticks: 22612, tics/sec 45.724, secs/tic 0.022 (More than 1300 threads hangs the old kernel with this test.) With my patch it gets: ./nohangc-3 1300 200000 Interval timer off. Threads: 1300 Max prime: 200000 Elapsed: 94.097s Execution: User 366.000s, System 0.052s, Total 366.052s Context switches: vol 1336, invol 28928 ./hangc-3 1300 200000 Interval timer set to 0.010 sec. Threads: 1300 Max prime: 200000 Elapsed: 93.583s Execution: User 366.117s, System 0.047s, Total 366.164s Context switches: vol 1323, invol 28875 Ticks: 12131, tics/sec 33.130, secs/tic 0.030 In the latest cut I've named them "process_*" but "thread_group" makes Good point, although so far it's been undetectable in my performance testing. (I can't say that it will stay that way down the road a bit, Excellent idea! This method hadn't occurred to me since I was looking at it from the viewpoint of the existing structure and keeping the Yeah, I had caught that one. F...
Like I've said, it's only something to investigate for best performance.
If the conditional code is encapsulated well, it will be simple to add
There are several important scenarios you did not test.
Analysis of combinations of all these variables is needed.
1. Tests with a few threads, like as many threads as CPUs or only 2x as many.
2. Tests with a process CPU timer set for a long expiration time.
i.e. a timer set, but that never goes off in your entire run.
(This is what a non-infinity RLIMIT_CPU limit does.)
With the old code, a long enough timer and a small enough number
That's correct. The it_*_expires fields store itimerval.it_value (the
current timer) and the it_*_incr fields store itimerval.it_interval (the
The analysis above is correct but your conclusion here is wrong.
The current value of an itimer is a user feature, not just a piece
of internal bookkeeping.
getitimer returns in it_value the amount of time until the itimer
fires, regardless of whether or not it will reload after it fires or
with what value it will be reloaded. In a setitimer call, the
it_value sets the time at which the itimer must fire, regardless of
the reload setting in it_interval. Consider the case where
it_interval={0,0}; it_value is still meaningful.
Your code causes any timer_settime or timer_delete call on a process
CPU timer or any setrlimit call on RLIMIT_CPU to suddenly change the
itimer setting just as if the user had made some setitimer call that
was never made or intended. That's wrong.
Thanks,
Roland
--This is my official first cut at a patch that will fix bug 9906, "Weird hang with NPTL and SIGPROF." The problem is that run_posix_cpu_timers() repeatedly walks the entire thread group every time it runs, which is at interrupt. With heavy load and lots of threads, this can take longer than the tick, at which point the kernel stops doing anything put servicing clock ticks and the occasional interrupt. Many thanks to Roland McGrath for his help in my attempt to understand his code. The change adds a new structure to the signal_struct, thread_group_cputime. On an SMP kernel, this is allocated as a percpu structure when needed (from do_setitimer()) using the alloc_percpu() mechanism). It is manipulated via a set of inline functions and macros defined in sched.h, thread_group_times_init(), thread_group_times_free(), thread_group_times_alloc(), thread_group_update() (the macro) and thread_group_cputime(). The thread_group_update macro is used to update a single field of the thread_group_cputime structure when needed; the thread_group_cputime() function sums the fields for each cpu into a passed structure. In the uniprocessor case, the thread_group_cputime structure becomes a simple substructure of the signal_struct, allocation and freeing go away and updating and "summing" become simple assignments. In addition to fixing the hang, this change removes the overloading of it_prof_expires for RLIMIT_CPU handling, replacing it with a new field, rlim_expires, which is checked instead. This makes the code simpler and more straightforward. I've made some decisions in this work that could have gone in different directions and I'm certainly happy to entertain comments and criticisms. Performance with this fix is at least as good as before and in a few cases is slightly improved, possibly due to the reduced tick overhead. Signed-off-by: Frank Mayhar <fmayhar@google.com> include/linux/sched.h | 172 ++++++++++++++++++++++++++++ kernel/compat.c | 31 ++++-- ...
This is my second cut at a patch that will fix bug 9906, "Weird hang
with NPTL and SIGPROF." Thanks to Ingo Molnar who sent me feedback
regarding the first cut.
Again, the problem is that run_posix_cpu_timers() repeatedly walks the
entire thread group every time it runs, which is at interrupt. With
heavy load and lots of threads, this can take longer than the tick, at
which point the kernel stops doing anything put servicing clock ticks
and the occasional interrupt. Many thanks to Roland McGrath for his
help in my attempt to understand his code.
The change adds a new structure to the signal_struct,
thread_group_cputime. On an SMP kernel, this is allocated as a percpu
structure when needed (from do_setitimer()) using the alloc_percpu()
mechanism). It is manipulated via a set of functions defined in sched.c
and sched.h. These new functions are
* thread_group_times_free(), inline function to free via
free_percpu() (SMP) or kfree (UP) the thread_group_cputime
structure.
* thread_group_times_alloc(), external function to allocate the
thread_group_cputime structure when needed.
* thread_group_update(), inline function to update a field of the
thread_group_cputime structure; called at interrupt from tick
handlers, generally. It depends on the "offsetof()" macro to
know which field to update and on compiler optimization to
remove the unused code paths in each case.
* thread_group_cputime(), inline function that sums the time
fields for all running CPUs (SMP) or snapshots the time fields
(UP) into a passed structure.
I've changed the uniprocessor case to retain the dynamic allocation of
the thread_group_cputime structure as needed; this makes the code
somewhat more consistent between SMP and UP and retains the feature of
reducing overhead for processes that don't use interval timers.
In addition to fixing the hang, this change removes the overloading of
it_prof_expires for RLIMIT...On Fri, 28 Mar 2008 15:46:40 -0700 kernel/sys.c: In function 'sys_times': kernel/sys.c:885: error: 'sig' undeclared (first use in this function) --
Thanks. As I said privately, I don't know how this snuck in but it's certainly time to blow away my build tree and reapply the next patch from scratch. Speaking of which, expect that next patch in a day or two. -- Frank Mayhar <fmayhar@google.com> Google, Inc. --
return 0 is the proper form - please run your patch through scripts/checkpatch.pl. Ingo --
Well, if it's acceptable, for a first cut (and the patch I'll submit), I'll handle the UP and SMP cases, encapsulating them in sched.h in such a way as to make it invisible (as much as is possible) to the rest of I've actually done this, although I didn't find the numbers particularly After looking at the code again, I now understand what you're talking about. You overloaded it_*_expires to support both the POSIX interval timers and RLIMIT_CPU. So the way I have things, setting one can stomp Right, because the original effect was to only set the it_*_expires on each individual task struct, leaving the one in the signal struct alone. Might it be cleaner to handle the RLIMIT_CPU stuff separately, rather than rolling it into the itimer handling? -- Frank Mayhar <fmayhar@google.com> Google, Inc. --
For clarity, please never mention that identifier without indicating which struct you're talking about. signal_struct.it_*_expires has never been overloaded. signal_struct.it_prof_expires is the ITIMER_PROF setting; signal_struct.it_virt_expires is the ITIMER_VIRTUAL setting; there is no signal_struct.it_sched_expires field. task_struct.it_*_expires has never been overloaded. task_struct.it_prof_expires is the next value of (task_struct.utime + task_struct.stime) at which run_posix_cpu_timers() It is not "rolled into the itimer handling". run_posix_cpu_timers() handles three separate features: 1. ITIMER_VIRTUAL, ITIMER_PROF itimers (setitimer/getitimer) 2. POSIX timers CPU timers (timer_* calls) 3. CPU time rlimits (RLIMIT_CPU for process-wide, RLIMIT_RTTIME for each thread) The poorly-named task_struct.it_*_expires fields serve a single purpose: to optimize run_posix_cpu_timers(). task_struct.it_prof_expires is the I don't see the point of adding this field at all. It serves solely to cache the secs_to_cputime calculation on the RLIMIT_CPU rlim_cur value, This does not make sense. There is no need for any new state at all in the UP case, just reorganizing what is already there. The existing signal_struct fields utime, stime, and sum_sched_runtime are no longer needed. These accumulate the times of dead threads in the group (see __exit_signal in exit.c) solely so cpu_clock_sample_group can add them in. Keeping both those old fields and the dynamically allocated per-CPU counters is wrong. You will count double all the threads that died since the struct was allocated (i.e. since the first timer was set). For the UP case, just replace these with a single struct thread_group_cputime in signal_struct, and increment it directly on every tick. __exit_signal never touches it. For the SMP case, you need a bit of complication. When there are no expirations (none of the three features in use on a process-wide CPU clock) or only one live thread, then you don't ne...
Roland, I'm very much having to read between the lines of what you've
written. And, obviously, getting it wrong at least half the time. :-)
So you've cleared part of my understanding with your latest email.
Here's what I've gotten from it:
struct task_cputime {
cputime_t utime; /* User time. */
cputime_t stime; /* System time. */
unsigned long long sched_runtime; /* Scheduler time. */
};
This is for both SMP and UP, defined before signal_struct in sched.h
(since that structure refers to this one). Following that:
struct thread_group_cputime;
Which is a forward reference to the real definition later in the file.
The inline functions depend on signal_struct and task_struct, so they
have to come after:
#ifdef SMP
struct thread_group_cputime {
struct task_cputime *totals;
};
< ... inline functions ... >
#else /* SMP */
struct thread_group_cputime {
struct task_cputime totals;
};
< ... inline functions ... >
#endif
The SMP version is percpu, the UP version is just a substructure. In
signal_struct itself, delete utime & stime, add
struct thread_group_cputime cputime;
The inline functions include the ones you defined for UP plus equivalent
ones for SMP. The SMP inlines check the percpu pointer
(sig->cputime.totals) and don't update if it's NULL. One small
correction to one of your inlines, in thread_group_cputime:
*cputime = sig->cputime;
should be
*cputime = sig->cputime.totals;
A representative inline for SMP is:
static inline void account_group_system_time(struct task_struct *task,
cputime_t cputime)
{
struct task_cputime *times;
if (!sig->cputime.totals)
return;
times = per_cpu_ptr(sig->cputime.totals, get_cpu());
times->stime...I'm sorry I wasn't able to communicate more clearly. Please ask me to elaborate whenever it would help. I hope the experience will inspire you to add some good comments to the code. If the comments come to explain everything you now wish had been brought to your attention This is what I'd called it originally (based on the sched_clock() name). Now it's an aggregate of task_struct.se.sum_exec_runtime, so perhaps There is no need for that. This struct is used directly in signal_struct and so has to be defined before it anyway. (And when a struct is used only as the type of a pointer in another type declaration, no forward Right. And if we were to try different variants based on NR_CPUS or whatever, that would entail just a new struct thread_group_cputime By do_setitimer, you mean set_process_cpu_timer and posix_cpu_timer_set. In the last message I said "there are several options" for dealing with the dead-threads issue on SMP and did not go into it. This is one option. It is attractive in being as lazy as possible. But I think it is a prohibitively dubious proposition to do any allocation in __exit_signal. That is part of the hot path for recovering from OOM situations and all sorts of troubles; it's also expected to be reliable so that I am not comfortable with the bookkeeping dropping bits in extreme conditions (i.e. when you cannot possibly succeed in the allocation and have to proceed without it or ultimately wedge the system). The first thing to do is move the existing summation of utime, stime, and sum_exec_runtime in __exit_signal into an inline thread_group_cputime_exit. That abstracts it into the set of inlines that can vary for the different flavors of storage model. For UP, it does nothing. 1. One set of options keeps static storage for dead-threads accounting, as we have now. i.e., utime, stime, sum_sched_runtime in signal_struct are replaced by a "struct task_cputime dead_totals;" in thread_group_cputime. a. thread_group_cputime_exit ...
And another quick note: It appears that with the "allocate percpu storage in copy_signal CLONE_THREAD case" mechanism, I don't need to worry about allocating it anywhere else. If I need it (which is only in the case of multiple threads and an interval timer) then I'll have it because it was allocated with the second thread. So I just eliminate the allocation in do_setitimer() entirely. -- Frank Mayhar <fmayhar@google.com> Google, Inc. --
One quick note: this inline isn't needed for the 2b solution (allocate percpu storage in copy_signal CLONE_THREAD case), since if there's more than one thread there'll always be a percpu area and if there's only one thread the summation code won't be entered. -- Frank Mayhar <fmayhar@google.com> Google, Inc. --
That's true. I still think it's a good idea to have it, even if it winds up being empty in the variants we really use. The principle is that the new set of types/functions could be used to implement exactly what we have now. In fact, it's usually best to do a series of small patches that start Again, I'd leave the call to the inline that would do it. For this implementation plan, its body is: BUG_ON(!task->signal->cputime.totals && !thread_group_empty(task)); Thanks, Roland --
Ah, okay. Well, except that the whole point of this exercise is to fix BTW I did look at allocating it in posix_cpu_timer_set() and set_process_cpu_timer() but the first at least is doing stuff with locks held. I'll keep looking at it, though. One little gotcha we just ran into, though: When checking tsk->signal->(anything) in run_posix_cpu_timers(), we have to hold tasklist_lock to avoid a race with release_task(). This is going to make even the null case always cost more than before. -- Frank Mayhar <fmayhar@google.com> Google, Inc. --
Yeah, it's a little sticky. It would be simple enough to arrange things to
do the allocation when needed before taking locks, but would not make the
code wonderfully self-contained. I wouldn't worry about it unless/until we
conclude for other reasons that this really is the best way to go. We seem
This is reminiscent of something that came up once before. I think it
was the same issue of what happens on a tick while the thread is in the
middle of exiting and racing with release_task. See commit
72ab373a5688a78cbdaf3bf96012e597d5399bb7, commit
3de463c7d9d58f8cf3395268230cb20a4c15bffa and related history (some of
This is a big can of worms that we really don't need. Complicating the
data structure handling this way is really not warranted at all just to
address this race. You'll just create another version of the same race
with a different pointer, and then solve it some simple way that you
could have just used to solve the existing problem. If you don't have
some independent (and very compelling) reasons to reorganize the data
structures, nix nix nix.
We can make posix_cpu_timers_exit() or earlier in the exit/reap path
tweak any state we need to ensure that this problem won't come up.
But, off hand, I don't think we need any new state.
Probably the right fix is to make the fast-path check do:
rcu_read_lock();
signal = rcu_dereference(current->signal);
if (unlikely(!signal) || !fastpath_process_timer_check(signal)) {
rcu_read_unlock();
return;
}
sighand = lock_task_sighand(current, &flags);
rcu_read_unlock();
if (likely(sighand))
slowpath_process_timer_work(signal);
unlock_task_sighand(sighand, &flags);
Another approach that is probably fine too is just to do:
if (unlikely(current->exit_state))
return;
We can never get to the __exit_signal code that causes the race if we
are not already late in exit. The former seems a little preferable
because the added fast-path cost is the same (or perhaps even more
cache-friendly), bu...Well, if we allocate at clone time then some things do get a bit simpler. (Hmm, for some reason I was thinking that we still allocate in Yeah, I checked that out. The one difference here is that that was a race between do_exit() (actually release_task()/__exit_signal()) and run_posix_cpu_timers(). While this race was the same in that respect, there was also a race between all of the timer-tick routines that call any of account_group_user_time(), account_group_system_time() or account_group_exec_runtime() and __exit_signal(). This is because those Erm, well, this isn't reorganizing the data structures per se, since these are new data structures. Regardless, the way I did this was by making the refcounted structure follow the lifetime of the task structure. It's instantiated when needed and at that point each task structure in the thread group gets a reference, which is released when that task structure is destroyed. The last release destroys the data structure. The upshot of this is that none of the timer routines dereference tsk->signal, so the races go away, no locking needed. From my perspective this was the simplest solution, since lock dependency ordering is _really_ a can of worms. That's certainly a possibility, but it'll still require more code in the I'll have to look at the code some more and thing about this. For the other timer routines I would think that the second approach would be better, but I still think that best of all is to not have to check anything at all (except of course the presence of the structure pointer itself). Regarding the second approach, without locking wouldn't that still be racy? Couldn't exit_state change (and therefore __exit_signal() run) between the check and the dereference? -- Frank Mayhar frank@exit.com http://www.exit.com/ Exit Consulting http://www.gpsclock.com/ http://www.exit.com/blog/frank/ http://www.zazzle.com/fmayhar* -...
The essence that matters is the same: something that current does with its Tomato, tomato. You're adding new data structures and lifetime rules to replace data that was described in a different data structure before, yet your new data's meaningful semantic lifetime exactly matches that of signal_struct. You could as well make everything release_task cleans up be done in __put_task_struct instead, but that would not be a good idea either. You've added a word to task_struct (100000 words per 100000-thread With the perspective of tunnel vision to just your one test case, adding something entirely new considering only that case always seems simplest. That's not how we keep the entire system from getting the wrong kinds of No. current->exit_state can go from zero to nonzero only by current running code in the do_exit path. current does not progress on that path while current is inside one update_process_times call. Thanks, Roland --
Okay. One of the paths to the update code is through update_curr() in
sched_fair.c, which (in my tree) calls account_group_exec_runtime() to
update the sum_exec_runtime field:
delta_exec = (unsigned long)(now - curr->exec_start);
__update_curr(cfs_rq, curr, delta_exec);
curr->exec_start = now;
if (entity_is_task(curr)) {
struct task_struct *curtask = task_of(curr);
cpuacct_charge(curtask, delta_exec);
account_group_exec_runtime(curtask, delta_exec);
}
To make sure that I understand what's going on, I put an invariant at
the beginning of account_group_exec_runtime():
static inline void account_group_exec_runtime(struct task_struct *tsk,
unsigned long long ns)
{
struct signal_struct *sig = tsk->signal;
struct task_cputime *times;
BUG_ON(tsk != current);
if (unlikely(tsk->exit_state))
return;
if (!sig->cputime.totals)
return;
times = per_cpu_ptr(sig->cputime.totals, get_cpu());
times->sum_exec_runtime += ns;
put_cpu_no_resched();
}
And, you guessed it, the invariant gets violated. Apparently the passed
task_struct isn't the same as "current" at this point.
Any ideas? Am I checking the wrong thing? If we're really not updating
current then the task we are updating could very easily be running
through __exit_signal() on another CPU. (And while I wait for your
response I will of course continue to try to figure this out.)
--
Frank Mayhar <fmayhar@google.com>
Google, Inc.
--My explanation about the constraints on exit_state was specifically about
the context of update_process_times(), which is part of the path for a
The scheduler code has gotten a lot more complex since I first implemented
posix-cpu-timers, and I've never been any expert on the scheduler at all.
But I'm moderately sure all those things are all involved in context
switch where the task of interest is about to be on a CPU or just was on a
CPU. I doubt those are places where the task in question could be
simultaneously executing in exit_notify() on another CPU. But we'd need
to ask the scheduler experts to be sure we know what we're talking about
I haven't figured out what actual code path this refers to.
This sort of concern is among the reasons that checking ->signal was the
course I found wise to suggest to begin with. We can figure out what the
constraints on ->exit_state are in all the places by understanding every
corner of the scheduler. We can measure whether it winds up being in a
cooler cache line than ->signal and a net loss to add the load, or has
superior performance as you seem to think. Or we can just test the
constraint that matters, whether the pointer we loaded was in fact null,
and rely on RCU to make it not matter if there is a race after that load.
It doesn't matter whether tsk is current or not, it only matters that we
have the pointer and that we're using some CPU array slot or other that
noone else is using simultaneously.
static inline void account_group_exec_runtime(struct task_struct *tsk,
unsigned long long runtime)
{
struct signal_struct *sig;
struct task_cputime *times;
rcu_read_lock();
sig = rcu_dereference(tsk->signal);
if (likely(sig) && sig->cputime.totals) {
times = per_cpu_ptr(sig->cputime.totals, get_cpu());
times->sum_exec_runtime += runtime;
put_cpu_no_resched();
}
rcu_read_unlock();
}
Thanks,
Roland
--This was my conclusion as well. Certainly the path through do_fork() (elucidated below) doesn't even allow the task in question to even be executing, much less on a different CPU, but all these routines with "struct task_struct *" parameters make me nervous. Which is why I do_fork=>wake_up_new_task=>task_new_fair=> Well, as much as I would like to take the time to do that, I do have a Yeah, agreed. Of course, I was hoping (in vain, apparently) to avoid this level of overhead here. And I suspect I'll really have to do it in each of these routines. But I suppose it can't be helped. Even with a thorough understanding of the scheduler(s) and code based on that understanding, we would still not (necessarily) be protected from future changes that violate the assumptions we make on that basis. -- Frank Mayhar <fmayhar@google.com> Google, Inc. --
Found the exception. do_fork() violates the invariant when it's cranking up a new process. Hmmm. -- Frank Mayhar <fmayhar@google.com> Google, Inc. --
This is one thing that has been unclear. The relationship of signal_struct to task_struct is, as far as I can tell, an unwritten one. Certainly the interrupt routines are adjusting values that live only inside task_struct and (with the exception of run_posix_cpu_timers()) While true, that's not the only reason to do it. The tradeoff here is between performance (i.e. having to do checks before dereferencing tsk->signal) versus space. It's really a judgment call. (Although This isn't exactly how I would state it but yes, this is generally true as well. The problem is that knowing exactly what is "the wrong kinds" relies on knowledge possessed by only a few. Prying that knowledge out Well, okay, this is the vital bit of data that puts everything above into perspective. Had I known this, I would not have made the change I did. I guess the key bit of knowledge is that a "task" is really a scheduling unit, right? And, really, from the scheduler's perspective, "task" is the same as "thread." The only thing that makes a set of threads into a multithreaded process is that they share a signal struct (well, and their memory map, of course). So a "task" can only be executed on a single cpu at any time, it can't be executed on more than one cpu at a time. Therefore if a "task" is executing and is interrupted, the value of "current" at the interrupt will be that task, which is entirely suspended for the duration of the interrupt. Is this correct? (This is not just for this fix, but for my general understanding of Linux scheduling.) Unfortunately, these things are often implicit in the code but as far as I know aren't written down anywhere. This whole exercise has been for me a process of becoming really familiar with the internals of the Linux kernel for the first time. -- Frank Mayhar <fmayhar@google.com> Google, Inc. --
No, the performance idea there is a myth. You're talking about one test and branch-not-taken for a word you're already loading into a register right there anyway (if testing ->signal). It's maybe two cycles that were most likely already idle in a load stall. The cache effects alone of But when it comes out, it's flying at high velocity straight into your face! There are several other things that are implicitly required to be shared when signal_struct is shared, too. But approximately, yes. Had I been in charge of the world, task_struct would be 'struct thread' and signal_struct would be 'struct process'. (Cue left-field flames from the peanut gallery Everything I know I learned from reading the source. So I sympathize with the sense of starting out lost without bearings, but I may be a little hard-hearted about anyone wanting more than their eyeballs and full-text searching to find their own bootstraps and pull (in my day, it was up hill both ways, and all that). Thanks, Roland --
This race, by the way, is because we're dereferencing task->signal at interrupt once per tick. We ran into a case where a process was going through release_task() and being torn down on one CPU while running a timer tick on another. Under load. It's not a very likely race but with sufficient time or load it's pretty much inevitable. My thought is to move thread_group_cputime from the signal structure to hanging directly off the task structure. It would be shared in the same way as the signal structure is now but would be deallocated with the task structure rather than the signal structure. This should mean that I could avoid getting tasklist_lock under most conditions. Thoughts? -- Frank Mayhar <fmayhar@google.com> Google, Inc. --
Okay, having run face-first into this race and having every combination of spinlock serialization fail for me, I've done a variation of the above scheme. For the local environment, I solved the problem by moving the percpu structure out of the signal structure entirely and by making it refcounted. It is allocated as before, but now in two parts, a normal structure with an atomic refcount that has a pointer to the percpu structure. The signal structure doesn't point to it any longer, but each task_struct in the thread group does, and each of these references is counted. New threads will also get a reference (at the top of copy_signal()) and be counted. All access goes through the task structure. References are removed in __put_task_struct() when the task itself is destroyed; when the last reference goes away, the structures are freed. This eliminates the races with signal_struct being freed and has the nice effect that there is a bit less overhead in places like account_group_user_time() and friends. In run_posix_cpu_timers(), though, I have to pick up the tasklist_lock early (and therefore in every case) because it's still dereferencing tsk->signal in the early comparison. I'm thinking about moving all of the itimer stuff (i.e. the cputime_expires structures) into the refcounted structure as well, thus avoiding the signal_struct entirely so we don't need the tasklist_lock in the fast path. I don't know how any of this will affect the UP case, though. I'll have to continue to think about it and I'm sure you have something to say as well. (And if anyone else wants to chime in, they're welcome.) -- Frank Mayhar <fmayhar@google.com> Google, Inc. --
At least it's beginning to make sense. Rather than worrying overmuch
about commentary, though, I'll think about writing an explanation for
the Documentation directory. The implementation is pretty scattered so
there's no obvious single place for good commentary; a separate document
Yeah, I remembered this finally. The structures are declared before
signal_struct in all cases, with the inlines appearing (much) later in
the file.
Here are the declarations (starting at around line 425; the inlines
start at around line 2010 or so):
struct task_cputime {
cputime_t utime; /* User time. */
cputime_t stime; /* System time. */
unsigned long long sum_exec_runtime; /* Scheduler time. */
};
/* Alternate field names when used to cache expirations. */
#define prof_exp stime
#define virt_exp utime
#define sched_exp sum_exec_runtime
#ifdef CONFIG_SMP
struct thread_group_cputime {
struct task_cputime *totals;
};
#else
struct thread_group_cputime {
struct task_cputime totals;
};
#endif
The #defines there are for when I refer to the task_cputime fields as
expiration values, to make it a little more self-explanatory (and less
confusing). I can remove them if it violates a Linux kernel coding
Yeah, that bothered me, too. It's obviously impossible to do the
allocate itself in __exit_signal() itself (since the caller,
release_task() is holding tasklist_lock) and doing it in release_task()
This seems kind of hacky to me, particularly the bit about subtracting
This is a little better and is along the lines I was thinking at one
point, but it still has the dead_totals field which becomes redundant
when the percpu structure is allocated. I would rather eliminate that
field entirely, especially since we're growing signal_struct in other
I had also thought hard about doing this. One the one hand, it makes
dealing w...Okay, my proposed fix for this is to introduce a new field in signal_struct, rlim_expires, a cputime_t. Everywhere that the RLIMIT_CPU code formerly set it_prof_expires it will now set rlim_expires and in run_posix_cpu_timers() I'll check it against the thread group prof_time. I believe that that will solve the problem, if I understand this correctly. If I don't, I trust that you will set me straight. :-) -- Frank Mayhar <fmayhar@google.com> Google, Inc. --
