login
Header Space

 
 

Re: posix-cpu-timers revamp

Previous thread: none

Next thread: Re: [RFC] ext3 freeze feature by Takashi Sato on Wednesday, February 6, 2008 - 9:05 pm. (1 message)
To: <fmayhar@...>
Cc: <bugme-daemon@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Roland McGrath <roland@...>, Jakub Jelinek <jakub@...>
Date: Wednesday, February 6, 2008 - 8:50 pm

On Wed,  6 Feb 2008 16:33:20 -0800 (PST)

It's probably better to handle this one via email, so please send that
testcase vie reply-to-all to this email, thanks.

--
To: Andrew Morton <akpm@...>
Cc: <bugme-daemon@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Roland McGrath <roland@...>, Jakub Jelinek <jakub@...>
Date: Wednesday, February 6, 2008 - 8:58 pm

Testcase attached.

Build with
        gcc -D_GNU_SOURCE -c hangc-2.c -o hangc-2.o
        gcc -lpthread -o hangc-2 hangc-2.o

Run with
        hangc-2 4500 4500

-- 
Frank Mayhar &lt;fmayhar@google.com&gt;
Google, Inc.
To: Frank Mayhar <fmayhar@...>
Cc: Andrew Morton <akpm@...>, <bugme-daemon@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Roland McGrath <roland@...>, Jakub Jelinek <jakub@...>
Date: Wednesday, February 6, 2008 - 10:57 pm

FWIW this is not reproducible on 2.6.24/x86/CentOS-51. (I tried running 
it nearly 1500 times in a loop.) Assuming those many tries are sufficient 
to reproduce this bug, there seems to be something specific to the 
environment/architecture/configuration that's necessary to trigger it. 

It might be helpful to provide full details like glibc version, compiler 
version, .config etc.

Parag
--
To: <parag.warudkar@...>
Cc: Frank Mayhar <fmayhar@...>, Andrew Morton <akpm@...>, <bugme-daemon@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Roland McGrath <roland@...>, Jakub Jelinek <jakub@...>
Date: Thursday, February 7, 2008 - 11:22 am

El Wed, 6 Feb 2008 21:57:38 -0500 (EST)

I can reproduce it on my Ubuntu 7.10 kernel 2.6.24


Note i use CC=gcc-4.2 

gcc --version

gcc-4.2 (GCC) 4.2.1 (Ubuntu 4.2.1-5ubuntu4)
 
If some fields are empty or look unusual you may have an old version.
Compare to the current minimal requirements in Documentation/Changes.
 
Linux Varda 2.6.24 #2 SMP PREEMPT Fri Jan 25 01:05:47 CET 2008 x86_64 GNU/Linux
 
Gnu C                  4.1.3
Gnu make               3.81
binutils               2.18
util-linux             2.13
mount                  2.13
module-init-tools      3.3-pre2
e2fsprogs              1.40.2
jfsutils               1.1.11
reiserfsprogs          3.6.19
pcmciautils            014
PPP                    2.4.4
Linux C Library        2.6.1
Dynamic linker (ldd)   2.6.1
Procps                 3.2.7
Net-tools              1.60
Kbd                    [opcion...][archivo
Console-tools          0.2.3
Sh-utils               5.97
udev                   113
wireless-tools         29
Modules Loaded         af_packet binfmt_misc rfcomm l2cap bluetooth ipv6 powernow_k8 cpufreq_userspace cpufreq_stats cpufreq_powersave cpufreq_ondemand freq_table cpufreq_conservative nf_conntrack_ftp nf_conntrack_irc xt_tcpudp ipt_ULOG xt_limit xt_state iptable_filter nf_conntrack_ipv4 nf_conntrack ip_tables x_tables kvm_amd kvm w83627ehf hwmon_vid lp snd_hda_intel arc4 ecb blkcipher cryptomgr crypto_algapi snd_pcm_oss snd_mixer_oss snd_pcm snd_mpu401 snd_mpu401_uart snd_seq_dummy rt2500pci rt2x00pci rt2x00lib snd_seq_oss rfkill snd_seq_midi input_polldev snd_rawmidi crc_itu_t snd_seq_midi_event snd_seq mac80211 usbhid snd_timer snd_seq_device cfg80211 usblp ff_memless snd eeprom_93cx6 nvidia i2c_ali1535 i2c_ali15x3 evdev snd_page_alloc sr_mod cdrom button soundcore uli526x 8250_pnp 8250 serial_core i2c_core k8temp hwmon parport_pc parport pata_acpi pcspkr rtc floppy sg ata_generic ehci_hcd r8169 ohci_hcd usbcore unix thermal processor fan fuse

#
# Automatically generated make config: don't edit
# Lin...
To: Alejandro Riveira Fernández <ariveira@...>
Cc: <parag.warudkar@...>, Frank Mayhar <fmayhar@...>, Andrew Morton <akpm@...>, <bugme-daemon@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Roland McGrath <roland@...>, Jakub Jelinek <jakub@...>
Date: Thursday, February 7, 2008 - 11:53 am

On Thu, 7 Feb 2008, Alejandro Riveira Fern
To: <parag.warudkar@...>
Cc: Alejandro Riveira <ariveira@...>, Andrew Morton <akpm@...>, <bugme-daemon@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Roland McGrath <roland@...>, Jakub Jelinek <jakub@...>
Date: Thursday, February 7, 2008 - 1:36 pm

Several, among which were i686+SMP+GnuC-4.0.3+Glibc-2.3.6.  No PREEMPT.
Linux 2.6.18, 2.6.21 and 2.6.24-rc4.
-- 
Frank Mayhar &lt;fmayhar@google.com&gt;
Google, Inc.

--
To: Alejandro Riveira Fernández <ariveira@...>
Cc: Frank Mayhar <fmayhar@...>, Andrew Morton <akpm@...>, <bugme-daemon@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Roland McGrath <roland@...>, Jakub Jelinek <jakub@...>
Date: Thursday, February 7, 2008 - 11:56 am

That should of course be   
 x86_64+SMP+PREEMPT+GnuC-4.1.3+Glibc-2.6.1 = Reproducible.	

--
To: <parag.warudkar@...>
Cc: Frank Mayhar <fmayhar@...>, Andrew Morton <akpm@...>, <bugme-daemon@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Roland McGrath <roland@...>, Jakub Jelinek <jakub@...>
Date: Thursday, February 7, 2008 - 11:54 am

El Thu, 7 Feb 2008 10:56:16 -0500 (EST)
From my previous mail 

Note that i use CC=gcc-4.2 

 $gcc-4.2 --version

 gcc-4.2 (GCC) 4.2.1 (Ubuntu 4.2.1-5ubuntu4)
--
To: Alejandro Riveira Fernández <ariveira@...>
Cc: <parag.warudkar@...>, Frank Mayhar <fmayhar@...>, Andrew Morton <akpm@...>, <bugme-daemon@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Roland McGrath <roland@...>, Jakub Jelinek <jakub@...>
Date: Thursday, February 7, 2008 - 12:01 pm

On Thu, 7 Feb 2008, Alejandro Riveira Fern
To: Parag Warudkar <parag.warudkar@...>
Cc: Alejandro Riveira Fernández <ariveira@...>, Frank Mayhar <fmayhar@...>, Andrew Morton <akpm@...>, <bugme-daemon@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Roland McGrath <roland@...>, Jakub Jelinek <jakub@...>
Date: Thursday, February 7, 2008 - 12:53 pm

Not reproducible with PREEMPT either. 

Parag
--
To: <parag.warudkar@...>
Cc: Alejandro Riveira <ariveira@...>, Andrew Morton <akpm@...>, <bugme-daemon@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Roland McGrath <roland@...>, Jakub Jelinek <jakub@...>
Date: Friday, February 29, 2008 - 3:55 pm

Okay, here's an analysis of the problem and a potential solution.  I
mentioned this in the bug itself but I'll repeat it here:

A couple of us here have been investigating this thing and have
concluded that the problem lies in the implementation of
run_posix_cpu_timers() and specifically in the quadratic nature of the
implementation.  It calls check_process_timers() to sum the
utime/stime/sched_time (in 2.6.18.5, under another name in 2.6.24+) of
all threads in the thread group.  This means that runtime there grows
with the number of threads.  It can go through the list _again_ if and
when it decides to rebalance expiry times.

After thinking through it, it seems clear that the critical number of
threads is that in which run_posix_cpu_timers() takes as long as or
longer than a tick to get its work done.  The system makes progress to
that point but after that everything goes to hell as it gets further and
further behind.  This explains all the symptoms we've seen, including
seeing run_posix_cpu_timers() at the top of a bunch of profiling stats
(I saw it get more than a third of overall processing time on a bunch of
tests, even where the system _didn't_ hang!).  It explains the fact that
things get slow right before they go to hell and it explains why under
certain conditions the system can recover (if the threads have started
exiting by the time it hangs, for example).

I've come up with a potential fix for the problem.  It does two things.
First, rather than summing the utime/stime/sched_time at interrupt it
adds all of those times to a new task_struct field on the group leader
then at interrupt just consults those fields; this avoids repeatedly
blowing the cache as well as a loop across all the threads.

Second, if there are more than 1000 threads in the process (as noted in
task-&gt;signal-&gt;live), it just punts all of the processing to a workqueue.

With these changes I've gone from a hang at 4500 (or fewer) threads to
running out of resources at more than 32000 threads on...
To: Frank Mayhar <fmayhar@...>
Cc: <parag.warudkar@...>, Alejandro Riveira <ariveira@...>, Andrew Morton <akpm@...>, <bugme-daemon@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Jakub Jelinek <jakub@...>
Date: Tuesday, March 4, 2008 - 3:00 am

Thanks for the detailed explanation and for bringing this to my attention.

This is a problem we knew about when I first implemented posix-cpu-timers
and process-wide SIGPROF/SIGVTALRM.  I'm a little surprised it took this
long to become a problem in practice.  I originally expected to have to
revisit it sooner than this, but I certainly haven't thought about it for
quite some time.  I'd guess that HZ=1000 becoming common is what did it.

The obvious implementation for the process-wide clocks is to have the
tick interrupt increment shared utime/stime/sched_time fields in
signal_struct as well as the private task_struct fields.  The all-threads
totals accumulate in the signal_struct fields, which would be atomic_t.
It's then trivial for the timer expiry checks to compare against those
totals.

The concern I had about this was multiple CPUs competing for the
signal_struct fields.  (That is, several CPUs all running threads in the
same process.)  If the ticks on each CPU are even close to synchronized,
then every single time all those CPUs will do an atomic_add on the same
word.  I'm not any kind of expert on SMP and cache effects, but I know
this is bad.  However bad it is, it's that bad all the time and however
few threads (down to 2) it's that bad for that many CPUs.

The implementation we have instead is obviously dismal for large numbers
of threads.  I always figured we'd replace that with something based on
more sophisticated thinking about the CPU-clash issue.  

I don't entirely follow your description of your patch.  It sounds like it
should be two patches, though.  The second of those patches (workqueue)
sounds like it could be an appropriate generic cleanup, or like it could
be a complication that might be unnecessary if we get a really good
solution to main issue.  

The first patch I'm not sure whether I understand what you said or not.
Can you elaborate?  Or just post the unfinished patch as illustration,
marking it as not for submission until you've finished.


Th...
To: Roland McGrath <roland@...>
Cc: <parag.warudkar@...>, Alejandro Riveira <ariveira@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Jakub Jelinek <jakub@...>
Date: Tuesday, March 4, 2008 - 3:52 pm

Put this on the patch but I'm emailing it as well.



Well, the iron is getting bigger, too, so it's beginning to be feasible

My first patch did essentially what you outlined above, incrementing
shared utime/stime/sched_time fields, except that they were in the
task_struct of the group leader rather than in the signal_struct.  It's
not clear to me exactly how the signal_struct is shared, whether it is
shared among all threads or if each has its own version.

So each timer routine had something like:

	/* If we're part of a thread group, add our time to the leader. */
	if (p-&gt;group_leader != NULL)
		p-&gt;group_leader-&gt;threads_sched_time += tmp;

and check_process_timers() had

	/* Times for the whole thread group are held by the group leader. */
	utime = cputime_add(utime, tsk-&gt;group_leader-&gt;threads_utime);
	stime = cputime_add(stime, tsk-&gt;group_leader-&gt;threads_stime);
	sched_time += tsk-&gt;group_leader-&gt;threads_sched_time;

Of course, this alone is insufficient.  It speeds things up a tiny bit
but not nearly enough.

The other issue has to do with the rest of the processing in
run_posix_cpu_timers(), walking the timer lists and walking the whole
thread group (again) to rebalance expiry times.  My second patch moved
all that work to a workqueue, but only if there were more than 100
threads in the process.  This basically papered over the problem by
moving the processing out of interrupt and into a kernel thread.  It's
still insufficient, though, because it takes just as long and will get
backed up just as badly on large numbers of threads.  This was made
clear in a test I ran yesterday where I generated some 200,000 threads.
The work queue was unreasonably large, as you might expect.

I am looking for a way to do everything that needs to be done in fewer
operations, but unfortunately I'm not familiar enough with the
SIGPROF/SIGVTALRM semantics or with the details of the Linux
implementation to know where it is safe to consolidate things.
-- 
Frank...
To: Frank Mayhar <fmayhar@...>
Cc: <parag.warudkar@...>, Alejandro Riveira <ariveira@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Jakub Jelinek <jakub@...>
Date: Wednesday, March 5, 2008 - 12:08 am

There is a 1:1 correspondence between "shares signal_struct" and "member of
same thread group".  signal_struct is the right place for such new fields.
Don't be confused by the existing fields utime, stime, gtime, and
sum_sched_runtime.  All of those are accumulators only touched when a
non-leader thread dies (in __exit_signal), and governed by the siglock.
Their only purpose now is to represent the threads that are dead and gone
when calculating the cumulative total for the whole thread group.  If you
were to provide cumulative totals that are updated on every tick, then

The task_struct.group_leader field is never NULL.  Every thread is a member
of some thread group.  The degenerate case is that it's the only member of

It sounds like you sped up only one of the sampling loops.  Having a
cumulative total already on hand means cpu_clock_sample_group can also
become simple and cheap, as can the analogues in do_getitimer and
k_getrusage.  These are what's used in clock_gettime and in the timer
manipulation calls, and in getitimer and getrusage.  That's all just gravy.

The real benefit of having a cumulative total is for the basic logic of
run_posix_cpu_timers (check_process_timers) and the timer expiry setup.  
It sounds like you didn't take advantage of the new fields for that.

When a cumulative total is on hand in the tick handler, then there is no
need at all to derive per-thread expiry times from group-wide CPU timers
("rebalance") either there or when arming the timer in the first place.
All of that complexity can just disappear from the implementation.

check_process_timers can look just like check_thread_timers, but
consulting the shared fields instead of the per-thread ones for both the
clock accumulators and the timers' expiry times.  Likewise, arm_timer
only has to set signal-&gt;it_*_expires; process_timer_rebalance goes away.

If you do all that then the time spent in run_posix_cpu_timers should
not be affected at all by the number of threads.  The only "walking the
...
To: Roland McGrath <roland@...>
Cc: <parag.warudkar@...>, Alejandro Riveira <ariveira@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Jakub Jelinek <jakub@...>
Date: Friday, March 7, 2008 - 7:26 pm

Based on Roland's comments and from reading the source, I have a
possible fix.  I'm posting the attached patch _not_ for submission but
_only_ for comment.  For one thing it's based on 2.6.18.5 and for
another it hasn't had much testing yet.  I wanted to get it out here for
comment, though, in case anyone can see where I might have gone wrong.
Comments, criticism and (especially!) testing enthusiastically

	Replaces the utime, stime and sched_time fields in signal_struct with
	shared_utime, shared_stime and shared_schedtime, respectively.  It
	also adds it_sched_expires to the signal struct.

	Each place that loops through all threads in a thread group to sum
	task-&gt;utime and/or task-&gt;stime now loads the value from
	task-&gt;signal-&gt;shared_[us]time.  This includes compat_sys_times(),
	do_task_stat(), do_getitimer(), sys_times() and k_getrusage().

	Certain routines that used task-&gt;signal-&gt;[us]time now use the shared
	fields instead, which may change their semantics slightly.  These
	include fill_prstatus() (in fs/binfmt_elf.c), do_task_stat() (in
	fs/proc/array.c), wait_task_zombie() and do_notify_parent().

	The shared fields are updated at each tick, in update_cpu_clock()
	(shared_schedtime), account_user_time() (shared_utime) and
	account_system_time() (shared_stime).  Each of these functions updates
	the task-private field followed by the shared version in the signal
	structure if one is present.  Note that if different threads of the
	same process are being run by different CPUs at the tick, there may
	be serious cache contention here.

	Finally, kernel/posix-cpu-timers.c has changed quite dramatically.
	First, run_posix_cpu_timers() decides whether a timer has expired by
	consulting the it_*_expires and shared_* fields in the signal struct.
	The check_process_timers() routine bases its computations on the new
	shared fields, removing two loops through the threads.  "Rebalancing"
	is no longer required, the process_timer_rebalance() routine as
	disappeared en...
To: Roland McGrath <roland@...>
Cc: <parag.warudkar@...>, Alejandro Riveira <ariveira@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Jakub Jelinek <jakub@...>
Date: Friday, March 7, 2008 - 8:01 pm

The previous email was missing one small part of the patch, reproduced
below.  Remain calm.
----------------------------------BEGIN------------------------------------
diff -urp /home/fmayhar/Static/linux-2.6.18.5/kernel/compat.c linux-2.6.18.5/kernel/compat.c
--- /home/fmayhar/Static/linux-2.6.18.5/kernel/compat.c	2006-12-01 16:13:05.000000000 -0800
+++ linux-2.6.18.5/kernel/compat.c	2008-03-06 17:26:21.000000000 -0800
@@ -161,18 +161,11 @@ asmlinkage long compat_sys_times(struct 
 	if (tbuf) {
 		struct compat_tms tmp;
 		struct task_struct *tsk = current;
-		struct task_struct *t;
 		cputime_t utime, stime, cutime, cstime;
 
 		read_lock(&amp;tasklist_lock);
-		utime = tsk-&gt;signal-&gt;utime;
-		stime = tsk-&gt;signal-&gt;stime;
-		t = tsk;
-		do {
-			utime = cputime_add(utime, t-&gt;utime);
-			stime = cputime_add(stime, t-&gt;stime);
-			t = next_thread(t);
-		} while (t != tsk);
+		utime = tsk-&gt;signal-&gt;shared_utime;
+		stime = tsk-&gt;signal-&gt;shared_stime;
 
 		/*
 		 * While we have tasklist_lock read-locked, no dying thread
-----------------------------------END-------------------------------------
-- 
Frank Mayhar &lt;fmayhar@google.com&gt;
Google, Inc.

--
To: Roland McGrath <roland@...>
Cc: <parag.warudkar@...>, Alejandro Riveira <ariveira@...>, Andrew Morton <akpm@...>, <linux-kernel@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, Jakub Jelinek <jakub@...>
Date: Thursday, March 6, 2008 - 3:04 pm

Okay, my understanding of this is still evolving, so please (please!)
correct me when I get it wrong.  I take what you're saying to mean that,
first, run_posix_cpu_timers() only needs to be run once per thread
group.  It _sounds_ like it should be checking the shared fields rather
than the per-task fields for timer expiration (in fact, the more I think
about it the more sure I am that that's the case).

The old process_timer_rebalance() routine was intended to distribute the
remaining ticks across all the threads, so that the per-task fields
would cause run_posix_cpu_timers() to run at the appropriate time.  With
it checking the shared fields this becomes no longer necessary.

Since the shared fields are getting all the ticks, this will work for
per-thread timers as well.

The arm_timers() routine, instead of calling posix_timer_rebalance(),
should just directly set signal-&gt;it_*_expires to the expiration time,
e.g.:
			switch (CPUCLOCK_WHICH(timer-&gt;it_clock)) {
			default:
				BUG();
			case CPUCLOCK_VIRT:
				if (!cputime_eq(p-&gt;signal-&gt;it_virt_expires,
						cputime_zero) &amp;&amp;
				    cputime_lt(p-&gt;signal-&gt;it_virt_expires,
					       timer-&gt;it.cpu.expires.cpu))
					break;
				p-&gt;signal-&gt;it_virt_expires = timer-&gt;it.cpu.expires.cpu;
				goto rebalance;
			case CPUCLOCK_PROF:
				if (!cputime_eq(p-&gt;signal-&gt;it_prof_expires,
						cputime_zero) &amp;&amp;
				    cputime_lt(p-&gt;signal-&gt;it_prof_expires,
					       timer-&gt;it.cpu.expires.cpu))
					break;
				i = p-&gt;signal-&gt;rlim[RLIMIT_CPU].rlim_cur;
				if (i != RLIM_INFINITY &amp;&amp;
				    i &lt;= cputime_to_secs(timer-&gt;it.cpu.expires.cpu))
					break;
				p-&gt;signal-&gt;it_prof_expires = timer-&gt;it.cpu.expires.cpu;
				goto rebalance;
			case CPUCLOCK_SCHED:
				p-&gt;signal-&gt;it_sched_expires = timer-&gt;it.cpu.expires.sched;
				break;

It's still probably worthwhile to defer processing to a workqueue
thread, though, just because it's still a lot ...
To: Frank Mayhar <fmayhar@...>
Cc: <linux-kernel@...>
Date: Tuesday, March 11, 2008 - 3:50 am

[I changed the subject and trimmed the CC list, as this is now quite far
away from the "some mysterious NPTL problem" subject.  If anyone else

Not quite.  check_process_timers only needs to be run once per thread

run_posix_cpu_timers does two things: thread CPU timers and process CPU
timers.  The thread CPU timers track the thread CPU clocks, which are
what the per-thread fields in task_struct count.  check_thread_timers
finds what thread CPU timers have fired.  The task_struct.it_*_expires
fields are set when there are thread CPU timers set on those clocks.

The process CPU timers track the process CPU clocks.  Each process CPU
clock (virt, prof, sched) is just the sum of the corresponding thread
CPU clock across all threads in the group.  In the original code, these
clocks are never maintained in any storage as such, but sampled by
summing all the thread clocks when a current value is needed.
check_process_timers finds what process CPU timers have fired.  The
signal_struct.it_*_expires fields are set when there are process CPU
timers set on those clocks.

The "rebalance" stuff also sets the task_struct.it_*_expires fields of
all the threads in the group when there are process CPU timers.  So each
of these fields is really the minimum of the expiration time of the
earliest thread CPU timer and the "balanced" sample-and-update time


I do not follow your logic at all here.  The signal_struct fields being
proposed track each process CPU clock's value.  The thread CPU timers


I think the natural place for it is run_timer_softirq.  This is where
natural time timers run.  The posix-cpu-timers processing is analogous
to the posix-timers processing, which runs via timer callbacks from
here.  That roughly means doing it right after the interrupt normally,
and falling back to something similar to a workqueue when the load is
too heavy.  In the common case, it doesn't entail any context switches,
as workqueues always do.

The issue with either the workqueue or the softirq approach ...
To: Roland McGrath <roland@...>
Cc: <linux-kernel@...>
Date: Tuesday, March 11, 2008 - 5:05 pm

Where "interesting tick" means "tick in which a process timer has

And my changes introduce these clocks as separate fields in the signal

Okay, I hadn't been clear on the distinction between process-wide and
thread-only timers.  So, really, run_posix_cpu_timers() needs to check
both sets, the versions in the signal struct for the process-wide timers

I'm going to table this for now.  Based on my preliminary performance
results, the changes I've made mean that using a workqueue or softirq is
not necessary.  In the profiles of a couple of testcases I've run,
run_posix_cpu_timers() didn't show up at all, whereas before my change
it was right at the top with ~35% of the time.


So, check_process_timers() checks for and handles any expired timers for
the currently-running process, whereas check_thread_timers() checks for
and handles any expired timers for the currently-running thread.  Is
that correct?

And, since these timers are only counting CPU time, if a thread is never
running at the tick (since that's how we account time in the first
place) any timers it might have will never expire.  Sorry to repeat the
obvious but sometimes it's better to state things very explicitly.

At each tick a process-wide timer may have expired.  Also, at each tick
a thread-only timer may have expired.  Or, of course, both.  So we need
to detect both events and fire the appropriate timer in the appropriate
context.

I think my current code (i.e. the patch I published for comment a few
days ago) does this, with one exception:  If a thread-only timer expires
it _won't_ be detected when run_posix_cpu_timers() runs, since I'm only
checking the process-wide timers.  This implies that I need to do twice

I've actually gotten a little bit of insight into this.  I don't think
that a straight set of shared fields is sufficient except in UP (and
possibly dual-CPU) environments.  I was able to run a reasonable test on
both a four-core Opteron system and a sixteen-core Opteron system.
(That's two dual-cor...
To: Frank Mayhar <fmayhar@...>
Cc: <linux-kernel@...>
Date: Tuesday, March 11, 2008 - 5:35 pm

I'm not surprised by this result.  (I do want to see much more detailed

That is exactly what I had in mind.  (I hadn't noticed alloc_percpu, and it
has one more level of indirection than I'd planned.  But that wastes less
space when num_possible_cpus() is far greater than num_online_cpus(), and I
imagine it's vastly superior for NUMA.)

Don't forget do_[gs]etitimer and k_getrusage can use this too.
(Though maybe no reason to bother in k_getrusage since it has

I tend to agree.  It's the only plan I've thought through in detail.
But my remarks stand, about thorough analysis of performance impacts
of options we can think of.


Thanks,
Roland
--
To: Roland McGrath <roland@...>
Cc: <linux-kernel@...>
Date: Thursday, March 13, 2008 - 8:37 pm

After the recent conversation with Roland and after more testing, I have
another patch for review (although _not_ for submission, as again it's
against 2.6.18.5).  This patch breaks the shared utime/stime/sched_time
fields out into their own structure which is allocated as needed via
alloc_percpu().  This avoids cache thrashing when running lots of
threads on lots of CPUs.

Please take a look and let me know what you think.  In the meantime I'll
be working on a similar patch to 2.6-head that has optimizations for
uniprocessor and two-CPU operation, to avoid the overhead of the percpu
functions when they are unneeded.

This patch:

	Replaces the utime, stime and sched_time fields in signal_struct with
	the shared_times structure, which is cacheline aligned and allocated
	when needed using the alloc_percpu() mechanism.  There is one copy of
	this structure per running CPU when it is being used.

	Each place that loops through all threads in a thread group to sum
	task-&gt;utime and/or task-&gt;stime now use the shared_*_sum() inline
	functions defined in sched.h to sum the per-CPU structures.  This
	includes compat_sys_times(), do_task_stat(), do_getitimer(),
	sys_times() and k_getrusage().

	Certain routines that used task-&gt;signal-&gt;[us]time now use the
	shared_*_sum() functions instead, which may (but hopefully will not)
	change their semantics slightly.  These include fill_prstatus() (in
	fs/binfmt_elf.c), do_task_stat() (in fs/proc/array.c),
	wait_task_zombie() and do_notify_parent().

	At each tick, update_cpu_clock(), account_user_time() and
	account_system_time() update the relevant field of the shared_times
	structure using a pointer obtained using per_cpu_ptr, with the effect
	that these functions do not compete with one another for the cacheline.
	Each of these functions updates the task-private field followed by the
	shared_times version if one is present.

	Finally, kernel/posix-cpu-timers.c has changed quite dramatically.
	First, run_posix_cpu_timers() decide...
To: Frank Mayhar <fmayhar@...>
Cc: <linux-kernel@...>
Date: Friday, March 21, 2008 - 3:18 am

My mention of a 2-CPU special case was just an off-hand idea.  I don't
really have any idea if that would be optimal given the tradeoff of

I think I misled you about the use of the it_*_expires fields, sorry.
The task_struct.it_*_expires fields are used solely as a cache of the
head of cpu_timers[].  Despite the poor choice of the same name, the
signal_struct.it_*_expires fields serve a different purpose.  For an
analogous cache of the soonest timer to expire, you need to add new
fields.  The signal_struct.it_{prof,virt}_{expires,incr} fields hold
the setitimer settings for ITIMER_{PROF,VTALRM}.  You can't change
those in arm_timer.  For a quick cache you need a new field that is
the sooner of it_foo_expires or the head cpu_timers[foo] expiry time.

The shared_utime_sum et al names are somewhat oblique to anyone who
hasn't just been hacking on exactly this thing like you and I have.
Things like thread_group_*time make more sense.

There are now several places where you call both shared_utime_sum and
shared_stime_sum.  It looks simple because they're nicely encapsulated.
But now you have two loops through all CPUs, and three loops in
check_process_timers.

I think what we want instead is this:

	struct task_cputime
	{
		cputime_t utime;
		cputime_t stime;
		unsigned long long schedtime;
	};

Use one in task_struct to replace the utime, stime, and sum_sched_runtime
fields, and another to replace it_*_expires.  Use a single inline function
thread_group_cputime() that fills a sum struct task_cputime using a single
loop.  For the places only one or two of the sums is actually used, the
compiler should optimize away the extra summing from the loop.

Don't use __cacheline_aligned on this struct type itself, because most of
the uses don't need that.  When using alloc_percpu, you can rely on it to
take care of those needs--that's what it's for.  If you implement a
variant that uses a flat array, you can use a wrapper struct with
__cacheline_aligned for that.


Thanks,
Roland...
To: Roland McGrath <roland@...>
Cc: <linux-kernel@...>
Date: Friday, March 21, 2008 - 4:40 pm

Actually, after looking at the code again and thinking about it a bit,
it appears that the signal_struct.it_*_incr field holds the actual
interval as set by setitimer.  Initially the it_*_expires field holds
the expiration time as set by setitimer, but after the timer fires the
first time that value becomes &lt;firing time&gt;+it_*_incr.  In other words,
the first time it fires at the value set by setitimer() but from then on
it fires at a time indicated by whatever the time was the last time the
timer fired plus the value in it_*_incr.  This time is stored in
signal_struct.it_*_expires.

I guess I could be wrong about this, but it appears to be what the code
is doing.  If my analysis is correct, I really don't need a new field,
since the old fields work just fine.
-- 
Frank Mayhar &lt;fmayhar@google.com&gt;
Google, Inc.

--
To: Roland McGrath <roland@...>
Cc: <linux-kernel@...>
Date: Friday, March 21, 2008 - 1:57 pm

I would really like to just ignore the 2-cpu scenario and just have two
versions, the UP version and the n-way SMP version.  It would make life,

Okay, I'll go back over this and make sure I got it right.  It's
interesting, though, that my current patch (written without this
particular bit of knowledge) actually performs no differently from the
kernel gets:

./nohangc-3 1300 200000
Interval timer off.
Threads:                1300
Max prime:              200000
Elapsed:                95.421s
Execution:              User 356.001s, System 0.029s, Total 356.030s
Context switches:       vol 1319, invol 7402

./hangc-3 1300 200000
Interval timer set to 0.010 sec.
Threads:                1300
Max prime:              200000
Elapsed:                131.457s
Execution:              User 435.037s, System 59.495s, Total 494.532s
Context switches:       vol 1464, invol 10123
Ticks:                  22612, tics/sec 45.724, secs/tic 0.022

(More than 1300 threads hangs the old kernel with this test.)

With my patch it gets:

./nohangc-3 1300 200000
Interval timer off.
Threads:                1300
Max prime:              200000
Elapsed:                94.097s
Execution:              User 366.000s, System 0.052s, Total 366.052s
Context switches:       vol 1336, invol 28928

./hangc-3 1300 200000
Interval timer set to 0.010 sec.
Threads:                1300
Max prime:              200000
Elapsed:                93.583s
Execution:              User 366.117s, System 0.047s, Total 366.164s
Context switches:       vol 1323, invol 28875
Ticks:                  12131, tics/sec 33.130, secs/tic 0.030


In the latest cut I've named them "process_*" but "thread_group" makes

Good point, although so far it's been undetectable in my performance
testing.  (I can't say that it will stay that way down the road a bit,

Excellent idea!  This method hadn't occurred to me since I was looking
at it from the viewpoint of the existing structure and keeping the

Yeah, I had caught that one.

F...
To: Frank Mayhar <fmayhar@...>
Cc: <linux-kernel@...>
Date: Saturday, March 22, 2008 - 5:58 pm

Like I've said, it's only something to investigate for best performance.
If the conditional code is encapsulated well, it will be simple to add


There are several important scenarios you did not test.
Analysis of combinations of all these variables is needed.
1. Tests with a few threads, like as many threads as CPUs or only 2x as many.
2. Tests with a process CPU timer set for a long expiration time.
   i.e. a timer set, but that never goes off in your entire run.
   (This is what a non-infinity RLIMIT_CPU limit does.)
   With the old code, a long enough timer and a small enough number

That's correct.  The it_*_expires fields store itimerval.it_value (the
current timer) and the it_*_incr fields store itimerval.it_interval (the

The analysis above is correct but your conclusion here is wrong.
The current value of an itimer is a user feature, not just a piece
of internal bookkeeping.

getitimer returns in it_value the amount of time until the itimer
fires, regardless of whether or not it will reload after it fires or
with what value it will be reloaded.  In a setitimer call, the
it_value sets the time at which the itimer must fire, regardless of
the reload setting in it_interval.  Consider the case where
it_interval={0,0}; it_value is still meaningful.  

Your code causes any timer_settime or timer_delete call on a process
CPU timer or any setrlimit call on RLIMIT_CPU to suddenly change the
itimer setting just as if the user had made some setitimer call that
was never made or intended.  That's wrong.


Thanks,
Roland
--
To: Roland McGrath <roland@...>
Cc: <linux-kernel@...>
Date: Thursday, March 27, 2008 - 8:52 pm

This is my official first cut at a patch that will fix bug 9906, "Weird
hang with NPTL and SIGPROF."  The problem is that run_posix_cpu_timers()
repeatedly walks the entire thread group every time it runs, which is at
interrupt.  With heavy load and lots of threads, this can take longer
than the tick, at which point the kernel stops doing anything put
servicing clock ticks and the occasional interrupt.  Many thanks to
Roland McGrath for his help in my attempt to understand his code.

The change adds a new structure to the signal_struct,
thread_group_cputime.  On an SMP kernel, this is allocated as a percpu
structure when needed (from do_setitimer()) using the alloc_percpu()
mechanism).  It is manipulated via a set of inline functions and macros
defined in sched.h, thread_group_times_init(),
thread_group_times_free(), thread_group_times_alloc(),
thread_group_update() (the macro) and thread_group_cputime().  The
thread_group_update macro is used to update a single field of the
thread_group_cputime structure when needed; the thread_group_cputime()
function sums the fields for each cpu into a passed structure.

In the uniprocessor case, the thread_group_cputime structure becomes a
simple substructure of the signal_struct, allocation and freeing go away
and updating and "summing" become simple assignments.

In addition to fixing the hang, this change removes the overloading of
it_prof_expires for RLIMIT_CPU handling, replacing it with a new field,
rlim_expires, which is checked instead.  This makes the code simpler and
more straightforward.

I've made some decisions in this work that could have gone in different
directions and I'm certainly happy to entertain comments and criticisms.
Performance with this fix is at least as good as before and in a few
cases is slightly improved, possibly due to the reduced tick overhead.

Signed-off-by:  Frank Mayhar &lt;fmayhar@google.com&gt;

 include/linux/sched.h     |  172 ++++++++++++++++++++++++++++
 kernel/compat.c           |   31 ++++--
 ...
To: Roland McGrath <roland@...>
Cc: <linux-kernel@...>
Date: Friday, March 28, 2008 - 6:46 pm

This is my second cut at a patch that will fix bug 9906, "Weird hang
with NPTL and SIGPROF."  Thanks to Ingo Molnar who sent me feedback
regarding the first cut.

Again, the problem is that run_posix_cpu_timers() repeatedly walks the
entire thread group every time it runs, which is at interrupt.  With
heavy load and lots of threads, this can take longer than the tick, at
which point the kernel stops doing anything put servicing clock ticks
and the occasional interrupt.  Many thanks to Roland McGrath for his
help in my attempt to understand his code.

The change adds a new structure to the signal_struct,
thread_group_cputime.  On an SMP kernel, this is allocated as a percpu
structure when needed (from do_setitimer()) using the alloc_percpu()
mechanism).  It is manipulated via a set of functions defined in sched.c
and sched.h.  These new functions are

      * thread_group_times_free(), inline function to free via
        free_percpu() (SMP) or kfree (UP) the thread_group_cputime
        structure.
      * thread_group_times_alloc(), external function to allocate the
        thread_group_cputime structure when needed.
      * thread_group_update(), inline function to update a field of the
        thread_group_cputime structure; called at interrupt from tick
        handlers, generally.  It depends on the "offsetof()" macro to
        know which field to update and on compiler optimization to
        remove the unused code paths in each case.
      * thread_group_cputime(), inline function that sums the time
        fields for all running CPUs (SMP) or snapshots the time fields
        (UP) into a passed structure.

I've changed the uniprocessor case to retain the dynamic allocation of
the thread_group_cputime structure as needed; this makes the code
somewhat more consistent between SMP and UP and retains the feature of
reducing overhead for processes that don't use interval timers.

In addition to fixing the hang, this change removes the overloading of
it_prof_expires for RLIMIT...
To: Frank Mayhar <fmayhar@...>
Cc: <roland@...>, <linux-kernel@...>
Date: Tuesday, April 1, 2008 - 2:45 pm

On Fri, 28 Mar 2008 15:46:40 -0700

kernel/sys.c: In function 'sys_times':
kernel/sys.c:885: error: 'sig' undeclared (first use in this function)
--
To: Andrew Morton <akpm@...>
Cc: <roland@...>, <linux-kernel@...>
Date: Tuesday, April 1, 2008 - 5:46 pm

Thanks.  As I said privately, I don't know how this snuck in but it's
certainly time to blow away my build tree and reapply the next patch
from scratch.

Speaking of which, expect that next patch in a day or two.
-- 
Frank Mayhar &lt;fmayhar@google.com&gt;
Google, Inc.

--
To: Frank Mayhar <fmayhar@...>
Cc: Roland McGrath <roland@...>, <linux-kernel@...>, Peter Zijlstra <a.p.zijlstra@...>, Thomas Gleixner <tglx@...>
Date: Friday, March 28, 2008 - 6:28 am

return 0 is the proper form - please run your patch through 
scripts/checkpatch.pl.

	Ingo
--
To: Roland McGrath <roland@...>
Cc: <linux-kernel@...>
Date: Monday, March 24, 2008 - 1:34 pm

Well, if it's acceptable, for a first cut (and the patch I'll submit),
I'll handle the UP and SMP cases, encapsulating them in sched.h in such
a way as to make it invisible (as much as is possible) to the rest of

I've actually done this, although I didn't find the numbers particularly


After looking at the code again, I now understand what you're talking
about.  You overloaded it_*_expires to support both the POSIX interval
timers and RLIMIT_CPU.  So the way I have things, setting one can stomp

Right, because the original effect was to only set the it_*_expires on
each individual task struct, leaving the one in the signal struct alone.

Might it be cleaner to handle the RLIMIT_CPU stuff separately, rather
than rolling it into the itimer handling?
-- 
Frank Mayhar &lt;fmayhar@google.com&gt;
Google, Inc.

--
To: Frank Mayhar <fmayhar@...>
Cc: <linux-kernel@...>
Date: Monday, March 31, 2008 - 1:44 am

For clarity, please never mention that identifier without indicating which
struct you're talking about.  signal_struct.it_*_expires has never been
overloaded.  signal_struct.it_prof_expires is the ITIMER_PROF setting;
signal_struct.it_virt_expires is the ITIMER_VIRTUAL setting; there is no
signal_struct.it_sched_expires field.  task_struct.it_*_expires has never
been overloaded.  task_struct.it_prof_expires is the next value of
(task_struct.utime + task_struct.stime) at which run_posix_cpu_timers()

It is not "rolled into the itimer handling".  
run_posix_cpu_timers() handles three separate features:

1. ITIMER_VIRTUAL, ITIMER_PROF itimers (setitimer/getitimer)
2. POSIX timers CPU timers (timer_* calls)
3. CPU time rlimits (RLIMIT_CPU for process-wide, RLIMIT_RTTIME for each thread)

The poorly-named task_struct.it_*_expires fields serve a single purpose:
to optimize run_posix_cpu_timers().  task_struct.it_prof_expires is the

I don't see the point of adding this field at all.  It serves solely to
cache the secs_to_cputime calculation on the RLIMIT_CPU rlim_cur value,

This does not make sense.  There is no need for any new state at all in
the UP case, just reorganizing what is already there.

The existing signal_struct fields utime, stime, and sum_sched_runtime are
no longer needed.  These accumulate the times of dead threads in the group
(see __exit_signal in exit.c) solely so cpu_clock_sample_group can add
them in.  Keeping both those old fields and the dynamically allocated
per-CPU counters is wrong.  You will count double all the threads that
died since the struct was allocated (i.e. since the first timer was set).

For the UP case, just replace these with a single struct thread_group_cputime
in signal_struct, and increment it directly on every tick.  __exit_signal
never touches it.

For the SMP case, you need a bit of complication.  When there are no
expirations (none of the three features in use on a process-wide CPU
clock) or only one live thread, then you don't ne...
To: Roland McGrath <roland@...>
Cc: <linux-kernel@...>
Date: Monday, March 31, 2008 - 4:24 pm

Roland, I'm very much having to read between the lines of what you've
written.  And, obviously, getting it wrong at least half the time. :-)

So you've cleared part of my understanding with your latest email.
Here's what I've gotten from it:

        struct task_cputime {
        	cputime_t utime;		/* User time. */
        	cputime_t stime;		/* System time. */
        	unsigned long long sched_runtime; /* Scheduler time. */
        };
        
This is for both SMP and UP, defined before signal_struct in sched.h
(since that structure refers to this one).  Following that:

        struct thread_group_cputime;

Which is a forward reference to the real definition later in the file.
The inline functions depend on signal_struct and task_struct, so they
have to come after:

        #ifdef SMP

        struct thread_group_cputime {
        	struct task_cputime *totals;
        };

        &lt; ... inline functions ... &gt;

        #else /* SMP */

        struct thread_group_cputime {
        	struct task_cputime totals;
        };

        &lt; ... inline functions ... &gt;

        #endif

The SMP version is percpu, the UP version is just a substructure.  In
signal_struct itself, delete utime &amp; stime, add
        struct thread_group_cputime cputime;

The inline functions include the ones you defined for UP plus equivalent
ones for SMP.  The SMP inlines check the percpu pointer
(sig-&gt;cputime.totals) and don't update if it's NULL.  One small
correction to one of your inlines, in thread_group_cputime:
                *cputime = sig-&gt;cputime;
should be
                *cputime = sig-&gt;cputime.totals;

A representative inline for SMP is:

        static inline void account_group_system_time(struct task_struct *task,
        					      cputime_t cputime)
        {
        	struct task_cputime *times;
        
        	if (!sig-&gt;cputime.totals)
        		return;
        	times = per_cpu_ptr(sig-&gt;cputime.totals, get_cpu());
        	times-&gt;stime...
To: Frank Mayhar <fmayhar@...>
Cc: <linux-kernel@...>
Date: Tuesday, April 1, 2008 - 10:07 pm

I'm sorry I wasn't able to communicate more clearly.  Please ask me to
elaborate whenever it would help.  I hope the experience will inspire
you to add some good comments to the code.  If the comments come to
explain everything you now wish had been brought to your attention

This is what I'd called it originally (based on the sched_clock() name).
Now it's an aggregate of task_struct.se.sum_exec_runtime, so perhaps

There is no need for that.  This struct is used directly in signal_struct
and so has to be defined before it anyway.  (And when a struct is used
only as the type of a pointer in another type declaration, no forward


Right.  And if we were to try different variants based on NR_CPUS or
whatever, that would entail just a new struct thread_group_cputime


By do_setitimer, you mean set_process_cpu_timer and posix_cpu_timer_set.

In the last message I said "there are several options" for dealing with
the dead-threads issue on SMP and did not go into it.  This is one option.
It is attractive in being as lazy as possible.  But I think it is a
prohibitively dubious proposition to do any allocation in __exit_signal.
That is part of the hot path for recovering from OOM situations and all
sorts of troubles; it's also expected to be reliable so that I am not
comfortable with the bookkeeping dropping bits in extreme conditions
(i.e. when you cannot possibly succeed in the allocation and have to
proceed without it or ultimately wedge the system).

The first thing to do is move the existing summation of utime, stime, and
sum_exec_runtime in __exit_signal into an inline thread_group_cputime_exit.
That abstracts it into the set of inlines that can vary for the different
flavors of storage model.  For UP, it does nothing.

1. One set of options keeps static storage for dead-threads accounting, as we
   have now.  i.e., utime, stime, sum_sched_runtime in signal_struct are
   replaced by a "struct task_cputime dead_totals;" in thread_group_cputime.

   a. thread_group_cputime_exit ...
To: Roland McGrath <roland@...>
Cc: <linux-kernel@...>
Date: Wednesday, April 2, 2008 - 2:42 pm

And another quick note:  It appears that with the "allocate percpu
storage in copy_signal CLONE_THREAD case" mechanism, I don't need to
worry about allocating it anywhere else.  If I need it (which is only in
the case of multiple threads and an interval timer) then I'll have it
because it was allocated with the second thread.  So I just eliminate
the allocation in do_setitimer() entirely.
-- 
Frank Mayhar &lt;fmayhar@google.com&gt;
Google, Inc.

--
To: Roland McGrath <roland@...>
Cc: <linux-kernel@...>
Date: Wednesday, April 2, 2008 - 1:42 pm

One quick note:  this inline isn't needed for the 2b solution (allocate
percpu storage in copy_signal CLONE_THREAD case), since if there's more
than one thread there'll always be a percpu area and if there's only one
thread the summation code won't be entered.
-- 
Frank Mayhar &lt;fmayhar@google.com&gt;
Google, Inc.

--
To: Frank Mayhar <fmayhar@...>
Cc: <linux-kernel@...>
Date: Wednesday, April 2, 2008 - 3:48 pm

That's true.  I still think it's a good idea to have it, even if it winds
up being empty in the variants we really use.  The principle is that the
new set of types/functions could be used to implement exactly what we have
now.  In fact, it's usually best to do a series of small patches that start


Again, I'd leave the call to the inline that would do it.
For this implementation plan, its body is:
	BUG_ON(!task-&gt;signal-&gt;cputime.totals &amp;&amp; !thread_group_empty(task));


Thanks,
Roland
--
To: Roland McGrath <roland@...>
Cc: <linux-kernel@...>
Date: Wednesday, April 2, 2008 - 4:34 pm

Ah, okay.  Well, except that the whole point of this exercise is to fix

BTW I did look at allocating it in posix_cpu_timer_set() and
set_process_cpu_timer() but the first at least is doing stuff with locks
held.  I'll keep looking at it, though.

One little gotcha we just ran into, though:  When checking
tsk-&gt;signal-&gt;(anything) in run_posix_cpu_timers(), we have to hold
tasklist_lock to avoid a race with release_task().  This is going to
make even the null case always cost more than before.
-- 
Frank Mayhar &lt;fmayhar@google.com&gt;
Google, Inc.

--
To: Frank Mayhar <fmayhar@...>
Cc: <linux-kernel@...>
Date: Friday, April 4, 2008 - 7:17 pm

Yeah, it's a little sticky.  It would be simple enough to arrange things to
do the allocation when needed before taking locks, but would not make the
code wonderfully self-contained.  I wouldn't worry about it unless/until we
conclude for other reasons that this really is the best way to go.  We seem

This is reminiscent of something that came up once before.  I think it
was the same issue of what happens on a tick while the thread is in the
middle of exiting and racing with release_task.  See commit
72ab373a5688a78cbdaf3bf96012e597d5399bb7, commit
3de463c7d9d58f8cf3395268230cb20a4c15bffa and related history (some of

This is a big can of worms that we really don't need.  Complicating the
data structure handling this way is really not warranted at all just to
address this race.  You'll just create another version of the same race
with a different pointer, and then solve it some simple way that you
could have just used to solve the existing problem.  If you don't have
some independent (and very compelling) reasons to reorganize the data
structures, nix nix nix.

We can make posix_cpu_timers_exit() or earlier in the exit/reap path
tweak any state we need to ensure that this problem won't come up.
But, off hand, I don't think we need any new state.

Probably the right fix is to make the fast-path check do:

	rcu_read_lock();
	signal = rcu_dereference(current-&gt;signal);
	if (unlikely(!signal) || !fastpath_process_timer_check(signal)) {
		rcu_read_unlock();
		return;
	}
	sighand = lock_task_sighand(current, &amp;flags);
	rcu_read_unlock();
	if (likely(sighand))
		slowpath_process_timer_work(signal);
	unlock_task_sighand(sighand, &amp;flags);

Another approach that is probably fine too is just to do:

	if (unlikely(current-&gt;exit_state))
		return;

We can never get to the __exit_signal code that causes the race if we
are not already late in exit.  The former seems a little preferable
because the added fast-path cost is the same (or perhaps even more
cache-friendly), bu...
To: Roland McGrath <roland@...>
Cc: Frank Mayhar <fmayhar@...>, <linux-kernel@...>
Date: Sunday, April 6, 2008 - 1:26 am

Well, if we allocate at clone time then some things do get a bit
simpler.  (Hmm, for some reason I was thinking that we still allocate in

Yeah, I checked that out.  The one difference here is that that was a
race between do_exit() (actually release_task()/__exit_signal()) and
run_posix_cpu_timers().  While this race was the same in that respect,
there was also a race between all of the timer-tick routines that call
any of account_group_user_time(), account_group_system_time() or
account_group_exec_runtime() and __exit_signal().  This is because those

Erm, well, this isn't reorganizing the data structures per se, since
these are new data structures.  Regardless, the way I did this was by
making the refcounted structure follow the lifetime of the task
structure.  It's instantiated when needed and at that point each task
structure in the thread group gets a reference, which is released when
that task structure is destroyed.  The last release destroys the data
structure.

The upshot of this is that none of the timer routines dereference
tsk-&gt;signal, so the races go away, no locking needed.  From my
perspective this was the simplest solution, since lock dependency
ordering is _really_ a can of worms.


That's certainly a possibility, but it'll still require more code in the

I'll have to look at the code some more and thing about this.  For the
other timer routines I would think that the second approach would be
better, but I still think that best of all is to not have to check
anything at all (except of course the presence of the structure pointer
itself).

Regarding the second approach, without locking wouldn't that still be
racy?  Couldn't exit_state change (and therefore __exit_signal() run)
between the check and the dereference?
-- 
Frank Mayhar frank@exit.com     http://www.exit.com/
Exit Consulting                 http://www.gpsclock.com/
                                http://www.exit.com/blog/frank/
                                http://www.zazzle.com/fmayhar*
-...
To: <frank@...>
Cc: Frank Mayhar <fmayhar@...>, <linux-kernel@...>
Date: Monday, April 7, 2008 - 4:08 pm

The essence that matters is the same: something that current does with its

Tomato, tomato.  You're adding new data structures and lifetime rules to
replace data that was described in a different data structure before, yet
your new data's meaningful semantic lifetime exactly matches that of
signal_struct.  You could as well make everything release_task cleans up be
done in __put_task_struct instead, but that would not be a good idea
either.  You've added a word to task_struct (100000 words per 100000-thread

With the perspective of tunnel vision to just your one test case, adding
something entirely new considering only that case always seems simplest.
That's not how we keep the entire system from getting the wrong kinds of

No.  current-&gt;exit_state can go from zero to nonzero only by current
running code in the do_exit path.  current does not progress on that
path while current is inside one update_process_times call.


Thanks,
Roland
--
To: Roland McGrath <roland@...>
Cc: <frank@...>, <linux-kernel@...>
Date: Tuesday, April 8, 2008 - 5:27 pm

Okay.  One of the paths to the update code is through update_curr() in
sched_fair.c, which (in my tree) calls account_group_exec_runtime() to
update the sum_exec_runtime field:

	delta_exec = (unsigned long)(now - curr-&gt;exec_start);

	__update_curr(cfs_rq, curr, delta_exec);
	curr-&gt;exec_start = now;

	if (entity_is_task(curr)) {
		struct task_struct *curtask = task_of(curr);

		cpuacct_charge(curtask, delta_exec);
		account_group_exec_runtime(curtask, delta_exec);
	}

To make sure that I understand what's going on, I put an invariant at
the beginning of account_group_exec_runtime():

static inline void account_group_exec_runtime(struct task_struct *tsk,
					       unsigned long long ns)
{
	struct signal_struct *sig = tsk-&gt;signal;
	struct task_cputime *times;

	BUG_ON(tsk != current);
	if (unlikely(tsk-&gt;exit_state))
		return;
	if (!sig-&gt;cputime.totals)
		return;
	times = per_cpu_ptr(sig-&gt;cputime.totals, get_cpu());
	times-&gt;sum_exec_runtime += ns;
	put_cpu_no_resched();
}

And, you guessed it, the invariant gets violated.  Apparently the passed
task_struct isn't the same as "current" at this point.

Any ideas?  Am I checking the wrong thing?  If we're really not updating
current then the task we are updating could very easily be running
through __exit_signal() on another CPU.  (And while I wait for your
response I will of course continue to try to figure this out.)
-- 
Frank Mayhar &lt;fmayhar@google.com&gt;
Google, Inc.

--
To: Frank Mayhar <fmayhar@...>
Cc: <linux-kernel@...>
Date: Tuesday, April 8, 2008 - 6:49 pm

My explanation about the constraints on exit_state was specifically about
the context of update_process_times(), which is part of the path for a

The scheduler code has gotten a lot more complex since I first implemented
posix-cpu-timers, and I've never been any expert on the scheduler at all.
But I'm moderately sure all those things are all involved in context
switch where the task of interest is about to be on a CPU or just was on a
CPU.  I doubt those are places where the task in question could be
simultaneously executing in exit_notify() on another CPU.  But we'd need
to ask the scheduler experts to be sure we know what we're talking about

I haven't figured out what actual code path this refers to.


This sort of concern is among the reasons that checking -&gt;signal was the
course I found wise to suggest to begin with.  We can figure out what the
constraints on -&gt;exit_state are in all the places by understanding every
corner of the scheduler.  We can measure whether it winds up being in a
cooler cache line than -&gt;signal and a net loss to add the load, or has
superior performance as you seem to think.  Or we can just test the
constraint that matters, whether the pointer we loaded was in fact null,
and rely on RCU to make it not matter if there is a race after that load.
It doesn't matter whether tsk is current or not, it only matters that we
have the pointer and that we're using some CPU array slot or other that
noone else is using simultaneously.

    static inline void account_group_exec_runtime(struct task_struct *tsk,
						  unsigned long long runtime)
    {
	    struct signal_struct *sig;
	    struct task_cputime *times;

	    rcu_read_lock();
	    sig = rcu_dereference(tsk-&gt;signal);
	    if (likely(sig) &amp;&amp; sig-&gt;cputime.totals) {
		    times = per_cpu_ptr(sig-&gt;cputime.totals, get_cpu());
		    times-&gt;sum_exec_runtime += runtime;
		    put_cpu_no_resched();
	    }
	    rcu_read_unlock();
    }


Thanks,
Roland
--
To: Roland McGrath <roland@...>
Cc: <linux-kernel@...>
Date: Wednesday, April 9, 2008 - 12:29 pm

This was my conclusion as well.  Certainly the path through do_fork()
(elucidated below) doesn't even allow the task in question to even be
executing, much less on a different CPU, but all these routines with
"struct task_struct *" parameters make me nervous.  Which is why I

do_fork=&gt;wake_up_new_task=&gt;task_new_fair=&gt;

Well, as much as I would like to take the time to do that, I do have a

Yeah, agreed.  Of course, I was hoping (in vain, apparently) to avoid
this level of overhead here.  And I suspect I'll really have to do it in
each of these routines.  But I suppose it can't be helped.

Even with a thorough understanding of the scheduler(s) and code based on
that understanding, we would still not (necessarily) be protected from
future changes that violate the assumptions we make on that basis.
-- 
Frank Mayhar &lt;fmayhar@google.com&gt;
Google, Inc.

--
To: Roland McGrath <roland@...>
Cc: <frank@...>, <linux-kernel@...>
Date: Tuesday, April 8, 2008 - 5:52 pm

Found the exception.  do_fork() violates the invariant when it's
cranking up a new process.  Hmmm.
-- 
Frank Mayhar &lt;fmayhar@google.com&gt;
Google, Inc.

--
To: Roland McGrath <roland@...>
Cc: <frank@...>, <linux-kernel@...>
Date: Monday, April 7, 2008 - 5:31 pm

This is one thing that has been unclear.  The relationship of
signal_struct to task_struct is, as far as I can tell, an unwritten one.
Certainly the interrupt routines are adjusting values that live only
inside task_struct and (with the exception of run_posix_cpu_timers())

While true, that's not the only reason to do it.  The tradeoff here is
between performance (i.e. having to do checks before dereferencing
tsk-&gt;signal) versus space.  It's really a judgment call.  (Although


This isn't exactly how I would state it but yes, this is generally true
as well.  The problem is that knowing exactly what is "the wrong kinds"
relies on knowledge possessed by only a few.  Prying that knowledge out

Well, okay, this is the vital bit of data that puts everything above
into perspective.  Had I known this, I would not have made the change I
did.

I guess the key bit of knowledge is that a "task" is really a scheduling
unit, right?  And, really, from the scheduler's perspective, "task" is
the same as "thread."  The only thing that makes a set of threads into a
multithreaded process is that they share a signal struct (well, and
their memory map, of course).  So a "task" can only be executed on a
single cpu at any time, it can't be executed on more than one cpu at a
time.  Therefore if a "task" is executing and is interrupted, the value
of "current" at the interrupt will be that task, which is entirely
suspended for the duration of the interrupt.

Is this correct?  (This is not just for this fix, but for my general
understanding of Linux scheduling.)

Unfortunately, these things are often implicit in the code but as far as
I know aren't written down anywhere.  This whole exercise has been for
me a process of becoming really familiar with the internals of the Linux
kernel for the first time.
-- 
Frank Mayhar &lt;fmayhar@google.com&gt;
Google, Inc.

--
To: Frank Mayhar <fmayhar@...>
Cc: <frank@...>, <linux-kernel@...>
Date: Monday, April 7, 2008 - 6:02 pm

No, the performance idea there is a myth.  You're talking about one test
and branch-not-taken for a word you're already loading into a register
right there anyway (if testing -&gt;signal).  It's maybe two cycles that were
most likely already idle in a load stall.  The cache effects alone of

But when it comes out, it's flying at high velocity straight into your face!


There are several other things that are implicitly required to be shared
when signal_struct is shared, too.  But approximately, yes.  Had I been in
charge of the world, task_struct would be 'struct thread' and signal_struct
would be 'struct process'.  (Cue left-field flames from the peanut gallery


Everything I know I learned from reading the source.  So I sympathize
with the sense of starting out lost without bearings, but I may be a
little hard-hearted about anyone wanting more than their eyeballs and
full-text searching to find their own bootstraps and pull (in my day,
it was up hill both ways, and all that).


Thanks,
Roland
--
To: Roland McGrath <roland@...>
Cc: <linux-kernel@...>
Date: Wednesday, April 2, 2008 - 5:42 pm

This race, by the way, is because we're dereferencing task-&gt;signal at
interrupt once per tick.  We ran into a case where a process was going
through release_task() and being torn down on one CPU while running a
timer tick on another.  Under load.  It's not a very likely race but
with sufficient time or load it's pretty much inevitable.

My thought is to move thread_group_cputime from the signal structure to
hanging directly off the task structure.  It would be shared in the same
way as the signal structure is now but would be deallocated with the
task structure rather than the signal structure.  This should mean that
I could avoid getting tasklist_lock under most conditions.

Thoughts?
-- 
Frank Mayhar &lt;fmayhar@google.com&gt;
Google, Inc.

--
To: Roland McGrath <roland@...>
Cc: <linux-kernel@...>
Date: Thursday, April 3, 2008 - 8:53 pm

Okay, having run face-first into this race and having every combination
of spinlock serialization fail for me, I've done a variation of the
above scheme.

For the local environment, I solved the problem by moving the percpu
structure out of the signal structure entirely and by making it
refcounted.  It is allocated as before, but now in two parts, a normal
structure with an atomic refcount that has a pointer to the percpu
structure.  The signal structure doesn't point to it any longer, but
each task_struct in the thread group does, and each of these references
is counted.  New threads will also get a reference (at the top of
copy_signal()) and be counted.  All access goes through the task
structure.  References are removed in __put_task_struct() when the task
itself is destroyed; when the last reference goes away, the structures
are freed.

This eliminates the races with signal_struct being freed and has the
nice effect that there is a bit less overhead in places like
account_group_user_time() and friends.  In run_posix_cpu_timers(),
though, I have to pick up the tasklist_lock early (and therefore in
every case) because it's still dereferencing tsk-&gt;signal in the early
comparison.

I'm thinking about moving all of the itimer stuff (i.e. the
cputime_expires structures) into the refcounted structure as well, thus
avoiding the signal_struct entirely so we don't need the tasklist_lock
in the fast path.  I don't know how any of this will affect the UP case,
though.  I'll have to continue to think about it and I'm sure you have
something to say as well.  (And if anyone else wants to chime in,
they're welcome.)
-- 
Frank Mayhar &lt;fmayhar@google.com&gt;
Google, Inc.

--
To: Roland McGrath <roland@...>
Cc: <linux-kernel@...>
Date: Wednesday, April 2, 2008 - 12:34 pm

At least it's beginning to make sense.  Rather than worrying overmuch
about commentary, though, I'll think about writing an explanation for
the Documentation directory.  The implementation is pretty scattered so
there's no obvious single place for good commentary; a separate document

Yeah, I remembered this finally.  The structures are declared before
signal_struct in all cases, with the inlines appearing (much) later in
the file.

Here are the declarations (starting at around line 425; the inlines
start at around line 2010 or so):

        struct task_cputime {
        	cputime_t utime;			/* User time. */
        	cputime_t stime;			/* System time. */
        	unsigned long long sum_exec_runtime;	/* Scheduler time. */
        };
        /* Alternate field names when used to cache expirations. */
        #define prof_exp	stime
        #define virt_exp	utime
        #define sched_exp	sum_exec_runtime
        
        #ifdef CONFIG_SMP
        struct thread_group_cputime {
        	struct task_cputime *totals;
        };
        #else
        struct thread_group_cputime {
        	struct task_cputime totals;
        };
        #endif

The #defines there are for when I refer to the task_cputime fields as
expiration values, to make it a little more self-explanatory (and less
confusing).  I can remove them if it violates a Linux kernel coding


Yeah, that bothered me, too.  It's obviously impossible to do the
allocate itself in __exit_signal() itself (since the caller,
release_task() is holding tasklist_lock) and doing it in release_task()

This seems kind of hacky to me, particularly the bit about subtracting

This is a little better and is along the lines I was thinking at one
point, but it still has the dead_totals field which becomes redundant
when the percpu structure is allocated.  I would rather eliminate that
field entirely, especially since we're growing signal_struct in other


I had also thought hard about doing this.  One the one hand, it makes
dealing w...
To: Roland McGrath <roland@...>
Cc: <linux-kernel@...>
Date: Monday, March 24, 2008 - 6:43 pm

Okay, my proposed fix for this is to introduce a new field in
signal_struct, rlim_expires, a cputime_t.  Everywhere that the
RLIMIT_CPU code formerly set it_prof_expires it will now set
rlim_expires and in run_posix_cpu_timers() I'll check it against the
thread group prof_time.

I believe that that will solve the problem, if I understand this
correctly.  If I don't, I trust that you will set me straight. :-)
-- 
Frank Mayhar &lt;fmayhar@google.com&gt;
Google, Inc.

--
Previous thread: none

Next thread: Re: [RFC] ext3 freeze feature by Takashi Sato on Wednesday, February 6, 2008 - 9:05 pm. (1 message)