posix-cpu-timers revamp

Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
From: Roland McGrath
Date: Tuesday, March 11, 2008 - 12:50 am

[I changed the subject and trimmed the CC list, as this is now quite far
away from the "some mysterious NPTL problem" subject.  If anyone else
wanted to be individually CC'd, you can add them back in followups.]


Not quite.  check_process_timers only needs to be run once per thread
group (per interesting tick).


run_posix_cpu_timers does two things: thread CPU timers and process CPU
timers.  The thread CPU timers track the thread CPU clocks, which are
what the per-thread fields in task_struct count.  check_thread_timers
finds what thread CPU timers have fired.  The task_struct.it_*_expires
fields are set when there are thread CPU timers set on those clocks.

The process CPU timers track the process CPU clocks.  Each process CPU
clock (virt, prof, sched) is just the sum of the corresponding thread
CPU clock across all threads in the group.  In the original code, these
clocks are never maintained in any storage as such, but sampled by
summing all the thread clocks when a current value is needed.
check_process_timers finds what process CPU timers have fired.  The
signal_struct.it_*_expires fields are set when there are process CPU
timers set on those clocks.

The "rebalance" stuff also sets the task_struct.it_*_expires fields of
all the threads in the group when there are process CPU timers.  So each
of these fields is really the minimum of the expiration time of the
earliest thread CPU timer and the "balanced" sample-and-update time
computed from the earliest process CPU timer's expiration time.


This is true of check_process_timers.


I do not follow your logic at all here.  The signal_struct fields being
proposed track each process CPU clock's value.  The thread CPU timers
react to the thread CPU clocks, not the process CPU clocks.


Correct.


I think the natural place for it is run_timer_softirq.  This is where
natural time timers run.  The posix-cpu-timers processing is analogous
to the posix-timers processing, which runs via timer callbacks from
here.  That roughly means doing it right after the interrupt normally,
and falling back to something similar to a workqueue when the load is
too heavy.  In the common case, it doesn't entail any context switches,
as workqueues always do.

The issue with either the workqueue or the softirq approach is that it
means the call will sometimes (softirq) or always (workqueue) be made by
another task.  Currently we always have current taking samples of its
own clocks and firing timers set on them.  (Because of this you can't
actually use softirq in the simple fashion, i.e. moving the call to
run_posix_cpu_timers.  It would only be guaranteed to run once per CPU
and wouldn't know which task it was supposed to look at.  You'd have to
keep a per-CPU list of tasks pending consideration, i.e. a workqueue.)

I can't tell you off hand about serious complications of doing this work
from another task rather than by current on current.  I think I might
have had some in mind when I did the original implementation, but I just
don't remember any more.  It seems like a potential can of worms.  My
first inclination is to do every other cleanup we like first before
touching this question.

Also note that the bulk of the work (and everything that's not O(1))
has to be done with interrupts disabled anyway.  That's necessary to
take siglock.  That lock both protects signal_struct, and it protects
the task_struct.cpu_timers lists.  (You can do a cheap and lossy test
on current->signal->it_*_expires without taking the lock, for the
nothing-fires fast path.)


What I said is only actually necessary once per thread group is the work
that check_process_timers does.  In the current style of code where
there is a loop through all the threads anyway, then you could in fact
weave in the check_thread_timers work there too and then all that would
only need to be done once per thread group per tick.  (But I don't think
that's what I suggested last time.)


Indeed.  That's why I said I would not endorse any patch that doesn't
address this up front, and show concrete measurements about this
overhead.  (To a first approximation, the overhead on every tick for
every task in the system is always more important than the complications
for tasks that are using any CPU timers, including ITIMER_PROF and
ITIMER_VIRTUAL.)


Exactly this is the first idea I had about this.  (I considered this in
the original implementation and decided for the first crack to err on
the side of no new code or data structures in the paths taken by every
thread, with the new hair only affecting processes that actually use any
process CPU timers.)

But this is not without its own issues.  Currently on my configuration
(64-bit) the utime, stime, sum_sched_runtime fields (which now only
accumulate the contributions to process CPU clocks of threads that are
already dead) take 24 bytes.  Throw in gtime (which is analogous in its
bookkeeping, but has no POSIX clock/timer interface to it) and call it
32 bytes.  That's out of 904 bytes total in signal_struct.

On some common configurations, SMP_CACHE_BYTES is 128 and NR_CPUS is 64.
So the obvious static addition to signal_struct would bloat it by 8192
bytes (i.e. 904 to 9096, or more than 10x), of which 96*NR_CPUS (2/3)
would be wasted even when you really have NR_CPUS running.  That is way
beyond the acceptable size (it's too big to even work right with the
kernel allocators), even if it weren't mostly wasted space.

This leads to some obvious follow-on ideas.  With some fancy
footwork, you could use a pointer to a separately allocated chunk of
only num_possible_cpus() * SMP_CACHE_BYTES.  You needn't allocate it
at all until the first timer is set on that process CPU clock.  That
makes the bloat smaller, and limits it to the processes that actually
need to check on each tick.

With all of this properly encapsulated in a struct and some helper
functions, it would be simple to conditionalize the implementation.
For uniprocessor kernels, clearly it would be preferable just to use
the existing signal_struct fields.  For NR_CPUS=2, it might be
reasonable to use the static aligned fields bloating signal_struct
(or perhaps for NR_CPUS * SMP_CACHE_BYTES < some threshold).  etc.

An alternative notion is to have single shared fields per clock in
signal_struct but add to them only at context switch.  If the threads
in the process don't all yield at the same time, then maybe that
works out ok for cache contention.  It's not a well-developed idea.

This all adds up to me thinking there is no simple answer.  I think
we need to consider several alternatives, and get real measurements
of their overheads in various workloads, etc.  I am hesitant to pick
any "simple" changes to put in the kernel before we have examined the
real trade-offs fully.


Thanks,
Roland
--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and S ..., Alejandro Riveira , (Thu Feb 7, 8:22 am)
Re: [Bugme-new] [Bug 9906] New: Weird hang with NPTL and S ..., Alejandro Riveira , (Thu Feb 7, 8:54 am)
posix-cpu-timers revamp, Roland McGrath, (Tue Mar 11, 12:50 am)
Re: posix-cpu-timers revamp, Frank Mayhar, (Tue Mar 11, 2:05 pm)
Re: posix-cpu-timers revamp, Roland McGrath, (Tue Mar 11, 2:35 pm)
Re: posix-cpu-timers revamp, Frank Mayhar, (Thu Mar 13, 5:37 pm)
Re: posix-cpu-timers revamp, Roland McGrath, (Fri Mar 21, 12:18 am)
Re: posix-cpu-timers revamp, Frank Mayhar, (Fri Mar 21, 10:57 am)
Re: posix-cpu-timers revamp, Frank Mayhar, (Fri Mar 21, 1:40 pm)
Re: posix-cpu-timers revamp, Roland McGrath, (Sat Mar 22, 2:58 pm)
Re: posix-cpu-timers revamp, Frank Mayhar, (Mon Mar 24, 10:34 am)
Re: posix-cpu-timers revamp, Frank Mayhar, (Mon Mar 24, 3:43 pm)
[PATCH 2.6.25-rc6] Fix itimer/many thread hang., Frank Mayhar, (Thu Mar 27, 5:52 pm)
Re: [PATCH 2.6.25-rc6] Fix itimer/many thread hang., Ingo Molnar, (Fri Mar 28, 3:28 am)
Re: posix-cpu-timers revamp, Roland McGrath, (Sun Mar 30, 10:44 pm)
Re: posix-cpu-timers revamp, Frank Mayhar, (Mon Mar 31, 1:24 pm)
Re: posix-cpu-timers revamp, Roland McGrath, (Tue Apr 1, 7:07 pm)
Re: posix-cpu-timers revamp, Frank Mayhar, (Wed Apr 2, 9:34 am)
Re: posix-cpu-timers revamp, Frank Mayhar, (Wed Apr 2, 10:42 am)
Re: posix-cpu-timers revamp, Frank Mayhar, (Wed Apr 2, 11:42 am)
Re: posix-cpu-timers revamp, Roland McGrath, (Wed Apr 2, 12:48 pm)
Re: posix-cpu-timers revamp, Frank Mayhar, (Wed Apr 2, 1:34 pm)
Re: posix-cpu-timers revamp, Frank Mayhar, (Wed Apr 2, 2:42 pm)
Re: posix-cpu-timers revamp, Frank Mayhar, (Thu Apr 3, 5:53 pm)
Re: posix-cpu-timers revamp, Roland McGrath, (Fri Apr 4, 4:17 pm)
Re: posix-cpu-timers revamp, Frank Mayhar, (Sat Apr 5, 10:26 pm)
Re: posix-cpu-timers revamp, Roland McGrath, (Mon Apr 7, 1:08 pm)
Re: posix-cpu-timers revamp, Frank Mayhar, (Mon Apr 7, 2:31 pm)
Re: posix-cpu-timers revamp, Roland McGrath, (Mon Apr 7, 3:02 pm)
Re: posix-cpu-timers revamp, Frank Mayhar, (Tue Apr 8, 2:27 pm)
Re: posix-cpu-timers revamp, Frank Mayhar, (Tue Apr 8, 2:52 pm)
Re: posix-cpu-timers revamp, Roland McGrath, (Tue Apr 8, 3:49 pm)
Re: posix-cpu-timers revamp, Frank Mayhar, (Wed Apr 9, 9:29 am)