[I changed the subject and trimmed the CC list, as this is now quite far away from the "some mysterious NPTL problem" subject. If anyone else wanted to be individually CC'd, you can add them back in followups.]Not quite. check_process_timers only needs to be run once per thread group (per interesting tick). run_posix_cpu_timers does two things: thread CPU timers and process CPU timers. The thread CPU timers track the thread CPU clocks, which are what the per-thread fields in task_struct count. check_thread_timers finds what thread CPU timers have fired. The task_struct.it_*_expires fields are set when there are thread CPU timers set on those clocks. The process CPU timers track the process CPU clocks. Each process CPU clock (virt, prof, sched) is just the sum of the corresponding thread CPU clock across all threads in the group. In the original code, these clocks are never maintained in any storage as such, but sampled by summing all the thread clocks when a current value is needed. check_process_timers finds what process CPU timers have fired. The signal_struct.it_*_expires fields are set when there are process CPU timers set on those clocks. The "rebalance" stuff also sets the task_struct.it_*_expires fields of all the threads in the group when there are process CPU timers. So each of these fields is really the minimum of the expiration time of the earliest thread CPU timer and the "balanced" sample-and-update time computed from the earliest process CPU timer's expiration time. This is true of check_process_timers. I do not follow your logic at all here. The signal_struct fields being proposed track each process CPU clock's value. The thread CPU timers react to the thread CPU clocks, not the process CPU clocks. Correct. I think the natural place for it is run_timer_softirq. This is where natural time timers run. The posix-cpu-timers processing is analogous to the posix-timers processing, which runs via timer callbacks from here. That roughly means doing it right after the interrupt normally, and falling back to something similar to a workqueue when the load is too heavy. In the common case, it doesn't entail any context switches, as workqueues always do. The issue with either the workqueue or the softirq approach is that it means the call will sometimes (softirq) or always (workqueue) be made by another task. Currently we always have current taking samples of its own clocks and firing timers set on them. (Because of this you can't actually use softirq in the simple fashion, i.e. moving the call to run_posix_cpu_timers. It would only be guaranteed to run once per CPU and wouldn't know which task it was supposed to look at. You'd have to keep a per-CPU list of tasks pending consideration, i.e. a workqueue.) I can't tell you off hand about serious complications of doing this work from another task rather than by current on current. I think I might have had some in mind when I did the original implementation, but I just don't remember any more. It seems like a potential can of worms. My first inclination is to do every other cleanup we like first before touching this question. Also note that the bulk of the work (and everything that's not O(1)) has to be done with interrupts disabled anyway. That's necessary to take siglock. That lock both protects signal_struct, and it protects the task_struct.cpu_timers lists. (You can do a cheap and lossy test on current->signal->it_*_expires without taking the lock, for the nothing-fires fast path.) What I said is only actually necessary once per thread group is the work that check_process_timers does. In the current style of code where there is a loop through all the threads anyway, then you could in fact weave in the check_thread_timers work there too and then all that would only need to be done once per thread group per tick. (But I don't think that's what I suggested last time.) Indeed. That's why I said I would not endorse any patch that doesn't address this up front, and show concrete measurements about this overhead. (To a first approximation, the overhead on every tick for every task in the system is always more important than the complications for tasks that are using any CPU timers, including ITIMER_PROF and ITIMER_VIRTUAL.) Exactly this is the first idea I had about this. (I considered this in the original implementation and decided for the first crack to err on the side of no new code or data structures in the paths taken by every thread, with the new hair only affecting processes that actually use any process CPU timers.) But this is not without its own issues. Currently on my configuration (64-bit) the utime, stime, sum_sched_runtime fields (which now only accumulate the contributions to process CPU clocks of threads that are already dead) take 24 bytes. Throw in gtime (which is analogous in its bookkeeping, but has no POSIX clock/timer interface to it) and call it 32 bytes. That's out of 904 bytes total in signal_struct. On some common configurations, SMP_CACHE_BYTES is 128 and NR_CPUS is 64. So the obvious static addition to signal_struct would bloat it by 8192 bytes (i.e. 904 to 9096, or more than 10x), of which 96*NR_CPUS (2/3) would be wasted even when you really have NR_CPUS running. That is way beyond the acceptable size (it's too big to even work right with the kernel allocators), even if it weren't mostly wasted space. This leads to some obvious follow-on ideas. With some fancy footwork, you could use a pointer to a separately allocated chunk of only num_possible_cpus() * SMP_CACHE_BYTES. You needn't allocate it at all until the first timer is set on that process CPU clock. That makes the bloat smaller, and limits it to the processes that actually need to check on each tick. With all of this properly encapsulated in a struct and some helper functions, it would be simple to conditionalize the implementation. For uniprocessor kernels, clearly it would be preferable just to use the existing signal_struct fields. For NR_CPUS=2, it might be reasonable to use the static aligned fields bloating signal_struct (or perhaps for NR_CPUS * SMP_CACHE_BYTES < some threshold). etc. An alternative notion is to have single shared fields per clock in signal_struct but add to them only at context switch. If the threads in the process don't all yield at the same time, then maybe that works out ok for cache contention. It's not a well-developed idea. This all adds up to me thinking there is no simple answer. I think we need to consider several alternatives, and get real measurements of their overheads in various workloads, etc. I am hesitant to pick any "simple" changes to put in the kernel before we have examined the real trade-offs fully. Thanks, Roland --
| Jeremy Fitzhardinge | Re: [RFC 00/15] x86_64: Optimize percpu accesses |
| jmerkey | [ANNOUNCE] mdb: Merkey's Linux Kernel Debugger 2.6.27-rc4 released |
| Greg Kroah-Hartman | [PATCH 021/196] ISDN: Convert from class_device to device for ISDN capi |
| Ingo Molnar | Re: [PATCH 00/23] per device dirty throttling -v8 |
git: | |
| Linus Torvalds | Re: VCS comparison table |
| Peter Stahlir | Git as a filesystem |
| Johannes Schindelin | Re: git on MacOSX and files with decomposed utf-8 file names |
| Bill Lear | Meaning of "fatal: protocol error: bad line length character"? |
| Mayuresh Kathe | Re: What is our ultimate goal?? |
| Richard Stallman | Real men don't attack straw men |
| bofh | Re: web development on OpenBSD |
| Kevin | uvm_mapent_alloc: out of static map entries on 4.3 i386 |
| Mark Lord | Re: 2.6.25-rc8: FTP transfer errors |
| Evgeniy Polyakov | Re: [BUG] New Kernel Bugs |
| Jarek Poplawski | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Andi Kleen | [PATCH RFC] [1/9] Core module symbol namespaces code and intro. |
