> My first patch did essentially what you outlined above, incrementingThere is a 1:1 correspondence between "shares signal_struct" and "member of same thread group". signal_struct is the right place for such new fields. Don't be confused by the existing fields utime, stime, gtime, and sum_sched_runtime. All of those are accumulators only touched when a non-leader thread dies (in __exit_signal), and governed by the siglock. Their only purpose now is to represent the threads that are dead and gone when calculating the cumulative total for the whole thread group. If you were to provide cumulative totals that are updated on every tick, then these old fields would not be needed. The task_struct.group_leader field is never NULL. Every thread is a member of some thread group. The degenerate case is that it's the only member of the group; then p->group_leader == p. It sounds like you sped up only one of the sampling loops. Having a cumulative total already on hand means cpu_clock_sample_group can also become simple and cheap, as can the analogues in do_getitimer and k_getrusage. These are what's used in clock_gettime and in the timer manipulation calls, and in getitimer and getrusage. That's all just gravy. The real benefit of having a cumulative total is for the basic logic of run_posix_cpu_timers (check_process_timers) and the timer expiry setup. It sounds like you didn't take advantage of the new fields for that. When a cumulative total is on hand in the tick handler, then there is no need at all to derive per-thread expiry times from group-wide CPU timers ("rebalance") either there or when arming the timer in the first place. All of that complexity can just disappear from the implementation. check_process_timers can look just like check_thread_timers, but consulting the shared fields instead of the per-thread ones for both the clock accumulators and the timers' expiry times. Likewise, arm_timer only has to set signal->it_*_expires; process_timer_rebalance goes away. If you do all that then the time spent in run_posix_cpu_timers should not be affected at all by the number of threads. The only "walking the timer lists" that happens is popping the expired timers off the head of the lists that are kept in ascending order of expiry time. For each flavor of timer, there are n+1 steps in the "walk" for n timers that have expired. So already no costs here should scale with the number of timers, just the with the number of timers that all expire at the same time. Back for a moment to the status quo and your second patch. What I would expect is that there be at most one item in the queue for each process (thread group). If you have 200000 threads in one process, you still only need one iteration of check_process_timers to run. If it hasn't run by the time more threads in the same group get more ticks, then all that matters is that it indeed runs once reasonably soon (for an overall effect of not much less often than once per tick interval). I can help you with all of that. What I'll need from you is careful performance analysis of all the effects of any changes we consider. The simplifications I described above will obviously greatly improve your test case (many threads and with some process timers expiring pretty frequently). We need to consider and analyze the other kinds of cases too. That is, cases with a few threads (not many more than the number of CPUs); cases where no timer is close to expiring very often. The most common cases, from one-thread cases to one-million thread cases, are when no timers are going off before next Tuesday (if any are set at all). Then run_posix_cpu_timers always bails out early, and none of the costs you've seen become relevant at all. Any change to what the timer interrupt handler does on every tick affects those cases too. As I mentioned in my last message, my concern about this originally was with the SMP cache/lock effects of multiple CPUs touching the same memory in signal_struct on every tick (which presumably all tend to happen simultaneously on all the CPUs). I'd insist that we have measurements and analysis as thorough as possible of the effects of introducing that frequent/synchronized sharing, before endorsing such changes. I have a couple of guesses as to what might be reasonable ways to mitigate that. But it needs a lot of measurement and wise opinion on the low-level performance effects of each proposal. Thanks, Roland --
| Vladislav Bolkhovitin | Re: Integration of SCST in the mainstream Linux kernel |
| Peter Zijlstra | [PATCH 6/6] sched: disabled rt-bandwidth by default |
| Tejun Heo | [PATCHSET] CUSE: implement CUSE |
| Richard Jonsson | forcedeth: MAC-address reversed on resume from suspend |
git: | |
| Junio C Hamano | [0/4] What's not in 1.5.2 (overview) |
| Jan Hudec | Smart fetch via HTTP? |
| Johannes Schindelin | Re: git log filtering |
| Junio C Hamano | [PATCH] combine-diff: reuse diff from the same blob. |
| Julien TOUCHE | setting up ssh tunnel/vpn |
| Jordi Prats | OpenBSD with pf on a mini-ITX? |
| GVG GVG | ssh_exchange_identification: Connection closed by remote host |
| Reyk Floeter | Re: hoststated(8): DNS Relay uses unexpected source IP address |
| David Miller | Re: [ANNOUNCE] Btrfs v0.12 released |
| Christophe Saout | Re: silent semantic changes with reiser4 |
| Anton Altaparmakov | Re: [RFC] add FIEMAP ioctl to efficiently map file allocation |
| Rik van Riel | Re: [RFD] Incremental fsck |
