On Fri, Sep 21, 2007 at 05:46:53PM +0200, Peter Zijlstra wrote:Yes, four flip states per stage, and four stages per grace period: rcu_try_flip_idle_state: Stay here if nothing is happening. Flip the counter if something starts happening. rcu_try_flip_waitack_state: Wait here for all CPUs to notice that the counter has flipped. This prevents the old set of counters from ever being incremented once we leave this state, which in turn is necessary because we cannot test any individual counter for zero -- we can only check the sum. rcu_try_flip_waitzero_state: Wait here for the sum of the old per-CPU counters to reach zero. rcu_try_flip_waitmb_state: Wait here for each of the other CPUs to execute a memory barrier. This is necessary to ensure that these other CPUs really have finished executing their RCU read-side critical sections, despite their CPUs wildly reordering memory. This state They are needed to allow memory barriers to be removed from rcu_read_lock(), to allow for the fact that the stages mean that grace-period boundaries are now fuzzy, as well as to account for the fact that each CPU now has its own callback queue. (Aside: classic RCU doesn't have memory barriers -- or anything else -- in rcu_read_lock() and rcu_read_unlock(), but from an implementation viewpoint, the read-side critical section extends forwards and backwards to the next/previous context switches, which do have lots of memory barriers.) Taking the reasons for stages one at a time: 1. The first stage is needed for generic RCU. Everyone must complete any pre-existing RCU read-side critical sections before the grace period can be permitted to complete. This stage corresponds to the "waitlist" and "waittail" in earlier RCU implementations. 2. An additional stage is needed due to the fact that memory reordering can cause an RCU read-side critical section's rcu_dereference() to be executed before the rcu_read_lock() -- no memory barriers! This can result in the following sequence of events: o rcu_dereference() from CPU 0. o list_del_rcu() from CPU 1. o call_rcu() from CPU 1. o counter flip from CPU 2. o rcu_read_lock() from CPU 0 (delayed due to memory reordering). CPU 0 increments the new counter, not the old one that its rcu_dereference() is associated with. o All CPUs acknowledge the counter flip. o CPU 2 discovers that the old counters sum to zero (because CPU 0 has incremented its new counter). o All CPUs do their memory barriers. o CPU 1's callback is invoked, destroying the memory still being used by CPU 0. An additional pass through the state machine fixes this problem. Note that the read-after-write misordering can happen on x86, as well as on s390. 3. An additional stage is also needed due to the fact that each CPU has its own set of callback queues. This means that callbacks that were registered quite some time after the most recent counter flip can make it into the current round of the state machine, as follows: o CPU 0 flips the counter. o CPU 1 does an rcu_read_lock() followed by an rcu_dereference(). CPU 1 of course uses the new value of the counter. o CPU 2 does a list_del_rcu() on the element that CPU 1 did the rcu_dereference on. o CPU 2 does a call_rcu(), placing the callback on its local queue. o CPU 2 gets its scheduling-clock interrupt, and acknowledges the counter flip, also moving its "next" list (which contains the new callback) to the "wait" list. o The old counters sum to zero, and everyone does a memory barrier, and the callback is invoked. Too bad that CPU 0 is still using the data element just freed! (And yes, I do understand that this scenario is incompatible with the previous scenario. But the point of these scenarios is to demonstrate that a given effect can result in failure, not to show all possible misfortunes that the effect can cause. If you wish to try to convince me that the two causes underlying these symptoms cannot be combined to create a failure spanning two full flip intervals, then the task before you is to prove that there is no such scenario.) This scenario can happen even on sequentially consistent machines. 4. An additional stage is needed due to the fact that different CPUs can disagree as to when the counter flip happened. This effect is in some way similar to #3, but can (and did) happen even when there is only one global callback queue. I believe that this can happen even on sequentially consistent machines, but am not 100% certain of this. So, four mechanisms for going bad mapped to four states in the grace-period pipeline. It is quite possible that a more-complex state machine would require fewer stages, but then again it might not end up being any faster overall. :-/ It is also quite possible that one or another of the additional stages covers more than one of the above mechanisms. So, we know just from the above scenarios that at least two stages are required, and the most that they can possibly require is four stages. I have looked very hard for other underlying sources of badness, and am quite sure that I have them covered (famous last words). I will rerun with GP_STAGES==3 on POWER to double-check -- it is entirely possible that the last such run preceded some bug removal. On the other hand, there are most definitely relatively straightforward ways to speed things up on uniprocessor machines, for example, skipping the acknowledgement and memory-barrier states, short-circuiting synchronize_rcu() and friends, and so on. But I need to avoid premature optimization here -- one thing at a time. We have to remain API-compatible with classic RCU, where the separate grace-period types are absolutely required if the networking stack is to hold up against certain types of DOS attacks. Networking uses call_rcu_bh(), which has quiescent states at any point in the code that interrupts are enabled. It might well be possible to fuse the _bh variant into the classic variant for CONFIG_PREEMPT (as opposed to CONFIG_PREEMPT_RT) kernels, but as far as I know, no one has tried this. And there is little motivation to try, from what I can see. OK, good point either way. Critical portions of the GP protection happen in the scheduler-clock interrupt, which is a hardirq. For example, the .completed counter is always incremented in hardirq context, and we cannot tolerate a .completed increment in this code. Allowing such an increment would defeat the counter-acknowledge state in the state machine. There are ways to avoid the interrupt disabling, but all the ones that I am aware of thus far have grace-period-latency consequences. Hey!!! If it wasn't nuts, I probably wouldn't be working on it!!! ;-) Yeah, the current implementation of synchronize_sched() is a really bad idea on a 4096-CPU system, and probably much smaller. The base RCU implementation is now probably good to several tens of CPUs, perhaps even to a couple hundred. In contrast, the implementation currently in -rt might start choking pretty badly much earlier due to the single global callback queue. Thanx, Paul -
| Scott Preece | Re: Linux Foundation Technical Advisory Board Elections |
| Luis R. Rodriguez | Re: [Announce] Linux-tiny project revival |
| Andrew Morton | 2.6.23-rc1-mm2 |
| Dave Hansen | [PATCH 02/24] rearrange may_open() to be r/o friendly |
git: | |
| David Miller | [GIT]: Networking |
| David Miller | Re: [BUG] New Kernel Bugs |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
