Work in progress, still not for inclusion. But code now complete! This is a respin of the following prior posting: http://lkml.org/lkml/2007/9/5/268 This release adds an additional patch that adds fixes to comments and RCU documentation, along with one macro being renamed. The rcutorture patch has a modification to make it a bit more vicious to priority boosting (though the current design relies on -rt latencies for much of the priority-boost torturing effectiveness in this case -- run the test in presence of CPU hotplug operations to get the same effect in -mm). Next step is rebasing this to a more recent version of Linux. Thanx, Paul -
Work in progress, not for inclusion. Fix rcu_barrier() to work properly in preemptive kernel environment. Also, the ordering of callback must be preserved while moving callbacks to another CPU during CPU hotplug. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- rcuclassic.c | 2 +- rcupdate.c | 10 ++++++++++ 2 files changed, 11 insertions(+), 1 deletion(-) diff -urpNa -X dontdiff linux-2.6.22-a-splitclassic/kernel/rcuclassic.c linux-2.6.22-b-fixbarriers/kernel/rcuclassic.c --- linux-2.6.22-a-splitclassic/kernel/rcuclassic.c 2007-07-19 15:03:51.000000000 -0700 +++ linux-2.6.22-b-fixbarriers/kernel/rcuclassic.c 2007-07-19 17:10:46.000000000 -0700 @@ -349,9 +349,9 @@ static void __rcu_offline_cpu(struct rcu if (rcp->cur != rcp->completed) cpu_quiet(rdp->cpu, rcp); spin_unlock_bh(&rcp->lock); + rcu_move_batch(this_rdp, rdp->donelist, rdp->donetail); rcu_move_batch(this_rdp, rdp->curlist, rdp->curtail); rcu_move_batch(this_rdp, rdp->nxtlist, rdp->nxttail); - rcu_move_batch(this_rdp, rdp->donelist, rdp->donetail); } static void rcu_offline_cpu(int cpu) diff -urpNa -X dontdiff linux-2.6.22-a-splitclassic/kernel/rcupdate.c linux-2.6.22-b-fixbarriers/kernel/rcupdate.c --- linux-2.6.22-a-splitclassic/kernel/rcupdate.c 2007-07-19 14:19:03.000000000 -0700 +++ linux-2.6.22-b-fixbarriers/kernel/rcupdate.c 2007-07-19 17:13:31.000000000 -0700 @@ -115,7 +115,17 @@ void rcu_barrier(void) mutex_lock(&rcu_barrier_mutex); init_completion(&rcu_barrier_completion); atomic_set(&rcu_barrier_cpu_count, 0); + /* + * The queueing of callbacks in all CPUs must be atomic with + * respect to RCU, otherwise one CPU may queue a callback, + * wait for a grace period, decrement barrier count and call + * complete(), while other CPUs have not yet queued anything. + * So, we need to make sure that grace periods cannot complete + * until all the callbacks are queued. + */ + rcu_read_lock(); ...
Work in progress, not for inclusion. This patch re-organizes the RCU code to enable multiple implementations of RCU. Users of RCU continues to include rcupdate.h and the RCU interfaces remain the same. This is in preparation for subsequently merging the preemptible RCU implementation. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- include/linux/rcuclassic.h | 149 +++++++++++ include/linux/rcupdate.h | 151 +++--------- kernel/Makefile | 2 kernel/rcuclassic.c | 558 ++++++++++++++++++++++++++++++++++++++++++++ kernel/rcupdate.c | 561 ++------------------------------------------- 5 files changed, 779 insertions(+), 642 deletions(-) diff -urpNa -X dontdiff linux-2.6.22/include/linux/rcuclassic.h linux-2.6.22-a-splitclassic/include/linux/rcuclassic.h --- linux-2.6.22/include/linux/rcuclassic.h 1969-12-31 16:00:00.000000000 -0800 +++ linux-2.6.22-a-splitclassic/include/linux/rcuclassic.h 2007-08-22 14:42:23.000000000 -0700 @@ -0,0 +1,149 @@ +/* + * Read-Copy Update mechanism for mutual exclusion (classic version) + * + * This program is free software; you can redistribute it and/or modify + * it under the terms of the GNU General Public License as published by + * the Free Software Foundation; either version 2 of the License, or + * (at your option) any later version. + * + * This program is distributed in the hope that it will be useful, + * but WITHOUT ANY WARRANTY; without even the implied warranty of + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the + * GNU General Public License for more details. + * + * You should have received a copy of the GNU General Public License + * along with this program; if not, write to the Free Software + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA. + * + * Copyright IBM Corporation, 2001 + * + * Author: Dipankar Sarma <dipankar@in.ibm.com> + * + * Based on the original work ...
On Mon, Sep 10, 2007 at 11:32:08AM -0700, Paul E. McKenney wrote: Why is rcu_batches_completed and rcu_batches_completed_bh moved from rcupdate.h to rcuclassic.h? [ continued ...] -- Steve -
Work in progress, not for inclusion. This patch implements a new version of RCU which allows its read-side critical sections to be preempted. It uses a set of counter pairs to keep track of the read-side critical sections and flips them when all tasks exit read-side critical section. The details of this implementation can be found in this paper - http://www.rdrop.com/users/paulmck/RCU/OLSrtRCU.2006.08.11a.pdf This patch was developed as a part of the -rt kernel development and meant to provide better latencies when read-side critical sections of RCU don't disable preemption. As a consequence of keeping track of RCU readers, the readers have a slight overhead (optimizations in the paper). This implementation co-exists with the "classic" RCU implementations and can be switched to at compiler. Also includes RCU tracing summarized in debugfs and RCU_SOFTIRQ for the preemptible variant of RCU. Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> Signed-off-by: Steven Rostedt <rostedt@goodmis.org> (for RCU_SOFTIRQ) Signed-off-by: Paul McKenney <paulmck@us.ibm.com> --- include/linux/interrupt.h | 1 include/linux/rcuclassic.h | 2 include/linux/rcupdate.h | 7 include/linux/rcupreempt.h | 78 +++ include/linux/rcupreempt_trace.h | 100 +++++ include/linux/sched.h | 5 kernel/Kconfig.preempt | 38 + kernel/Makefile | 7 kernel/fork.c | 4 kernel/rcupreempt.c | 767 +++++++++++++++++++++++++++++++++++++++ kernel/rcupreempt_trace.c | 330 ++++++++++++++++ 11 files changed, 1336 insertions(+), 3 deletions(-) diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/interrupt.h linux-2.6.22-c-preemptrcu/include/linux/interrupt.h --- linux-2.6.22-b-fixbarriers/include/linux/interrupt.h 2007-07-08 16:32:17.000000000 -0700 +++ linux-2.6.22-c-preemptrcu/include/linux/interrupt.h 2007-08-22 15:21:06.000000000 -0700 @@ -269,6 +269,7 @@ ...
[ continued here from comment on patch 1] And here we put back rcu_batches_completed and rcu_batches_completed_bh from rcuclassic.h to rcupdate.h ;-) -- Steve -
Hmmm... Good point!!! I guess it would be OK to just leave them in rcupdate.h throughout. ;-) Will fix. And good eyes! Thanx, Paul -
Good questions :) I can't remember why I did this - probably because I was breaking up into classic and preemptible RCU in incremental patches with the goal that the break-up patch can be merged before the rcu-preempt patches. IIRC, I had to make *batches_completed*() a common RCU API later on. Thanks Dipankar -
I take it that GP stand for "grace period". Might want to state that here. /* Grace period stages */ When I was looking at this code at 1am, I kept asking myself "what's this GP?" (General Protection??). But Can you have a pointer somewhere that explains these states. And not a "it's in this paper or directory". Either have a short discription here, or specify where exactly to find the information (perhaps a Documentation/RCU/preemptible_states.txt?). Trying to understand these states has caused me the most agony in Nitpick, but other places in the kernel usually use "t" or "p" as a variable to assign current to. It's just that "me" thows me off a little while reviewing this. But this is just a nitpick, so do as you Isn't the GP detection done via a tasklet/softirq. So wouldn't a local_bh_disable be sufficient here? You already cover NMIs, which would Just trying to understand all this. Here at flip_idle, only a CPU with no pending RCU calls will flip it. Then all the cpus flags will be set to rcu_flipped, and the ctrl.completed counter is incremented. When a CPU calls process_callbacks, it would then move all it's callbacks to the next list (next -> wait[GP...] -> done), and set it's unique completed counter to completed. So no more moving up the lists will be done. It also will set it's state flag to rcu_flip_seen. Even if the cpu doesn't have any RCU callbacks, the state would be in rcu_try_flip_waitack_state so we go to the next switch case for a rcu_try_flip call. Also the filp counter has been flipped so that all new rcu_read_locks will increment the cpus "other" index. We just need to wait for the Now, we just wait to see if all the cpus have called process_callbacks and now have the flag rcu_flip_seen set. So all the cpus have pushed Now this is where we wait for the sum of all the CPUs counters to reach zero. The reason for the sum is that the task may have been preempted and migrated to another CPU and decremented that counter. So the one ...
On Fri, 21 Sep 2007 10:40:03 -0400 Steven Rostedt <rostedt@goodmis.org> I thought the 4 flip states corresponded to the 4 GP stages, but now you confused me. It seems to indeed progress one stage for every 4 flip states. Hmm, now I have to puzzle how these 4 stages are required by the lock struct task_struct *curr = current; This is also my understanding, but I think this disable is an 'optimization' in that it avoids the regular IRQs from jumping through Its the large SMP case that's nuts, and on that I have to agree with Paul, its not really large SMP friendly. -
Yes, four flip states per stage, and four stages per grace period: rcu_try_flip_idle_state: Stay here if nothing is happening. Flip the counter if something starts happening. rcu_try_flip_waitack_state: Wait here for all CPUs to notice that the counter has flipped. This prevents the old set of counters from ever being incremented once we leave this state, which in turn is necessary because we cannot test any individual counter for zero -- we can only check the sum. rcu_try_flip_waitzero_state: Wait here for the sum of the old per-CPU counters to reach zero. rcu_try_flip_waitmb_state: Wait here for each of the other CPUs to execute a memory barrier. This is necessary to ensure that these other CPUs really have finished executing their RCU read-side critical sections, despite their CPUs wildly reordering memory. They are needed to allow memory barriers to be removed from rcu_read_lock(), to allow for the fact that the stages mean that grace-period boundaries are now fuzzy, as well as to account for the fact that each CPU now has its own callback queue. (Aside: classic RCU doesn't have memory barriers -- or anything else -- in rcu_read_lock() and rcu_read_unlock(), but from an implementation viewpoint, the read-side critical section extends forwards and backwards to the next/previous context switches, which do have lots of memory barriers.) Taking the reasons for stages one at a time: 1. The first stage is needed for generic RCU. Everyone must complete any pre-existing RCU read-side critical sections before the grace period can be permitted to complete. This stage corresponds to the "waitlist" and "waittail" in earlier RCU implementations. 2. An additional stage is needed due to the fact that memory reordering can cause an RCU read-side critical section's rcu_dereference() to be executed before the rcu_read_lock() -- no memory barriers! This can result in the following sequence of ...
No, I mean that both the rcu_batches_completed and rcu_batches_completed_bh are identical. Perhaps we can just put in a #define rcu_batches_completed_bh rcu_batches_completed in rcupreempt.h. In rcuclassic, they are different. But no need to have But isn't disabling irqs slower than doing a local_bh_disable? So the Hmm, that could be true. But on large SMP systems, you usually have a large amounts of memory, so hopefully a really long synchronize_rcu would not be a problem. -- Steve -
If you do a synchronize_rcu() it might well have to wait through the following sequence of states: Stage 0: (might have to wait through part of this to get out of "next" queue) rcu_try_flip_idle_state, /* "I" */ rcu_try_flip_waitack_state, /* "A" */ rcu_try_flip_waitzero_state, /* "Z" */ rcu_try_flip_waitmb_state /* "M" */ Stage 1: rcu_try_flip_idle_state, /* "I" */ rcu_try_flip_waitack_state, /* "A" */ rcu_try_flip_waitzero_state, /* "Z" */ rcu_try_flip_waitmb_state /* "M" */ Stage 2: rcu_try_flip_idle_state, /* "I" */ rcu_try_flip_waitack_state, /* "A" */ rcu_try_flip_waitzero_state, /* "Z" */ rcu_try_flip_waitmb_state /* "M" */ Stage 3: rcu_try_flip_idle_state, /* "I" */ rcu_try_flip_waitack_state, /* "A" */ rcu_try_flip_waitzero_state, /* "Z" */ rcu_try_flip_waitmb_state /* "M" */ Stage 4: rcu_try_flip_idle_state, /* "I" */ rcu_try_flip_waitack_state, /* "A" */ rcu_try_flip_waitzero_state, /* "Z" */ rcu_try_flip_waitmb_state /* "M" */ The current code absolutely must exclude the scheduling-clock hardirq Somewhere in the range from 64 to a few hundred CPUs, the global lock protecting the try_flip state machine would start sucking air pretty badly. But the real problem is synchronize_sched(), which loops through all the CPUs -- this would likely cause problems at a few tens of CPUs, perhaps as early as 10-20. Thanx, Paul -
-- Yes they do. I'm now at the point that I'm just "trusting" you that you understand that each of these stages are needed. My IQ level only lets me understand next -> wait -> done, but not the extra 3 shifts in wait. It's the app I watch sitting there waiting it's turn for it's callback to ACKed, hehe, From someone who's largest box is 4 CPUs, to me 16 CPUS is large. But I can see hundreds, let alone thousands of CPUs would make a huge grinding halt on things like synchronize_sched. God, imaging if all CPUs did that approximately at the same time. The system would should a huge jitter. -- Steve -
In the spirit of full disclosure, I am not -absolutely- certain that Well, the first time the SGI guys tried to boot a 1024-CPU Altix, I got an email complaining about RCU overheads. ;-) Manfred Spraul fixed things up for them, though. Thanx, Paul -
Covering the pieces that weren't in Peter's reply. ;-) My fingers do this without any help from the rest of me, but I suppose Good point, perhaps a comment block above the enum giving a short description of the purpose of each state. Maybe more detail in I apologize for the repetition in this email. I apologize for the repetition in this email. I apologize for the repetition in this email. Yep, will fix with either #define or static inline, as you suggested A later patch in the series fixes these -- I believe I got all of them. s/no pending RCU calls/at least one pending RCU call/, but otherwise spot on. So if the RCU grace-period machinery is idle, the first CPU to take a scheduling-clock interrupt after having posted an RCU callback will Yep. The wait-for-acknowledgement cycle is there to handle rcu_read_lock() invocations on other CPUs that executed concurrently with the counter Yep. Note that rcu_read_lock() might well be able to use its local .completed counter rather than the global one in rcu_ctrlblk, but (1) it is not clear that the reduction in cache misses would outweigh the per-CPU access overhead given that rcu_read_lock() happens a -lot- more often than does a counter flip and (2) doing this could well require Yep! More importantly, we know that no CPU will ever increment the The only way it could cause a problem would be if there was ever more than 4,294,967,296 outstanding rcu_read_lock() calls. I believe that lockdep screams if it sees more than 30 nested locks within a single task, so for systems that support no more than 100M tasks, we should be OK. It might sometime be necessary to make this be a long Yep. Because there are no memory barriers in rcu_read_unlock(), the CPU is free to reorder the contents of the RCU read-side critical section to follow the counter decrement. This means that this CPU would still be referencing RCU-protected data after it had told the world that it was no longer doing so. ...
Thanks, so many places in the kernel have acronyms that are just suppose
to be "obvious". I hate them, because they make me feel so stupid when I
don't know what they are. After I find out, I usually slap my forehead and
I said 'no' becaues of this:
+rcu_try_flip_idle(void)
+{
+ int cpu;
+
+ RCU_TRACE_ME(rcupreempt_trace_try_flip_i1);
+ if (!rcu_pending(smp_processor_id())) {
+ RCU_TRACE_ME(rcupreempt_trace_try_flip_ie1);
+ return 0;
+ }
But now I'm a bit more confused. :-/
Looking at the caller in kernel/timer.c I see
if (rcu_pending(cpu))
rcu_check_callbacks(cpu, user_tick);
And rcu_check_callbacks is the caller of rcu_try_flip. The confusion is
that we call this when we have a pending rcu, but if we have a pending
Sure, why not. More and more and more overkill!!!
And this seem reasonable to me that this would be enough to satisfy a
grace period. But the CPU moving around the rcu_read_(un)lock's around.
Are we sure that adding all these grace periods stages is better than just
That is definitely an accomplishment. And I know as well as you do that it
happened because of a lot of people sharing ideas. Some good, some bad,
but all helpful for heathy development!
-- Steve
-
We don't enter unless there is something for RCU to do (might be a pending callback, for example, but might also be needing to acknowledge a counter flip). If, by the time we get to rcu_try_flip_idle(), there is no longer anything to do (!rcu_pending()), we bail. So a given CPU kicks the state machine out of idle only if it -still- has something to do once it gets to rcu_try_flip_idle(), right? Good question. I believe so, because the extra stages don't require much additional processing, and because the ratio of rcu_read_lock() calls to the number of grace periods is extremely high. But, if I can prove it is safe, I will certainly decrease GP_STAGES or otherwise optimize the state machine. Indeed! The current version is quite a bit different than my early-2005 posting (which relied on locks!), and a -lot- of people had a hand in making it what it is today. Thanx, Paul -
[took off Ingo, because he has my ISP blacklisted, and I'm tired of getting those return mail messages. He can read LKML or you can re-CC him. Sad since this is a topic he should be reading. ] -- Now I can slap my forehead! Duh, I wasn't seeing that ! in front of the rcu_pending condition in the rcu_try_flip_idle. We only flip if we do indeed have something pending. I need some sleep. I also need to re-evaluate some of my analysis of that code. But it doesn't change my But until others besides yourself understand that state machine (doesn't really need to be me) I would be worried about applying it without barriers. The barriers may add a bit of overhead, but it adds some confidence in the code. I'm arguing that we have barriers in there until there's a fine understanding of why we fail with 3 stages and not 4. Perhaps you don't have a box with enough cpus to fail at 4. I don't know how the higher ups in the kernel command line feel, but I think that memory barriers on critical sections are justified. But if you can show a proof that adding extra stages is sufficient to deal with CPUS moving memory writes around, then so be it. But I'm still not convinced that these extra stages are really solving the bug instead of just making it much less likely to happen. Ingo praised this code since it had several years of testing in the RT tree. But that version has barriers, so this new verison without the True. -- Steve -
Fair point... Though the -rt variant has its shortcomings as well, such as being unusable from NMI/SMI handlers. How about this: I continue running the GP_STAGES=3 run on the pair of POWER machines (which are both going strong, and I also get a document together describing the new version (and of course apply the changes we have discussed, and merge with recent CPU-hotplug changes -- Gautham Shenoy is currently working this), work out a good answer to "how big exactly does GP_STAGES need to be", test whatever that number is, assuming it is neither 3 nor 4, and figure out why the gekko-lp1 machine choked on GP_STAGES=3. Then we can work out the best path forward from wherever that ends up being. [ . . . ] Thanx, Paul -
Paul, Looking further into this, I still think this is a bit of overkill. We go through 20 states from call_rcu to list->func(). On call_rcu we put our stuff on the next list. Before we move stuff from next to wait, we need to go through 4 states. So we have next -> 4 states -> wait[0] -> 4 states -> wait[1] -> 4 states -> wait[2] -> 4 states -> wait[3] -> 4 states -> done. That's 20 states that we go through from the time we add our function to the list to the time it actually gets called. Do we really need the 4 wait lists? Seems a bit overkill to me. What am I missing? -- Steve -
"Nothing kills like overkill!!!" ;-) Seriously, I do expect to be able to squeeze this down over time, but feel the need to be a bit on the cowardly side at the moment. In any case, I will be looking at the scenarios more carefully. If it turns out that GP_STAGES can indeed be cranked down a bit, well, that is an easy change! I just fired off a POWER run with GP_STAGES set to 3, will let you know how it goes. Thanx, Paul -
The first attempt blew up during boot badly enough that ABAT was unable to recover the machine (sorry, grahal!!!). Just for grins, I am trying it again on a machine that ABAT has had a better record of reviving... Thanx, Paul -
-- This still frightens the hell out of me. Going through 15 states and failing. Seems the CPU is holding off writes for a long long time. That means we flipped the counter 4 times, and that still wasn't good enough? Maybe I'll boot up my powerbook to see if it has the same issues. Well, I'm still finishing up on moving into my new house, so I wont be available this weekend. Thanks, -- Steve -
Might be that the other machine has its 2.6.22 version of .config messed up. I will try booting it on a stock 2.6.22 kernel when it comes back to life -- not sure I ever did that before. Besides, the other similar machine seems to have gone down for the count, but without me torturing it... Also, keep in mind that various stages can "record" a memory misordering, The other machine not only booted, but has survived several minutes of rcutorture thus far. I am also trying POWER5 machine as well, as the one currently running is a POWER4, which is a bit less aggressive about memory reordering than is the POWER5. Even if they pass, I refuse to reduce GP_STAGES until proven safe. Trust me, you -don't- want to be unwittingly making use of a subtely busted RCU implementation!!! Thanx, Paul -
[ sneaks away from the family for a bit to answer emails ] -- I totally agree. This is the same reason I want to understand -why- it fails with 3 stages. To make sure that adding a 4th stage really does fix it, and doesn't just make the chances for the bug smaller. I just have that funny feeling that we are band-aiding this for POWER with extra stages and not really solving the bug. I could be totally out in left field on this. But the more people have a good understanding of what is happening (this includes why things fail) the more people in general can trust this code. Right now I'm thinking you may be the only one that understands this code enough to trust it. I'm just wanting you to help people like me to trust the code by understanding and not just having faith in others. If you ever decide to give up jogging, we need to make sure that there are people here that can still fill those running shoes (-: -- Steve -
Or if it really does break, as opposed to my having happened upon a sick Agreed. Trusting me is grossly insufficient. For one thing, the Linux Well, I certainly don't jog as fast or as far as I used to! ;-) Thanx, Paul -
Just curious, any reason why rcu_flipctr can't live in rcu_data ? Similarly, rcu_try_flip_state can be a member of rcu_ctrlblk, it is even protected by rcu_ctrlblk.fliplock Could you please tell more, why do we need this cli? It can't "protect" rcu_ctrlblk.completed, and the only change which affects the state machine is rcu_flipctr[idx]++, so I can't understand the "half-done" above. (of course, we must disable preemption while changing rcu_flipctr). Hmm... this was already discussed... from another message: > Critical portions of the GP protection happen in the scheduler-clock > interrupt, which is a hardirq. For example, the .completed counter > is always incremented in hardirq context, and we cannot tolerate a > .completed increment in this code. Allowing such an increment would > defeat the counter-acknowledge state in the state machine. Still can't understand, please help! .completed could be incremented by Isn't it "obvious" that this barrier is not needed? rcu_flipctr could be change only by this CPU, regardless of the actual value of idx, we can't read the "wrong" value of rcu_flipctr[idx], no? OTOH, I can't understand how it works. With ot without local_irq_save(), another CPU can do rcu_try_flip_idle() and increment rcu_ctrlblk.completed at any time, we can see the old value... rcu_try_flip_waitzero() can miss us? OK, GP_STAGES > 1, so rcu_try_flip_waitzero() will actually check both 0 and 1 lastidx's before synchronize_rcu() succeeds... I doubt very much my understanding is correct. Apart from this why GP_STAGES > 1 ??? Well, I think this code is just too tricky for me! Will try to read again Why? Any code sequence which relies on that? For example, rcu_check_callbacks does if (rcu_ctrlblk.completed == rdp->completed) rcu_try_flip(); it is possible that the timer interrupt calls rcu_check_callbacks() exactly because rcu_pending() sees rcu_flipped, but without rmb() in between we can see the old value of ...
Thank you for the kind words, and most especially for the careful review!!! Definitely are two of them in rcupdate.h, good eyes! The ones in rcuclassic.h and rcupreempt.h, by the time all the patches are applied, are for rcu_check_callbacks_rt(). However, this, along with the other _rt() cross-calls from rcuclassic.c to rcupreempt.c, In the case of rcu_flipctr, this is historical -- the implementation in -rt has only one rcu_data, rather than one per CPU. I cannot think of a problem with placing it in rcu_data right off-hand, will look it over carefully and see. Looks like it to me, thank you for the tip! Hmmm... Why not the same for rcu_data? I guess because there is very little sharing? The only example thus far of sharing would be if rcu_flipctr were to be moved into rcu_data (if an RCU read-side critical section were preempted and then started on some other CPU), though I will need to check more carefully. Of course, if we start having CPUs access each others' rcu_data structures to make RCU more power-friendly in dynticks situations, then maybe there would be more reason to use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data. Yeah, I guess my explanation -was- a bit lacking... When I re-read it, it didn't even help -me- much! ;-) I am putting together better documentation, but in the meantime, let me try again... The problem is not with .completed being incremented, but only by it (1) being incremented (presumably by some other CPU and then (2) having this CPU get a scheduling-clock interrupt, which then causes this CPU to prematurely update the rcu_flip_flag. This is a problem, because updating rcu_flip_flag is making a promise that you won't ever again increment the "old" counter set (at least not until the next counter flip). If the scheduling clock interrupt were to happen between the time that we pick up the .completed field and the time that we increment our counter, we will have broken that promise, and that could ...
even if rcu lock/unlock happen on different CPUs, rcu_flipctr is not "shared",
Thanks a lot! Actually, this explains most of my questions. I was greatly
confused even before I started to read the code. Looking at rcu_flipctr[2]
I wrongly assumed that these 2 counters "belong" to different GPs. When
Yes, yes, I see now. We really need this barriers, except I think
rcu_try_flip_idle() can use wmb. However, I have a bit offtopic question,
// rcu_try_flip_waitzero()
if (A == 0) {
mb();
B == 0;
}
Do we really need the mb() in this case? How it is possible that STORE
Hmm. Still can't understand.
Suppose that we are doing call_rcu(), and __rcu_advance_callbacks() sees
rdp->completed == rcu_ctrlblk.completed but rcu_flip_flag = rcu_flipped
(say, another CPU does rcu_try_flip_idle() in between).
We ack the flip, call_rcu() enables irqs, the timer interrupt calls
__rcu_advance_callbacks() again and moves the callbacks.
So, it is still possible that "move callbacks" and "ack the flip" happen
out of order. But why this is bad?
This can't "speedup" the moving of our callbacks from next to done lists.
Yes, RCU state machine can switch to the next state/stage, but this looks
safe to me.
Help!
The last question, rcu_check_callbacks does
if (rcu_ctrlblk.completed == rdp->completed)
rcu_try_flip();
Could you clarify the check above? Afaics this is just optimization,
technically it is correct to rcu_try_flip() at any time, even if ->completed
are not synchronized. Most probably in that case rcu_try_flip_waitack() will
oldmask could be obsolete now. Suppose that the admin moves that task to some
cpuset or just changes its ->cpus_allowed while it does synchronize_sched().
I think there is another problem. It would be nice to eliminate taking the global
sched_hotcpu_mutex in sched_setaffinity() (I think without CONFIG_HOTPLUG_CPU
it is not needed right now). In that case sched_setaffinity(0, cpumask_of_cpu(cpu))
can in fact return on the "wrong" CPU ...Suppose that A was most recently stored by a CPU that shares a store buffer with this CPU. Then it is possible that some other CPU sees the store to B as happening before the store that "A==0" above is loading from. This other CPU would naturally conclude that the store to B must have happened before the load from A. In more detail, suppose that CPU 0 and 1 share a store buffer, and that CPU 2 and 3 share a second store buffer. This happens naturally if CPUs 0 and 1 are really just different hardware threads within a single core. So, suppose the cacheline for A is initially owned by CPUs 2 and 3, and that the cacheline for B is initially owned by CPUs 0 and 1. Then consider the following sequence of events: o CPU 0 stores zero to A. This is a cache miss, so the new value for A is placed in CPU 0's and 1's store buffer. o CPU 1 executes the above code, first loading A. It sees the value of A==0 in the store buffer, and therefore stores zero to B, which hits in the cache. (I am assuming that we left out the mb() above). o CPU 2 loads from B, which misses the cache, and gets the value that CPU 1 stored. Suppose it checks the value, and based on this check, loads A. The old value of A might still be in cache, which would lead CPU 2 to conclude that the store to B by CPU 1 must have happened before the store to A by CPU 0. Memory barriers would prevent this confusion. An intro to store buffers can be found at http://www.cs.utah.edu/mpv/papers/neiger/fmcad2001.pdf, Callbacks would be able to be injected into a grace period after it started. Or are you arguing that as long as interrupts remain disabled between Ah -- you are in fact arguing that interrupts remain disabled throughout. I would still rather that the rcu_flip_seen transition be adjacent to the callback movement in the code. My fear is that the connection might be lost otherwise... "Oh, but we can just momentarily enable From a conceptual viewpoint, if this CPU hasn't caught ...
Ah, I was confused by the comment,
smp_mb(); /* Don't call for memory barriers before we see zero. */
^^^^^^^^^^^^^^^^^^
So, in fact, we need this barrier to make sure that _other_ CPUs see these
changes in order, thanks. Of course, _we_ already saw zero.
But in that particular case this doesn't matter, rcu_try_flip_waitzero()
is the only function which reads the "non-local" per_cpu(rcu_flipctr), so
it doesn't really need the barrier? (besides, it is always called under
Thanks a lot!!! This fills another gap in my understanding.
OK, the last (I promise :) off-topic question. When CPU 0 and 1 share a
store buffer, the situation is simple, we can replace "CPU 0 stores" with
"CPU 1 stores". But what if CPU 0 is equally "far" from CPUs 1 and 2?
Suppose that CPU 1 does
wmb();
B = 0
Can we assume that CPU 2 doing
if (B == 0) {
rmb();
Look, what happens is
// call_rcu()
rcu_flip_flag = rcu_flipped
insert the new callback
// timer irq
move the callbacks (the new one goes to wait[0])
Before this callback will be flushed, we need 2 rdp->completed != rcu_ctrlblk.completed
I hope this is OK, note that migration_call(CPU_DEAD) flushes ->migration_queue,
if we take rq->lock after that we must see !cpu_online(cpu). CPU_UP event is not
interesting for us, we can miss it.
Hmm... but wake_up_process() should be moved under spin_lock().
Oleg.
-
Fair point! Perhaps: "Ensure that all CPUs see their rcu_mb_flag -after- the The final rcu_read_unlock() that zeroed the sum was -not- under fliplock, Yes. CPU 2 saw something following CPU 1's wmb(), so any of CPU 2's reads following its rmb() must therefore see all of CPU 1's stores Quite possibly my paranoia -- need to think about this some more. Guess I need to expand the code-level portion of the document to cover the grace-period side. That would force me either to get my explanation The other approach would be to simply have a separate thread for this purpose. Batching would amortize the overhead (a single trip around the CPUs could satisfy an arbitrarily large number of synchronize_sched() requests). Thanx, Paul -
Yes, but still I think this mb() is not necessary. Becasue we don't need
the "if we saw rcu_mb_flag we must see sum(lastidx)==0" property. When another
CPU calls rcu_try_flip_waitzero(), it will use another lastidx. OK, minor issue,
Ah, but I asked the different question. We must see CPU 1's stores by
definition, but what about CPU 0's stores (which could be seen by CPU 1)?
Let's take a "real life" example,
A = B = X = 0;
P = Q = &A;
CPU_0 CPU_1 CPU_2
P = &B; *P = 1; if (X) {
wmb(); rmb();
X = 1; BUG_ON(*P != 1 && *Q != 1);
}
Yes, this way we don't need to uglify migration_thread(). OTOH, we need
another kthread ;)
Oleg.
-
That can't be. CPU_2 sees X=1, that happened after (or same time at most - from a cache inv. POV) to *P=1, that must have happened after P=&B (in order for *P to assign B). So P=&B happened, from a pure time POV, before the rmb(), and the rmb() should guarantee that CPU_2 sees P=&B too. - Davide -
Actually, CPU designers have to go quite a ways out of their way to prevent this BUG_ON from happening. One way that it would happen naturally would be if the cache line containing P were owned by CPU 2, and if CPUs 0 and 1 shared a store buffer that they both snooped. So, here is what could happen given careless or sadistic CPU designers: o CPU 0 stores &B to P, but misses the cache, so puts the result in the store buffer. This means that only CPUs 0 and 1 can see it. o CPU 1 fetches P, and sees &B, so stores a 1 to B. Again, this value for P is visible only to CPUs 0 and 1. o CPU 1 executes a wmb(), which forces CPU 1's stores to happen in order. But it does nothing about CPU 0's stores, nor about CPU 1's loads, for that matter (and the only reason that POWER ends up working the way you would like is because wmb() turns into "sync" rather than the "eieio" instruction that would have been used for smp_wmb() -- which is maybe what Oleg was thinking of, but happened to abbreviate. If my analysis is buggy, Anton and Paulus will no doubt correct me...) o CPU 1 stores to X. o CPU 2 loads X, and sees that the value is 1. o CPU 2 does an rmb(), which orders its loads, but does nothing about anyone else's loads or stores. o CPU 2 fetches P from its cached copy, which still points to A, which is still zero. So the BUG_ON fires. o Some time later, CPU 0 gets the cache line containing P from CPU 2, and updates it from the value in the store buffer, but too late... Unfortunately, cache-coherence protocols don't care much about pure time... It is possible to make a 16-CPU machine believe that a single variable has more than ten different values -at- -the- -same- -time-. This is easy to do -- have all the CPUs store different values to the same variable at the same time, then reload, collecting timestamps between each pair of operations. On a large SMP, the values will sit in the store buffers for many hundreds of nanoseconds, perhaps ...
Ohh, I misinterpreted that rmb(), sorry. Somehow I gave it for granted that it was a cross-CPU sync point (ala read_barrier_depends). If that's a If a store buffer is shared between CPU_0 and CPU_1, it is very likely that a sync done on CPU_1 is going to sync even CPU_0 stores that are held in the buffer at the time of CPU_1's sync. - Davide -
That would indeed be one approach that CPU designers could take to avoid being careless or sadistic. ;-) Another approach would be to make CPU 1 refrain from snooping CPU 0's entries in the shared store queue. Thanx, Paul -
That'd be the easier (unique maybe) approach too for them, from an silicon complexity POV. Distinguishing between different CPUs stores once inside a shared store buffer, would require tagging them in some way. That'd defeat most of the pros of having a shared store buffer ;) - Davide -
Tagging requires but one bit per entry. Depends on the workload -- if lots of barriers, bursty stores and little sharing, tagging might win. If lots of sharing, then your suggested approach might win. Thanx, Paul -
Davide, Paul, thank you very much! I've been wondering about this for the long time, now I know the answer. Great. Oleg. -
It depends. ;-) o Itanium: because both wmb() and rmb() map to the "mf" instruction, and because "mf" instructions map to a single global order, the BUG_ON cannot happen. (But I could easily be mistaken -- I cannot call myself an Itanium memory-ordering expert.) See: ftp://download.intel.com/design/Itanium/Downloads/25142901.pdf for the official story. o POWER: because wmb() maps to the "sync" instruction, cumulativity applies, so that any instruction provably following "X = 1" will see "P = &B" if the "*P = 1" statement saw it. So the BUG_ON cannot happen. o i386: memory ordering respects transitive visibility, which seems to be similar to POWER's cumulativity (http://developer.intel.com/products/processor/manuals/318147.pdf), so the BUG_ON cannot happen. o x86_64: same as i386. o s390: the basic memory-ordering model is tight enough that the BUG_ON cannot happen. (If I am confused about this, the s390 guys will not be shy about correcting me!) True enough!!! Thanx, Paul -
Work in progress, not for inclusion. The combination of CPU hotplug and PREEMPT_RCU has resulted in deadlocks due to the migration-based implementation of synchronize_sched() in -rt. This experimental patch maps synchronize_sched() back onto Classic RCU, eliminating the migration, thus hopefully also eliminating the deadlocks. It is not clear that this is a good long-term approach, but it will at least permit people doing CPU hotplug in -rt kernels additional wiggle room in their design and implementation. The basic approach is to cause the -rt kernel to incorporate rcuclassic.c as well as rcupreempt.c, but to #ifdef out the conflicting portions of rcuclassic.c so that only the code needed to implement synchronize_sched() remains in a PREEMPT_RT build. Invocations of grace-period detection from the scheduling-clock interrupt go to rcuclassic.c, which then invokes the corresponding functions in rcupreempt.c (with _rt suffix added to keep the linker happy). Also applies the RCU_SOFTIRQ to classic RCU. The bulk of this patch just moves code around, but likely increases scheduling-clock latency. If this patch does turn out to be the right approach, the #ifdefs in kernel/rcuclassic.c might be dealt with. ;-) At current writing, Gautham Shenoy's most recent CPU-hotplug fixes seem likely to obsolete this patch (which would be a very good thing indeed!). If this really pans out, this portion of the patch will vanish during the forward-porting process. Signed-off-by: Steven Rostedt <rostedt@goodmis.org> (for RCU_SOFTIRQ) Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- include/linux/rcuclassic.h | 79 +++++-------------------------------- include/linux/rcupdate.h | 30 ++++++++++++-- include/linux/rcupreempt.h | 27 ++++++------ kernel/Makefile | 2 kernel/rcuclassic.c | 95 ++++++++++++++++++++++++++++++++++++--------- kernel/rcupdate.c | 22 ++++++++-- kernel/rcupreempt.c | 50 +++++------------------ ...
Work in progress, not for inclusion. This patch allows preemptible RCU to tolerate CPU-hotplug operations. It accomplishes this by maintaining a local copy of a map of online CPUs, which it accesses under its own lock. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- include/linux/rcuclassic.h | 2 include/linux/rcupreempt.h | 2 kernel/rcuclassic.c | 8 +++ kernel/rcupreempt.c | 93 +++++++++++++++++++++++++++++++++++++++++++-- 4 files changed, 100 insertions(+), 5 deletions(-) diff -urpNa -X dontdiff linux-2.6.22-d-schedclassic/include/linux/rcuclassic.h linux-2.6.22-e-hotplugcpu/include/linux/rcuclassic.h --- linux-2.6.22-d-schedclassic/include/linux/rcuclassic.h 2007-08-22 15:38:22.000000000 -0700 +++ linux-2.6.22-e-hotplugcpu/include/linux/rcuclassic.h 2007-08-22 15:55:39.000000000 -0700 @@ -143,6 +143,8 @@ extern int rcu_needs_cpu(int cpu); #define rcu_check_callbacks_rt(cpu, user) #define rcu_init_rt() #define rcu_needs_cpu_rt(cpu) 0 +#define rcu_offline_cpu_rt(cpu) +#define rcu_online_cpu_rt(cpu) #define rcu_pending_rt(cpu) 0 #define rcu_process_callbacks_rt(unused) diff -urpNa -X dontdiff linux-2.6.22-d-schedclassic/include/linux/rcupreempt.h linux-2.6.22-e-hotplugcpu/include/linux/rcupreempt.h --- linux-2.6.22-d-schedclassic/include/linux/rcupreempt.h 2007-08-22 15:38:22.000000000 -0700 +++ linux-2.6.22-e-hotplugcpu/include/linux/rcupreempt.h 2007-08-22 15:55:39.000000000 -0700 @@ -59,6 +59,8 @@ extern void rcu_advance_callbacks_rt(int extern void rcu_check_callbacks_rt(int cpu, int user); extern void rcu_init_rt(void); extern int rcu_needs_cpu_rt(int cpu); +extern void rcu_offline_cpu_rt(int cpu); +extern void rcu_online_cpu_rt(int cpu); extern int rcu_pending_rt(int cpu); struct softirq_action; extern void rcu_process_callbacks_rt(struct softirq_action *unused); diff -urpNa -X dontdiff linux-2.6.22-d-schedclassic/kernel/rcuclassic.c linux-2.6.22-e-hotplugcpu/kernel/rcuclassic.c --- ...
Imho, all these barriers are unneeded and confusing, we can't do them on behalf of a dead CPU anyway. Can't we just do per_cpu(rcu_mb_flag, cpu) = rcu_mb_done; per_cpu(rcu_flip_flag, cpu) = rcu_flip_seen; ? Why can't we also do __get_cpu_var(rcu_flipctr)[0] += per_cpu(rcu_flipctr, cpu)[0]; per_cpu(rcu_flipctr, cpu)[0] = 0; __get_cpu_var(rcu_flipctr)[1] += per_cpu(rcu_flipctr, cpu)[1]; per_cpu(rcu_flipctr, cpu)[1] = 0; ? This way rcu_try_flip_waitzero() can also use rcu_cpu_online_map. This cpu is dead, nobody can modify per_cpu(rcu_flipctr, cpu). And we can't confuse What if _cpu_up() fails? I think rcu_cpu_notify(CPU_UP_CANCELED) should call rcu_offline_cpu_rt() too. Oleg. -
You are likely correct, but this is a slow path, extremely hard to Very good point!!! This would reduce latencies on systems where the number of possible CPUs greatly exceeds that of the number of Good catch, will fix!!! Thanx, Paul -
Work in progress, not for inclusion. RCU priority boosting is needed when running a workload that might include CPU-bound user tasks running at realtime priorities with preemptible RCU. In this situation, RCU priority boosting is needed to avoid OOM. Please note that because Classic RCU does not permit RCU read-side critical sections to be preempted, there is no need to boost the priority of Classic RCU readers. Boosting the priority of a running process does not make it run any faster, at least not on any hardware that I am aware of. ;-) Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- include/linux/init_task.h | 13 include/linux/rcupdate.h | 17 + include/linux/rcupreempt.h | 20 + include/linux/sched.h | 24 + init/main.c | 1 kernel/Kconfig.preempt | 14 - kernel/fork.c | 6 kernel/rcupreempt.c | 608 ++++++++++++++++++++++++++++++++++++++++++--- kernel/rtmutex.c | 7 kernel/sched.c | 5 lib/Kconfig.debug | 34 ++ 11 files changed, 703 insertions(+), 46 deletions(-) diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/include/linux/init_task.h linux-2.6.22-F-boostrcu/include/linux/init_task.h --- linux-2.6.22-E-hotplug/include/linux/init_task.h 2007-07-08 16:32:17.000000000 -0700 +++ linux-2.6.22-F-boostrcu/include/linux/init_task.h 2007-08-31 14:09:02.000000000 -0700 @@ -87,6 +87,17 @@ extern struct nsproxy init_nsproxy; .signalfd_list = LIST_HEAD_INIT(sighand.signalfd_list), \ } +#ifdef CONFIG_PREEMPT_RCU_BOOST +#define INIT_RCU_BOOST_PRIO .rcu_prio = MAX_PRIO, +#define INIT_PREEMPT_RCU_BOOST(tsk) \ + .rcub_rbdp = NULL, \ + .rcub_state = RCU_BOOST_IDLE, \ + .rcub_entry = LIST_HEAD_INIT(tsk.rcub_entry), +#else /* #ifdef CONFIG_PREEMPT_RCU_BOOST */ +#define INIT_RCU_BOOST_PRIO +#define INIT_PREEMPT_RCU_BOOST(tsk) +#endif /* #else #ifdef CONFIG_PREEMPT_RCU_BOOST */ + extern struct group_info init_groups; #define ...
Hi Paul, Some silly doubts. I am wondering, can a task block on a lock while in RCU read-side Event #3. is missing? The state change from RCU_BOOST_BLOCKED to RCU_BOOSTED is not done by -- Gautham R Shenoy Linux Technology Center IBM India. "Freedom comes with a price tag of responsibility, which is still a bargain, because Freedom is priceless!" -
-- I think this may be specific to the -rt patch. In the -rt patch, spin_locks turn into mutexes, and therefor can block a read-side critical I guess Paul needs to answer that one ;-) -- Steve -
An older version had three, the new one has two, and I forgot to s/three/two/. Thanx, Paul -
On Mon, Sep 10, 2007 at 11:39:01AM -0700, Paul E. McKenney wrote: Why is this masking required? When we increment the rcu_boost_idx in rcu_booster, we do perform a modulo operation -- Gautham R Shenoy Linux Technology Center IBM India. "Freedom comes with a price tag of responsibility, which is still a bargain, because Freedom is priceless!" -
Because we are not masking rcu_boost_idx, we are masking (rcu_boost_idx + 1) which may extend the bounderies of RCU_BOOST_ELEMENTS. -- Steve -
Thanks! But I'm still trying to understand why the (increment + masking) is required at all. The thread(producer) that requires boosting is added to the element with index rcu_boost_idx. The booster thread(consumer) increments the rcu_boost_idx to (rcu_boost_idx + 1) % RCU_BOOST_ELEMENTS, before it fetches the least recently used rcu_boot_dat elements and boost the eligible tasks queued in that element. So, can't we just return per_cpu(rcu_boost_dat, cpu)[rcu_boost_idx] from rcu_rbd_boosting(cpu) ? Isn't that already the least recently used Thanks and Regards gautham. -- Gautham R Shenoy Linux Technology Center IBM India. "Freedom comes with a price tag of responsibility, which is still a bargain, because Freedom is priceless!" -
Good catch -- we need to advance the index -after- boosting, so that new sleeping tasks are not immediately dropped on the to-be-boosted list. Will fix! (Non-fatal -- but means that the algorithm is effectively only using three elements of the four-element array, so does need to be fixed.) Thanx, Paul -
Work in progress, not for inclusion. Still uses xtime because this patch is still against 2.6.22. This patch modifies rcutorture to also torture RCU priority boosting. The torturing involves forcing RCU read-side critical sections (already performed as part of the torturing of RCU) to run for extremely long time periods, increasing the probability of their being preempted and thus needing priority boosting. The fact that rcutorture's "nreaders" module parameter defaults to twice the number of CPUs helps ensure lots of the needed preemption. To cause the torturing to be fully effective in -mm, run in presence of CPU-hotplug operations. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- rcutorture.c | 91 +++++++++++++++++++++++++++++++++++++++++++++++++---------- 1 file changed, 77 insertions(+), 14 deletions(-) diff -urpNa -X dontdiff linux-2.6.22-f-boostrcu/kernel/rcutorture.c linux-2.6.22-g-boosttorture/kernel/rcutorture.c --- linux-2.6.22-f-boostrcu/kernel/rcutorture.c 2007-07-08 16:32:17.000000000 -0700 +++ linux-2.6.22-g-boosttorture/kernel/rcutorture.c 2007-09-09 16:58:35.000000000 -0700 @@ -58,6 +58,7 @@ static int stat_interval; /* Interval be static int verbose; /* Print more debug info. */ static int test_no_idle_hz; /* Test RCU's support for tickless idle CPUs. */ static int shuffle_interval = 5; /* Interval between shuffles (in sec)*/ +static int preempt_torture; /* Realtime task preempts torture readers. */ static char *torture_type = "rcu"; /* What RCU implementation to torture. */ module_param(nreaders, int, 0444); @@ -72,6 +73,8 @@ module_param(test_no_idle_hz, bool, 0444 MODULE_PARM_DESC(test_no_idle_hz, "Test support for tickless idle CPUs"); module_param(shuffle_interval, int, 0444); MODULE_PARM_DESC(shuffle_interval, "Number of seconds between shuffles"); +module_param(preempt_torture, bool, 0444); +MODULE_PARM_DESC(preempt_torture, "Enable realtime preemption torture"); module_param(torture_type, charp, 0444); ...
Work in progress, not for inclusion.
This patch modified the RCU priority booster to explicitly sleep when
there are no RCU readers in need of priority boosting. This should be
a power-consumption improvement over the one-second polling cycle in
the underlying RCU priority-boosting patch.
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---
include/linux/rcupreempt.h | 15 ++++++
kernel/rcupreempt.c | 102 ++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 115 insertions(+), 2 deletions(-)
diff -urpNa -X dontdiff linux-2.6.22-G-boosttorture/include/linux/rcupreempt.h linux-2.6.22-H-boostsleep/include/linux/rcupreempt.h
--- linux-2.6.22-G-boosttorture/include/linux/rcupreempt.h 2007-08-24 11:24:59.000000000 -0700
+++ linux-2.6.22-H-boostsleep/include/linux/rcupreempt.h 2007-08-24 18:12:41.000000000 -0700
@@ -60,6 +60,21 @@ enum rcu_boost_state {
#define N_RCU_BOOST_STATE (RCU_BOOST_INVALID + 1)
+/*
+ * RCU-booster state with respect to sleeping. The RCU booster
+ * sleeps when no task has recently been seen sleeping in an RCU
+ * read-side critical section, and is awakened when a new sleeper
+ * appears.
+ */
+enum rcu_booster_state {
+ RCU_BOOSTER_ACTIVE = 0, /* RCU booster actively scanning. */
+ RCU_BOOSTER_DROWSY = 1, /* RCU booster is considering sleeping. */
+ RCU_BOOSTER_SLEEPING = 2, /* RCU booster is asleep. */
+ RCU_BOOSTER_INVALID = 3, /* For bogus state sightings. */
+};
+
+#define N_RCU_BOOSTER_STATE (RCU_BOOSTER_INVALID + 1)
+
#endif /* #ifdef CONFIG_PREEMPT_RCU_BOOST */
#define call_rcu_bh(head, rcu) call_rcu(head, rcu)
diff -urpNa -X dontdiff linux-2.6.22-G-boosttorture/kernel/rcupreempt.c linux-2.6.22-H-boostsleep/kernel/rcupreempt.c
--- linux-2.6.22-G-boosttorture/kernel/rcupreempt.c 2007-08-27 15:42:57.000000000 -0700
+++ linux-2.6.22-H-boostsleep/kernel/rcupreempt.c 2007-08-27 15:42:37.000000000 -0700
@@ -108,6 +108,7 @@ struct rcu_boost_dat {
unsigned long rbs_unboosted;
#ifdef ...Work in progress, not for inclusion. This patch updates the RCU documentation to reflect preemptible RCU as well as recent publications. Fix an incorrect comment in the code. Change the name ORDERED_WRT_IRQ() to ACCESS_ONCE() to better describe its function. Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> --- Documentation/RCU/RTFP.txt | 234 ++++++++++++++++++++++++++++++++++++++++-- Documentation/RCU/rcu.txt | 20 +++ Documentation/RCU/torture.txt | 44 ++++++- kernel/Kconfig.preempt | 1 kernel/rcupreempt.c | 24 ++-- 5 files changed, 292 insertions(+), 31 deletions(-) diff -urpNa -X dontdiff linux-2.6.22-h-boostsleep/Documentation/RCU/rcu.txt linux-2.6.22-i-cleanups/Documentation/RCU/rcu.txt --- linux-2.6.22-h-boostsleep/Documentation/RCU/rcu.txt 2007-07-08 16:32:17.000000000 -0700 +++ linux-2.6.22-i-cleanups/Documentation/RCU/rcu.txt 2007-09-08 16:54:17.000000000 -0700 @@ -36,6 +36,14 @@ o How can the updater tell when a grace executed in user mode, or executed in the idle loop, we can safely free up that item. + Preemptible variants of RCU (CONFIG_PREEMPT_RCU) get the + same effect, but require that the readers manipulate CPU-local + counters. These counters allow limited types of blocking + within RCU read-side critical sections. SRCU also uses + CPU-local counters, and permits general blocking within + RCU read-side critical sections. These two variants of + RCU detect grace periods by sampling these counters. + o If I am running on a uniprocessor kernel, which can only do one thing at a time, why should I wait for a grace period? @@ -46,7 +54,10 @@ o How can I see where RCU is currently u Search for "rcu_read_lock", "rcu_read_unlock", "call_rcu", "rcu_read_lock_bh", "rcu_read_unlock_bh", "call_rcu_bh", "srcu_read_lock", "srcu_read_unlock", "synchronize_rcu", - "synchronize_net", and "synchronize_srcu". + "synchronize_net", "synchronize_srcu", and the other RCU + primitives. Or grab ...
cool! Now after 2 years of development and testing i think this is one of the most mature patchsets on lkml - so i'd like to see this designated for potential upstream inclusion. I.e. everyone who can see some bug, please holler now. Ingo -
