Re: [PATCH RFC 5/9] RCU: CPU hotplug support for preemptible RCU

Previous thread: [PATCH] Hookup group-scheduler with task container infrastructure by Srivatsa Vaddagiri on Monday, September 10, 2007 - 10:10 am. (30 messages)

Next thread: [PATCH] dcache: trivial comment fix by J. Bruce Fields on Monday, September 10, 2007 - 11:46 am. (2 messages)
From: Paul E. McKenney
Date: Monday, September 10, 2007 - 11:30 am

Work in progress, still not for inclusion.  But code now complete!

This is a respin of the following prior posting:

http://lkml.org/lkml/2007/9/5/268

This release adds an additional patch that adds fixes to comments and RCU
documentation, along with one macro being renamed.  The rcutorture patch
has a modification to make it a bit more vicious to priority boosting
(though the current design relies on -rt latencies for much of the
priority-boost torturing effectiveness in this case -- run the test
in presence of CPU hotplug operations to get the same effect in -mm).
Next step is rebasing this to a more recent version of Linux.

						Thanx, Paul
-

From: Paul E. McKenney
Date: Monday, September 10, 2007 - 11:33 am

Work in progress, not for inclusion.

Fix rcu_barrier() to work properly in preemptive kernel environment.
Also, the ordering of callback must be preserved while moving
callbacks to another CPU during CPU hotplug.

Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

 rcuclassic.c |    2 +-
 rcupdate.c   |   10 ++++++++++
 2 files changed, 11 insertions(+), 1 deletion(-)

diff -urpNa -X dontdiff linux-2.6.22-a-splitclassic/kernel/rcuclassic.c linux-2.6.22-b-fixbarriers/kernel/rcuclassic.c
--- linux-2.6.22-a-splitclassic/kernel/rcuclassic.c	2007-07-19 15:03:51.000000000 -0700
+++ linux-2.6.22-b-fixbarriers/kernel/rcuclassic.c	2007-07-19 17:10:46.000000000 -0700
@@ -349,9 +349,9 @@ static void __rcu_offline_cpu(struct rcu
 	if (rcp->cur != rcp->completed)
 		cpu_quiet(rdp->cpu, rcp);
 	spin_unlock_bh(&rcp->lock);
+	rcu_move_batch(this_rdp, rdp->donelist, rdp->donetail);
 	rcu_move_batch(this_rdp, rdp->curlist, rdp->curtail);
 	rcu_move_batch(this_rdp, rdp->nxtlist, rdp->nxttail);
-	rcu_move_batch(this_rdp, rdp->donelist, rdp->donetail);
 }
 
 static void rcu_offline_cpu(int cpu)
diff -urpNa -X dontdiff linux-2.6.22-a-splitclassic/kernel/rcupdate.c linux-2.6.22-b-fixbarriers/kernel/rcupdate.c
--- linux-2.6.22-a-splitclassic/kernel/rcupdate.c	2007-07-19 14:19:03.000000000 -0700
+++ linux-2.6.22-b-fixbarriers/kernel/rcupdate.c	2007-07-19 17:13:31.000000000 -0700
@@ -115,7 +115,17 @@ void rcu_barrier(void)
 	mutex_lock(&rcu_barrier_mutex);
 	init_completion(&rcu_barrier_completion);
 	atomic_set(&rcu_barrier_cpu_count, 0);
+	/*
+	 * The queueing of callbacks in all CPUs must be atomic with
+	 * respect to RCU, otherwise one CPU may queue a callback,
+	 * wait for a grace period, decrement barrier count and call
+	 * complete(), while other CPUs have not yet queued anything.
+	 * So, we need to make sure that grace periods cannot complete
+	 * until all the callbacks are queued.
+	 */
+	rcu_read_lock();
 ...
From: Paul E. McKenney
Date: Monday, September 10, 2007 - 11:32 am

Work in progress, not for inclusion.

This patch re-organizes the RCU code to enable multiple implementations
of RCU. Users of RCU continues to include rcupdate.h and the
RCU interfaces remain the same. This is in preparation for
subsequently merging the preemptible RCU implementation.

Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com> 
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> 
---

 include/linux/rcuclassic.h |  149 +++++++++++
 include/linux/rcupdate.h   |  151 +++---------
 kernel/Makefile            |    2 
 kernel/rcuclassic.c        |  558 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/rcupdate.c          |  561 ++-------------------------------------------
 5 files changed, 779 insertions(+), 642 deletions(-)

diff -urpNa -X dontdiff linux-2.6.22/include/linux/rcuclassic.h linux-2.6.22-a-splitclassic/include/linux/rcuclassic.h
--- linux-2.6.22/include/linux/rcuclassic.h	1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.22-a-splitclassic/include/linux/rcuclassic.h	2007-08-22 14:42:23.000000000 -0700
@@ -0,0 +1,149 @@
+/*
+ * Read-Copy Update mechanism for mutual exclusion (classic version)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright IBM Corporation, 2001
+ *
+ * Author: Dipankar Sarma <dipankar@in.ibm.com>
+ *
+ * Based on the original work ...
From: Steven Rostedt
Date: Thursday, September 20, 2007 - 9:14 pm

On Mon, Sep 10, 2007 at 11:32:08AM -0700, Paul E. McKenney wrote:




Why is rcu_batches_completed and rcu_batches_completed_bh moved from
rcupdate.h to rcuclassic.h?

[ continued ...]

-- Steve

-

From: Paul E. McKenney
Date: Monday, September 10, 2007 - 11:34 am

Work in progress, not for inclusion.

This patch implements a new version of RCU which allows its read-side
critical sections to be preempted. It uses a set of counter pairs
to keep track of the read-side critical sections and flips them
when all tasks exit read-side critical section. The details
of this implementation can be found in this paper -

http://www.rdrop.com/users/paulmck/RCU/OLSrtRCU.2006.08.11a.pdf

This patch was developed as a part of the -rt kernel development and
meant to provide better latencies when read-side critical sections of
RCU don't disable preemption.  As a consequence of keeping track of RCU
readers, the readers have a slight overhead (optimizations in the paper).
This implementation co-exists with the "classic" RCU implementations
and can be switched to at compiler.

Also includes RCU tracing summarized in debugfs and RCU_SOFTIRQ for
the preemptible variant of RCU.

Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> (for RCU_SOFTIRQ)
Signed-off-by: Paul McKenney <paulmck@us.ibm.com>
---

 include/linux/interrupt.h        |    1 
 include/linux/rcuclassic.h       |    2 
 include/linux/rcupdate.h         |    7 
 include/linux/rcupreempt.h       |   78 +++
 include/linux/rcupreempt_trace.h |  100 +++++
 include/linux/sched.h            |    5 
 kernel/Kconfig.preempt           |   38 +
 kernel/Makefile                  |    7 
 kernel/fork.c                    |    4 
 kernel/rcupreempt.c              |  767 +++++++++++++++++++++++++++++++++++++++
 kernel/rcupreempt_trace.c        |  330 ++++++++++++++++
 11 files changed, 1336 insertions(+), 3 deletions(-)

diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/interrupt.h linux-2.6.22-c-preemptrcu/include/linux/interrupt.h
--- linux-2.6.22-b-fixbarriers/include/linux/interrupt.h	2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-c-preemptrcu/include/linux/interrupt.h	2007-08-22 15:21:06.000000000 -0700
@@ -269,6 +269,7 @@ ...
From: Steven Rostedt
Date: Thursday, September 20, 2007 - 9:17 pm

[ continued here from comment on patch 1]


And here we put back rcu_batches_completed and rcu_batches_completed_bh
from rcuclassic.h to rcupdate.h ;-)

-- Steve

-

From: Paul E. McKenney
Date: Thursday, September 20, 2007 - 10:50 pm

Hmmm...  Good point!!!  I guess it would be OK to just leave them
in rcupdate.h throughout.  ;-)

Will fix.  And good eyes!

						Thanx, Paul
-

From: Dipankar Sarma
Date: Thursday, September 20, 2007 - 10:56 pm

Good questions :) I can't remember why I did this - probably because
I was breaking up into classic and preemptible RCU in incremental
patches with the goal that the break-up patch can be merged
before the rcu-preempt patches. IIRC, I had to make *batches_completed*()
a common RCU API later on.

Thanks
Dipankar
-

From: Steven Rostedt
Date: Friday, September 21, 2007 - 7:40 am

I take it that GP stand for "grace period". Might want to state that
here. /* Grace period stages */  When I was looking at this code at 1am,
I kept asking myself "what's this GP?" (General Protection??). But

Can you have a pointer somewhere that explains these states. And not a
"it's in this paper or directory". Either have a short discription here,
or specify where exactly to find the information (perhaps a
Documentation/RCU/preemptible_states.txt?).

Trying to understand these states has caused me the most agony in


Nitpick, but other places in the kernel usually use "t" or "p" as a
variable to assign current to.  It's just that "me" thows me off a
little while reviewing this.  But this is just a nitpick, so do as you


Isn't the GP detection done via a tasklet/softirq. So wouldn't a
local_bh_disable be sufficient here? You already cover NMIs, which would



Just trying to understand all this. Here at flip_idle, only a CPU with
no pending RCU calls will flip it. Then all the cpus flags will be set
to rcu_flipped, and the ctrl.completed counter is incremented.

When a CPU calls process_callbacks, it would then move all it's
callbacks to the next list (next -> wait[GP...] -> done), and set it's
unique completed counter to completed. So no more moving up the lists
will be done. It also will set it's state flag to rcu_flip_seen. Even if
the cpu doesn't have any RCU callbacks, the state would be in
rcu_try_flip_waitack_state so we go to the next switch case for a
rcu_try_flip call.

Also the filp counter has been flipped so that all new rcu_read_locks
will increment the cpus "other" index. We just need to wait for the

Now, we just wait to see if all the cpus have called process_callbacks
and now have the flag rcu_flip_seen set. So all the cpus have pushed

Now this is where we wait for the sum of all the CPUs counters to reach
zero. The reason for the sum is that the task may have been preempted
and migrated to another CPU and decremented that counter. So the one ...
From: Peter Zijlstra
Date: Friday, September 21, 2007 - 8:46 am

On Fri, 21 Sep 2007 10:40:03 -0400 Steven Rostedt <rostedt@goodmis.org>

I thought the 4 flip states corresponded to the 4 GP stages, but now
you confused me. It seems to indeed progress one stage for every 4 flip
states.

Hmm, now I have to puzzle how these 4 stages are required by the lock


struct task_struct *curr = current;


This is also my understanding, but I think this disable is an
'optimization' in that it avoids the regular IRQs from jumping through

Its the large SMP case that's nuts, and on that I have to agree with
Paul, its not really large SMP friendly.


-

From: Paul E. McKenney
Date: Friday, September 21, 2007 - 3:06 pm

Yes, four flip states per stage, and four stages per grace period:

rcu_try_flip_idle_state:	Stay here if nothing is happening.
				Flip the counter if something starts
				happening.

rcu_try_flip_waitack_state:	Wait here for all CPUs to notice that
				the counter has flipped.  This prevents
				the old set of counters from ever being
				incremented once we leave this state,
				which in turn is necessary because we
				cannot test any individual counter for
				zero -- we can only check the sum.

rcu_try_flip_waitzero_state:	Wait here for the sum of the old
				per-CPU counters to reach zero.

rcu_try_flip_waitmb_state:	Wait here for each of the other
				CPUs to execute a memory barrier.
				This is necessary to ensure that
				these other CPUs really have finished
				executing their RCU read-side
				critical sections, despite their
				CPUs wildly reordering memory.


They are needed to allow memory barriers to be removed from
rcu_read_lock(), to allow for the fact that the stages mean that
grace-period boundaries are now fuzzy, as well as to account for the
fact that each CPU now has its own callback queue.  (Aside: classic RCU
doesn't have memory barriers -- or anything else -- in rcu_read_lock()
and rcu_read_unlock(), but from an implementation viewpoint, the read-side
critical section extends forwards and backwards to the next/previous
context switches, which do have lots of memory barriers.)

Taking the reasons for stages one at a time:

1.	The first stage is needed for generic RCU.  Everyone must
	complete any pre-existing RCU read-side critical sections
	before the grace period can be permitted to complete.  This
	stage corresponds to the "waitlist" and "waittail" in earlier
	RCU implementations.

2.	An additional stage is needed due to the fact that memory reordering
	can cause an RCU read-side critical section's rcu_dereference()
	to be executed before the rcu_read_lock() -- no memory barriers!

	This can result in the following sequence of ...
From: Steven Rostedt
Date: Friday, September 21, 2007 - 3:31 pm

No, I mean that both the rcu_batches_completed and
rcu_batches_completed_bh are identical. Perhaps we can just put in a

#define rcu_batches_completed_bh rcu_batches_completed

in rcupreempt.h.  In rcuclassic, they are different. But no need to have


But isn't disabling irqs slower than doing a local_bh_disable? So the

Hmm, that could be true. But on large SMP systems, you usually have a
large amounts of memory, so hopefully a really long synchronize_rcu
would not be a problem.

-- Steve

-

From: Paul E. McKenney
Date: Friday, September 21, 2007 - 3:44 pm

If you do a synchronize_rcu() it might well have to wait through the
following sequence of states:

Stage 0: (might have to wait through part of this to get out of "next" queue)
	rcu_try_flip_idle_state,	/* "I" */
	rcu_try_flip_waitack_state, 	/* "A" */
	rcu_try_flip_waitzero_state,	/* "Z" */
	rcu_try_flip_waitmb_state	/* "M" */
Stage 1:
	rcu_try_flip_idle_state,	/* "I" */
	rcu_try_flip_waitack_state, 	/* "A" */
	rcu_try_flip_waitzero_state,	/* "Z" */
	rcu_try_flip_waitmb_state	/* "M" */
Stage 2:
	rcu_try_flip_idle_state,	/* "I" */
	rcu_try_flip_waitack_state, 	/* "A" */
	rcu_try_flip_waitzero_state,	/* "Z" */
	rcu_try_flip_waitmb_state	/* "M" */
Stage 3:
	rcu_try_flip_idle_state,	/* "I" */
	rcu_try_flip_waitack_state, 	/* "A" */
	rcu_try_flip_waitzero_state,	/* "Z" */
	rcu_try_flip_waitmb_state	/* "M" */
Stage 4:
	rcu_try_flip_idle_state,	/* "I" */
	rcu_try_flip_waitack_state, 	/* "A" */
	rcu_try_flip_waitzero_state,	/* "Z" */
	rcu_try_flip_waitmb_state	/* "M" */




The current code absolutely must exclude the scheduling-clock hardirq

Somewhere in the range from 64 to a few hundred CPUs, the global lock
protecting the try_flip state machine would start sucking air pretty
badly.  But the real problem is synchronize_sched(), which loops through
all the CPUs --  this would likely cause problems at a few tens of
CPUs, perhaps as early as 10-20.

						Thanx, Paul
-

From: Steven Rostedt
Date: Friday, September 21, 2007 - 4:23 pm

--

Yes they do. I'm now at the point that I'm just "trusting" you that you
understand that each of these stages are needed. My IQ level only lets me
understand next -> wait -> done, but not the extra 3 shifts in wait.


It's the app I watch sitting there waiting it's turn for it's callback to

ACKed,

hehe, From someone who's largest box is 4 CPUs, to me 16 CPUS is large.
But I can see hundreds, let alone thousands of CPUs would make a huge
grinding halt on things like synchronize_sched. God, imaging if all CPUs
did that approximately at the same time. The system would should a huge
jitter.

-- Steve

-

From: Paul E. McKenney
Date: Friday, September 21, 2007 - 4:44 pm

In the spirit of full disclosure, I am not -absolutely- certain that

Well, the first time the SGI guys tried to boot a 1024-CPU Altix, I got
an email complaining about RCU overheads.  ;-)  Manfred Spraul fixed
things up for them, though.

							Thanx, Paul
-

From: Paul E. McKenney
Date: Friday, September 21, 2007 - 5:26 pm

Covering the pieces that weren't in Peter's reply.  ;-)


My fingers do this without any help from the rest of me, but I suppose


Good point, perhaps a comment block above the enum giving a short
description of the purpose of each state.  Maybe more detail in

I apologize for the repetition in this email.

I apologize for the repetition in this email.

I apologize for the repetition in this email.

Yep, will fix with either #define or static inline, as you suggested


A later patch in the series fixes these -- I believe I got all of them.




s/no pending RCU calls/at least one pending RCU call/, but otherwise
spot on.

So if the RCU grace-period machinery is idle, the first CPU to take
a scheduling-clock interrupt after having posted an RCU callback will

Yep.  The wait-for-acknowledgement cycle is there to handle rcu_read_lock() 
invocations on other CPUs that executed concurrently with the counter

Yep.  Note that rcu_read_lock() might well be able to use its local
.completed counter rather than the global one in rcu_ctrlblk, but
(1) it is not clear that the reduction in cache misses would outweigh
the per-CPU access overhead given that rcu_read_lock() happens a -lot-
more often than does a counter flip and (2) doing this could well require

Yep!  More importantly, we know that no CPU will ever increment the


The only way it could cause a problem would be if there was ever
more than 4,294,967,296 outstanding rcu_read_lock() calls.  I believe
that lockdep screams if it sees more than 30 nested locks within a
single task, so for systems that support no more than 100M tasks, we
should be OK.  It might sometime be necessary to make this be a long

Yep.  Because there are no memory barriers in rcu_read_unlock(), the
CPU is free to reorder the contents of the RCU read-side critical section
to follow the counter decrement.  This means that this CPU would still
be referencing RCU-protected data after it had told the world that it
was no longer doing so.  ...
From: Steven Rostedt
Date: Friday, September 21, 2007 - 6:15 pm

Thanks, so many places in the kernel have acronyms that are just suppose
to be "obvious". I hate them, because they make me feel so stupid when I
don't know what they are. After I find out, I usually slap my forehead and








I said 'no' becaues of this:

+rcu_try_flip_idle(void)
+{
+       int cpu;
+
+       RCU_TRACE_ME(rcupreempt_trace_try_flip_i1);
+       if (!rcu_pending(smp_processor_id())) {
+               RCU_TRACE_ME(rcupreempt_trace_try_flip_ie1);
+               return 0;
+       }

But now I'm a bit more confused. :-/

Looking at the caller in kernel/timer.c I see

	if (rcu_pending(cpu))
		rcu_check_callbacks(cpu, user_tick);

And rcu_check_callbacks is the caller of rcu_try_flip. The confusion is
that we call this when we have a pending rcu, but if we have a pending

Sure, why not. More and more and more overkill!!!


And this seem reasonable to me that this would be enough to satisfy a
grace period. But the CPU moving around the rcu_read_(un)lock's around.

Are we sure that adding all these grace periods stages is better than just





That is definitely an accomplishment. And I know as well as you do that it
happened because of a lot of people sharing ideas. Some good, some bad,
but all helpful for heathy development!

-- Steve

-

From: Paul E. McKenney
Date: Friday, September 21, 2007 - 6:53 pm

We don't enter unless there is something for RCU to do (might be a
pending callback, for example, but might also be needing to acknowledge
a counter flip).  If, by the time we get to rcu_try_flip_idle(), there
is no longer anything to do (!rcu_pending()), we bail.

So a given CPU kicks the state machine out of idle only if it -still-
has something to do once it gets to rcu_try_flip_idle(), right?



Good question.  I believe so, because the extra stages don't require
much additional processing, and because the ratio of rcu_read_lock()
calls to the number of grace periods is extremely high.  But, if I
can prove it is safe, I will certainly decrease GP_STAGES or otherwise
optimize the state machine.





Indeed!  The current version is quite a bit different than my early-2005
posting (which relied on locks!), and a -lot- of people had a hand in
making it what it is today.

							Thanx, Paul
-

From: Steven Rostedt
Date: Friday, September 21, 2007 - 8:15 pm

[took off Ingo, because he has my ISP blacklisted, and I'm tired of
getting those return mail messages. He can read LKML or you can re-CC
him. Sad since this is a topic he should be reading. ]

--

Now I can slap my forehead!  Duh, I wasn't seeing that ! in front of the
rcu_pending condition in the rcu_try_flip_idle.  We only flip if we do
indeed have something pending. I need some sleep. I also need to
re-evaluate some of my analysis of that code. But it doesn't change my

But until others besides yourself understand that state machine (doesn't
really need to be me) I would be worried about applying it without
barriers.  The barriers may add a bit of overhead, but it adds some
confidence in the code.  I'm arguing that we have barriers in there until
there's a fine understanding of why we fail with 3 stages and not 4.
Perhaps you don't have a box with enough cpus to fail at 4.

I don't know how the higher ups in the kernel command line feel, but I
think that memory barriers on critical sections are justified. But if you
can show a proof that adding extra stages is sufficient to deal with
CPUS moving memory writes around, then so be it. But I'm still not
convinced that these extra stages are really solving the bug instead of
just making it much less likely to happen.

Ingo praised this code since it had several years of testing in the RT
tree. But that version has barriers, so this new verison without the



True.

-- Steve

-

From: Paul E. McKenney
Date: Friday, September 21, 2007 - 9:07 pm

Fair point...  Though the -rt variant has its shortcomings as well,
such as being unusable from NMI/SMI handlers.

How about this:  I continue running the GP_STAGES=3 run on the pair of
POWER machines (which are both going strong, and I also get a document
together describing the new version (and of course apply the changes we
have discussed, and merge with recent CPU-hotplug changes -- Gautham
Shenoy is currently working this), work out a good answer to "how
big exactly does GP_STAGES need to be", test whatever that number is,
assuming it is neither 3 nor 4, and figure out why the gekko-lp1 machine
choked on GP_STAGES=3.

Then we can work out the best path forward from wherever that ends up
being.

[ . . . ]

						Thanx, Paul
-

From: Steven Rostedt
Date: Friday, September 21, 2007 - 8:20 am

Paul,

Looking further into this, I still think this is a bit of overkill. We
go through 20 states from call_rcu to list->func().

On call_rcu we put our stuff on the next list. Before we move stuff from
next to wait, we need to go through 4 states. So we have

next -> 4 states -> wait[0] -> 4 states -> wait[1] -> 4 states ->
wait[2] -> 4 states -> wait[3] -> 4 states -> done.

That's 20 states that we go through from the time we add our function to
the list to the time it actually gets called. Do we really need the 4
wait lists?

Seems a bit overkill to me.

What am I missing?

-- Steve

-

From: Paul E. McKenney
Date: Friday, September 21, 2007 - 4:03 pm

"Nothing kills like overkill!!!"  ;-)

Seriously, I do expect to be able to squeeze this down over time, but
feel the need to be a bit on the cowardly side at the moment.

In any case, I will be looking at the scenarios more carefully.  If
it turns out that GP_STAGES can indeed be cranked down a bit, well,
that is an easy change!  I just fired off a POWER run with GP_STAGES
set to 3, will let you know how it goes.

						Thanx, Paul
-

From: Paul E. McKenney
Date: Friday, September 21, 2007 - 5:32 pm

The first attempt blew up during boot badly enough that ABAT was unable
to recover the machine (sorry, grahal!!!).  Just for grins, I am trying
it again on a machine that ABAT has had a better record of reviving...

						Thanx, Paul
-

From: Steven Rostedt
Date: Friday, September 21, 2007 - 6:19 pm

--

This still frightens the hell out of me. Going through 15 states and
failing. Seems the CPU is holding off writes for a long long time. That
means we flipped the counter 4 times, and that still wasn't good enough?

Maybe I'll boot up my powerbook to see if it has the same issues.

Well, I'm still finishing up on moving into my new house, so I wont be
available this weekend.

Thanks,

-- Steve

-

From: Paul E. McKenney
Date: Friday, September 21, 2007 - 6:43 pm

Might be that the other machine has its 2.6.22 version of .config messed
up.  I will try booting it on a stock 2.6.22 kernel when it comes back
to life -- not sure I ever did that before.  Besides, the other similar
machine seems to have gone down for the count, but without me torturing
it...

Also, keep in mind that various stages can "record" a memory misordering,

The other machine not only booted, but has survived several minutes of
rcutorture thus far.  I am also trying POWER5 machine as well, as the
one currently running is a POWER4, which is a bit less aggressive about
memory reordering than is the POWER5.

Even if they pass, I refuse to reduce GP_STAGES until proven safe.
Trust me, you -don't- want to be unwittingly making use of a subtely
busted RCU implementation!!!

						Thanx, Paul
-

From: Steven Rostedt
Date: Friday, September 21, 2007 - 7:56 pm

[ sneaks away from the family for a bit to answer emails ]

--

I totally agree. This is the same reason I want to understand -why- it
fails with 3 stages. To make sure that adding a 4th stage really does fix
it, and doesn't just make the chances for the bug smaller.

I just have that funny feeling that we are band-aiding this for POWER with
extra stages and not really solving the bug.

I could be totally out in left field on this. But the more people have a
good understanding of what is happening (this includes why things fail)
the more people in general can trust this code.  Right now I'm thinking
you may be the only one that understands this code enough to trust it. I'm
just wanting you to help people like me to trust the code by understanding
and not just having faith in others.

If you ever decide to give up jogging, we need to make sure that there are
people here that can still fill those running shoes (-:


 -- Steve

-

From: Paul E. McKenney
Date: Friday, September 21, 2007 - 9:10 pm

Or if it really does break, as opposed to my having happened upon a sick

Agreed.  Trusting me is grossly insufficient.  For one thing, the Linux

Well, I certainly don't jog as fast or as far as I used to!  ;-)

						Thanx, Paul
-

From: Oleg Nesterov
Date: Sunday, September 23, 2007 - 10:38 am

Just curious, any reason why rcu_flipctr can't live in rcu_data ? Similarly,
rcu_try_flip_state can be a member of rcu_ctrlblk, it is even protected by
rcu_ctrlblk.fliplock


Could you please tell more, why do we need this cli?

It can't "protect" rcu_ctrlblk.completed, and the only change which affects
the state machine is rcu_flipctr[idx]++, so I can't understand the "half-done"
above. (of course, we must disable preemption while changing rcu_flipctr).

Hmm... this was already discussed... from another message:

	> Critical portions of the GP protection happen in the scheduler-clock
	> interrupt, which is a hardirq.  For example, the .completed counter
	> is always incremented in hardirq context, and we cannot tolerate a
	> .completed increment in this code.  Allowing such an increment would
	> defeat the counter-acknowledge state in the state machine.

Still can't understand, please help! .completed could be incremented by

Isn't it "obvious" that this barrier is not needed? rcu_flipctr could be
change only by this CPU, regardless of the actual value of idx, we can't
read the "wrong" value of rcu_flipctr[idx], no?

OTOH, I can't understand how it works. With ot without local_irq_save(),
another CPU can do rcu_try_flip_idle() and increment rcu_ctrlblk.completed
at any time, we can see the old value... rcu_try_flip_waitzero() can miss us?

OK, GP_STAGES > 1, so rcu_try_flip_waitzero() will actually check both
0 and 1 lastidx's before synchronize_rcu() succeeds... I doubt very much
my understanding is correct. Apart from this why GP_STAGES > 1 ???

Well, I think this code is just too tricky for me! Will try to read again

Why? Any code sequence which relies on that?

For example, rcu_check_callbacks does

	if (rcu_ctrlblk.completed == rdp->completed)
		rcu_try_flip();

it is possible that the timer interrupt calls rcu_check_callbacks()
exactly because rcu_pending() sees rcu_flipped, but without rmb() in
between we can see the old value of ...
From: Paul E. McKenney
Date: Sunday, September 23, 2007 - 5:15 pm

Thank you for the kind words, and most especially for the careful review!!!


Definitely are two of them in rcupdate.h, good eyes!  The ones in
rcuclassic.h and rcupreempt.h, by the time all the patches are
applied, are for rcu_check_callbacks_rt().  However, this, along
with the other _rt() cross-calls from rcuclassic.c to rcupreempt.c,


In the case of rcu_flipctr, this is historical -- the implementation in
-rt has only one rcu_data, rather than one per CPU.  I cannot think of
a problem with placing it in rcu_data right off-hand, will look it over
carefully and see.


Looks like it to me, thank you for the tip!

Hmmm...  Why not the same for rcu_data?  I guess because there is
very little sharing?  The only example thus far of sharing would be
if rcu_flipctr were to be moved into rcu_data (if an RCU read-side
critical section were preempted and then started on some other CPU),
though I will need to check more carefully.

Of course, if we start having CPUs access each others' rcu_data structures
to make RCU more power-friendly in dynticks situations, then maybe there
would be more reason to use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data.


Yeah, I guess my explanation -was- a bit lacking...  When I re-read it, it
didn't even help -me- much!  ;-)

I am putting together better documentation, but in the meantime, let me
try again...

	The problem is not with .completed being incremented, but only
	by it (1) being incremented (presumably by some other CPU and
	then (2) having this CPU get a scheduling-clock interrupt, which
	then causes this CPU to prematurely update the rcu_flip_flag.
	This is a problem, because updating rcu_flip_flag is making a
	promise that you won't ever again increment the "old" counter set
	(at least not until the next counter flip).  If the scheduling
	clock interrupt were to happen between the time that we pick
	up the .completed field and the time that we increment our
	counter, we will have broken that promise, and that could ...
From: Oleg Nesterov
Date: Wednesday, September 26, 2007 - 8:13 am

even if rcu lock/unlock happen on different CPUs, rcu_flipctr is not "shared",

Thanks a lot! Actually, this explains most of my questions. I was greatly
confused even before I started to read the code. Looking at rcu_flipctr[2]
I wrongly assumed that these 2 counters "belong" to different GPs. When



Yes, yes, I see now. We really need this barriers, except I think
rcu_try_flip_idle() can use wmb. However, I have a bit offtopic question,

	// rcu_try_flip_waitzero()
	if (A == 0) {
		mb();
		B == 0;
	}

Do we really need the mb() in this case? How it is possible that STORE


Hmm. Still can't understand.

Suppose that we are doing call_rcu(), and __rcu_advance_callbacks() sees
rdp->completed == rcu_ctrlblk.completed but rcu_flip_flag = rcu_flipped
(say, another CPU does rcu_try_flip_idle() in between).

We ack the flip, call_rcu() enables irqs, the timer interrupt calls
__rcu_advance_callbacks() again and moves the callbacks.

So, it is still possible that "move callbacks" and "ack the flip" happen
out of order. But why this is bad?

This can't "speedup" the moving of our callbacks from next to done lists.
Yes, RCU state machine can switch to the next state/stage, but this looks
safe to me.

Help!

The last question, rcu_check_callbacks does

	if (rcu_ctrlblk.completed == rdp->completed)
		rcu_try_flip();

Could you clarify the check above? Afaics this is just optimization,
technically it is correct to rcu_try_flip() at any time, even if ->completed
are not synchronized. Most probably in that case rcu_try_flip_waitack() will

oldmask could be obsolete now. Suppose that the admin moves that task to some
cpuset or just changes its ->cpus_allowed while it does synchronize_sched().

I think there is another problem. It would be nice to eliminate taking the global
sched_hotcpu_mutex in sched_setaffinity() (I think without CONFIG_HOTPLUG_CPU
it is not needed right now). In that case sched_setaffinity(0, cpumask_of_cpu(cpu))
can in fact return on the "wrong" CPU ...
From: Paul E. McKenney
Date: Thursday, September 27, 2007 - 8:46 am

Suppose that A was most recently stored by a CPU that shares a store
buffer with this CPU.  Then it is possible that some other CPU sees
the store to B as happening before the store that "A==0" above is
loading from.  This other CPU would naturally conclude that the store
to B must have happened before the load from A.

In more detail, suppose that CPU 0 and 1 share a store buffer, and
that CPU 2 and 3 share a second store buffer.  This happens naturally
if CPUs 0 and 1 are really just different hardware threads within a
single core.

So, suppose the cacheline for A is initially owned by CPUs 2 and 3,
and that the cacheline for B is initially owned by CPUs 0 and 1.
Then consider the following sequence of events:

o	CPU 0 stores zero to A.  This is a cache miss, so the new value
	for A is placed in CPU 0's and 1's store buffer.

o	CPU 1 executes the above code, first loading A.  It sees
	the value of A==0 in the store buffer, and therefore
	stores zero to B, which hits in the cache.  (I am assuming
	that we left out the mb() above).

o	CPU 2 loads from B, which misses the cache, and gets the
	value that CPU 1 stored.  Suppose it checks the value,
	and based on this check, loads A.  The old value of A might
	still be in cache, which would lead CPU 2 to conclude that
	the store to B by CPU 1 must have happened before the store
	to A by CPU 0.

Memory barriers would prevent this confusion.  An intro to store buffers
can be found at http://www.cs.utah.edu/mpv/papers/neiger/fmcad2001.pdf,


Callbacks would be able to be injected into a grace period after it
started.

Or are you arguing that as long as interrupts remain disabled between

Ah -- you are in fact arguing that interrupts remain disabled throughout.
I would still rather that the rcu_flip_seen transition be adjacent
to the callback movement in the code.  My fear is that the connection
might be lost otherwise...  "Oh, but we can just momentarily enable

From a conceptual viewpoint, if this CPU hasn't caught ...
From: Oleg Nesterov
Date: Friday, September 28, 2007 - 7:47 am

Ah, I was confused by the comment,

	smp_mb();  /* Don't call for memory barriers before we see zero. */
	                                             ^^^^^^^^^^^^^^^^^^
So, in fact, we need this barrier to make sure that _other_ CPUs see these
changes in order, thanks. Of course, _we_ already saw zero.

But in that particular case this doesn't matter, rcu_try_flip_waitzero()
is the only function which reads the "non-local" per_cpu(rcu_flipctr), so
it doesn't really need the barrier? (besides, it is always called under

Thanks a lot!!! This fills another gap in my understanding.

OK, the last (I promise :) off-topic question. When CPU 0 and 1 share a
store buffer, the situation is simple, we can replace "CPU 0 stores" with
"CPU 1 stores". But what if CPU 0 is equally "far" from CPUs 1 and 2?

Suppose that CPU 1 does

	wmb();
	B = 0

Can we assume that CPU 2 doing

	if (B == 0) {
		rmb();




Look, what happens is

	// call_rcu()
	rcu_flip_flag = rcu_flipped
	insert the new callback
	// timer irq
	move the callbacks (the new one goes to wait[0])


Before this callback will be flushed, we need 2 rdp->completed != rcu_ctrlblk.completed



I hope this is OK, note that migration_call(CPU_DEAD) flushes ->migration_queue,
if we take rq->lock after that we must see !cpu_online(cpu). CPU_UP event is not
interesting for us, we can miss it.

Hmm... but wake_up_process() should be moved under spin_lock().

Oleg.

-

From: Paul E. McKenney
Date: Friday, September 28, 2007 - 11:57 am

Fair point!

Perhaps: "Ensure that all CPUs see their rcu_mb_flag -after- the

The final rcu_read_unlock() that zeroed the sum was -not- under fliplock,


Yes.  CPU 2 saw something following CPU 1's wmb(), so any of CPU 2's
reads following its rmb() must therefore see all of CPU 1's stores

Quite possibly my paranoia -- need to think about this some more.

Guess I need to expand the code-level portion of the document to cover
the grace-period side.  That would force me either to get my explanation

The other approach would be to simply have a separate thread for this
purpose.  Batching would amortize the overhead (a single trip around the
CPUs could satisfy an arbitrarily large number of synchronize_sched()
requests).

						Thanx, Paul
-

From: Oleg Nesterov
Date: Sunday, September 30, 2007 - 9:31 am

Yes, but still I think this mb() is not necessary. Becasue we don't need
the "if we saw rcu_mb_flag we must see sum(lastidx)==0" property. When another
CPU calls rcu_try_flip_waitzero(), it will use another lastidx. OK, minor issue,

Ah, but I asked the different question. We must see CPU 1's stores by
definition, but what about CPU 0's stores (which could be seen by CPU 1)?

Let's take a "real life" example,

                A = B = X = 0;
                P = Q = &A;

CPU_0           CPU_1           CPU_2

P = &B;         *P = 1;         if (X) {
                wmb();                  rmb();
                X = 1;                  BUG_ON(*P != 1 && *Q != 1);
                                }


Yes, this way we don't need to uglify migration_thread(). OTOH, we need
another kthread ;)

Oleg.

-

From: Davide Libenzi
Date: Sunday, September 30, 2007 - 4:02 pm

That can't be. CPU_2 sees X=1, that happened after (or same time at most - 
from a cache inv. POV) to *P=1, that must have happened after P=&B (in 
order for *P to assign B). So P=&B happened, from a pure time POV, before 
the rmb(), and the rmb() should guarantee that CPU_2 sees P=&B too.



- Davide


-

From: Paul E. McKenney
Date: Sunday, September 30, 2007 - 6:37 pm

Actually, CPU designers have to go quite a ways out of their way to
prevent this BUG_ON from happening.  One way that it would happen
naturally would be if the cache line containing P were owned by CPU 2,
and if CPUs 0 and 1 shared a store buffer that they both snooped.  So,
here is what could happen given careless or sadistic CPU designers:

o	CPU 0 stores &B to P, but misses the cache, so puts the
	result in the store buffer.  This means that only CPUs 0 and 1
	can see it.

o	CPU 1 fetches P, and sees &B, so stores a 1 to B.  Again,
	this value for P is visible only to CPUs 0 and 1.

o	CPU 1 executes a wmb(), which forces CPU 1's stores to happen
	in order.  But it does nothing about CPU 0's stores, nor about CPU
	1's loads, for that matter (and the only reason that POWER ends
	up working the way you would like is because wmb() turns into
	"sync" rather than the "eieio" instruction that would have been
	used for smp_wmb() -- which is maybe what Oleg was thinking of,
	but happened to abbreviate.  If my analysis is buggy, Anton and
	Paulus will no doubt correct me...)

o	CPU 1 stores to X.

o	CPU 2 loads X, and sees that the value is 1.

o	CPU 2 does an rmb(), which orders its loads, but does nothing
	about anyone else's loads or stores.

o	CPU 2 fetches P from its cached copy, which still points to A,
	which is still zero.  So the BUG_ON fires.

o	Some time later, CPU 0 gets the cache line containing P from
	CPU 2, and updates it from the value in the store buffer, but
	too late...

Unfortunately, cache-coherence protocols don't care much about pure
time...  It is possible to make a 16-CPU machine believe that a single
variable has more than ten different values -at- -the- -same- -time-.
This is easy to do -- have all the CPUs store different values to the
same variable at the same time, then reload, collecting timestamps
between each pair of operations.  On a large SMP, the values will sit
in the store buffers for many hundreds of nanoseconds, perhaps ...
From: Davide Libenzi
Date: Monday, October 1, 2007 - 11:44 am

Ohh, I misinterpreted that rmb(), sorry. Somehow I gave it for granted
that it was a cross-CPU sync point (ala read_barrier_depends). If that's a

If a store buffer is shared between CPU_0 and CPU_1, it is very likely 
that a sync done on CPU_1 is going to sync even CPU_0 stores that are 
held in the buffer at the time of CPU_1's sync.



- Davide


-

From: Paul E. McKenney
Date: Monday, October 1, 2007 - 12:21 pm

That would indeed be one approach that CPU designers could take to
avoid being careless or sadistic.  ;-)

Another approach would be to make CPU 1 refrain from snooping CPU 0's
entries in the shared store queue.

							Thanx, Paul
-

From: Davide Libenzi
Date: Monday, October 1, 2007 - 3:09 pm

That'd be the easier (unique maybe) approach too for them, from an silicon 
complexity POV. Distinguishing between different CPUs stores once inside a 
shared store buffer, would require tagging them in some way. That'd defeat 
most of the pros of having a shared store buffer ;)



- Davide


-

From: Paul E. McKenney
Date: Monday, October 1, 2007 - 3:24 pm

Tagging requires but one bit per entry.  Depends on the workload -- if
lots of barriers, bursty stores and little sharing, tagging might win.
If lots of sharing, then your suggested approach might win.

						Thanx, Paul
-

From: Oleg Nesterov
Date: Tuesday, October 2, 2007 - 11:02 am

Davide, Paul, thank you very much! I've been wondering about this for the
long time, now I know the answer. Great.

Oleg.

-

From: Paul E. McKenney
Date: Sunday, September 30, 2007 - 6:20 pm

It depends.  ;-)

o	Itanium: because both wmb() and rmb() map to the "mf"
	instruction, and because "mf" instructions map to a
	single global order, the BUG_ON cannot happen.  (But
	I could easily be mistaken -- I cannot call myself an
	Itanium memory-ordering expert.)  See:

	ftp://download.intel.com/design/Itanium/Downloads/25142901.pdf
	
	for the official story.

o	POWER: because wmb() maps to the "sync" instruction,
	cumulativity applies, so that any instruction provably
	following "X = 1" will see "P = &B" if the "*P = 1"
	statement saw it.  So the BUG_ON cannot happen.

o	i386: memory ordering respects transitive visibility,
	which seems to be similar to POWER's cumulativity
	(http://developer.intel.com/products/processor/manuals/318147.pdf),
	so the BUG_ON cannot happen.

o	x86_64: same as i386.

o	s390: the basic memory-ordering model is tight enough that the
	BUG_ON cannot happen.  (If I am confused about this, the s390
	guys will not be shy about correcting me!)


True enough!!!

							Thanx, Paul
-

From: Paul E. McKenney
Date: Monday, September 10, 2007 - 11:35 am

Work in progress, not for inclusion.

The combination of CPU hotplug and PREEMPT_RCU has resulted in deadlocks
due to the migration-based implementation of synchronize_sched() in -rt.
This experimental patch maps synchronize_sched() back onto Classic RCU,
eliminating the migration, thus hopefully also eliminating the deadlocks.
It is not clear that this is a good long-term approach, but it will at
least permit people doing CPU hotplug in -rt kernels additional wiggle
room in their design and implementation.

The basic approach is to cause the -rt kernel to incorporate rcuclassic.c
as well as rcupreempt.c, but to #ifdef out the conflicting portions of
rcuclassic.c so that only the code needed to implement synchronize_sched()
remains in a PREEMPT_RT build.  Invocations of grace-period detection from
the scheduling-clock interrupt go to rcuclassic.c, which then invokes
the corresponding functions in rcupreempt.c (with _rt suffix added to
keep the linker happy).  Also applies the RCU_SOFTIRQ to classic RCU.
The bulk of this patch just moves code around, but likely increases
scheduling-clock latency.

If this patch does turn out to be the right approach, the #ifdefs in
kernel/rcuclassic.c might be dealt with.  ;-)  At current writing, Gautham
Shenoy's most recent CPU-hotplug fixes seem likely to obsolete this patch
(which would be a very good thing indeed!).  If this really pans out,
this portion of the patch will vanish during the forward-porting process.

Signed-off-by: Steven Rostedt <rostedt@goodmis.org> (for RCU_SOFTIRQ)
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

 include/linux/rcuclassic.h |   79 +++++--------------------------------
 include/linux/rcupdate.h   |   30 ++++++++++++--
 include/linux/rcupreempt.h |   27 ++++++------
 kernel/Makefile            |    2 
 kernel/rcuclassic.c        |   95 ++++++++++++++++++++++++++++++++++++---------
 kernel/rcupdate.c          |   22 ++++++++--
 kernel/rcupreempt.c        |   50 +++++------------------
 ...
From: Paul E. McKenney
Date: Monday, September 10, 2007 - 11:36 am

Work in progress, not for inclusion.

This patch allows preemptible RCU to tolerate CPU-hotplug operations.
It accomplishes this by maintaining a local copy of a map of online 
CPUs, which it accesses under its own lock.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

 include/linux/rcuclassic.h |    2 
 include/linux/rcupreempt.h |    2 
 kernel/rcuclassic.c        |    8 +++
 kernel/rcupreempt.c        |   93 +++++++++++++++++++++++++++++++++++++++++++--
 4 files changed, 100 insertions(+), 5 deletions(-)

diff -urpNa -X dontdiff linux-2.6.22-d-schedclassic/include/linux/rcuclassic.h linux-2.6.22-e-hotplugcpu/include/linux/rcuclassic.h
--- linux-2.6.22-d-schedclassic/include/linux/rcuclassic.h	2007-08-22 15:38:22.000000000 -0700
+++ linux-2.6.22-e-hotplugcpu/include/linux/rcuclassic.h	2007-08-22 15:55:39.000000000 -0700
@@ -143,6 +143,8 @@ extern int rcu_needs_cpu(int cpu);
 #define rcu_check_callbacks_rt(cpu, user)
 #define rcu_init_rt()
 #define rcu_needs_cpu_rt(cpu) 0
+#define rcu_offline_cpu_rt(cpu)
+#define rcu_online_cpu_rt(cpu)
 #define rcu_pending_rt(cpu) 0
 #define rcu_process_callbacks_rt(unused)
 
diff -urpNa -X dontdiff linux-2.6.22-d-schedclassic/include/linux/rcupreempt.h linux-2.6.22-e-hotplugcpu/include/linux/rcupreempt.h
--- linux-2.6.22-d-schedclassic/include/linux/rcupreempt.h	2007-08-22 15:38:22.000000000 -0700
+++ linux-2.6.22-e-hotplugcpu/include/linux/rcupreempt.h	2007-08-22 15:55:39.000000000 -0700
@@ -59,6 +59,8 @@ extern void rcu_advance_callbacks_rt(int
 extern void rcu_check_callbacks_rt(int cpu, int user);
 extern void rcu_init_rt(void);
 extern int  rcu_needs_cpu_rt(int cpu);
+extern void rcu_offline_cpu_rt(int cpu);
+extern void rcu_online_cpu_rt(int cpu);
 extern int  rcu_pending_rt(int cpu);
 struct softirq_action;
 extern void rcu_process_callbacks_rt(struct softirq_action *unused);
diff -urpNa -X dontdiff linux-2.6.22-d-schedclassic/kernel/rcuclassic.c linux-2.6.22-e-hotplugcpu/kernel/rcuclassic.c
--- ...
From: Oleg Nesterov
Date: Sunday, September 30, 2007 - 9:38 am

Imho, all these barriers are unneeded and confusing, we can't do them on behalf
of a dead CPU anyway. Can't we just do

	per_cpu(rcu_mb_flag, cpu) = rcu_mb_done;
	per_cpu(rcu_flip_flag, cpu) = rcu_flip_seen;
?

Why can't we also do

	__get_cpu_var(rcu_flipctr)[0] += per_cpu(rcu_flipctr, cpu)[0];
	per_cpu(rcu_flipctr, cpu)[0] = 0;	
	__get_cpu_var(rcu_flipctr)[1] += per_cpu(rcu_flipctr, cpu)[1];
	per_cpu(rcu_flipctr, cpu)[1] = 0;	

? This way rcu_try_flip_waitzero() can also use rcu_cpu_online_map. This cpu
is dead, nobody can modify per_cpu(rcu_flipctr, cpu). And we can't confuse

What if _cpu_up() fails? I think rcu_cpu_notify(CPU_UP_CANCELED) should call
rcu_offline_cpu_rt() too.

Oleg.

-

From: Paul E. McKenney
Date: Sunday, September 30, 2007 - 6:41 pm

You are likely correct, but this is a slow path, extremely hard to

Very good point!!!  This would reduce latencies on systems where
the number of possible CPUs greatly exceeds that of the number of

Good catch, will fix!!!

							Thanx, Paul
-

From: Paul E. McKenney
Date: Monday, September 10, 2007 - 11:39 am

Work in progress, not for inclusion.

RCU priority boosting is needed when running a workload that might include
CPU-bound user tasks running at realtime priorities with preemptible RCU.
In this situation, RCU priority boosting is needed to avoid OOM.

Please note that because Classic RCU does not permit RCU read-side
critical sections to be preempted, there is no need to boost the priority
of Classic RCU readers.  Boosting the priority of a running process
does not make it run any faster, at least not on any hardware that I am
aware of.  ;-)

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

 include/linux/init_task.h  |   13 
 include/linux/rcupdate.h   |   17 +
 include/linux/rcupreempt.h |   20 +
 include/linux/sched.h      |   24 +
 init/main.c                |    1 
 kernel/Kconfig.preempt     |   14 -
 kernel/fork.c              |    6 
 kernel/rcupreempt.c        |  608 ++++++++++++++++++++++++++++++++++++++++++---
 kernel/rtmutex.c           |    7 
 kernel/sched.c             |    5 
 lib/Kconfig.debug          |   34 ++
 11 files changed, 703 insertions(+), 46 deletions(-)

diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/include/linux/init_task.h linux-2.6.22-F-boostrcu/include/linux/init_task.h
--- linux-2.6.22-E-hotplug/include/linux/init_task.h	2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-F-boostrcu/include/linux/init_task.h	2007-08-31 14:09:02.000000000 -0700
@@ -87,6 +87,17 @@ extern struct nsproxy init_nsproxy;
 	.signalfd_list	= LIST_HEAD_INIT(sighand.signalfd_list),	\
 }
 
+#ifdef CONFIG_PREEMPT_RCU_BOOST
+#define INIT_RCU_BOOST_PRIO .rcu_prio	= MAX_PRIO,
+#define INIT_PREEMPT_RCU_BOOST(tsk)					\
+	.rcub_rbdp	= NULL,						\
+	.rcub_state	= RCU_BOOST_IDLE,				\
+	.rcub_entry	= LIST_HEAD_INIT(tsk.rcub_entry),
+#else /* #ifdef CONFIG_PREEMPT_RCU_BOOST */
+#define INIT_RCU_BOOST_PRIO
+#define INIT_PREEMPT_RCU_BOOST(tsk)
+#endif /* #else #ifdef CONFIG_PREEMPT_RCU_BOOST */
+
 extern struct group_info init_groups;
 
 #define ...
From: Gautham R Shenoy
Date: Friday, September 28, 2007 - 3:56 pm

Hi Paul, 

Some silly doubts.


I am wondering, can a task block on a lock while in RCU read-side

Event #3. is missing?


The state change from RCU_BOOST_BLOCKED to RCU_BOOSTED is not done by

-- 
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"
-

From: Steven Rostedt
Date: Friday, September 28, 2007 - 4:05 pm

--

I think this may be specific to the -rt patch. In the -rt patch,
spin_locks turn into mutexes, and therefor can block a read-side critical

I guess Paul needs to answer that one ;-)

-- Steve

-

From: Paul E. McKenney
Date: Saturday, September 29, 2007 - 8:11 pm

An older version had three, the new one has two, and I forgot to s/three/two/.

							Thanx, Paul
-

From: Gautham R Shenoy
Date: Friday, October 5, 2007 - 4:46 am

On Mon, Sep 10, 2007 at 11:39:01AM -0700, Paul E. McKenney wrote:


Why is this masking required? When we increment 
the rcu_boost_idx in rcu_booster, we do perform a modulo operation 

-- 
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"
-

From: Steven Rostedt
Date: Friday, October 5, 2007 - 5:24 am

Because we are not masking rcu_boost_idx, we are masking
 (rcu_boost_idx + 1) which may extend the bounderies of
RCU_BOOST_ELEMENTS.

-- Steve
-

From: Gautham R Shenoy
Date: Friday, October 5, 2007 - 6:21 am

Thanks! 

But I'm still trying to understand why the (increment + masking)
is required at all.

The thread(producer) that requires boosting is added to the element
with index rcu_boost_idx.

The booster thread(consumer) increments the rcu_boost_idx to 
(rcu_boost_idx + 1) % RCU_BOOST_ELEMENTS, before it fetches the least
recently used rcu_boot_dat elements and boost the eligible tasks queued
in that element.

So, can't we just return per_cpu(rcu_boost_dat, cpu)[rcu_boost_idx] from
rcu_rbd_boosting(cpu) ? Isn't that already the least recently used

Thanks and Regards
gautham.
-- 
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"
-

From: Paul E. McKenney
Date: Friday, October 5, 2007 - 7:07 am

Good catch -- we need to advance the index -after- boosting, so that
new sleeping tasks are not immediately dropped on the to-be-boosted
list.  Will fix!

(Non-fatal -- but means that the algorithm is effectively only using
three elements of the four-element array, so does need to be fixed.)

						Thanx, Paul
-

From: Paul E. McKenney
Date: Monday, September 10, 2007 - 11:39 am

Work in progress, not for inclusion.  Still uses xtime because this
patch is still against 2.6.22.

This patch modifies rcutorture to also torture RCU priority boosting.
The torturing involves forcing RCU read-side critical sections (already
performed as part of the torturing of RCU) to run for extremely long
time periods, increasing the probability of their being preempted and
thus needing priority boosting.  The fact that rcutorture's "nreaders"
module parameter defaults to twice the number of CPUs helps ensure lots
of the needed preemption.

To cause the torturing to be fully effective in -mm, run in presence
of CPU-hotplug operations.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

 rcutorture.c |   91 +++++++++++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 77 insertions(+), 14 deletions(-)

diff -urpNa -X dontdiff linux-2.6.22-f-boostrcu/kernel/rcutorture.c linux-2.6.22-g-boosttorture/kernel/rcutorture.c
--- linux-2.6.22-f-boostrcu/kernel/rcutorture.c	2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-g-boosttorture/kernel/rcutorture.c	2007-09-09 16:58:35.000000000 -0700
@@ -58,6 +58,7 @@ static int stat_interval;	/* Interval be
 static int verbose;		/* Print more debug info. */
 static int test_no_idle_hz;	/* Test RCU's support for tickless idle CPUs. */
 static int shuffle_interval = 5; /* Interval between shuffles (in sec)*/
+static int preempt_torture;	/* Realtime task preempts torture readers. */
 static char *torture_type = "rcu"; /* What RCU implementation to torture. */
 
 module_param(nreaders, int, 0444);
@@ -72,6 +73,8 @@ module_param(test_no_idle_hz, bool, 0444
 MODULE_PARM_DESC(test_no_idle_hz, "Test support for tickless idle CPUs");
 module_param(shuffle_interval, int, 0444);
 MODULE_PARM_DESC(shuffle_interval, "Number of seconds between shuffles");
+module_param(preempt_torture, bool, 0444);
+MODULE_PARM_DESC(preempt_torture, "Enable realtime preemption torture");
 module_param(torture_type, charp, 0444);
 ...
From: Paul E. McKenney
Date: Monday, September 10, 2007 - 11:41 am

Work in progress, not for inclusion.

This patch modified the RCU priority booster to explicitly sleep when
there are no RCU readers in need of priority boosting.  This should be
a power-consumption improvement over the one-second polling cycle in
the underlying RCU priority-boosting patch.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

 include/linux/rcupreempt.h |   15 ++++++
 kernel/rcupreempt.c        |  102 ++++++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 115 insertions(+), 2 deletions(-)

diff -urpNa -X dontdiff linux-2.6.22-G-boosttorture/include/linux/rcupreempt.h linux-2.6.22-H-boostsleep/include/linux/rcupreempt.h
--- linux-2.6.22-G-boosttorture/include/linux/rcupreempt.h	2007-08-24 11:24:59.000000000 -0700
+++ linux-2.6.22-H-boostsleep/include/linux/rcupreempt.h	2007-08-24 18:12:41.000000000 -0700
@@ -60,6 +60,21 @@ enum rcu_boost_state {
 
 #define N_RCU_BOOST_STATE (RCU_BOOST_INVALID + 1)
 
+/*
+ * RCU-booster state with respect to sleeping.  The RCU booster
+ * sleeps when no task has recently been seen sleeping in an RCU
+ * read-side critical section, and is awakened when a new sleeper
+ * appears.
+ */
+enum rcu_booster_state {
+	RCU_BOOSTER_ACTIVE = 0,   /* RCU booster actively scanning. */
+	RCU_BOOSTER_DROWSY = 1,   /* RCU booster is considering sleeping. */
+	RCU_BOOSTER_SLEEPING = 2, /* RCU booster is asleep. */
+	RCU_BOOSTER_INVALID = 3,  /* For bogus state sightings. */
+};
+
+#define N_RCU_BOOSTER_STATE (RCU_BOOSTER_INVALID + 1)
+
 #endif /* #ifdef CONFIG_PREEMPT_RCU_BOOST */
 
 #define call_rcu_bh(head, rcu) call_rcu(head, rcu)
diff -urpNa -X dontdiff linux-2.6.22-G-boosttorture/kernel/rcupreempt.c linux-2.6.22-H-boostsleep/kernel/rcupreempt.c
--- linux-2.6.22-G-boosttorture/kernel/rcupreempt.c	2007-08-27 15:42:57.000000000 -0700
+++ linux-2.6.22-H-boostsleep/kernel/rcupreempt.c	2007-08-27 15:42:37.000000000 -0700
@@ -108,6 +108,7 @@ struct rcu_boost_dat {
 	unsigned long rbs_unboosted;
 #ifdef ...
From: Paul E. McKenney
Date: Monday, September 10, 2007 - 11:42 am

Work in progress, not for inclusion.

This patch updates the RCU documentation to reflect preemptible RCU as
well as recent publications.  Fix an incorrect comment in the code. 
Change the name ORDERED_WRT_IRQ() to ACCESS_ONCE() to better describe
its function.

Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---

 Documentation/RCU/RTFP.txt    |  234 ++++++++++++++++++++++++++++++++++++++++--
 Documentation/RCU/rcu.txt     |   20 +++
 Documentation/RCU/torture.txt |   44 ++++++-
 kernel/Kconfig.preempt        |    1 
 kernel/rcupreempt.c           |   24 ++--
 5 files changed, 292 insertions(+), 31 deletions(-)

diff -urpNa -X dontdiff linux-2.6.22-h-boostsleep/Documentation/RCU/rcu.txt linux-2.6.22-i-cleanups/Documentation/RCU/rcu.txt
--- linux-2.6.22-h-boostsleep/Documentation/RCU/rcu.txt	2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-i-cleanups/Documentation/RCU/rcu.txt	2007-09-08 16:54:17.000000000 -0700
@@ -36,6 +36,14 @@ o	How can the updater tell when a grace 
 	executed in user mode, or executed in the idle loop, we can
 	safely free up that item.
 
+	Preemptible variants of RCU (CONFIG_PREEMPT_RCU) get the
+	same effect, but require that the readers manipulate CPU-local
+	counters.  These counters allow limited types of blocking
+	within RCU read-side critical sections.  SRCU also uses
+	CPU-local counters, and permits general blocking within
+	RCU read-side critical sections.  These two variants of
+	RCU detect grace periods by sampling these counters.
+
 o	If I am running on a uniprocessor kernel, which can only do one
 	thing at a time, why should I wait for a grace period?
 
@@ -46,7 +54,10 @@ o	How can I see where RCU is currently u
 	Search for "rcu_read_lock", "rcu_read_unlock", "call_rcu",
 	"rcu_read_lock_bh", "rcu_read_unlock_bh", "call_rcu_bh",
 	"srcu_read_lock", "srcu_read_unlock", "synchronize_rcu",
-	"synchronize_net", and "synchronize_srcu".
+	"synchronize_net", "synchronize_srcu", and the other RCU
+	primitives.  Or grab ...
From: Ingo Molnar
Date: Monday, September 10, 2007 - 11:44 am

cool! Now after 2 years of development and testing i think this is one 
of the most mature patchsets on lkml - so i'd like to see this 
designated for potential upstream inclusion. I.e. everyone who can see 
some bug, please holler now.

	Ingo
-

Previous thread: [PATCH] Hookup group-scheduler with task container infrastructure by Srivatsa Vaddagiri on Monday, September 10, 2007 - 10:10 am. (30 messages)

Next thread: [PATCH] dcache: trivial comment fix by J. Bruce Fields on Monday, September 10, 2007 - 11:46 am. (2 messages)