Work in progress, still not for inclusion. But code now complete!
This is a respin of the following prior posting:
http://lkml.org/lkml/2007/9/5/268
This release adds an additional patch that adds fixes to comments and RCU
documentation, along with one macro being renamed. The rcutorture patch
has a modification to make it a bit more vicious to priority boosting
(though the current design relies on -rt latencies for much of the
priority-boost torturing effectiveness in this case -- run the test
in presence of CPU hotplug operations to get the same effect in -mm).
Next step is rebasing this to a more recent version of Linux.Thanx, Paul
-
cool! Now after 2 years of development and testing i think this is one
of the most mature patchsets on lkml - so i'd like to see this
designated for potential upstream inclusion. I.e. everyone who can see
some bug, please holler now.Ingo
-
Work in progress, not for inclusion.
This patch updates the RCU documentation to reflect preemptible RCU as
well as recent publications. Fix an incorrect comment in the code.
Change the name ORDERED_WRT_IRQ() to ACCESS_ONCE() to better describe
its function.Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---Documentation/RCU/RTFP.txt | 234 ++++++++++++++++++++++++++++++++++++++++--
Documentation/RCU/rcu.txt | 20 +++
Documentation/RCU/torture.txt | 44 ++++++-
kernel/Kconfig.preempt | 1
kernel/rcupreempt.c | 24 ++--
5 files changed, 292 insertions(+), 31 deletions(-)diff -urpNa -X dontdiff linux-2.6.22-h-boostsleep/Documentation/RCU/rcu.txt linux-2.6.22-i-cleanups/Documentation/RCU/rcu.txt
--- linux-2.6.22-h-boostsleep/Documentation/RCU/rcu.txt 2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-i-cleanups/Documentation/RCU/rcu.txt 2007-09-08 16:54:17.000000000 -0700
@@ -36,6 +36,14 @@ o How can the updater tell when a grace
executed in user mode, or executed in the idle loop, we can
safely free up that item.+ Preemptible variants of RCU (CONFIG_PREEMPT_RCU) get the
+ same effect, but require that the readers manipulate CPU-local
+ counters. These counters allow limited types of blocking
+ within RCU read-side critical sections. SRCU also uses
+ CPU-local counters, and permits general blocking within
+ RCU read-side critical sections. These two variants of
+ RCU detect grace periods by sampling these counters.
+
o If I am running on a uniprocessor kernel, which can only do one
thing at a time, why should I wait for a grace period?@@ -46,7 +54,10 @@ o How can I see where RCU is currently u
Search for "rcu_read_lock", "rcu_read_unlock", "call_rcu",
"rcu_read_lock_bh", "rcu_read_unlock_bh", "call_rcu_bh",
"srcu_read_lock", "srcu_read_unlock", "synchronize_rcu",
- "synchronize_net", and "synchronize_srcu".
+ "synchronize_net", "synchronize_srcu", and the other RCU
+ primitives. Or grab ...
Work in progress, not for inclusion.
This patch modified the RCU priority booster to explicitly sleep when
there are no RCU readers in need of priority boosting. This should be
a power-consumption improvement over the one-second polling cycle in
the underlying RCU priority-boosting patch.Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---include/linux/rcupreempt.h | 15 ++++++
kernel/rcupreempt.c | 102 ++++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 115 insertions(+), 2 deletions(-)diff -urpNa -X dontdiff linux-2.6.22-G-boosttorture/include/linux/rcupreempt.h linux-2.6.22-H-boostsleep/include/linux/rcupreempt.h
--- linux-2.6.22-G-boosttorture/include/linux/rcupreempt.h 2007-08-24 11:24:59.000000000 -0700
+++ linux-2.6.22-H-boostsleep/include/linux/rcupreempt.h 2007-08-24 18:12:41.000000000 -0700
@@ -60,6 +60,21 @@ enum rcu_boost_state {#define N_RCU_BOOST_STATE (RCU_BOOST_INVALID + 1)
+/*
+ * RCU-booster state with respect to sleeping. The RCU booster
+ * sleeps when no task has recently been seen sleeping in an RCU
+ * read-side critical section, and is awakened when a new sleeper
+ * appears.
+ */
+enum rcu_booster_state {
+ RCU_BOOSTER_ACTIVE = 0, /* RCU booster actively scanning. */
+ RCU_BOOSTER_DROWSY = 1, /* RCU booster is considering sleeping. */
+ RCU_BOOSTER_SLEEPING = 2, /* RCU booster is asleep. */
+ RCU_BOOSTER_INVALID = 3, /* For bogus state sightings. */
+};
+
+#define N_RCU_BOOSTER_STATE (RCU_BOOSTER_INVALID + 1)
+
#endif /* #ifdef CONFIG_PREEMPT_RCU_BOOST */#define call_rcu_bh(head, rcu) call_rcu(head, rcu)
diff -urpNa -X dontdiff linux-2.6.22-G-boosttorture/kernel/rcupreempt.c linux-2.6.22-H-boostsleep/kernel/rcupreempt.c
--- linux-2.6.22-G-boosttorture/kernel/rcupreempt.c 2007-08-27 15:42:57.000000000 -0700
+++ linux-2.6.22-H-boostsleep/kernel/rcupreempt.c 2007-08-27 15:42:37.000000000 -0700
@@ -108,6 +108,7 @@ struct rcu_boost_dat {
unsigned long rbs_unboosted;
#ifdef CONFIG_PR...
Work in progress, not for inclusion. Still uses xtime because this
patch is still against 2.6.22.This patch modifies rcutorture to also torture RCU priority boosting.
The torturing involves forcing RCU read-side critical sections (already
performed as part of the torturing of RCU) to run for extremely long
time periods, increasing the probability of their being preempted and
thus needing priority boosting. The fact that rcutorture's "nreaders"
module parameter defaults to twice the number of CPUs helps ensure lots
of the needed preemption.To cause the torturing to be fully effective in -mm, run in presence
of CPU-hotplug operations.Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---rcutorture.c | 91 +++++++++++++++++++++++++++++++++++++++++++++++++----------
1 file changed, 77 insertions(+), 14 deletions(-)diff -urpNa -X dontdiff linux-2.6.22-f-boostrcu/kernel/rcutorture.c linux-2.6.22-g-boosttorture/kernel/rcutorture.c
--- linux-2.6.22-f-boostrcu/kernel/rcutorture.c 2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-g-boosttorture/kernel/rcutorture.c 2007-09-09 16:58:35.000000000 -0700
@@ -58,6 +58,7 @@ static int stat_interval; /* Interval be
static int verbose; /* Print more debug info. */
static int test_no_idle_hz; /* Test RCU's support for tickless idle CPUs. */
static int shuffle_interval = 5; /* Interval between shuffles (in sec)*/
+static int preempt_torture; /* Realtime task preempts torture readers. */
static char *torture_type = "rcu"; /* What RCU implementation to torture. */module_param(nreaders, int, 0444);
@@ -72,6 +73,8 @@ module_param(test_no_idle_hz, bool, 0444
MODULE_PARM_DESC(test_no_idle_hz, "Test support for tickless idle CPUs");
module_param(shuffle_interval, int, 0444);
MODULE_PARM_DESC(shuffle_interval, "Number of seconds between shuffles");
+module_param(preempt_torture, bool, 0444);
+MODULE_PARM_DESC(preempt_torture, "Enable realtime preemption torture");
module_param(torture_type, charp, 0444);
MO...
Work in progress, not for inclusion.
RCU priority boosting is needed when running a workload that might include
CPU-bound user tasks running at realtime priorities with preemptible RCU.
In this situation, RCU priority boosting is needed to avoid OOM.Please note that because Classic RCU does not permit RCU read-side
critical sections to be preempted, there is no need to boost the priority
of Classic RCU readers. Boosting the priority of a running process
does not make it run any faster, at least not on any hardware that I am
aware of. ;-)Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---include/linux/init_task.h | 13
include/linux/rcupdate.h | 17 +
include/linux/rcupreempt.h | 20 +
include/linux/sched.h | 24 +
init/main.c | 1
kernel/Kconfig.preempt | 14 -
kernel/fork.c | 6
kernel/rcupreempt.c | 608 ++++++++++++++++++++++++++++++++++++++++++---
kernel/rtmutex.c | 7
kernel/sched.c | 5
lib/Kconfig.debug | 34 ++
11 files changed, 703 insertions(+), 46 deletions(-)diff -urpNa -X dontdiff linux-2.6.22-E-hotplug/include/linux/init_task.h linux-2.6.22-F-boostrcu/include/linux/init_task.h
--- linux-2.6.22-E-hotplug/include/linux/init_task.h 2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-F-boostrcu/include/linux/init_task.h 2007-08-31 14:09:02.000000000 -0700
@@ -87,6 +87,17 @@ extern struct nsproxy init_nsproxy;
.signalfd_list = LIST_HEAD_INIT(sighand.signalfd_list), \
}+#ifdef CONFIG_PREEMPT_RCU_BOOST
+#define INIT_RCU_BOOST_PRIO .rcu_prio = MAX_PRIO,
+#define INIT_PREEMPT_RCU_BOOST(tsk) \
+ .rcub_rbdp = NULL, \
+ .rcub_state = RCU_BOOST_IDLE, \
+ .rcub_entry = LIST_HEAD_INIT(tsk.rcub_entry),
+#else /* #ifdef CONFIG_PREEMPT_RCU_BOOST */
+#define INIT_RCU_BOOST_PRIO
+#define INIT_PREEMPT_RCU_BOOST(tsk)
+#endif /* #else #ifdef CONFIG_PREEMPT_RCU_BOOST */
+
extern struct group_info init_groups;#define...
On Mon, Sep 10, 2007 at 11:39:01AM -0700, Paul E. McKenney wrote:
Why is this masking required? When we increment
the rcu_boost_idx in rcu_booster, we do perform a modulo operation--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"
-
Because we are not masking rcu_boost_idx, we are masking
(rcu_boost_idx + 1) which may extend the bounderies of
RCU_BOOST_ELEMENTS.-- Steve
-
Thanks!
But I'm still trying to understand why the (increment + masking)
is required at all.The thread(producer) that requires boosting is added to the element
with index rcu_boost_idx.The booster thread(consumer) increments the rcu_boost_idx to
(rcu_boost_idx + 1) % RCU_BOOST_ELEMENTS, before it fetches the least
recently used rcu_boot_dat elements and boost the eligible tasks queued
in that element.So, can't we just return per_cpu(rcu_boost_dat, cpu)[rcu_boost_idx] from
rcu_rbd_boosting(cpu) ? Isn't that already the least recently usedThanks and Regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"
-
Good catch -- we need to advance the index -after- boosting, so that
new sleeping tasks are not immediately dropped on the to-be-boosted
list. Will fix!(Non-fatal -- but means that the algorithm is effectively only using
three elements of the four-element array, so does need to be fixed.)Thanx, Paul
-
Hi Paul,
Some silly doubts.
I am wondering, can a task block on a lock while in RCU read-side
Event #3. is missing?
The state change from RCU_BOOST_BLOCKED to RCU_BOOSTED is not done by
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"
-
--
I think this may be specific to the -rt patch. In the -rt patch,
spin_locks turn into mutexes, and therefor can block a read-side criticalI guess Paul needs to answer that one ;-)
-- Steve
-
An older version had three, the new one has two, and I forgot to s/three/two/.
Thanx, Paul
-
Work in progress, not for inclusion.
This patch allows preemptible RCU to tolerate CPU-hotplug operations.
It accomplishes this by maintaining a local copy of a map of online
CPUs, which it accesses under its own lock.Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---include/linux/rcuclassic.h | 2
include/linux/rcupreempt.h | 2
kernel/rcuclassic.c | 8 +++
kernel/rcupreempt.c | 93 +++++++++++++++++++++++++++++++++++++++++++--
4 files changed, 100 insertions(+), 5 deletions(-)diff -urpNa -X dontdiff linux-2.6.22-d-schedclassic/include/linux/rcuclassic.h linux-2.6.22-e-hotplugcpu/include/linux/rcuclassic.h
--- linux-2.6.22-d-schedclassic/include/linux/rcuclassic.h 2007-08-22 15:38:22.000000000 -0700
+++ linux-2.6.22-e-hotplugcpu/include/linux/rcuclassic.h 2007-08-22 15:55:39.000000000 -0700
@@ -143,6 +143,8 @@ extern int rcu_needs_cpu(int cpu);
#define rcu_check_callbacks_rt(cpu, user)
#define rcu_init_rt()
#define rcu_needs_cpu_rt(cpu) 0
+#define rcu_offline_cpu_rt(cpu)
+#define rcu_online_cpu_rt(cpu)
#define rcu_pending_rt(cpu) 0
#define rcu_process_callbacks_rt(unused)diff -urpNa -X dontdiff linux-2.6.22-d-schedclassic/include/linux/rcupreempt.h linux-2.6.22-e-hotplugcpu/include/linux/rcupreempt.h
--- linux-2.6.22-d-schedclassic/include/linux/rcupreempt.h 2007-08-22 15:38:22.000000000 -0700
+++ linux-2.6.22-e-hotplugcpu/include/linux/rcupreempt.h 2007-08-22 15:55:39.000000000 -0700
@@ -59,6 +59,8 @@ extern void rcu_advance_callbacks_rt(int
extern void rcu_check_callbacks_rt(int cpu, int user);
extern void rcu_init_rt(void);
extern int rcu_needs_cpu_rt(int cpu);
+extern void rcu_offline_cpu_rt(int cpu);
+extern void rcu_online_cpu_rt(int cpu);
extern int rcu_pending_rt(int cpu);
struct softirq_action;
extern void rcu_process_callbacks_rt(struct softirq_action *unused);
diff -urpNa -X dontdiff linux-2.6.22-d-schedclassic/kernel/rcuclassic.c linux-2.6.22-e-hotplugcpu/kernel/rcuclassic.c
--- linux-2.6.2...
Imho, all these barriers are unneeded and confusing, we can't do them on behalf
of a dead CPU anyway. Can't we just doper_cpu(rcu_mb_flag, cpu) = rcu_mb_done;
per_cpu(rcu_flip_flag, cpu) = rcu_flip_seen;
?Why can't we also do
__get_cpu_var(rcu_flipctr)[0] += per_cpu(rcu_flipctr, cpu)[0];
per_cpu(rcu_flipctr, cpu)[0] = 0;
__get_cpu_var(rcu_flipctr)[1] += per_cpu(rcu_flipctr, cpu)[1];
per_cpu(rcu_flipctr, cpu)[1] = 0;? This way rcu_try_flip_waitzero() can also use rcu_cpu_online_map. This cpu
is dead, nobody can modify per_cpu(rcu_flipctr, cpu). And we can't confuseWhat if _cpu_up() fails? I think rcu_cpu_notify(CPU_UP_CANCELED) should call
rcu_offline_cpu_rt() too.Oleg.
-
You are likely correct, but this is a slow path, extremely hard to
Very good point!!! This would reduce latencies on systems where
the number of possible CPUs greatly exceeds that of the number ofGood catch, will fix!!!
Thanx, Paul
-
Work in progress, not for inclusion.
The combination of CPU hotplug and PREEMPT_RCU has resulted in deadlocks
due to the migration-based implementation of synchronize_sched() in -rt.
This experimental patch maps synchronize_sched() back onto Classic RCU,
eliminating the migration, thus hopefully also eliminating the deadlocks.
It is not clear that this is a good long-term approach, but it will at
least permit people doing CPU hotplug in -rt kernels additional wiggle
room in their design and implementation.The basic approach is to cause the -rt kernel to incorporate rcuclassic.c
as well as rcupreempt.c, but to #ifdef out the conflicting portions of
rcuclassic.c so that only the code needed to implement synchronize_sched()
remains in a PREEMPT_RT build. Invocations of grace-period detection from
the scheduling-clock interrupt go to rcuclassic.c, which then invokes
the corresponding functions in rcupreempt.c (with _rt suffix added to
keep the linker happy). Also applies the RCU_SOFTIRQ to classic RCU.
The bulk of this patch just moves code around, but likely increases
scheduling-clock latency.If this patch does turn out to be the right approach, the #ifdefs in
kernel/rcuclassic.c might be dealt with. ;-) At current writing, Gautham
Shenoy's most recent CPU-hotplug fixes seem likely to obsolete this patch
(which would be a very good thing indeed!). If this really pans out,
this portion of the patch will vanish during the forward-porting process.Signed-off-by: Steven Rostedt <rostedt@goodmis.org> (for RCU_SOFTIRQ)
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---include/linux/rcuclassic.h | 79 +++++--------------------------------
include/linux/rcupdate.h | 30 ++++++++++++--
include/linux/rcupreempt.h | 27 ++++++------
kernel/Makefile | 2
kernel/rcuclassic.c | 95 ++++++++++++++++++++++++++++++++++++---------
kernel/rcupdate.c | 22 ++++++++--
kernel/rcupreempt.c | 50 +++++-------------...
Work in progress, not for inclusion.
This patch implements a new version of RCU which allows its read-side
critical sections to be preempted. It uses a set of counter pairs
to keep track of the read-side critical sections and flips them
when all tasks exit read-side critical section. The details
of this implementation can be found in this paper -http://www.rdrop.com/users/paulmck/RCU/OLSrtRCU.2006.08.11a.pdf
This patch was developed as a part of the -rt kernel development and
meant to provide better latencies when read-side critical sections of
RCU don't disable preemption. As a consequence of keeping track of RCU
readers, the readers have a slight overhead (optimizations in the paper).
This implementation co-exists with the "classic" RCU implementations
and can be switched to at compiler.Also includes RCU tracing summarized in debugfs and RCU_SOFTIRQ for
the preemptible variant of RCU.Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Steven Rostedt <rostedt@goodmis.org> (for RCU_SOFTIRQ)
Signed-off-by: Paul McKenney <paulmck@us.ibm.com>
---include/linux/interrupt.h | 1
include/linux/rcuclassic.h | 2
include/linux/rcupdate.h | 7
include/linux/rcupreempt.h | 78 +++
include/linux/rcupreempt_trace.h | 100 +++++
include/linux/sched.h | 5
kernel/Kconfig.preempt | 38 +
kernel/Makefile | 7
kernel/fork.c | 4
kernel/rcupreempt.c | 767 +++++++++++++++++++++++++++++++++++++++
kernel/rcupreempt_trace.c | 330 ++++++++++++++++
11 files changed, 1336 insertions(+), 3 deletions(-)diff -urpNa -X dontdiff linux-2.6.22-b-fixbarriers/include/linux/interrupt.h linux-2.6.22-c-preemptrcu/include/linux/interrupt.h
--- linux-2.6.22-b-fixbarriers/include/linux/interrupt.h 2007-07-08 16:32:17.000000000 -0700
+++ linux-2.6.22-c-preemptrcu/include/linux/interrupt.h 2007-08-22 15:21:06.000000000 -0700
@@ -269,6...
Just curious, any reason why rcu_flipctr can't live in rcu_data ? Similarly,
rcu_try_flip_state can be a member of rcu_ctrlblk, it is even protected by
rcu_ctrlblk.fliplockCould you please tell more, why do we need this cli?
It can't "protect" rcu_ctrlblk.completed, and the only change which affects
the state machine is rcu_flipctr[idx]++, so I can't understand the "half-done"
above. (of course, we must disable preemption while changing rcu_flipctr).Hmm... this was already discussed... from another message:
> Critical portions of the GP protection happen in the scheduler-clock
> interrupt, which is a hardirq. For example, the .completed counter
> is always incremented in hardirq context, and we cannot tolerate a
> .completed increment in this code. Allowing such an increment would
> defeat the counter-acknowledge state in the state machine.Still can't understand, please help! .completed could be incremented by
Isn't it "obvious" that this barrier is not needed? rcu_flipctr could be
change only by this CPU, regardless of the actual value of idx, we can't
read the "wrong" value of rcu_flipctr[idx], no?OTOH, I can't understand how it works. With ot without local_irq_save(),
another CPU can do rcu_try_flip_idle() and increment rcu_ctrlblk.completed
at any time, we can see the old value... rcu_try_flip_waitzero() can miss us?OK, GP_STAGES > 1, so rcu_try_flip_waitzero() will actually check both
0 and 1 lastidx's before synchronize_rcu() succeeds... I doubt very much
my understanding is correct. Apart from this why GP_STAGES > 1 ???Well, I think this code is just too tricky for me! Will try to read again
Why? Any code sequence which relies on that?
For example, rcu_check_callbacks does
if (rcu_ctrlblk.completed == rdp->completed)
rcu_try_flip();it is possible that the timer interrupt calls rcu_check_callbacks()
exactly because rcu_pending() sees rcu_flipped, but without rmb() in
between we can see the old value of...
Thank you for the kind words, and most especially for the careful review!!!
Definitely are two of them in rcupdate.h, good eyes! The ones in
rcuclassic.h and rcupreempt.h, by the time all the patches are
applied, are for rcu_check_callbacks_rt(). However, this, along
with the other _rt() cross-calls from rcuclassic.c to rcupreempt.c,In the case of rcu_flipctr, this is historical -- the implementation in
-rt has only one rcu_data, rather than one per CPU. I cannot think of
a problem with placing it in rcu_data right off-hand, will look it over
carefully and see.Looks like it to me, thank you for the tip!
Hmmm... Why not the same for rcu_data? I guess because there is
very little sharing? The only example thus far of sharing would be
if rcu_flipctr were to be moved into rcu_data (if an RCU read-side
critical section were preempted and then started on some other CPU),
though I will need to check more carefully.Of course, if we start having CPUs access each others' rcu_data structures
to make RCU more power-friendly in dynticks situations, then maybe there
would be more reason to use DEFINE_PER_CPU_SHARED_ALIGNED for rcu_data.Yeah, I guess my explanation -was- a bit lacking... When I re-read it, it
didn't even help -me- much! ;-)I am putting together better documentation, but in the meantime, let me
try again...The problem is not with .completed being incremented, but only
by it (1) being incremented (presumably by some other CPU and
then (2) having this CPU get a scheduling-clock interrupt, which
then causes this CPU to prematurely update the rcu_flip_flag.
This is a problem, because updating rcu_flip_flag is making a
promise that you won't ever again increment the "old" counter set
(at least not until the next counter flip). If the scheduling
clock interrupt were to happen between the time that we pick
up the .completed field and the time that we increment our
counter, we will have broken that promise, and that could cause
someone t...
even if rcu lock/unlock happen on different CPUs, rcu_flipctr is not "shared",
Thanks a lot! Actually, this explains most of my questions. I was greatly
confused even before I started to read the code. Looking at rcu_flipctr[2]
I wrongly assumed that these 2 counters "belong" to different GPs. WhenYes, yes, I see now. We really need this barriers, except I think
rcu_try_flip_idle() can use wmb. However, I have a bit offtopic question,// rcu_try_flip_waitzero()
if (A == 0) {
mb();
B == 0;
}Do we really need the mb() in this case? How it is possible that STORE
Hmm. Still can't understand.
Suppose that we are doing call_rcu(), and __rcu_advance_callbacks() sees
rdp->completed == rcu_ctrlblk.completed but rcu_flip_flag = rcu_flipped
(say, another CPU does rcu_try_flip_idle() in between).We ack the flip, call_rcu() enables irqs, the timer interrupt calls
__rcu_advance_callbacks() again and moves the callbacks.So, it is still possible that "move callbacks" and "ack the flip" happen
out of order. But why this is bad?This can't "speedup" the moving of our callbacks from next to done lists.
Yes, RCU state machine can switch to the next state/stage, but this looks
safe to me.Help!
The last question, rcu_check_callbacks does
if (rcu_ctrlblk.completed == rdp->completed)
rcu_try_flip();Could you clarify the check above? Afaics this is just optimization,
technically it is correct to rcu_try_flip() at any time, even if ->completed
are not synchronized. Most probably in that case rcu_try_flip_waitack() willoldmask could be obsolete now. Suppose that the admin moves that task to some
cpuset or just changes its ->cpus_allowed while it does synchronize_sched().I think there is another problem. It would be nice to eliminate taking the global
sched_hotcpu_mutex in sched_setaffinity() (I think without CONFIG_HOTPLUG_CPU
it is not needed right now). In that case sched_setaffinity(0, cpumask_of_cpu(cpu))
can in fact return on the "wr...
Suppose that A was most recently stored by a CPU that shares a store
buffer with this CPU. Then it is possible that some other CPU sees
the store to B as happening before the store that "A==0" above is
loading from. This other CPU would naturally conclude that the store
to B must have happened before the load from A.In more detail, suppose that CPU 0 and 1 share a store buffer, and
that CPU 2 and 3 share a second store buffer. This happens naturally
if CPUs 0 and 1 are really just different hardware threads within a
single core.So, suppose the cacheline for A is initially owned by CPUs 2 and 3,
and that the cacheline for B is initially owned by CPUs 0 and 1.
Then consider the following sequence of events:o CPU 0 stores zero to A. This is a cache miss, so the new value
for A is placed in CPU 0's and 1's store buffer.o CPU 1 executes the above code, first loading A. It sees
the value of A==0 in the store buffer, and therefore
stores zero to B, which hits in the cache. (I am assuming
that we left out the mb() above).o CPU 2 loads from B, which misses the cache, and gets the
value that CPU 1 stored. Suppose it checks the value,
and based on this check, loads A. The old value of A might
still be in cache, which would lead CPU 2 to conclude that
the store to B by CPU 1 must have happened before the store
to A by CPU 0.Memory barriers would prevent this confusion. An intro to store buffers
can be found at http://www.cs.utah.edu/mpv/papers/neiger/fmcad2001.pdf,Callbacks would be able to be injected into a grace period after it
started.Or are you arguing that as long as interrupts remain disabled between
Ah -- you are in fact arguing that interrupts remain disabled throughout.
I would still rather that the rcu_flip_seen transition be adjacent
to the callback movement in the code. My fear is that the connection
might be lost otherwise... "Oh, but we can just momentarily enableFrom a conceptual viewpoint, if this CPU hasn't caught up w...
Ah, I was confused by the comment,
smp_mb(); /* Don't call for memory barriers before we see zero. */
^^^^^^^^^^^^^^^^^^
So, in fact, we need this barrier to make sure that _other_ CPUs see these
changes in order, thanks. Of course, _we_ already saw zero.But in that particular case this doesn't matter, rcu_try_flip_waitzero()
is the only function which reads the "non-local" per_cpu(rcu_flipctr), so
it doesn't really need the barrier? (besides, it is always called underThanks a lot!!! This fills another gap in my understanding.
OK, the last (I promise :) off-topic question. When CPU 0 and 1 share a
store buffer, the situation is simple, we can replace "CPU 0 stores" with
"CPU 1 stores". But what if CPU 0 is equally "far" from CPUs 1 and 2?Suppose that CPU 1 does
wmb();
B = 0Can we assume that CPU 2 doing
if (B == 0) {
rmb();Look, what happens is
// call_rcu()
rcu_flip_flag = rcu_flipped
insert the new callback
// timer irq
move the callbacks (the new one goes to wait[0])Before this callback will be flushed, we need 2 rdp->completed != rcu_ctrlblk.completed
I hope this is OK, note that migration_call(CPU_DEAD) flushes ->migration_queue,
if we take rq->lock after that we must see !cpu_online(cpu). CPU_UP event is not
interesting for us, we can miss it.Hmm... but wake_up_process() should be moved under spin_lock().
Oleg.
-
Fair point!
Perhaps: "Ensure that all CPUs see their rcu_mb_flag -after- the
The final rcu_read_unlock() that zeroed the sum was -not- under fliplock,
Yes. CPU 2 saw something following CPU 1's wmb(), so any of CPU 2's
reads following its rmb() must therefore see all of CPU 1's storesQuite possibly my paranoia -- need to think about this some more.
Guess I need to expand the code-level portion of the document to cover
the grace-period side. That would force me either to get my explanationThe other approach would be to simply have a separate thread for this
purpose. Batching would amortize the overhead (a single trip around the
CPUs could satisfy an arbitrarily large number of synchronize_sched()
requests).Thanx, Paul
-
Yes, but still I think this mb() is not necessary. Becasue we don't need
the "if we saw rcu_mb_flag we must see sum(lastidx)==0" property. When another
CPU calls rcu_try_flip_waitzero(), it will use another lastidx. OK, minor issue,Ah, but I asked the different question. We must see CPU 1's stores by
definition, but what about CPU 0's stores (which could be seen by CPU 1)?Let's take a "real life" example,
A = B = X = 0;
P = Q = &A;CPU_0 CPU_1 CPU_2
P = &B; *P = 1; if (X) {
wmb(); rmb();
X = 1; BUG_ON(*P != 1 && *Q != 1);
}Yes, this way we don't need to uglify migration_thread(). OTOH, we need
another kthread ;)Oleg.
-
It depends. ;-)
o Itanium: because both wmb() and rmb() map to the "mf"
instruction, and because "mf" instructions map to a
single global order, the BUG_ON cannot happen. (But
I could easily be mistaken -- I cannot call myself an
Itanium memory-ordering expert.) See:ftp://download.intel.com/design/Itanium/Downloads/25142901.pdf
for the official story.
o POWER: because wmb() maps to the "sync" instruction,
cumulativity applies, so that any instruction provably
following "X = 1" will see "P = &B" if the "*P = 1"
statement saw it. So the BUG_ON cannot happen.o i386: memory ordering respects transitive visibility,
which seems to be similar to POWER's cumulativity
(http://developer.intel.com/products/processor/manuals/318147.pdf),
so the BUG_ON cannot happen.o x86_64: same as i386.
o s390: the basic memory-ordering model is tight enough that the
BUG_ON cannot happen. (If I am confused about this, the s390
guys will not be shy about correcting me!)True enough!!!
Thanx, Paul
-
That can't be. CPU_2 sees X=1, that happened after (or same time at most -
from a cache inv. POV) to *P=1, that must have happened after P=&B (in
order for *P to assign B). So P=&B happened, from a pure time POV, before
the rmb(), and the rmb() should guarantee that CPU_2 sees P=&B too.- Davide
-
Actually, CPU designers have to go quite a ways out of their way to
prevent this BUG_ON from happening. One way that it would happen
naturally would be if the cache line containing P were owned by CPU 2,
and if CPUs 0 and 1 shared a store buffer that they both snooped. So,
here is what could happen given careless or sadistic CPU designers:o CPU 0 stores &B to P, but misses the cache, so puts the
result in the store buffer. This means that only CPUs 0 and 1
can see it.o CPU 1 fetches P, and sees &B, so stores a 1 to B. Again,
this value for P is visible only to CPUs 0 and 1.o CPU 1 executes a wmb(), which forces CPU 1's stores to happen
in order. But it does nothing about CPU 0's stores, nor about CPU
1's loads, for that matter (and the only reason that POWER ends
up working the way you would like is because wmb() turns into
"sync" rather than the "eieio" instruction that would have been
used for smp_wmb() -- which is maybe what Oleg was thinking of,
but happened to abbreviate. If my analysis is buggy, Anton and
Paulus will no doubt correct me...)o CPU 1 stores to X.
o CPU 2 loads X, and sees that the value is 1.
o CPU 2 does an rmb(), which orders its loads, but does nothing
about anyone else's loads or stores.o CPU 2 fetches P from its cached copy, which still points to A,
which is still zero. So the BUG_ON fires.o Some time later, CPU 0 gets the cache line containing P from
CPU 2, and updates it from the value in the store buffer, but
too late...Unfortunately, cache-coherence protocols don't care much about pure
time... It is possible to make a 16-CPU machine believe that a single
variable has more than ten different values -at- -the- -same- -time-.
This is easy to do -- have all the CPUs store different values to the
same variable at the same time, then reload, collecting timestamps
between each pair of operations. On a large SMP, the values will sit
in the store buffers for many hundreds of nanoseconds, perhaps even
se...
Davide, Paul, thank you very much! I've been wondering about this for the
long time, now I know the answer. Great.Oleg.
-
Ohh, I misinterpreted that rmb(), sorry. Somehow I gave it for granted
that it was a cross-CPU sync point (ala read_barrier_depends). If that's aIf a store buffer is shared between CPU_0 and CPU_1, it is very likely
that a sync done on CPU_1 is going to sync even CPU_0 stores that are
held in the buffer at the time of CPU_1's sync.- Davide
-
That would indeed be one approach that CPU designers could take to
avoid being careless or sadistic. ;-)Another approach would be to make CPU 1 refrain from snooping CPU 0's
entries in the shared store queue.Thanx, Paul
-
That'd be the easier (unique maybe) approach too for them, from an silicon
complexity POV. Distinguishing between different CPUs stores once inside a
shared store buffer, would require tagging them in some way. That'd defeat
most of the pros of having a shared store buffer ;)- Davide
-
Tagging requires but one bit per entry. Depends on the workload -- if
lots of barriers, bursty stores and little sharing, tagging might win.
If lots of sharing, then your suggested approach might win.Thanx, Paul
-
Paul,
Looking further into this, I still think this is a bit of overkill. We
go through 20 states from call_rcu to list->func().On call_rcu we put our stuff on the next list. Before we move stuff from
next to wait, we need to go through 4 states. So we havenext -> 4 states -> wait[0] -> 4 states -> wait[1] -> 4 states ->
wait[2] -> 4 states -> wait[3] -> 4 states -> done.That's 20 states that we go through from the time we add our function to
the list to the time it actually gets called. Do we really need the 4
wait lists?Seems a bit overkill to me.
What am I missing?
-- Steve
-
"Nothing kills like overkill!!!" ;-)
Seriously, I do expect to be able to squeeze this down over time, but
feel the need to be a bit on the cowardly side at the moment.In any case, I will be looking at the scenarios more carefully. If
it turns out that GP_STAGES can indeed be cranked down a bit, well,
that is an easy change! I just fired off a POWER run with GP_STAGES
set to 3, will let you know how it goes.Thanx, Paul
-
The first attempt blew up during boot badly enough that ABAT was unable
to recover the machine (sorry, grahal!!!). Just for grins, I am trying
it again on a machine that ABAT has had a better record of reviving...Thanx, Paul
-
--
This still frightens the hell out of me. Going through 15 states and
failing. Seems the CPU is holding off writes for a long long time. That
means we flipped the counter 4 times, and that still wasn't good enough?Maybe I'll boot up my powerbook to see if it has the same issues.
Well, I'm still finishing up on moving into my new house, so I wont be
available this weekend.Thanks,
-- Steve
-
Might be that the other machine has its 2.6.22 version of .config messed
up. I will try booting it on a stock 2.6.22 kernel when it comes back
to life -- not sure I ever did that before. Besides, the other similar
machine seems to have gone down for the count, but without me torturing
it...Also, keep in mind that various stages can "record" a memory misordering,
The other machine not only booted, but has survived several minutes of
rcutorture thus far. I am also trying POWER5 machine as well, as the
one currently running is a POWER4, which is a bit less aggressive about
memory reordering than is the POWER5.Even if they pass, I refuse to reduce GP_STAGES until proven safe.
Trust me, you -don't- want to be unwittingly making use of a subtely
busted RCU implementation!!!Thanx, Paul
-
[ sneaks away from the family for a bit to answer emails ]
--
I totally agree. This is the same reason I want to understand -why- it
fails with 3 stages. To make sure that adding a 4th stage really does fix
it, and doesn't just make the chances for the bug smaller.I just have that funny feeling that we are band-aiding this for POWER with
extra stages and not really solving the bug.I could be totally out in left field on this. But the more people have a
good understanding of what is happening (this includes why things fail)
the more people in general can trust this code. Right now I'm thinking
you may be the only one that understands this code enough to trust it. I'm
just wanting you to help people like me to trust the code by understanding
and not just having faith in others.If you ever decide to give up jogging, we need to make sure that there are
people here that can still fill those running shoes (-:-- Steve
-
Or if it really does break, as opposed to my having happened upon a sick
Agreed. Trusting me is grossly insufficient. For one thing, the Linux
Well, I certainly don't jog as fast or as far as I used to! ;-)
Thanx, Paul
-
I take it that GP stand for "grace period". Might want to state that
here. /* Grace period stages */ When I was looking at this code at 1am,
I kept asking myself "what's this GP?" (General Protection??). ButCan you have a pointer somewhere that explains these states. And not a
"it's in this paper or directory". Either have a short discription here,
or specify where exactly to find the information (perhaps a
Documentation/RCU/preemptible_states.txt?).Trying to understand these states has caused me the most agony in
Nitpick, but other places in the kernel usually use "t" or "p" as a
variable to assign current to. It's just that "me" thows me off a
little while reviewing this. But this is just a nitpick, so do as youIsn't the GP detection done via a tasklet/softirq. So wouldn't a
local_bh_disable be sufficient here? You already cover NMIs, which wouldJust trying to understand all this. Here at flip_idle, only a CPU with
no pending RCU calls will flip it. Then all the cpus flags will be set
to rcu_flipped, and the ctrl.completed counter is incremented.When a CPU calls process_callbacks, it would then move all it's
callbacks to the next list (next -> wait[GP...] -> done), and set it's
unique completed counter to completed. So no more moving up the lists
will be done. It also will set it's state flag to rcu_flip_seen. Even if
the cpu doesn't have any RCU callbacks, the state would be in
rcu_try_flip_waitack_state so we go to the next switch case for a
rcu_try_flip call.Also the filp counter has been flipped so that all new rcu_read_locks
will increment the cpus "other" index. We just need to wait for theNow, we just wait to see if all the cpus have called process_callbacks
and now have the flag rcu_flip_seen set. So all the cpus have pushedNow this is where we wait for the sum of all the CPUs counters to reach
zero. The reason for the sum is that the task may have been preempted
and migrated to another CPU and decremented that counter. So the o...
Covering the pieces that weren't in Peter's reply. ;-)
My fingers do this without any help from the rest of me, but I suppose
Good point, perhaps a comment block above the enum giving a short
description of the purpose of each state. Maybe more detail inI apologize for the repetition in this email.
I apologize for the repetition in this email.
I apologize for the repetition in this email.
Yep, will fix with either #define or static inline, as you suggested
A later patch in the series fixes these -- I believe I got all of them.
s/no pending RCU calls/at least one pending RCU call/, but otherwise
spot on.So if the RCU grace-period machinery is idle, the first CPU to take
a scheduling-clock interrupt after having posted an RCU callback willYep. The wait-for-acknowledgement cycle is there to handle rcu_read_lock()
invocations on other CPUs that executed concurrently with the counterYep. Note that rcu_read_lock() might well be able to use its local
.completed counter rather than the global one in rcu_ctrlblk, but
(1) it is not clear that the reduction in cache misses would outweigh
the per-CPU access overhead given that rcu_read_lock() happens a -lot-
more often than does a counter flip and (2) doing this could well requireYep! More importantly, we know that no CPU will ever increment the
The only way it could cause a problem would be if there was ever
more than 4,294,967,296 outstanding rcu_read_lock() calls. I believe
that lockdep screams if it sees more than 30 nested locks within a
single task, so for systems that support no more than 100M tasks, we
should be OK. It might sometime be necessary to make this be a longYep. Because there are no memory barriers in rcu_read_unlock(), the
CPU is free to reorder the contents of the RCU read-side critical section
to follow the counter decrement. This means that this CPU would still
be referencing RCU-protected data after it had told the world that it
was no longer doing so. Forcing a...
Thanks, so many places in the kernel have acronyms that are just suppose
to be "obvious". I hate them, because they make me feel so stupid when I
don't know what they are. After I find out, I usually slap my forehead andI said 'no' becaues of this:
+rcu_try_flip_idle(void)
+{
+ int cpu;
+
+ RCU_TRACE_ME(rcupreempt_trace_try_flip_i1);
+ if (!rcu_pending(smp_processor_id())) {
+ RCU_TRACE_ME(rcupreempt_trace_try_flip_ie1);
+ return 0;
+ }But now I'm a bit more confused. :-/
Looking at the caller in kernel/timer.c I see
if (rcu_pending(cpu))
rcu_check_callbacks(cpu, user_tick);And rcu_check_callbacks is the caller of rcu_try_flip. The confusion is
that we call this when we have a pending rcu, but if we have a pendingSure, why not. More and more and more overkill!!!
And this seem reasonable to me that this would be enough to satisfy a
grace period. But the CPU moving around the rcu_read_(un)lock's around.Are we sure that adding all these grace periods stages is better than just
That is definitely an accomplishment. And I know as well as you do that it
happened because of a lot of people sharing ideas. Some good, some bad,
but all helpful for heathy development!-- Steve
-
We don't enter unless there is something for RCU to do (might be a
pending callback, for example, but might also be needing to acknowledge
a counter flip). If, by the time we get to rcu_try_flip_idle(), there
is no longer anything to do (!rcu_pending()), we bail.So a given CPU kicks the state machine out of idle only if it -still-
has something to do once it gets to rcu_try_flip_idle(), right?Good question. I believe so, because the extra stages don't require
much additional processing, and because the ratio of rcu_read_lock()
calls to the number of grace periods is extremely high. But, if I
can prove it is safe, I will certainly decrease GP_STAGES or otherwise
optimize the state machine.Indeed! The current version is quite a bit different than my early-2005
posting (which relied on locks!), and a -lot- of people had a hand in
making it what it is today.Thanx, Paul
-
[took off Ingo, because he has my ISP blacklisted, and I'm tired of
getting those return mail messages. He can read LKML or you can re-CC
him. Sad since this is a topic he should be reading. ]--
Now I can slap my forehead! Duh, I wasn't seeing that ! in front of the
rcu_pending condition in the rcu_try_flip_idle. We only flip if we do
indeed have something pending. I need some sleep. I also need to
re-evaluate some of my analysis of that code. But it doesn't change myBut until others besides yourself understand that state machine (doesn't
really need to be me) I would be worried about applying it without
barriers. The barriers may add a bit of overhead, but it adds some
confidence in the code. I'm arguing that we have barriers in there until
there's a fine understanding of why we fail with 3 stages and not 4.
Perhaps you don't have a box with enough cpus to fail at 4.I don't know how the higher ups in the kernel command line feel, but I
think that memory barriers on critical sections are justified. But if you
can show a proof that adding extra stages is sufficient to deal with
CPUS moving memory writes around, then so be it. But I'm still not
convinced that these extra stages are really solving the bug instead of
just making it much less likely to happen.Ingo praised this code since it had several years of testing in the RT
tree. But that version has barriers, so this new verison without theTrue.
-- Steve
-
Fair point... Though the -rt variant has its shortcomings as well,
such as being unusable from NMI/SMI handlers.How about this: I continue running the GP_STAGES=3 run on the pair of
POWER machines (which are both going strong, and I also get a document
together describing the new version (and of course apply the changes we
have discussed, and merge with recent CPU-hotplug changes -- Gautham
Shenoy is currently working this), work out a good answer to "how
big exactly does GP_STAGES need to be", test whatever that number is,
assuming it is neither 3 nor 4, and figure out why the gekko-lp1 machine
choked on GP_STAGES=3.Then we can work out the best path forward from wherever that ends up
being.[ . . . ]
Thanx, Paul
-
On Fri, 21 Sep 2007 10:40:03 -0400 Steven Rostedt <rostedt@goodmis.org>
I thought the 4 flip states corresponded to the 4 GP stages, but now
you confused me. It seems to indeed progress one stage for every 4 flip
states.Hmm, now I have to puzzle how these 4 stages are required by the lock
struct task_struct *curr = current;
This is also my understanding, but I think this disable is an
'optimization' in that it avoids the regular IRQs from jumping throughIts the large SMP case that's nuts, and on that I have to agree with
Paul, its not really large SMP friendly.-
No, I mean that both the rcu_batches_completed and
rcu_batches_completed_bh are identical. Perhaps we can just put in a#define rcu_batches_completed_bh rcu_batches_completed
in rcupreempt.h. In rcuclassic, they are different. But no need to have
But isn't disabling irqs slower than doing a local_bh_disable? So the
Hmm, that could be true. But on large SMP systems, you usually have a
large amounts of memory, so hopefully a really long synchronize_rcu
would not be a problem.-- Steve
-
If you do a synchronize_rcu() it might well have to wait through the
following sequence of states:Stage 0: (might have to wait through part of this to get out of "next" queue)
rcu_try_flip_idle_state, /* "I" */
rcu_try_flip_waitack_state, /* "A" */
rcu_try_flip_waitzero_state, /* "Z" */
rcu_try_flip_waitmb_state /* "M" */
Stage 1:
rcu_try_flip_idle_state, /* "I" */
rcu_try_flip_waitack_state, /* "A" */
rcu_try_flip_waitzero_state, /* "Z" */
rcu_try_flip_waitmb_state /* "M" */
Stage 2:
rcu_try_flip_idle_state, /* "I" */
rcu_try_flip_waitack_state, /* "A" */
rcu_try_flip_waitzero_state, /* "Z" */
rcu_try_flip_waitmb_state /* "M" */
Stage 3:
rcu_try_flip_idle_state, /* "I" */
rcu_try_flip_waitack_state, /* "A" */
rcu_try_flip_waitzero_state, /* "Z" */
rcu_try_flip_waitmb_state /* "M" */
Stage 4:
rcu_try_flip_idle_state, /* "I" */
rcu_try_flip_waitack_state, /* "A" */
rcu_try_flip_waitzero_state, /* "Z" */
rcu_try_flip_waitmb_state /* "M" */The current code absolutely must exclude the scheduling-clock hardirq
Somewhere in the range from 64 to a few hundred CPUs, the global lock
protecting the try_flip state machine would start sucking air pretty
badly. But the real problem is synchronize_sched(), which loops through
all the CPUs -- this would likely cause problems at a few tens of
CPUs, perhaps as early as 10-20.Thanx, Paul
-
--
Yes they do. I'm now at the point that I'm just "trusting" you that you
understand that each of these stages are needed. My IQ level only lets me
understand next -> wait -> done, but not the extra 3 shifts in wait.It's the app I watch sitting there waiting it's turn for it's callback to
ACKed,
hehe, From someone who's largest box is 4 CPUs, to me 16 CPUS is large.
But I can see hundreds, let alone thousands of CPUs would make a huge
grinding halt on things like synchronize_sched. God, imaging if all CPUs
did that approximately at the same time. The system would should a huge
jitter.-- Steve
-
In the spirit of full disclosure, I am not -absolutely- certain that
Well, the first time the SGI guys tried to boot a 1024-CPU Altix, I got
an email complaining about RCU overheads. ;-) Manfred Spraul fixed
things up for them, though.Thanx, Paul
-
Yes, four flip states per stage, and four stages per grace period:
rcu_try_flip_idle_state: Stay here if nothing is happening.
Flip the counter if something starts
happening.rcu_try_flip_waitack_state: Wait here for all CPUs to notice that
the counter has flipped. This prevents
the old set of counters from ever being
incremented once we leave this state,
which in turn is necessary because we
cannot test any individual counter for
zero -- we can only check the sum.rcu_try_flip_waitzero_state: Wait here for the sum of the old
per-CPU counters to reach zero.rcu_try_flip_waitmb_state: Wait here for each of the other
CPUs to execute a memory barrier.
This is necessary to ensure that
these other CPUs really have finished
executing their RCU read-side
critical sections, despite their
CPUs wildly reordering memory.They are needed to allow memory barriers to be removed from
rcu_read_lock(), to allow for the fact that the stages mean that
grace-period boundaries are now fuzzy, as well as to account for the
fact that each CPU now has its own callback queue. (Aside: classic RCU
doesn't have memory barriers -- or anything else -- in rcu_read_lock()
and rcu_read_unlock(), but from an implementation viewpoint, the read-side
critical section extends forwards and backwards to the next/previous
context switches, which do have lots of memory barriers.)Taking the reasons for stages one at a time:
1. The first stage is needed for generic RCU. Everyone must
complete any pre-existing RCU read-side critical sections
before the grace period can be permitted to complete. This
stage corresponds to the "waitlist" and "waittail" in earlier
RCU implementations.2. An additional stage is needed due to the fact that memory reordering
can cause an RCU read-side critical section's rcu_dereference()
to be executed before the rcu_read_lock() -- no memory barriers!This can result in the following sequence of events:...
[ continued here from comment on patch 1]
And here we put back rcu_batches_completed and rcu_batches_completed_bh
from rcuclassic.h to rcupdate.h ;-)-- Steve
-
Good questions :) I can't remember why I did this - probably because
I was breaking up into classic and preemptible RCU in incremental
patches with the goal that the break-up patch can be merged
before the rcu-preempt patches. IIRC, I had to make *batches_completed*()
a common RCU API later on.Thanks
Dipankar
-
Hmmm... Good point!!! I guess it would be OK to just leave them
in rcupdate.h throughout. ;-)Will fix. And good eyes!
Thanx, Paul
-
Work in progress, not for inclusion.
This patch re-organizes the RCU code to enable multiple implementations
of RCU. Users of RCU continues to include rcupdate.h and the
RCU interfaces remain the same. This is in preparation for
subsequently merging the preemptible RCU implementation.Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---include/linux/rcuclassic.h | 149 +++++++++++
include/linux/rcupdate.h | 151 +++---------
kernel/Makefile | 2
kernel/rcuclassic.c | 558 ++++++++++++++++++++++++++++++++++++++++++++
kernel/rcupdate.c | 561 ++-------------------------------------------
5 files changed, 779 insertions(+), 642 deletions(-)diff -urpNa -X dontdiff linux-2.6.22/include/linux/rcuclassic.h linux-2.6.22-a-splitclassic/include/linux/rcuclassic.h
--- linux-2.6.22/include/linux/rcuclassic.h 1969-12-31 16:00:00.000000000 -0800
+++ linux-2.6.22-a-splitclassic/include/linux/rcuclassic.h 2007-08-22 14:42:23.000000000 -0700
@@ -0,0 +1,149 @@
+/*
+ * Read-Copy Update mechanism for mutual exclusion (classic version)
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright IBM Corporation, 2001
+ *
+ * Author: Dipankar Sarma <dipankar@in.ibm.com>
+ *
+ * Based on the o...
On Mon, Sep 10, 2007 at 11:32:08AM -0700, Paul E. McKenney wrote:
Why is rcu_batches_completed and rcu_batches_completed_bh moved from
rcupdate.h to rcuclassic.h?[ continued ...]
-- Steve
-
Work in progress, not for inclusion.
Fix rcu_barrier() to work properly in preemptive kernel environment.
Also, the ordering of callback must be preserved while moving
callbacks to another CPU during CPU hotplug.Signed-off-by: Dipankar Sarma <dipankar@in.ibm.com>
Signed-off-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com>
---rcuclassic.c | 2 +-
rcupdate.c | 10 ++++++++++
2 files changed, 11 insertions(+), 1 deletion(-)diff -urpNa -X dontdiff linux-2.6.22-a-splitclassic/kernel/rcuclassic.c linux-2.6.22-b-fixbarriers/kernel/rcuclassic.c
--- linux-2.6.22-a-splitclassic/kernel/rcuclassic.c 2007-07-19 15:03:51.000000000 -0700
+++ linux-2.6.22-b-fixbarriers/kernel/rcuclassic.c 2007-07-19 17:10:46.000000000 -0700
@@ -349,9 +349,9 @@ static void __rcu_offline_cpu(struct rcu
if (rcp->cur != rcp->completed)
cpu_quiet(rdp->cpu, rcp);
spin_unlock_bh(&rcp->lock);
+ rcu_move_batch(this_rdp, rdp->donelist, rdp->donetail);
rcu_move_batch(this_rdp, rdp->curlist, rdp->curtail);
rcu_move_batch(this_rdp, rdp->nxtlist, rdp->nxttail);
- rcu_move_batch(this_rdp, rdp->donelist, rdp->donetail);
}static void rcu_offline_cpu(int cpu)
diff -urpNa -X dontdiff linux-2.6.22-a-splitclassic/kernel/rcupdate.c linux-2.6.22-b-fixbarriers/kernel/rcupdate.c
--- linux-2.6.22-a-splitclassic/kernel/rcupdate.c 2007-07-19 14:19:03.000000000 -0700
+++ linux-2.6.22-b-fixbarriers/kernel/rcupdate.c 2007-07-19 17:13:31.000000000 -0700
@@ -115,7 +115,17 @@ void rcu_barrier(void)
mutex_lock(&rcu_barrier_mutex);
init_completion(&rcu_barrier_completion);
atomic_set(&rcu_barrier_cpu_count, 0);
+ /*
+ * The queueing of callbacks in all CPUs must be atomic with
+ * respect to RCU, otherwise one CPU may queue a callback,
+ * wait for a grace period, decrement barrier count and call
+ * complete(), while other CPUs have not yet queued anything.
+ * So, we need to make sure that grace periods cannot complete
+ * until all...
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Benjamin Herrenschmidt | Re: [PATCH] Remove process freezer from suspend to RAM pathway |
| Bart Van Assche | Re: Integration of SCST in the mainstream Linux kernel |
git: | |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Arjan van de Ven | Re: [GIT]: Networking |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Natalie Protasevich | [BUG] New Kernel Bugs |
