Re: [PATCH RFC 3/9] RCU: Preemptible RCU

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Steven Rostedt <rostedt@...>, <linux-kernel@...>, <linux-rt-users@...>, <mingo@...>, <akpm@...>, <dipankar@...>, <josht@...>, <tytso@...>, <dvhltc@...>, <tglx@...>, <bunk@...>, <ego@...>, <oleg@...>
Date: Friday, September 21, 2007 - 6:06 pm

On Fri, Sep 21, 2007 at 05:46:53PM +0200, Peter Zijlstra wrote:

Yes, four flip states per stage, and four stages per grace period:

rcu_try_flip_idle_state:	Stay here if nothing is happening.
				Flip the counter if something starts
				happening.

rcu_try_flip_waitack_state:	Wait here for all CPUs to notice that
				the counter has flipped.  This prevents
				the old set of counters from ever being
				incremented once we leave this state,
				which in turn is necessary because we
				cannot test any individual counter for
				zero -- we can only check the sum.

rcu_try_flip_waitzero_state:	Wait here for the sum of the old
				per-CPU counters to reach zero.

rcu_try_flip_waitmb_state:	Wait here for each of the other
				CPUs to execute a memory barrier.
				This is necessary to ensure that
				these other CPUs really have finished
				executing their RCU read-side
				critical sections, despite their
				CPUs wildly reordering memory.

				This state 


They are needed to allow memory barriers to be removed from
rcu_read_lock(), to allow for the fact that the stages mean that
grace-period boundaries are now fuzzy, as well as to account for the
fact that each CPU now has its own callback queue.  (Aside: classic RCU
doesn't have memory barriers -- or anything else -- in rcu_read_lock()
and rcu_read_unlock(), but from an implementation viewpoint, the read-side
critical section extends forwards and backwards to the next/previous
context switches, which do have lots of memory barriers.)

Taking the reasons for stages one at a time:

1.	The first stage is needed for generic RCU.  Everyone must
	complete any pre-existing RCU read-side critical sections
	before the grace period can be permitted to complete.  This
	stage corresponds to the "waitlist" and "waittail" in earlier
	RCU implementations.

2.	An additional stage is needed due to the fact that memory reordering
	can cause an RCU read-side critical section's rcu_dereference()
	to be executed before the rcu_read_lock() -- no memory barriers!

	This can result in the following sequence of events:

	o	rcu_dereference() from CPU 0.

	o	list_del_rcu() from CPU 1.

	o	call_rcu() from CPU 1.

	o	counter flip from CPU 2.

	o	rcu_read_lock() from CPU 0 (delayed due to memory reordering).

		CPU 0 increments the new counter, not the old one that
		its rcu_dereference() is associated with.

	o	All CPUs acknowledge the counter flip.

	o	CPU 2 discovers that the old counters sum to zero (because
		CPU 0 has incremented its new counter).

	o	All CPUs do their memory barriers.

	o	CPU 1's callback is invoked, destroying the memory still
		being used by CPU 0.

	An additional pass through the state machine fixes this problem.

	Note that the read-after-write misordering can happen on x86,
	as well as on s390.

3.	An additional stage is also needed due to the fact that each CPU
	has its own set of callback queues.  This means that callbacks
	that were registered quite some time after the most recent counter
	flip can make it into the current round of the state machine,
	as follows:

	o	CPU 0 flips the counter.

	o	CPU 1 does an rcu_read_lock() followed by an rcu_dereference().
		CPU 1 of course uses the new value of the counter.

	o	CPU 2 does a list_del_rcu() on the element that CPU 1
		did the rcu_dereference on.

	o	CPU 2 does a call_rcu(), placing the callback on
		its local queue.

	o	CPU 2 gets its scheduling-clock interrupt, and acknowledges
		the counter flip, also moving its "next" list (which
		contains the new callback) to the "wait" list.

	o	The old counters sum to zero, and everyone does a memory
		barrier, and the callback is invoked.  Too bad that CPU 0
		is still using the data element just freed!

	(And yes, I do understand that this scenario is incompatible
	with the previous scenario.  But the point of these scenarios
	is to demonstrate that a given effect can result in failure,
	not to show all possible misfortunes that the effect can cause.
	If you wish to try to convince me that the two causes underlying
	these symptoms cannot be combined to create a failure spanning
	two full flip intervals, then the task before you is to prove
	that there is no such scenario.)

	This scenario can happen even on sequentially consistent
	machines.

4.	An additional stage is needed due to the fact that different
	CPUs can disagree as to when the counter flip happened.  This
	effect is in some way similar to #3, but can (and did) happen
	even when there is only one global callback queue.  

	I believe that this can happen even on sequentially consistent
	machines, but am not 100% certain of this.

So, four mechanisms for going bad mapped to four states in the grace-period
pipeline.  It is quite possible that a more-complex state machine would
require fewer stages, but then again it might not end up being any faster
overall.  :-/  It is also quite possible that one or another of the
additional stages covers more than one of the above mechanisms.  So,
we know just from the above scenarios that at least two stages are
required, and the most that they can possibly require is four stages.
I have looked very hard for other underlying sources of badness, and
am quite sure that I have them covered (famous last words).

I will rerun with GP_STAGES==3 on POWER to double-check -- it is entirely
possible that the last such run preceded some bug removal.

On the other hand, there are most definitely relatively straightforward
ways to speed things up on uniprocessor machines, for example,
skipping the acknowledgement and memory-barrier states, short-circuiting
synchronize_rcu() and friends, and so on.  But I need to avoid premature
optimization here -- one thing at a time.


We have to remain API-compatible with classic RCU, where the separate
grace-period types are absolutely required if the networking stack is
to hold up against certain types of DOS attacks.  Networking uses
call_rcu_bh(), which has quiescent states at any point in the code
that interrupts are enabled.

It might well be possible to fuse the _bh variant into the classic variant
for CONFIG_PREEMPT (as opposed to CONFIG_PREEMPT_RT) kernels, but as far
as I know, no one has tried this.  And there is little motivation to try,
from what I can see.


OK, good point either way.


Critical portions of the GP protection happen in the scheduler-clock
interrupt, which is a hardirq.  For example, the .completed counter
is always incremented in hardirq context, and we cannot tolerate a
.completed increment in this code.  Allowing such an increment would
defeat the counter-acknowledge state in the state machine.

There are ways to avoid the interrupt disabling, but all the ones
that I am aware of thus far have grace-period-latency consequences.


Hey!!!  If it wasn't nuts, I probably wouldn't be working on it!!!  ;-)


Yeah, the current implementation of synchronize_sched() is a really bad
idea on a 4096-CPU system, and probably much smaller.  The base RCU
implementation is now probably good to several tens of CPUs, perhaps
even to a couple hundred.  In contrast, the implementation currently
in -rt might start choking pretty badly much earlier due to the single
global callback queue.

						Thanx, Paul
-
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
[PATCH RFC 0/9] RCU: Preemptible RCU, Paul E. McKenney, (Mon Sep 10, 2:30 pm)
Re: [PATCH RFC 0/9] RCU: Preemptible RCU, Ingo Molnar, (Mon Sep 10, 2:44 pm)
[PATCH RFC 6/9] RCU priority boosting for preemptible RCU, Paul E. McKenney, (Mon Sep 10, 2:39 pm)
Re: [PATCH RFC 6/9] RCU priority boosting for preemptible RCU, Paul E. McKenney, (Fri Oct 5, 10:07 am)
Re: [PATCH RFC 6/9] RCU priority boosting for preemptible RCU, Gautham R Shenoy, (Fri Sep 28, 6:56 pm)
Re: [PATCH RFC 6/9] RCU priority boosting for preemptible RCU, Paul E. McKenney, (Sat Sep 29, 11:11 pm)
[PATCH RFC 5/9] RCU: CPU hotplug support for preemptible RCU, Paul E. McKenney, (Mon Sep 10, 2:36 pm)
[PATCH RFC 3/9] RCU: Preemptible RCU, Paul E. McKenney, (Mon Sep 10, 2:34 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Oleg Nesterov, (Sun Sep 23, 1:38 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Paul E. McKenney, (Sun Sep 23, 8:15 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Oleg Nesterov, (Wed Sep 26, 11:13 am)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Paul E. McKenney, (Thu Sep 27, 11:46 am)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Oleg Nesterov, (Fri Sep 28, 10:47 am)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Paul E. McKenney, (Fri Sep 28, 2:57 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Oleg Nesterov, (Sun Sep 30, 12:31 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Paul E. McKenney, (Sun Sep 30, 9:20 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Davide Libenzi, (Sun Sep 30, 7:02 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Paul E. McKenney, (Sun Sep 30, 9:37 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Oleg Nesterov, (Tue Oct 2, 2:02 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Davide Libenzi, (Mon Oct 1, 2:44 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Paul E. McKenney, (Mon Oct 1, 3:21 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Davide Libenzi, (Mon Oct 1, 6:09 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Paul E. McKenney, (Mon Oct 1, 6:24 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Steven Rostedt, (Fri Sep 21, 11:20 am)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Paul E. McKenney, (Fri Sep 21, 7:03 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Paul E. McKenney, (Fri Sep 21, 8:32 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Steven Rostedt, (Fri Sep 21, 9:19 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Paul E. McKenney, (Fri Sep 21, 9:43 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Steven Rostedt, (Fri Sep 21, 10:56 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Paul E. McKenney, (Sat Sep 22, 12:10 am)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Steven Rostedt, (Fri Sep 21, 10:40 am)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Paul E. McKenney, (Fri Sep 21, 8:26 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Steven Rostedt, (Fri Sep 21, 9:15 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Paul E. McKenney, (Fri Sep 21, 9:53 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Steven Rostedt, (Fri Sep 21, 11:15 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Paul E. McKenney, (Sat Sep 22, 12:07 am)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Peter Zijlstra, (Fri Sep 21, 11:46 am)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Steven Rostedt, (Fri Sep 21, 6:31 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Paul E. McKenney, (Fri Sep 21, 6:44 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Steven Rostedt, (Fri Sep 21, 7:23 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Paul E. McKenney, (Fri Sep 21, 7:44 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Paul E. McKenney, (Fri Sep 21, 6:06 pm)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Steven Rostedt, (Fri Sep 21, 12:17 am)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Dipankar Sarma, (Fri Sep 21, 1:56 am)
Re: [PATCH RFC 3/9] RCU: Preemptible RCU, Paul E. McKenney, (Fri Sep 21, 1:50 am)
[PATCH RFC 2/9] RCU: Fix barriers, Paul E. McKenney, (Mon Sep 10, 2:33 pm)