Re: [PATCH, RFC] v4 scalable classic RCU implementation

Previous thread: Re: char/tpm: tpm_infineon no longer loaded for HP 2510p laptop by Kay Sievers on Thursday, August 21, 2008 - 2:18 pm. (9 messages)

Next thread: [PATCH 1/2] smp_call_function: don't use lock in call_function_data by Jeremy Fitzhardinge on Thursday, August 21, 2008 - 5:29 pm. (3 messages)
From: Paul E. McKenney
Date: Thursday, August 21, 2008 - 4:43 pm

Hello!

Experimental, not for inclusion.

Attached is a patch to Classic RCU that applies a hierarchy, greatly
reducing the contention on the top-level lock for large machines.
This passes mild rcutorture testing on x86 and ppc64, but is most
definitely not ready for inclusion.  It is OK for experimental work
assuming sufficiently brave experimenters.  See also Manfred Spraul's
patch at http://lkml.org/lkml/2008/8/21/336 (or his earlier work from
2004 at http://marc.info/?l=linux-kernel&m=108546384711797&w=2).
We will converge onto a common patch in the fullness of time, but
are currently exploring different regions of the design space.

This patch provides CONFIG_RCU_FANOUT, which controls the bushiness
of the RCU hierarchy.  Defaults to 32 on 32-bit machines and 64 on
64-bit machines.  If CONFIG_NR_CPUS is less than CONFIG_RCU_FANOUT,
there is no hierarchy.  By default, the RCU initialization code will
adjust CONFIG_RCU_FANOUT to balance the hierarchy, so strongly NUMA
architectures may choose to set CONFIG_RCU_FANOUT_EXACT to disable
this balancing, allowing the hierarchy to be exactly aligned to the
underlying hardware.  Up to two levels of hierarchy are permitted
(in addition to the root node), allowing up to 16,384 CPUs on 32-bit
systems and up to 262,144 CPUs on 64-bit systems.  I just know that I
am going to regret saying this, but this seems more than sufficient
for the foreseeable future.  (Some architectures might wish to set
CONFIG_RCU_FANOUT=4, which would limit such architectures to 64 CPUs.
If this becomes a real problem, additional levels can be added, but I
doubt that it will make a significant difference on real hardware.)

In the common case, a given CPU will manipulate its private rcu_data
structure and the rcu_node structure that it shares with its immediate
neighbors.  This can reduce both lock and memory contention by multiple
orders of magnitude, which should eliminate the need for the strange
manipulations that are reported to be required when running ...
From: Ingo Molnar
Date: Thursday, August 21, 2008 - 9:37 pm

just a quick stylistic suggestion: if feasible then such sizing ugliness 
should be hidden in a Kconfig file. (if Kconfig is capable enough for 
this that is)

	Ingo
--

From: Paul E. McKenney
Date: Friday, August 22, 2008 - 6:47 am

I have no idea if Kconfig can do it, but I will check.

							Thanx, Paul
--

From: Paul E. McKenney
Date: Friday, August 22, 2008 - 10:22 am

OK, Kconfig does not currently support arithmetic, based on zconf.y:

expr:	  symbol				{ $$ = expr_alloc_symbol($1); }
	| symbol T_EQUAL symbol			{ $$ = expr_alloc_comp(E_EQUAL, $1, $3); }
	| symbol T_UNEQUAL symbol		{ $$ = expr_alloc_comp(E_UNEQUAL, $1, $3); }
	| T_OPEN_PAREN expr T_CLOSE_PAREN	{ $$ = $2; }
	| T_NOT expr				{ $$ = expr_alloc_one(E_NOT, $2); }
	| expr T_OR expr			{ $$ = expr_alloc_two(E_OR, $1, $3); }
	| expr T_AND expr			{ $$ = expr_alloc_two(E_AND, $1, $3); }
;

All we currently get is basic comparison and logical operators.  It would
not be all -that- hard to add general arithmetic (famous last words),
but when I tried mapping out what the sizing code would look like in
such an augmented Kconfig, it was even uglier than the above.

So I took a hard look at the current mess, and prettied it as shown below.
Is this a sufficient improvement?

Another alternative I am considering is moving this to a separate
include file.

Thoughts?

							Thanx, Paul

#define MAX_RCU_LEVELS 3
#define RCU_FANOUT (CONFIG_RCU_FANOUT)
#define RCU_FANOUT_SQ (RCU_FANOUT * RCU_FANOUT)
#define RCU_FANOUT_CUBE (RCU_FANOUT_SQ * RCU_FANOUT)

#if (NR_CPUS) <= RCU_FANOUT
#  define NUM_RCU_LVLS 1
#  define NUM_RCU_LVL_0 1
#  define NUM_RCU_LVL_1 (NR_CPUS)
#  define NUM_RCU_LVL_2 0
#  define NUM_RCU_LVL_3 0
#elif (NR_CPUS) <= RCU_FANOUT_SQ
#  define NUM_RCU_LVLS 2
#  define NUM_RCU_LVL_0 1
#  define NUM_RCU_LVL_1 (((NR_CPUS) + RCU_FANOUT - 1) / RCU_FANOUT)
#  define NUM_RCU_LVL_2 (NR_CPUS)
#  define NUM_RCU_LVL_3 0
#elif (NR_CPUS) <= RCU_FANOUT_CUBE
#  define NUM_RCU_LVLS 3
#  define NUM_RCU_LVL_0 1
#  define NUM_RCU_LVL_1 (((NR_CPUS) + RCU_FANOUT_SQ - 1) / RCU_FANOUT_SQ)
#  define NUM_RCU_LVL_2 (((NR_CPUS) + (RCU_FANOUT) - 1) / (RCU_FANOUT))
#  define NUM_RCU_LVL_3 NR_CPUS
#else
# error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
#endif /* #if (NR_CPUS) <= RCU_FANOUT */

#define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
#define ...
From: Josh Triplett
Date: Friday, August 22, 2008 - 11:16 am

I personally don't think that would help.  The revised version seems

- Josh Triplett


--

From: Ingo Molnar
Date: Saturday, August 23, 2008 - 9:07 am

yeah - looks much better. This was the block that meets the eye for the 
first time in the patch so it stuck out.

just one more small pet peeve of mine: please use vertical alignment too 



but no strong feelings on that one. (maybe inserting a space at the 
right places helps too, no need for a full tab)

	Ingo
--

From: Paul E. McKenney
Date: Saturday, August 23, 2008 - 7:44 pm

Yep, just like you, spaced it just enough to keep the longest one from
running over one line.  ;-)

I left the definitions for RCU_SUM and NUM_RCU_NODES compact, though:

#define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
#define NUM_RCU_NODES (RCU_SUM - NR_CPUS)

The other alternative would be to stack RCU_SUM as follows:

#define RCU_SUM		      (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + \
			       NUM_RCU_LVL_2 + NUM_RCU_LVL_3)

which seemed to me to add more ugly than enlightenment.

Testing is going well.  Having to occasionally restrain myself to keep
from going full-bore for 4096 CPU optimality -- but have to keep it
simple until/unless someone with that large of a machine shows where
improvements are needed.

							Thanx, Paul
--

From: Josh Triplett
Date: Friday, August 22, 2008 - 4:29 pm

Why put these references here rather than in Documentation/RCU?  It
seems easier to keep documentation up to date in one place.  If you
think these represent a good "getting started" set of documents, how
about a Documentation/RCU/ReadTheseFirst with links to them, or how




u8 or bool, depending on semantics.  If just a simple flag, how about

Looks like several whitespace changes occurred here; several of these
lines didn't actually change except in whitespace.

The same comment about sized types applies here, but these fields didn't

Some whitespace changes again here; several of these lines didn't change


Might want to document in the commit message that you have tracing
information through RCU_TRACE, and that it applies to non-preemptible


You might want to give a specific example of a NUMA machine, the
appropriate value to use on that machine, and the result with and

It might actually make sense here to do this instead:

ifeq ($(CONFIG_RCU_TRACE),y)
obj-$(CONFIG_CLASSIC_RCU) += rcuclassic_trace.o
obj-$(CONFIG_PREEMPT_RCU) += rcupreempt_trace.o

Same comment as before; maintaining these in a single place seems

How about making these state structures static, along with removing the



ACCESS_ONCE, like memory barriers, benefits from an accompanying

No need to explicitly say "inline"; GCC should do the right thing here.


This comment applies to the original code, but:
You only call __call_rcu twice, in call_rcu and call_rcu_bh.  Both
times, you set head first, then wrap the call with local_irq_save.  How
about moving both into __call_rcu, making call_rcu and call_rcu_bh

static DEFINE_MUTEX(rcuclassic_trace_mutex);

Did you perhaps want PAGE_SIZE?

- Josh Triplett


--

From: Paul E. McKenney
Date: Friday, August 22, 2008 - 6:53 pm

OK for grpnum and level, but grphi and grplo need to be "int" to


This will need to be a non-bool shortly.

OK, so what the heck -are- the official type names???  u8 seems
to be defined in a powerpc-specific file.  OK, it also appears in
include/asm-generic/int-l64.h.  s8, u8, s16, u16, s32, u32, s64, and

Whitespace-only changes are kind of lost in the noise with this patch.

Both can be bool.  The completed and gpnum fields can probably be s32,




Indeed.  But I need to see an actual example before I document it.
It would be easy to make things slower by following the NUMA hardware






It will try again on the next scheduling-clock interrupt.  The reason
I did this is because ->onofflock is a global lock acquired when
beginning a quiescent state or when onlining/offlining.  Can't let
force_quiescent_state() monopolize things, and would like to exclude
online/offline while doing force_quiescent_state().  Hence make
force_quiescent_state() back off if the lock is held.


OK.  These are accessed without holding the relevant lock, and I don't
want the compiler to refetch them.  (For example, if inlined into other


Someone once told me why this was necessary, but I forget.  It was in the
original, and I didn't put it there.  Some weirdness about conversion
to 32-bit integer when the lower 32 bits of the pointer was zero or
some such.  So if your pointer value was 0x100000000, for example,

I can't pass "rcu_data" to a function (or at least I don't know how to
do so, short of passing __per_cpu_rcu_data and doing the per-CPU stuff
by hand).  I could make __call_rcu() be a macro, but that seemed more
ugly than it seemed worthwhile.



I really want some way of gracefully handling arbitrarily long output
to debugfs.  I am sure that some such exists, but haven't found it.
What I do instead is to arbitrarily truncate output to 4096 bytes,
which will be stunningly less than useful on a 4,096-CPU machine.  :-/

Suggestions welcome!

							Thanx, ...
From: Josh Triplett
Date: Monday, August 25, 2008 - 3:02 pm

Fair enough; the CPU-manipulation primitives do indeed use "int".  Odd

levelspread can, since it will never exceed 64, but levelcnt cannot.


Yes. {s,u}{8,16,32,64}, defined in include/asm-generic/int-{l,ll}64.h,

Hmmm, true.  Unfortunate, particularly if only for the benefit of




Primarily concerned about the possibility of perpetual failure.  Then
again, eventually a grace period will occur "naturally".  Just wondering

Good point!  That doesn't apply if you use ||, though.  If you just did
"return somepointer" that could potentially cause the problem you
describe.  In any case, it can't *hurt* to have it; GCC should do the


I can see two possibilities, depending on how much complexity you want.

The complicated way: do one pass calling snprintf everywhere and adding
up the total length used, and if you run out of memory during that pass,
reallocate the buffer to at least the total length you accumulated.  Or
something like that.

The simple hack:

- Josh Triplett


--

From: Paul E. McKenney
Date: Tuesday, August 26, 2008 - 9:05 am

It does allow use of -1 for "no particular CPU" or for error checking,



Indeed.  Putting rcuclassic_trace.c into rcuclassic.c gets pretty ugly.
I suppose that another possibility would be to #include rcuclassic_trace.c


Ah!  So the lock can fail for the following reasons:

1.	Some other CPU is in force_quiescent_state().  Here there is
	clearly no problem.

2.	Some other CPU is initializing the rcu_node hierarchy to set
	up a new quiescent state.  Here, we shouldn't have been
	executing force_quiescent_state() in the first place, so
	again no problem.

3.	Some other CPU is adjusting the rcu_node hierarchy to account
	for a CPU online or offline operation.  There is enough overhead
	in onlining and offlining CPUs that it seems unlikely that this
	could result in a denial of service.  However, if someone can
	make this happen, I will make the online/offline operation check
	to see if it should do a force_quiescent_state() -- which will
	require an __force_quiescent_state() where the onofflock is
	acquired by the caller.

So we are covered on #1 and #2, and very likely covered on #3, with an

OK.  I will review this towards the end, leaving it there to remind me
in the meantime.

So, would I need the !! on the left-hand operand of the first || due

The only other thing I can think of is dynamically allocated per-CPU

Given that this doesn't show up in production kernels, I will take
door #2.  Though I was hoping for some sort of interface that "just
made it work" regardless of the size of user reads and the length
and pattern of in-kernel prints, but that might be a bit much...

						Thanx, Paul
--

From: Josh Triplett
Date: Tuesday, August 26, 2008 - 5:38 pm

No.  || will always return 1 or 0.  You only need the !! if you want to
directly return the boolean value of a potentially 64-bit pointer.

- Josh Triplett


--

From: Paul E. McKenney
Date: Wednesday, August 27, 2008 - 11:34 am

Even if one argument of || is long and the other int or some fool thing
like that?  (What, me paranoid???)

							Thanx, Paul
--

From: Josh Triplett
Date: Wednesday, August 27, 2008 - 1:23 pm

What, you don't know exactly how C behaves in every strange corner
case? ;)

|| always produces a result of type int, and it compares each of its two
arguments to 0 independently; to the best of my knowledge the size of
those arguments never matters.

- Josh Triplett


--

From: Paul E. McKenney
Date: Wednesday, August 27, 2008 - 1:41 pm

I used to, back when identifiers were only guaranteed to be

I suppose I should read the spec.

							Thanx, Paul
--

From: Peter Zijlstra
Date: Monday, August 25, 2008 - 3:34 am

I actually like in code comments and 'documentation' more than
Documentation/ stuff. Mostly because Documentation/ is:
 - far away from the code
 - therefore, more easily bitrotted
 - and easily forgotten



--

From: Paul E. McKenney
Date: Monday, August 25, 2008 - 8:16 am

I know!!!

#ifdef JOSH_TRIPLETT
 * 	Documentation/RCU
 * 	http://lwn.net/Articles/262464/ (What is RCU, Fundamentally?)
 * 	http://lwn.net/Articles/263130/ (What is RCU's Usage?)
 * 	http://lwn.net/Articles/264090/ (What is RCU's API? + references)
#elif PETER_ZIJLSTRA
 * 	Documentation/RCU
#endif

(Sorry, couldn't resist!!!)

Seriously, I know where all the documentation is, as I wrote most of it.
These comments are for you guys.  So, any thoughts on how I should
resolve this?  My default is, as always, a coin flip.  ;-)

							Thanx, Paul
--

From: Peter Zijlstra
Date: Monday, August 25, 2008 - 8:26 am

I guess we could do the 'this is how the concept works and can be used
like so and so' documentation in Documentation/

And the stuff that says 'this code does like so and so, because blah'
should stay near the code.

And in any case of doubt - stay near the code :-)

I always view Documentation/ as end user stuff (be that a kernel
programmer that needs to learn a new API, or userland folks or people
wanting to know what a certain feature is about).

--

From: Paul E. McKenney
Date: Wednesday, August 27, 2008 - 11:28 am

Good point...

#ifdef READER_LIKES_DOCUMENTATION_URLS_IN_COMMENTS
 * 	Documentation/RCU
 * 	http://lwn.net/Articles/262464/ (What is RCU, Fundamentally?)
 * 	http://lwn.net/Articles/263130/ (What is RCU's Usage?)
 * 	http://lwn.net/Articles/264090/ (What is RCU's API? + references)
#else
 * 	Documentation/RCU
#endif

Of course, the C preprocessor would just remove the whole comment

Documentation/RCU/whatisRCU.txt does in fact contain the three URLs
listed above.  And there is always Documentation/RCU/RTFP.txt for

I confess to erring on the side of spamming all channels.  Then again,
I am a serial junk-mailer, so perhaps this is just me.

							Thanx, Paul
--

From: Manfred Spraul
Date: Sunday, August 24, 2008 - 1:08 am

I'm not sure if a bitmap is the right storage. If I understand the code 
correctly, it contains two information:
1) If the bitmap is clear, then all cpus have completed whatever they 
need to do.
A counter is more efficient than a bitmap. Especially: It would allow to 
choose the optimal fan-out, independent from 32/64 bits.
2) The information if the current cpu must do something to complete the 
current period.non
This is a local information, usually (always?) only the current cpu 
needs to know if it must do something.
But this doesn't need to be stored in a shared structure, the 
Do you have a description of the events between call_rcu() and the rcu 
callback?
Is the following description correct?

__call_rcu() queues in RCU_NEXT_TAIL.
In the middle of the current grace period: rcu_check_quiescent_state() 
calls rcu_next_callbacks_are_ready().
Entry now in RCU_NEXT_READY_TAIL
** 0.5 cycles: wait until all cpus have completed the current cycle.
rcu_process_gp_end() moves from NEXT_READY_TAIL to WAIT_TAIL

** full grace period
rcu_process_gp_end() moves from WAIT_TAIL to DONE_TAIL
Why this mb()? There was a grace period between the last read side 
critical section that might have accessed the pointer.
The rcu internal code already does spin_lock()+spin_unlock(). Isn't that 
The docbook entry is duplicated: They are in include/linux/rcupdate.h 
and here.
What about removing one of them?
I would go one step further:
Even add call_rcu_sched() into rcupdate.h. Add a Kconfig bool 
"RCU_NEEDS_SCHED" and automatically define either the extern or the #define.

--
    Manfred
--

From: Paul E. McKenney
Date: Sunday, August 24, 2008 - 9:32 am

I am using the bitmap in force_quiescent_state() to work out who to
check dynticks and who to send reschedule IPIs to.  I could scan all
of the per-CPU rcu_data structures, but am assuming that after a few
jiffies there would typically be relatively few CPUs still needing to do
a quiescent state.  Given this assumption, on systems with large numbers
of CPUs, scanning the bitmask greatly reduces the number of cache misses

Yes, that is the sequence of events if grace periods are happening back to
back and if the current CPU has not yet passed through a quiescent state.
If RCU is idle, then there is no need to wait for a previous grace
period to complete.  If this CPU already passed through its quiescent
state for the first grace period, and is not the CPU that starts the next
grace period, then there will be an additional grace period to move from
RCU_NEXT_TAIL to RCU_NEXT_READY_TAIL.  If this CPU is the one starting

The combination of spin_lock()+spin_unlock()+spin_lock()+spin_unlock()
would suffice.  But the pair of smp_mb()s suffice regardless of fast
paths through the rest of the mechanism.  Because rcu_process_callbacks()
is on the slow path, the trival proof of ordering is more important than

I have (and am) considering this.  It would require an additional per-CPU
data structure, would require an additional check in rcu_pending(), and
require a bit more manipulation when callbacks become "done".  However,
but would simplify the requeuing code in rcu_do_batch() a bit.

One interesting side effect would be that blimit would apply globally
to both rcu and rcu_bh, and that a burst of (say) rcu callbacks could


Another approach would be to have an rcupdate.h definition that
was a wrapper around __call_rcu_sched(), and put the docbook stuff in
rcupdate.h.  It would appear that docbook was not created with the idea
of alternative implementations in mind.  ;-)

							Thanx, Paul
--

From: Manfred Spraul
Date: Sunday, August 24, 2008 - 11:25 am

It's an optimization question: What is rarer? force_quiescent_state() or 
"normal" cpu_quiet calls.
You have optimized for force_quiescent_state(), I have optimized for 
"normal" cpu_quiet calls. [ok, I admit: force_quiescent_state() is still 
missing in my code].
Do you have any statistics?

--
    Manfred
--

From: Paul E. McKenney
Date: Sunday, August 24, 2008 - 2:19 pm

If the system is completely busy, then I would expect normal cpu_quiet()
calls to be more common.  But if the system were sized for peak
workload, it would spend a fair amount of time with many of the CPUs
idle.  Power-conservation measures would hopefully push the idleness
into single cores/dies/whatever which could then be powered down.

A large fraction of the systems I see have utilizations well under 50%.
And latency concerns would also focus on force_quiescent state.

That said, I haven't had much to do with systems having more than 128
CPUs.

							Thanx, Paul
--

From: Paul E. McKenney
Date: Sunday, August 24, 2008 - 5:07 pm

Hello!

Still experimental, not for inclusion.

Updates from earlier version:

o	Handles dyntick idle state, including interrupts and NMIs,
	but while allowing dyntick-idle CPUs to remain idle (in
	contrast to the previous version, which simply whacked all
	non-responding CPUs with a resched IPI).

o	Made force_quiescent_state() more intelligent, so that it
	no longer whacks all CPUs whether they need it or not.

o	Cleaned up cpp code that determines size and shape of the
	rcu_node hierarchy.

o	Added debugfs tracing capability (was in previous patch,
	but forgot to mention it).

Attached is an updated patch to Classic RCU that applies a hierarchy,
greatly reducing the contention on the top-level lock for large machines.
This passes mild rcutorture testing on x86 and ppc64, but is most
definitely not ready for inclusion.  It is OK for experimental work
assuming sufficiently brave experimenters.  See also Manfred Spraul's
patch at http://lkml.org/lkml/2008/8/21/336 (or his earlier work from
2004 at http://marc.info/?l=linux-kernel&m=108546384711797&w=2).  We will
converge onto a common patch in the fullness of time, but are currently
exploring different regions of the design space.

This patch provides CONFIG_RCU_FANOUT, which controls the bushiness
of the RCU hierarchy.  Defaults to 32 on 32-bit machines and 64 on
64-bit machines.  If CONFIG_NR_CPUS is less than CONFIG_RCU_FANOUT,
there is no hierarchy.  By default, the RCU initialization code will
adjust CONFIG_RCU_FANOUT to balance the hierarchy, so strongly NUMA
architectures may choose to set CONFIG_RCU_FANOUT_EXACT to disable
this balancing, allowing the hierarchy to be exactly aligned to the
underlying hardware.  Up to two levels of hierarchy are permitted
(in addition to the root node), allowing up to 16,384 CPUs on 32-bit
systems and up to 262,144 CPUs on 64-bit systems.  I just know that I
am going to regret saying this, but this seems more than sufficient
for the foreseeable future.  (Some architectures might ...
From: Paul E. McKenney
Date: Friday, August 29, 2008 - 5:49 pm

Hello!

Still experimental, not for inclusion.  But getting better!

Updates from v2:

o	Fixed a number of bugs uncovered by running rcutorture in
	parallel with onlining and offlining CPUs.  Many of these
	were due to the fact that there can be a multiple-grace-period
	window during which RCU and the process scheduler disagree
	about whether a given CPU is offline.  The solution was
	to make force_quiescent_state() check for RCU waiting on
	offlined CPUs, and then cleaning up all the locking gotchas
	that resulted from that change.

o	Upgraded tracing capability with additional statistics, for
	example, per-CPU counts of how often force_quiescent_state()
	responded on their behalf (because they were offline, in
	dyntick-idle state, or needed a resched IPI).  Also abbreviated
	more severely to allow the system to run longer within the
	confines of an 80-character xterm.

o	Added sparse annotations so that it sparses cleanly.

o	Added an argument to force_quiescent_state() so that for normal
	callers, it checks for enough time having passed since the
	last try.  Emergency callers (__call_rcu() with more than
	10,000 RCU callbacks piled up on the local CPU) get their
	quiescent state forced unconditionally.

o	Added mapping from CPU to rcu_data structure to allow RCU to
	easily switch its attention from (say) the CPU being offlined
	to the currently running CPU, should the offlining kick off
	a new RCU grace period.

o	Made the trace buffer's size a function of the number of
	CPUs so that the rcudata debugfs file works correctly on
	128-CPU machines.

Attached is an updated patch to Classic RCU that applies a hierarchy,
greatly reducing the contention on the top-level lock for large
machines.  This passes mild rcutorture testing on x86 and ppc64,
including some 12-hour runs on 8-CPU machines and an hour thus far
on a 128-CPU machine, but is most definitely not ready for inclusion.
It is OK for experimental work assuming sufficiently brave experimenters.
See also ...
From: Peter Zijlstra
Date: Saturday, August 30, 2008 - 2:33 am

Already done and available in the -tip tree, curtesy of Mathieu.


--

From: Paul E. McKenney
Date: Saturday, August 30, 2008 - 7:10 am

Very cool!!!  I see one of his patches at http://lkml.org/lkml/2008/4/17/342,
but how do I find out which branch of -tip this is on?  (I am learning
git, but it is a slow process...)

This would also simplify preemptable RCU's dyntick interface, removing
the need for proofs.

							Thanx, Paul
--

From: Peter Zijlstra
Date: Saturday, August 30, 2008 - 8:40 am

Not sure - my git-foo isn't good enough either :-(

All I can offer is that its available in tip/master (the collective
merge of all of tip's branches) as commit:
0d84b78a606f1562532cd576ee8733caf5a4aed3, which I found using
git-annotate include/linux/hardirq.h

How to find from which particular topic branch it came from, I too am
clueless.

---
commit 0d84b78a606f1562532cd576ee8733caf5a4aed3
Author: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Date:   Mon May 12 21:21:07 2008 +0200

    x86 NMI-safe INT3 and Page Fault
    
    Implements an alternative iret with popf and return so trap and exception
    handlers can return to the NMI handler without issuing iret. iret would cause
    NMIs to be reenabled prematurely. x86_32 uses popf and far return. x86_64 has to
    copy the return instruction pointer to the top of the previous stack, issue a
    popf, loads the previous esp and issue a near return (ret).
    
    It allows placing immediate values (and therefore optimized trace_marks) in NMI
    code since returning from a breakpoint would be valid. Accessing vmalloc'd
    memory, which allows executing module code or accessing vmapped or vmalloc'd
    areas from NMI context, would also be valid. This is very useful to tracers like
    LTTng.
    
    This patch makes all faults, traps and exception safe to be called from NMI
    context *except* single-stepping, which requires iret to restore the TF (trap
    flag) and jump to the return address in a single instruction. Sorry, no kprobes
    support in NMI handlers because of this limitation.  We cannot single-step an
    NMI handler, because iret must set the TF flag and return back to the
    instruction to single-step in a single instruction. This cannot be emulated with
    popf/lret, because lret would be single-stepped. It does not apply to immediate
    values because they do not use single-stepping. This code detects if the TF
    flag is set and uses the iret path for single-stepping, even if it ...
From: Paul E. McKenney
Date: Saturday, August 30, 2008 - 12:38 pm

That works -- thank you!!!

--

From: Mathieu Desnoyers
Date: Tuesday, September 2, 2008 - 6:26 am

If you're interested in knowing the topic it came from : it's required
so a following patch can use a "popf; ret" instead of iret to return
from trap handlers executed in NMI context. There is an architectural
problem on x86 causing NMIs to be reactivated after the first iret
encountered, which leads to NMI handler races if nmi handlers trap. This
works around the problem by returning from the trap handlers without
using the iret instruction.

It's useful to Immediate Values which put a temporary breakpoint in the
instruction stream when proceeding to code modification and also useful
to LTTng (available in the -lttng tree) which writes tracing data to
vmap'd memory buffers (which can cause a minor page fault).

I'm glad to see NMI context detection is useful to others too !


-- 
Mathieu Desnoyers
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--

From: Peter Zijlstra
Date: Tuesday, September 2, 2008 - 6:41 am

While an interesting detail, its not the answer to the question.

Given a bunch of topic branches, and a branch that has all those topic
merged, how, for any particular commit from the merge branch, do you
find from which topic branch it originiated?

IOW, the answer to the above question would have been a series of git
commands that would have resulted in something like tip/tracing/nmisafe



--

From: Paul E. McKenney
Date: Tuesday, September 2, 2008 - 7:55 am

I guess that it turned out that there was a series of mutt commands that
eventually got the answer.  That said, a series of git commands would
be quite nice.  ;-)

But it would appear that the series of git commands would need to come
from someone with better git-foo than either Mathieu or myself.  :-/

							Thanx, Paul
--

From: Lai Jiangshan
Date: Saturday, August 30, 2008 - 2:58 am

I just had a fast review. so my comments is nothing but cleanup.

          Thanks, Lai.


I tried a mask like qsmaskinit before. The system came to deadlock
when I did on/offline cpus.
I didn't find out the whys for I bethought of these two problem:

problem 1:
----race condition 1:
<cpu_down>
synchronize_rcu <called from offline handler in other subsystem>
rcu_offline_cpu


-----race condition 2:
rcu_online_cpu
synchronize_rcu <called from online handler in other subsystem>
<cpu_up>

in these two condition, synchronize_rcu isblocked for ever for
synchronize_rcu have to wait a cpu in rnp->qsmask, but this
cpu don't run.



problem 2:
we need call rcu_offline_cpu() in these two cases in rcu_cpu_notify()
since qsmaskinit had changed by rcu_online_cpu()

	case CPU_UP_CANCELED:



struct rcu_state has a field: struct rcu_data *rda[NR_CPUS]
so we can move these lines around __call_rcu into __call_rcu.

__call_rcu(struct rcu_head *head, struct rcu_state *rsp)
{
	local_irq_save(flags);
	struct rcu_data *rdp = rsp->rda[smp_processor_id()];
.....
	local_irq_save(flags);




--

From: Manfred Spraul
Date: Saturday, August 30, 2008 - 6:32 am

Can we disallow synchronize_rcu() from the cpu notifiers? Are there any 
users that do a synchronize_rcu() from within the notifiers?
I don't see any other solution.
Something like qsmaskinit is needed - always enumerating all cpus just 
doesn't scale.

Perhaps it's possible to rely on CPU_DYING, but I haven't figured out 
yet how to handle read-side critical sections in CPU_DYING handlers.
Interrupts after CPU_DYING could be handled by rcu_irq_enter(), 
rcu_irq_exit() [yes, they exist on x86: the arch code enables the local 
interrupts in order to process the currently queued interrupts]

--
    Manfred
--

From: Paul E. McKenney
Date: Saturday, August 30, 2008 - 7:34 am

I made force_quiescent_state() check for offline CPUs.  (Well, actually
it is rcu_implicit_offline_qs(), which is indirectly called from


My feeling is that CPU online/offline will be quite rare, so it should
be OK to clean up after the races in force_quiescent_state(), which in
this version is called every three ticks in a given grace period.

Yes, I did worry about the possibility of all CPUs being in dyntick-idle
mode, and the solution for that is (1) don't let a CPU that has RCU
callbacks pending go into dyntick-idle mode via rcu_needs_cpu() and
(2) don't let a grace period start unless there is at least one callback
that is not yet in the done state.  But no, I am not certain that I have
gotten this completely correct yet.

							Thanx, Paul
--

From: Manfred Spraul
Date: Sunday, August 31, 2008 - 3:58 am

If you add failing cpu offline calls, then the problem appears to be 
unsolvable:
If I get it right, the offlining process looks like this:
* one cpu in the system makes the CPU_DOWN_PREPARE notifier call. These 
calls can sleep (e.g. slab sleeps on semaphores). The cpu that goes 
offline is still alive, still doing arbitrary work. cpu_quiet calls on 
behalf of the cpu would be wrong.
* stop_machine: all cpus schedule to a special kernel thread [1], only 
the dying cpu runs.
* The cpu that goes offline calls the CPU_DYING notifiers.
* __cpu_disable(): The cpu that goes offline check if it's possible to 
offline the cpu. At least on i386, this can fail.
On success:
* at least on i386: the cpu that goes offline handles outstanding 
interrupts. I'm not sure, perhaps even softirqs are handled.
* the cpus stopps handling interrupts.
* stop machine leaves, the remaining cpus continue their work.
* The CPU_DEAD notifiers are called. They can sleep.
On failure:
* all cpus continue their work. call_rcu, synchronize_rcu(), ...
* some time later: the CPU_DOWN_FAILED callbacks are called.

Is that description correct?
Then:
- treating a cpu as always quiet after the rcu notifer was called with 
CPU_OFFLINE_PREPARE is wrong: the target cpu still runs normal code: 
user space, kernel space, interrupts, whatever. The target cpu still 
accepts interrupst, thus treating it as "normal" should work.
__cpu_disable() success:
- after CPU_DYING, a cpu is either in an interrupt or outside read-side 
critical sections. Parallel synchronize_rcu() calls are impossible until 
the cpu is dead. call_rcu() is probably possible.
- The CPU_DEAD notifiers are called. a synchronize_rcu() call before the 
rcu notifier is called is possible.
__cpu_disable() failure:
- CPU_DYING is called, but the cpu remains fully alive. The system comes 
fully alive again.
- some time later, CPU_DEAD is called.

With the current CPU_DYING callback, it's impossible to be both 
deadlock-free and race-free with the given ...
From: Paul E. McKenney
Date: Sunday, August 31, 2008 - 10:20 am

As I understand it, this is the point where the dying CPU disables
interrupts and removes itself from the online masks.  Though I would
feel better if there was an smp_mb() after the last local_irq_disable()


Indeed!  My current code doesn't declare them offline until the CPU_DEAD
notifiers are called.  And force_quiescent_state() does not consider
them to be offline until after they have cleared their bit in
cpu_online_map, which does not happen until the outgoing CPU has
disabled interrupts, at least in x86.  So my current code should be
OK on x86.

It -looks- like stop_cpu() expects to be called with irqs disabled,
but I don't see what would be disabling irqs.  (Don't kthreads normally
start with irqs enabled?)  Ah, I see it -- the stop_cpu() threads
sequence through a state machine, and one of the states disables
irqs for everyone.

So the only problem would occur in architectures that re-enable irqs
in the middle of __cpu_disable(), as x86 does (but x86 correctly orders
the clearing of the cpu_online_mask bit, so is OK).  This of course
has the added benefit that irq handlers aren't running on a CPU that
is marked offline.

Checking other architectures:

o	ARM arch/arm/kernel/smp.c __cpu_disable() does not re-enable
	irqs, so is OK.

!	arch/ia64/kernel/smpboot.c __cpu_disable() clears itself
	from the cpu_online mask before flushing pending irqs, which
	might include RCU read-side critical sections.	I believe that
	the "cpu_clear(cpu, cpu_online_map)" must move to after the
	"fixup_irqs()".

o	arch/powerpc/kernel/smp.c __cpu_disable() does not disable
	irqs directly, but calls subarch-specific functions noted
	below.

o	arch/powerpc/platforms/powermac/smp.c smp_core99_cpu_disable()
	does not appear to re-enable irqs, so should be OK.

!	arch/powerpc/kernel/smp.c generic_cpu_disable() clears itself
	from the cpu_online_mask before invoking fixup_irqs(), which
	momentarily enables irqs.  I believe that the  "cpu_clear(cpu,
	cpu_online_map)" must move to after ...
From: Manfred Spraul
Date: Sunday, August 31, 2008 - 10:45 am

Yes, that would work:
Rule 1: after CPU_DEAD, a cpu is gone. The cpu is quiet, rcu callbacks 
must be moved to other cpus, ...
Rule 2: if a cpu is not listed in cpu_online_mask, then it can be 
considered as outside a read-side critical section.

The problem with rule 2 is that it means someone 
[force_quiescent_state()] must poll the cpu_online_mask and look for 
changes.
I'd really prefer a notifier. CPU_DYING is nearly the correct thing, it 
only has to be moved down 3 lines ;-)
I made a mistake, get_online_cpus() stores current, not a cpu number. 
Thus the described race it not possible. Perhaps there are other users 
that could deadlock.
I don't know enough about the preempt algorithm, thus I can't confirm if 
your proposal would work or not.

--
    Manfred
--

From: Paul E. McKenney
Date: Sunday, August 31, 2008 - 10:55 am

But some later CPU_DYING notifier might decide that the CPU cannot be
removed after all, which would mean bringing the CPU back.  And then
whatever the CPU was needed for might have actually happened in the

Well, that is on my list of things to look into...

							Thanx, Paul
--

From: Manfred Spraul
Date: Sunday, August 31, 2008 - 11:18 am

CPU_DYING must not fail, the current code doesn't support that.

--
    Manfred
--

From: Paul E. McKenney
Date: Sunday, August 31, 2008 - 12:23 pm

Good point!

							Thanx, Paul
--

From: Paul E. McKenney
Date: Saturday, August 30, 2008 - 7:29 am

First, only one of these race conditions can happen at a time, since
only one online/offline action can be happening at a time.

What I did to solve it was to make force_quiescent_state() check
to see if the CPU currently blocking the grace period is offline.
(The actual checking for offline is in rcu_implicit_offline_qs(),
which is called indirectly from force_quiescent_state().)

So when this race occurs, it is taken care of within three jiffies.
This happened -many- times during my most recent test ("of=" in the

Good catch!!!  Fixed.  The current code would work in this case, but grace
periods would be unnecessarily extended until force_quiescent_state()



Very good point!!!  And then call_rcu() and call_rcu_bh() become



--

From: Andi Kleen
Date: Monday, September 1, 2008 - 2:38 am

It would be better if you hung rcu_irq_enter in the irq_enter() if 
statement that checks if the task was idle or not. This way it would
be zero overhead for interruptions of non busy CPUs, keeping 
it out of many fast paths.

Haven't read everything, sorry.

-Andi
--

From: Paul E. McKenney
Date: Monday, September 1, 2008 - 6:05 pm

So that I lose the #else above, and so that irq_enter() and irq_exit()
look something like the following (with additional adjustments to suit)?
Makes a lot of sense to me...

And it has the very nice side effect of allowing me to have a separate
rcu_irq_enter() and rcu_nmi_enter(), trivializing the RCU-dynticks
interface!!!  Very cool, thank you very much!!!

							Thanx, Paul

void irq_enter(void)
{
#ifdef CONFIG_NO_HZ
	int cpu = smp_processor_id();
	if (idle_cpu(cpu) && !in_interrupt())
		tick_nohz_stop_idle(cpu);
#endif
	__irq_enter();
#ifdef CONFIG_NO_HZ
	if (idle_cpu(cpu)) {
		rcu_irq_enter();
		tick_nohz_update_jiffies();
	}
#endif
}

void irq_exit(void)
{
	account_system_vtime(current);
	trace_hardirq_exit();
	sub_preempt_count(IRQ_EXIT_OFFSET);
	if (!in_interrupt() && local_softirq_pending())
		invoke_softirq();

#ifdef CONFIG_NO_HZ
	/* Make sure that timer wheel updates are propagated */
	if (idle_cpu(smp_processor_id())) {
		if (!in_interrupt() && !need_resched())
			tick_nohz_stop_sched_tick(0);
		rcu_irq_exit();
	}
#endif
	preempt_enable_no_resched();
}
--

From: Andi Kleen
Date: Monday, September 1, 2008 - 11:18 pm

Yes looks good.



-Andi
--

From: Paul E. McKenney
Date: Friday, September 5, 2008 - 8:29 am

Hello!

Still experimental, not for inclusion.  But ready for serious experimental
use, in particular, experience on an actual >1000-CPU machine would be
most welcome.

Updates from v3:

o	The hierarchical-RCU implementation has been moved to its own
	"rcutree" set of files.  This allows configuring three different
	implementations of RCU (CLASSIC_RCU, PREEMPT_RCU, and the new
	TREE_RCU).  More importantly, it enables easy application of
	this patch to a wide variety of Linux versions.

	I hope that this implementation can completely replace Classic
	RCU, but in the meantime, this split makes for easier testing
	and review.

o	The stalled-CPU detection is now implemented and working,
	enabled by the CONFIG_RCU_CPU_STALL config parameter.  Complaints
	are kprint()ed 3 seconds into the stall, and every 30 seconds
	thereafter.  It also now attempts to force quiescent states.

o	The algorithm uses pre-fabricated masks rather than shifting
	on each access.

o	Review comments have been applied (thank you all!!!).
	For but one example, call_rcu() and call_rcu_bh() are now
	one-liners.

o	The rcu_pending() and rcu_needs_cpu() primitives are now
	much more aggressive about permitting CPUs to enter dynticks
	idle mode.  Only CPUs that have RCU callbacks are kept out
	of dynticks idle mode.

Attached is an updated patch to Classic RCU that applies a
hierarchy, greatly reducing the contention on the top-level lock
for large machines.  This passes 10-hour concurrent rcutorture and
online-offline testing on 128-CPU ppc64.  It is OK for experimental
work assuming only modestly brave experimenters (and perhaps even
cowardly experiementers), but not yet ready for inclusion.  See also
Manfred Spraul's recent patches (or his earlier work from 2004 at
http://marc.info/?l=linux-kernel&m=108546384711797&w=2).  We will
converge onto a common patch in the fullness of time, but are currently
exploring different regions of the design space.  That said, I have
already gratefully stolen a ...
From: Andrew Morton
Date: Friday, September 5, 2008 - 12:33 pm

On Fri, 5 Sep 2008 08:29:30 -0700

The CONFIG_RCU_CPU_STALL identifier seems poorly-chosen to me - it
sounds like it will stall my CPU.  Should it be
CONFIG_RCU_CPU_STALL_DETECTOR?  If it's a debugging option then it

You forgot

  o	Adds yet another RCU flavour

Having alternative implementations of the same thing is a real cost in

Using NR_CPUS for anything at all is grossly, grossly inaccurate. 
Vendors can and will ship kernels with NR_CPUS=1024 and their customers
can and will run those kernels on 4-cpu machines.  Lots of customers.

That's a two-and-a-half-order-of-magnitude inaccuracy.  It makes all
your above work meaningless.


So this is a 4096-byte structure on some setups.


did they have to be implemented in macros?  (it's generally best to use

static inline void rcu_init_sched(void)
{
}


These are massive.  But it seems they'll only ever be used once, in





I don't personally think such fugliness gains us enough.

static void dyntick_record_completed(struct rcu_state *rsp, int comp)
{
}


I dislike the L's here.  They don't do anything and they have an
encapsulation-violation feeling about them.  Do we really want code
sprinkled all over the palce whcih "knows" spefically which type was
used to implement these fields?  Or do we want to stick a plain old "2"

]]

hm, does that work?

akpm:/usr/src/25> grep -r __releases Documentation 
akpm:/usr/src/25> 



Looks great!  I don't understand a line of it!

It's a pretty big pill to swallow.  Nice performance testing results
will help it to slide down.
--

From: Paul E. McKenney
Date: Friday, September 5, 2008 - 4:04 pm

CONFIG_RCU_CPU_STALL_DETECTOR it is!  It is sufficiently lightweight
to be included in production systems, and similar mechanisms have proven

Agreed, but note well the quote from above:

	The hierarchical-RCU implementation has been moved to its own
 	"rcutree" set of files.  This allows configuring three different
 	implementations of RCU (CLASSIC_RCU, PREEMPT_RCU, and the new
 	TREE_RCU).  More importantly, it enables easy application of
 	this patch to a wide variety of Linux versions.
 
 	I hope that this implementation can completely replace Classic
 	RCU, but in the meantime, this split makes for easier testing
 	and review.

The idea is to avoid adding another RCU flavour (or flavor, for that

I agree in principle, but this case is an exception.

Suppose that we have NR_CPUS=1024 on a 4-CPU 64-bit machine.  Since 64^2
is greater than 1024, we end up with a two-level hierarchy, with one
rcu_node structure at the root and 16 rcu_node leaf structures, each
of which takes up a single 128-byte cache line.  There will be two such
structures in the system, one for rcu and one for rcu_bh.

So I do not believe that this will be a problem.

One laptops, this is even less of an issue -- NR_CPUS=8 on my laptop,
which would reduce to a pair rcu_node structures, one for rcu, the other
for rcu_bh.

Making the decision at runtime would bloat the code by much more than the
extra data consumed.  And introduce yet more races between online/offline
and everything else.  Besides, this way I am being bloat-compatible

You lost me on this one.  On a 64-bit system, I have 8 bytes each for
lock, qsmask, qsmaskinit, grpmask, and parent, four bytes each for grplo
and grphi, and another byte each for grpnum and level, for a total of
50 bytes for each struct rcu_node, which comes to a single cache line
for most large system.  Given the default CONFIG_RCU_FANOUT=64 and
NR_CPUS=4096, we have a two-level hierarchy with one root rcu_node
structure and 64 leaf rcu_node structures.  This ...
From: Andrew Morton
Date: Friday, September 5, 2008 - 4:52 pm

Is it likely that anyone will ship kernels with NR_CPUS=8?  What are

____cacheline_internodealigned_in_smp will expand this structure to

OK, thanks.  As it's effectively a bugfix, a full description of the
bug would grease some wheels.

--

From: Paul E. McKenney
Date: Friday, September 5, 2008 - 9:16 pm

Ubuntu Feisty ships with NR_CPUS=8, but yes, I imagine that this number
will increase.  Perhaps a table for 64-bit CPUs:

			 Cachelines per	      Total
       NR_CPUs		 Implementation	    Cachelines

	1-64			 1		 2	[laptop distros]
       65-128			 3		 6	[x86 distros, Power]
      129-192			 4		 8
      193-256			 5		10
      257-320			 6		12
      321-384			 7		14
      385-448			 8		16
      449-512			 9		18	[SGI ca. 2004]
      513-576			10		20
      577-640			11		22
      641-704			12		24
      705-768			13		26
      769-832			14		28
      833-896			15		30
      897-960			16		32
      961-1024			17		34	[SGI ca. 2006?]

      . . .

     3967-4032			64	       128
     4033-4096			65	       130	[SGI ca. 2008/9?]
     4097-4160			67	       134
     4161-4224			68	       136

And so on, limiting out at 262,144 CPUs, which should be at least
3-5 years in the future (famous last words).  The "Cachelines per
Implementation" column is for each of rcu and rcu_bh, while the "Total
Cachelines" column is for the combination of both.  As you can see, the
number of cachelines consumed by the rcu_nodes hierarchy is quite modest
as a function of the number of CPUs, even for pretty big numbers of CPUs.


Ah!  I guess that I have not been paying sufficient attention.

OK, http://www.scalemp.com/ says that they go to 128 cores, so perhaps
256 hardware threads, and 1TB of memory.  This is about 4GB per hardware
thread.  My approach costs them 4096/64=64 bytes per hardware thread,
or about 2 millionths of a percent of memory.

I think that this is OK.  If not, one option is to pick a smaller

Here is the gist of what I was told:

	Above several hundred CPUs, the current RCU implementation
	suffers from severe lock contention, rendering the machine
	useless.  Crude workarounds exist, but are expected to cause
	their own problems at higher CPU counts.

I never have used a machine with that many CPUs, so am just passing on
what I heard.  But given that Classic RCU was ...
From: Manfred Spraul
Date: Saturday, September 6, 2008 - 9:37 am

Hi Paul,


I've noticed that right now rcu_enter_nohz() can be nested within 
rcu_irq_enter():
irq_exit() first calls tick_nohz_stop_sched_tick(), then rcu_irq_exit().
And tick_nohz_stop_sched_tick() can switch into nohz mode.

Is that intentional? Does rcupreempt support that? It broke my rcustate 
code on x86-64.

I would prefer if something like the attached patch is applied. What do 
you think?
Do you need the patch as well?

--
    Manfred
From: Paul E. McKenney
Date: Sunday, September 7, 2008 - 10:25 am

Good question -- when I tried splitting irqs from NMIs, things broke
badly, and this might well explain it.  Thank you very much for the
hint!!!


--

From: Manfred Spraul
Date: Sunday, September 7, 2008 - 3:18 am

I fear that this won't work.
E.g. look at x86, smp_callin() [arch/x86/kernel/smpboot.c]:
The boot code must enable local interrupts around calibrate_delay(), 
otherwise the NMI watchdog would complain.
That means the first interrupts, the first read side critical sections 
can run way before the cpu bit is set within cpu_online_map.
cpus are just started, we are not within stop_machine. Thus we cannot 
make any assumption about what the remaining cpus are doing.

What about introducing a CPU_STARTING notifier call, similar to CPU_DYING:
- called with disabled interrupts
- called before interrupts are enabled
- must not sleep
- called on the new cpu.

This might also be useful for something like kvm. I'm not sure if it's 
guaranteed that hardware_enable() runs early enough.

Attached is a patch proposal. Note that it doesn't work correctly: On 
x86-64, I have seen that CPU_STARTING is called after CPU_ONLINE. Thus 
frozen_cpus could already be cleared, then _FROZEN would be wrong.

--
Manfred

I would find that useful too. I had several cases where i had 
to add smp_call_function_single() with a second callback to the CPU_UP 
notifier, which always seemed ugly.

-Andi
-- 
ak@linux.intel.com
--

From: Paul E. McKenney
Date: Sunday, September 7, 2008 - 12:46 pm

I suppose another approach would be to make calibrate_delay() kick the
watchdog timer every so often, but the effort explaining to people why
their machine's bogomips had decreased would probably far exceed any
possible benefit...

So given a CPU_STARTING notifier, RCU could set a bit in a coming-online
bitmap, which would be reset rcu_online_cpu().  Then RCU would consider
a CPU offline only if it is marked offline in cpu_online_map -and- it is

Then notify_cpu_starting() is invoked right before smp_callin() in
start_secondary()?


--

From: Paul E. McKenney
Date: Monday, September 15, 2008 - 9:02 am

Hello!

This patch fixes a long-standing performance bug in classic RCU that
results in massive lock contention on the internal RCU lock on systems
with more than a few hundred CPUs.  Although this patch creates a
separate flavor of RCU for easy of review and patch maintenance, it
is intended to replace classic RCU.

Still experimental, not for inclusion, but given that I am now finding
more bugs in the rest of Linux than in this code, I suspect that it is
getting close.  Definitely ready for serious experimental use, especially

Updates from v4:

o	Separated dynticks interface so that NMIs and irqs call separate
	functions, greatly simplifying it.  In particular, this code
	no longer requires a proof of correctness.  ;-)

o	Separated dynticks state out into its own per-CPU structure,
	avoiding the duplicated accounting.

o	The case where a dynticks-idle CPU runs an irq handler that
	invokes call_rcu() is now correctly handled, forcing that CPU
	out of dynticks-idle mode.

o	Review comments have been applied (thank you all!!!).
	For but one example, fixed the dynticks-ordering issue that
	Manfred pointed out, saving me much debugging.  ;-)

o	Adjusted rcuclassic and rcupreempt to handle dynticks changes.

Attached is an updated patch to Classic RCU that applies a
hierarchy, greatly reducing the contention on the top-level lock
for large machines.  This passes 10-hour concurrent rcutorture and
online-offline testing on 128-CPU ppc64 without dynticks enabled,
and exposes some timekeeping bugs in presence of dynticks (exciting
working on a system where "sleep 1" hangs until interrupted...).
It is OK for experimental work, but not yet ready for inclusion.
See also Manfred Spraul's recent patches (or his earlier work from
2004 at http://marc.info/?l=linux-kernel&m=108546384711797&w=2).
We will converge onto a common patch in the fullness of time, but are
currently exploring different regions of the design space.  That said,
I have already gratefully stolen quite a few of ...
From: Manfred Spraul
Date: Tuesday, September 16, 2008 - 9:52 am

Hi Paul,

I'm still comparing my implementation with your code:
- f is called once for each cpu in the system, correct?
- if at least one cpu is in nohz mode, this loop will be needed for 
every grace period.

That means an O(NR_CPUS) loop with disabled local interrupts :-(
Is that correct?

Unfortunately, my solution is even worse:
My rcu_irq_exit() acquires a global spinlock when called on a nohz cpus.
A few cpus in cpu_idle, nohz, executing 50k network interrupts/sec would 
cacheline-trash that spinlock.
I'm considering counting interrupts: if a nohz cpu executes more than a 
few interrupts/tick, then add a timer that check rcu_pending().

Perhaps even wouldn't be enough: I remember that the initial unhandled 
irq detection code broke miserably on large SGI systems:
An atomic_inc(&global_var) in the local timer interrupt (i.e.: 
NR_CPUS*HZ calls/sec) caused so severe trashing that the system wouldn't 
boot. IIRC that was with 512 cpus.


--
    Manfred
--

From: Paul E. McKenney
Date: Tuesday, September 16, 2008 - 10:30 am

Hello, Manfred!


Not necessarily.  If all CPUs corresponding to this rcu_state struct
have checked in already, we don't even get to this loop -- see the

The outer loop, yes.  The inner loop only for those rcu_state structs

With the definition of "O()" being the worst-case execution time, yes.
But this worst case could only happen when the system was mostly idle,
in which case the added overhead should not be too horribly bad.  If the
system was busy enough that each CPU ran at least one process during each
grace period, then this function would not be invoked in the first place.

If this does prove to be a problem in practice, I will rework
force_quiescent_state() to run incrementally.  But I would rather
avoid both the added complexity and the resulting longer grace periods,
so someone needs to bring me a real-world problem before I take that

I tried putting a cpu_quiet() in my rcu_irq_exit() as well, and quickly

/me runs off and checks to make sure that all of my dyntick entry/exit
code restricts itself to per-CPU variables...

Yep!  (Whew!!!)

							Thanx, Paul
--

From: Manfred Spraul
Date: Tuesday, September 16, 2008 - 10:48 am

No: "was mostly running cpu_idle()". A cpu_idle() cpu could execute lots 
of irqs and softirqs.
So the worst case would be a system with 1 cpu/node for reserved for irq 
handling.
The "idle" cpu would be always in no_hz mode, even though it might be 
100% busy handling irqs.
The remaning cpus might be 100% busy handling user space.

And every quiescent state will end up in that O(NR_CPUS) loop.

--
    Manfred
--

From: Paul E. McKenney
Date: Tuesday, September 16, 2008 - 11:22 am

Good point!

Indeed, if you had a 1024-CPU box acting as (say) a router/hub using
the Linux-kernel protocol stacks with no user-mode processing, then
you could indeed have the system mostly busy with no user-space code
running, and thus no quiescent states.

However, last I checked, almost all 1024-CPU boxes run HPC workloads
mostly in user mode, so this scenario would not occur.  However, again,
if it does come up, I would add an additional level of state machine
to the force_quiescent_state() family of functions, so that the scan
would be done incrementally.  Perhaps arranging for CPU groups to be
scanned by CPUs within that group.

But again, I don't want to take that step until I see someone actually
needing it.  Maybe the Vyatta guys will be there sooner than I think,
but...

							Thanx, Paul
--

From: Manfred Spraul
Date: Sunday, September 21, 2008 - 4:09 am

Hi Paul,

Some further thoughts about design differences between your and my 
implementation:

- rcutree's qsmaskinit  is the worst-case list of cpus that could be in 
rcu read side critical sections.
- rcustate's cpu_total is the accurate list of cpus that could be in rcu 
read side critical sections.

Both variables are read rarely: for rcu_state, twice per grace period.

rcutree fixes up cpus that are "incorrectly" listed in qsmaskinit with 
force_quiescent_state(). It forces rcutree to use a cpu bitmask for 
qsmask and it forces rcutree to store the "done" information in a global 
structure. Additionately, in the worst case force_quiescent_state() must 
loop over all cpus.
rcustate can use per-cpu structures and a global atomic_t. There is no 
loop over all cpus. That's a big advantage, thus I think it's worth the 
effort to maintain an accurate list.
Unfortunately, I don't have an efficient implementation for the accurate 
list.

Some random ideas:
- cpu_total is only read rarely. Thus it would be ok if the read 
operation is expensive [e.g. collect data from multiple cachelines, 
acquire spinlocks...]
- updates to cpu_total happen with every interrupt on an idle system 
with no_hz.
    Thus it must be very scalable, preferably per-cpu data.
    And: Updates are far more frequent than grace periods.
- updates to cpu_total happen nearly never without no_hz.
   Especially: far less frequent than grace periods.

What about adding an "invalid" flag to cpu_total? The "real" data is 
stored in per-cpu structures.
- when a cpu enters/leaves nohz, then it invalidates the global 
cpu_total and updates a per-cpu structure
- when the state machine needs the number of rcu-tracked cpus, then it 
checks if the global cpu_total is valid.
If it's valid, then cpu_total is used directly. Otherwise the per-cpu 
structures are enumerated and the new value is stored as cpu_total.

What do you think?

--
    Manfred
--

From: Paul E. McKenney
Date: Sunday, September 21, 2008 - 2:14 pm

That has been one of my biggest questions about your approach, that and
the need to hit holdouts with resched IPI.  (Though perhaps you have

Agreed.  I could use this approach as well, having each CPU set and clear
its qsmaskinit bit on every exit from and entry to dynticks idle state,
but see below...

In your case, you would need to carefully keep state so that a CPU
entering dynticks idle mode would know whether or not it needed to
respond to the current grace period -- but you need this in any case,

Yep!  Hence my reluctance to add overhead to the dynticks side of the


I use an analogous algorithm, as the qsinitmask values might change while
setting up for the next quiescent state.  The tough part of this was
correctly handling races between setting up for the quiescent state and
onlining/offlining CPUs (and, in your case, CPUs entering/leaving dynticks
idle mode).  I chose to use a global lock that excludes online/offline
and starting a quiescent state (except in the case where there are so
few CPUs that there is but one rcu_node structure, in which case that
structure's lock suffices, so that onofflock need not be acquired).

But although acquiring a global lock is reasonable for CPU online/offline
(as there can be but one such operation at a time), it would be quite
painful for the dynticks case.

Of course, I might be able to avoid the need for this global lock if I
were willing to acquire the interior-node rcu_node locks when setting
up for the next grace period.  But this would require me to put in a
cleanup step after grace-period setup, as some set of dyntick operations
might otherwise end the grace period before it is fully initialized.
This could potentially result in two different CPUs setting up grace
periods concurrently (yuck!).  Worse yet, the grace period could end
while the initialization was only halfway through the leaves, so that
the CPU doing the initialization would need to recognize this and stop
further initialization -- and, again, clean up ...
From: Paul E. McKenney
Date: Tuesday, September 23, 2008 - 4:53 pm

Hello!

This patch fixes a long-standing performance bug in classic RCU that
results in massive lock contention on the internal RCU lock on systems
with more than a few hundred CPUs.  Although this patch creates a
separate flavor of RCU for easy of review and patch maintenance, it
is intended to replace classic RCU.

Still experimental, not for inclusion, but given that I am now finding
more bugs in the rest of Linux than in this code, I suspect that it
is getting close.  Definitely ready for serious experimental use,
especially in !CONFIG_NO_HZ configurations.  In particular, experience
on an actual 1000+ CPU machine would be most welcome, and now appears
to be forthcoming!

Updates from v5 (http://lkml.org/lkml/2008/9/15/92, bad subject line):

o	Fix a compiler error in the !CONFIG_FANOUT_EXACT case (blew a
	changeset some time ago, and finally got around to retesting
	this option).

o	Fix some tracing bugs in rcupreempt that caused incorrect
	totals to be printed.

o	I now test with a more brutal random-selection online/offline
	script (attached).  Probably more brutal than it needs to be
	on the people reading it as well, but so it goes.

o	A number of optimizations and usability improvements:

	o	Make rcu_pending() ignore the grace-period timeout when
		there is no grace period in progress.

	o	Make force_quiescent_state() avoid going for a global
		lock in the case where there is no grace period in
		progress.

	o	Rearrange struct fields to improve struct layout.

	o	Make call_rcu() initiate a grace period if RCU was
		idle, rather than waiting for the next scheduling
		clock interrupt.

	o	Invoke rcu_irq_enter() and rcu_irq_exit() only when
		idle, as suggested by Andi Kleen.  I still don't
		completely trust this change, and might back it out.

	o	Make CONFIG_RCU_TRACE be the single config variable
		manipulated for all forms of RCU, instead of the prior
		confusion.

	o	Document tracing files and formats for both rcupreempt
		and rcutree.

Updates ...
From: Ingo Molnar
Date: Thursday, September 25, 2008 - 12:26 am

could you please send this bit separately, even if rcutree isnt finished 
yet? Seems like a quite useful debug feature.

	Ingo
--

From: Paul E. McKenney
Date: Thursday, September 25, 2008 - 7:05 am

Good point -- I guess I do need to implement the CLASSIC_RCU piece of
this in any case.

							Thanx, Paul
--

From: Ingo Molnar
Date: Thursday, September 25, 2008 - 12:29 am

i'm wondering about those timekeeping bugs. Do you have an idea what's 
it about and does it affect mainline?

	Ingo
--

From: Paul E. McKenney
Date: Thursday, September 25, 2008 - 7:18 am

Sad to say, they do affect mainline -- I can reproduce these problems
in 2.6.27-rc7.  Thomas has given me a couple of fixes that got rid of
earlier problems in which jiffies would stop counting, and I believe
has located the cause of at least one other problem.

							Thanx, Paul
--

From: Paul E. McKenney
Date: Friday, October 10, 2008 - 9:09 am

Hello!

This patch fixes a long-standing performance bug in classic RCU that
results in massive lock contention on the internal RCU lock on systems
with more than a few hundred CPUs.  Although this patch creates a
separate flavor of RCU for easy of review and patch maintenance, it
is intended to replace classic RCU.

Still experimental, not for inclusion, but getting quite close.  I expect
to have it in shape for 2.6.29.  Definitely ready for -serious- testing
and abuse.  In particular, experience on an actual 1000+ CPU machine
would be most welcome, and still appears to be forthcoming...

Updates from v6 (http://lkml.org/lkml/2008/9/23/448):

o	Fix a number of checkpatch.pl complaints.

o	Apply review comments from Ingo Molnar and Lai Jiangshan
	on the stall-detection code.

o	Fix several bugs in !CONFIG_SMP builds.

o	Fix a misspelled config-parameter name so that RCU now announces
	at boot time if stall detection is configured.

o	Run tests on numerous combinations of configurations parameters,
	which after the fixes above, now build and run correctly.

Updates from v5 (http://lkml.org/lkml/2008/9/15/92, bad subject line):

o	Fix a compiler error in the !CONFIG_FANOUT_EXACT case (blew a
	changeset some time ago, and finally got around to retesting
	this option).

o	Fix some tracing bugs in rcupreempt that caused incorrect
	totals to be printed.

o	I now test with a more brutal random-selection online/offline
	script (attached).  Probably more brutal than it needs to be
	on the people reading it as well, but so it goes.

o	A number of optimizations and usability improvements:

	o	Make rcu_pending() ignore the grace-period timeout when
		there is no grace period in progress.

	o	Make force_quiescent_state() avoid going for a global
		lock in the case where there is no grace period in
		progress.

	o	Rearrange struct fields to improve struct layout.

	o	Make call_rcu() initiate a grace period if RCU was
		idle, rather than waiting for the next ...
From: Manfred Spraul
Date: Sunday, October 12, 2008 - 8:52 am

What about using CPU_DYING and CPU_STARTING?

Hmm - your code must loop multiple times over the cpus.
I've use a different approach: More forcing is only required for a nohz 
cpu when it was hit within a long-running interrupt.
Thus I've added a '->kick_poller' flag, rcu_irq_exit() reports back when 
the long-running interrupt completes. Never more than one loop over the 
outstanding cpus is required.

--
    Manfred
--

From: Paul E. McKenney
Date: Sunday, October 12, 2008 - 3:46 pm

Because I don't want to tie RCU too tightly to the details of the
online/offline implementation.  It is too easy for someone to make a
"simple" change and break things, especially given that the online/offline
code still seems to be adjusting a bit.

So I might well use CPU_DYING and CPU_STARTING, but I would still keep

Do you send a reschedule IPI to CPUs that are not in dyntick idle mode,
but who have failed to pass through a quiescent state?

In my case, more forcing is required only for a nohz CPU in a long-running
interrupt (as with your approach), for sending the aforementioned IPI,
and for checking for offlined CPUs as noted above.

							Thanx, Paul
--

From: Manfred Spraul
Date: Monday, October 13, 2008 - 11:03 am

Actually, I never send reschedule IPIs.

- For "usual" cpus, rcu_check_callbacks() checks for quiescent states.
There should be a set_need_resched() if a cpu holds up a grace period 
for too long. I just haven't implemented it yet.
IMHO it doesn't make sense to perform a "for_each_cpu() 
smd_send_reschedule()". rcu has a hook in each cpu, thus a 
set_need_resched() by the per-cpu hook is faster/simpler.

- For nohz cpus, a poller function [schedule_work(), enabled interrupts] 
peeks into the per-cpu data of the nohz cpu and checks if it is quiet or 
if it passed through a quiescent state.
If it didn't, then it sets a cpu_data->kick_poller flag and 
rcu_irq_exit() reports the grace period.
No need for an IPI either - rcu has a hook in the irq exit path.

Right now, I cheat if a nohz cpu is in a long-running nmi 
[while(other_cpu_is_in_nmi()) cpu_relax()], but I think I can fix that 
with an set_need_resched() in the rcu_nmi_exit().

--
    Manfred
--

From: Paul E. McKenney
Date: Tuesday, October 14, 2008 - 6:11 pm

One indeed only sends resched IPIs to CPUs that are not responding on
their own.  And in the case of rcutree.c, the loop is not over the CPUs
themselves, but rather over the leaf rcu_node structures (one per 64
CPUs) -- the per-CPU rcu_data structure is touched only for CPUs that

I considered adding a cpu_quiet() on the irq exit path, but eventually
decided that I should instead place the added overhead in the infrequently
invoked force_quiescent_state() function.  Could be argued either way,

Hmmm...  I don't see where the NMI exit path checks the TIF_NEED_RESCHED
flag, but I could easily be missing something.

I instead maintain separate counters for the NMI and irq entry/exit code,
which are checked in the force_quiescent_state() path.

							Thanx, Paul
--

From: Manfred Spraul
Date: Wednesday, October 15, 2008 - 1:13 am

rcu_irq_exit() is only called on idle cpus.
Good point.
I haven't looked at the issue yet.
Perhaps a smd_send_reschedule(smp_processor_id()) is necessary.

Btw, I found a bug in my state machine: Right now, the state machine 
will lock up if all cpus are in nohz mode.
I'm not sure if it applies to your code as well.

--
    Manfred
--

From: Paul E. McKenney
Date: Wednesday, October 15, 2008 - 8:26 am

Only once per such CPU every grace period -- seems in the noise to me.
But I should revisit, as I have changed things quite a bit since I

Is legal to call that from an NMI handler?  Looks to me that some x86
architectures do sequences of device-register reads and writes to send

I avoid this problem by forbidding a CPU with an active RCU callback
from entering nohz mode.  Therefore, the only way that all CPUs can be in
nohz mode is if there are no RCU callbacks in the system.  In this case,
RCU grace periods will never complete, but that is OK because there is
no need for RCU grace periods.  One leaves this all-nohz state when some
irq handler either invokes call_rcu() or awakens some task.  In the former
case, rcu_irq_exit() will see the callback and invoke set_need_resched(),
while in the latter case the normal dynticks mechanism will wake up
some CPU.

This is not a problem from NMI handlers, as NMI handlers are not permitted
to invoke call_rcu().  Or much of anything else, for that matter.  ;-)

							Thanx, Paul
--

From: Manfred Spraul
Date: Wednesday, October 22, 2008 - 11:41 am

Another small point:
Does your implementation support rcu_check_callbacks() with cpu != 
smp_processor_id()?
I don't think my locking would support it properly.
Thus:
- cpu != smp_processor_id() doesn't work.
- stack space for a useless parameter.
- the explicit cpu parameter prevents the rcu code from using get_cpu_var().

What about modifying the rcu_check_callbacks() prototype? I'd propose to 
remove the cpu parameter.

--
    Manfred
--

From: Paul E. McKenney
Date: Wednesday, October 22, 2008 - 2:02 pm

That would work fine for rcutree.c.  If I were to invoke
rcu_check_callbacks() remotely, I would use something like
smp_call_function() to make it happen.

Hmmm...  Looks like rcu_pending is also always called with its cpu
parameter set to the current CPU, and same for rcu_needs_cpu().
And given that all the external uses of rcu_check_callbacks() are
of the following form:

	if (rcu_pending(cpu))
		rcu_check_callbacks(cpu, whatever);


perhaps rcu_pending() should be an internal-to-RCU API invoked from
rcu_check_callbacks().

Thoughts?

							Thanx, Paul
--

From: Manfred Spraul
Date: Wednesday, October 22, 2008 - 2:24 pm

From my point of view: Yes, change it.

In the long run, I'd like to move the stall detector code to rcupdate.c, 
with an 'rcu_cpu_missing' callback. That one would need a cpu flag, but 
that's a new function.

--
    Manfred
--

From: Paul E. McKenney
Date: Monday, October 27, 2008 - 9:45 am

Agreed.  Perhaps a good change to make while introducing stall detection
to preemptable RCU -- there would then be three examples, which should
allow good generalization.

							Thanx, Paul
--

From: Manfred Spraul
Date: Monday, October 27, 2008 - 12:48 pm

Two implementations. IMHO the current rcu-classic code should be dropped 
immediately when you add rcu-tree:
rcu-classic is buggy, as far as I can see long-running interrupts on 
nohz cpus are not handled correctly. I don't think it makes sense to 
keep it in the kernel in parallel to rcu-tree.

I would propose that rcu-tree replaces rcu-classic.
I'll continue to update rcu-state, I think that it will achieve lower 
latency than rcu-tree [average/max time between call_rcu() and 
destruction callback] and it doesn't have the irq disabled loop to find 
the missing cpus.
If I find decent benchmarks where I can quantify the advantages, then 
I'll propose to merge rcu-state as a third implementation in addition to 
rcu-tree and rcu-preempt.

Paul: What do you think?

--
    Manfred


--

From: Paul E. McKenney
Date: Monday, October 27, 2008 - 4:52 pm

In keeping with my reputation as a "conservative programmer", I would
suggest that rcuclassic.c remain for a year or so.  Distros branching
off during this time should continue making rcuclassic.c be the default.
Other uses should have rcutree.c as the default.  At the end of the year,
we remove rcuclassic.c.

All that said, one attractive aspect of your suggestion is immediately
removing rcuclassic.c would eliminate the need to do further work on it.  ;-)

Your benchmarking proposal for rcu-state makes sense to me.

One other possible place for techniques from rcu-state may be in making
preemptable RCU scale.  This may take some time, as other parts of
the RT kernel have their limitations, but sooner or later people are
going to expect real-time response from even the largest machines.
In addition, preemptable RCU has a number of shorter-term issues:

1.	RCU-boosting mechanism.  (I need to combine the best of
	Steve's and my mechanisms.  The treercu.c effort has been
	sort of a warm-up exercise for RCU-boosting.)

2.	Reducing the latency contribution of the preemptable RCU
	state machine (but note that moving this state machine out
	of the scheduling-clock irq handler means more stuff to boost).

3.	Porting the simpler dynticks interface from rcutree to
	preemptable RCU.

4.	Making the preemptable RCU tracing code use seqfile.

Hmmm...  Maybe it is (past) time for me to publish an RCU to-do list?

							Thanx, Paul
--

From: Manfred Spraul
Date: Monday, October 27, 2008 - 10:30 pm

How do you intend to handle nohz cpus?
I would create a separate patch that removes rcuclassic.c. distros that 
want to keep rcuclassic could just revert that change.

--

--

From: Paul E. McKenney
Date: Tuesday, October 28, 2008 - 8:17 am

In which variant of RCU?  My current thought is to apply the rcutree.c
version to rcupreempt.c.  If rcuclassic.c can be dropped, my thought
would be to leave it alone -- it is unnecessarily awakening CPUs, but

That does make a lot of sense.  At least it would make my life simple.  ;-)

--

From: Manfred Spraul
Date: Tuesday, October 28, 2008 - 10:21 am

For rcuclassic.

As far as  I can see, rcuclassic treats nohz cpus as always outside 
rcu_read_lock():
rcu_start_batch() contains
 >
 >                cpus_andnot(rcp->cpumask, cpu_online_map, nohz_cpu_mask);
 >
As soon as all cpus from rcp->cpumask reported a grace period, the 
callbacks are called.
That a bug, therefore I would drop rcuclassic as soon as rcutree is merged.

--
    Manfred

--

From: Paul E. McKenney
Date: Tuesday, October 28, 2008 - 10:35 am

If we were to keep rcuclassic for any length of time, I would modify
rcu_pending() and rcu_check_callbacks() to invoke force_quiescent_state()

Good point, I had forgotten that issue.  Making this modification would
cause the resulting rcuclassic to be just as suspect as is rcutree,
I suppose.

A strong argument for moving to rcutree.c quickly rather than slowly,
I must admit!

							Thanx, Paul
--

From: Gautham R Shenoy
Date: Friday, October 17, 2008 - 1:34 am

Hi Paul,

				   ^^^^^^^^^^^^^^^^^^
=>
       This can also happen when a CPU has just gone offline,
       but RCU hasn't yet marked it as offline. However, it's impact
       on delaying the grace period may not be high as in the
=>
	This check is safe here since this callpath is invoked
	from a softirq, and thus the system cannot do a stop_machine()
	as yet. This implies that the cpu in question cannot go offline

=>	Just wondering, can't NMI handlers queue callbacks? If yes,
=>					       ^^^^^^^^^^^^^^^^^^^
=>	----------------> [1]
>
From: Gautham R Shenoy
Date: Friday, October 17, 2008 - 8:35 am

^^^^^^^^^^^^^^^^^^

        This can also happen when a CPU has just gone offline,
        but RCU hasn't yet marked it as offline. However, it's impact
        on delaying the grace period may not be high as in the

	This check is safe here since this callpath is invoked
	from a softirq, and thus the system cannot do a stop_machine()
	as yet. This implies that the cpu in question cannot go offline
					       ^^^^^^^^^^^^^^^^^^^
	----------------> [1]
		If this condition is true, then,
		rsp->gpnum != rsp->completed. Hence, we will always enter
		the if() condition in print_other_cpu_stall() at
		[1] (See above), and return without ratting our buddy.

		That defeats the purpose of the stall check or I am
	rdtp->dynticks is odd. Hence rdtp->dynticks + 1 should be even.
-- 
Thanks and Regards
gautham
--

From: Paul E. McKenney
Date: Friday, October 17, 2008 - 8:46 am

The "/" command in "vi" works pretty well for me.  ;-)

These are the same as the ones in your earlier note, correct?

--

From: Paul E. McKenney
Date: Friday, October 17, 2008 - 8:43 am

Thank you for looking this over, and especially for noting several issues!

Responses interspersed.




Yep!  I note that in the comment preceding the cpu_is_offline() above.
Is that sufficient, or should I reiterate that point in another comment

NMI handlers are forbidden to queue callbacks, as the current call_rcu()
implementation is not NMI-safe.  It would be possible to create an
NMI-safe implementation, but there needs to be someone who needs it

> > + * is already in a quiescent state courtesy of dynticks id
From: Paul E. McKenney
Date: Monday, December 8, 2008 - 11:42 am

Because this line should instead be:

	rdtp->dynticks_nmi = (rdtp->dynticks_nmi + 1) & ~0x1;

Well spotted, even if it did take me a good long time to figure out
that this really was a bug in my code!  ;-)

That said, you would have to really work to exercise this one...  Near as
I can tell, you would need to wrap the ->dynticks counter, which would
then cause the dynticks_nmi counter to appear to go backwards.  And then
you would have to prevent the newly onlined CPU from ever passing through
a quiescent state, which would cause a failure in any case.

Still, good to fix, even if I can't figure out how it would result in
a failure.  Real hardware and software tends to be -much- better than me
at finding such failures!

							Thanx, Paul
--

From: Manfred Spraul
Date: Sunday, November 2, 2008 - 1:10 pm

What about some generic files, with the same content for all rcu backends?
The implementation could be in rcupdate.c. At least counting the rcu 
callbacks could be done from generic code, the grace periods could be 
queried [Do all backends implement rcu_batches_completed?]

Attached is a hack that I use right now for myself.
Btw - on my 4-cpu system, the average latency from call_rcu() to the rcu 
callback is 4-5 milliseconds, (CONFIG_HZ_1000).

--
    Manfred
From: Paul E. McKenney
Date: Monday, November 3, 2008 - 1:33 pm

For the grace-period latencies, I use the attached kernel
module.  This module simply times synchronize_rcu() to obtain
the grace-period performance.  I have similar modules for other
performance measures.

(Please note that this code was written for my own use only, others
may find the kernel-module parameter names to be quite obnoxious.
But yes, nreaders=1 will give you one synchronize_rcu() -update-
task.  What can I say? I used this module to collect data for
http://www.research.ibm.com/journal/sj/472/guniguntala.pdf, and was not
considering use by others.)

I would rather avoid decorating the RCU code with grace-period-latency
measurement code, since this can be done independently with a kernel
module, and I suspect that more people read the RCU code than measure
the grace-period latency.

All backends do implement rcu_batches_completed(), but its return

Hmmm...  I would expect that if you have some CPUs in dyntick idle mode.
But if I run treercu on an CONFIG_HZ_250 8-CPU Power box, I see 2.5
jiffies per grace period if CPUs are kept out of dyntick idle mode, and
4 jiffies per grace period if CPUs are allowed to enter dyntick idle mode.

Alternatively, if you were testing with multiple concurrent
synchronize_rcu() invocations, you can also see longer grace-period
latencies due to the fact that a new synchronize_rcu() must wait for an
earlier grace period to complete before starting a new one.

The 2.5 jiffies is expected when RCU is idle: the synchronize_rcu()
starts a new grace period immediately, it takes up to a jiffy for the
other CPUs do their scheduling-clock interrupt (which notices that the
grace period started), another jiffy for all the CPUs to do their
next scheduling-clock interrupt (which notices the quiescent state),
and on average a half jiffy for the initiating CPU to notice that the
grace period has completed.

I considered trying to trim an additional jiffy off of this grace-period
latency by having CPUs keep track of whether they were in a ...
From: Manfred Spraul
Date: Wednesday, November 5, 2008 - 12:48 pm

That's the reason why I decided to measure the real latency, from 
call_rcu() to the final callback. It includes the delays for waiting 
until the current grace period completes, until the softirq is 
scheduled, etc.
Probably one cpu was not in user space when the timer interrupt arrived.
I'll continue to investigate that. Unfortunately, my first attempt 
failed: adding too many printk's results in too much time spent within 
do_syslog(). And then the timer interrupt always arrives on the 
spin_unlock_irqrestore in do_syslog()....

--
    Manfred

--

From: Paul E. McKenney
Date: Wednesday, November 5, 2008 - 2:27 pm

I believe that I get very close to the same effect by timing a call to
synchronize_rcu() in a kernel module.  Repeating measurements and
printing out cumulative statistics periodically reduces the heisenberg

;-)

							Thanx, Paul
--

From: Paul E. McKenney
Date: Saturday, November 15, 2008 - 4:20 pm

Hello!

This patch fixes a long-standing performance bug in classic RCU that
results in massive internal-to-RCU lock contention on systems with
more than a few hundred CPUs.  Although this patch creates a separate
flavor of RCU for ease of review and patch maintenance, it is intended
to replace classic RCU.

The current patch passes more stress tests than does mainline, so the next
version will be against -tip in preparation for 2.6.29.  Definitely ready
for -serious- testing and abuse, and probably would do OK in production.
In particular, experience on an actual 1000+ CPU machine would be most
welcome, and still appears to be forthcoming some day...

Updates from v7 (http://lkml.org/lkml/2008/10/10/291):

o	Fixed a number of problems noted by Gautham Shenoy, including
	the cpu-stall-detection bug that he was having difficulty
	convincing me was real.  ;-)

o	Changed cpu-stall detection to wait for ten seconds rather than
	three in order to reduce false positive, as suggested by Ingo
	Molnar.

o	Produced a design document (http://lwn.net/Articles/305782/).
	The act of writing this document uncovered a number of both
	theoretical and "here and now" bugs as noted below.

o	Fix dynticks_nesting accounting confusion, simplify WARN_ON()
	condition, fix kerneldoc comments, and add memory barriers
	in dynticks interface functions.

o	Add more data to tracing.

o	Remove unused "rcu_barrier" field from rcu_data structure.

o	Count calls to rcu_pending() from scheduling-clock interrupt
	to use as a surrogate timebase should jiffies stop counting.

o	Fix a theoretical race between force_quiescent_state() and
	grace-period initialization.  Yes, initialization does have to
	go on for some jiffies for this race to occur, but given enough
	CPUs...

Updates from v6 (http://lkml.org/lkml/2008/9/23/448):

o	Fix a number of checkpatch.pl complaints.

o	Apply review comments from Ingo Molnar and Lai Jiangshan
	on the stall-detection code.

o	Fix several bugs in !CONFIG_SMP ...
Previous thread: Re: char/tpm: tpm_infineon no longer loaded for HP 2510p laptop by Kay Sievers on Thursday, August 21, 2008 - 2:18 pm. (9 messages)

Next thread: [PATCH 1/2] smp_call_function: don't use lock in call_function_data by Jeremy Fitzhardinge on Thursday, August 21, 2008 - 5:29 pm. (3 messages)