Hello! Experimental, not for inclusion. Attached is a patch to Classic RCU that applies a hierarchy, greatly reducing the contention on the top-level lock for large machines. This passes mild rcutorture testing on x86 and ppc64, but is most definitely not ready for inclusion. It is OK for experimental work assuming sufficiently brave experimenters. See also Manfred Spraul's patch at http://lkml.org/lkml/2008/8/21/336 (or his earlier work from 2004 at http://marc.info/?l=linux-kernel&m=108546384711797&w=2). We will converge onto a common patch in the fullness of time, but are currently exploring different regions of the design space. This patch provides CONFIG_RCU_FANOUT, which controls the bushiness of the RCU hierarchy. Defaults to 32 on 32-bit machines and 64 on 64-bit machines. If CONFIG_NR_CPUS is less than CONFIG_RCU_FANOUT, there is no hierarchy. By default, the RCU initialization code will adjust CONFIG_RCU_FANOUT to balance the hierarchy, so strongly NUMA architectures may choose to set CONFIG_RCU_FANOUT_EXACT to disable this balancing, allowing the hierarchy to be exactly aligned to the underlying hardware. Up to two levels of hierarchy are permitted (in addition to the root node), allowing up to 16,384 CPUs on 32-bit systems and up to 262,144 CPUs on 64-bit systems. I just know that I am going to regret saying this, but this seems more than sufficient for the foreseeable future. (Some architectures might wish to set CONFIG_RCU_FANOUT=4, which would limit such architectures to 64 CPUs. If this becomes a real problem, additional levels can be added, but I doubt that it will make a significant difference on real hardware.) In the common case, a given CPU will manipulate its private rcu_data structure and the rcu_node structure that it shares with its immediate neighbors. This can reduce both lock and memory contention by multiple orders of magnitude, which should eliminate the need for the strange manipulations that are reported to be required when running ...
just a quick stylistic suggestion: if feasible then such sizing ugliness should be hidden in a Kconfig file. (if Kconfig is capable enough for this that is) Ingo --
I have no idea if Kconfig can do it, but I will check. Thanx, Paul --
OK, Kconfig does not currently support arithmetic, based on zconf.y:
expr: symbol { $$ = expr_alloc_symbol($1); }
| symbol T_EQUAL symbol { $$ = expr_alloc_comp(E_EQUAL, $1, $3); }
| symbol T_UNEQUAL symbol { $$ = expr_alloc_comp(E_UNEQUAL, $1, $3); }
| T_OPEN_PAREN expr T_CLOSE_PAREN { $$ = $2; }
| T_NOT expr { $$ = expr_alloc_one(E_NOT, $2); }
| expr T_OR expr { $$ = expr_alloc_two(E_OR, $1, $3); }
| expr T_AND expr { $$ = expr_alloc_two(E_AND, $1, $3); }
;
All we currently get is basic comparison and logical operators. It would
not be all -that- hard to add general arithmetic (famous last words),
but when I tried mapping out what the sizing code would look like in
such an augmented Kconfig, it was even uglier than the above.
So I took a hard look at the current mess, and prettied it as shown below.
Is this a sufficient improvement?
Another alternative I am considering is moving this to a separate
include file.
Thoughts?
Thanx, Paul
#define MAX_RCU_LEVELS 3
#define RCU_FANOUT (CONFIG_RCU_FANOUT)
#define RCU_FANOUT_SQ (RCU_FANOUT * RCU_FANOUT)
#define RCU_FANOUT_CUBE (RCU_FANOUT_SQ * RCU_FANOUT)
#if (NR_CPUS) <= RCU_FANOUT
# define NUM_RCU_LVLS 1
# define NUM_RCU_LVL_0 1
# define NUM_RCU_LVL_1 (NR_CPUS)
# define NUM_RCU_LVL_2 0
# define NUM_RCU_LVL_3 0
#elif (NR_CPUS) <= RCU_FANOUT_SQ
# define NUM_RCU_LVLS 2
# define NUM_RCU_LVL_0 1
# define NUM_RCU_LVL_1 (((NR_CPUS) + RCU_FANOUT - 1) / RCU_FANOUT)
# define NUM_RCU_LVL_2 (NR_CPUS)
# define NUM_RCU_LVL_3 0
#elif (NR_CPUS) <= RCU_FANOUT_CUBE
# define NUM_RCU_LVLS 3
# define NUM_RCU_LVL_0 1
# define NUM_RCU_LVL_1 (((NR_CPUS) + RCU_FANOUT_SQ - 1) / RCU_FANOUT_SQ)
# define NUM_RCU_LVL_2 (((NR_CPUS) + (RCU_FANOUT) - 1) / (RCU_FANOUT))
# define NUM_RCU_LVL_3 NR_CPUS
#else
# error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
#endif /* #if (NR_CPUS) <= RCU_FANOUT */
#define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
#define ...I personally don't think that would help. The revised version seems - Josh Triplett --
yeah - looks much better. This was the block that meets the eye for the first time in the patch so it stuck out. just one more small pet peeve of mine: please use vertical alignment too but no strong feelings on that one. (maybe inserting a space at the right places helps too, no need for a full tab) Ingo --
Yep, just like you, spaced it just enough to keep the longest one from running over one line. ;-) I left the definitions for RCU_SUM and NUM_RCU_NODES compact, though: #define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3) #define NUM_RCU_NODES (RCU_SUM - NR_CPUS) The other alternative would be to stack RCU_SUM as follows: #define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + \ NUM_RCU_LVL_2 + NUM_RCU_LVL_3) which seemed to me to add more ugly than enlightenment. Testing is going well. Having to occasionally restrain myself to keep from going full-bore for 4096 CPU optimality -- but have to keep it simple until/unless someone with that large of a machine shows where improvements are needed. Thanx, Paul --
Why put these references here rather than in Documentation/RCU? It seems easier to keep documentation up to date in one place. If you think these represent a good "getting started" set of documents, how about a Documentation/RCU/ReadTheseFirst with links to them, or how u8 or bool, depending on semantics. If just a simple flag, how about Looks like several whitespace changes occurred here; several of these lines didn't actually change except in whitespace. The same comment about sized types applies here, but these fields didn't Some whitespace changes again here; several of these lines didn't change Might want to document in the commit message that you have tracing information through RCU_TRACE, and that it applies to non-preemptible You might want to give a specific example of a NUMA machine, the appropriate value to use on that machine, and the result with and It might actually make sense here to do this instead: ifeq ($(CONFIG_RCU_TRACE),y) obj-$(CONFIG_CLASSIC_RCU) += rcuclassic_trace.o obj-$(CONFIG_PREEMPT_RCU) += rcupreempt_trace.o Same comment as before; maintaining these in a single place seems How about making these state structures static, along with removing the ACCESS_ONCE, like memory barriers, benefits from an accompanying No need to explicitly say "inline"; GCC should do the right thing here. This comment applies to the original code, but: You only call __call_rcu twice, in call_rcu and call_rcu_bh. Both times, you set head first, then wrap the call with local_irq_save. How about moving both into __call_rcu, making call_rcu and call_rcu_bh static DEFINE_MUTEX(rcuclassic_trace_mutex); Did you perhaps want PAGE_SIZE? - Josh Triplett --
OK for grpnum and level, but grphi and grplo need to be "int" to This will need to be a non-bool shortly. OK, so what the heck -are- the official type names??? u8 seems to be defined in a powerpc-specific file. OK, it also appears in include/asm-generic/int-l64.h. s8, u8, s16, u16, s32, u32, s64, and Whitespace-only changes are kind of lost in the noise with this patch. Both can be bool. The completed and gpnum fields can probably be s32, Indeed. But I need to see an actual example before I document it. It would be easy to make things slower by following the NUMA hardware It will try again on the next scheduling-clock interrupt. The reason I did this is because ->onofflock is a global lock acquired when beginning a quiescent state or when onlining/offlining. Can't let force_quiescent_state() monopolize things, and would like to exclude online/offline while doing force_quiescent_state(). Hence make force_quiescent_state() back off if the lock is held. OK. These are accessed without holding the relevant lock, and I don't want the compiler to refetch them. (For example, if inlined into other Someone once told me why this was necessary, but I forget. It was in the original, and I didn't put it there. Some weirdness about conversion to 32-bit integer when the lower 32 bits of the pointer was zero or some such. So if your pointer value was 0x100000000, for example, I can't pass "rcu_data" to a function (or at least I don't know how to do so, short of passing __per_cpu_rcu_data and doing the per-CPU stuff by hand). I could make __call_rcu() be a macro, but that seemed more ugly than it seemed worthwhile. I really want some way of gracefully handling arbitrarily long output to debugfs. I am sure that some such exists, but haven't found it. What I do instead is to arbitrarily truncate output to 4096 bytes, which will be stunningly less than useful on a 4,096-CPU machine. :-/ Suggestions welcome! Thanx, ...
Fair enough; the CPU-manipulation primitives do indeed use "int". Odd
levelspread can, since it will never exceed 64, but levelcnt cannot.
Yes. {s,u}{8,16,32,64}, defined in include/asm-generic/int-{l,ll}64.h,
Hmmm, true. Unfortunate, particularly if only for the benefit of
Primarily concerned about the possibility of perpetual failure. Then
again, eventually a grace period will occur "naturally". Just wondering
Good point! That doesn't apply if you use ||, though. If you just did
"return somepointer" that could potentially cause the problem you
describe. In any case, it can't *hurt* to have it; GCC should do the
I can see two possibilities, depending on how much complexity you want.
The complicated way: do one pass calling snprintf everywhere and adding
up the total length used, and if you run out of memory during that pass,
reallocate the buffer to at least the total length you accumulated. Or
something like that.
The simple hack:
- Josh Triplett
--
It does allow use of -1 for "no particular CPU" or for error checking, Indeed. Putting rcuclassic_trace.c into rcuclassic.c gets pretty ugly. I suppose that another possibility would be to #include rcuclassic_trace.c Ah! So the lock can fail for the following reasons: 1. Some other CPU is in force_quiescent_state(). Here there is clearly no problem. 2. Some other CPU is initializing the rcu_node hierarchy to set up a new quiescent state. Here, we shouldn't have been executing force_quiescent_state() in the first place, so again no problem. 3. Some other CPU is adjusting the rcu_node hierarchy to account for a CPU online or offline operation. There is enough overhead in onlining and offlining CPUs that it seems unlikely that this could result in a denial of service. However, if someone can make this happen, I will make the online/offline operation check to see if it should do a force_quiescent_state() -- which will require an __force_quiescent_state() where the onofflock is acquired by the caller. So we are covered on #1 and #2, and very likely covered on #3, with an OK. I will review this towards the end, leaving it there to remind me in the meantime. So, would I need the !! on the left-hand operand of the first || due The only other thing I can think of is dynamically allocated per-CPU Given that this doesn't show up in production kernels, I will take door #2. Though I was hoping for some sort of interface that "just made it work" regardless of the size of user reads and the length and pattern of in-kernel prints, but that might be a bit much... Thanx, Paul --
No. || will always return 1 or 0. You only need the !! if you want to directly return the boolean value of a potentially 64-bit pointer. - Josh Triplett --
Even if one argument of || is long and the other int or some fool thing like that? (What, me paranoid???) Thanx, Paul --
What, you don't know exactly how C behaves in every strange corner case? ;) || always produces a result of type int, and it compares each of its two arguments to 0 independently; to the best of my knowledge the size of those arguments never matters. - Josh Triplett --
I used to, back when identifiers were only guaranteed to be I suppose I should read the spec. Thanx, Paul --
I actually like in code comments and 'documentation' more than Documentation/ stuff. Mostly because Documentation/ is: - far away from the code - therefore, more easily bitrotted - and easily forgotten --
I know!!! #ifdef JOSH_TRIPLETT * Documentation/RCU * http://lwn.net/Articles/262464/ (What is RCU, Fundamentally?) * http://lwn.net/Articles/263130/ (What is RCU's Usage?) * http://lwn.net/Articles/264090/ (What is RCU's API? + references) #elif PETER_ZIJLSTRA * Documentation/RCU #endif (Sorry, couldn't resist!!!) Seriously, I know where all the documentation is, as I wrote most of it. These comments are for you guys. So, any thoughts on how I should resolve this? My default is, as always, a coin flip. ;-) Thanx, Paul --
I guess we could do the 'this is how the concept works and can be used like so and so' documentation in Documentation/ And the stuff that says 'this code does like so and so, because blah' should stay near the code. And in any case of doubt - stay near the code :-) I always view Documentation/ as end user stuff (be that a kernel programmer that needs to learn a new API, or userland folks or people wanting to know what a certain feature is about). --
Good point... #ifdef READER_LIKES_DOCUMENTATION_URLS_IN_COMMENTS * Documentation/RCU * http://lwn.net/Articles/262464/ (What is RCU, Fundamentally?) * http://lwn.net/Articles/263130/ (What is RCU's Usage?) * http://lwn.net/Articles/264090/ (What is RCU's API? + references) #else * Documentation/RCU #endif Of course, the C preprocessor would just remove the whole comment Documentation/RCU/whatisRCU.txt does in fact contain the three URLs listed above. And there is always Documentation/RCU/RTFP.txt for I confess to erring on the side of spamming all channels. Then again, I am a serial junk-mailer, so perhaps this is just me. Thanx, Paul --
I'm not sure if a bitmap is the right storage. If I understand the code
correctly, it contains two information:
1) If the bitmap is clear, then all cpus have completed whatever they
need to do.
A counter is more efficient than a bitmap. Especially: It would allow to
choose the optimal fan-out, independent from 32/64 bits.
2) The information if the current cpu must do something to complete the
current period.non
This is a local information, usually (always?) only the current cpu
needs to know if it must do something.
But this doesn't need to be stored in a shared structure, the
Do you have a description of the events between call_rcu() and the rcu
callback?
Is the following description correct?
__call_rcu() queues in RCU_NEXT_TAIL.
In the middle of the current grace period: rcu_check_quiescent_state()
calls rcu_next_callbacks_are_ready().
Entry now in RCU_NEXT_READY_TAIL
** 0.5 cycles: wait until all cpus have completed the current cycle.
rcu_process_gp_end() moves from NEXT_READY_TAIL to WAIT_TAIL
** full grace period
rcu_process_gp_end() moves from WAIT_TAIL to DONE_TAIL
Why this mb()? There was a grace period between the last read side
critical section that might have accessed the pointer.
The rcu internal code already does spin_lock()+spin_unlock(). Isn't that
The docbook entry is duplicated: They are in include/linux/rcupdate.h
and here.
What about removing one of them?
I would go one step further:
Even add call_rcu_sched() into rcupdate.h. Add a Kconfig bool
"RCU_NEEDS_SCHED" and automatically define either the extern or the #define.
--
Manfred
--
I am using the bitmap in force_quiescent_state() to work out who to check dynticks and who to send reschedule IPIs to. I could scan all of the per-CPU rcu_data structures, but am assuming that after a few jiffies there would typically be relatively few CPUs still needing to do a quiescent state. Given this assumption, on systems with large numbers of CPUs, scanning the bitmask greatly reduces the number of cache misses Yes, that is the sequence of events if grace periods are happening back to back and if the current CPU has not yet passed through a quiescent state. If RCU is idle, then there is no need to wait for a previous grace period to complete. If this CPU already passed through its quiescent state for the first grace period, and is not the CPU that starts the next grace period, then there will be an additional grace period to move from RCU_NEXT_TAIL to RCU_NEXT_READY_TAIL. If this CPU is the one starting The combination of spin_lock()+spin_unlock()+spin_lock()+spin_unlock() would suffice. But the pair of smp_mb()s suffice regardless of fast paths through the rest of the mechanism. Because rcu_process_callbacks() is on the slow path, the trival proof of ordering is more important than I have (and am) considering this. It would require an additional per-CPU data structure, would require an additional check in rcu_pending(), and require a bit more manipulation when callbacks become "done". However, but would simplify the requeuing code in rcu_do_batch() a bit. One interesting side effect would be that blimit would apply globally to both rcu and rcu_bh, and that a burst of (say) rcu callbacks could Another approach would be to have an rcupdate.h definition that was a wrapper around __call_rcu_sched(), and put the docbook stuff in rcupdate.h. It would appear that docbook was not created with the idea of alternative implementations in mind. ;-) Thanx, Paul --
It's an optimization question: What is rarer? force_quiescent_state() or
"normal" cpu_quiet calls.
You have optimized for force_quiescent_state(), I have optimized for
"normal" cpu_quiet calls. [ok, I admit: force_quiescent_state() is still
missing in my code].
Do you have any statistics?
--
Manfred
--
If the system is completely busy, then I would expect normal cpu_quiet() calls to be more common. But if the system were sized for peak workload, it would spend a fair amount of time with many of the CPUs idle. Power-conservation measures would hopefully push the idleness into single cores/dies/whatever which could then be powered down. A large fraction of the systems I see have utilizations well under 50%. And latency concerns would also focus on force_quiescent state. That said, I haven't had much to do with systems having more than 128 CPUs. Thanx, Paul --
Hello! Still experimental, not for inclusion. Updates from earlier version: o Handles dyntick idle state, including interrupts and NMIs, but while allowing dyntick-idle CPUs to remain idle (in contrast to the previous version, which simply whacked all non-responding CPUs with a resched IPI). o Made force_quiescent_state() more intelligent, so that it no longer whacks all CPUs whether they need it or not. o Cleaned up cpp code that determines size and shape of the rcu_node hierarchy. o Added debugfs tracing capability (was in previous patch, but forgot to mention it). Attached is an updated patch to Classic RCU that applies a hierarchy, greatly reducing the contention on the top-level lock for large machines. This passes mild rcutorture testing on x86 and ppc64, but is most definitely not ready for inclusion. It is OK for experimental work assuming sufficiently brave experimenters. See also Manfred Spraul's patch at http://lkml.org/lkml/2008/8/21/336 (or his earlier work from 2004 at http://marc.info/?l=linux-kernel&m=108546384711797&w=2). We will converge onto a common patch in the fullness of time, but are currently exploring different regions of the design space. This patch provides CONFIG_RCU_FANOUT, which controls the bushiness of the RCU hierarchy. Defaults to 32 on 32-bit machines and 64 on 64-bit machines. If CONFIG_NR_CPUS is less than CONFIG_RCU_FANOUT, there is no hierarchy. By default, the RCU initialization code will adjust CONFIG_RCU_FANOUT to balance the hierarchy, so strongly NUMA architectures may choose to set CONFIG_RCU_FANOUT_EXACT to disable this balancing, allowing the hierarchy to be exactly aligned to the underlying hardware. Up to two levels of hierarchy are permitted (in addition to the root node), allowing up to 16,384 CPUs on 32-bit systems and up to 262,144 CPUs on 64-bit systems. I just know that I am going to regret saying this, but this seems more than sufficient for the foreseeable future. (Some architectures might ...
Hello! Still experimental, not for inclusion. But getting better! Updates from v2: o Fixed a number of bugs uncovered by running rcutorture in parallel with onlining and offlining CPUs. Many of these were due to the fact that there can be a multiple-grace-period window during which RCU and the process scheduler disagree about whether a given CPU is offline. The solution was to make force_quiescent_state() check for RCU waiting on offlined CPUs, and then cleaning up all the locking gotchas that resulted from that change. o Upgraded tracing capability with additional statistics, for example, per-CPU counts of how often force_quiescent_state() responded on their behalf (because they were offline, in dyntick-idle state, or needed a resched IPI). Also abbreviated more severely to allow the system to run longer within the confines of an 80-character xterm. o Added sparse annotations so that it sparses cleanly. o Added an argument to force_quiescent_state() so that for normal callers, it checks for enough time having passed since the last try. Emergency callers (__call_rcu() with more than 10,000 RCU callbacks piled up on the local CPU) get their quiescent state forced unconditionally. o Added mapping from CPU to rcu_data structure to allow RCU to easily switch its attention from (say) the CPU being offlined to the currently running CPU, should the offlining kick off a new RCU grace period. o Made the trace buffer's size a function of the number of CPUs so that the rcudata debugfs file works correctly on 128-CPU machines. Attached is an updated patch to Classic RCU that applies a hierarchy, greatly reducing the contention on the top-level lock for large machines. This passes mild rcutorture testing on x86 and ppc64, including some 12-hour runs on 8-CPU machines and an hour thus far on a 128-CPU machine, but is most definitely not ready for inclusion. It is OK for experimental work assuming sufficiently brave experimenters. See also ...
Already done and available in the -tip tree, curtesy of Mathieu. --
Very cool!!! I see one of his patches at http://lkml.org/lkml/2008/4/17/342, but how do I find out which branch of -tip this is on? (I am learning git, but it is a slow process...) This would also simplify preemptable RCU's dyntick interface, removing the need for proofs. Thanx, Paul --
Not sure - my git-foo isn't good enough either :-(
All I can offer is that its available in tip/master (the collective
merge of all of tip's branches) as commit:
0d84b78a606f1562532cd576ee8733caf5a4aed3, which I found using
git-annotate include/linux/hardirq.h
How to find from which particular topic branch it came from, I too am
clueless.
---
commit 0d84b78a606f1562532cd576ee8733caf5a4aed3
Author: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Date: Mon May 12 21:21:07 2008 +0200
x86 NMI-safe INT3 and Page Fault
Implements an alternative iret with popf and return so trap and exception
handlers can return to the NMI handler without issuing iret. iret would cause
NMIs to be reenabled prematurely. x86_32 uses popf and far return. x86_64 has to
copy the return instruction pointer to the top of the previous stack, issue a
popf, loads the previous esp and issue a near return (ret).
It allows placing immediate values (and therefore optimized trace_marks) in NMI
code since returning from a breakpoint would be valid. Accessing vmalloc'd
memory, which allows executing module code or accessing vmapped or vmalloc'd
areas from NMI context, would also be valid. This is very useful to tracers like
LTTng.
This patch makes all faults, traps and exception safe to be called from NMI
context *except* single-stepping, which requires iret to restore the TF (trap
flag) and jump to the return address in a single instruction. Sorry, no kprobes
support in NMI handlers because of this limitation. We cannot single-step an
NMI handler, because iret must set the TF flag and return back to the
instruction to single-step in a single instruction. This cannot be emulated with
popf/lret, because lret would be single-stepped. It does not apply to immediate
values because they do not use single-stepping. This code detects if the TF
flag is set and uses the iret path for single-stepping, even if it ...If you're interested in knowing the topic it came from : it's required so a following patch can use a "popf; ret" instead of iret to return from trap handlers executed in NMI context. There is an architectural problem on x86 causing NMIs to be reactivated after the first iret encountered, which leads to NMI handler races if nmi handlers trap. This works around the problem by returning from the trap handlers without using the iret instruction. It's useful to Immediate Values which put a temporary breakpoint in the instruction stream when proceeding to code modification and also useful to LTTng (available in the -lttng tree) which writes tracing data to vmap'd memory buffers (which can cause a minor page fault). I'm glad to see NMI context detection is useful to others too ! -- Mathieu Desnoyers OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68 --
While an interesting detail, its not the answer to the question. Given a bunch of topic branches, and a branch that has all those topic merged, how, for any particular commit from the merge branch, do you find from which topic branch it originiated? IOW, the answer to the above question would have been a series of git commands that would have resulted in something like tip/tracing/nmisafe --
I guess that it turned out that there was a series of mutt commands that eventually got the answer. That said, a series of git commands would be quite nice. ;-) But it would appear that the series of git commands would need to come from someone with better git-foo than either Mathieu or myself. :-/ Thanx, Paul --
I just had a fast review. so my comments is nothing but cleanup.
Thanks, Lai.
I tried a mask like qsmaskinit before. The system came to deadlock
when I did on/offline cpus.
I didn't find out the whys for I bethought of these two problem:
problem 1:
----race condition 1:
<cpu_down>
synchronize_rcu <called from offline handler in other subsystem>
rcu_offline_cpu
-----race condition 2:
rcu_online_cpu
synchronize_rcu <called from online handler in other subsystem>
<cpu_up>
in these two condition, synchronize_rcu isblocked for ever for
synchronize_rcu have to wait a cpu in rnp->qsmask, but this
cpu don't run.
problem 2:
we need call rcu_offline_cpu() in these two cases in rcu_cpu_notify()
since qsmaskinit had changed by rcu_online_cpu()
case CPU_UP_CANCELED:
struct rcu_state has a field: struct rcu_data *rda[NR_CPUS]
so we can move these lines around __call_rcu into __call_rcu.
__call_rcu(struct rcu_head *head, struct rcu_state *rsp)
{
local_irq_save(flags);
struct rcu_data *rdp = rsp->rda[smp_processor_id()];
.....
local_irq_save(flags);
--
Can we disallow synchronize_rcu() from the cpu notifiers? Are there any
users that do a synchronize_rcu() from within the notifiers?
I don't see any other solution.
Something like qsmaskinit is needed - always enumerating all cpus just
doesn't scale.
Perhaps it's possible to rely on CPU_DYING, but I haven't figured out
yet how to handle read-side critical sections in CPU_DYING handlers.
Interrupts after CPU_DYING could be handled by rcu_irq_enter(),
rcu_irq_exit() [yes, they exist on x86: the arch code enables the local
interrupts in order to process the currently queued interrupts]
--
Manfred
--
I made force_quiescent_state() check for offline CPUs. (Well, actually it is rcu_implicit_offline_qs(), which is indirectly called from My feeling is that CPU online/offline will be quite rare, so it should be OK to clean up after the races in force_quiescent_state(), which in this version is called every three ticks in a given grace period. Yes, I did worry about the possibility of all CPUs being in dyntick-idle mode, and the solution for that is (1) don't let a CPU that has RCU callbacks pending go into dyntick-idle mode via rcu_needs_cpu() and (2) don't let a grace period start unless there is at least one callback that is not yet in the done state. But no, I am not certain that I have gotten this completely correct yet. Thanx, Paul --
If you add failing cpu offline calls, then the problem appears to be unsolvable: If I get it right, the offlining process looks like this: * one cpu in the system makes the CPU_DOWN_PREPARE notifier call. These calls can sleep (e.g. slab sleeps on semaphores). The cpu that goes offline is still alive, still doing arbitrary work. cpu_quiet calls on behalf of the cpu would be wrong. * stop_machine: all cpus schedule to a special kernel thread [1], only the dying cpu runs. * The cpu that goes offline calls the CPU_DYING notifiers. * __cpu_disable(): The cpu that goes offline check if it's possible to offline the cpu. At least on i386, this can fail. On success: * at least on i386: the cpu that goes offline handles outstanding interrupts. I'm not sure, perhaps even softirqs are handled. * the cpus stopps handling interrupts. * stop machine leaves, the remaining cpus continue their work. * The CPU_DEAD notifiers are called. They can sleep. On failure: * all cpus continue their work. call_rcu, synchronize_rcu(), ... * some time later: the CPU_DOWN_FAILED callbacks are called. Is that description correct? Then: - treating a cpu as always quiet after the rcu notifer was called with CPU_OFFLINE_PREPARE is wrong: the target cpu still runs normal code: user space, kernel space, interrupts, whatever. The target cpu still accepts interrupst, thus treating it as "normal" should work. __cpu_disable() success: - after CPU_DYING, a cpu is either in an interrupt or outside read-side critical sections. Parallel synchronize_rcu() calls are impossible until the cpu is dead. call_rcu() is probably possible. - The CPU_DEAD notifiers are called. a synchronize_rcu() call before the rcu notifier is called is possible. __cpu_disable() failure: - CPU_DYING is called, but the cpu remains fully alive. The system comes fully alive again. - some time later, CPU_DEAD is called. With the current CPU_DYING callback, it's impossible to be both deadlock-free and race-free with the given ...
As I understand it, this is the point where the dying CPU disables interrupts and removes itself from the online masks. Though I would feel better if there was an smp_mb() after the last local_irq_disable() Indeed! My current code doesn't declare them offline until the CPU_DEAD notifiers are called. And force_quiescent_state() does not consider them to be offline until after they have cleared their bit in cpu_online_map, which does not happen until the outgoing CPU has disabled interrupts, at least in x86. So my current code should be OK on x86. It -looks- like stop_cpu() expects to be called with irqs disabled, but I don't see what would be disabling irqs. (Don't kthreads normally start with irqs enabled?) Ah, I see it -- the stop_cpu() threads sequence through a state machine, and one of the states disables irqs for everyone. So the only problem would occur in architectures that re-enable irqs in the middle of __cpu_disable(), as x86 does (but x86 correctly orders the clearing of the cpu_online_mask bit, so is OK). This of course has the added benefit that irq handlers aren't running on a CPU that is marked offline. Checking other architectures: o ARM arch/arm/kernel/smp.c __cpu_disable() does not re-enable irqs, so is OK. ! arch/ia64/kernel/smpboot.c __cpu_disable() clears itself from the cpu_online mask before flushing pending irqs, which might include RCU read-side critical sections. I believe that the "cpu_clear(cpu, cpu_online_map)" must move to after the "fixup_irqs()". o arch/powerpc/kernel/smp.c __cpu_disable() does not disable irqs directly, but calls subarch-specific functions noted below. o arch/powerpc/platforms/powermac/smp.c smp_core99_cpu_disable() does not appear to re-enable irqs, so should be OK. ! arch/powerpc/kernel/smp.c generic_cpu_disable() clears itself from the cpu_online_mask before invoking fixup_irqs(), which momentarily enables irqs. I believe that the "cpu_clear(cpu, cpu_online_map)" must move to after ...
Yes, that would work:
Rule 1: after CPU_DEAD, a cpu is gone. The cpu is quiet, rcu callbacks
must be moved to other cpus, ...
Rule 2: if a cpu is not listed in cpu_online_mask, then it can be
considered as outside a read-side critical section.
The problem with rule 2 is that it means someone
[force_quiescent_state()] must poll the cpu_online_mask and look for
changes.
I'd really prefer a notifier. CPU_DYING is nearly the correct thing, it
only has to be moved down 3 lines ;-)
I made a mistake, get_online_cpus() stores current, not a cpu number.
Thus the described race it not possible. Perhaps there are other users
that could deadlock.
I don't know enough about the preempt algorithm, thus I can't confirm if
your proposal would work or not.
--
Manfred
--
But some later CPU_DYING notifier might decide that the CPU cannot be removed after all, which would mean bringing the CPU back. And then whatever the CPU was needed for might have actually happened in the Well, that is on my list of things to look into... Thanx, Paul --
CPU_DYING must not fail, the current code doesn't support that.
--
Manfred
--
First, only one of these race conditions can happen at a time, since only one online/offline action can be happening at a time. What I did to solve it was to make force_quiescent_state() check to see if the CPU currently blocking the grace period is offline. (The actual checking for offline is in rcu_implicit_offline_qs(), which is called indirectly from force_quiescent_state().) So when this race occurs, it is taken care of within three jiffies. This happened -many- times during my most recent test ("of=" in the Good catch!!! Fixed. The current code would work in this case, but grace periods would be unnecessarily extended until force_quiescent_state() Very good point!!! And then call_rcu() and call_rcu_bh() become --
It would be better if you hung rcu_irq_enter in the irq_enter() if statement that checks if the task was idle or not. This way it would be zero overhead for interruptions of non busy CPUs, keeping it out of many fast paths. Haven't read everything, sorry. -Andi --
So that I lose the #else above, and so that irq_enter() and irq_exit()
look something like the following (with additional adjustments to suit)?
Makes a lot of sense to me...
And it has the very nice side effect of allowing me to have a separate
rcu_irq_enter() and rcu_nmi_enter(), trivializing the RCU-dynticks
interface!!! Very cool, thank you very much!!!
Thanx, Paul
void irq_enter(void)
{
#ifdef CONFIG_NO_HZ
int cpu = smp_processor_id();
if (idle_cpu(cpu) && !in_interrupt())
tick_nohz_stop_idle(cpu);
#endif
__irq_enter();
#ifdef CONFIG_NO_HZ
if (idle_cpu(cpu)) {
rcu_irq_enter();
tick_nohz_update_jiffies();
}
#endif
}
void irq_exit(void)
{
account_system_vtime(current);
trace_hardirq_exit();
sub_preempt_count(IRQ_EXIT_OFFSET);
if (!in_interrupt() && local_softirq_pending())
invoke_softirq();
#ifdef CONFIG_NO_HZ
/* Make sure that timer wheel updates are propagated */
if (idle_cpu(smp_processor_id())) {
if (!in_interrupt() && !need_resched())
tick_nohz_stop_sched_tick(0);
rcu_irq_exit();
}
#endif
preempt_enable_no_resched();
}
--
Hello! Still experimental, not for inclusion. But ready for serious experimental use, in particular, experience on an actual >1000-CPU machine would be most welcome. Updates from v3: o The hierarchical-RCU implementation has been moved to its own "rcutree" set of files. This allows configuring three different implementations of RCU (CLASSIC_RCU, PREEMPT_RCU, and the new TREE_RCU). More importantly, it enables easy application of this patch to a wide variety of Linux versions. I hope that this implementation can completely replace Classic RCU, but in the meantime, this split makes for easier testing and review. o The stalled-CPU detection is now implemented and working, enabled by the CONFIG_RCU_CPU_STALL config parameter. Complaints are kprint()ed 3 seconds into the stall, and every 30 seconds thereafter. It also now attempts to force quiescent states. o The algorithm uses pre-fabricated masks rather than shifting on each access. o Review comments have been applied (thank you all!!!). For but one example, call_rcu() and call_rcu_bh() are now one-liners. o The rcu_pending() and rcu_needs_cpu() primitives are now much more aggressive about permitting CPUs to enter dynticks idle mode. Only CPUs that have RCU callbacks are kept out of dynticks idle mode. Attached is an updated patch to Classic RCU that applies a hierarchy, greatly reducing the contention on the top-level lock for large machines. This passes 10-hour concurrent rcutorture and online-offline testing on 128-CPU ppc64. It is OK for experimental work assuming only modestly brave experimenters (and perhaps even cowardly experiementers), but not yet ready for inclusion. See also Manfred Spraul's recent patches (or his earlier work from 2004 at http://marc.info/?l=linux-kernel&m=108546384711797&w=2). We will converge onto a common patch in the fullness of time, but are currently exploring different regions of the design space. That said, I have already gratefully stolen a ...
On Fri, 5 Sep 2008 08:29:30 -0700
The CONFIG_RCU_CPU_STALL identifier seems poorly-chosen to me - it
sounds like it will stall my CPU. Should it be
CONFIG_RCU_CPU_STALL_DETECTOR? If it's a debugging option then it
You forgot
o Adds yet another RCU flavour
Having alternative implementations of the same thing is a real cost in
Using NR_CPUS for anything at all is grossly, grossly inaccurate.
Vendors can and will ship kernels with NR_CPUS=1024 and their customers
can and will run those kernels on 4-cpu machines. Lots of customers.
That's a two-and-a-half-order-of-magnitude inaccuracy. It makes all
your above work meaningless.
So this is a 4096-byte structure on some setups.
did they have to be implemented in macros? (it's generally best to use
static inline void rcu_init_sched(void)
{
}
These are massive. But it seems they'll only ever be used once, in
I don't personally think such fugliness gains us enough.
static void dyntick_record_completed(struct rcu_state *rsp, int comp)
{
}
I dislike the L's here. They don't do anything and they have an
encapsulation-violation feeling about them. Do we really want code
sprinkled all over the palce whcih "knows" spefically which type was
used to implement these fields? Or do we want to stick a plain old "2"
]]
hm, does that work?
akpm:/usr/src/25> grep -r __releases Documentation
akpm:/usr/src/25>
Looks great! I don't understand a line of it!
It's a pretty big pill to swallow. Nice performance testing results
will help it to slide down.
--
CONFIG_RCU_CPU_STALL_DETECTOR it is! It is sufficiently lightweight to be included in production systems, and similar mechanisms have proven Agreed, but note well the quote from above: The hierarchical-RCU implementation has been moved to its own "rcutree" set of files. This allows configuring three different implementations of RCU (CLASSIC_RCU, PREEMPT_RCU, and the new TREE_RCU). More importantly, it enables easy application of this patch to a wide variety of Linux versions. I hope that this implementation can completely replace Classic RCU, but in the meantime, this split makes for easier testing and review. The idea is to avoid adding another RCU flavour (or flavor, for that I agree in principle, but this case is an exception. Suppose that we have NR_CPUS=1024 on a 4-CPU 64-bit machine. Since 64^2 is greater than 1024, we end up with a two-level hierarchy, with one rcu_node structure at the root and 16 rcu_node leaf structures, each of which takes up a single 128-byte cache line. There will be two such structures in the system, one for rcu and one for rcu_bh. So I do not believe that this will be a problem. One laptops, this is even less of an issue -- NR_CPUS=8 on my laptop, which would reduce to a pair rcu_node structures, one for rcu, the other for rcu_bh. Making the decision at runtime would bloat the code by much more than the extra data consumed. And introduce yet more races between online/offline and everything else. Besides, this way I am being bloat-compatible You lost me on this one. On a 64-bit system, I have 8 bytes each for lock, qsmask, qsmaskinit, grpmask, and parent, four bytes each for grplo and grphi, and another byte each for grpnum and level, for a total of 50 bytes for each struct rcu_node, which comes to a single cache line for most large system. Given the default CONFIG_RCU_FANOUT=64 and NR_CPUS=4096, we have a two-level hierarchy with one root rcu_node structure and 64 leaf rcu_node structures. This ...
Is it likely that anyone will ship kernels with NR_CPUS=8? What are ____cacheline_internodealigned_in_smp will expand this structure to OK, thanks. As it's effectively a bugfix, a full description of the bug would grease some wheels. --
Ubuntu Feisty ships with NR_CPUS=8, but yes, I imagine that this number
will increase. Perhaps a table for 64-bit CPUs:
Cachelines per Total
NR_CPUs Implementation Cachelines
1-64 1 2 [laptop distros]
65-128 3 6 [x86 distros, Power]
129-192 4 8
193-256 5 10
257-320 6 12
321-384 7 14
385-448 8 16
449-512 9 18 [SGI ca. 2004]
513-576 10 20
577-640 11 22
641-704 12 24
705-768 13 26
769-832 14 28
833-896 15 30
897-960 16 32
961-1024 17 34 [SGI ca. 2006?]
. . .
3967-4032 64 128
4033-4096 65 130 [SGI ca. 2008/9?]
4097-4160 67 134
4161-4224 68 136
And so on, limiting out at 262,144 CPUs, which should be at least
3-5 years in the future (famous last words). The "Cachelines per
Implementation" column is for each of rcu and rcu_bh, while the "Total
Cachelines" column is for the combination of both. As you can see, the
number of cachelines consumed by the rcu_nodes hierarchy is quite modest
as a function of the number of CPUs, even for pretty big numbers of CPUs.
Ah! I guess that I have not been paying sufficient attention.
OK, http://www.scalemp.com/ says that they go to 128 cores, so perhaps
256 hardware threads, and 1TB of memory. This is about 4GB per hardware
thread. My approach costs them 4096/64=64 bytes per hardware thread,
or about 2 millionths of a percent of memory.
I think that this is OK. If not, one option is to pick a smaller
Here is the gist of what I was told:
Above several hundred CPUs, the current RCU implementation
suffers from severe lock contention, rendering the machine
useless. Crude workarounds exist, but are expected to cause
their own problems at higher CPU counts.
I never have used a machine with that many CPUs, so am just passing on
what I heard. But given that Classic RCU was ...Hi Paul,
I've noticed that right now rcu_enter_nohz() can be nested within
rcu_irq_enter():
irq_exit() first calls tick_nohz_stop_sched_tick(), then rcu_irq_exit().
And tick_nohz_stop_sched_tick() can switch into nohz mode.
Is that intentional? Does rcupreempt support that? It broke my rcustate
code on x86-64.
I would prefer if something like the attached patch is applied. What do
you think?
Do you need the patch as well?
--
Manfred
Good question -- when I tried splitting irqs from NMIs, things broke badly, and this might well explain it. Thank you very much for the hint!!! --
I fear that this won't work. E.g. look at x86, smp_callin() [arch/x86/kernel/smpboot.c]: The boot code must enable local interrupts around calibrate_delay(), otherwise the NMI watchdog would complain. That means the first interrupts, the first read side critical sections can run way before the cpu bit is set within cpu_online_map. cpus are just started, we are not within stop_machine. Thus we cannot make any assumption about what the remaining cpus are doing. What about introducing a CPU_STARTING notifier call, similar to CPU_DYING: - called with disabled interrupts - called before interrupts are enabled - must not sleep - called on the new cpu. This might also be useful for something like kvm. I'm not sure if it's guaranteed that hardware_enable() runs early enough. Attached is a patch proposal. Note that it doesn't work correctly: On x86-64, I have seen that CPU_STARTING is called after CPU_ONLINE. Thus frozen_cpus could already be cleared, then _FROZEN would be wrong. -- Manfred
I would find that useful too. I had several cases where i had to add smp_call_function_single() with a second callback to the CPU_UP notifier, which always seemed ugly. -Andi -- ak@linux.intel.com --
I suppose another approach would be to make calibrate_delay() kick the watchdog timer every so often, but the effort explaining to people why their machine's bogomips had decreased would probably far exceed any possible benefit... So given a CPU_STARTING notifier, RCU could set a bit in a coming-online bitmap, which would be reset rcu_online_cpu(). Then RCU would consider a CPU offline only if it is marked offline in cpu_online_map -and- it is Then notify_cpu_starting() is invoked right before smp_callin() in start_secondary()? --
Hello! This patch fixes a long-standing performance bug in classic RCU that results in massive lock contention on the internal RCU lock on systems with more than a few hundred CPUs. Although this patch creates a separate flavor of RCU for easy of review and patch maintenance, it is intended to replace classic RCU. Still experimental, not for inclusion, but given that I am now finding more bugs in the rest of Linux than in this code, I suspect that it is getting close. Definitely ready for serious experimental use, especially Updates from v4: o Separated dynticks interface so that NMIs and irqs call separate functions, greatly simplifying it. In particular, this code no longer requires a proof of correctness. ;-) o Separated dynticks state out into its own per-CPU structure, avoiding the duplicated accounting. o The case where a dynticks-idle CPU runs an irq handler that invokes call_rcu() is now correctly handled, forcing that CPU out of dynticks-idle mode. o Review comments have been applied (thank you all!!!). For but one example, fixed the dynticks-ordering issue that Manfred pointed out, saving me much debugging. ;-) o Adjusted rcuclassic and rcupreempt to handle dynticks changes. Attached is an updated patch to Classic RCU that applies a hierarchy, greatly reducing the contention on the top-level lock for large machines. This passes 10-hour concurrent rcutorture and online-offline testing on 128-CPU ppc64 without dynticks enabled, and exposes some timekeeping bugs in presence of dynticks (exciting working on a system where "sleep 1" hangs until interrupted...). It is OK for experimental work, but not yet ready for inclusion. See also Manfred Spraul's recent patches (or his earlier work from 2004 at http://marc.info/?l=linux-kernel&m=108546384711797&w=2). We will converge onto a common patch in the fullness of time, but are currently exploring different regions of the design space. That said, I have already gratefully stolen quite a few of ...
Hi Paul,
I'm still comparing my implementation with your code:
- f is called once for each cpu in the system, correct?
- if at least one cpu is in nohz mode, this loop will be needed for
every grace period.
That means an O(NR_CPUS) loop with disabled local interrupts :-(
Is that correct?
Unfortunately, my solution is even worse:
My rcu_irq_exit() acquires a global spinlock when called on a nohz cpus.
A few cpus in cpu_idle, nohz, executing 50k network interrupts/sec would
cacheline-trash that spinlock.
I'm considering counting interrupts: if a nohz cpu executes more than a
few interrupts/tick, then add a timer that check rcu_pending().
Perhaps even wouldn't be enough: I remember that the initial unhandled
irq detection code broke miserably on large SGI systems:
An atomic_inc(&global_var) in the local timer interrupt (i.e.:
NR_CPUS*HZ calls/sec) caused so severe trashing that the system wouldn't
boot. IIRC that was with 512 cpus.
--
Manfred
--
Hello, Manfred! Not necessarily. If all CPUs corresponding to this rcu_state struct have checked in already, we don't even get to this loop -- see the The outer loop, yes. The inner loop only for those rcu_state structs With the definition of "O()" being the worst-case execution time, yes. But this worst case could only happen when the system was mostly idle, in which case the added overhead should not be too horribly bad. If the system was busy enough that each CPU ran at least one process during each grace period, then this function would not be invoked in the first place. If this does prove to be a problem in practice, I will rework force_quiescent_state() to run incrementally. But I would rather avoid both the added complexity and the resulting longer grace periods, so someone needs to bring me a real-world problem before I take that I tried putting a cpu_quiet() in my rcu_irq_exit() as well, and quickly /me runs off and checks to make sure that all of my dyntick entry/exit code restricts itself to per-CPU variables... Yep! (Whew!!!) Thanx, Paul --
No: "was mostly running cpu_idle()". A cpu_idle() cpu could execute lots
of irqs and softirqs.
So the worst case would be a system with 1 cpu/node for reserved for irq
handling.
The "idle" cpu would be always in no_hz mode, even though it might be
100% busy handling irqs.
The remaning cpus might be 100% busy handling user space.
And every quiescent state will end up in that O(NR_CPUS) loop.
--
Manfred
--
Good point! Indeed, if you had a 1024-CPU box acting as (say) a router/hub using the Linux-kernel protocol stacks with no user-mode processing, then you could indeed have the system mostly busy with no user-space code running, and thus no quiescent states. However, last I checked, almost all 1024-CPU boxes run HPC workloads mostly in user mode, so this scenario would not occur. However, again, if it does come up, I would add an additional level of state machine to the force_quiescent_state() family of functions, so that the scan would be done incrementally. Perhaps arranging for CPU groups to be scanned by CPUs within that group. But again, I don't want to take that step until I see someone actually needing it. Maybe the Vyatta guys will be there sooner than I think, but... Thanx, Paul --
Hi Paul,
Some further thoughts about design differences between your and my
implementation:
- rcutree's qsmaskinit is the worst-case list of cpus that could be in
rcu read side critical sections.
- rcustate's cpu_total is the accurate list of cpus that could be in rcu
read side critical sections.
Both variables are read rarely: for rcu_state, twice per grace period.
rcutree fixes up cpus that are "incorrectly" listed in qsmaskinit with
force_quiescent_state(). It forces rcutree to use a cpu bitmask for
qsmask and it forces rcutree to store the "done" information in a global
structure. Additionately, in the worst case force_quiescent_state() must
loop over all cpus.
rcustate can use per-cpu structures and a global atomic_t. There is no
loop over all cpus. That's a big advantage, thus I think it's worth the
effort to maintain an accurate list.
Unfortunately, I don't have an efficient implementation for the accurate
list.
Some random ideas:
- cpu_total is only read rarely. Thus it would be ok if the read
operation is expensive [e.g. collect data from multiple cachelines,
acquire spinlocks...]
- updates to cpu_total happen with every interrupt on an idle system
with no_hz.
Thus it must be very scalable, preferably per-cpu data.
And: Updates are far more frequent than grace periods.
- updates to cpu_total happen nearly never without no_hz.
Especially: far less frequent than grace periods.
What about adding an "invalid" flag to cpu_total? The "real" data is
stored in per-cpu structures.
- when a cpu enters/leaves nohz, then it invalidates the global
cpu_total and updates a per-cpu structure
- when the state machine needs the number of rcu-tracked cpus, then it
checks if the global cpu_total is valid.
If it's valid, then cpu_total is used directly. Otherwise the per-cpu
structures are enumerated and the new value is stored as cpu_total.
What do you think?
--
Manfred
--
That has been one of my biggest questions about your approach, that and the need to hit holdouts with resched IPI. (Though perhaps you have Agreed. I could use this approach as well, having each CPU set and clear its qsmaskinit bit on every exit from and entry to dynticks idle state, but see below... In your case, you would need to carefully keep state so that a CPU entering dynticks idle mode would know whether or not it needed to respond to the current grace period -- but you need this in any case, Yep! Hence my reluctance to add overhead to the dynticks side of the I use an analogous algorithm, as the qsinitmask values might change while setting up for the next quiescent state. The tough part of this was correctly handling races between setting up for the quiescent state and onlining/offlining CPUs (and, in your case, CPUs entering/leaving dynticks idle mode). I chose to use a global lock that excludes online/offline and starting a quiescent state (except in the case where there are so few CPUs that there is but one rcu_node structure, in which case that structure's lock suffices, so that onofflock need not be acquired). But although acquiring a global lock is reasonable for CPU online/offline (as there can be but one such operation at a time), it would be quite painful for the dynticks case. Of course, I might be able to avoid the need for this global lock if I were willing to acquire the interior-node rcu_node locks when setting up for the next grace period. But this would require me to put in a cleanup step after grace-period setup, as some set of dyntick operations might otherwise end the grace period before it is fully initialized. This could potentially result in two different CPUs setting up grace periods concurrently (yuck!). Worse yet, the grace period could end while the initialization was only halfway through the leaves, so that the CPU doing the initialization would need to recognize this and stop further initialization -- and, again, clean up ...
Hello! This patch fixes a long-standing performance bug in classic RCU that results in massive lock contention on the internal RCU lock on systems with more than a few hundred CPUs. Although this patch creates a separate flavor of RCU for easy of review and patch maintenance, it is intended to replace classic RCU. Still experimental, not for inclusion, but given that I am now finding more bugs in the rest of Linux than in this code, I suspect that it is getting close. Definitely ready for serious experimental use, especially in !CONFIG_NO_HZ configurations. In particular, experience on an actual 1000+ CPU machine would be most welcome, and now appears to be forthcoming! Updates from v5 (http://lkml.org/lkml/2008/9/15/92, bad subject line): o Fix a compiler error in the !CONFIG_FANOUT_EXACT case (blew a changeset some time ago, and finally got around to retesting this option). o Fix some tracing bugs in rcupreempt that caused incorrect totals to be printed. o I now test with a more brutal random-selection online/offline script (attached). Probably more brutal than it needs to be on the people reading it as well, but so it goes. o A number of optimizations and usability improvements: o Make rcu_pending() ignore the grace-period timeout when there is no grace period in progress. o Make force_quiescent_state() avoid going for a global lock in the case where there is no grace period in progress. o Rearrange struct fields to improve struct layout. o Make call_rcu() initiate a grace period if RCU was idle, rather than waiting for the next scheduling clock interrupt. o Invoke rcu_irq_enter() and rcu_irq_exit() only when idle, as suggested by Andi Kleen. I still don't completely trust this change, and might back it out. o Make CONFIG_RCU_TRACE be the single config variable manipulated for all forms of RCU, instead of the prior confusion. o Document tracing files and formats for both rcupreempt and rcutree. Updates ...
could you please send this bit separately, even if rcutree isnt finished yet? Seems like a quite useful debug feature. Ingo --
Good point -- I guess I do need to implement the CLASSIC_RCU piece of this in any case. Thanx, Paul --
i'm wondering about those timekeeping bugs. Do you have an idea what's it about and does it affect mainline? Ingo --
Sad to say, they do affect mainline -- I can reproduce these problems in 2.6.27-rc7. Thomas has given me a couple of fixes that got rid of earlier problems in which jiffies would stop counting, and I believe has located the cause of at least one other problem. Thanx, Paul --
Hello! This patch fixes a long-standing performance bug in classic RCU that results in massive lock contention on the internal RCU lock on systems with more than a few hundred CPUs. Although this patch creates a separate flavor of RCU for easy of review and patch maintenance, it is intended to replace classic RCU. Still experimental, not for inclusion, but getting quite close. I expect to have it in shape for 2.6.29. Definitely ready for -serious- testing and abuse. In particular, experience on an actual 1000+ CPU machine would be most welcome, and still appears to be forthcoming... Updates from v6 (http://lkml.org/lkml/2008/9/23/448): o Fix a number of checkpatch.pl complaints. o Apply review comments from Ingo Molnar and Lai Jiangshan on the stall-detection code. o Fix several bugs in !CONFIG_SMP builds. o Fix a misspelled config-parameter name so that RCU now announces at boot time if stall detection is configured. o Run tests on numerous combinations of configurations parameters, which after the fixes above, now build and run correctly. Updates from v5 (http://lkml.org/lkml/2008/9/15/92, bad subject line): o Fix a compiler error in the !CONFIG_FANOUT_EXACT case (blew a changeset some time ago, and finally got around to retesting this option). o Fix some tracing bugs in rcupreempt that caused incorrect totals to be printed. o I now test with a more brutal random-selection online/offline script (attached). Probably more brutal than it needs to be on the people reading it as well, but so it goes. o A number of optimizations and usability improvements: o Make rcu_pending() ignore the grace-period timeout when there is no grace period in progress. o Make force_quiescent_state() avoid going for a global lock in the case where there is no grace period in progress. o Rearrange struct fields to improve struct layout. o Make call_rcu() initiate a grace period if RCU was idle, rather than waiting for the next ...
What about using CPU_DYING and CPU_STARTING?
Hmm - your code must loop multiple times over the cpus.
I've use a different approach: More forcing is only required for a nohz
cpu when it was hit within a long-running interrupt.
Thus I've added a '->kick_poller' flag, rcu_irq_exit() reports back when
the long-running interrupt completes. Never more than one loop over the
outstanding cpus is required.
--
Manfred
--
Because I don't want to tie RCU too tightly to the details of the online/offline implementation. It is too easy for someone to make a "simple" change and break things, especially given that the online/offline code still seems to be adjusting a bit. So I might well use CPU_DYING and CPU_STARTING, but I would still keep Do you send a reschedule IPI to CPUs that are not in dyntick idle mode, but who have failed to pass through a quiescent state? In my case, more forcing is required only for a nohz CPU in a long-running interrupt (as with your approach), for sending the aforementioned IPI, and for checking for offlined CPUs as noted above. Thanx, Paul --
Actually, I never send reschedule IPIs.
- For "usual" cpus, rcu_check_callbacks() checks for quiescent states.
There should be a set_need_resched() if a cpu holds up a grace period
for too long. I just haven't implemented it yet.
IMHO it doesn't make sense to perform a "for_each_cpu()
smd_send_reschedule()". rcu has a hook in each cpu, thus a
set_need_resched() by the per-cpu hook is faster/simpler.
- For nohz cpus, a poller function [schedule_work(), enabled interrupts]
peeks into the per-cpu data of the nohz cpu and checks if it is quiet or
if it passed through a quiescent state.
If it didn't, then it sets a cpu_data->kick_poller flag and
rcu_irq_exit() reports the grace period.
No need for an IPI either - rcu has a hook in the irq exit path.
Right now, I cheat if a nohz cpu is in a long-running nmi
[while(other_cpu_is_in_nmi()) cpu_relax()], but I think I can fix that
with an set_need_resched() in the rcu_nmi_exit().
--
Manfred
--
One indeed only sends resched IPIs to CPUs that are not responding on their own. And in the case of rcutree.c, the loop is not over the CPUs themselves, but rather over the leaf rcu_node structures (one per 64 CPUs) -- the per-CPU rcu_data structure is touched only for CPUs that I considered adding a cpu_quiet() on the irq exit path, but eventually decided that I should instead place the added overhead in the infrequently invoked force_quiescent_state() function. Could be argued either way, Hmmm... I don't see where the NMI exit path checks the TIF_NEED_RESCHED flag, but I could easily be missing something. I instead maintain separate counters for the NMI and irq entry/exit code, which are checked in the force_quiescent_state() path. Thanx, Paul --
rcu_irq_exit() is only called on idle cpus.
Good point.
I haven't looked at the issue yet.
Perhaps a smd_send_reschedule(smp_processor_id()) is necessary.
Btw, I found a bug in my state machine: Right now, the state machine
will lock up if all cpus are in nohz mode.
I'm not sure if it applies to your code as well.
--
Manfred
--
Only once per such CPU every grace period -- seems in the noise to me. But I should revisit, as I have changed things quite a bit since I Is legal to call that from an NMI handler? Looks to me that some x86 architectures do sequences of device-register reads and writes to send I avoid this problem by forbidding a CPU with an active RCU callback from entering nohz mode. Therefore, the only way that all CPUs can be in nohz mode is if there are no RCU callbacks in the system. In this case, RCU grace periods will never complete, but that is OK because there is no need for RCU grace periods. One leaves this all-nohz state when some irq handler either invokes call_rcu() or awakens some task. In the former case, rcu_irq_exit() will see the callback and invoke set_need_resched(), while in the latter case the normal dynticks mechanism will wake up some CPU. This is not a problem from NMI handlers, as NMI handlers are not permitted to invoke call_rcu(). Or much of anything else, for that matter. ;-) Thanx, Paul --
Another small point:
Does your implementation support rcu_check_callbacks() with cpu !=
smp_processor_id()?
I don't think my locking would support it properly.
Thus:
- cpu != smp_processor_id() doesn't work.
- stack space for a useless parameter.
- the explicit cpu parameter prevents the rcu code from using get_cpu_var().
What about modifying the rcu_check_callbacks() prototype? I'd propose to
remove the cpu parameter.
--
Manfred
--
That would work fine for rcutree.c. If I were to invoke rcu_check_callbacks() remotely, I would use something like smp_call_function() to make it happen. Hmmm... Looks like rcu_pending is also always called with its cpu parameter set to the current CPU, and same for rcu_needs_cpu(). And given that all the external uses of rcu_check_callbacks() are of the following form: if (rcu_pending(cpu)) rcu_check_callbacks(cpu, whatever); perhaps rcu_pending() should be an internal-to-RCU API invoked from rcu_check_callbacks(). Thoughts? Thanx, Paul --
From my point of view: Yes, change it.
In the long run, I'd like to move the stall detector code to rcupdate.c,
with an 'rcu_cpu_missing' callback. That one would need a cpu flag, but
that's a new function.
--
Manfred
--
Agreed. Perhaps a good change to make while introducing stall detection to preemptable RCU -- there would then be three examples, which should allow good generalization. Thanx, Paul --
Two implementations. IMHO the current rcu-classic code should be dropped
immediately when you add rcu-tree:
rcu-classic is buggy, as far as I can see long-running interrupts on
nohz cpus are not handled correctly. I don't think it makes sense to
keep it in the kernel in parallel to rcu-tree.
I would propose that rcu-tree replaces rcu-classic.
I'll continue to update rcu-state, I think that it will achieve lower
latency than rcu-tree [average/max time between call_rcu() and
destruction callback] and it doesn't have the irq disabled loop to find
the missing cpus.
If I find decent benchmarks where I can quantify the advantages, then
I'll propose to merge rcu-state as a third implementation in addition to
rcu-tree and rcu-preempt.
Paul: What do you think?
--
Manfred
--
In keeping with my reputation as a "conservative programmer", I would suggest that rcuclassic.c remain for a year or so. Distros branching off during this time should continue making rcuclassic.c be the default. Other uses should have rcutree.c as the default. At the end of the year, we remove rcuclassic.c. All that said, one attractive aspect of your suggestion is immediately removing rcuclassic.c would eliminate the need to do further work on it. ;-) Your benchmarking proposal for rcu-state makes sense to me. One other possible place for techniques from rcu-state may be in making preemptable RCU scale. This may take some time, as other parts of the RT kernel have their limitations, but sooner or later people are going to expect real-time response from even the largest machines. In addition, preemptable RCU has a number of shorter-term issues: 1. RCU-boosting mechanism. (I need to combine the best of Steve's and my mechanisms. The treercu.c effort has been sort of a warm-up exercise for RCU-boosting.) 2. Reducing the latency contribution of the preemptable RCU state machine (but note that moving this state machine out of the scheduling-clock irq handler means more stuff to boost). 3. Porting the simpler dynticks interface from rcutree to preemptable RCU. 4. Making the preemptable RCU tracing code use seqfile. Hmmm... Maybe it is (past) time for me to publish an RCU to-do list? Thanx, Paul --
How do you intend to handle nohz cpus? I would create a separate patch that removes rcuclassic.c. distros that want to keep rcuclassic could just revert that change. -- --
In which variant of RCU? My current thought is to apply the rcutree.c version to rcupreempt.c. If rcuclassic.c can be dropped, my thought would be to leave it alone -- it is unnecessarily awakening CPUs, but That does make a lot of sense. At least it would make my life simple. ;-) --
For rcuclassic.
As far as I can see, rcuclassic treats nohz cpus as always outside
rcu_read_lock():
rcu_start_batch() contains
>
> cpus_andnot(rcp->cpumask, cpu_online_map, nohz_cpu_mask);
>
As soon as all cpus from rcp->cpumask reported a grace period, the
callbacks are called.
That a bug, therefore I would drop rcuclassic as soon as rcutree is merged.
--
Manfred
--
If we were to keep rcuclassic for any length of time, I would modify rcu_pending() and rcu_check_callbacks() to invoke force_quiescent_state() Good point, I had forgotten that issue. Making this modification would cause the resulting rcuclassic to be just as suspect as is rcutree, I suppose. A strong argument for moving to rcutree.c quickly rather than slowly, I must admit! Thanx, Paul --
Hi Paul,
^^^^^^^^^^^^^^^^^^
=>
This can also happen when a CPU has just gone offline,
but RCU hasn't yet marked it as offline. However, it's impact
on delaying the grace period may not be high as in the
=>
This check is safe here since this callpath is invoked
from a softirq, and thus the system cannot do a stop_machine()
as yet. This implies that the cpu in question cannot go offline
=> Just wondering, can't NMI handlers queue callbacks? If yes,
=> ^^^^^^^^^^^^^^^^^^^
=> ----------------> [1]
>^^^^^^^^^^^^^^^^^^
This can also happen when a CPU has just gone offline,
but RCU hasn't yet marked it as offline. However, it's impact
on delaying the grace period may not be high as in the
This check is safe here since this callpath is invoked
from a softirq, and thus the system cannot do a stop_machine()
as yet. This implies that the cpu in question cannot go offline
^^^^^^^^^^^^^^^^^^^
----------------> [1]
If this condition is true, then,
rsp->gpnum != rsp->completed. Hence, we will always enter
the if() condition in print_other_cpu_stall() at
[1] (See above), and return without ratting our buddy.
That defeats the purpose of the stall check or I am
rdtp->dynticks is odd. Hence rdtp->dynticks + 1 should be even.
--
Thanks and Regards
gautham
--
The "/" command in "vi" works pretty well for me. ;-) These are the same as the ones in your earlier note, correct? --
Thank you for looking this over, and especially for noting several issues! Responses interspersed. Yep! I note that in the comment preceding the cpu_is_offline() above. Is that sufficient, or should I reiterate that point in another comment NMI handlers are forbidden to queue callbacks, as the current call_rcu() implementation is not NMI-safe. It would be possible to create an NMI-safe implementation, but there needs to be someone who needs it > > + * is already in a quiescent state courtesy of dynticks id
Because this line should instead be: rdtp->dynticks_nmi = (rdtp->dynticks_nmi + 1) & ~0x1; Well spotted, even if it did take me a good long time to figure out that this really was a bug in my code! ;-) That said, you would have to really work to exercise this one... Near as I can tell, you would need to wrap the ->dynticks counter, which would then cause the dynticks_nmi counter to appear to go backwards. And then you would have to prevent the newly onlined CPU from ever passing through a quiescent state, which would cause a failure in any case. Still, good to fix, even if I can't figure out how it would result in a failure. Real hardware and software tends to be -much- better than me at finding such failures! Thanx, Paul --
What about some generic files, with the same content for all rcu backends?
The implementation could be in rcupdate.c. At least counting the rcu
callbacks could be done from generic code, the grace periods could be
queried [Do all backends implement rcu_batches_completed?]
Attached is a hack that I use right now for myself.
Btw - on my 4-cpu system, the average latency from call_rcu() to the rcu
callback is 4-5 milliseconds, (CONFIG_HZ_1000).
--
Manfred
For the grace-period latencies, I use the attached kernel module. This module simply times synchronize_rcu() to obtain the grace-period performance. I have similar modules for other performance measures. (Please note that this code was written for my own use only, others may find the kernel-module parameter names to be quite obnoxious. But yes, nreaders=1 will give you one synchronize_rcu() -update- task. What can I say? I used this module to collect data for http://www.research.ibm.com/journal/sj/472/guniguntala.pdf, and was not considering use by others.) I would rather avoid decorating the RCU code with grace-period-latency measurement code, since this can be done independently with a kernel module, and I suspect that more people read the RCU code than measure the grace-period latency. All backends do implement rcu_batches_completed(), but its return Hmmm... I would expect that if you have some CPUs in dyntick idle mode. But if I run treercu on an CONFIG_HZ_250 8-CPU Power box, I see 2.5 jiffies per grace period if CPUs are kept out of dyntick idle mode, and 4 jiffies per grace period if CPUs are allowed to enter dyntick idle mode. Alternatively, if you were testing with multiple concurrent synchronize_rcu() invocations, you can also see longer grace-period latencies due to the fact that a new synchronize_rcu() must wait for an earlier grace period to complete before starting a new one. The 2.5 jiffies is expected when RCU is idle: the synchronize_rcu() starts a new grace period immediately, it takes up to a jiffy for the other CPUs do their scheduling-clock interrupt (which notices that the grace period started), another jiffy for all the CPUs to do their next scheduling-clock interrupt (which notices the quiescent state), and on average a half jiffy for the initiating CPU to notice that the grace period has completed. I considered trying to trim an additional jiffy off of this grace-period latency by having CPUs keep track of whether they were in a ...
That's the reason why I decided to measure the real latency, from
call_rcu() to the final callback. It includes the delays for waiting
until the current grace period completes, until the softirq is
scheduled, etc.
Probably one cpu was not in user space when the timer interrupt arrived.
I'll continue to investigate that. Unfortunately, my first attempt
failed: adding too many printk's results in too much time spent within
do_syslog(). And then the timer interrupt always arrives on the
spin_unlock_irqrestore in do_syslog()....
--
Manfred
--
I believe that I get very close to the same effect by timing a call to synchronize_rcu() in a kernel module. Repeating measurements and printing out cumulative statistics periodically reduces the heisenberg ;-) Thanx, Paul --
Hello! This patch fixes a long-standing performance bug in classic RCU that results in massive internal-to-RCU lock contention on systems with more than a few hundred CPUs. Although this patch creates a separate flavor of RCU for ease of review and patch maintenance, it is intended to replace classic RCU. The current patch passes more stress tests than does mainline, so the next version will be against -tip in preparation for 2.6.29. Definitely ready for -serious- testing and abuse, and probably would do OK in production. In particular, experience on an actual 1000+ CPU machine would be most welcome, and still appears to be forthcoming some day... Updates from v7 (http://lkml.org/lkml/2008/10/10/291): o Fixed a number of problems noted by Gautham Shenoy, including the cpu-stall-detection bug that he was having difficulty convincing me was real. ;-) o Changed cpu-stall detection to wait for ten seconds rather than three in order to reduce false positive, as suggested by Ingo Molnar. o Produced a design document (http://lwn.net/Articles/305782/). The act of writing this document uncovered a number of both theoretical and "here and now" bugs as noted below. o Fix dynticks_nesting accounting confusion, simplify WARN_ON() condition, fix kerneldoc comments, and add memory barriers in dynticks interface functions. o Add more data to tracing. o Remove unused "rcu_barrier" field from rcu_data structure. o Count calls to rcu_pending() from scheduling-clock interrupt to use as a surrogate timebase should jiffies stop counting. o Fix a theoretical race between force_quiescent_state() and grace-period initialization. Yes, initialization does have to go on for some jiffies for this race to occur, but given enough CPUs... Updates from v6 (http://lkml.org/lkml/2008/9/23/448): o Fix a number of checkpatch.pl complaints. o Apply review comments from Ingo Molnar and Lai Jiangshan on the stall-detection code. o Fix several bugs in !CONFIG_SMP ...
