RCU can only control the lifetime of allocated memory blocks, which
forces all the call structures to be allocated. This is expensive
compared to allocating them on the stack, which is the common case for
synchronous calls.
This patch takes a different approach. Rather than using RCU, the
queues are managed under rwlocks. Adding or removing from the queue
requires holding the lock for writing, but multiple CPUs can walk the
queues to process function calls under read locks. In the common
case, where the structures are stack allocated, the calling CPU need
only wait for its call to be done, take the lock for writing and
remove the call structure.
Lock contention - particularly write vs read - is reduced by using
multiple queues.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
kernel/smp.c | 140 +++++++++++++++++++++++-----------------------------------
1 file changed, 56 insertions(+), 84 deletions(-)
===================================================================
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -7,8 +7,6 @@
#include <linux/init.h>
#include <linux/module.h>
#include <linux/percpu.h>
-#include <linux/rcupdate.h>
-#include <linux/rculist.h>
#include <linux/smp.h>
#include <asm/atomic.h>
@@ -26,7 +24,7 @@
static DEFINE_PER_CPU(struct call_single_queue, call_single_queue);
struct ____cacheline_aligned queue {
struct list_head list;
- spinlock_t lock;
+ rwlock_t rwlock;
};
static __cacheline_aligned struct queue call_function_queues[NQUEUES];
@@ -40,7 +38,7 @@
struct call_single_data csd;
atomic_t refs;
cpumask_t cpumask;
- struct rcu_head rcu_head;
+ struct list_head cull_list;
};
struct call_single_queue {
@@ -98,15 +96,6 @@
csd_flag_wait(data);
}
-static void rcu_free_call_data(struct rcu_head *head)
-{
- struct call_function_data *data;
-
- data = container_of(head, struct call_function_data, rcu_head);
-
- kfree(data);
-}
-
/*
* Invoked by arch to handle an IPI for ...Could be reasonable. Still, it's adding like 4 or 5 more atomic operations, which will be approaching the cost of a slab allocation. Another approach to reduce slab allocation cost would be to do the call_rcu at the caller, in the wait case so we hit the CPU-local --
hm, is there any authorative data on what is cheaper on a big box, a full-blown MESI cache miss that occurs for every reader in this new fastpath, or a local SLAB/SLUB allocation+free that occurs with the current RCU approach? Ingo --
Hi Ingo, Christoph might have an idea about it. --
... thought of that missing Cc: line entry exactly 1.3 seconds after having sent the mail :) Christoph, any preferences/suggestions? Ingo --
I think it's just going to be a matter of benchmarking it and seeing. And small/medium systems are probably more important than huge ones unless there is a pathological scalability problem with one of the approaches (which there probably isn't seeing as there is already locking there). --
Its on the stack which is presumably hot so no cache miss? If its async then presumably we do not need to wait so its okay to call an allocator. Generally: The larger the box (longer cacheline acquisition latencies) and the higher the contention (cannot get cacheline because of contention) the better a slab allocation will be compared to a cacheline miss. RCU is problematic because it lets cachelines get cold. A hot cacheline that is used frequently read and written to by the same cpu is very good thing for performace. --
So on your these large boxes, read-only cachelines are preferentially ejected from the cache, so that one should write to per-CPU data occasionally to keep it resident? Or is the issue the long RCU grace periods which allow the structure being freed to age out of all relevant caches? (My guess would be the second.) Thanx, Paul --
The issue are the RCU grace period that are generally long enough to make the cacheline fall out of all caches. --
Would it make sense to push the freed-by-RCU memory further up the hierarchy, so that such memory is not mistaken for recently freed hot-in-cache memory? Thanx, Paul --
Right now my impression is that it is not well understood why the kmalloc makes the IPI that much slower. In theory a kmalloc shouldn't be all that slow, it's essentially just a "disable interrupts; unlink object from cpu cache; enable interrupts" with some window dressing. kfree() is similar. Does it bounce a cache line on freeing perhaps? -Andi --
I think it's just an assumption that it would be slower. Has anyone
measured it?
(Note: The measurements I posted do not cover this path, because it was
on a two cpu system, and it was always using the call-single path.)
J
--
It's likely slower than no kmalloc because Ah so it was already 25% slower even without kmalloc? I thought that was with already. That doesn't sound good. Any idea where that slowdown comes from? -Andi --
Just longer code path, I think. It calls the generic
smp_call_function_mask(), which then does a popcount on the cpu mask
(which it needs to do anyway), sees only one bit set, and then punts to
the smp_call_function_single() path. If we maintained a cpus_online
count, then we could fast-path the call to smp_call_function_single() in
the two core/cpu case more efficiently (would still need to scan the
mask to extract the cpu number).
Or alternatively, maybe it isn't actually worth special casing
smp_call_function_single() with a multi-queue smp_call_function_mask()
implementation?
J
--
I did IPI measurements quite some time ago and what I remember from them is that IPI latencies were in the low multiple thousands cycle But that is more in the a few tens of cycles (or maybe 1-2 hundreds if you have a NR_CPU==4096 kernel with really large cpumask) Doesn't really explain a 25% slowdown I would say. Are you sure there isn't a new cache miss in there or something? Actually it must be even multiple ones to account for such a slow down. -Andi -- ak@linux.intel.com --
It is indeed easy to focus on the wrong area when attempting to improve performance. Done it many times myself... :-/ Thanx, Paul --
That would mean passing a gfp flag like __GFP_COLD on free from RCU? Or how would that work at the higher levels? --
I was indeed thinking in terms of the free from RCU being specially marked. Thanx, Paul --
Isnt there some way to shorten the rcu periods significantly? Critical sections do not take that long after all. If the RCU periods are much shorter then the chance of cache hotness of the objects is increased. --
In theory, yes. However, the shorter the grace period, the greater the per-update overhead of grace-period detection -- the general approach is to use a per-CPU high-resolution timer to force RCU grace period processing every 100 microseconds or so. Also, by definition, the RCU grace period can be no shorter than the longest active RCU read-side critical section. Nevertheless, I have designed my current hierarchical RCU patch with expedited grace periods in mind, though more for the purpose of reducing latency of long strings of operations that involve How short does the grace period need to be to significantly increase the chance of an RCU-protected data element remaining in cache across an RCU grace period? The last time I calculated this, the knee of the curve was at a few tens of milliseconds, but to give you an idea of how long ago that was, the workload I used was TPC/A. Which might no longer be very representative. ;-) Thanx, Paul --
You could of course also drive the rcu state machine from Another thing that could be done is more often force a grace period by flipping the counters. --
True, and Jim Houston implemented something similar to this some years back: http://marc.theaimsgroup.com/?l=linux-kernel&m=109387402400673&w=2 This of course greatly increases rcu_read_unlock() overhead. But perhaps it is a good implementation for the workloads that Christoph is Yep. That is exactly what I was getting at with the high-resolution timer point above. This seems to be a reasonable compromise, as it allows someone to specify how quickly the grace periods happen dynamically. But I am not sure that this gets the grace periods to go fast enough to cover Christoph's use case -- he seems to be in a "faster is better" space rather than in an "at least this fast" space. Still, it would likely help in some important cases. Thanx, Paul --
If we combine these two cases, and flip the counter as soon as we've
enqueued one callback, unless we're already waiting for a grace period
to end - which gives us a longer window to collect callbacks.
And then the rcu_read_unlock() can do:
if (dec_and_zero(my_counter) && my_index == dying)
raise_softirq(RCU)
to fire off the callback stuff.
/me ponders - there must be something wrong with that...
Aaah, yes, the dec_and_zero is non trivial due to the fact that its a
distributed counter. Bugger..
--
Then lets make it per cpu. If we get the cpu ops in then dec_and_zero would be very cheap. --
Hmm, perhaps that might work for classic RCU, as that disables preemption and thus the counters should always be balanced. --
Unless you use a pair of global counters (like QRCU), you will still need to check a large number of counters for zero. I suppose that one approach would be to do something like QRCU, but with some smallish number of counter pairs, each of which is shared by a moderate group of CPUs. For example, for 4,096 CPUs, use 64 pairs of counters, each shared by 64 CPUs. My guess is that the rcu_read_lock() overhead would make this be a case of "Holy overhead, Batman!!!", but then again, I cannot claim to be an expert on 4,096-CPU machines. Thanx, Paul --
right - while the local count will be balanced and will always end up on zero, you have to check remote counts for zero as well. But after a counter flip, the dying counter will only reach zero once per cpu. So each cpu gets to tickle a softirq once per cycle. That softirq can then check all remote counters, and kick off the callback list when it finds them all zero. Of course, this scan is very expensive, n^2 at worst, each cpu triggering a full scan, until finally the last cpu is done. We could optimize this by keeping cpu masks of cpus found to have !0 counts - those who were found to have 0, will always stay zero, so we'll not have to look at them again. Another is making use of a scanning hierarchy. --
Which might not be so good from a powertop viewpoint, since grace periods only take a few milliseconds, and powertop likes to keep CPUs sleeping for seconds. But one could designate a set of CPUs that scanned nearby counters -- carefully chosed WRT the system in question One could keep an index of already-scanned counters, which would You still have to scan all of the leaves... Though you can divide the work over a set of CPUs, again, as long as those CPUs are chosen so as to balance the need to power down whole cores with the need to maintain good memory locality. Thanx, Paul --
The problem is that we need dec_and_zero on the sum of the per-CPU counters. Gets spendy. One can make a hierarchy, and propagate up. But still lots of cache misses. Thanx, Paul --
Let's be very careful before making rcu read locks costly. Any reduction in grace periods would be great, but IMO RCU should not be used in cases where performance depends on the freed data remaining in cache. --
Indeed! But if you were in a situation where read-side overhead was irrelevant (perhaps a mythical machine with zero-cost atomics and cache misses), then one approach would be to combine Oleg Nesterov's QRCU with the callback processing from Andrea Arcangeli's implementation from the 2001 timeframe. Of course, if your cache misses really were zero cost, then you wouldn't care about the data remaining in cache. So maybe a machine were cache misses to other CPUs' caches are free, but misses to main memory are horribly expensive? Anyway, the trick would be to adapt QRCU (http://lkml.org/lkml/2007/2/25/18) to store the index in the task structure (as opposed to returning it from rcu_read_lock()), and have a single global queue of callbacks, guarded by a global lock. Then rcu_read_unlock() can initiate callback processing if the counter decrements down to zero, and call_rcu() would also initiate a counter switch in the case where the non-current counter was zero -- and this operation would be guarded by the same lock that guards the callback queue. But I doubt that this would be satisfactory on 4,096-CPU machines. At least not in most cases. ;-) Thanx, Paul --
I think there was an AIM9 regression in the close/open tests when the struct file was switched to RCU. That test could be run with various intervals to figure out if a shorter RCU period is beneficial and how short an RCU period is needed to avoid the regression. --
Well, with some luck, I can get an RCU implementation that allows the grace-period duration to be varied, which would allow someone to check the AIM9 regression. Thanx, Paul --
