Hi, While working on the IO CPU affinity bits, I had to convert some more archs to support smp_call_function_single(). Nicks original patch for x86-64 also included improvements for smp_call_function(), so it isn't serialized by call_lock anymore (but still hits call_lock on every call). I've made further improvements and changes upon that while converting x86, going further and updating IA64 and PPC made me realize that we have a lot of shared code for the same functionality. So I came up with kernel/smp.c that basically just holds the new and abstracted code for issuing smp_call_function() and smp_call_function_single() along with the helpers and interrupt handlers associated. Perhaps a better place would be lib/smp.c, I'm open to suggestions. Even with half of this being new functionality, the diffstat still looks pretty nice: arch/ia64/kernel/smp.c | 171 ++--------- arch/powerpc/kernel/smp.c | 223 +------------- arch/powerpc/platforms/cell/interrupt.c | 1 arch/powerpc/platforms/ps3/smp.c | 7 arch/powerpc/platforms/pseries/xics.c | 6 arch/powerpc/sysdev/mpic.c | 2 arch/x86/kernel/entry_64.S | 3 arch/x86/kernel/i8259_64.c | 1 arch/x86/kernel/smp_32.c | 165 +++------- arch/x86/kernel/smp_64.c | 175 ++--------- arch/x86/kernel/smpboot_32.c | 4 arch/x86/kernel/smpcommon_32.c | 34 -- include/asm-powerpc/smp.h | 5 include/asm-x86/hw_irq_32.h | 1 include/asm-x86/hw_irq_64.h | 4 include/asm-x86/mach-default/entry_arch.h | 1 include/asm-x86/mach-default/irq_vectors.h | 1 include/linux/smp.h | 30 + kernel/Makefile | 2 kernel/smp.c | 316 +++++++++++++++++++++ 20 files changed, 533 insertions(+), 619 ...
This adds kernel/smp.c which contains helpers for IPI function calls. In
addition to supporting the existing smp_call_function() in a more efficient
manner, it also adds a more scalable variant called smp_call_function_single()
for calling a given function on a single CPU only.
The core of this is based on the x86-64 patch from Nick Piggin.
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
include/linux/smp.h | 30 +++++-
kernel/Makefile | 2 +-
kernel/smp.c | 316 +++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 345 insertions(+), 3 deletions(-)
create mode 100644 kernel/smp.c
diff --git a/include/linux/smp.h b/include/linux/smp.h
index 55232cc..19b217e 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -7,9 +7,19 @@
*/
#include <linux/errno.h>
+#include <linux/list.h>
+#include <linux/spinlock.h>
+#include <linux/cpumask.h>
extern void cpu_idle(void);
+struct call_single_data {
+ struct list_head list;
+ void (*func) (void *info);
+ void *info;
+ unsigned int flags;
+};
+
#ifdef CONFIG_SMP
#include <linux/preempt.h>
@@ -50,12 +60,27 @@ extern int __cpu_up(unsigned int cpunum);
extern void smp_cpus_done(unsigned int max_cpus);
/*
- * Call a function on all other processors
+ * Call a function on all other processors - arch exported functions
*/
int smp_call_function(void(*func)(void *info), void *info, int retry, int wait);
-
int smp_call_function_single(int cpuid, void (*func) (void *info), void *info,
int retry, int wait);
+void __smp_call_function_single(int cpuid, struct call_single_data *data);
+
+/*
+ * Generic helpers
+ */
+int generic_smp_call_function_single(int cpuid, void (*func) (void *info),
+ void *info, int wait,
+ void (*send_ipi)(int cpu));
+int generic_smp_call_function(void (*func) (void *info), void *info, int wait,
+ cpumask_t mask, void (*send_ipi)(cpumask_t));
+void generic_exec_single(int cpu, struct call_single_data *data,
+ ...This converts x86 to use the new helpers for smp_call_function() and
frieds, and adds support for smp_call_function_single().
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
arch/x86/kernel/smp_32.c | 165 +++++++++-------------------
arch/x86/kernel/smpboot_32.c | 4 +
arch/x86/kernel/smpcommon_32.c | 34 ------
include/asm-x86/hw_irq_32.h | 1 +
include/asm-x86/mach-default/entry_arch.h | 1 +
include/asm-x86/mach-default/irq_vectors.h | 1 +
6 files changed, 61 insertions(+), 145 deletions(-)
diff --git a/arch/x86/kernel/smp_32.c b/arch/x86/kernel/smp_32.c
index dc0cde9..8d12130 100644
--- a/arch/x86/kernel/smp_32.c
+++ b/arch/x86/kernel/smp_32.c
@@ -476,64 +476,29 @@ static void native_smp_send_reschedule(int cpu)
send_IPI_mask(cpumask_of_cpu(cpu), RESCHEDULE_VECTOR);
}
-/*
- * Structure and data for smp_call_function(). This is designed to minimise
- * static memory requirements. It also looks cleaner.
- */
-static DEFINE_SPINLOCK(call_lock);
-
-struct call_data_struct {
- void (*func) (void *info);
- void *info;
- atomic_t started;
- atomic_t finished;
- int wait;
-};
-
void lock_ipi_call_lock(void)
{
- spin_lock_irq(&call_lock);
+ spin_lock_irq(&call_function_lock);
}
void unlock_ipi_call_lock(void)
{
- spin_unlock_irq(&call_lock);
+ spin_unlock_irq(&call_function_lock);
}
-static struct call_data_struct *call_data;
-
-static void __smp_call_function(void (*func) (void *info), void *info,
- int nonatomic, int wait)
+static void send_call_function_ipi(cpumask_t mask)
{
- struct call_data_struct data;
- int cpus = num_online_cpus() - 1;
-
- if (!cpus)
- return;
-
- data.func = func;
- data.info = info;
- atomic_set(&data.started, 0);
- data.wait = wait;
- if (wait)
- atomic_set(&data.finished, 0);
-
- call_data = &data;
- mb();
-
- /* Send a message to all other CPUs and wait for them to respond ...This converts x86-64 to use the new helpers for smp_call_function() and frieds, and adds support for smp_call_function_single(). Signed-off-by: Jens Axboe <jens.axboe@oracle.com> --- arch/x86/kernel/entry_64.S | 3 + arch/x86/kernel/i8259_64.c | 1 + arch/x86/kernel/smp_64.c | 175 ++++++++++--------------------------------- include/asm-x86/hw_irq_64.h | 4 +- 4 files changed, 46 insertions(+), 137 deletions(-) diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S index c20c9e7..22caf56 100644 --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -713,6 +713,9 @@ END(invalidate_interrupt\num) ENTRY(call_function_interrupt) apicinterrupt CALL_FUNCTION_VECTOR,smp_call_function_interrupt END(call_function_interrupt) +ENTRY(call_function_single_interrupt) + apicinterrupt CALL_FUNCTION_SINGLE_VECTOR,smp_call_function_single_interrupt +END(call_function_single_interrupt) ENTRY(irq_move_cleanup_interrupt) apicinterrupt IRQ_MOVE_CLEANUP_VECTOR,smp_irq_move_cleanup_interrupt END(irq_move_cleanup_interrupt) diff --git a/arch/x86/kernel/i8259_64.c b/arch/x86/kernel/i8259_64.c index fa57a15..2b0b6d2 100644 --- a/arch/x86/kernel/i8259_64.c +++ b/arch/x86/kernel/i8259_64.c @@ -493,6 +493,7 @@ void __init native_init_IRQ(void) /* IPI for generic function call */ set_intr_gate(CALL_FUNCTION_VECTOR, call_function_interrupt); + set_intr_gate(CALL_FUNCTION_SINGLE_VECTOR, call_function_single_interrupt); /* Low priority IPI to cleanup after moving an irq */ set_intr_gate(IRQ_MOVE_CLEANUP_VECTOR, irq_move_cleanup_interrupt); diff --git a/arch/x86/kernel/smp_64.c b/arch/x86/kernel/smp_64.c index 2fd74b0..6002274 100644 --- a/arch/x86/kernel/smp_64.c +++ b/arch/x86/kernel/smp_64.c @@ -295,111 +295,27 @@ void smp_send_reschedule(int cpu) send_IPI_mask(cpumask_of_cpu(cpu), RESCHEDULE_VECTOR); } -/* - * Structure and data for smp_call_function(). This is designed to minimise - * static memory requirements. It also ...
This converts ia64 to use the new helpers for smp_call_function() and
frieds, and adds support for smp_call_function_single().
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
arch/ia64/kernel/smp.c | 171 +++++++++++++-----------------------------------
1 files changed, 45 insertions(+), 126 deletions(-)
diff --git a/arch/ia64/kernel/smp.c b/arch/ia64/kernel/smp.c
index 4e446aa..fa26528 100644
--- a/arch/ia64/kernel/smp.c
+++ b/arch/ia64/kernel/smp.c
@@ -61,24 +61,9 @@ static struct local_tlb_flush_counts {
static DEFINE_PER_CPU(unsigned int, shadow_flush_counts[NR_CPUS]) ____cacheline_aligned;
-/*
- * Structure and data for smp_call_function(). This is designed to minimise static memory
- * requirements. It also looks cleaner.
- */
-static __cacheline_aligned DEFINE_SPINLOCK(call_lock);
-
-struct call_data_struct {
- void (*func) (void *info);
- void *info;
- long wait;
- atomic_t started;
- atomic_t finished;
-};
-
-static volatile struct call_data_struct *call_data;
-
#define IPI_CALL_FUNC 0
#define IPI_CPU_STOP 1
+#define IPI_CALL_FUNC_SINGLE 2
#define IPI_KDUMP_CPU_STOP 3
/* This needs to be cacheline aligned because it is written to by *other* CPUs. */
@@ -89,13 +74,13 @@ extern void cpu_halt (void);
void
lock_ipi_calllock(void)
{
- spin_lock_irq(&call_lock);
+ spin_lock_irq(&call_function_lock);
}
void
unlock_ipi_calllock(void)
{
- spin_unlock_irq(&call_lock);
+ spin_unlock_irq(&call_function_lock);
}
static void
@@ -139,32 +124,11 @@ handle_IPI (int irq, void *dev_id)
switch (which) {
case IPI_CALL_FUNC:
- {
- struct call_data_struct *data;
- void (*func)(void *info);
- void *info;
- int wait;
-
- /* release the 'pointer lock' */
- data = (struct call_data_struct *) call_data;
- func = data->func;
- info = data->info;
- wait = data->wait;
-
- mb();
- ...This converts ppc to use the new helpers for smp_call_function() and
frieds, and adds support for smp_call_function_single().
Signed-off-by: Jens Axboe <jens.axboe@oracle.com>
---
arch/powerpc/kernel/smp.c | 223 ++++---------------------------
arch/powerpc/platforms/cell/interrupt.c | 1 +
arch/powerpc/platforms/ps3/smp.c | 7 +-
arch/powerpc/platforms/pseries/xics.c | 6 +-
arch/powerpc/sysdev/mpic.c | 2 +-
include/asm-powerpc/smp.h | 5 +-
6 files changed, 36 insertions(+), 208 deletions(-)
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index be35ffa..c94c80c 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -72,12 +72,8 @@ struct smp_ops_t *smp_ops;
static volatile unsigned int cpu_callin_map[NR_CPUS];
-void smp_call_function_interrupt(void);
-
int smt_enabled_at_boot = 1;
-static int ipi_fail_ok;
-
static void (*crash_ipi_function_ptr)(struct pt_regs *) = NULL;
#ifdef CONFIG_PPC64
@@ -99,12 +95,15 @@ void smp_message_recv(int msg)
{
switch(msg) {
case PPC_MSG_CALL_FUNCTION:
- smp_call_function_interrupt();
+ generic_smp_call_function_interrupt();
break;
case PPC_MSG_RESCHEDULE:
/* XXX Do we have to do this? */
set_need_resched();
break;
+ case PPC_MSG_CALL_FUNC_SINGLE:
+ generic_smp_call_function_single_interrupt();
+ break;
case PPC_MSG_DEBUGGER_BREAK:
if (crash_ipi_function_ptr) {
crash_ipi_function_ptr(get_irq_regs());
@@ -154,215 +153,47 @@ static void stop_this_cpu(void *dummy)
;
}
-/*
- * Structure and data for smp_call_function(). This is designed to minimise
- * static memory requirements. It also looks cleaner.
- * Stolen from the i386 version.
- */
-static __cacheline_aligned_in_smp DEFINE_SPINLOCK(call_lock);
-
-static struct call_data_struct {
- void (*func) (void *info);
- void *info;
- atomic_t started;
- atomic_t finished;
- int wait;
-} *call_data;
-
-/* delay ...i'm still wondering about the following fundamental bit: why not use per CPU kernel threads? That way you get a fast (lockless) IPI "for free" as SMP wakeups already do this. smp_call_function() is quirky and has deep limitations on atomicity, etc., so we are moving away from it and should not base more functionality on it. Ingo --
The kernel thread variant wont be any more lockless than the smp_call_function_single() approach, they both have to grab the destination queue lock. If you recall, I pushed forward on the kernel thread variant and even still have it online here: http://git.kernel.dk/?p=linux-2.6-block.git;a=shortlog;h=io-cpu-affinity-kthread which is pretty much identical to io-cpu-affinity, except it uses kernel threads for completion. The reason why I dropped the kthread approach is that it was slower. Time from signal to run was about 33% faster with IPI than with wake_up_process(). Doing benchmark runs, and the IPI approach won hands The patchset does not build on smp_call_function(), it merely cleans that stuff up instead of having essentially the same code in each arch. As more archs are converted, it'll remove lots more code. The block stuff builds on smp_call_function_single(), which doesn't suffer from any of the badness that smp_call_function() does. -- Jens Axboe --
It's obviously going to be faster. The kthread approach needs an IPI *and* a context switch on the destination CPU. For broadcast situations, it is also going to be faster, because it can use platform specific broadcast IPIs The only weird atomicity requirements of smp_call_function is due to its terribly unscalable design. Once you have the producer provide its own data (or accept -ENOMEM failures), then there is no problem. It's then even possible to use it from interrupt context with not too much Also there is nothing wrong with improving the speed of the existing implementation. The current smp_call_function is more or less unusable for any real work because it is totally serialised. Thanks for taking this patchset under your wing, Jens. --
with irq threads we'll have all irq context run in kthread context again. Could you show me how you measured the performance of the kthread approach versus the raw-IPI approach? we can do a million kthread context switches per CPU per second, so kthread context-switch cost cannot be a true performance limit, unless you micro-benchmarked this. Ingo --
There were 3 different indicators that the irq thread approach was slower: - Time from signal to actual run of the trigger was ~2usec vs ~3usec for IPI vs kthread. That was a microbenchmark. - Cache misses were higher with the kthread approach. - Actual performance in non-micro benchmarks was lower with the kthread approach. I'll defer to Alan for the actual numbers, most of this was done in private mails back and forth doing performance analysis. The initial testing was done with the IPI hack, then we moved to the kthread approach. Later the two were pitted against each other and the kthread part was definitely slower. It ended up using more system time than the IPI approach. So the kthread approach was than abandoned and all testing has been on the smp_call_function_single() branch since then. I very much wanted the kthread approach to work, since it's easier to work with. It's not for lack of will or trying... I'll be happy to supply you otherwise identical patches for this, the only difference At which point you wont be doing much else, so a cs microbenchmark is not really that interesting. -- Jens Axboe --
i'd love to be able to run/pull something simple that enables me to replicate the measurements you did on a generic PC [without having to hit any real IO hardware which would put any context switching effects down into the noise category]. Obviously since the kthread approach embedds IPI sending it can never be as fast as a pure IPI approach - but it should still be reasonably fast. 3 usecs versus 2 usecs in a microbenchmark sounds about right to me, but it would be better to make it a better-sounding 2.5 usecs versus 2 usecs or so :-) Ingo --
You can pull io-cpu-affinity or io-cpu-affinity-kthread from git://git.kernel.dk/linux-2.6-block.git - or just see the two attached If you have ideas to speedup the kthread approach, fire away :-) -- Jens Axboe
another stupid question: what should i run in user-space to replicate your "3 usecs versus 2 usecs" result? io-affinity-ipi.patch seems to have no self-benchmarking capability at first sight. (I'd rather not try and cook up anything myself - i'd like to reproduce the workload you think is relevant for your IO affinity purposes.) Best would be to have a Kconfig based self-test that just runs during bootup if i boot a bzImage. (laziness rules - and this way i could also track performance regressions more easily, by looking at historic serial logs.) Ingo --
I didn't do those numbers - Alan, when you timed the kthread vs ipi "wake up", what did you use? I'm guessing some hacked in test, perhaps you can pass that to Ingo? -- Jens Axboe --
Can you give me a clue as to which line needs the one-liner ... although
it builds cleanly for IA64 it panics with a dereference NULL fault. It
looks like we were in the list_add_tail(&data->list, &dst->list) code.
Console log looks like this:
ACPI: Core revision 20070126
Boot processor id 0x0/0xc018
Fixed BSP b0 value from CPU 1
Unable to handle kernel NULL pointer dereference (address 0000000000000000)
swapper[0]: Oops 8804682956800 [1]
Modules linked in:
Pid: 0, CPU 1, comm: swapper
psr : 00001010084a2010 ifs : 800000000000060e ip : [<a0000001000be1c0>] Not tainted (2.6.25-rc6-tiger-smp)
ip is at generic_exec_single+0xc0/0x200
unat: 0000000000000000 pfs : 000000000000060e rsc : 0000000000000003
rnat: 5555555555555555 bsps: 5555555555555555 pr : 0000000000002655
ldrs: 0000000000000000 ccv : 0000000000000000 fpsr: 0009804c0270433f
csd : 0000000000000000 ssd : 0000000000000000
b0 : a0000001000be180 b6 : e00000007fe131b0 b7 : e00000007fb1bde0
f6 : 000000000000000000000 f7 : 1003e6db6db6db6db6db7
f8 : 1003e00000000000c410a f9 : 1000fffff800000000000
f10 : 1003e000000000001ffff f11 : 000000000000000000000
r1 : a000000100c42f20 r2 : 0000000000000000 r3 : ffffffffffff5968
r8 : 00000010084a6010 r9 : a000000100a617c0 r10 : 0000000000000000
r11 : a000000100a617c0 r12 : e0000001c025fdf0 r13 : e0000001c0250000
r14 : e000000180016058 r15 : e000000180005970 r16 : 0000000000000002
r17 : a000000100a617c0 r18 : 0000000000004000 r19 : a00000010099ffb0
r20 : e0000001c01550c0 r21 : e0000001c01501c0 r22 : 0000000000000000
r23 : e000000180016064 r24 : a07ffffffff20498 r25 : 00000000000000c0
r26 : 00000000000000c1 r27 : 00000010084a6010 r28 : 0000000000000001
r29 : a00000010099ff80 r30 : 0000000000000000 r31 : e000000180005978
Call Trace:
[<a000000100011bb0>] show_stack+0x50/0xa0
sp=e0000001c025f9c0 bsp=e0000001c0250e68
[<a000000100012490>] show_regs+0x830/0x860
sp=e0000001c025fb90 ...My thinking was that it may need a bit of help to ensure that the init function for single call data works, kernel/smp.c:init_call_single_data() specifically. It may be run before Is it data or dst being NULL? Looking at your trace, it must be dst. Which makes me believe that it is indeed a missing call of the above function at the right time. So some help would be accepted there :) -- Jens Axboe --
The problem may be worse than that. I see that you have a per-cpu "call_single_queue" on which you do some list operations. On ia64 the per-cpu variables live at the same virtual address on each processor with a per-cpu TLB mapping directing them to a different physical address on each. This makes Linux lists very confused. -Tony --
Funky, how does accessing other CPU's per-cpu variables work on ia64 then? Perhaps I made some false assumptions. Unfortunately I have no access to any IA64 machines, so I either have to yank the ia64 bits (which is unfortunate, since Alan tests on those :-) or rely on a bit of help from you and/or others in getting that bit right. -- Jens Axboe --
The per-cpu memory is mapped at two different spots in the kernel virtual memory. When cpuA wants to access per-cpu memory that belongs to cpuB it can use the mappings that allow access to every percpu area (which may just be indexing by cpu number into a big block of memory that has all the per-cpu spaces ... or some more complex arithmetic and pointers for NUMA systems where the per-cpu memory ought to be allocated out of memory on the right node for the cpu that it refers to). When any cpu wants to access its own per-cpu data, it can do so via the per-cpu mapping (which is much more efficient from a code generation perspective at the per-cpu virtual area is the top 64K of virtual address space, so can be accessed with a small negative offset from register r0). This is why lists can be a problem ... since the same memory can be accessed via two different virtual addresses, things can get badly knotted when the two different addresses get used in different parts of the code. Then operations like "list_empty()" may give the wrong answer because the virtual address used for head->next isn't the same as that used for head ... but they both refer to the I'll see if I can figure out what is going wrong. -Tony --
