This series: - adds a simple debugfs profiling entry for cross-cpu tlb flushes - converts them to using smp_call_function_mask - unifies 32 and 64-bit tlb flushes - converts smp_call_function to using multiple queues (using the now freed vectors) - allows config-time adjustment of the number of queues - adds a kernel parameter to disable multi-queue in case it causes problems The main concern is whether using smp_call_function adds an unacceptible performance hit to cross-cpu tlb flushes. My limited measurements show a ~35% regression in latency for a particular flush; it would be interesting to try this on a wider range of hardware. I gather the effect tlb flush performance is very application specific as well, but I'm not sure what benchmarks show what effects. Trading off agains the latency of a given flush, the smp_function_call mechanism allows multiple requests to be queued, and so may improve throughput on a system-wide basis. So, I'd like people to try this out and see what performance effects it has. Thanks, J --
arch/x86/kernel/tlb_*.c are functionally identical, so unify them.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
arch/x86/kernel/Makefile | 2
arch/x86/kernel/tlb.c | 268 ++++++++++++++++++++++++++++++++++++++++++++++
arch/x86/kernel/tlb_32.c | 190 --------------------------------
arch/x86/kernel/tlb_64.c | 235 ----------------------------------------
4 files changed, 269 insertions(+), 426 deletions(-)
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -60,7 +60,7 @@
apm-y := apm_32.o
obj-$(CONFIG_APM) += apm.o
obj-$(CONFIG_X86_SMP) += smp.o
-obj-$(CONFIG_X86_SMP) += smpboot.o tsc_sync.o ipi.o tlb_$(BITS).o
+obj-$(CONFIG_X86_SMP) += smpboot.o tsc_sync.o ipi.o tlb.o
obj-$(CONFIG_X86_32_SMP) += smpcommon.o
obj-$(CONFIG_X86_64_SMP) += tsc_sync.o smpcommon.o
obj-$(CONFIG_X86_TRAMPOLINE) += trampoline_$(BITS).o
diff --git a/arch/x86/kernel/tlb.c b/arch/x86/kernel/tlb.c
new file mode 100644
--- /dev/null
+++ b/arch/x86/kernel/tlb.c
@@ -0,0 +1,268 @@
+#include <linux/smp.h>
+#include <linux/interrupt.h>
+#include <linux/module.h>
+#include <linux/debugfs.h>
+
+#include <asm/tlbflush.h>
+#include <asm/timer.h>
+
+/* For UV tlb flush */
+#include <asm/uv/uv_hub.h>
+#include <asm/uv/uv_bau.h>
+#include <asm/genapic.h> /* for is_uv_system */
+
+/*
+ * Smarter SMP flushing macros.
+ * c/o Linus Torvalds.
+ *
+ * These mean you can really definitely utterly forget about
+ * writing to user space from interrupts. (Its not allowed anyway).
+ *
+ * Optimizations Manfred Spraul <manfred@colorfullife.com>
+ */
+
+#ifdef CONFIG_X86_32
+DEFINE_PER_CPU(struct tlb_state, cpu_tlbstate)
+ ____cacheline_aligned = { &init_mm, 0, };
+
+static inline short get_mmu_state(void)
+{
+ return __get_cpu_var(cpu_tlbstate).state;
+}
+
+static inline struct mm_struct *get_active_mm(void)
+{
+ return ...Bring arch/x86/kernel/tlb_32.c and _64.c closer into alignment, so that unification is more straightforward. After this patch, the remaining differences come down to UV support and the distinction between percpu and pda variables. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> --- arch/x86/kernel/tlb_32.c | 39 +++++++++++++++++++-------------------- arch/x86/kernel/tlb_64.c | 12 +++++++----- 2 files changed, 26 insertions(+), 25 deletions(-) diff --git a/arch/x86/kernel/tlb_32.c b/arch/x86/kernel/tlb_32.c --- a/arch/x86/kernel/tlb_32.c +++ b/arch/x86/kernel/tlb_32.c @@ -1,7 +1,7 @@ -#include <linux/cpu.h> +#include <linux/smp.h> #include <linux/interrupt.h> -#include <linux/smp.h> #include <linux/percpu.h> +#include <linux/module.h> #include <asm/tlbflush.h> @@ -46,25 +46,25 @@ * 1) switch_mm() either 1a) or 1b) * 1a) thread switch to a different mm * 1a1) cpu_clear(cpu, old_mm->cpu_vm_mask); - * Stop ipi delivery for the old mm. This is not synchronized with - * the other cpus, but smp_invalidate_interrupt ignore flush ipis - * for the wrong mm, and in the worst case we perform a superfluous - * tlb flush. + * Stop ipi delivery for the old mm. This is not synchronized with + * the other cpus, but tlb_invalidate() ignores flush ipis + * for the wrong mm, and in the worst case we perform a superfluous + * tlb flush. * 1a2) set cpu_tlbstate to TLBSTATE_OK * Now the smp_invalidate_interrupt won't call leave_mm if cpu0 * was in lazy tlb mode. * 1a3) update cpu_tlbstate[].active_mm - * Now cpu0 accepts tlb flushes for the new mm. + * Now cpu0 accepts tlb flushes for the new mm. * 1a4) cpu_set(cpu, new_mm->cpu_vm_mask); - * Now the other cpus will send tlb flush ipis. + * Now the other cpus will send tlb flush ipis. * 1a4) change cr3. * 1b) thread switch without mm change * cpu_tlbstate[].active_mm is correct, cpu0 already handles * flush ipis. * 1b1) set cpu_tlbstate to TLBSTATE_OK * 1b2) ...
Now that smp_call_function_mask exists and is scalable, there's no reason to have a special TLB flush IPI. This saves a mass of code. In the process, I removed a copy of a cpumask_t. The UV tlb flush code relies on that copy, so I propagated it down. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> Cc: Cliff Wickman <cpw@sgi.com> --- arch/x86/kernel/entry_64.S | 15 ---- arch/x86/kernel/irqinit_64.c | 10 --- arch/x86/kernel/tlb_64.c | 124 ++++++----------------------------------- arch/x86/kernel/tlb_uv.c | 5 - include/asm-x86/irq_vectors.h | 4 - include/asm-x86/uv/uv_bau.h | 2 6 files changed, 24 insertions(+), 136 deletions(-) diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S --- a/arch/x86/kernel/entry_64.S +++ b/arch/x86/kernel/entry_64.S @@ -869,21 +869,6 @@ apicinterrupt RESCHEDULE_VECTOR,smp_reschedule_interrupt END(reschedule_interrupt) - .macro INVALIDATE_ENTRY num -ENTRY(invalidate_interrupt\num) - apicinterrupt INVALIDATE_TLB_VECTOR_START+\num,smp_invalidate_interrupt -END(invalidate_interrupt\num) - .endm - - INVALIDATE_ENTRY 0 - INVALIDATE_ENTRY 1 - INVALIDATE_ENTRY 2 - INVALIDATE_ENTRY 3 - INVALIDATE_ENTRY 4 - INVALIDATE_ENTRY 5 - INVALIDATE_ENTRY 6 - INVALIDATE_ENTRY 7 - ENTRY(call_function_interrupt) apicinterrupt CALL_FUNCTION_VECTOR,smp_call_function_interrupt END(call_function_interrupt) diff --git a/arch/x86/kernel/irqinit_64.c b/arch/x86/kernel/irqinit_64.c --- a/arch/x86/kernel/irqinit_64.c +++ b/arch/x86/kernel/irqinit_64.c @@ -187,16 +187,6 @@ */ alloc_intr_gate(RESCHEDULE_VECTOR, reschedule_interrupt); - /* IPIs for invalidation */ - alloc_intr_gate(INVALIDATE_TLB_VECTOR_START+0, invalidate_interrupt0); - alloc_intr_gate(INVALIDATE_TLB_VECTOR_START+1, invalidate_interrupt1); - alloc_intr_gate(INVALIDATE_TLB_VECTOR_START+2, invalidate_interrupt2); - alloc_intr_gate(INVALIDATE_TLB_VECTOR_START+3, ...
Put tlb_flush_others() latency measurements in debugfs.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
arch/x86/kernel/tlb_64.c | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)
diff --git a/arch/x86/kernel/tlb_64.c b/arch/x86/kernel/tlb_64.c
--- a/arch/x86/kernel/tlb_64.c
+++ b/arch/x86/kernel/tlb_64.c
@@ -7,6 +7,7 @@
#include <linux/kernel_stat.h>
#include <linux/mc146818rtc.h>
#include <linux/interrupt.h>
+#include <linux/debugfs.h>
#include <asm/mtrr.h>
#include <asm/pgalloc.h>
@@ -17,6 +18,8 @@
#include <asm/idle.h>
#include <asm/uv/uv_hub.h>
#include <asm/uv/uv_bau.h>
+
+#include <asm/timer.h>
#include <mach_ipi.h>
/*
@@ -157,15 +160,29 @@
add_pda(irq_tlb_count, 1);
}
+#ifdef CONFIG_DEBUG_FS
+static spinlock_t tlbflush_others_lock = __SPIN_LOCK_UNLOCKED(tlb_flush_others_lock);
+static u32 tlbflush_others_count;
+static u64 tlbflush_others_accum;
+static u8 tlbflush_others_enable;
+#else
+#define tlbflush_others_enable 0
+#endif
+
void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
unsigned long va)
{
int sender;
union smp_flush_state *f;
cpumask_t cpumask = *cpumaskp;
+ u8 timing_enabled = tlbflush_others_enable;
+ u64 uninitialized_var(start), end;
if (is_uv_system() && uv_flush_tlb_others(&cpumask, mm, va))
return;
+
+ if (timing_enabled)
+ rdtscll(start);
/* Caller has disabled preemption */
sender = smp_processor_id() % NUM_INVALIDATE_TLB_VECTORS;
@@ -194,6 +211,15 @@
f->flush_mm = NULL;
f->flush_va = 0;
spin_unlock(&f->tlbstate_lock);
+
+ if (timing_enabled) {
+ rdtscll(end);
+
+ spin_lock(&tlbflush_others_lock);
+ tlbflush_others_count++;
+ tlbflush_others_accum += cycles_2_ns(end - start);
+ spin_unlock(&tlbflush_others_lock);
+ }
}
static int __cpuinit init_smp_flush(void)
@@ -277,3 +303,15 @@
{
on_each_cpu(do_flush_tlb_all, NULL, 1);
}
+
+#ifdef ...Previously, the config option GENERIC_SMP_QUEUES was never actually
adjustable because a moderate amount of static boilerplate code had to
match.
This patch makes it truely adjustable up to the number of IPI vectors
we statically preallocate. The various IPI handlers are generated by
looping assembler macros, and their addresses are put into an array
for C code to use to actually set the vectors up.
The default number of queues is never more than NR_CPUS, and defaults
to NR_CPUS. Ideally this could be a runtime variable based on the
number of possible cpus, rather than a static compile-time options.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
arch/x86/Kconfig | 8 +++++++-
arch/x86/kernel/entry_32.S | 12 ++++++------
arch/x86/kernel/entry_64.S | 29 ++++++++++++++++-------------
arch/x86/kernel/irqinit_32.c | 16 ++++++----------
arch/x86/kernel/irqinit_64.c | 16 ++++++----------
include/asm-x86/hw_irq.h | 9 +--------
include/asm-x86/irq_vectors.h | 3 +++
include/asm-x86/mach-default/entry_arch.h | 29 ++++++++++++++++-------------
8 files changed, 61 insertions(+), 61 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -182,9 +182,15 @@
select USE_GENERIC_SMP_HELPERS
default y
+config MAX_GENERIC_SMP_QUEUES
+ int
+ default 8
+
config GENERIC_SMP_QUEUES
int
- default "8"
+ range 1 NR_CPUS if NR_CPUS <= MAX_GENERIC_SMP_QUEUES
+ range 1 MAX_GENERIC_SMP_QUEUES
+ default NR_CPUS
config X86_32_SMP
def_bool y
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -688,8 +688,7 @@
ENDPROC(common_interrupt)
CFI_ENDPROC
-#define __BUILD_INTERRUPT(name, func, nr) \
-ENTRY(name) \
+#define ...There was some concern that using multiple queues - and their
associated APIC vectors - may trigger bugs in various dubious APIC
implementations. This patch adds a command line option to force the
kernel to use a single queue, even if the architecture can support
more.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
Documentation/kernel-parameters.txt | 5 +++++
arch/x86/kernel/smp.c | 2 +-
arch/x86/xen/smp.c | 6 ++++--
include/linux/smp.h | 26 ++++++++++++++++++++++++++
kernel/smp.c | 14 +++++++++++++-
5 files changed, 49 insertions(+), 4 deletions(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1901,6 +1901,11 @@
simeth= [IA-64]
simscsi=
+ single_ipi_queue [X86]
+ Force the use of a single queue for smp function calls.
+ This means that only a single vector is used, which may avoid
+ bugs in some APIC implementations.
+
slram= [HW,MTD]
slub_debug[=options[,slabs]] [MM, SLUB]
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -129,7 +129,7 @@
void native_send_call_func_ipi(cpumask_t mask)
{
cpumask_t allbutself;
- unsigned queue = smp_processor_id() % CONFIG_GENERIC_SMP_QUEUES;
+ unsigned queue = smp_ipi_choose_queue();
allbutself = cpu_online_map;
cpu_clear(smp_processor_id(), allbutself);
diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -373,7 +373,7 @@
static void xen_smp_send_call_function_ipi(cpumask_t mask)
{
int cpu;
- unsigned queue = smp_processor_id() % CONFIG_GENERIC_SMP_QUEUES;
+ unsigned queue = smp_ipi_choose_queue();
/*
* We can't afford to allocate N callfunc vectors * M cpu
@@ -411,10 +411,12 @@
unsigned queue;
...Now that smp_call_function_mask exists and is scalable, there's no
reason to have a special TLB flush IPI. This saves a mass of code.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
arch/x86/kernel/irqinit_32.c | 3 -
arch/x86/kernel/tlb_32.c | 86 +++++------------------------
include/asm-x86/irq_vectors.h | 1
include/asm-x86/mach-default/entry_arch.h | 1
4 files changed, 17 insertions(+), 74 deletions(-)
diff --git a/arch/x86/kernel/irqinit_32.c b/arch/x86/kernel/irqinit_32.c
--- a/arch/x86/kernel/irqinit_32.c
+++ b/arch/x86/kernel/irqinit_32.c
@@ -120,9 +120,6 @@
*/
alloc_intr_gate(RESCHEDULE_VECTOR, reschedule_interrupt);
- /* IPI for invalidation */
- alloc_intr_gate(INVALIDATE_TLB_VECTOR, invalidate_interrupt);
-
/* IPI for generic function call */
alloc_intr_gate(CALL_FUNCTION_VECTOR, call_function_interrupt);
diff --git a/arch/x86/kernel/tlb_32.c b/arch/x86/kernel/tlb_32.c
--- a/arch/x86/kernel/tlb_32.c
+++ b/arch/x86/kernel/tlb_32.c
@@ -1,14 +1,12 @@
-#include <linux/spinlock.h>
#include <linux/cpu.h>
#include <linux/interrupt.h>
+#include <linux/smp.h>
+#include <linux/percpu.h>
#include <asm/tlbflush.h>
DEFINE_PER_CPU(struct tlb_state, cpu_tlbstate)
____cacheline_aligned = { &init_mm, 0, };
-
-/* must come after the send_IPI functions above for inlining */
-#include <mach_ipi.h>
/*
* Smarter SMP flushing macros.
@@ -20,10 +18,10 @@
* Optimizations Manfred Spraul <manfred@colorfullife.com>
*/
-static cpumask_t flush_cpumask;
-static struct mm_struct *flush_mm;
-static unsigned long flush_va;
-static DEFINE_SPINLOCK(tlbstate_lock);
+struct tlb_flush {
+ struct mm_struct *mm;
+ unsigned long va;
+};
/*
* We cannot call mmdrop() because we are in interrupt context,
@@ -87,37 +85,23 @@
* 2) Leave the mm if we are in the lazy tlb mode.
*/
-void smp_invalidate_interrupt(struct pt_regs *regs)
+static void ...This adds 8 queues for smp_call_function(), in order to avoid a bottleneck on a single global lock and list for function calls. When initiating a function call, the sender chooses a queue based on its own processor id (if there are more than 8 processors, they hash down to 8 queues). It then sends an IPI to the corresponding vector for that queue to each target CPU. The target CPUs use the vector number to determine which queue they should scan for work. This should give smp_call_function the same performance characteristics as the original x86-64 cross-cpu tlb flush code, which used the same scheme. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> --- arch/x86/Kconfig | 4 ++++ arch/x86/kernel/entry_32.S | 25 ++++++++++++++----------- arch/x86/kernel/entry_64.S | 19 ++++++++++++++++--- arch/x86/kernel/irqinit_32.c | 11 ++++++++++- arch/x86/kernel/irqinit_64.c | 11 ++++++++++- arch/x86/kernel/smp.c | 10 +++++++--- arch/x86/xen/smp.c | 27 ++++++++++++++++++++++++++- include/asm-x86/hw_irq.h | 20 +++++++++----------- include/asm-x86/irq_vectors.h | 6 ++++-- include/asm-x86/mach-default/entry_arch.h | 16 +++++++++++++++- 10 files changed, 115 insertions(+), 34 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -181,6 +181,10 @@ depends on SMP && ((X86_32 && !X86_VOYAGER) || X86_64) select USE_GENERIC_SMP_HELPERS default y + +config GENERIC_SMP_QUEUES + int + default "8" config X86_32_SMP def_bool y diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S --- a/arch/x86/kernel/entry_32.S +++ b/arch/x86/kernel/entry_32.S @@ -688,18 +688,21 @@ ENDPROC(common_interrupt) CFI_ENDPROC +#define __BUILD_INTERRUPT(name, func, ...
This patch allows an architecture to have multiple smp_call_function queues, in order to reduce contention on a single queue and lock. By default there is still one queue, so this patch makes no functional change. However, an architecture can set CONFIG_GENERIC_SMP_QUEUES to enable a certain number of queues. It's expected it will will set up an IPI vector for each queue, and it should pass the appropriate queue number into generic_smp_call_function_interrupt. Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com> --- arch/alpha/kernel/smp.c | 2 - arch/arm/kernel/smp.c | 2 - arch/ia64/kernel/smp.c | 2 - arch/m32r/kernel/smp.c | 2 - arch/mips/kernel/smp.c | 2 - arch/parisc/kernel/smp.c | 2 - arch/powerpc/kernel/smp.c | 2 - arch/sparc64/kernel/smp.c | 2 - arch/x86/kernel/smp.c | 2 - arch/x86/mach-voyager/voyager_smp.c | 2 - arch/x86/xen/smp.c | 2 - include/linux/smp.h | 2 - kernel/smp.c | 70 ++++++++++++++++++++++++++++------- 13 files changed, 69 insertions(+), 25 deletions(-) diff --git a/arch/alpha/kernel/smp.c b/arch/alpha/kernel/smp.c --- a/arch/alpha/kernel/smp.c +++ b/arch/alpha/kernel/smp.c @@ -588,7 +588,7 @@ break; case IPI_CALL_FUNC: - generic_smp_call_function_interrupt(); + generic_smp_call_function_interrupt(0); break; case IPI_CALL_FUNC_SINGLE: diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c --- a/arch/arm/kernel/smp.c +++ b/arch/arm/kernel/smp.c @@ -481,7 +481,7 @@ break; case IPI_CALL_FUNC: - generic_smp_call_function_interrupt(); + generic_smp_call_function_interrupt(0); break; case IPI_CALL_FUNC_SINGLE: diff --git a/arch/ia64/kernel/smp.c b/arch/ia64/kernel/smp.c --- a/arch/ia64/kernel/smp.c +++ b/arch/ia64/kernel/smp.c @@ -114,7 +114,7 @@ ...
nice stuff! I suspect the extra cost might be worth it for two reasons: 1) we could optimize the cross-call implementation further 2) on systems where TLB flushes actually matter, the ability to overlap multiple TLB flushes to the same single CPU might improve workloads. FYI, i've created a new -tip topic for your patches, tip/x86/tlbflush. It's based on tip/irq/sparseirq (there are a good deal of dependencies with that topic). It would be nice to see some numbers on sufficiently SMP systems, using some mmap/munmap intense workload. Ingo --
i threw it into -tip testing for a while - triggered the lockdep warning
on 64-bit below.
Ingo
------------>
checking TSC synchronization [CPU#0 -> CPU#1]: passed.
=============================================
[ INFO: possible recursive locking detected ]
2.6.27-rc3-tip #1
---------------------------------------------
swapper/0 is trying to acquire lock:
(&call_function_queues[i].lock){....}, at: [<ffffffff8026cbba>] ipi_call_lock_irq+0x25/0x2e
but task is already holding lock:
(&call_function_queues[i].lock){....}, at: [<ffffffff8026cbba>] ipi_call_lock_irq+0x25/0x2e
other info that might help us debug this:
1 lock held by swapper/0:
#0: (&call_function_queues[i].lock){....}, at: [<ffffffff8026cbba>] ipi_call_lock_irq+0x25/0x2e
stack backtrace:
Pid: 0, comm: swapper Not tainted 2.6.27-rc3-tip #1
Call Trace:
[<ffffffff802665c2>] validate_chain+0x53e/0xc26
[<ffffffff80263814>] ? trace_hardirqs_off_caller+0x21/0xa4
[<ffffffff802638a4>] ? trace_hardirqs_off+0xd/0xf
[<ffffffff802673d6>] __lock_acquire+0x72c/0x795
[<ffffffff802674cc>] lock_acquire+0x8d/0xba
[<ffffffff8026cbba>] ? ipi_call_lock_irq+0x25/0x2e
[<ffffffff808aa83c>] _spin_lock_irq+0x44/0x74
[<ffffffff8026cbba>] ? ipi_call_lock_irq+0x25/0x2e
[<ffffffff8026cbba>] ipi_call_lock_irq+0x25/0x2e
[<ffffffff808a4acf>] start_secondary+0x127/0x18e
Brought up 2 CPUs
Total of 2 processors activated (11733.31 BogoMIPS).
--
I think this might be a spurious "holding multiple locks in the same
class" bug. All the queue locks are presumably in the same class, and
ipi_call_lock_irq() wants to hold them all to lock out any IPIs.
Spurious because this is the only place which holds more than one queue
lock, and it always locks 0->N.
I guess the fix is to use an outer lock and use spin_lock_nested() (now
that it exists). Something along these lines?
J
diff -r 22ebc3296a6f kernel/smp.c
--- a/kernel/smp.c Mon Aug 18 15:12:14 2008 -0700
+++ b/kernel/smp.c Mon Aug 18 22:52:22 2008 -0700
@@ -18,6 +18,9 @@
#else
#define NQUEUES 1
#endif
+
+/* Hold queues_lock when taking more than one queue[].lock at once */
+static DEFINE_SPINLOCK(queues_lock);
static DEFINE_PER_CPU(struct call_single_queue, call_single_queue);
struct ____cacheline_aligned queue {
@@ -446,8 +449,10 @@
{
int i;
+ spin_lock_irq(&queues_lock);
+
for(i = 0; i < NQUEUES; i++)
- spin_lock_irq(&call_function_queues[i].lock);
+ spin_lock_nest_lock(&call_function_queues[i].lock, &queues_lock);
}
void ipi_call_unlock_irq(void)
@@ -455,7 +460,9 @@
int i;
for(i = 0; i < NQUEUES; i++)
- spin_unlock_irq(&call_function_queues[i].lock);
+ spin_unlock(&call_function_queues[i].lock);
+
+ spin_unlock_irq(&queues_lock);
}
--
because it adds an artificial extra spinlock for no good reason and weakens the lock dependency checking as well. Just add a lock class descriptor to each call_function_queue lock, so that lockdep can see that it's truly all in the correct order. I.e. dont turn lockdep off artificially. Ingo --
Are you sure? I thought this is exactly the case where
spin_lock_nest_lock() is supposed to be used? Admittedly it's very
simple, but the extra lock does two things: 1) it guarantees that the
queue locks can be taken in any order, and 2) it's a single lock on
which we can do spin_lock_irq(), rather than doing it in the loop for
each individual lock (which I think was bogus).
I don't see how it weakens lockdep in any way. Does it hide any
potential lock misuse?
J
--
spin_lock_nested() has existed for a long time, but you are now using spin_lock_nest_lock(), which is a totally different annotation --
Yes, typo. The patch itself uses spin_lock_nest_lock().
J
--
Unfortunately, I think the kmalloc fix for the RCU issue is going to
Really? I didn't see much conflict when rebasing onto current tip.git.
I've attached my test program: tlb-mash.c. Compile with "gcc -o
tlb-mash tlb-mash.c -lpthread" and run with ./tlb-mash X, where X is the
number of threads to run (2x cpus works well). It keeps running until
killed, with each thread repeatedly mprotecting a page within a shared
mapping.
J
yeah :-( Nick, is there any way to get rid of that kmalloc() in the async function call case? The whole premise of the smp_function_call() rewrite was that it's faster - and now it's measurably slower. At least we could/should perhaps standardize/generalize all the 'specific' IPI handlers into the smp_function_call() framework: if function address equals to a pre-cooked IPI entry point we could call that function without a kmalloc. As these are all hardwired, __builtin_is_constant_p() could come to the help as well. Hm? Ingo --
Not quite. smp_call_function_single is much faster, it is now scalable, and it is queueing (and only needs a single IPI to submit multiple requests if the target isn't keeping up). The rewrite is meant primarily to speed up call single (for really interesting things like block request completion migration). Before that, it was totally useless for anything remotely performance critical. A secondary goal was to make smp_call_function_mask at least somewhat scalable. smp_call_function_mask used to have to execute the entire call and wait-for-ack-from-all under a global lock (shared by call_function_single, mind you). Can't get a whole lot more serialised than that. I wanted to improve this to improve vmalloc flushing scalability. There wasn't much other performance critical stuff that used it. For that guy -- as I said, we could possibly look at retuning to a non queueing implementation to avoid the kmalloc... but I'm not so hopeful that it would bring TLB flushing to parity. And scalability would No, it's not just the function call but also payload, list entry for queue, scoreboard of CPUs have processed it, a lock, etc etc etc. smp_call_function is *always* going to be heavier than a hard wired special case, no matter how it is implemented. For such low level performance critical functionality, I miss the days when people were rabid about saving every cycle rather than every line of code ;) I'm especially sore about mmap because I have a customer with a database that uses a lot of mmap/munmap and those calls have slowed down something like 50%(!!) from 2.6.early to 2.6.late. Put another way: if TLB flushing were currently using smp_call_function, I would be very happy to submit a patch to have it use a hardwired call scheme even if it only gained 1% improvement (in a realistic case). Just let me reiterate that I would love anybody to make smp_call_function go faster, or unify special case TLB flushing *if it no longer makes sense to have*. --
oh, agreed that it all needs to speed up - please send testcases ... Ingo --
I'm not sure if I can send the testcase I have. But this issue is coming to the top of my list to investigate in mainline, so I'll check out how we're doing here soon. --
obviously having a repeatable testcase would jump-start any efforts to fix this. It doesnt have to be complex - just a .c thing kernel developers can run and get some quick number out of. A 50% regression shouldnt be too hard to reproduce even with a vastly simplified workload. Ingo --
no, i was thinking about really high level hardwiring, i.e. hardwiring the _function pointer_ knowledge into smp_function_call(). for example for the reschedule IPI, it would be hardwired on x86 to just call into the special IPI handler, via: smp_call_function_mask(target_mask, smp_send_reschedule, NULL, 0); Exactly same cost and call sequence as a direct hardwired-to-IPI function call (and the same underlying mechanism) - just consolidated around a single cross-call API. Same for all the other special cross-CPU handlers. That way some architectures would hardwire it, some wouldnt, etc. Ingo --
So it would magically use a different IPI vector, magically be callable from interrupt context, have a different function type etc.? Why would we do such a thing? If it doesn't walk like a duck and doesn't quack I think the right way is just to keep the "special" classes of IPI-ish functions (I guess reschedule and TLB flush importantly). The arch code is still free to implement these using their call function or a special vector or whatever. --
These are small mmaps right? Also in this case no kmalloc should be needed anyways. AFAIK mmap flushing hasn't changed much in 2.6 and it tends to batch well anyways in this case (unlike vmscan swapping). I would be careful to really optimize the real culprits which are likely elsewhere. -Andi --
It wasn't actually the TLB flushing side of it that was causing the slowdown IIRC. It's just all over the map. Notifier hooks; accounting statistics; 4lpt; cond_resched and low latency code causing functions to spill more to stack; cache misses from data structures increasing or becoming unaligned... Basically just lots of little straws that added up to kill the camel. I didn't even get to the bottom of the whole thing. But my point is that even 1% here and there eventually adds up to a big headache for someone. For some features it is obviously inevitable to slowdown, but in all other cases we should always be aiming to make the kernel faster rather than slower. --
Hmm, on a benchmark here a simple anonymous mmap+munmap is ~3800 cycles. Was it ever really that much faster? There is a great classical email floating around how such 1% regressions It's hard to catch such regressions later. I wonder if we really need some kind of mini benchmark collection that is regularly run and that checks latency of such micro operation and points out regressions when they happen. AFAIK the OpenSolaris people have something like that. --
> nice stuff! If this is getting seriously optimized it would be useful to address this at a slightly higher level in the generic MM. As in expose the queue on the TLB flush interfaces. While munmap and exit are pretty good at batching flushes vmscan.c is terrible and tends to do all the flushing on clearing pages synchronously. Back in 2.4 I at some point ran into a nasty livelock where one CPU would always flush the other and the other would always bounce some mm locks with the other and the CPUs were nearly 100% busy just doing while swapping. Fortunately that issue disappeared with 2.6, but I still fear a little it will come back some day. Also with these queued invalidates right now there is duplication of data structures with the delayed mmu gather flusher in asm-generic/tlb.h. Both have their own queue. If the low level flusher is queuing anyways it might be useful to do a x86 specific implementation that combines the both and frees the pages once the queue has been processed. Perhaps that could amortize part of the 35%? -Andi --
Don't pay attention to that number. It's only the extra latency of a HT
context->HT context function call + tlb flush. Which means 1) the tlb
is shared anyway, so the extra flush is redundant, 2) they're not really
concurrent, 3) it's going down the single-cpu call, rather than the
multi-cpu one, 4) it's only measuring the latency for a particular tlb
flush, and doesn't take into account any throughput improvements the
extra queueing may add.
I don't really know what the nett consequence of all that is, but it
means the number has almost no bearing on multicore results, esp with >2
cores.
J
--
A lot of flushes are synchronous. See the rest of my email that you snipped. -Andi --
Yep, sure, but I just wanted to be clear that the 35% number is almost
useless in isolation, and the first step is to do proper measurements.
J
--
