Re: [PATCH 0 of 9] x86/smp function calls: convert x86 tlb flushes to use function calls [POST 2]

Previous thread: [PATCH] OSS: duplicate AFMT_S24_LE by roel kluin on Monday, August 18, 2008 - 8:43 pm. (1 message)

Next thread: [PATCH] virtio_balloon: fix towards_target when deflating balloon by Anthony Liguori on Monday, August 18, 2008 - 3:15 pm. (5 messages)
From: Jeremy Fitzhardinge
Date: Monday, August 18, 2008 - 11:23 am

This series:
 - adds a simple debugfs profiling entry for cross-cpu tlb flushes
 - converts them to using smp_call_function_mask
 - unifies 32 and 64-bit tlb flushes
 - converts smp_call_function to using multiple queues (using the now
   freed vectors)
 - allows config-time adjustment of the number of queues
 - adds a kernel parameter to disable multi-queue in case it causes
   problems

The main concern is whether using smp_call_function adds an
unacceptible performance hit to cross-cpu tlb flushes.  My limited
measurements show a ~35% regression in latency for a particular flush;
it would be interesting to try this on a wider range of hardware.  I
gather the effect tlb flush performance is very application specific
as well, but I'm not sure what benchmarks show what effects.

Trading off agains the latency of a given flush, the smp_function_call
mechanism allows multiple requests to be queued, and so may improve
throughput on a system-wide basis.

So, I'd like people to try this out and see what performance effects it
has.

Thanks,
	J


--

From: Jeremy Fitzhardinge
Date: Monday, August 18, 2008 - 11:23 am

arch/x86/kernel/tlb_*.c are functionally identical, so unify them.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/kernel/Makefile |    2
 arch/x86/kernel/tlb.c    |  268 ++++++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kernel/tlb_32.c |  190 --------------------------------
 arch/x86/kernel/tlb_64.c |  235 ----------------------------------------
 4 files changed, 269 insertions(+), 426 deletions(-)

diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -60,7 +60,7 @@
 apm-y				:= apm_32.o
 obj-$(CONFIG_APM)		+= apm.o
 obj-$(CONFIG_X86_SMP)		+= smp.o
-obj-$(CONFIG_X86_SMP)		+= smpboot.o tsc_sync.o ipi.o tlb_$(BITS).o
+obj-$(CONFIG_X86_SMP)		+= smpboot.o tsc_sync.o ipi.o tlb.o
 obj-$(CONFIG_X86_32_SMP)	+= smpcommon.o
 obj-$(CONFIG_X86_64_SMP)	+= tsc_sync.o smpcommon.o
 obj-$(CONFIG_X86_TRAMPOLINE)	+= trampoline_$(BITS).o
diff --git a/arch/x86/kernel/tlb.c b/arch/x86/kernel/tlb.c
new file mode 100644
--- /dev/null
+++ b/arch/x86/kernel/tlb.c
@@ -0,0 +1,268 @@
+#include <linux/smp.h>
+#include <linux/interrupt.h>
+#include <linux/module.h>
+#include <linux/debugfs.h>
+
+#include <asm/tlbflush.h>
+#include <asm/timer.h>
+
+/* For UV tlb flush */
+#include <asm/uv/uv_hub.h>
+#include <asm/uv/uv_bau.h>
+#include <asm/genapic.h>	/* for is_uv_system */
+
+/*
+ *	Smarter SMP flushing macros.
+ *		c/o Linus Torvalds.
+ *
+ *	These mean you can really definitely utterly forget about
+ *	writing to user space from interrupts. (Its not allowed anyway).
+ *
+ *	Optimizations Manfred Spraul <manfred@colorfullife.com>
+ */
+
+#ifdef CONFIG_X86_32
+DEFINE_PER_CPU(struct tlb_state, cpu_tlbstate)
+			____cacheline_aligned = { &init_mm, 0, };
+
+static inline short get_mmu_state(void)
+{
+	return __get_cpu_var(cpu_tlbstate).state;
+}
+
+static inline struct mm_struct *get_active_mm(void)
+{
+	return ...
From: Jeremy Fitzhardinge
Date: Monday, August 18, 2008 - 11:23 am

Bring arch/x86/kernel/tlb_32.c and _64.c closer into alignment, so
that unification is more straightforward.  After this patch, the
remaining differences come down to UV support and the distinction
between percpu and pda variables.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/kernel/tlb_32.c |   39 +++++++++++++++++++--------------------
 arch/x86/kernel/tlb_64.c |   12 +++++++-----
 2 files changed, 26 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kernel/tlb_32.c b/arch/x86/kernel/tlb_32.c
--- a/arch/x86/kernel/tlb_32.c
+++ b/arch/x86/kernel/tlb_32.c
@@ -1,7 +1,7 @@
-#include <linux/cpu.h>
+#include <linux/smp.h>
 #include <linux/interrupt.h>
-#include <linux/smp.h>
 #include <linux/percpu.h>
+#include <linux/module.h>
 
 #include <asm/tlbflush.h>
 
@@ -46,25 +46,25 @@
  * 1) switch_mm() either 1a) or 1b)
  * 1a) thread switch to a different mm
  * 1a1) cpu_clear(cpu, old_mm->cpu_vm_mask);
- * 	Stop ipi delivery for the old mm. This is not synchronized with
- * 	the other cpus, but smp_invalidate_interrupt ignore flush ipis
- * 	for the wrong mm, and in the worst case we perform a superfluous
- * 	tlb flush.
+ *	Stop ipi delivery for the old mm. This is not synchronized with
+ *	the other cpus, but tlb_invalidate() ignores flush ipis
+ *	for the wrong mm, and in the worst case we perform a superfluous
+ *	tlb flush.
  * 1a2) set cpu_tlbstate to TLBSTATE_OK
  * 	Now the smp_invalidate_interrupt won't call leave_mm if cpu0
  *	was in lazy tlb mode.
  * 1a3) update cpu_tlbstate[].active_mm
- * 	Now cpu0 accepts tlb flushes for the new mm.
+ *	Now cpu0 accepts tlb flushes for the new mm.
  * 1a4) cpu_set(cpu, new_mm->cpu_vm_mask);
- * 	Now the other cpus will send tlb flush ipis.
+ *	Now the other cpus will send tlb flush ipis.
  * 1a4) change cr3.
  * 1b) thread switch without mm change
  *	cpu_tlbstate[].active_mm is correct, cpu0 already handles
  *	flush ipis.
  * 1b1) set cpu_tlbstate to TLBSTATE_OK
  * 1b2) ...
From: Jeremy Fitzhardinge
Date: Monday, August 18, 2008 - 11:23 am

Now that smp_call_function_mask exists and is scalable, there's no
reason to have a special TLB flush IPI.  This saves a mass of code.

In the process, I removed a copy of a cpumask_t.  The UV tlb flush
code relies on that copy, so I propagated it down.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Cliff Wickman <cpw@sgi.com>
---
 arch/x86/kernel/entry_64.S    |   15 ----
 arch/x86/kernel/irqinit_64.c  |   10 ---
 arch/x86/kernel/tlb_64.c      |  124 ++++++-----------------------------------
 arch/x86/kernel/tlb_uv.c      |    5 -
 include/asm-x86/irq_vectors.h |    4 -
 include/asm-x86/uv/uv_bau.h   |    2
 6 files changed, 24 insertions(+), 136 deletions(-)

diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -869,21 +869,6 @@
 	apicinterrupt RESCHEDULE_VECTOR,smp_reschedule_interrupt
 END(reschedule_interrupt)
 
-	.macro INVALIDATE_ENTRY num
-ENTRY(invalidate_interrupt\num)
-	apicinterrupt INVALIDATE_TLB_VECTOR_START+\num,smp_invalidate_interrupt	
-END(invalidate_interrupt\num)
-	.endm
-
-	INVALIDATE_ENTRY 0
-	INVALIDATE_ENTRY 1
-	INVALIDATE_ENTRY 2
-	INVALIDATE_ENTRY 3
-	INVALIDATE_ENTRY 4
-	INVALIDATE_ENTRY 5
-	INVALIDATE_ENTRY 6
-	INVALIDATE_ENTRY 7
-
 ENTRY(call_function_interrupt)
 	apicinterrupt CALL_FUNCTION_VECTOR,smp_call_function_interrupt
 END(call_function_interrupt)
diff --git a/arch/x86/kernel/irqinit_64.c b/arch/x86/kernel/irqinit_64.c
--- a/arch/x86/kernel/irqinit_64.c
+++ b/arch/x86/kernel/irqinit_64.c
@@ -187,16 +187,6 @@
 	 */
 	alloc_intr_gate(RESCHEDULE_VECTOR, reschedule_interrupt);
 
-	/* IPIs for invalidation */
-	alloc_intr_gate(INVALIDATE_TLB_VECTOR_START+0, invalidate_interrupt0);
-	alloc_intr_gate(INVALIDATE_TLB_VECTOR_START+1, invalidate_interrupt1);
-	alloc_intr_gate(INVALIDATE_TLB_VECTOR_START+2, invalidate_interrupt2);
-	alloc_intr_gate(INVALIDATE_TLB_VECTOR_START+3, ...
From: Jeremy Fitzhardinge
Date: Monday, August 18, 2008 - 11:23 am

Put tlb_flush_others() latency measurements in debugfs.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/kernel/tlb_64.c |   38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/arch/x86/kernel/tlb_64.c b/arch/x86/kernel/tlb_64.c
--- a/arch/x86/kernel/tlb_64.c
+++ b/arch/x86/kernel/tlb_64.c
@@ -7,6 +7,7 @@
 #include <linux/kernel_stat.h>
 #include <linux/mc146818rtc.h>
 #include <linux/interrupt.h>
+#include <linux/debugfs.h>
 
 #include <asm/mtrr.h>
 #include <asm/pgalloc.h>
@@ -17,6 +18,8 @@
 #include <asm/idle.h>
 #include <asm/uv/uv_hub.h>
 #include <asm/uv/uv_bau.h>
+
+#include <asm/timer.h>
 
 #include <mach_ipi.h>
 /*
@@ -157,15 +160,29 @@
 	add_pda(irq_tlb_count, 1);
 }
 
+#ifdef CONFIG_DEBUG_FS
+static spinlock_t tlbflush_others_lock = __SPIN_LOCK_UNLOCKED(tlb_flush_others_lock);
+static u32 tlbflush_others_count;
+static u64 tlbflush_others_accum;
+static u8 tlbflush_others_enable;
+#else
+#define tlbflush_others_enable 0
+#endif
+
 void native_flush_tlb_others(const cpumask_t *cpumaskp, struct mm_struct *mm,
 			     unsigned long va)
 {
 	int sender;
 	union smp_flush_state *f;
 	cpumask_t cpumask = *cpumaskp;
+	u8 timing_enabled = tlbflush_others_enable;
+	u64 uninitialized_var(start), end;
 
 	if (is_uv_system() && uv_flush_tlb_others(&cpumask, mm, va))
 		return;
+
+	if (timing_enabled)
+		rdtscll(start);
 
 	/* Caller has disabled preemption */
 	sender = smp_processor_id() % NUM_INVALIDATE_TLB_VECTORS;
@@ -194,6 +211,15 @@
 	f->flush_mm = NULL;
 	f->flush_va = 0;
 	spin_unlock(&f->tlbstate_lock);
+
+	if (timing_enabled) {
+		rdtscll(end);
+
+		spin_lock(&tlbflush_others_lock);
+		tlbflush_others_count++;
+		tlbflush_others_accum += cycles_2_ns(end - start);
+		spin_unlock(&tlbflush_others_lock);
+	}
 }
 
 static int __cpuinit init_smp_flush(void)
@@ -277,3 +303,15 @@
 {
 	on_each_cpu(do_flush_tlb_all, NULL, 1);
 }
+
+#ifdef ...
From: Jeremy Fitzhardinge
Date: Monday, August 18, 2008 - 11:23 am

Previously, the config option GENERIC_SMP_QUEUES was never actually
adjustable because a moderate amount of static boilerplate code had to
match.

This patch makes it truely adjustable up to the number of IPI vectors
we statically preallocate.  The various IPI handlers are generated by
looping assembler macros, and their addresses are put into an array
for C code to use to actually set the vectors up.

The default number of queues is never more than NR_CPUS, and defaults
to NR_CPUS.  Ideally this could be a runtime variable based on the
number of possible cpus, rather than a static compile-time options.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/Kconfig                          |    8 +++++++-
 arch/x86/kernel/entry_32.S                |   12 ++++++------
 arch/x86/kernel/entry_64.S                |   29 ++++++++++++++++-------------
 arch/x86/kernel/irqinit_32.c              |   16 ++++++----------
 arch/x86/kernel/irqinit_64.c              |   16 ++++++----------
 include/asm-x86/hw_irq.h                  |    9 +--------
 include/asm-x86/irq_vectors.h             |    3 +++
 include/asm-x86/mach-default/entry_arch.h |   29 ++++++++++++++++-------------
 8 files changed, 61 insertions(+), 61 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -182,9 +182,15 @@
 	select USE_GENERIC_SMP_HELPERS
 	default y
 
+config MAX_GENERIC_SMP_QUEUES
+       int
+       default 8
+
 config GENERIC_SMP_QUEUES
        int
-       default "8"
+       range 1 NR_CPUS if NR_CPUS <= MAX_GENERIC_SMP_QUEUES
+       range 1 MAX_GENERIC_SMP_QUEUES
+       default NR_CPUS
 
 config X86_32_SMP
 	def_bool y
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -688,8 +688,7 @@
 ENDPROC(common_interrupt)
 	CFI_ENDPROC
 
-#define __BUILD_INTERRUPT(name, func, nr)	\
-ENTRY(name)					\
+#define ...
From: Jeremy Fitzhardinge
Date: Monday, August 18, 2008 - 11:23 am

There was some concern that using multiple queues - and their
associated APIC vectors - may trigger bugs in various dubious APIC
implementations.  This patch adds a command line option to force the
kernel to use a single queue, even if the architecture can support
more.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 Documentation/kernel-parameters.txt |    5 +++++
 arch/x86/kernel/smp.c               |    2 +-
 arch/x86/xen/smp.c                  |    6 ++++--
 include/linux/smp.h                 |   26 ++++++++++++++++++++++++++
 kernel/smp.c                        |   14 +++++++++++++-
 5 files changed, 49 insertions(+), 4 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1901,6 +1901,11 @@
 	simeth=		[IA-64]
 	simscsi=
 
+	single_ipi_queue [X86]
+			Force the use of a single queue for smp function calls.
+			This means that only a single vector is used, which may avoid
+			bugs in some APIC implementations.
+
 	slram=		[HW,MTD]
 
 	slub_debug[=options[,slabs]]	[MM, SLUB]
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -129,7 +129,7 @@
 void native_send_call_func_ipi(cpumask_t mask)
 {
 	cpumask_t allbutself;
-	unsigned queue = smp_processor_id() % CONFIG_GENERIC_SMP_QUEUES;
+	unsigned queue = smp_ipi_choose_queue();
 
 	allbutself = cpu_online_map;
 	cpu_clear(smp_processor_id(), allbutself);
diff --git a/arch/x86/xen/smp.c b/arch/x86/xen/smp.c
--- a/arch/x86/xen/smp.c
+++ b/arch/x86/xen/smp.c
@@ -373,7 +373,7 @@
 static void xen_smp_send_call_function_ipi(cpumask_t mask)
 {
 	int cpu;
-	unsigned queue = smp_processor_id() % CONFIG_GENERIC_SMP_QUEUES;
+	unsigned queue = smp_ipi_choose_queue();
 
 	/*
 	 * We can't afford to allocate N callfunc vectors * M cpu
@@ -411,10 +411,12 @@
 	unsigned queue;
 
 ...
From: Jeremy Fitzhardinge
Date: Monday, August 18, 2008 - 11:23 am

Now that smp_call_function_mask exists and is scalable, there's no
reason to have a special TLB flush IPI.  This saves a mass of code.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/kernel/irqinit_32.c              |    3 -
 arch/x86/kernel/tlb_32.c                  |   86 +++++------------------------
 include/asm-x86/irq_vectors.h             |    1
 include/asm-x86/mach-default/entry_arch.h |    1
 4 files changed, 17 insertions(+), 74 deletions(-)

diff --git a/arch/x86/kernel/irqinit_32.c b/arch/x86/kernel/irqinit_32.c
--- a/arch/x86/kernel/irqinit_32.c
+++ b/arch/x86/kernel/irqinit_32.c
@@ -120,9 +120,6 @@
 	 */
 	alloc_intr_gate(RESCHEDULE_VECTOR, reschedule_interrupt);
 
-	/* IPI for invalidation */
-	alloc_intr_gate(INVALIDATE_TLB_VECTOR, invalidate_interrupt);
-
 	/* IPI for generic function call */
 	alloc_intr_gate(CALL_FUNCTION_VECTOR, call_function_interrupt);
 
diff --git a/arch/x86/kernel/tlb_32.c b/arch/x86/kernel/tlb_32.c
--- a/arch/x86/kernel/tlb_32.c
+++ b/arch/x86/kernel/tlb_32.c
@@ -1,14 +1,12 @@
-#include <linux/spinlock.h>
 #include <linux/cpu.h>
 #include <linux/interrupt.h>
+#include <linux/smp.h>
+#include <linux/percpu.h>
 
 #include <asm/tlbflush.h>
 
 DEFINE_PER_CPU(struct tlb_state, cpu_tlbstate)
 			____cacheline_aligned = { &init_mm, 0, };
-
-/* must come after the send_IPI functions above for inlining */
-#include <mach_ipi.h>
 
 /*
  *	Smarter SMP flushing macros.
@@ -20,10 +18,10 @@
  *	Optimizations Manfred Spraul <manfred@colorfullife.com>
  */
 
-static cpumask_t flush_cpumask;
-static struct mm_struct *flush_mm;
-static unsigned long flush_va;
-static DEFINE_SPINLOCK(tlbstate_lock);
+struct tlb_flush {
+	struct mm_struct *mm;
+	unsigned long va;
+};
 
 /*
  * We cannot call mmdrop() because we are in interrupt context,
@@ -87,37 +85,23 @@
  * 2) Leave the mm if we are in the lazy tlb mode.
  */
 
-void smp_invalidate_interrupt(struct pt_regs *regs)
+static void ...
From: Jeremy Fitzhardinge
Date: Monday, August 18, 2008 - 11:23 am

This adds 8 queues for smp_call_function(), in order to avoid a
bottleneck on a single global lock and list for function calls.  When
initiating a function call, the sender chooses a queue based on its
own processor id (if there are more than 8 processors, they hash down
to 8 queues).  It then sends an IPI to the corresponding vector for
that queue to each target CPU.  The target CPUs use the vector number
to determine which queue they should scan for work.

This should give smp_call_function the same performance
characteristics as the original x86-64 cross-cpu tlb flush code, which
used the same scheme.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/Kconfig                          |    4 ++++
 arch/x86/kernel/entry_32.S                |   25 ++++++++++++++-----------
 arch/x86/kernel/entry_64.S                |   19 ++++++++++++++++---
 arch/x86/kernel/irqinit_32.c              |   11 ++++++++++-
 arch/x86/kernel/irqinit_64.c              |   11 ++++++++++-
 arch/x86/kernel/smp.c                     |   10 +++++++---
 arch/x86/xen/smp.c                        |   27 ++++++++++++++++++++++++++-
 include/asm-x86/hw_irq.h                  |   20 +++++++++-----------
 include/asm-x86/irq_vectors.h             |    6 ++++--
 include/asm-x86/mach-default/entry_arch.h |   16 +++++++++++++++-
 10 files changed, 115 insertions(+), 34 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -181,6 +181,10 @@
 	depends on SMP && ((X86_32 && !X86_VOYAGER) || X86_64)
 	select USE_GENERIC_SMP_HELPERS
 	default y
+
+config GENERIC_SMP_QUEUES
+       int
+       default "8"
 
 config X86_32_SMP
 	def_bool y
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -688,18 +688,21 @@
 ENDPROC(common_interrupt)
 	CFI_ENDPROC
 
+#define __BUILD_INTERRUPT(name, func, ...
From: Jeremy Fitzhardinge
Date: Monday, August 18, 2008 - 11:23 am

This patch allows an architecture to have multiple smp_call_function
queues, in order to reduce contention on a single queue and lock.  By
default there is still one queue, so this patch makes no functional
change.

However, an architecture can set CONFIG_GENERIC_SMP_QUEUES to enable a
certain number of queues.  It's expected it will will set up an IPI
vector for each queue, and it should pass the appropriate queue number
into generic_smp_call_function_interrupt.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/alpha/kernel/smp.c             |    2 -
 arch/arm/kernel/smp.c               |    2 -
 arch/ia64/kernel/smp.c              |    2 -
 arch/m32r/kernel/smp.c              |    2 -
 arch/mips/kernel/smp.c              |    2 -
 arch/parisc/kernel/smp.c            |    2 -
 arch/powerpc/kernel/smp.c           |    2 -
 arch/sparc64/kernel/smp.c           |    2 -
 arch/x86/kernel/smp.c               |    2 -
 arch/x86/mach-voyager/voyager_smp.c |    2 -
 arch/x86/xen/smp.c                  |    2 -
 include/linux/smp.h                 |    2 -
 kernel/smp.c                        |   70 ++++++++++++++++++++++++++++-------
 13 files changed, 69 insertions(+), 25 deletions(-)

diff --git a/arch/alpha/kernel/smp.c b/arch/alpha/kernel/smp.c
--- a/arch/alpha/kernel/smp.c
+++ b/arch/alpha/kernel/smp.c
@@ -588,7 +588,7 @@
 			break;
 
 		case IPI_CALL_FUNC:
-			generic_smp_call_function_interrupt();
+			generic_smp_call_function_interrupt(0);
 			break;
 
 		case IPI_CALL_FUNC_SINGLE:
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -481,7 +481,7 @@
 				break;
 
 			case IPI_CALL_FUNC:
-				generic_smp_call_function_interrupt();
+				generic_smp_call_function_interrupt(0);
 				break;
 
 			case IPI_CALL_FUNC_SINGLE:
diff --git a/arch/ia64/kernel/smp.c b/arch/ia64/kernel/smp.c
--- a/arch/ia64/kernel/smp.c
+++ b/arch/ia64/kernel/smp.c
@@ -114,7 +114,7 @@
 ...
From: Ingo Molnar
Date: Monday, August 18, 2008 - 5:45 pm

nice stuff!

I suspect the extra cost might be worth it for two reasons: 1) we could 
optimize the cross-call implementation further 2) on systems where TLB 
flushes actually matter, the ability to overlap multiple TLB flushes to 
the same single CPU might improve workloads.

FYI, i've created a new -tip topic for your patches, tip/x86/tlbflush. 
It's based on tip/irq/sparseirq (there are a good deal of dependencies 
with that topic).

It would be nice to see some numbers on sufficiently SMP systems, using 
some mmap/munmap intense workload.

	Ingo
--

From: Ingo Molnar
Date: Monday, August 18, 2008 - 6:28 pm

i threw it into -tip testing for a while - triggered the lockdep warning 
on 64-bit below.

	Ingo

------------>
checking TSC synchronization [CPU#0 -> CPU#1]: passed.

=============================================
[ INFO: possible recursive locking detected ]
2.6.27-rc3-tip #1
---------------------------------------------
swapper/0 is trying to acquire lock:
 (&call_function_queues[i].lock){....}, at: [<ffffffff8026cbba>] ipi_call_lock_irq+0x25/0x2e

but task is already holding lock:
 (&call_function_queues[i].lock){....}, at: [<ffffffff8026cbba>] ipi_call_lock_irq+0x25/0x2e

other info that might help us debug this:
1 lock held by swapper/0:
 #0:  (&call_function_queues[i].lock){....}, at: [<ffffffff8026cbba>] ipi_call_lock_irq+0x25/0x2e

stack backtrace:
Pid: 0, comm: swapper Not tainted 2.6.27-rc3-tip #1
Call Trace:
 [<ffffffff802665c2>] validate_chain+0x53e/0xc26
 [<ffffffff80263814>] ? trace_hardirqs_off_caller+0x21/0xa4
 [<ffffffff802638a4>] ? trace_hardirqs_off+0xd/0xf
 [<ffffffff802673d6>] __lock_acquire+0x72c/0x795
 [<ffffffff802674cc>] lock_acquire+0x8d/0xba
 [<ffffffff8026cbba>] ? ipi_call_lock_irq+0x25/0x2e
 [<ffffffff808aa83c>] _spin_lock_irq+0x44/0x74
 [<ffffffff8026cbba>] ? ipi_call_lock_irq+0x25/0x2e
 [<ffffffff8026cbba>] ipi_call_lock_irq+0x25/0x2e
 [<ffffffff808a4acf>] start_secondary+0x127/0x18e
Brought up 2 CPUs
Total of 2 processors activated (11733.31 BogoMIPS).
--

From: Jeremy Fitzhardinge
Date: Monday, August 18, 2008 - 11:18 pm

I think this might be a spurious "holding multiple locks in the same
class" bug.  All the queue locks are presumably in the same class, and
ipi_call_lock_irq() wants to hold them all to lock out any IPIs. 
Spurious because this is the only place which holds more than one queue
lock, and it always locks 0->N.

I guess the fix is to use an outer lock and use spin_lock_nested() (now
that it exists).  Something along these lines?

    J

diff -r 22ebc3296a6f kernel/smp.c
--- a/kernel/smp.c	Mon Aug 18 15:12:14 2008 -0700
+++ b/kernel/smp.c	Mon Aug 18 22:52:22 2008 -0700
@@ -18,6 +18,9 @@
 #else
 #define	NQUEUES	1
 #endif
+
+/* Hold queues_lock when taking more than one queue[].lock at once */
+static DEFINE_SPINLOCK(queues_lock);
 
 static DEFINE_PER_CPU(struct call_single_queue, call_single_queue);
 struct ____cacheline_aligned queue {
@@ -446,8 +449,10 @@
 {
 	int i;
 
+	spin_lock_irq(&queues_lock);
+
 	for(i = 0; i < NQUEUES; i++)
-		spin_lock_irq(&call_function_queues[i].lock);
+		spin_lock_nest_lock(&call_function_queues[i].lock, &queues_lock);
 }
 
 void ipi_call_unlock_irq(void)
@@ -455,7 +460,9 @@
 	int i;
 
 	for(i = 0; i < NQUEUES; i++)
-		spin_unlock_irq(&call_function_queues[i].lock);
+		spin_unlock(&call_function_queues[i].lock);
+
+	spin_unlock_irq(&queues_lock);
 }
 
 


--

From: Ingo Molnar
Date: Tuesday, August 19, 2008 - 2:27 am

because it adds an artificial extra spinlock for no good reason and 
weakens the lock dependency checking as well.

Just add a lock class descriptor to each call_function_queue lock, so 
that lockdep can see that it's truly all in the correct order.

I.e. dont turn lockdep off artificially.

	Ingo
--

From: Jeremy Fitzhardinge
Date: Tuesday, August 19, 2008 - 7:58 am

Are you sure?  I thought this is exactly the case where
spin_lock_nest_lock() is supposed to be used?  Admittedly it's very
simple, but the extra lock does two things: 1) it guarantees that the
queue locks can be taken in any order, and 2) it's a single lock on
which we can do spin_lock_irq(), rather than doing it in the loop for
each individual lock (which I think was bogus).

I don't see how it weakens lockdep in any way.  Does it hide any
potential lock misuse?

    J

--

From: Peter Zijlstra
Date: Tuesday, August 19, 2008 - 2:45 am

spin_lock_nested() has existed for a long time, but you are now using
spin_lock_nest_lock(), which is a totally different annotation



--

From: Jeremy Fitzhardinge
Date: Tuesday, August 19, 2008 - 7:58 am

Yes, typo.  The patch itself uses spin_lock_nest_lock().

    J
--

From: Jeremy Fitzhardinge
Date: Monday, August 18, 2008 - 10:37 pm

Unfortunately, I think the kmalloc fix for the RCU issue is going to


Really?  I didn't see much conflict when rebasing onto current tip.git. 

I've attached my test program: tlb-mash.c.  Compile with "gcc -o
tlb-mash tlb-mash.c -lpthread" and run with ./tlb-mash X, where X is the
number of threads to run (2x cpus works well).  It keeps running until
killed, with each thread repeatedly mprotecting a page within a shared
mapping.

    J
From: Ingo Molnar
Date: Tuesday, August 19, 2008 - 2:31 am

yeah :-(

Nick, is there any way to get rid of that kmalloc() in the async 
function call case? The whole premise of the smp_function_call() rewrite 
was that it's faster - and now it's measurably slower.

At least we could/should perhaps standardize/generalize all the 
'specific' IPI handlers into the smp_function_call() framework: if 
function address equals to a pre-cooked IPI entry point we could call 
that function without a kmalloc. As these are all hardwired, 
__builtin_is_constant_p() could come to the help as well. Hm?

	Ingo
--

From: Nick Piggin
Date: Tuesday, August 19, 2008 - 2:56 am

Not quite. smp_call_function_single is much faster, it is now scalable,
and it is queueing (and only needs a single IPI to submit multiple requests
if the target isn't keeping up). The rewrite is meant primarily to speed up
call single (for really interesting things like block request completion
migration). Before that, it was totally useless for anything remotely
performance critical.

A secondary goal was to make smp_call_function_mask at least somewhat
scalable. smp_call_function_mask used to have to execute the entire
call and wait-for-ack-from-all under a global lock (shared by 
call_function_single, mind you). Can't get a whole lot more serialised
than that. I wanted to improve this to improve vmalloc flushing
scalability. There wasn't much other performance critical stuff that
used it.

For that guy -- as I said, we could possibly look at retuning to a non
queueing implementation to avoid the kmalloc... but I'm not so hopeful
that it would bring TLB flushing to parity. And scalability would

No, it's not just the function call but also payload, list entry for
queue, scoreboard of CPUs have processed it, a lock, etc etc etc.

smp_call_function is *always* going to be heavier than a hard wired
special case, no matter how it is implemented. For such low level
performance critical functionality, I miss the days when people were
rabid about saving every cycle rather than every line of code ;)

I'm especially sore about mmap because I have a customer with a
database that uses a lot of mmap/munmap and those calls have slowed
down something like 50%(!!) from 2.6.early to 2.6.late.

Put another way: if TLB flushing were currently using
smp_call_function, I would be very happy to submit a patch to have
it use a hardwired call scheme even if it only gained 1% improvement
(in a realistic case).

Just let me reiterate that I would love anybody to make
smp_call_function go faster, or unify special case TLB flushing *if
it no longer makes sense to have*.
--

From: Ingo Molnar
Date: Tuesday, August 19, 2008 - 3:20 am

oh, agreed that it all needs to speed up - please send testcases ...

	Ingo
--

From: Nick Piggin
Date: Tuesday, August 19, 2008 - 4:08 am

I'm not sure if I can send the testcase I have. But this issue is
coming to the top of my list to investigate in mainline, so I'll
check out how we're doing here soon.

--

From: Ingo Molnar
Date: Tuesday, August 19, 2008 - 4:44 am

obviously having a repeatable testcase would jump-start any efforts to 
fix this. It doesnt have to be complex - just a .c thing kernel 
developers can run and get some quick number out of. A 50% regression 
shouldnt be too hard to reproduce even with a vastly simplified 
workload.

	Ingo
--

From: Ingo Molnar
Date: Tuesday, August 19, 2008 - 3:24 am

no, i was thinking about really high level hardwiring, i.e. hardwiring 
the _function pointer_ knowledge into smp_function_call().

for example for the reschedule IPI, it would be hardwired on x86 to just 
call into the special IPI handler, via:

  smp_call_function_mask(target_mask, smp_send_reschedule, NULL, 0);

Exactly same cost and call sequence as a direct hardwired-to-IPI 
function call (and the same underlying mechanism) - just consolidated 
around a single cross-call API.

Same for all the other special cross-CPU handlers. That way some 
architectures would hardwire it, some wouldnt, etc.

	Ingo
--

From: Nick Piggin
Date: Tuesday, August 19, 2008 - 3:49 am

So it would magically use a different IPI vector, magically be callable
from interrupt context, have a different function type etc.? Why would
we do such a thing? If it doesn't walk like a duck and doesn't quack

I think the right way is just to keep the "special" classes of IPI-ish
functions (I guess reschedule and TLB flush importantly). The arch code
is still free to implement these using their call function or a special
vector or whatever.
--

From: Andi Kleen
Date: Tuesday, August 19, 2008 - 3:31 am

These are small mmaps right? Also in this case no kmalloc should
be needed anyways.

AFAIK mmap flushing hasn't changed much in 2.6 and it tends
to batch well anyways in this case (unlike vmscan swapping). I would be 
careful to really optimize the real culprits which are likely elsewhere.

-Andi
--

From: Nick Piggin
Date: Tuesday, August 19, 2008 - 4:04 am

It wasn't actually the TLB flushing side of it that was causing
the slowdown IIRC. It's just all over the map.

Notifier hooks; accounting statistics; 4lpt; cond_resched and
low latency code causing functions to spill more to stack; cache
misses from data structures increasing or becoming unaligned...

Basically just lots of little straws that added up to kill the
camel. I didn't even get to the bottom of the whole thing. But
my point is that even 1% here and there eventually adds up to a
big headache for someone. For some features it is obviously
inevitable to slowdown, but in all other cases we should always
be aiming to make the kernel faster rather than slower.
--

From: Andi Kleen
Date: Tuesday, August 19, 2008 - 4:20 am

Hmm, on a benchmark here a simple anonymous mmap+munmap is ~3800 cycles.
Was it ever really that much faster? 


There is a great classical email floating around how such 1% regressions

It's hard to catch such regressions later. I wonder if we really need
some kind of mini benchmark collection that is regularly run
and that checks latency of such micro operation and points
out regressions when they happen.
AFAIK the OpenSolaris people have something like that.

--

From: Andi Kleen
Date: Tuesday, August 19, 2008 - 12:32 am

> nice stuff!


If this is getting seriously optimized it would be useful to address
this at a slightly higher level in the generic MM.  As in expose
the queue on the TLB flush interfaces.

While munmap and exit are pretty good at batching flushes vmscan.c is terrible 
and tends to do all the flushing on clearing pages synchronously.

Back in 2.4 I at some point ran into a nasty livelock where 
one CPU would always flush the other and the other would always 
bounce some mm locks with the other and the CPUs were nearly 100% 
busy just doing while swapping. Fortunately that issue disappeared
with 2.6, but I still fear a little it will come back some day.

Also with these queued invalidates right now there is duplication
of data structures with the delayed mmu gather flusher in asm-generic/tlb.h.
Both have their own queue.

If the low level flusher is queuing anyways it might be useful
to do a x86 specific implementation that combines the both and
frees the pages once the queue has been processed. Perhaps
that could amortize part of the 35%? 

-Andi

--

From: Jeremy Fitzhardinge
Date: Tuesday, August 19, 2008 - 12:44 am

Don't pay attention to that number.  It's only the extra latency of a HT
context->HT context function call + tlb flush.  Which means 1) the tlb
is shared anyway, so the extra flush is redundant, 2) they're not really
concurrent, 3) it's going down the single-cpu call, rather than the
multi-cpu one, 4) it's only measuring the latency for a particular tlb
flush, and doesn't take into account any throughput improvements the
extra queueing may add.

I don't really know what the nett consequence of all that is, but it
means the number has almost no bearing on multicore results, esp with >2
cores.

    J
--

From: Andi Kleen
Date: Tuesday, August 19, 2008 - 12:48 am

A lot of flushes are synchronous. See the rest of my email that
you snipped.

-Andi
--

From: Jeremy Fitzhardinge
Date: Tuesday, August 19, 2008 - 1:04 am

Yep, sure, but I just wanted to be clear that the 35% number is almost
useless in isolation, and the first step is to do proper measurements.

    J
--

Previous thread: [PATCH] OSS: duplicate AFMT_S24_LE by roel kluin on Monday, August 18, 2008 - 8:43 pm. (1 message)

Next thread: [PATCH] virtio_balloon: fix towards_target when deflating balloon by Anthony Liguori on Monday, August 18, 2008 - 3:15 pm. (5 messages)