Re: [cpuops cmpxchg double V1 1/4] Generic support for this_cpu_cmpxchg_double

Previous thread: Re: [PATCH] [USB] UASP: USB Attached SCSI (UAS) protocol driver by Luben Tuikov on Tuesday, December 14, 2010 - 10:25 am. (2 messages)

Next thread: Re: [PATCH net-next-2.6] bnx2: remove cancel_work_sync() from remove_one by Michael Chan on Tuesday, December 14, 2010 - 10:48 am. (6 messages)
From: Christoph Lameter
Date: Tuesday, December 14, 2010 - 10:48 am

This patch series introduces this_cpu_cmpxchg_double().

x86 cpus support cmpxchg16b and cmpxchg8b that can be used to perform a
cmpxchg with two words instead of only one. That allows to put more state
into an atomic instruction.

this_cpu_cmpxchg_double() is then used in the slub allocator to avoid
interrupt disable/enable in both alloc and free fastpath.
this_cpu_cmpxchg_double works nicely with the per cpu data of the
allocator. Doing so significantly speeds up the fastpaths.

--

From: Christoph Lameter
Date: Tuesday, December 14, 2010 - 10:48 am

Support this_cpu_cmpxchg_double using the cmpxchg16b and cmpxchg8b instructions.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 arch/x86/include/asm/percpu.h |   47 ++++++++++++++++++++++++++++++++++
 arch/x86/lib/Makefile         |    1 
 arch/x86/lib/cmpxchg16b_emu.S |   58 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 106 insertions(+)

Index: linux-2.6/arch/x86/include/asm/percpu.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/percpu.h	2010-12-14 10:31:33.000000000 -0600
+++ linux-2.6/arch/x86/include/asm/percpu.h	2010-12-14 10:39:04.000000000 -0600
@@ -451,6 +451,26 @@ do {									\
 #define irqsafe_cpu_cmpxchg_4(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
 #endif /* !CONFIG_M386 */
 
+#ifdef CONFIG_X86_CMPXCHG64
+#define percpu_cmpxchg8b_double(pcp, o1, o2, n1, n2)			\
+({									\
+	char __ret;							\
+	typeof(o1) __o1 = o1;						\
+	typeof(o1) __n1 = n1;						\
+	typeof(o2) __o2 = o2;						\
+	typeof(o2) __n2 = n2;						\
+	typeof(o2) __dummy = n2;						\
+	asm("cmpxchg8b "__percpu_arg(1)"\n\tsetz %0\n\t"		\
+		    : "=a"(__ret), "=m" (*pcp), "=d"(__dummy)		\
+		    :  "b"(__n1), "c"(__n2), "a"(__o1), "d"(__o2));	\
+	__ret;								\
+})
+
+#define __this_cpu_cmpxchg_double_4(pcp, o1, o2, n1, n2) percpu_cmpxchg8b_double((pcp), o1, o2, n1, n2)
+#define this_cpu_cmpxchg_double_4(pcp, o1, o2, n1, n2)	percpu_cmpxchg8b_double((pcp), o1, o2, n1, n2)
+#define irqsafe_cpu_cmpxchg_double_4(pcp, o1, o2, n1, n2)	percpu_cmpxchg8b_double((pcp), o1, o2, n1, n2)
+#endif /* CONFIG_X86_CMPXCHG64 */
+
 /*
  * Per cpu atomic 64 bit operations are only available under 64 bit.
  * 32 bit must fall back to generic operations.
@@ -486,6 +506,33 @@ do {									\
 #define this_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
 #define irqsafe_cpu_cmpxchg_8(pcp, oval, nval)	percpu_cmpxchg_op(pcp, oval, nval)
 
+/*
+ * Pretty complex macro to generate cmpxchg16 ...
From: H. Peter Anvin
Date: Tuesday, December 14, 2010 - 5:46 pm

NAK on this.  This is acceptable for cmpxchg8b only because we don't
support SMP on 486s anymore.  x86-64 is another matter...
	
	-hpa
--

From: H. Peter Anvin
Date: Tuesday, December 14, 2010 - 5:56 pm

Hm, this is meant to be a CPU local operation, isn't it... it isn't very
clear from the comments or naming *in this file*.

Could you make it a little clearer in the local code, please?

	-hpa
--

From: Christoph Lameter
Date: Wednesday, December 15, 2010 - 9:12 am

Ok. But the code is correct as far as I can tell. Its not a global cmpxchg

ok.

--

From: Christoph Lameter
Date: Wednesday, December 15, 2010 - 9:20 am

Better comments:

Subject: Fixup comments for cmpxchg16b_cpuops_emu

Indicate that the sematics are not fullly atomic but just per cpu atomic.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 arch/x86/lib/cmpxchg16b_emu.S |    4 ++++
 1 file changed, 4 insertions(+)

Index: linux-2.6/arch/x86/lib/cmpxchg16b_emu.S
===================================================================
--- linux-2.6.orig/arch/x86/lib/cmpxchg16b_emu.S	2010-12-15 10:18:18.000000000 -0600
+++ linux-2.6/arch/x86/lib/cmpxchg16b_emu.S	2010-12-15 10:19:11.000000000 -0600
@@ -30,6 +30,10 @@ CFI_STARTPROC
 # Emulate 'cmpxchg16b %gs:(%rsi)' except we return the result in
 # al not via the ZF. Caller will access al to get result.
 #
+# Note that this is only useful for a cpuops operation. Meaning that we
+# do *not* have a fully atomic operation but just an operation that is
+# *atomic* on a single cpu (as provided by the this_cpu_xx class of macros)
+#
 cmpxchg16b_cpuops_emu:
 	pushf
 	cli

--

From: H. Peter Anvin
Date: Wednesday, December 15, 2010 - 10:36 am

Could we rename it this_cpu_cmpxchg16b_emu or something like that?

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--

From: Christoph Lameter
Date: Wednesday, December 15, 2010 - 10:53 am

Ok. New fixup patch:



Subject: x86, percpu: Rename cmpxchg16b_cpuops_emu to this_cpu_cmpxchg16b_emu

Rename the function to make clear what it is used for. Also add some comments
to explain the nature of the atomicness.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 arch/x86/include/asm/percpu.h |    2 +-
 arch/x86/lib/cmpxchg16b_emu.S |   10 +++++++---
 2 files changed, 8 insertions(+), 4 deletions(-)

Index: linux-2.6/arch/x86/lib/cmpxchg16b_emu.S
===================================================================
--- linux-2.6.orig/arch/x86/lib/cmpxchg16b_emu.S	2010-12-15 10:57:21.000000000 -0600
+++ linux-2.6/arch/x86/lib/cmpxchg16b_emu.S	2010-12-15 11:47:50.000000000 -0600
@@ -23,14 +23,18 @@
  * %rcx : high 64 bits of new value
  * %al  : Operation successful
  */
-ENTRY(cmpxchg16b_cpuops_emu)
+ENTRY(this_cpu_cmpxchg16b_emu)
 CFI_STARTPROC

 #
 # Emulate 'cmpxchg16b %gs:(%rsi)' except we return the result in
 # al not via the ZF. Caller will access al to get result.
 #
-cmpxchg16b_cpuops_emu:
+# Note that this is only useful for a cpuops operation. Meaning that we
+# do *not* have a fully atomic operation but just an operation that is
+# *atomic* on a single cpu (as provided by the this_cpu_xx class of macros)
+#
+this_cpu_cmpxchg16b_emu:
 	pushf
 	cli

@@ -53,6 +57,6 @@ cmpxchg16b_cpuops_emu:

 CFI_ENDPROC

-ENDPROC(cmpxchg16b_cpuops)
+ENDPROC(this_cpu_cmpxchg16b_emu)


Index: linux-2.6/arch/x86/include/asm/percpu.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/percpu.h	2010-12-15 11:48:09.000000000 -0600
+++ linux-2.6/arch/x86/include/asm/percpu.h	2010-12-15 11:48:26.000000000 -0600
@@ -520,7 +520,7 @@ do {									\
 	typeof(o2) __o2 = o2;						\
 	typeof(o2) __n2 = n2;						\
 	typeof(o2) __dummy;						\
-	alternative_io("call cmpxchg16b_cpuops_emu\n\t" P6_NOP4,		\
+	alternative_io("call this_cpu_cmpxchg16b_emu\n\t" P6_NOP4,		\
 			"cmpxchg16b %%gs:(%%rsi)\n\tsetz ...
From: H. Peter Anvin
Date: Wednesday, December 15, 2010 - 9:32 am

Yes, that's what I meant... the code is correct, it's just not obvious
from someone who looks at the code that it is correct, because the
intended function of the code isn't obvious from that file by itself (in
fact, it looks almost exactly like the 486 cmpxchg ersatz code.)

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--

From: Christoph Lameter
Date: Wednesday, December 15, 2010 - 9:41 am

Confirmed. It is a copy of that 486 ersatz code ... But its ok for the use
case here.


--

From: Christoph Lameter
Date: Tuesday, December 14, 2010 - 10:48 am

Introduce this_cpu_cmpxchg_double. this_cpu_cmpxchg_double() allows the
comparision between two consecutive words and replaces them if there is
a match.

	bool this_cpu_cmpxchg(xx __percpu *p, old_word1, old_word2, new_word1, new_word2)

this_cpu_cmpxchg does not return the old value (difficult since there are two
words) but a boolean indicating if the operation was successful.

Also this_cpu_cmpxchg does take a per cpu pointer rather than a variable
reference like the other this_cpu_ops. This is because two words are compared.

The pointer passed must be double word aligned!

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 include/linux/percpu.h |  130 +++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 130 insertions(+)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h	2010-12-14 10:34:20.000000000 -0600
+++ linux-2.6/include/linux/percpu.h	2010-12-14 10:36:13.000000000 -0600
@@ -259,6 +259,27 @@ extern void __bad_size_call_parameter(vo
 	pscr2_ret__;							\
 })
 
+/*
+ * Special handling for cmpxchg_double. cmpxchg_double is passed a
+ * __percpu pointer and that pointer has to be aligned to a double
+ * word boundary
+ */
+#define __pcpu_double_call_return_int(stem, pcp, ...)			\
+({									\
+	int ret__;							\
+	__verify_pcpu_ptr(pcp);						\
+	VM_BUG_ON((unsigned long)(pcp) % (2 * sizeof(unsigned long)));	\
+	switch(sizeof(*pcp)) {						\
+	case 1: ret__ = stem##1(pcp, __VA_ARGS__);break;		\
+	case 2: ret__ = stem##2(pcp, __VA_ARGS__);break;		\
+	case 4: ret__ = stem##4(pcp, __VA_ARGS__);break;		\
+	case 8: ret__ = stem##8(pcp, __VA_ARGS__);break;		\
+	default:							\
+		__bad_size_call_parameter();break;			\
+	}								\
+	ret__;								\
+})
+
 #define __pcpu_size_call(stem, variable, ...)				\
 do {									\
 	__verify_pcpu_ptr(&(variable));					\
@@ -422,6 +443,82 @@ do {									\
 				__this_cpu_cmpxchg_, pcp, oval, ...
From: Tejun Heo
Date: Saturday, December 18, 2010 - 7:47 am

Hello, Christoph.


This is ugly. :-( I think we should have made this_cpu_*() ops take
pointers from the beginning.  Anyways, that's too late, so is it
completely impossible to make cmpxchg_double's take a scalar value?
It can take the pointer all the same, no?

Thanks.

-- 
tejun
--

From: Tejun Heo
Date: Saturday, December 18, 2010 - 7:51 am

Also, the ratonale behind taking a scalar value instead of pointer was
that the operation changes what they do depending on the type of the
parameter (the size of the parameter), which is true for the double
ops too, so I think it would be better if we stick with lvalues.

Thanks.

-- 
tejun
--

From: Christoph Lameter
Date: Tuesday, December 21, 2010 - 3:36 pm

It could take a scalar value like the others but we are then not operating
on the scalar alone but also on the following field.

--

From: H. Peter Anvin
Date: Tuesday, December 21, 2010 - 4:24 pm

I'm a bit confused on this one.  The standard cmpxchg() takes a scalar
and a pointer, and returns a scalar.  The equivalent for the "double"
variety would be to return a compound object, basically:

struct double_ulong {
	unsigned long v[2];
};

... which can be returned in registers on both i386 and x86-64.

It's a bit clumsy from a type perspective, but I'm not sure that that is
a bad thing.  Doing too much type genericity has caused us problems in
the past.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--

From: Tejun Heo
Date: Wednesday, December 22, 2010 - 2:14 am

Hello,


Yes, it's weird but the operation itself is weird enough and named
accordingly, so to me it seems like a much lesser problem than

Yeah, the above might be better too.  Is there any reason to use
cmpxchg_double on anything smaller?

Thanks.

-- 
tejun
--

From: Christoph Lameter
Date: Thursday, December 23, 2010 - 5:16 pm

Really? How would that work? I tried with uint128 but could not get the

Yes. You may want to use cmpxchg_double on 32 bit entities for backwards
compatibilities sake or any other smaller unit size. But those could also
be realized using this_cpu_cmpxchg_<double the size> by just aggregating
the amount.

If we can indeed pass 128 bit entities (as claimed by hpa) via registers
then the logical choice would be to do

	this_cpu_cmpxchg_16(pcp, old, new)

instead of cmpxchg_double. All parameters would have to be bit.
Then we can avoid the strange cmpxchg_double semantics and can completely
avoid introducing those.




--

From: H. Peter Anvin
Date: Thursday, December 23, 2010 - 5:22 pm

There are two return registers; two machine registers can be returned in
registers.  [u]int128 is poorly implemented in a lot of gcc versions,
since it really hasn't been exercised.  However, two-word structures

I'm not sure it works with passing in a structure.

	-hpa
--

From: Christoph Lameter
Date: Friday, December 24, 2010 - 9:53 pm

Oh gosh. So we would be using a tight corner case for gcc that may only
work with certain versions of gcc? Note that the current version does only
return a boolean. There is no need for returning double words. I'd be

I think then we better leave it as is. The use case so far is minimal so
we could easily change that if we get to a better solution later.
--

From: H. Peter Anvin
Date: Friday, December 24, 2010 - 11:11 pm

A structure is not a corner case; a uint128 would be.

	-hpa

-- 
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel.  I don't speak on their behalf.

--

From: Tejun Heo
Date: Saturday, December 25, 2010 - 9:52 am

I wouldn't call it a corner case.  It's pretty clearly defined in the
ABI.  But that said, it might still be problematic on other
architectures when we try to apply it to different architectures.  Is
everyone against just taking a scalar for the first variable instead
of taking a pointer?  I'd be happier with that than the current one.

Thanks.

-- 
tejun
--

From: Christoph Lameter
Date: Saturday, December 25, 2010 - 4:55 pm

How about replacing that with two scalars? Macro will check that the
scalaers are properly aligned and that the second follows the first. Then
there is also better symmetry in the parameters.

bool this_cpu_cmpxchg_double(
          percpu_1, percpu_2
          old_1, old_2
          new_1, new_2
)

--

From: Tejun Heo
Date: Monday, December 27, 2010 - 3:52 am

Yeah, maybe.  The interface is a bit redundant but probably the most
fool proof.  I'm okay with either taking a single scalar or both.

Thanks.

-- 
tejun
--

From: Christoph Lameter
Date: Monday, January 3, 2011 - 3:43 pm

The single large 128 bit scalar does not work. Having to define an
additional structure it also a bit clumsy. I think its best to get another
patchset out that also duplicates the first parameter and makes the percpu
variable specs conform to the other this_cpu ops.

Still on vacation so it may take to the end of the week for me to get this
done.

--

From: Christoph Lameter
Date: Tuesday, December 14, 2010 - 10:48 am

The following patch will make the fastpaths lockless and will no longer
require interrupts to be disabled. Calling the free hook with irq disabled
will no longer be possible.

Move the slab_free_hook_irq() logic into slab_free_hook. Only disable
interrupts if the features are selected that require callbacks with
interrupts off and reenable after calls have been made.

Signed-off-by: Christoph Lameter <cl@linux.com>

---
 mm/slub.c |   29 +++++++++++++++++------------
 1 file changed, 17 insertions(+), 12 deletions(-)

Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c	2010-12-14 10:45:11.000000000 -0600
+++ linux-2.6/mm/slub.c	2010-12-14 10:46:40.000000000 -0600
@@ -805,14 +805,24 @@ static inline void slab_post_alloc_hook(
 static inline void slab_free_hook(struct kmem_cache *s, void *x)
 {
 	kmemleak_free_recursive(x, s->flags);
-}
 
-static inline void slab_free_hook_irq(struct kmem_cache *s, void *object)
-{
-	kmemcheck_slab_free(s, object, s->objsize);
-	debug_check_no_locks_freed(object, s->objsize);
-	if (!(s->flags & SLAB_DEBUG_OBJECTS))
-		debug_check_no_obj_freed(object, s->objsize);
+	/*
+	 * Trouble is that we may no longer disable interupts in the fast path
+	 * So in order to make the debug calls that expect irqs to be
+	 * disabled we need to disable interrupts temporarily.
+	 */
+#if defined(CONFIG_KMEMCHECK) || defined(CONFIG_LOCKDEP)
+	{
+		unsigned long flags;
+
+		local_irq_save(flags);
+		kmemcheck_slab_free(s, x, s->objsize);
+		debug_check_no_locks_freed(x, s->objsize);
+		if (!(s->flags & SLAB_DEBUG_OBJECTS))
+			debug_check_no_obj_freed(x, s->objsize);
+		local_irq_restore(flags);
+	}
+#endif
 }
 
 /*
@@ -1099,9 +1109,6 @@ static inline void slab_post_alloc_hook(
 
 static inline void slab_free_hook(struct kmem_cache *s, void *x) {}
 
-static inline void slab_free_hook_irq(struct kmem_cache *s,
-		void *object) {}
-
 #endif /* CONFIG_SLUB_DEBUG */
 
 ...
From: Christoph Lameter
Date: Tuesday, December 14, 2010 - 10:48 am

Use the this_cpu_cmpxchg_double functionality to implement a lockless
allocation algorithm on arches that support fast this_cpu_ops.

Each of the per cpu pointers is paired with a transaction id that ensures
that updates of the per cpu information can only occur in sequence on
a certain cpu.

A transaction id is a "long" integer that is comprised of an event number
and the cpu number. The event number is incremented for every change to the
per cpu state. This means that the cmpxchg instruction can verify for an
update that nothing interfered and that we are updating the percpu structure
for the processor where we picked up the information and that we are also
currently on that processor when we update the information.

This results in a significant decrease of the overhead in the fastpaths. It
also makes it easy to adopt the fast path for realtime kernels since this
is lockless and does not require the use of the current per cpu area
over the critical section. It is only important that the per cpu area is
current at the beginning of the critical section and at the end.

So there is no need even to disable preemption.

Test results show that the fastpath cycle count is reduced by up to ~ 40%
(alloc/free test goes from ~140 cycles down to ~80). The slowpath for kfree
adds a few cycles.

Sadly this does nothing for the slowpath which is where the main issues with
performance in slub are but the best case performance rises significantly.
(For that see the more complex slub patches that require cmpxchg_double)

Kmalloc: alloc/free test

Before:

10000 times kmalloc(8)/kfree -> 142 cycles
10000 times kmalloc(16)/kfree -> 142 cycles
10000 times kmalloc(32)/kfree -> 142 cycles
10000 times kmalloc(64)/kfree -> 142 cycles
10000 times kmalloc(128)/kfree -> 142 cycles
10000 times kmalloc(256)/kfree -> 140 cycles
10000 times kmalloc(512)/kfree -> 140 cycles
10000 times kmalloc(1024)/kfree -> 144 cycles
10000 times kmalloc(2048)/kfree -> 144 cycles
10000 times kmalloc(4096)/kfree ...
From: Tejun Heo
Date: Wednesday, December 15, 2010 - 9:51 am

The first two look good to me but I frankly don't have much idea about
the latter two.  Pekka, can you please ack those?  Alternatively, you
can later pull the percpu tree in and apply the allocator bits in your
tree, which I actually prefer.

Thanks.

-- 
tejun
--

From: Pekka Enberg
Date: Wednesday, December 15, 2010 - 9:55 am

Hi Tejun,


I'd prefer to pull the first two from your tree and put the allocator
changes to slab.git.

                        Pekka
--

From: Tejun Heo
Date: Wednesday, December 15, 2010 - 9:57 am

Hey,


Great, then I'll let you know after the percpu tree is frozen with the
new changes.

Thanks.

-- 
tejun
--

Previous thread: Re: [PATCH] [USB] UASP: USB Attached SCSI (UAS) protocol driver by Luben Tuikov on Tuesday, December 14, 2010 - 10:25 am. (2 messages)

Next thread: Re: [PATCH net-next-2.6] bnx2: remove cancel_work_sync() from remove_one by Michael Chan on Tuesday, December 14, 2010 - 10:48 am. (6 messages)