This patch series introduces this_cpu_cmpxchg_double(). x86 cpus support cmpxchg16b and cmpxchg8b that can be used to perform a cmpxchg with two words instead of only one. That allows to put more state into an atomic instruction. this_cpu_cmpxchg_double() is then used in the slub allocator to avoid interrupt disable/enable in both alloc and free fastpath. this_cpu_cmpxchg_double works nicely with the per cpu data of the allocator. Doing so significantly speeds up the fastpaths. --
Support this_cpu_cmpxchg_double using the cmpxchg16b and cmpxchg8b instructions.
Signed-off-by: Christoph Lameter <cl@linux.com>
---
arch/x86/include/asm/percpu.h | 47 ++++++++++++++++++++++++++++++++++
arch/x86/lib/Makefile | 1
arch/x86/lib/cmpxchg16b_emu.S | 58 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 106 insertions(+)
Index: linux-2.6/arch/x86/include/asm/percpu.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/percpu.h 2010-12-14 10:31:33.000000000 -0600
+++ linux-2.6/arch/x86/include/asm/percpu.h 2010-12-14 10:39:04.000000000 -0600
@@ -451,6 +451,26 @@ do { \
#define irqsafe_cpu_cmpxchg_4(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval)
#endif /* !CONFIG_M386 */
+#ifdef CONFIG_X86_CMPXCHG64
+#define percpu_cmpxchg8b_double(pcp, o1, o2, n1, n2) \
+({ \
+ char __ret; \
+ typeof(o1) __o1 = o1; \
+ typeof(o1) __n1 = n1; \
+ typeof(o2) __o2 = o2; \
+ typeof(o2) __n2 = n2; \
+ typeof(o2) __dummy = n2; \
+ asm("cmpxchg8b "__percpu_arg(1)"\n\tsetz %0\n\t" \
+ : "=a"(__ret), "=m" (*pcp), "=d"(__dummy) \
+ : "b"(__n1), "c"(__n2), "a"(__o1), "d"(__o2)); \
+ __ret; \
+})
+
+#define __this_cpu_cmpxchg_double_4(pcp, o1, o2, n1, n2) percpu_cmpxchg8b_double((pcp), o1, o2, n1, n2)
+#define this_cpu_cmpxchg_double_4(pcp, o1, o2, n1, n2) percpu_cmpxchg8b_double((pcp), o1, o2, n1, n2)
+#define irqsafe_cpu_cmpxchg_double_4(pcp, o1, o2, n1, n2) percpu_cmpxchg8b_double((pcp), o1, o2, n1, n2)
+#endif /* CONFIG_X86_CMPXCHG64 */
+
/*
* Per cpu atomic 64 bit operations are only available under 64 bit.
* 32 bit must fall back to generic operations.
@@ -486,6 +506,33 @@ do { \
#define this_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval)
#define irqsafe_cpu_cmpxchg_8(pcp, oval, nval) percpu_cmpxchg_op(pcp, oval, nval)
+/*
+ * Pretty complex macro to generate cmpxchg16 ...NAK on this. This is acceptable for cmpxchg8b only because we don't support SMP on 486s anymore. x86-64 is another matter... -hpa --
Hm, this is meant to be a CPU local operation, isn't it... it isn't very clear from the comments or naming *in this file*. Could you make it a little clearer in the local code, please? -hpa --
Ok. But the code is correct as far as I can tell. Its not a global cmpxchg ok. --
Better comments: Subject: Fixup comments for cmpxchg16b_cpuops_emu Indicate that the sematics are not fullly atomic but just per cpu atomic. Signed-off-by: Christoph Lameter <cl@linux.com> --- arch/x86/lib/cmpxchg16b_emu.S | 4 ++++ 1 file changed, 4 insertions(+) Index: linux-2.6/arch/x86/lib/cmpxchg16b_emu.S =================================================================== --- linux-2.6.orig/arch/x86/lib/cmpxchg16b_emu.S 2010-12-15 10:18:18.000000000 -0600 +++ linux-2.6/arch/x86/lib/cmpxchg16b_emu.S 2010-12-15 10:19:11.000000000 -0600 @@ -30,6 +30,10 @@ CFI_STARTPROC # Emulate 'cmpxchg16b %gs:(%rsi)' except we return the result in # al not via the ZF. Caller will access al to get result. # +# Note that this is only useful for a cpuops operation. Meaning that we +# do *not* have a fully atomic operation but just an operation that is +# *atomic* on a single cpu (as provided by the this_cpu_xx class of macros) +# cmpxchg16b_cpuops_emu: pushf cli --
Could we rename it this_cpu_cmpxchg16b_emu or something like that? -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. --
Ok. New fixup patch:
Subject: x86, percpu: Rename cmpxchg16b_cpuops_emu to this_cpu_cmpxchg16b_emu
Rename the function to make clear what it is used for. Also add some comments
to explain the nature of the atomicness.
Signed-off-by: Christoph Lameter <cl@linux.com>
---
arch/x86/include/asm/percpu.h | 2 +-
arch/x86/lib/cmpxchg16b_emu.S | 10 +++++++---
2 files changed, 8 insertions(+), 4 deletions(-)
Index: linux-2.6/arch/x86/lib/cmpxchg16b_emu.S
===================================================================
--- linux-2.6.orig/arch/x86/lib/cmpxchg16b_emu.S 2010-12-15 10:57:21.000000000 -0600
+++ linux-2.6/arch/x86/lib/cmpxchg16b_emu.S 2010-12-15 11:47:50.000000000 -0600
@@ -23,14 +23,18 @@
* %rcx : high 64 bits of new value
* %al : Operation successful
*/
-ENTRY(cmpxchg16b_cpuops_emu)
+ENTRY(this_cpu_cmpxchg16b_emu)
CFI_STARTPROC
#
# Emulate 'cmpxchg16b %gs:(%rsi)' except we return the result in
# al not via the ZF. Caller will access al to get result.
#
-cmpxchg16b_cpuops_emu:
+# Note that this is only useful for a cpuops operation. Meaning that we
+# do *not* have a fully atomic operation but just an operation that is
+# *atomic* on a single cpu (as provided by the this_cpu_xx class of macros)
+#
+this_cpu_cmpxchg16b_emu:
pushf
cli
@@ -53,6 +57,6 @@ cmpxchg16b_cpuops_emu:
CFI_ENDPROC
-ENDPROC(cmpxchg16b_cpuops)
+ENDPROC(this_cpu_cmpxchg16b_emu)
Index: linux-2.6/arch/x86/include/asm/percpu.h
===================================================================
--- linux-2.6.orig/arch/x86/include/asm/percpu.h 2010-12-15 11:48:09.000000000 -0600
+++ linux-2.6/arch/x86/include/asm/percpu.h 2010-12-15 11:48:26.000000000 -0600
@@ -520,7 +520,7 @@ do { \
typeof(o2) __o2 = o2; \
typeof(o2) __n2 = n2; \
typeof(o2) __dummy; \
- alternative_io("call cmpxchg16b_cpuops_emu\n\t" P6_NOP4, \
+ alternative_io("call this_cpu_cmpxchg16b_emu\n\t" P6_NOP4, \
"cmpxchg16b %%gs:(%%rsi)\n\tsetz ...Yes, that's what I meant... the code is correct, it's just not obvious from someone who looks at the code that it is correct, because the intended function of the code isn't obvious from that file by itself (in fact, it looks almost exactly like the 486 cmpxchg ersatz code.) -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. --
Confirmed. It is a copy of that 486 ersatz code ... But its ok for the use case here. --
Introduce this_cpu_cmpxchg_double. this_cpu_cmpxchg_double() allows the
comparision between two consecutive words and replaces them if there is
a match.
bool this_cpu_cmpxchg(xx __percpu *p, old_word1, old_word2, new_word1, new_word2)
this_cpu_cmpxchg does not return the old value (difficult since there are two
words) but a boolean indicating if the operation was successful.
Also this_cpu_cmpxchg does take a per cpu pointer rather than a variable
reference like the other this_cpu_ops. This is because two words are compared.
The pointer passed must be double word aligned!
Signed-off-by: Christoph Lameter <cl@linux.com>
---
include/linux/percpu.h | 130 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 130 insertions(+)
Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h 2010-12-14 10:34:20.000000000 -0600
+++ linux-2.6/include/linux/percpu.h 2010-12-14 10:36:13.000000000 -0600
@@ -259,6 +259,27 @@ extern void __bad_size_call_parameter(vo
pscr2_ret__; \
})
+/*
+ * Special handling for cmpxchg_double. cmpxchg_double is passed a
+ * __percpu pointer and that pointer has to be aligned to a double
+ * word boundary
+ */
+#define __pcpu_double_call_return_int(stem, pcp, ...) \
+({ \
+ int ret__; \
+ __verify_pcpu_ptr(pcp); \
+ VM_BUG_ON((unsigned long)(pcp) % (2 * sizeof(unsigned long))); \
+ switch(sizeof(*pcp)) { \
+ case 1: ret__ = stem##1(pcp, __VA_ARGS__);break; \
+ case 2: ret__ = stem##2(pcp, __VA_ARGS__);break; \
+ case 4: ret__ = stem##4(pcp, __VA_ARGS__);break; \
+ case 8: ret__ = stem##8(pcp, __VA_ARGS__);break; \
+ default: \
+ __bad_size_call_parameter();break; \
+ } \
+ ret__; \
+})
+
#define __pcpu_size_call(stem, variable, ...) \
do { \
__verify_pcpu_ptr(&(variable)); \
@@ -422,6 +443,82 @@ do { \
__this_cpu_cmpxchg_, pcp, oval, ...Hello, Christoph. This is ugly. :-( I think we should have made this_cpu_*() ops take pointers from the beginning. Anyways, that's too late, so is it completely impossible to make cmpxchg_double's take a scalar value? It can take the pointer all the same, no? Thanks. -- tejun --
Also, the ratonale behind taking a scalar value instead of pointer was that the operation changes what they do depending on the type of the parameter (the size of the parameter), which is true for the double ops too, so I think it would be better if we stick with lvalues. Thanks. -- tejun --
It could take a scalar value like the others but we are then not operating on the scalar alone but also on the following field. --
I'm a bit confused on this one. The standard cmpxchg() takes a scalar
and a pointer, and returns a scalar. The equivalent for the "double"
variety would be to return a compound object, basically:
struct double_ulong {
unsigned long v[2];
};
... which can be returned in registers on both i386 and x86-64.
It's a bit clumsy from a type perspective, but I'm not sure that that is
a bad thing. Doing too much type genericity has caused us problems in
the past.
-hpa
--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.
--
Hello, Yes, it's weird but the operation itself is weird enough and named accordingly, so to me it seems like a much lesser problem than Yeah, the above might be better too. Is there any reason to use cmpxchg_double on anything smaller? Thanks. -- tejun --
Really? How would that work? I tried with uint128 but could not get the Yes. You may want to use cmpxchg_double on 32 bit entities for backwards compatibilities sake or any other smaller unit size. But those could also be realized using this_cpu_cmpxchg_<double the size> by just aggregating the amount. If we can indeed pass 128 bit entities (as claimed by hpa) via registers then the logical choice would be to do this_cpu_cmpxchg_16(pcp, old, new) instead of cmpxchg_double. All parameters would have to be bit. Then we can avoid the strange cmpxchg_double semantics and can completely avoid introducing those. --
There are two return registers; two machine registers can be returned in registers. [u]int128 is poorly implemented in a lot of gcc versions, since it really hasn't been exercised. However, two-word structures I'm not sure it works with passing in a structure. -hpa --
Oh gosh. So we would be using a tight corner case for gcc that may only work with certain versions of gcc? Note that the current version does only return a boolean. There is no need for returning double words. I'd be I think then we better leave it as is. The use case so far is minimal so we could easily change that if we get to a better solution later. --
A structure is not a corner case; a uint128 would be. -hpa -- H. Peter Anvin, Intel Open Source Technology Center I work for Intel. I don't speak on their behalf. --
I wouldn't call it a corner case. It's pretty clearly defined in the ABI. But that said, it might still be problematic on other architectures when we try to apply it to different architectures. Is everyone against just taking a scalar for the first variable instead of taking a pointer? I'd be happier with that than the current one. Thanks. -- tejun --
How about replacing that with two scalars? Macro will check that the
scalaers are properly aligned and that the second follows the first. Then
there is also better symmetry in the parameters.
bool this_cpu_cmpxchg_double(
percpu_1, percpu_2
old_1, old_2
new_1, new_2
)
--
Yeah, maybe. The interface is a bit redundant but probably the most fool proof. I'm okay with either taking a single scalar or both. Thanks. -- tejun --
The single large 128 bit scalar does not work. Having to define an additional structure it also a bit clumsy. I think its best to get another patchset out that also duplicates the first parameter and makes the percpu variable specs conform to the other this_cpu ops. Still on vacation so it may take to the end of the week for me to get this done. --
The following patch will make the fastpaths lockless and will no longer
require interrupts to be disabled. Calling the free hook with irq disabled
will no longer be possible.
Move the slab_free_hook_irq() logic into slab_free_hook. Only disable
interrupts if the features are selected that require callbacks with
interrupts off and reenable after calls have been made.
Signed-off-by: Christoph Lameter <cl@linux.com>
---
mm/slub.c | 29 +++++++++++++++++------------
1 file changed, 17 insertions(+), 12 deletions(-)
Index: linux-2.6/mm/slub.c
===================================================================
--- linux-2.6.orig/mm/slub.c 2010-12-14 10:45:11.000000000 -0600
+++ linux-2.6/mm/slub.c 2010-12-14 10:46:40.000000000 -0600
@@ -805,14 +805,24 @@ static inline void slab_post_alloc_hook(
static inline void slab_free_hook(struct kmem_cache *s, void *x)
{
kmemleak_free_recursive(x, s->flags);
-}
-static inline void slab_free_hook_irq(struct kmem_cache *s, void *object)
-{
- kmemcheck_slab_free(s, object, s->objsize);
- debug_check_no_locks_freed(object, s->objsize);
- if (!(s->flags & SLAB_DEBUG_OBJECTS))
- debug_check_no_obj_freed(object, s->objsize);
+ /*
+ * Trouble is that we may no longer disable interupts in the fast path
+ * So in order to make the debug calls that expect irqs to be
+ * disabled we need to disable interrupts temporarily.
+ */
+#if defined(CONFIG_KMEMCHECK) || defined(CONFIG_LOCKDEP)
+ {
+ unsigned long flags;
+
+ local_irq_save(flags);
+ kmemcheck_slab_free(s, x, s->objsize);
+ debug_check_no_locks_freed(x, s->objsize);
+ if (!(s->flags & SLAB_DEBUG_OBJECTS))
+ debug_check_no_obj_freed(x, s->objsize);
+ local_irq_restore(flags);
+ }
+#endif
}
/*
@@ -1099,9 +1109,6 @@ static inline void slab_post_alloc_hook(
static inline void slab_free_hook(struct kmem_cache *s, void *x) {}
-static inline void slab_free_hook_irq(struct kmem_cache *s,
- void *object) {}
-
#endif /* CONFIG_SLUB_DEBUG */
...Use the this_cpu_cmpxchg_double functionality to implement a lockless allocation algorithm on arches that support fast this_cpu_ops. Each of the per cpu pointers is paired with a transaction id that ensures that updates of the per cpu information can only occur in sequence on a certain cpu. A transaction id is a "long" integer that is comprised of an event number and the cpu number. The event number is incremented for every change to the per cpu state. This means that the cmpxchg instruction can verify for an update that nothing interfered and that we are updating the percpu structure for the processor where we picked up the information and that we are also currently on that processor when we update the information. This results in a significant decrease of the overhead in the fastpaths. It also makes it easy to adopt the fast path for realtime kernels since this is lockless and does not require the use of the current per cpu area over the critical section. It is only important that the per cpu area is current at the beginning of the critical section and at the end. So there is no need even to disable preemption. Test results show that the fastpath cycle count is reduced by up to ~ 40% (alloc/free test goes from ~140 cycles down to ~80). The slowpath for kfree adds a few cycles. Sadly this does nothing for the slowpath which is where the main issues with performance in slub are but the best case performance rises significantly. (For that see the more complex slub patches that require cmpxchg_double) Kmalloc: alloc/free test Before: 10000 times kmalloc(8)/kfree -> 142 cycles 10000 times kmalloc(16)/kfree -> 142 cycles 10000 times kmalloc(32)/kfree -> 142 cycles 10000 times kmalloc(64)/kfree -> 142 cycles 10000 times kmalloc(128)/kfree -> 142 cycles 10000 times kmalloc(256)/kfree -> 140 cycles 10000 times kmalloc(512)/kfree -> 140 cycles 10000 times kmalloc(1024)/kfree -> 144 cycles 10000 times kmalloc(2048)/kfree -> 144 cycles 10000 times kmalloc(4096)/kfree ...
The first two look good to me but I frankly don't have much idea about the latter two. Pekka, can you please ack those? Alternatively, you can later pull the percpu tree in and apply the allocator bits in your tree, which I actually prefer. Thanks. -- tejun --
Hi Tejun,
I'd prefer to pull the first two from your tree and put the allocator
changes to slab.git.
Pekka
--
Hey, Great, then I'll let you know after the percpu tree is frozen with the new changes. Thanks. -- tejun --
