login
Header Space

 
 

Re: Linux 2.6.25-rc2

Previous thread: Re: data corruption with dmcrypt/LUKS by Christoph Anton Mitterer on Friday, February 15, 2008 - 5:25 pm. (1 message)

Next thread: USB regression (and other failures) in 2.6.2[45]* by Andrew Buehler on Friday, February 15, 2008 - 5:45 pm. (26 messages)
To: Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, February 15, 2008 - 5:23 pm

Ok,
 this kernel is a winner.

Just to show how _much_ of a winner it is, it's been awarded a coveted 
"weasel" series name, which should tell you just how good it's going to 
be. It's a name revered in Linux kernel history, and as such this brings 
back the good old days where if you find a bug, you're almost certainly 
simply mistaken, and you probably just did something wrong.

But hey, you can try to prove me wrong.  I dare you.

The full shortlog is appended, but instead of a diffstat, I'm doing the my 
git "dirstat" output which gives a nicer high-level view. This was 
generated with

	git diff --cumulative --dirstat=2 --shortstat v2.6.25-rc1..v2.6.25-rc2

and actually gives a pretty good overview. In particular, it shows that 
almost exactly half of the updates are to drivers, with network drivers 
alone being a third of the whole patch. And of the remaining half, about 
half was architecture updates, notably to SH.

What a cool git feature, if I say so myself. It allows me an extra half 
hour a week of just sitting back and sipping my foofy tropical drink 
(bringing my total up to about 168.5 hours of drunken stupor a week), 
since now I don't even have to try to make sense of the diffstat manually 
any more. Anyway, here it is:

	   2.1% Documentation/
	   3.7% arch/cris/
	   7.0% arch/sh/configs/
	   4.4% arch/sh/kernel/
	   4.9% arch/sh/mm/
	  17.8% arch/sh/
	  23.8% arch/
	  33.5% drivers/net/
	   6.0% drivers/scsi/lpfc/
	   7.1% drivers/scsi/
	   4.5% drivers/sh/maple/
	  49.5% drivers/
	   8.1% fs/
	   2.5% include/linux/
	   4.5% include/
	   7.2% kernel/
	   2.0% net/
	 509 files changed, 14470 insertions(+), 6986 deletions(-)

which you have to admit looks very managerial, even if you have no clue 
what the heck it really contains (which is also very managerial). Now I 
just need to turn these reports into some colored powerpoint slides with 
pie charts, and then I can turn off my brain _completely_.

But if you actually care for anything but ...
To: Linus Torvalds <torvalds@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Saturday, February 16, 2008 - 5:38 pm

Sadly not for me:
[ 5282.056415] ------------[ cut here ]------------
[ 5282.059757] kernel BUG at lib/list_debug.c:33!
[ 5282.062055] invalid opcode: 0000 [1] SMP
[ 5282.062055] CPU 3
[ 5282.062055] Modules linked in: radeon drm w83792d ipv6 tuner
tea5767 tda8290 tuner_xc2028 tda9887 tuner_simple mt20xx tea5761
tvaudio msp3400 bttv videodev v4l1_compat ir_common compat_ioctl32
v4l2_common videobuf_dma_sg videobuf_core btcx_risc tveeprom usbhid
pata_amd i2c_nforce2 hid sg
[ 5282.062055] Pid: 12937, comm: sed Not tainted 2.6.25-rc2 #1
[ 5282.062055] RIP: 0010:[&lt;ffffffff803bffe4&gt;]
-&gt; then the output from the serial console stopped. I was in X, so I
could not see, if there was anything more on the real console.

(gdb) list *0xffffffff803bffe4
0xffffffff803bffe4 is in __list_add (lib/list_debug.c:33).
28              }
29              if (unlikely(prev-&gt;next != next)) {
30                      printk(KERN_ERR "list_add corruption.
prev-&gt;next should be "
31                              "next (%p), but was %p. (prev=%p).\n",
32                              next, prev-&gt;next, prev);
33                      BUG();
34              }
35              next-&gt;prev = new;
36              new-&gt;next = next;
37              new-&gt;prev = prev;

For more on this problem see http://marc.info/?l=linux-kernel&amp;m=120293042005445

I will now try 2.6.25-rc2-mm1.

Torsten
--
To: Torsten Kaiser <just.for.lkml@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, February 19, 2008 - 2:11 am

hm. Your crashes do seem to span multiple subsystems, but it always 
seems to be around the SLUB code. Could you try the patch below? The 
SLUB code has a new optimization and i'm not 100% sure about it. [the 
hack below switches the SLUB optimization off by disabling the CPU 
feature it relies on.]

	Ingo

-------------&gt;
 arch/x86/Kconfig |    4 ----
 1 file changed, 4 deletions(-)

Index: linux/arch/x86/Kconfig
===================================================================
--- linux.orig/arch/x86/Kconfig
+++ linux/arch/x86/Kconfig
@@ -59,10 +59,6 @@ config HAVE_LATENCYTOP_SUPPORT
 config SEMAPHORE_SLEEPERS
 	def_bool y
 
-config FAST_CMPXCHG_LOCAL
-	bool
-	default y
-
 config MMU
 	def_bool y
 

--
To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, February 19, 2008 - 2:54 am

$ grep FAST_CMPXCHG_LOCAL */.config
linux-2.6.24-rc2-mm1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
linux-2.6.24-rc3-mm1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
linux-2.6.24-rc3-mm2/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
linux-2.6.24-rc6-mm1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
linux-2.6.24-rc8-mm1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
linux-2.6.25-rc1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
linux-2.6.25-rc2-mm1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
linux-2.6.25-rc2/.config:CONFIG_FAST_CMPXCHG_LOCAL=y

-rc2-mm1 still worked for me.

Did you mean the new SLUB_FASTPATH?
$ grep "define SLUB_FASTPATH" */mm/slub.c
linux-2.6.25-rc1/mm/slub.c:#define SLUB_FASTPATH
linux-2.6.25-rc2-mm1/mm/slub.c:#define SLUB_FASTPATH
linux-2.6.25-rc2/mm/slub.c:#define SLUB_FASTPATH

The 2.6.24-rc3+ mm-kernels did crash for me, but don't seem to contain this...

On the other hand:
From the crash in 2.6.25-rc2-mm1:
[59987.116182] RIP  [&lt;ffffffff8029f83d&gt;] kmem_cache_alloc_node+0x6d/0xa0

(gdb) list *0xffffffff8029f83d
0xffffffff8029f83d is in kmem_cache_alloc_node (mm/slub.c:1646).
1641                    if (unlikely(is_end(object) || !node_match(c, node))) {
1642                            object = __slab_alloc(s, gfpflags,
node, addr, c);
1643                            break;
1644                    }
1645                    stat(c, ALLOC_FASTPATH);
1646            } while (cmpxchg_local(&amp;c-&gt;freelist, object, object[c-&gt;offset])
1647
 != object);
1648    #else
1649            unsigned long flags;
1650

That code is part for SLUB_FASTPATH.

I'm willing to test the patch, but don't know how fast I can find the
time to do it, so my answer if your patch helps might be delayed until
the weekend.

Torsten
--
To: Torsten Kaiser <just.for.lkml@...>
Cc: Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 3:21 am

Mathieu, Christoph is on vacation and I'm not at all that familiar
with this cmpxchg_local() optimization, so if you could take a peek at
this bug report to see if you can spot something obviously wrong with
it, I would much appreciate that.
--
To: Pekka Enberg <penberg@...>
Cc: Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 10:02 am

Sure,

Initial thoughts :

I'd like to get the complete config causing this bug. I suspect either :

- A race between the lockless algo and an IRQ in a driver allocating
  memory.
- stat(c, ALLOC_FASTPATH); seems to be using a var++, therefore
  indicating it is not reentrant if IRQs are disabled. Since those are
  only stats, I guess it's ok, but still weird.
- CPU hotplug problem. 
  http://bugzilla.kernel.org/attachment.cgi?id=14877&amp;action=view shows
  last sysfs file:
  /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
  -- is this linked to a cpu up/down event ?

Since this shows mostly with network card drivers, I think the most
plausible cause would be an IRQ nesting over kmem_cache_alloc_node and
calling it.

Will dig further...

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Pekka Enberg <penberg@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 2:39 pm

# CONFIG_HOTPLUG_CPU is not set

I expect my system would get rather angry if I try to remove a cpu
while it is running. ;)

And as I don't use suspend, I wanted to stay away from any bugs that

Thanks to yanmin crashing his machine with hackbench I now have a much
better testcase than recompiling KDE.
Starting hackbench will result in an nearly instant crash, if there is
a floodping running against the system.

This new testcase also showed that 2.6.24-rc2-mm1 can also be crashed
in this way.

I will try the patches from this tread against this new testcase and
report the results.

Torsten
To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 12:27 pm

On Tue, 19 Feb 2008 09:02:30 -0500

I wonder how SLUB_FASTPATH is supposed to work, since it is affected by
a classical ABA problem of lockless algo.

cmpxchg_local(&amp;c-&gt;freelist, object, object[c-&gt;offset]) can succeed,
while an interrupt came (on this cpu), and several allocations were done,
and one free was performed at the end of this interruption, so 'object'
was recycled.

c-&gt;freelist can then contain the previous value (object), but
object[c-&gt;offset] was changed by IRQ.

We then put back in freelist an already allocated object.

--
To: Eric Dumazet <dada1@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Wednesday, February 27, 2008 - 7:32 pm

Right. The alloc fastpath assumes that the object is not modified while it 
is allocated...

Du! This shoots my nice scheme in the foot and there does not seem to be 
an easy way to fix it.  This means we also need to revert the 
page-&gt;end patch. Useless if we do not have the cmpxchg_local.



--
To: Eric Dumazet <dada1@...>
Cc: Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 4:03 pm

I think you are right. A way to fix this would use the fact that the
freelist is only useful to point to the first free object in a page. We
could change it to an offset rather than an address.

The freelist would become a counter of type "long" which increments
until it wraps at 2^32 or 2^64. A PAGE_MASK bitmask could then be used
to get low order bits which would get the page offset of the first free
object, while the high order bits would insure we can detect this type
of object reuse when doing a cmpxchg. Upon free, the freelist counter
should always be incremented; this would be provided by adding PAGE_SIZE
to the counter and setting the LSBs to the correct offset.

On 32 bits architectures, it would allow 2^(32-12) = 1048576
allocations-free pairs to be done within interrupt handlers interrupting
a cmpxchg_local before the counter wraps. On 64 bits archs, it would
give 2^(64-12) = 2^52 allocation-free.

If we assume kmalloc/kfree pairs can take up to 1000 cycles to execute
(see
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=068fbad288...)
(3GHz Pentium 4) (therefore 333ns)
an interrupt handler doing 1048576 kmalloc/kfree would run for 0.35
seconds. An interrupt handler taking that much time would lead to other
OS problems (potential scheduler problems), so this quantity of
available allocations before a wrap-around looks safe. However, making
sure preemption is disabled would be safer here, since we cannot bound
the number of allocations that can be done by other processes if we are
scheduled out in the middle of the cmpxchg_local loop.

A revert is indeed the right solution at this stage until a corrected
version is ready.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Wednesday, February 27, 2008 - 7:34 pm

Urgh.... That sounds way too complicated. Do you have an experimental 
patch that would allow us to see the impact?

--
To: Christoph Lameter <clameter@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, February 28, 2008 - 1:55 am

Yep, I have an experimental patch ready. It would need some thorough
testing though. I seems to run fine on an x86_32 with and without
preemption (therefore with and without the slub cmpxchg_local fastpath).

You should first revert commit 00e962c5408b9f2d0bebd2308673fe982cb9a5fe
(this is the slub fastpath revert itself, currently in 2.6.25-rc3)

And then apply :


Implement slub fastpath in terms of freebase and freeoffset

It allows the cmpxchg_local to detect object re-use by keeping a counter in the
freeoffset MSBs.

freeoffset &amp; PAGE_MASK is the offset of the freelist element within the page.
freeoffset &amp; ~PAGE_MASK is the counter to detect object re-use.
freebase is the address of the page in this the object is located. It is a
multiple of PAGE_SIZE.

Whenever an object is freed in the cpu slab cache, the counter is incremented.
Whenever the alloc/free slow paths are modifying the freeoffset or freebase, the
freeoffset counter is also incremented. It is used to make sure we know if
freebase has been modified in an interrupt nested over the fast path.

Signed-off-by: Mathieu Desnoyers &lt;mathieu.desnoyers@polymtl.ca&gt;
CC: Christoph Lameter &lt;clameter@sgi.com&gt;
---
 include/linux/slub_def.h |   35 +++++++++++++++++++
 mm/slub.c                |   84 +++++++++++++++++++++++++++++++++--------------
 2 files changed, 93 insertions(+), 26 deletions(-)

Index: linux-2.6-lttng/include/linux/slub_def.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/slub_def.h	2008-02-27 20:40:48.000000000 -0500
+++ linux-2.6-lttng/include/linux/slub_def.h	2008-02-28 00:17:28.000000000 -0500
@@ -32,7 +32,18 @@ enum stat_item {
 	NR_SLUB_STAT_ITEMS };
 
 struct kmem_cache_cpu {
-	void **freelist;	/* Pointer to first free per cpu object */
+	unsigned long freeoffset;	/*
+					 * Offset of the first free per cpu
+				 	 * object. The MSBs are used as a
+					 * counter which must be incremented
+					 * each time an ...
To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, February 28, 2008 - 3:08 pm

What is the performance impact?




page_address(c-&gt;page) + (c-&gt;freeoffset &amp; (PAGE_MASK &lt;&lt; s-&gt;order)) ?


How is this going to show up as an atomic update? You modify multiple per 

The original code is wrong. It should be

if (is_end(c-&gt;freelist))

That use of set_freelist is safe since interrupts are disabled. So the 

How does this work? You replace the freeoffset with the the once 
incremented by the object offset??? The objects may not be in sequence on 
the freelist and the increment wont get you to the next object anyways 
since c-&gt;offset is usually 0.

--
To: Christoph Lameter <clameter@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, February 28, 2008 - 7:25 pm

I guess it would be more :

page_address(c-&gt;page) | (c-&gt;freeoffset &amp; c-&gt;off_mask)

where c-&gt;off_mask = (PAGE_SIZE &lt;&lt; s-&gt;order) - 1;



Note the "from the fastpath point of view", which means that if an
interrupt nests over the cmpxchg_local fastpath and uses
set_freelist_ptr, all the operations will be executed before the
interrupt returns to the cmpxchg_local fastpath, therefore showing the
update as being done "atomically" from the cmpxchg_local fastpath point
of view.

Anyway, I am changing this for :


static inline void set_freelist_ptr(struct kmem_cache_cpu *c, void **freelist)
{
        unsigned long offset;

        offset = c-&gt;freeoffset + c-&gt;off_mask + 1;
        offset &amp;= ~c-&gt;off_mask;
        offset |= (unsigned long)freelist &amp; c-&gt;off_mask;
        c-&gt;freeoffset = offset;
}

Which will insure c-&gt;freeoffset is always in a valid state, therefore

Should be ok now since we allow interrupts to nest over
set_freelist_ptr. But I think this is also protected by slab_lock, taken
both in the slow path and in deactivate_slab.

In any case, doing a 

                object = get_freelist_ptr(c);
                set_freelist_ptr(c, object[c-&gt;offset]);

Like deactivate_slab does should be either protected by disabling

Hrm, there should be a 1 to 1 mapping between the old code and the new
code :


-       } while (cmpxchg_local(&amp;c-&gt;freelist, object, object[c-&gt;offset])
-                                                               != object);
+               /* No need to increment the MSB counter here, because only
+                * object free will lead to counter re-use. */
+       } while (cmpxchg_local(&amp;c-&gt;freeoffset, freeoffset,
+               (freeoffset + (c-&gt;offset * sizeof(void *)))) != freeoffset);

With the difference that we use only offsets instead of the full object
address.

Ah! I think I found the issue. It should become :


                newoffset = freeo...
To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, February 28, 2008 - 8:57 pm

Hmmm.. The cmpxchg_local was intended to check for a change of slab too 
(addresses from different slabs are disjunct). 
That is no longer possible since the upper bits are used by the 
versioning? (Same issue in slab_alloc).

--
To: Christoph Lameter <clameter@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, February 28, 2008 - 9:56 pm

Hrm, this is why I kept a separate "freebase", as a sort of

Here is the trick : if, on top of the cmpxchg_local, we have an
interrupt that calls slab_alloc or slab_free which causes a change of
slab, this interrupt will have to go through the slow path.

These slow paths use set_freelist_ptr() to update the c-&gt;freeoffset,
which also increments the sequence number. When we return from
interrupt, the cmpxchg_local loop will fail because the sequence number
has changed.

In short, the we also use the versioning to check for change of slab.

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, February 28, 2008 - 10:12 pm

Sounds good.

--
To: Christoph Lameter <clameter@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, February 28, 2008 - 11:32 pm

Test results :

On a 3GHz Pentium 4, CONFIG_SMP=y, no preemption, single CPU, hyperthreading
Kernel 2.6.25-rc3 (complete config below)

Just to give away an overview of the results :

The "1. Kmalloc: Repeatedly allocate then free test", for a 32 bytes
allocation, takes :

10000 times kmalloc(32) -&gt; 191 cycles kfree -&gt; 179 cycles
10000 times kmalloc(32) -&gt; 93 cycles kfree -&gt; 172 cycles

For a speedup of 2 on the allocation and about the same speed for free.

For the alloc/free tests, we have a speedup of 3 for most of allocation
sizes. (as long as we are in the fast path)


* baseline : standard fast path, without freeoffset patch

Single thread testing                                            
=====================                                            
1. Kmalloc: Repeatedly allocate then free test                   

* baseline : standard fast path, without freeoffset patch

10000 times kmalloc(8) -&gt; 182 cycles kfree -&gt; 174 cycles         
10000 times kmalloc(16) -&gt; 185 cycles kfree -&gt; 175 cycles        
10000 times kmalloc(32) -&gt; 191 cycles kfree -&gt; 179 cycles        
10000 times kmalloc(64) -&gt; 208 cycles kfree -&gt; 184 cycles        
10000 times kmalloc(128) -&gt; 260 cycles kfree -&gt; 202 cycles       
10000 times kmalloc(256) -&gt; 367 cycles kfree -&gt; 277 cycles       
10000 times kmalloc(512) -&gt; 394 cycles kfree -&gt; 304 cycles       
10000 times kmalloc(1024) -&gt; 469 cycles kfree -&gt; 397 cycles      
10000 times kmalloc(2048) -&gt; 417 cycles kfree -&gt; 318 cycles      
10000 times kmalloc(4096) -&gt; 470 cycles kfree -&gt; 356 cycles      
10000 times kmalloc(8192) -&gt; 692 cycles kfree -&gt; 723 cycles      
10000 times kmalloc(16384) -&gt; 756 cycles kfree -&gt; 830 cycles     

* cmpxchg_local fast path with freeoffset patch

10000 times kmalloc(8) -&gt; 79 cycles kfree -&gt; 169 cycles          
10000 times kmalloc(16) -&gt; 83 cycles kfree -&gt; 169 cycles         
10000 times kmalloc(32) -&gt; 93 c...
To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, February 29, 2008 - 1:11 am

Then we do not need the page-&gt;end field anymore right? I will try 
to rediff your patch against current slab-mm and see how we can proceed 
from there.
--
To: Christoph Lameter <clameter@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, February 29, 2008 - 9:28 am

Slub Freeoffset check overflow

Check for overflow of the freeoffset version number.

I just thought adding this check in CONFIG_SLUB_DEBUG makes sense. It's
really unlikely that enough interrupt handlers will nest over the slub
fast path, and each of them do about a million alloc/free on 32 bits or
a huge amount of alloc/free on 64 bits, but just in case, it seems good
to warn if we detect we are half-way to a version overflow.

Signed-off-by: Mathieu Desnoyers &lt;mathieu.desnoyers@polymtl.ca&gt;
---
 mm/slub.c |   32 ++++++++++++++++++++++++++------
 1 file changed, 26 insertions(+), 6 deletions(-)

Index: linux-2.6-lttng/mm/slub.c
===================================================================
--- linux-2.6-lttng.orig/mm/slub.c	2008-02-29 08:05:01.000000000 -0500
+++ linux-2.6-lttng/mm/slub.c	2008-02-29 08:16:13.000000000 -0500
@@ -1660,7 +1660,7 @@ static __always_inline void *slab_alloc(
  */
 
 #ifdef SLUB_FASTPATH
-	unsigned long freeoffset, newoffset;
+	unsigned long freeoffset, newoffset, resoffset;
 
 	c = get_cpu_slab(s, raw_smp_processor_id());
 	do {
@@ -1682,8 +1682,18 @@ static __always_inline void *slab_alloc(
 		newoffset = freeoffset;
 		newoffset &amp;= ~c-&gt;off_mask;
 		newoffset |= (unsigned long)object[c-&gt;offset] &amp; c-&gt;off_mask;
-	} while (cmpxchg_local(&amp;c-&gt;freeoffset, freeoffset, newoffset)
-			!= freeoffset);
+		resoffset = cmpxchg_local(&amp;c-&gt;freeoffset, freeoffset,
+				newoffset);
+#ifdef CONFIG_SLUB_DEBUG
+		/*
+		 * Just to be paranoid : warn if we detect that enough
+		 * allocations nested on top of us to get the counter to go
+		 * half-way to overflow. That would be insane to do that much
+		 * allocations in interrupt handers, but check it anyway.
+		 */
+		WARN_ON(resoffset - freeoffset &gt; -1UL &gt;&gt; 1);
+#endif
+	} while (resoffset != freeoffset);
 #else
 	unsigned long flags;
 
@@ -1822,7 +1832,7 @@ static __always_inline void slab_free(st
 	struct kmem_cache_cpu *c;
 
 #ifdef SLUB_FASTPAT...
To: Christoph Lameter <clameter@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, March 4, 2008 - 2:17 am

Check for overflow of the freeoffset version number.

I just thought adding this check in CONFIG_SLUB_DEBUG makes sense. It's
really unlikely that enough interrupt handlers will nest over the slub
fast path, and each of them do about a million alloc/free on 32 bits or
a huge amount of alloc/free on 64 bits, but just in case, it seems good
to warn if we detect we are half-way to a version overflow.

Changelog :
- Mask out the LSB because of alloc fast path. See comment in source.

Signed-off-by: Mathieu Desnoyers &lt;mathieu.desnoyers@polymtl.ca&gt;
---
 mm/slub.c |   40 ++++++++++++++++++++++++++++++++++------
 1 file changed, 34 insertions(+), 6 deletions(-)

Index: linux-2.6-lttng/mm/slub.c
===================================================================
--- linux-2.6-lttng.orig/mm/slub.c	2008-03-04 00:59:01.000000000 -0500
+++ linux-2.6-lttng/mm/slub.c	2008-03-04 01:03:44.000000000 -0500
@@ -1660,7 +1660,7 @@ static __always_inline void *slab_alloc(
  */
 
 #ifdef SLUB_FASTPATH
-	unsigned long freeoffset, newoffset;
+	unsigned long freeoffset, newoffset, resoffset;
 
 	c = get_cpu_slab(s, raw_smp_processor_id());
 	do {
@@ -1682,8 +1682,22 @@ static __always_inline void *slab_alloc(
 		newoffset = freeoffset;
 		newoffset &amp;= ~c-&gt;off_mask;
 		newoffset |= (unsigned long)object[c-&gt;offset] &amp; c-&gt;off_mask;
-	} while (cmpxchg_local(&amp;c-&gt;freeoffset, freeoffset, newoffset)
-			!= freeoffset);
+		resoffset = cmpxchg_local(&amp;c-&gt;freeoffset, freeoffset,
+				newoffset);
+#ifdef CONFIG_SLUB_DEBUG
+		/*
+		 * Just to be paranoid : warn if we detect that enough free or
+		 * slow paths nested on top of us to get the counter to go
+		 * half-way to overflow. That would be insane to do that much
+		 * allocations/free in interrupt handers, but check it anyway.
+		 * Mask out the LSBs because alloc fast path does not increment
+		 * the sequence number, which may cause the overall values to go
+		 * backward.
+		 */
+		WARN_ON((resoffset &amp; ~c-&gt...
To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, March 4, 2008 - 3:15 am

Could you rediff your work against the slab-mm branch of my VM tree on 
kernel.org?
--
To: Christoph Lameter <clameter@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, February 29, 2008 - 9:03 am

I guess you could replace page-&gt;end with the (expensive)
page_address(page) + 1 in new_slab.

You would also have to change slab_address() to use
page_address(page). Since this is used by check_valid_pointer, I am not
sure it's a good idea to use this slow operation there.

Or maybe you have a different alternative in mind ?

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, February 29, 2008 - 3:57 pm

page-&gt;end was introduced to have a unique pointer for each slab. The 
counter that you introduces can take over that role. So we could use a 
NULL pointer again as the end of the objects? Or some constant?

I am going to revert the change that introduced page-&gt;end for 2.6.25 since 
it seems that the new fastpath does not need this. It was specific to the 
old fastpath.

--
To: Eric Dumazet <dada1@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 12:38 pm

I think you may well be right. This looks like a good clue.

I'll do the revert. I wanted either a confirmation that reveting it 
actually fixes something, _or_ an actual bug description, and this seems 
to be a quite possible case of the latter.

			Linus
	
--
To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 10:21 am

Hi Mathieu,


What is not re-entrant?


Yes, this can happen. Are you saying it is not safe to be in the
lockless path when an IRQ triggers?
--
To: Pekka Enberg <penberg@...>
Cc: Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>, Eric Dumazet <dada1@...>
Date: Tuesday, February 19, 2008 - 4:08 pm

incrementing the variable with a "++" when interrupts are not disabled.
It's not an atomic add and it's racy. The code within stat() does

It should be safe, but I think Eric pointed the correct problem in his
reply.

Thanks,

Mathieu

-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Eric Dumazet <dada1@...>
Date: Wednesday, February 27, 2008 - 7:32 pm

Yes but that is only for used for statistics which can be racy. Note that 
the VM event statistics also can be racy.

--
To: Christoph Lameter <clameter@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Eric Dumazet <dada1@...>
Date: Wednesday, February 27, 2008 - 9:57 pm

Doing ++ on a u32 _is_ atomic wrt interrupts on x86 and probably lots of
other architectures, so we're OK using unsigned there.  But on some other
architectures ++ on u32 is not atomic wrt interrutps, so they should use
atomic_t or some other arch-specific mechanism.

And guess what?  It's already all been done: local_t.
--
To: Andrew Morton <akpm@...>
Cc: Christoph Lameter <clameter@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Eric Dumazet <dada1@...>
Date: Thursday, February 28, 2008 - 7:13 am

So you make implicit assumptions about how gcc compiles your ++? I am 
afraid you can't do that, gcc is absolutely free to turn variable++; 
statement to non-atomic sequence of instructions (memory-&gt;register load, 
increment register, for example).

-- 
Jiri Kosina
SUSE Labs

--
To: Andrew Morton <akpm@...>
Cc: Christoph Lameter <clameter@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Eric Dumazet <dada1@...>
Date: Thursday, February 28, 2008 - 4:14 am

hm, why should it be atomic wrt. irqs? There's nothing that keeps gcc 
from not doing an "incl memory_address" but do something like: "load 
memory address into regx, incl regx, ... store address into memory 
address" - and that's not atomic at that point.

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Eric Dumazet <dada1@...>
Date: Thursday, February 28, 2008 - 7:15 am

&gt; hm, why should it be atomic wrt. irqs? There's nothing that keeps gcc 

It isn't - we've got tty layer traces of serial counts going wrong that
rely on precisely that assumption.

Alan
--
To: Andrew Morton <akpm@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Eric Dumazet <dada1@...>
Date: Wednesday, February 27, 2008 - 10:43 pm

local_t requires the disabling of preempt to work right.

The real solution here is cpu_alloc / cpu_ops. Per cpu operations work on 
an offset relative to the start of the per process cpu data area in some 
register. An increment can then be atomic vs. interrupt because it does 
the calculation of the address and the inc in one instruction. F.e.

gs: inc [percpu_offset]

Processor may change before and after without ill effects. So no preempt 
since the instruction will always reference the %%gs register that points 
to the percpu area of the currently executing processor.
--
To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 10:38 am

Hi Mathieu,



Hmm. The barrier() in slab_free() looks fishy. The comment says it's
there to make sure we've retrieved c-&gt;freelist before c-&gt;page but then
it uses a _compiler barrier_ which doesn't affect the CPU and the
reads may still be re-ordered... Not sure if that matters here though.
--
To: Pekka Enberg <penberg@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 12:20 pm

No, no. The comment says that it's purely there to serialize an 
*interrupt*, and as such, a compiler-only barrier is sufficient (or the 
comment is wrong).

Interrupts are "totally ordered" within a cpu (of course, in theory a CPU 
might have speculative work etc reordering, but the CPU also guarantees 
that interrupt acts _as_if_ it was exact), so a compiler barrier is 
sufficient.

Of course, if we're talking about interrupts on another CPU, that's a 
different issue, but the fact is, in that case it's not about interrupts 
any more (might as well be other code just running normally on another 
CPU), and a barrier doesn't help, it needs real locking.

So that barrier is fine per se. Of course, the whole code (and/or just the 
comment!) may be buggered, but any CPU SMP-aware barriers shouldn't be 
relevant.

What's much more likely to be an issue is simply the fact that since the 
fastpath now accesses the per-cpu freelist without any locking, if there 
is *any* sequence what-so-ever that does it from another CPU and assumes 
the old locking behaviour, the list will be corrupted. And from a quick 
look-through, I certainly cannot guarantee that isn't the case.

There's still a lot of cases that do direct assignments to "c-&gt;freelist" 
without using a guaranteed atomic sequence. They *should* be safe if it's 
guaranteed that 
 (a) they always run with interrupts disabled
AND
 (b) 'c' is _always_ the "current CPU" list

but I can't quickly see that guarantee for either.

I'd happily just revert this thing, but it would be really good to have 
confirmation that it seems to matter. But Torsten's partial bisection 
seems to say that the quicklist thing went into -mm before the crash even 
started.

So:
 - it might be something else entirely
 - it might still be the local cmpxchg, just Torsten didn't happen to 
   notice it until later.
 - it might still be the local cmpxchg, but something else changed its 
   patterns to actually make it start triggering.

and in ...
To: Linus Torvalds <torvalds@...>
Cc: Pekka Enberg <penberg@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Ingo Molnar <mingo@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 3:27 pm

My new hackbench-testcase also killed 2.6.24-rc2-mm1, so I really

I tried the following three patches:

switching the barrier() for a smp_mb() in 2.6.25-rc2-mm1:
-&gt; crashed

reverting the FASTPATH-patch in 2.6.25-rc2:
-&gt; worked

only removed FAST_CMPXCHG_LOCAL from arch/x86/Kconfig
-&gt; worked

So all of these tests seem to confirm, that the bug is in the new SLUB fastpath.

Torsten
--
To: Linus Torvalds <torvalds@...>
Cc: Pekka Enberg <penberg@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Torsten Kaiser <just.for.lkml@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 12:45 pm

yeah - my revert suggestions were all completely conditional on such 
type of test feedback.

Btw., i did trigger occasional SLUB crashes myself starting at around 
-rc1, on the order of one per 200-300 straight random bootups, and 
yesterday i did a 50-bootups series of a specific .config that crashed, 
to try to reproduce one of them but failed - so bisection was not an 
option and i had nothing concrete and repeatable to report either. I had 
a few complete lockups and only 3 usable backtraces - find them below.

Networking features in all of the backtraces - and so does the VFS. All 
of the crashes are on SMP - and given that 50% of the bootups are UP 
this gives us a 1:8 chance hint that this bug is SMP specific. (All the 
crashes are in distccd - that is what this build cluster does mainly so 
it's the main activity of the box - so they dont necessarily indicate 
anything workload specific.)

Earlier today i turned off local-cmpxchg and havent had a crash or hang 
since then - but at 200 bootups and 4-5 crashes in a week that's not 
conclusive yet. I think others might have workloads that trigger this 
bug more often.

	Ingo

----------------&gt;
mercury login: [  582.671916] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[  582.672334] 
[  582.672334] Pid: 3776, comm: distccd Not tainted (2.6.25-rc2 #5)
[  582.672334] EIP: 0060:[&lt;c0174fda&gt;] EFLAGS: 00010246 CPU: 0
[  582.672334] EIP is at kmem_cache_alloc+0x2a/0x90
[  582.672334] EAX: 00000000 EBX: 8000061c ECX: c069ed1c EDX: 01060002
[  582.672334] ESI: c0aeffc8 EDI: c1d11714 EBP: f6eddcdc ESP: f6eddcc4
[  582.672334]  DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[  582.672334] Process distccd (pid: 3776, ti=f6edc000 task=f508c000 task.ti=f6edc000)
[  582.672334] Stack: c06a3d48 f6eddce4 00000020 8000061c 0000066c c0aeffc8 f6eddcf8 c069ed1c 
[  582.672334]        00000000 00000020 8000061c f7ce6580 f7ce6580 f6eddd18 c045e7bb ffffffff 
[  582.672334]        00000000 f7f683e0 8000061c f52136c0 f7ce6580 f6eddd58 c0461de5 f5...
To: Linus Torvalds <torvalds@...>
Cc: Pekka Enberg <penberg@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Torsten Kaiser <just.for.lkml@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 12:48 pm

i mean, today i've only done 200 randconfig bootups since i did the 
cmpxchg SLUB revert, and given the statistics of the bug (thousands of 
bootups and just 3 provable crashes) i cannot yet conclude that the bug 
is truly gone.

	Ingo
--
To: Pekka Enberg <penberg@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 10:55 am

find a fix patch for that below - most systems affected seem to be SMP 
ones.

If this (or my other patch) indeed solves the problem i'd still favor a 
full revert of the SLUB_FASTPATH (commit 1f84260c8ce3b1ce26d4), it looks 
quite un-cooked and quite un-tested for multiple independent reasons.

Sigh, why do i again have to be the messenger who brings the bad news to 
SLUB land, and again when poor Christoph went on vacation? :-/

	Ingo

--------------------------&gt;
Subject: SLUB: barrier fix
From: Ingo Molnar &lt;mingo@elte.hu&gt;

---
 mm/slub.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -1862,7 +1862,7 @@ static __always_inline void slab_free(st
 	debug_check_no_locks_freed(object, s-&gt;objsize);
 	do {
 		freelist = c-&gt;freelist;
-		barrier();
+		smp_mb();
 		/*
 		 * If the compiler would reorder the retrieval of c-&gt;page to
 		 * come before c-&gt;freelist then an interrupt could

-----------------&gt;
Subject: slub: fastpath optimization revert
From: Ingo Molnar &lt;mingo@elte.hu&gt;
Date: Tue Feb 19 15:46:37 CET 2008

revert:

  commit 1f84260c8ce3b1ce26d4c1d6dedc2f33a3a29c0c
  Author: Christoph Lameter &lt;clameter@sgi.com&gt;
  Date:   Mon Jan 7 23:20:30 2008 -0800

      SLUB: Alternate fast paths using cmpxchg_local

it was causing problems (crashes) and was incomplete.

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
---
 mm/slub.c |   87 --------------------------------------------------------------
 1 file changed, 87 deletions(-)

Index: linux-x86.q/mm/slub.c
===================================================================
--- linux-x86.q.orig/mm/slub.c
+++ linux-x86.q/mm/slub.c
@@ -149,13 +149,6 @@ static inline void ClearSlabDebug(struct
 /* Enable to test recovery from slab corruption on boot */
 #undef SLUB_RESILIENCY_TEST
 
-/*
- * Currently fastpath is not support...
To: Ingo Molnar <mingo@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>, <yanmin_zhang@...>
Date: Tuesday, February 19, 2008 - 11:52 am

Torsten/Yamin, does this fix things for you? What about reverting commit 
1f84260c8ce3b1ce26d4c1d6dedc2f33a3a29c0c ("SLUB: Alternate fast paths 
using cmpxchg_local")?

			Pekka
--
To: Pekka Enberg <penberg@...>
Cc: Ingo Molnar <mingo@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 8:36 pm

I'm busy in another issue and will test it ASAP. Sorry.

-yanmin


--
To: Pekka Enberg <penberg@...>
Cc: Ingo Molnar <mingo@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 10:08 pm

I tested it on my 3 x86-64 machines. The small fix to use smp_mb to replace
barrier in slab_free doesn't work. Kernel still crashed at the same place.

I will test the reverting patch.

-yanmin


--
To: Pekka Enberg <penberg@...>
Cc: Ingo Molnar <mingo@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Wednesday, February 20, 2008 - 2:53 am

Kernel with the reverting patch is ok.
I ran reboot/hackbench for more than 10 times on every one of my 3 x86-64 machines,
and kernel didn't crash.

-yanmin


--
To: <yanmin_zhang@...>
Cc: Ingo Molnar <mingo@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Wednesday, February 20, 2008 - 3:10 am

Great, Linus reverted the patch yesterday. Thanks for testing!
--
To: Pekka Enberg <penberg@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 10:57 am

the revert patch is below. (manually done due to other changes since 
1f84260c8ce3b1ce26d4 was commited, but trivial)

	Ingo

-----------------&gt;
Subject: slub: fastpath optimization revert
From: Ingo Molnar &lt;mingo@elte.hu&gt;
Date: Tue Feb 19 15:46:37 CET 2008

revert:

  commit 1f84260c8ce3b1ce26d4c1d6dedc2f33a3a29c0c
  Author: Christoph Lameter &lt;clameter@sgi.com&gt;
  Date:   Mon Jan 7 23:20:30 2008 -0800

      SLUB: Alternate fast paths using cmpxchg_local

it was causing problems (crashes) and was incomplete.

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
---
 mm/slub.c |   87 --------------------------------------------------------------
 1 file changed, 87 deletions(-)

Index: linux-x86.q/mm/slub.c
===================================================================
--- linux-x86.q.orig/mm/slub.c
+++ linux-x86.q/mm/slub.c
@@ -149,13 +149,6 @@ static inline void ClearSlabDebug(struct
 /* Enable to test recovery from slab corruption on boot */
 #undef SLUB_RESILIENCY_TEST
 
-/*
- * Currently fastpath is not supported if preemption is enabled.
- */
-#if defined(CONFIG_FAST_CMPXCHG_LOCAL) &amp;&amp; !defined(CONFIG_PREEMPT)
-#define SLUB_FASTPATH
-#endif
-
 #if PAGE_SHIFT &lt;= 12
 
 /*
@@ -1514,11 +1507,6 @@ static void *__slab_alloc(struct kmem_ca
 {
 	void **object;
 	struct page *new;
-#ifdef SLUB_FASTPATH
-	unsigned long flags;
-
-	local_irq_save(flags);
-#endif
 	if (!c-&gt;page)
 		goto new_slab;
 
@@ -1541,9 +1529,6 @@ load_freelist:
 unlock_out:
 	slab_unlock(c-&gt;page);
 	stat(c, ALLOC_SLOWPATH);
-#ifdef SLUB_FASTPATH
-	local_irq_restore(flags);
-#endif
 	return object;
 
 another_slab:
@@ -1575,9 +1560,6 @@ new_slab:
 		c-&gt;page = new;
 		goto load_freelist;
 	}
-#ifdef SLUB_FASTPATH
-	local_irq_restore(flags);
-#endif
 	/*
 	 * No memory available.
 	 *
@@ -1619,34 +1601,6 @@ static __always_inline void *slab_alloc(
 {
 	void **object;
 	struct kmem_cache_cpu *c;
-
-/*
- * The SLUB_FASTPATH path is pr...
To: Ingo Molnar <mingo@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>, <yanmin_zhang@...>
Date: Tuesday, February 19, 2008 - 11:54 am

I am ok with this if someone can actually confirm it fixes things.

			Pekka
--
To: Pekka Enberg <penberg@...>
Cc: Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 6:27 am

hm, it's bad for at least one other reason as well (which is probably 
unrelated to this crash):

 /*
  * Currently fastpath is not supported if preemption is enabled.
  */
 #if defined(CONFIG_FAST_CMPXCHG_LOCAL) &amp;&amp; !defined(CONFIG_PREEMPT)
 #define SLUB_FASTPATH
 #endif

such !PREEMPT exceptions tend to show "i didnt want to think too hard 
about the preemptible case so just turn it off" thinking.

Also, why isnt this "SLUB_FASTPATH" flag done in the Kconfig space?

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 9:02 am

Ingo, a comment in slub.c explains it :

/*
 * The SLUB_FASTPATH path is provisional and is currently disabled if the
 * kernel is compiled with preemption or if the arch does not support
 * fast cmpxchg operations. There are a couple of coming changes that will
 * simplify matters and allow preemption. Ultimately we may end up making
 * SLUB_FASTPATH the default.
 *
 * 1. The introduction of the per cpu allocator will avoid array lookups
 *    through get_cpu_slab(). A special register can be used instead.
 *
 * 2. The introduction of per cpu atomic operations (cpu_ops) means that
 *    we can realize the logic here entirely with per cpu atomics. The
 *    per cpu atomic ops will take care of the preemption issues.
 */


Eventually, I think only CONFIG_FAST_CMPXCHG_LOCAL will be needed (when
the code will support preemption). Therefore, this SLUB_FASTPATH define
seems to be only here temporarily.

I'm looking at the code right now.. more to come.


-- 
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F  BA06 3F25 A8FE 3BAE 9A68
--
To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 10:00 am

well the feature is not complete and there are no reasons given _why_ 
it's not complete ... and even if there's a reason it should have been 
deferred to the next merge window. We still have 10 year old "this is a 
temporary hack" comments in the kernel ;-)

"hardware does not support it" is a valid argument, "kernel developer 
had no time to implement it properly" is not ;-)

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 6:45 am

Hi,



Hmm, no idea. I think might have been some mix-up with merging the
patch. The one I saw was:

http://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24/2.6.24-mm1/broke...

But I don't remember giving out a Reviewed-by for it (and my mailbox
confirms that). Furthermore, somehow it turned into this when merged:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=1f8426...

In any case, if Torsten/someone can verify that reverting
1f84260c8ce3b1ce26d4c1d6dedc2f33a3a29c0c ("SLUB: Alternate fast paths
using cmpxchg_local") fixes these problems, I think we should just do
it and let Christoph sort it out when he gets back.

                               Pekka
--
To: Torsten Kaiser <just.for.lkml@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Monday, February 18, 2008 - 7:54 pm

Is there any chance that you could try to bisect this, if it's repeatable 
enough for you? Even if you can't bisect it *all* the way, it would be 
really good to do a handful of bisection runs which should already 
hopefully narrow it down a bit more.

		Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, February 19, 2008 - 2:44 am

It's repeatable, but not in a really reliable way.
So to mark a kernel good I need to compile around 100 KDE packages,
and even then I'm not 100% sure, if it's good or if I was just lucky.

But I did a partly bisect against 2.6.24-rc6-mm1:
2.6.24-rc6 + mm-patches up to (including) git.nfsd -&gt; worked
2.6.24-rc6 + mm-patches up to (including) git.xfs -&gt; crashed

I think the only added patch between rc2-mm1 and rc3-mm2 in that range
where the iommu changes that I later ruled out.
That leaves some git trees as suspects:
git-ocfs2.patch
git-selinux.patch
git-s390.patch
git-sched.patch
git-sh.patch
git-scsi-misc.patch
git-unionfs.patch
git-v9fs.patch
git-watchdog.patch
git-wireless.patch
git-ipwireless_cs.patch
git-x86.patch
git-xfs.patch

(see http://marc.info/?l=linux-kernel&amp;m=120276641105256 )

Torsten
--
To: Torsten Kaiser <just.for.lkml@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Sunday, February 17, 2008 - 4:25 pm

There's the Bugzilla entry for it at
http://bugzilla.kernel.org/show_bug.cgi?id=9973

Please update it with the current information.

Thanks,
Rafael
--
To: Rafael J. Wysocki <rjw@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Sunday, February 17, 2008 - 5:32 pm

Crash for 2.6.25-rc2-mm1 added. That one had a complete stacktrace,
but the trace looks like others I already reported, so no real new
information... :-(

Torsten
--
To: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Saturday, February 16, 2008 - 3:14 pm

This is a multi-part message in MIME format.
--------------040606080104080609040505
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: quoted-printable

2.6.25-rc2 fails to bring up my openSUSE 10.3 PC because LVM
cannot find the volume group containing the root file system.
2.6.25-rc1 has the same problem, 2.6.24 works fine.

Some possibly relevant messages copied manually from the screen:

--------8&lt;--------8&lt;--------8&lt;--------8&lt;--------8&lt;--------8&lt;--------
Driver 'sd' needs updating - please use bus_type methods
sd 0:0:0:0: [sda] ...
  sda: sda1 sda2 sda3
sd 0:0:0:0: [sda] Attached SCSI disk
sd 1:0:0:0: [sdb] ...
  sda: sdb1
sd 0:0:0:0: [sdb] Attached SCSI disk
md: linear personality registered for level -1
   Volume group "system" not found
   Volume group "system" not found
Trying manual resume from /dev/sda2
Invoking userspace resume from /dev/sda2
boot/82-resume.userspace.sh: line 48: /proc/splash: No such file or direc=
tory
resume: could not stat configuration file
resume: libcrypt version: 1.2.4
Trying manual resume from /dev/sda2
PM: Starting manual resume from disk

Waiting for device /dev/system/root to appear: . Volume group "system" no=
t found
   Volume group "system" not found
[...]
   Volume group "system" not found
CPA self-test:
  4k 1024 large 223 gb 0 x 1247[c0000000-f7c00000] miss 0
  4k 189440 large 39 gb 0 x 189479[c0000000-f7c00000] miss 0
  4k 189440 large 39 gb 0 x 189479[c0000000-f7c00000] miss 0
ok.
=2E  Volume group "system" not found
[...]
   Volume group "system" not found
Could not find /dev/system/root.
Want me to fall back to /dev/system/root? (Y/n)
--------8&lt;--------8&lt;--------8&lt;--------8&lt;--------8&lt;--------8&lt;--------

Config attached. Architecture is x86 32bit. The two SATA disks are
hanging off an ICH6. More detail on request.

Anything obvious? Bisect?

--=20
Tilman Schmidt                          E-Mail: tilman@imap.cc
Bonn, Germany
Diese Nachrich...
To: Tilman Schmidt <tilman@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Saturday, February 16, 2008 - 4:12 pm

On Sat, 16 Feb 2008 20:14:30 +0100

Compile in SCSI disk support. Modular even if loaded in initrd it seems
to have broken somewhere.
--
To: Alan Cox <alan@...>
Cc: Tilman Schmidt <tilman@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Saturday, February 16, 2008 - 6:37 pm

# CONFIG_SYSFS_DEPRECATED is not set

I would suspect this.

Try to upgrade to at least lvm 2.02.29 (I guess this is the first version which 
understands the new sysfs layout).
--
To: Jiri Slaby <jirislaby@...>
Cc: Alan Cox <alan@...>, Tilman Schmidt <tilman@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Monday, February 18, 2008 - 9:53 pm

IMHO That should be *set* by default until everyone has had time to
update their userspace software to cope with the changed sysfs layout.

Alasdair
-- 
agk@redhat.com
--
To: Alasdair G Kergon <agk@...>, Jiri Slaby <jirislaby@...>, Alan Cox <alan@...>, Tilman Schmidt <tilman@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Cc: Alexey Dobriyan <adobriyan@...>, Ingo Molnar <mingo@...>
Date: Tuesday, February 19, 2008 - 4:56 am

[added CCs from the other thread on this topic]


It *is* set by default.

The root cause of the trouble is that its semantics are changing.
At one point in time (sorry, don't remember which kernel release
exactly) I tested whether the openSUSE 10.3 userspace supported
a CONFIG_SYSFS_DEPRECATED=3Dn kernel and found that it did. From
then on, "make oldconfig" would carry that setting over to every
new kernel I built, which was fine while the meaning of this
setting - ie. the difference in sysfs layout it controlled -
stayed the same.

With commit edfaa7c36574f1bf09c65ad602412db9da5f96bf however, the
sysfs layout changed again, so the same CONFIG_SYSFS_DEPRECATED
setting now controls a different difference (argh) in sysfs
layout. That kind of situation is not handled very well by
"make oldconfig", which basically starts from the assumption
that a setting that was ok for the previous kernel version is
still ok for the new one.

I see two ways of avoiding that problem: either create a new
backward compatibility config setting for that new sysfs change,
or create a way of telling "make oldconfig" that the semantics of
CONFIG_SYSFS_DEPRECATED have changed and it should ask the user
for that again even if there is a previous setting.

HTH
T.

--=20
Tilman Schmidt                    E-Mail: tilman@imap.cc
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Unge=F6ffnet mindestens haltbar bis: (siehe R=FCckseite)
To: Jiri Slaby <jirislaby@...>
Cc: Alan Cox <alan@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Sunday, February 17, 2008 - 8:57 pm

Bisection says:

edfaa7c36574f1bf09c65ad602412db9da5f96bf is first bad commit
commit edfaa7c36574f1bf09c65ad602412db9da5f96bf
Author: Kay Sievers &lt;kay.sievers@vrfy.org&gt;
Date:   Mon May 21 22:08:01 2007 +0200

     Driver core: convert block from raw kobjects to core devices

     This moves the block devices to /sys/class/block. It will create a
     flat list of all block devices, with the disks and partitions in one=

     directory. For compatibility /sys/block is created and contains syml=
inks
     to the disks.

Apparently, compatibility is in the eye of the beholder - in this

Setting

CONFIG_SCSI=3Dy
CONFIG_BLK_DEV_SD=3Dy


Setting

CONFIG_SYSFS_DEPRECATED=3Dy

does indeed fix the problem and allows me to boot successfully.

I'll have to investigate how to do that without breaking anything.

HTH
T.

--=20
Tilman Schmidt                          E-Mail: tilman@imap.cc
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Unge=F6ffnet mindestens haltbar bis: (siehe R=FCckseite)
To: Tilman Schmidt <tilman@...>
Cc: Jiri Slaby <jirislaby@...>, Alan Cox <alan@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Sunday, February 17, 2008 - 9:22 pm

I faced the same problem, but resolved with ...

vgscan
vgchange -a y

Also, ensure you set "write_cache_state = 1" in /etc/lvm.conf before
running the above.

Let me know if this helps.


Thanks,
Jeff.
--