Re: Linux 2.6.25-rc2

Previous thread: Re: data corruption with dmcrypt/LUKS by Christoph Anton Mitterer on Friday, February 15, 2008 - 5:25 pm. (1 message)

Next thread: USB regression (and other failures) in 2.6.2[45]* by Andrew Buehler on Friday, February 15, 2008 - 5:45 pm. (26 messages)
To: Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, February 15, 2008 - 5:23 pm

Ok,
this kernel is a winner.

Just to show how _much_ of a winner it is, it's been awarded a coveted
"weasel" series name, which should tell you just how good it's going to
be. It's a name revered in Linux kernel history, and as such this brings
back the good old days where if you find a bug, you're almost certainly
simply mistaken, and you probably just did something wrong.

But hey, you can try to prove me wrong. I dare you.

The full shortlog is appended, but instead of a diffstat, I'm doing the my
git "dirstat" output which gives a nicer high-level view. This was
generated with

git diff --cumulative --dirstat=2 --shortstat v2.6.25-rc1..v2.6.25-rc2

and actually gives a pretty good overview. In particular, it shows that
almost exactly half of the updates are to drivers, with network drivers
alone being a third of the whole patch. And of the remaining half, about
half was architecture updates, notably to SH.

What a cool git feature, if I say so myself. It allows me an extra half
hour a week of just sitting back and sipping my foofy tropical drink
(bringing my total up to about 168.5 hours of drunken stupor a week),
since now I don't even have to try to make sense of the diffstat manually
any more. Anyway, here it is:

2.1% Documentation/
3.7% arch/cris/
7.0% arch/sh/configs/
4.4% arch/sh/kernel/
4.9% arch/sh/mm/
17.8% arch/sh/
23.8% arch/
33.5% drivers/net/
6.0% drivers/scsi/lpfc/
7.1% drivers/scsi/
4.5% drivers/sh/maple/
49.5% drivers/
8.1% fs/
2.5% include/linux/
4.5% include/
7.2% kernel/
2.0% net/
509 files changed, 14470 insertions(+), 6986 deletions(-)

which you have to admit looks very managerial, even if you have no clue
what the heck it really contains (which is also very managerial). Now I
just need to turn these reports into some colored powerpoint slides with
pie charts, and then I can turn off my brain _completely_.

But if you actually care for anything but ...

To: Linus Torvalds <torvalds@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Saturday, February 16, 2008 - 5:38 pm

Sadly not for me:
[ 5282.056415] ------------[ cut here ]------------
[ 5282.059757] kernel BUG at lib/list_debug.c:33!
[ 5282.062055] invalid opcode: 0000 [1] SMP
[ 5282.062055] CPU 3
[ 5282.062055] Modules linked in: radeon drm w83792d ipv6 tuner
tea5767 tda8290 tuner_xc2028 tda9887 tuner_simple mt20xx tea5761
tvaudio msp3400 bttv videodev v4l1_compat ir_common compat_ioctl32
v4l2_common videobuf_dma_sg videobuf_core btcx_risc tveeprom usbhid
pata_amd i2c_nforce2 hid sg
[ 5282.062055] Pid: 12937, comm: sed Not tainted 2.6.25-rc2 #1
[ 5282.062055] RIP: 0010:[<ffffffff803bffe4>]
-> then the output from the serial console stopped. I was in X, so I
could not see, if there was anything more on the real console.

(gdb) list *0xffffffff803bffe4
0xffffffff803bffe4 is in __list_add (lib/list_debug.c:33).
28 }
29 if (unlikely(prev->next != next)) {
30 printk(KERN_ERR "list_add corruption.
prev->next should be "
31 "next (%p), but was %p. (prev=%p).\n",
32 next, prev->next, prev);
33 BUG();
34 }
35 next->prev = new;
36 new->next = next;
37 new->prev = prev;

For more on this problem see http://marc.info/?l=linux-kernel&m=120293042005445

I will now try 2.6.25-rc2-mm1.

Torsten
--

To: Torsten Kaiser <just.for.lkml@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, February 19, 2008 - 2:11 am

hm. Your crashes do seem to span multiple subsystems, but it always
seems to be around the SLUB code. Could you try the patch below? The
SLUB code has a new optimization and i'm not 100% sure about it. [the
hack below switches the SLUB optimization off by disabling the CPU
feature it relies on.]

Ingo

------------->
arch/x86/Kconfig | 4 ----
1 file changed, 4 deletions(-)

Index: linux/arch/x86/Kconfig
===================================================================
--- linux.orig/arch/x86/Kconfig
+++ linux/arch/x86/Kconfig
@@ -59,10 +59,6 @@ config HAVE_LATENCYTOP_SUPPORT
config SEMAPHORE_SLEEPERS
def_bool y

-config FAST_CMPXCHG_LOCAL
- bool
- default y
-
config MMU
def_bool y

--

To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, February 19, 2008 - 2:54 am

$ grep FAST_CMPXCHG_LOCAL */.config
linux-2.6.24-rc2-mm1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
linux-2.6.24-rc3-mm1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
linux-2.6.24-rc3-mm2/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
linux-2.6.24-rc6-mm1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
linux-2.6.24-rc8-mm1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
linux-2.6.25-rc1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
linux-2.6.25-rc2-mm1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
linux-2.6.25-rc2/.config:CONFIG_FAST_CMPXCHG_LOCAL=y

-rc2-mm1 still worked for me.

Did you mean the new SLUB_FASTPATH?
$ grep "define SLUB_FASTPATH" */mm/slub.c
linux-2.6.25-rc1/mm/slub.c:#define SLUB_FASTPATH
linux-2.6.25-rc2-mm1/mm/slub.c:#define SLUB_FASTPATH
linux-2.6.25-rc2/mm/slub.c:#define SLUB_FASTPATH

The 2.6.24-rc3+ mm-kernels did crash for me, but don't seem to contain this...

On the other hand:
From the crash in 2.6.25-rc2-mm1:
[59987.116182] RIP [<ffffffff8029f83d>] kmem_cache_alloc_node+0x6d/0xa0

(gdb) list *0xffffffff8029f83d
0xffffffff8029f83d is in kmem_cache_alloc_node (mm/slub.c:1646).
1641 if (unlikely(is_end(object) || !node_match(c, node))) {
1642 object = __slab_alloc(s, gfpflags,
node, addr, c);
1643 break;
1644 }
1645 stat(c, ALLOC_FASTPATH);
1646 } while (cmpxchg_local(&c->freelist, object, object[c->offset])
1647
!= object);
1648 #else
1649 unsigned long flags;
1650

That code is part for SLUB_FASTPATH.

I'm willing to test the patch, but don't know how fast I can find the
time to do it, so my answer if your patch helps might be delayed until
the weekend.

Torsten
--

To: Torsten Kaiser <just.for.lkml@...>
Cc: Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 3:21 am

Mathieu, Christoph is on vacation and I'm not at all that familiar
with this cmpxchg_local() optimization, so if you could take a peek at
this bug report to see if you can spot something obviously wrong with
it, I would much appreciate that.
--

To: Pekka Enberg <penberg@...>
Cc: Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 10:02 am

Sure,

Initial thoughts :

I'd like to get the complete config causing this bug. I suspect either :

- A race between the lockless algo and an IRQ in a driver allocating
memory.
- stat(c, ALLOC_FASTPATH); seems to be using a var++, therefore
indicating it is not reentrant if IRQs are disabled. Since those are
only stats, I guess it's ok, but still weird.
- CPU hotplug problem.
http://bugzilla.kernel.org/attachment.cgi?id=14877&action=view shows
last sysfs file:
/sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
-- is this linked to a cpu up/down event ?

Since this shows mostly with network card drivers, I think the most
plausible cause would be an IRQ nesting over kmem_cache_alloc_node and
calling it.

Will dig further...

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Pekka Enberg <penberg@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 2:39 pm

# CONFIG_HOTPLUG_CPU is not set

I expect my system would get rather angry if I try to remove a cpu
while it is running. ;)

And as I don't use suspend, I wanted to stay away from any bugs that

Thanks to yanmin crashing his machine with hackbench I now have a much
better testcase than recompiling KDE.
Starting hackbench will result in an nearly instant crash, if there is
a floodping running against the system.

This new testcase also showed that 2.6.24-rc2-mm1 can also be crashed
in this way.

I will try the patches from this tread against this new testcase and
report the results.

Torsten

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 12:27 pm

On Tue, 19 Feb 2008 09:02:30 -0500

I wonder how SLUB_FASTPATH is supposed to work, since it is affected by
a classical ABA problem of lockless algo.

cmpxchg_local(&c->freelist, object, object[c->offset]) can succeed,
while an interrupt came (on this cpu), and several allocations were done,
and one free was performed at the end of this interruption, so 'object'
was recycled.

c->freelist can then contain the previous value (object), but
object[c->offset] was changed by IRQ.

We then put back in freelist an already allocated object.

--

To: Eric Dumazet <dada1@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Wednesday, February 27, 2008 - 7:32 pm

Right. The alloc fastpath assumes that the object is not modified while it
is allocated...

Du! This shoots my nice scheme in the foot and there does not seem to be
an easy way to fix it. This means we also need to revert the
page->end patch. Useless if we do not have the cmpxchg_local.

--

To: Eric Dumazet <dada1@...>
Cc: Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 4:03 pm

I think you are right. A way to fix this would use the fact that the
freelist is only useful to point to the first free object in a page. We
could change it to an offset rather than an address.

The freelist would become a counter of type "long" which increments
until it wraps at 2^32 or 2^64. A PAGE_MASK bitmask could then be used
to get low order bits which would get the page offset of the first free
object, while the high order bits would insure we can detect this type
of object reuse when doing a cmpxchg. Upon free, the freelist counter
should always be incremented; this would be provided by adding PAGE_SIZE
to the counter and setting the LSBs to the correct offset.

On 32 bits architectures, it would allow 2^(32-12) = 1048576
allocations-free pairs to be done within interrupt handlers interrupting
a cmpxchg_local before the counter wraps. On 64 bits archs, it would
give 2^(64-12) = 2^52 allocation-free.

If we assume kmalloc/kfree pairs can take up to 1000 cycles to execute
(see
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commi...)
(3GHz Pentium 4) (therefore 333ns)
an interrupt handler doing 1048576 kmalloc/kfree would run for 0.35
seconds. An interrupt handler taking that much time would lead to other
OS problems (potential scheduler problems), so this quantity of
available allocations before a wrap-around looks safe. However, making
sure preemption is disabled would be safer here, since we cannot bound
the number of allocations that can be done by other processes if we are
scheduled out in the middle of the cmpxchg_local loop.

A revert is indeed the right solution at this stage until a corrected
version is ready.

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Wednesday, February 27, 2008 - 7:34 pm

Urgh.... That sounds way too complicated. Do you have an experimental
patch that would allow us to see the impact?

--

To: Christoph Lameter <clameter@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, February 28, 2008 - 1:55 am

Yep, I have an experimental patch ready. It would need some thorough
testing though. I seems to run fine on an x86_32 with and without
preemption (therefore with and without the slub cmpxchg_local fastpath).

You should first revert commit 00e962c5408b9f2d0bebd2308673fe982cb9a5fe
(this is the slub fastpath revert itself, currently in 2.6.25-rc3)

And then apply :

Implement slub fastpath in terms of freebase and freeoffset

It allows the cmpxchg_local to detect object re-use by keeping a counter in the
freeoffset MSBs.

freeoffset & PAGE_MASK is the offset of the freelist element within the page.
freeoffset & ~PAGE_MASK is the counter to detect object re-use.
freebase is the address of the page in this the object is located. It is a
multiple of PAGE_SIZE.

Whenever an object is freed in the cpu slab cache, the counter is incremented.
Whenever the alloc/free slow paths are modifying the freeoffset or freebase, the
freeoffset counter is also incremented. It is used to make sure we know if
freebase has been modified in an interrupt nested over the fast path.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Christoph Lameter <clameter@sgi.com>
---
include/linux/slub_def.h | 35 +++++++++++++++++++
mm/slub.c | 84 +++++++++++++++++++++++++++++++++--------------
2 files changed, 93 insertions(+), 26 deletions(-)

Index: linux-2.6-lttng/include/linux/slub_def.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/slub_def.h 2008-02-27 20:40:48.000000000 -0500
+++ linux-2.6-lttng/include/linux/slub_def.h 2008-02-28 00:17:28.000000000 -0500
@@ -32,7 +32,18 @@ enum stat_item {
NR_SLUB_STAT_ITEMS };

struct kmem_cache_cpu {
- void **freelist; /* Pointer to first free per cpu object */
+ unsigned long freeoffset; /*
+ * Offset of the first free per cpu
+ * object. The MSBs are used as a
+ * counter which must be incremented
+ * each time an ...

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, February 28, 2008 - 3:08 pm

What is the performance impact?

page_address(c->page) + (c->freeoffset & (PAGE_MASK << s->order)) ?

How is this going to show up as an atomic update? You modify multiple per

The original code is wrong. It should be

if (is_end(c->freelist))

That use of set_freelist is safe since interrupts are disabled. So the

How does this work? You replace the freeoffset with the the once
incremented by the object offset??? The objects may not be in sequence on
the freelist and the increment wont get you to the next object anyways
since c->offset is usually 0.

--

To: Christoph Lameter <clameter@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, February 28, 2008 - 7:25 pm

I guess it would be more :

page_address(c->page) | (c->freeoffset & c->off_mask)

where c->off_mask = (PAGE_SIZE << s->order) - 1;

Note the "from the fastpath point of view", which means that if an
interrupt nests over the cmpxchg_local fastpath and uses
set_freelist_ptr, all the operations will be executed before the
interrupt returns to the cmpxchg_local fastpath, therefore showing the
update as being done "atomically" from the cmpxchg_local fastpath point
of view.

Anyway, I am changing this for :

static inline void set_freelist_ptr(struct kmem_cache_cpu *c, void **freelist)
{
unsigned long offset;

offset = c->freeoffset + c->off_mask + 1;
offset &= ~c->off_mask;
offset |= (unsigned long)freelist & c->off_mask;
c->freeoffset = offset;
}

Which will insure c->freeoffset is always in a valid state, therefore

Should be ok now since we allow interrupts to nest over
set_freelist_ptr. But I think this is also protected by slab_lock, taken
both in the slow path and in deactivate_slab.

In any case, doing a

object = get_freelist_ptr(c);
set_freelist_ptr(c, object[c->offset]);

Like deactivate_slab does should be either protected by disabling

Hrm, there should be a 1 to 1 mapping between the old code and the new
code :

- } while (cmpxchg_local(&c->freelist, object, object[c->offset])
- != object);
+ /* No need to increment the MSB counter here, because only
+ * object free will lead to counter re-use. */
+ } while (cmpxchg_local(&c->freeoffset, freeoffset,
+ (freeoffset + (c->offset * sizeof(void *)))) != freeoffset);

With the difference that we use only offsets instead of the full object
address.

Ah! I think I found the issue. It should become :

newoffset = freeo...

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, February 28, 2008 - 8:57 pm

Hmmm.. The cmpxchg_local was intended to check for a change of slab too
(addresses from different slabs are disjunct).
That is no longer possible since the upper bits are used by the
versioning? (Same issue in slab_alloc).

--

To: Christoph Lameter <clameter@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, February 28, 2008 - 9:56 pm

Hrm, this is why I kept a separate "freebase", as a sort of

Here is the trick : if, on top of the cmpxchg_local, we have an
interrupt that calls slab_alloc or slab_free which causes a change of
slab, this interrupt will have to go through the slow path.

These slow paths use set_freelist_ptr() to update the c->freeoffset,
which also increments the sequence number. When we return from
interrupt, the cmpxchg_local loop will fail because the sequence number
has changed.

In short, the we also use the versioning to check for change of slab.

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, February 28, 2008 - 10:12 pm

Sounds good.

--

To: Christoph Lameter <clameter@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Thursday, February 28, 2008 - 11:32 pm

Test results :

On a 3GHz Pentium 4, CONFIG_SMP=y, no preemption, single CPU, hyperthreading
Kernel 2.6.25-rc3 (complete config below)

Just to give away an overview of the results :

The "1. Kmalloc: Repeatedly allocate then free test", for a 32 bytes
allocation, takes :

10000 times kmalloc(32) -> 191 cycles kfree -> 179 cycles
10000 times kmalloc(32) -> 93 cycles kfree -> 172 cycles

For a speedup of 2 on the allocation and about the same speed for free.

For the alloc/free tests, we have a speedup of 3 for most of allocation
sizes. (as long as we are in the fast path)

* baseline : standard fast path, without freeoffset patch

Single thread testing
=====================
1. Kmalloc: Repeatedly allocate then free test

* baseline : standard fast path, without freeoffset patch

10000 times kmalloc(8) -> 182 cycles kfree -> 174 cycles
10000 times kmalloc(16) -> 185 cycles kfree -> 175 cycles
10000 times kmalloc(32) -> 191 cycles kfree -> 179 cycles
10000 times kmalloc(64) -> 208 cycles kfree -> 184 cycles
10000 times kmalloc(128) -> 260 cycles kfree -> 202 cycles
10000 times kmalloc(256) -> 367 cycles kfree -> 277 cycles
10000 times kmalloc(512) -> 394 cycles kfree -> 304 cycles
10000 times kmalloc(1024) -> 469 cycles kfree -> 397 cycles
10000 times kmalloc(2048) -> 417 cycles kfree -> 318 cycles
10000 times kmalloc(4096) -> 470 cycles kfree -> 356 cycles
10000 times kmalloc(8192) -> 692 cycles kfree -> 723 cycles
10000 times kmalloc(16384) -> 756 cycles kfree -> 830 cycles

* cmpxchg_local fast path with freeoffset patch

10000 times kmalloc(8) -> 79 cycles kfree -> 169 cycles
10000 times kmalloc(16) -> 83 cycles kfree -> 169 cycles
10000 times kmalloc(32) -> 93 c...

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, February 29, 2008 - 1:11 am

Then we do not need the page->end field anymore right? I will try
to rediff your patch against current slab-mm and see how we can proceed
from there.
--

To: Christoph Lameter <clameter@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, February 29, 2008 - 9:28 am

Slub Freeoffset check overflow

Check for overflow of the freeoffset version number.

I just thought adding this check in CONFIG_SLUB_DEBUG makes sense. It's
really unlikely that enough interrupt handlers will nest over the slub
fast path, and each of them do about a million alloc/free on 32 bits or
a huge amount of alloc/free on 64 bits, but just in case, it seems good
to warn if we detect we are half-way to a version overflow.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
mm/slub.c | 32 ++++++++++++++++++++++++++------
1 file changed, 26 insertions(+), 6 deletions(-)

Index: linux-2.6-lttng/mm/slub.c
===================================================================
--- linux-2.6-lttng.orig/mm/slub.c 2008-02-29 08:05:01.000000000 -0500
+++ linux-2.6-lttng/mm/slub.c 2008-02-29 08:16:13.000000000 -0500
@@ -1660,7 +1660,7 @@ static __always_inline void *slab_alloc(
*/

#ifdef SLUB_FASTPATH
- unsigned long freeoffset, newoffset;
+ unsigned long freeoffset, newoffset, resoffset;

c = get_cpu_slab(s, raw_smp_processor_id());
do {
@@ -1682,8 +1682,18 @@ static __always_inline void *slab_alloc(
newoffset = freeoffset;
newoffset &= ~c->off_mask;
newoffset |= (unsigned long)object[c->offset] & c->off_mask;
- } while (cmpxchg_local(&c->freeoffset, freeoffset, newoffset)
- != freeoffset);
+ resoffset = cmpxchg_local(&c->freeoffset, freeoffset,
+ newoffset);
+#ifdef CONFIG_SLUB_DEBUG
+ /*
+ * Just to be paranoid : warn if we detect that enough
+ * allocations nested on top of us to get the counter to go
+ * half-way to overflow. That would be insane to do that much
+ * allocations in interrupt handers, but check it anyway.
+ */
+ WARN_ON(resoffset - freeoffset > -1UL >> 1);
+#endif
+ } while (resoffset != freeoffset);
#else
unsigned long flags;

@@ -1822,7 +1832,7 @@ static __always_inline void slab_free(st
struct kmem_cache_cpu *c;

#ifdef SLUB_FASTPAT...

To: Christoph Lameter <clameter@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, March 4, 2008 - 2:17 am

Check for overflow of the freeoffset version number.

I just thought adding this check in CONFIG_SLUB_DEBUG makes sense. It's
really unlikely that enough interrupt handlers will nest over the slub
fast path, and each of them do about a million alloc/free on 32 bits or
a huge amount of alloc/free on 64 bits, but just in case, it seems good
to warn if we detect we are half-way to a version overflow.

Changelog :
- Mask out the LSB because of alloc fast path. See comment in source.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
mm/slub.c | 40 ++++++++++++++++++++++++++++++++++------
1 file changed, 34 insertions(+), 6 deletions(-)

Index: linux-2.6-lttng/mm/slub.c
===================================================================
--- linux-2.6-lttng.orig/mm/slub.c 2008-03-04 00:59:01.000000000 -0500
+++ linux-2.6-lttng/mm/slub.c 2008-03-04 01:03:44.000000000 -0500
@@ -1660,7 +1660,7 @@ static __always_inline void *slab_alloc(
*/

#ifdef SLUB_FASTPATH
- unsigned long freeoffset, newoffset;
+ unsigned long freeoffset, newoffset, resoffset;

c = get_cpu_slab(s, raw_smp_processor_id());
do {
@@ -1682,8 +1682,22 @@ static __always_inline void *slab_alloc(
newoffset = freeoffset;
newoffset &= ~c->off_mask;
newoffset |= (unsigned long)object[c->offset] & c->off_mask;
- } while (cmpxchg_local(&c->freeoffset, freeoffset, newoffset)
- != freeoffset);
+ resoffset = cmpxchg_local(&c->freeoffset, freeoffset,
+ newoffset);
+#ifdef CONFIG_SLUB_DEBUG
+ /*
+ * Just to be paranoid : warn if we detect that enough free or
+ * slow paths nested on top of us to get the counter to go
+ * half-way to overflow. That would be insane to do that much
+ * allocations/free in interrupt handers, but check it anyway.
+ * Mask out the LSBs because alloc fast path does not increment
+ * the sequence number, which may cause the overall values to go
+ * backward.
+ */
+ WARN_ON((resoffset & ~c-&gt...

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, March 4, 2008 - 3:15 am

Could you rediff your work against the slab-mm branch of my VM tree on
kernel.org?
--

To: Christoph Lameter <clameter@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, February 29, 2008 - 9:03 am

I guess you could replace page->end with the (expensive)
page_address(page) + 1 in new_slab.

You would also have to change slab_address() to use
page_address(page). Since this is used by check_valid_pointer, I am not
sure it's a good idea to use this slow operation there.

Or maybe you have a different alternative in mind ?

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Eric Dumazet <dada1@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Friday, February 29, 2008 - 3:57 pm

page->end was introduced to have a unique pointer for each slab. The
counter that you introduces can take over that role. So we could use a
NULL pointer again as the end of the objects? Or some constant?

I am going to revert the change that introduced page->end for 2.6.25 since
it seems that the new fastpath does not need this. It was specific to the
old fastpath.

--

To: Eric Dumazet <dada1@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 12:38 pm

I think you may well be right. This looks like a good clue.

I'll do the revert. I wanted either a confirmation that reveting it
actually fixes something, _or_ an actual bug description, and this seems
to be a quite possible case of the latter.

Linus

--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 10:21 am

Hi Mathieu,

What is not re-entrant?

Yes, this can happen. Are you saying it is not safe to be in the
lockless path when an IRQ triggers?
--

To: Pekka Enberg <penberg@...>
Cc: Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>, Eric Dumazet <dada1@...>
Date: Tuesday, February 19, 2008 - 4:08 pm

incrementing the variable with a "++" when interrupts are not disabled.
It's not an atomic add and it's racy. The code within stat() does

It should be safe, but I think Eric pointed the correct problem in his
reply.

Thanks,

Mathieu

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Eric Dumazet <dada1@...>
Date: Wednesday, February 27, 2008 - 7:32 pm

Yes but that is only for used for statistics which can be racy. Note that
the VM event statistics also can be racy.

--

To: Christoph Lameter <clameter@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Eric Dumazet <dada1@...>
Date: Wednesday, February 27, 2008 - 9:57 pm

Doing ++ on a u32 _is_ atomic wrt interrupts on x86 and probably lots of
other architectures, so we're OK using unsigned there. But on some other
architectures ++ on u32 is not atomic wrt interrutps, so they should use
atomic_t or some other arch-specific mechanism.

And guess what? It's already all been done: local_t.
--

To: Andrew Morton <akpm@...>
Cc: Christoph Lameter <clameter@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Eric Dumazet <dada1@...>
Date: Thursday, February 28, 2008 - 7:13 am

So you make implicit assumptions about how gcc compiles your ++? I am
afraid you can't do that, gcc is absolutely free to turn variable++;
statement to non-atomic sequence of instructions (memory->register load,
increment register, for example).

--
Jiri Kosina
SUSE Labs

--

To: Andrew Morton <akpm@...>
Cc: Christoph Lameter <clameter@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Eric Dumazet <dada1@...>
Date: Thursday, February 28, 2008 - 4:14 am

hm, why should it be atomic wrt. irqs? There's nothing that keeps gcc
from not doing an "incl memory_address" but do something like: "load
memory address into regx, incl regx, ... store address into memory
address" - and that's not atomic at that point.

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Eric Dumazet <dada1@...>
Date: Thursday, February 28, 2008 - 7:15 am

> hm, why should it be atomic wrt. irqs? There's nothing that keeps gcc

It isn't - we've got tty layer traces of serial counts going wrong that
rely on precisely that assumption.

Alan
--

To: Andrew Morton <akpm@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Eric Dumazet <dada1@...>
Date: Wednesday, February 27, 2008 - 10:43 pm

local_t requires the disabling of preempt to work right.

The real solution here is cpu_alloc / cpu_ops. Per cpu operations work on
an offset relative to the start of the per process cpu data area in some
register. An increment can then be atomic vs. interrupt because it does
the calculation of the address and the inc in one instruction. F.e.

gs: inc [percpu_offset]

Processor may change before and after without ill effects. So no preempt
since the instruction will always reference the %%gs register that points
to the percpu area of the currently executing processor.
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 10:38 am

Hi Mathieu,

Hmm. The barrier() in slab_free() looks fishy. The comment says it's
there to make sure we've retrieved c->freelist before c->page but then
it uses a _compiler barrier_ which doesn't affect the CPU and the
reads may still be re-ordered... Not sure if that matters here though.
--

To: Pekka Enberg <penberg@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Torsten Kaiser <just.for.lkml@...>, Ingo Molnar <mingo@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 12:20 pm

No, no. The comment says that it's purely there to serialize an
*interrupt*, and as such, a compiler-only barrier is sufficient (or the
comment is wrong).

Interrupts are "totally ordered" within a cpu (of course, in theory a CPU
might have speculative work etc reordering, but the CPU also guarantees
that interrupt acts _as_if_ it was exact), so a compiler barrier is
sufficient.

Of course, if we're talking about interrupts on another CPU, that's a
different issue, but the fact is, in that case it's not about interrupts
any more (might as well be other code just running normally on another
CPU), and a barrier doesn't help, it needs real locking.

So that barrier is fine per se. Of course, the whole code (and/or just the
comment!) may be buggered, but any CPU SMP-aware barriers shouldn't be
relevant.

What's much more likely to be an issue is simply the fact that since the
fastpath now accesses the per-cpu freelist without any locking, if there
is *any* sequence what-so-ever that does it from another CPU and assumes
the old locking behaviour, the list will be corrupted. And from a quick
look-through, I certainly cannot guarantee that isn't the case.

There's still a lot of cases that do direct assignments to "c->freelist"
without using a guaranteed atomic sequence. They *should* be safe if it's
guaranteed that
(a) they always run with interrupts disabled
AND
(b) 'c' is _always_ the "current CPU" list

but I can't quickly see that guarantee for either.

I'd happily just revert this thing, but it would be really good to have
confirmation that it seems to matter. But Torsten's partial bisection
seems to say that the quicklist thing went into -mm before the crash even
started.

So:
- it might be something else entirely
- it might still be the local cmpxchg, just Torsten didn't happen to
notice it until later.
- it might still be the local cmpxchg, but something else changed its
patterns to actually make it start triggering.

and in ...

To: Linus Torvalds <torvalds@...>
Cc: Pekka Enberg <penberg@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Ingo Molnar <mingo@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 3:27 pm

My new hackbench-testcase also killed 2.6.24-rc2-mm1, so I really

I tried the following three patches:

switching the barrier() for a smp_mb() in 2.6.25-rc2-mm1:
-> crashed

reverting the FASTPATH-patch in 2.6.25-rc2:
-> worked

only removed FAST_CMPXCHG_LOCAL from arch/x86/Kconfig
-> worked

So all of these tests seem to confirm, that the bug is in the new SLUB fastpath.

Torsten
--

To: Linus Torvalds <torvalds@...>
Cc: Pekka Enberg <penberg@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Torsten Kaiser <just.for.lkml@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 12:45 pm

yeah - my revert suggestions were all completely conditional on such
type of test feedback.

Btw., i did trigger occasional SLUB crashes myself starting at around
-rc1, on the order of one per 200-300 straight random bootups, and
yesterday i did a 50-bootups series of a specific .config that crashed,
to try to reproduce one of them but failed - so bisection was not an
option and i had nothing concrete and repeatable to report either. I had
a few complete lockups and only 3 usable backtraces - find them below.

Networking features in all of the backtraces - and so does the VFS. All
of the crashes are on SMP - and given that 50% of the bootups are UP
this gives us a 1:8 chance hint that this bug is SMP specific. (All the
crashes are in distccd - that is what this build cluster does mainly so
it's the main activity of the box - so they dont necessarily indicate
anything workload specific.)

Earlier today i turned off local-cmpxchg and havent had a crash or hang
since then - but at 200 bootups and 4-5 crashes in a week that's not
conclusive yet. I think others might have workloads that trigger this
bug more often.

Ingo

---------------->
mercury login: [ 582.671916] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 582.672334]
[ 582.672334] Pid: 3776, comm: distccd Not tainted (2.6.25-rc2 #5)
[ 582.672334] EIP: 0060:[<c0174fda>] EFLAGS: 00010246 CPU: 0
[ 582.672334] EIP is at kmem_cache_alloc+0x2a/0x90
[ 582.672334] EAX: 00000000 EBX: 8000061c ECX: c069ed1c EDX: 01060002
[ 582.672334] ESI: c0aeffc8 EDI: c1d11714 EBP: f6eddcdc ESP: f6eddcc4
[ 582.672334] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 582.672334] Process distccd (pid: 3776, ti=f6edc000 task=f508c000 task.ti=f6edc000)
[ 582.672334] Stack: c06a3d48 f6eddce4 00000020 8000061c 0000066c c0aeffc8 f6eddcf8 c069ed1c
[ 582.672334] 00000000 00000020 8000061c f7ce6580 f7ce6580 f6eddd18 c045e7bb ffffffff
[ 582.672334] 00000000 f7f683e0 8000061c f52136c0 f7ce6580 f6eddd58 c0461de5 f5...

To: Linus Torvalds <torvalds@...>
Cc: Pekka Enberg <penberg@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Torsten Kaiser <just.for.lkml@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 12:48 pm

i mean, today i've only done 200 randconfig bootups since i did the
cmpxchg SLUB revert, and given the statistics of the bug (thousands of
bootups and just 3 provable crashes) i cannot yet conclude that the bug
is truly gone.

Ingo
--

To: Pekka Enberg <penberg@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 10:55 am

find a fix patch for that below - most systems affected seem to be SMP
ones.

If this (or my other patch) indeed solves the problem i'd still favor a
full revert of the SLUB_FASTPATH (commit 1f84260c8ce3b1ce26d4), it looks
quite un-cooked and quite un-tested for multiple independent reasons.

Sigh, why do i again have to be the messenger who brings the bad news to
SLUB land, and again when poor Christoph went on vacation? :-/

Ingo

-------------------------->
Subject: SLUB: barrier fix
From: Ingo Molnar <mingo@elte.hu>

---
mm/slub.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -1862,7 +1862,7 @@ static __always_inline void slab_free(st
debug_check_no_locks_freed(object, s->objsize);
do {
freelist = c->freelist;
- barrier();
+ smp_mb();
/*
* If the compiler would reorder the retrieval of c->page to
* come before c->freelist then an interrupt could

----------------->
Subject: slub: fastpath optimization revert
From: Ingo Molnar <mingo@elte.hu>
Date: Tue Feb 19 15:46:37 CET 2008

revert:

commit 1f84260c8ce3b1ce26d4c1d6dedc2f33a3a29c0c
Author: Christoph Lameter <clameter@sgi.com>
Date: Mon Jan 7 23:20:30 2008 -0800

SLUB: Alternate fast paths using cmpxchg_local

it was causing problems (crashes) and was incomplete.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
mm/slub.c | 87 --------------------------------------------------------------
1 file changed, 87 deletions(-)

Index: linux-x86.q/mm/slub.c
===================================================================
--- linux-x86.q.orig/mm/slub.c
+++ linux-x86.q/mm/slub.c
@@ -149,13 +149,6 @@ static inline void ClearSlabDebug(struct
/* Enable to test recovery from slab corruption on boot */
#undef SLUB_RESILIENCY_TEST

-/*
- * Currently fastpath is not support...

To: Ingo Molnar <mingo@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>, <yanmin_zhang@...>
Date: Tuesday, February 19, 2008 - 11:52 am

Torsten/Yamin, does this fix things for you? What about reverting commit
1f84260c8ce3b1ce26d4c1d6dedc2f33a3a29c0c ("SLUB: Alternate fast paths
using cmpxchg_local")?

Pekka
--

To: Pekka Enberg <penberg@...>
Cc: Ingo Molnar <mingo@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 8:36 pm

I'm busy in another issue and will test it ASAP. Sorry.

-yanmin

--

To: Pekka Enberg <penberg@...>
Cc: Ingo Molnar <mingo@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 10:08 pm

I tested it on my 3 x86-64 machines. The small fix to use smp_mb to replace
barrier in slab_free doesn't work. Kernel still crashed at the same place.

I will test the reverting patch.

-yanmin

--

To: Pekka Enberg <penberg@...>
Cc: Ingo Molnar <mingo@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Wednesday, February 20, 2008 - 2:53 am

Kernel with the reverting patch is ok.
I ran reboot/hackbench for more than 10 times on every one of my 3 x86-64 machines,
and kernel didn't crash.

-yanmin

--

To: <yanmin_zhang@...>
Cc: Ingo Molnar <mingo@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Wednesday, February 20, 2008 - 3:10 am

Great, Linus reverted the patch yesterday. Thanks for testing!
--

To: Pekka Enberg <penberg@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 10:57 am

the revert patch is below. (manually done due to other changes since
1f84260c8ce3b1ce26d4 was commited, but trivial)

Ingo

----------------->
Subject: slub: fastpath optimization revert
From: Ingo Molnar <mingo@elte.hu>
Date: Tue Feb 19 15:46:37 CET 2008

revert:

commit 1f84260c8ce3b1ce26d4c1d6dedc2f33a3a29c0c
Author: Christoph Lameter <clameter@sgi.com>
Date: Mon Jan 7 23:20:30 2008 -0800

SLUB: Alternate fast paths using cmpxchg_local

it was causing problems (crashes) and was incomplete.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
mm/slub.c | 87 --------------------------------------------------------------
1 file changed, 87 deletions(-)

Index: linux-x86.q/mm/slub.c
===================================================================
--- linux-x86.q.orig/mm/slub.c
+++ linux-x86.q/mm/slub.c
@@ -149,13 +149,6 @@ static inline void ClearSlabDebug(struct
/* Enable to test recovery from slab corruption on boot */
#undef SLUB_RESILIENCY_TEST

-/*
- * Currently fastpath is not supported if preemption is enabled.
- */
-#if defined(CONFIG_FAST_CMPXCHG_LOCAL) && !defined(CONFIG_PREEMPT)
-#define SLUB_FASTPATH
-#endif
-
#if PAGE_SHIFT <= 12

/*
@@ -1514,11 +1507,6 @@ static void *__slab_alloc(struct kmem_ca
{
void **object;
struct page *new;
-#ifdef SLUB_FASTPATH
- unsigned long flags;
-
- local_irq_save(flags);
-#endif
if (!c->page)
goto new_slab;

@@ -1541,9 +1529,6 @@ load_freelist:
unlock_out:
slab_unlock(c->page);
stat(c, ALLOC_SLOWPATH);
-#ifdef SLUB_FASTPATH
- local_irq_restore(flags);
-#endif
return object;

another_slab:
@@ -1575,9 +1560,6 @@ new_slab:
c->page = new;
goto load_freelist;
}
-#ifdef SLUB_FASTPATH
- local_irq_restore(flags);
-#endif
/*
* No memory available.
*
@@ -1619,34 +1601,6 @@ static __always_inline void *slab_alloc(
{
void **object;
struct kmem_cache_cpu *c;
-
-/*
- * The SLUB_FASTPATH path is pr...

To: Ingo Molnar <mingo@...>
Cc: Mathieu Desnoyers <mathieu.desnoyers@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>, <yanmin_zhang@...>
Date: Tuesday, February 19, 2008 - 11:54 am

I am ok with this if someone can actually confirm it fixes things.

Pekka
--

To: Pekka Enberg <penberg@...>
Cc: Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 6:27 am

hm, it's bad for at least one other reason as well (which is probably
unrelated to this crash):

/*
* Currently fastpath is not supported if preemption is enabled.
*/
#if defined(CONFIG_FAST_CMPXCHG_LOCAL) && !defined(CONFIG_PREEMPT)
#define SLUB_FASTPATH
#endif

such !PREEMPT exceptions tend to show "i didnt want to think too hard
about the preemptible case so just turn it off" thinking.

Also, why isnt this "SLUB_FASTPATH" flag done in the Kconfig space?

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 9:02 am

Ingo, a comment in slub.c explains it :

/*
* The SLUB_FASTPATH path is provisional and is currently disabled if the
* kernel is compiled with preemption or if the arch does not support
* fast cmpxchg operations. There are a couple of coming changes that will
* simplify matters and allow preemption. Ultimately we may end up making
* SLUB_FASTPATH the default.
*
* 1. The introduction of the per cpu allocator will avoid array lookups
* through get_cpu_slab(). A special register can be used instead.
*
* 2. The introduction of per cpu atomic operations (cpu_ops) means that
* we can realize the logic here entirely with per cpu atomics. The
* per cpu atomic ops will take care of the preemption issues.
*/

Eventually, I think only CONFIG_FAST_CMPXCHG_LOCAL will be needed (when
the code will support preemption). Therefore, this SLUB_FASTPATH define
seems to be only here temporarily.

I'm looking at the code right now.. more to come.

--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--

To: Mathieu Desnoyers <mathieu.desnoyers@...>
Cc: Pekka Enberg <penberg@...>, Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 10:00 am

well the feature is not complete and there are no reasons given _why_
it's not complete ... and even if there's a reason it should have been
deferred to the next merge window. We still have 10 year old "this is a
temporary hack" comments in the kernel ;-)

"hardware does not support it" is a valid argument, "kernel developer
had no time to implement it properly" is not ;-)

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Torsten Kaiser <just.for.lkml@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Mathieu Desnoyers <mathieu.desnoyers@...>, Christoph Lameter <clameter@...>
Date: Tuesday, February 19, 2008 - 6:45 am

Hi,

Hmm, no idea. I think might have been some mix-up with merging the
patch. The one I saw was:

http://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24/2....

But I don't remember giving out a Reviewed-by for it (and my mailbox
confirms that). Furthermore, somehow it turned into this when merged:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commi...

In any case, if Torsten/someone can verify that reverting
1f84260c8ce3b1ce26d4c1d6dedc2f33a3a29c0c ("SLUB: Alternate fast paths
using cmpxchg_local") fixes these problems, I think we should just do
it and let Christoph sort it out when he gets back.

Pekka
--

To: Torsten Kaiser <just.for.lkml@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Monday, February 18, 2008 - 7:54 pm

Is there any chance that you could try to bisect this, if it's repeatable
enough for you? Even if you can't bisect it *all* the way, it would be
really good to do a handful of bisection runs which should already
hopefully narrow it down a bit more.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Tuesday, February 19, 2008 - 2:44 am

It's repeatable, but not in a really reliable way.
So to mark a kernel good I need to compile around 100 KDE packages,
and even then I'm not 100% sure, if it's good or if I was just lucky.

But I did a partly bisect against 2.6.24-rc6-mm1:
2.6.24-rc6 + mm-patches up to (including) git.nfsd -> worked
2.6.24-rc6 + mm-patches up to (including) git.xfs -> crashed

I think the only added patch between rc2-mm1 and rc3-mm2 in that range
where the iommu changes that I later ruled out.
That leaves some git trees as suspects:
git-ocfs2.patch
git-selinux.patch
git-s390.patch
git-sched.patch
git-sh.patch
git-scsi-misc.patch
git-unionfs.patch
git-v9fs.patch
git-watchdog.patch
git-wireless.patch
git-ipwireless_cs.patch
git-x86.patch
git-xfs.patch

(see http://marc.info/?l=linux-kernel&m=120276641105256 )

Torsten
--

To: Torsten Kaiser <just.for.lkml@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Sunday, February 17, 2008 - 4:25 pm

There's the Bugzilla entry for it at
http://bugzilla.kernel.org/show_bug.cgi?id=9973

Please update it with the current information.

Thanks,
Rafael
--

To: Rafael J. Wysocki <rjw@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Sunday, February 17, 2008 - 5:32 pm

Crash for 2.6.25-rc2-mm1 added. That one had a complete stacktrace,
but the trace looks like others I already reported, so no real new
information... :-(

Torsten
--

To: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Saturday, February 16, 2008 - 3:14 pm

This is a multi-part message in MIME format.
--------------040606080104080609040505
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: quoted-printable

2.6.25-rc2 fails to bring up my openSUSE 10.3 PC because LVM
cannot find the volume group containing the root file system.
2.6.25-rc1 has the same problem, 2.6.24 works fine.

Some possibly relevant messages copied manually from the screen:

--------8<--------8<--------8<--------8<--------8<--------8<--------
Driver 'sd' needs updating - please use bus_type methods
sd 0:0:0:0: [sda] ...
sda: sda1 sda2 sda3
sd 0:0:0:0: [sda] Attached SCSI disk
sd 1:0:0:0: [sdb] ...
sda: sdb1
sd 0:0:0:0: [sdb] Attached SCSI disk
md: linear personality registered for level -1
Volume group "system" not found
Volume group "system" not found
Trying manual resume from /dev/sda2
Invoking userspace resume from /dev/sda2
boot/82-resume.userspace.sh: line 48: /proc/splash: No such file or direc=
tory
resume: could not stat configuration file
resume: libcrypt version: 1.2.4
Trying manual resume from /dev/sda2
PM: Starting manual resume from disk

Waiting for device /dev/system/root to appear: . Volume group "system" no=
t found
Volume group "system" not found
[...]
Volume group "system" not found
CPA self-test:
4k 1024 large 223 gb 0 x 1247[c0000000-f7c00000] miss 0
4k 189440 large 39 gb 0 x 189479[c0000000-f7c00000] miss 0
4k 189440 large 39 gb 0 x 189479[c0000000-f7c00000] miss 0
ok.
=2E Volume group "system" not found
[...]
Volume group "system" not found
Could not find /dev/system/root.
Want me to fall back to /dev/system/root? (Y/n)
--------8<--------8<--------8<--------8<--------8<--------8<--------

Config attached. Architecture is x86 32bit. The two SATA disks are
hanging off an ICH6. More detail on request.

Anything obvious? Bisect?

--=20
Tilman Schmidt E-Mail: tilman@imap.cc
Bonn, Germany
Diese Nachrich...

To: Tilman Schmidt <tilman@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Saturday, February 16, 2008 - 4:12 pm

On Sat, 16 Feb 2008 20:14:30 +0100

Compile in SCSI disk support. Modular even if loaded in initrd it seems
to have broken somewhere.
--

To: Alan Cox <alan@...>
Cc: Tilman Schmidt <tilman@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Saturday, February 16, 2008 - 6:37 pm

# CONFIG_SYSFS_DEPRECATED is not set

I would suspect this.

Try to upgrade to at least lvm 2.02.29 (I guess this is the first version which
understands the new sysfs layout).
--

To: Jiri Slaby <jirislaby@...>
Cc: Alan Cox <alan@...>, Tilman Schmidt <tilman@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Monday, February 18, 2008 - 9:53 pm

IMHO That should be *set* by default until everyone has had time to
update their userspace software to cope with the changed sysfs layout.

Alasdair
--
agk@redhat.com
--

To: Alasdair G Kergon <agk@...>, Jiri Slaby <jirislaby@...>, Alan Cox <alan@...>, Tilman Schmidt <tilman@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Cc: Alexey Dobriyan <adobriyan@...>, Ingo Molnar <mingo@...>
Date: Tuesday, February 19, 2008 - 4:56 am

[added CCs from the other thread on this topic]

It *is* set by default.

The root cause of the trouble is that its semantics are changing.
At one point in time (sorry, don't remember which kernel release
exactly) I tested whether the openSUSE 10.3 userspace supported
a CONFIG_SYSFS_DEPRECATED=3Dn kernel and found that it did. From
then on, "make oldconfig" would carry that setting over to every
new kernel I built, which was fine while the meaning of this
setting - ie. the difference in sysfs layout it controlled -
stayed the same.

With commit edfaa7c36574f1bf09c65ad602412db9da5f96bf however, the
sysfs layout changed again, so the same CONFIG_SYSFS_DEPRECATED
setting now controls a different difference (argh) in sysfs
layout. That kind of situation is not handled very well by
"make oldconfig", which basically starts from the assumption
that a setting that was ok for the previous kernel version is
still ok for the new one.

I see two ways of avoiding that problem: either create a new
backward compatibility config setting for that new sysfs change,
or create a way of telling "make oldconfig" that the semantics of
CONFIG_SYSFS_DEPRECATED have changed and it should ask the user
for that again even if there is a previous setting.

HTH
T.

--=20
Tilman Schmidt E-Mail: tilman@imap.cc
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Unge=F6ffnet mindestens haltbar bis: (siehe R=FCckseite)

To: Jiri Slaby <jirislaby@...>
Cc: Alan Cox <alan@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Sunday, February 17, 2008 - 8:57 pm

Bisection says:

edfaa7c36574f1bf09c65ad602412db9da5f96bf is first bad commit
commit edfaa7c36574f1bf09c65ad602412db9da5f96bf
Author: Kay Sievers <kay.sievers@vrfy.org>
Date: Mon May 21 22:08:01 2007 +0200

Driver core: convert block from raw kobjects to core devices

This moves the block devices to /sys/class/block. It will create a
flat list of all block devices, with the disks and partitions in one=

directory. For compatibility /sys/block is created and contains syml=
inks
to the disks.

Apparently, compatibility is in the eye of the beholder - in this

Setting

CONFIG_SCSI=3Dy
CONFIG_BLK_DEV_SD=3Dy

Setting

CONFIG_SYSFS_DEPRECATED=3Dy

does indeed fix the problem and allows me to boot successfully.

I'll have to investigate how to do that without breaking anything.

HTH
T.

--=20
Tilman Schmidt E-Mail: tilman@imap.cc
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Unge=F6ffnet mindestens haltbar bis: (siehe R=FCckseite)

To: Tilman Schmidt <tilman@...>
Cc: Jiri Slaby <jirislaby@...>, Alan Cox <alan@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Sunday, February 17, 2008 - 9:22 pm

I faced the same problem, but resolved with ...

vgscan
vgchange -a y

Also, ensure you set "write_cache_state = 1" in /etc/lvm.conf before
running the above.

Let me know if this helps.

Thanks,
Jeff.
--

To: Jeff Chua <jeff.chua.linux@...>
Cc: Jiri Slaby <jirislaby@...>, Alan Cox <alan@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>
Date: Monday, February 18, 2008 - 6:35 am

Sorry, I'm not sure what to do with those two commands.
Running them once manually doesn't seem to change anything,

That was already set by default.

Thanks,
Tilman

--=20
Tilman Schmidt E-Mail: tilman@imap.cc
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Unge=F6ffnet mindestens haltbar bis: (siehe R=FCckseite)

To: <rusty@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Saturday, February 16, 2008 - 12:52 pm

1. Release Linux Kernel
2. Give it a weasel name
3. ??
ERROR: "LGUEST_PAGES_guest_gdt_desc" [drivers/lguest/lg.ko] undefined!
ERROR: "LGUEST_PAGES_host_gdt_desc" [drivers/lguest/lg.ko] undefined!
ERROR: "LGUEST_PAGES_host_cr3" [drivers/lguest/lg.ko] undefined!
...

.config see http://pastebin.ca/906358
--

To: Linux Kernel Mailing List <linux-kernel@...>
Cc: <linuxppc-dev@...>, Ingo Molnar <mingo@...>, Jens Axboe <jens.axboe@...>, Srivatsa Vaddagiri <vatsa@...>, Dhaval Giani <dhaval@...>, Andy Whitcroft <apw@...>, Balbir Singh <balbir@...>
Date: Saturday, February 16, 2008 - 2:10 am

Hi,

The softlockup is seen from 2.6.25-rc1-git{1,3} and is visible in the 2.6.24-rc2 kernel,
While booting up with the 2.6.25-rc1-git{1,3} and 2.6.25-rc2 kernel(s) on the powerbox

Loading st.ko module
BUG: soft lockup - CPU#1 stuck for 61s! [insmod:379]
NIP: c0000000001b0620 LR: c0000000001a5dcc CTR: 0000000000000040
REGS: c00000077caab8a0 TRAP: 0901 Not tainted (2.6.25-rc2-autotest)
MSR: 8000000000009032 <EE,ME,IR,DR> CR: 84004088 XER: 20000000
TASK = c00000077cb450a0[379] 'insmod' THREAD: c00000077caa8000 CPU: 1
GPR00: c00000077c9d4000 c00000077caabb20 c000000000538a40 000000000000000b
GPR04: ffc0000000000000 c00000077e0c0000 0000000000000036 000000000000000a
GPR08: 0040000000000000 c00000077c9d4250 c000000000000000 0000000000000000
GPR12: c00000077c9d4230 c000000000481d00
NIP [c0000000001b0620] .radix_tree_gang_lookup+0x100/0x1e4
LR [c0000000001a5dcc] .call_for_each_cic+0x50/0x10c
Call Trace:
[c00000077caabb20] [c0000000001a5e2c] .call_for_each_cic+0xb0/0x10c (unreliable)
[c00000077caabc60] [c00000000019dba4] .exit_io_context+0xf0/0x110
[c00000077caabcf0] [c000000000061e38] .do_exit+0x820/0x850
[c00000077caabda0] [c000000000061f34] .do_group_exit+0xcc/0xe8
[c00000077caabe30] [c00000000000872c] syscall_exit+0x0/0x40
Instruction dump:
7d296214 39290018 e8090000 7caa2038 39290008 2fa00000 409e0018 7caa4215
396b0001 418200cc 424000b8 4bffffdc <79691f24> 7d296214 e9690018 2fab0000
INFO: task insmod:387 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
insmod D 000000001000e144 12144 387 1
Call Trace:
[c00000077cb97600] [c0000000008fae80] 0xc0000000008fae80 (unreliable)
[c00000077cb977d0] [c000000000010c7c] .__switch_to+0x11c/0x154
[c00000077cb97860] [c000000000344498] .schedule+0x5d0/0x6b0
[c00000077cb97950] [c0000000003447d8] .schedule_timeout+0x3c/0xe8
[c00000077cb97a20] [c000000000343d34] .wait_for_common+0x150/0x22c
[c00000077cb97ae0] [c00000000008ef00] .__stop_machine_run+0xbc...

To: Kamalesh Babulal <kamalesh@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, <linuxppc-dev@...>, Ingo Molnar <mingo@...>, Jens Axboe <jens.axboe@...>, Srivatsa Vaddagiri <vatsa@...>, Dhaval Giani <dhaval@...>, Andy Whitcroft <apw@...>, Balbir Singh <balbir@...>
Date: Sunday, February 17, 2008 - 4:08 pm

Can you update the Bugzilla entry at:
http://bugzilla.kernel.org/show_bug.cgi?id=9948
with the above information, please?

--
"Premature optimization is the root of all evil." - Donald Knuth
--

To: Kamalesh Babulal <kamalesh@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, <linuxppc-dev@...>, Ingo Molnar <mingo@...>, Srivatsa Vaddagiri <vatsa@...>, Dhaval Giani <dhaval@...>, Andy Whitcroft <apw@...>, Balbir Singh <balbir@...>
Date: Sunday, February 17, 2008 - 3:29 pm

It's odd stuff. Could you perhaps try and add some printks to
block/cfq-iosched.c:call_for_each_cic(), like dumping the 'nr' return
from radix_tree_gang_lookup() and the pointer value of cics[i] in the
for() loop after the lookup?

How many SCSI devices are online?

--
Jens Axboe

--

To: Jens Axboe <jens.axboe@...>
Cc: Kamalesh Babulal <kamalesh@...>, Linux Kernel Mailing List <linux-kernel@...>, <linuxppc-dev@...>, Ingo Molnar <mingo@...>, Srivatsa Vaddagiri <vatsa@...>, Dhaval Giani <dhaval@...>, Andy Whitcroft <apw@...>, Balbir Singh <balbir@...>
Date: Tuesday, February 19, 2008 - 4:04 am

On Sun, 17 Feb 2008 20:29:13 +0100
I met the same issue on ia64/NUMA box.
seems cisc[]->key is NULL and index for radix_tree_gang_lookup() was always '1'.

Attached patch works well for me,
but I don't know much about cfq. please confirm.

Regards,
-Kame

==
cics[]->key can be NULL.
In that case, cics[]->dead_key has key value.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Index: linux-2.6.25-rc2/block/cfq-iosched.c
===================================================================
--- linux-2.6.25-rc2.orig/block/cfq-iosched.c
+++ linux-2.6.25-rc2/block/cfq-iosched.c
@@ -1171,7 +1171,11 @@ call_for_each_cic(struct io_context *ioc
break;

called += nr;
- index = 1 + (unsigned long) cics[nr - 1]->key;
+
+ if (!cics[nr - 1]->key)
+ index = 1 + (unsigned long) cics[nr - 1]->dead_key;
+ else
+ index = 1 + (unsigned long) cics[nr - 1]->key;

for (i = 0; i < nr; i++)
func(ioc, cics[i]);

--

To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Cc: Kamalesh Babulal <kamalesh@...>, Linux Kernel Mailing List <linux-kernel@...>, <linuxppc-dev@...>, Ingo Molnar <mingo@...>, Srivatsa Vaddagiri <vatsa@...>, Dhaval Giani <dhaval@...>, Andy Whitcroft <apw@...>, Balbir Singh <balbir@...>
Date: Tuesday, February 19, 2008 - 4:36 am

Why does it keep repeating then? If ->key is NULL, the next lookup index
should be 1UL.

But I think the radix 'scan over entire tree' is a bit fragile. This
patch adds a parallel hlist for ease of properly browsing the members,

It doesn't make a lot of sense, I'm afraid.

block/blk-ioc.c | 35 +++++++++++++++--------------------
block/cfq-iosched.c | 37 +++++++++++--------------------------
include/linux/iocontext.h | 2 ++
3 files changed, 28 insertions(+), 46 deletions(-)

diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 80245dc..73c7002 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -17,17 +17,13 @@ static struct kmem_cache *iocontext_cachep;

static void cfq_dtor(struct io_context *ioc)
{
- struct cfq_io_context *cic[1];
- int r;
+ if (!hlist_empty(&ioc->cic_list)) {
+ struct cfq_io_context *cic;

- /*
- * We don't have a specific key to lookup with, so use the gang
- * lookup to just retrieve the first item stored. The cfq exit
- * function will iterate the full tree, so any member will do.
- */
- r = radix_tree_gang_lookup(&ioc->radix_root, (void **) cic, 0, 1);
- if (r > 0)
- cic[0]->dtor(ioc);
+ cic = list_entry(ioc->cic_list.first, struct cfq_io_context,
+ cic_list);
+ cic->dtor(ioc);
+ }
}

/*
@@ -57,18 +53,16 @@ EXPORT_SYMBOL(put_io_context);

static void cfq_exit(struct io_context *ioc)
{
- struct cfq_io_context *cic[1];
- int r;
-
rcu_read_lock();
- /*
- * See comment for cfq_dtor()
- */
- r = radix_tree_gang_lookup(&ioc->radix_root, (void **) cic, 0, 1);
- rcu_read_unlock();

- if (r > 0)
- cic[0]->exit(ioc);
+ if (!hlist_empty(&ioc->cic_list)) {
+ struct cfq_io_context *cic;
+
+ cic = list_entry(ioc->cic_list.first, struct cfq_io_context,
+ cic_list);
+ cic->exit(ioc);
+ }
+ rcu_read_unlock();
}

/* Called by the exitting task */
@@ -105,6 +99,7 @@ struct io_context *alloc_io_context(gfp_t gfp_f...

To: Jens Axboe <jens.axboe@...>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>, Kamalesh Babulal <kamalesh@...>, Linux Kernel Mailing List <linux-kernel@...>, <linuxppc-dev@...>, Ingo Molnar <mingo@...>, Srivatsa Vaddagiri <vatsa@...>, Dhaval Giani <dhaval@...>, Andy Whitcroft <apw@...>, Balbir Singh <balbir@...>
Date: Friday, February 22, 2008 - 3:24 am

Even though io_contexts are fairly uncommon, adding more stuff to a data
structure was a pretty sad alternative to fixing a bug in
radix_tree_gang_lookup(), or to fixing a bug in a caller of it.

IOW: what exactly went wrong here??
--

To: Andrew Morton <akpm@...>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>, Kamalesh Babulal <kamalesh@...>, Linux Kernel Mailing List <linux-kernel@...>, <linuxppc-dev@...>, Ingo Molnar <mingo@...>, Srivatsa Vaddagiri <vatsa@...>, Dhaval Giani <dhaval@...>, Andy Whitcroft <apw@...>, Balbir Singh <balbir@...>
Date: Friday, February 22, 2008 - 3:40 am

The cfq use of it, not the radix tree code! It juggled the keys and
wants to make sure that we see all users, modulo raced added ones (ok if

I could not convince myself that the current code would always do the
right thing. We should not have been seeing ->key == NULL entries in
there, it implied a double exit of that process. So I decided to fix it
by making the code a lot more readable (the patch in question deleted a
lot more than it added), at the cost of that hlist head + node.

--
Jens Axboe

--

To: Jens Axboe <jens.axboe@...>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>, Linux Kernel Mailing List <linux-kernel@...>, <linuxppc-dev@...>, Ingo Molnar <mingo@...>, Srivatsa Vaddagiri <vatsa@...>, Dhaval Giani <dhaval@...>, Andy Whitcroft <apw@...>, Balbir Singh <balbir@...>
Date: Tuesday, February 19, 2008 - 9:19 am

<snip>

Hi Jens,

Thanks for the patch. The patch works fine, machine boots up without the kernel panic.

--
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
--

To: Jens Axboe <jens.axboe@...>
Cc: Kamalesh Babulal <kamalesh@...>, Linux Kernel Mailing List <linux-kernel@...>, <linuxppc-dev@...>, Ingo Molnar <mingo@...>, Srivatsa Vaddagiri <vatsa@...>, Dhaval Giani <dhaval@...>, Andy Whitcroft <apw@...>, Balbir Singh <balbir@...>
Date: Tuesday, February 19, 2008 - 5:02 am

On Tue, 19 Feb 2008 09:36:34 +0100
Works well for me and my box booted !

Thanks,
-Kame

--

To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Cc: Kamalesh Babulal <kamalesh@...>, Linux Kernel Mailing List <linux-kernel@...>, <linuxppc-dev@...>, Ingo Molnar <mingo@...>, Srivatsa Vaddagiri <vatsa@...>, Dhaval Giani <dhaval@...>, Andy Whitcroft <apw@...>, Balbir Singh <balbir@...>
Date: Tuesday, February 19, 2008 - 5:01 am

Super, I'll get it upstream. Thanks for testing and debugging!

--
Jens Axboe

--

To: Jens Axboe <jens.axboe@...>
Cc: Kamalesh Babulal <kamalesh@...>, Linux Kernel Mailing List <linux-kernel@...>, <linuxppc-dev@...>, Ingo Molnar <mingo@...>, Srivatsa Vaddagiri <vatsa@...>, Dhaval Giani <dhaval@...>, Andy Whitcroft <apw@...>, Balbir Singh <balbir@...>
Date: Tuesday, February 19, 2008 - 4:47 am

On Tue, 19 Feb 2008 09:36:34 +0100
when I inserted printk here
==
for (i = 0; i < nr; i++)
func(ioc, cics[i]);
printk("%d %lx\n", nr, index);
==
index was always "1" and nr was always 32.

will try. please wait a bit.

Thanks,
-Kame

--

To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Cc: Kamalesh Babulal <kamalesh@...>, Linux Kernel Mailing List <linux-kernel@...>, <linuxppc-dev@...>, Ingo Molnar <mingo@...>, Srivatsa Vaddagiri <vatsa@...>, Dhaval Giani <dhaval@...>, Andy Whitcroft <apw@...>, Balbir Singh <balbir@...>
Date: Tuesday, February 19, 2008 - 4:58 am

Hang on, it returned 32? It should not return more than 16, since that
is what we have room for and asked for. Using ->dead_key when ->key is
NULL is correct btw, since that is the correct location in the tree once
the process has exited. But that should not happen until AFTER the

It boots here, so at least it passes normal sanity tests. It should
solve your problem as well, hopefully.

--
Jens Axboe

--

To: Jens Axboe <jens.axboe@...>
Cc: Kamalesh Babulal <kamalesh@...>, Linux Kernel Mailing List <linux-kernel@...>, <linuxppc-dev@...>, Ingo Molnar <mingo@...>, Srivatsa Vaddagiri <vatsa@...>, Dhaval Giani <dhaval@...>, Andy Whitcroft <apw@...>, Balbir Singh <balbir@...>
Date: Tuesday, February 19, 2008 - 5:07 am

On Tue, 19 Feb 2008 09:58:38 +0100
sorry. Of course, it was 16 ;(

your patch works well. thank you.

-Kame

--

To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Cc: Kamalesh Babulal <kamalesh@...>, Linux Kernel Mailing List <linux-kernel@...>, <linuxppc-dev@...>, Ingo Molnar <mingo@...>, Srivatsa Vaddagiri <vatsa@...>, Dhaval Giani <dhaval@...>, Andy Whitcroft <apw@...>, Balbir Singh <balbir@...>
Date: Tuesday, February 19, 2008 - 5:09 am

It's committed now and posted in the relevant bugzilla as well (#9948).

--
Jens Axboe

--

To: Linux Kernel Mailing List <linux-kernel@...>
Cc: <linux-ext4@...>, Andy Whitcroft <apw@...>, Balbir Singh <balbir@...>
Date: Saturday, February 16, 2008 - 1:44 am

Hi,

The 2.6.25-rc2 kernel oopses while running dbench on ext3 filesystem
mounted with mount -o data=writeback,nobh option on the x86_64 box

BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
IP: [<ffffffff80274972>] kmem_cache_alloc+0x3a/0x6c
PGD 1f6860067 PUD 1f5d64067 PMD 0
Oops: 0000 [1] SMP
CPU 3
Modules linked in:
Pid: 4271, comm: dbench Not tainted 2.6.25-rc2-autotest #1
RIP: 0010:[<ffffffff80274972>] [<ffffffff80274972>] kmem_cache_alloc+0x3a/0x6c
RSP: 0000:ffff8101fb041dc8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff810180033c00 RCX: ffffffff8027b269
RDX: 0000000000000000 RSI: 00000000000080d0 RDI: ffffffff80632d70
RBP: 00000000000080d0 R08: 0000000000000001 R09: 0000000000000000
R10: ffff8101feb36e50 R11: 0000000000000190 R12: 0000000000000001
R13: 0000000000000000 R14: ffff8101f8f38000 R15: 00000000ffffff9c
FS: 0000000000000000(0000) GS:ffff8101fff0f000(0063) knlGS:00000000f7e41460
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000001f5620000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dbench (pid: 4271, threadinfo ffff8101fb040000, task ffff8101fb180000)
Stack: 0000000000000001 ffff8101fb041ea8 0000000000000001 ffffffff8027b269
ffff8101fb041ea8 ffffffff80281fe8 0000000000000001 0000000000000000
ffff8101fb041ea8 00000000ffffff9c 000000000000000b 0000000000000001
Call Trace:
[<ffffffff8027b269>] get_empty_filp+0x55/0xf9
[<ffffffff80281fe8>] __path_lookup_intent_open+0x22/0x8f
[<ffffffff80282853>] open_namei+0x86/0x5a7
[<ffffffff8027d019>] vfs_stat_fd+0x3c/0x4a
[<ffffffff80279ab1>] do_filp_open+0x1c/0x3d
[<ffffffff80279c2c>] get_unused_fd_flags+0x79/0x111
[<ffffffff80279dce>] do_sys_open+0x46/0xca
[<ffffffff80221c82>] ia32_sysret+0x0/0xa

Code: 24 00 00 00 48 98 48 8b 9c c7 d8 02 00 00 48 8b 13 f6 c2 01 74 12...

To: Kamalesh Babulal <kamalesh@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, <linux-ext4@...>, Andy Whitcroft <apw@...>, Balbir Singh <balbir@...>, Christoph Lameter <clameter@...>, <linux-mm@...>
Date: Monday, February 18, 2008 - 8:59 am

Looks to me like we broke slab. Christoph is offline until the 27th..
--

To: Andrew Morton <akpm@...>
Cc: Kamalesh Babulal <kamalesh@...>, Linux Kernel Mailing List <linux-kernel@...>, <linux-ext4@...>, Andy Whitcroft <apw@...>, Balbir Singh <balbir@...>, Christoph Lameter <clameter@...>, <linux-mm@...>
Date: Monday, March 3, 2008 - 7:51 am

On Sat, 16 Feb 2008 11:14:46 +0530 Kamalesh Babulal

On Mon, Feb 18, 2008 at 2:59 PM, Andrew Morton

This is probably fixed by:

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commi...

As this is on the regression list, Kamalesh, can you please confirm
it's fixed now?

Pekka
--

To: Pekka Enberg <penberg@...>
Cc: Andrew Morton <akpm@...>, Linux Kernel Mailing List <linux-kernel@...>, <linux-ext4@...>, Andy Whitcroft <apw@...>, Balbir Singh <balbir@...>, Christoph Lameter <clameter@...>, <linux-mm@...>
Date: Tuesday, March 4, 2008 - 12:03 am

Thanks, I tested the 2.6.25-rc3-git4 kernel and the oops is not reproducible. This commit seems to fix the kernel oops.

--
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
--

To: Andrew Morton <akpm@...>
Cc: Kamalesh Babulal <kamalesh@...>, Linux Kernel Mailing List <linux-kernel@...>, <linux-ext4@...>, Andy Whitcroft <apw@...>, Balbir Singh <balbir@...>, Christoph Lameter <clameter@...>, <linux-mm@...>
Date: Monday, February 18, 2008 - 10:25 am

Two x86-64 boxes here lock up here on 2.6.25-rc2, shortly after boot.
One running Fedora 8 + X (GNOME) and one a headless file server.
configs and lspci attached. Unable to capture any splatter so far.

Bisecting...

To: Jeff Garzik <jeff@...>
Cc: <akpm@...>, <apw@...>, <balbir@...>, <clameter@...>, <kamalesh@...>, <linux-ext4@...>, <linux-kernel@...>, <linux-mm@...>
Date: Monday, February 18, 2008 - 12:11 pm

Sounds like it may be http://lkml.org/lkml/2008/2/17/78.

Suggest you try reverting that before doing the bisect.

Cheers,
FJP
--

To: Linus Torvalds <torvalds@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, Laszlo Attila Toth <panther@...>, David Miller <davem@...>
Date: Friday, February 15, 2008 - 10:08 pm

Here you go.

commit 45b503548210fe6f23e92b856421c2a3f05fd034
Author: Laszlo Attila Toth <panther@balabit.hu>
Date: Tue Feb 12 22:42:09 2008 -0800

[RTNETLINK]: Send a single notification on device state changes.

contains the following gem:

if (tb[IFLA_LINKMODE]) {
- write_lock_bh(&dev_base_lock);
- dev->link_mode = nla_get_u8(tb[IFLA_LINKMODE]);
- write_unlock_bh(&dev_base_lock);
+ if (dev->link_mode != nla_get_u8(tb[IFLA_LINKMODE])) {
+ write_lock_bh(&dev_base_lock);
+ dev->link_mode = nla_get_u8(tb[IFLA_LINKMODE]);
+ write_lock_bh(&dev_base_lock);
+ modified = 1;
+ }
}

and even with that fixed it breaks NetworkManager (on my test box it apparently
can't get the IP address using DHCP). Reverting this commit makes things
work again.

Well, it looks like this patch went to you untested and unreviewed, so may I
gently request that it be reverted from your tree, pretty please?

Thanks,
Rafael
--

Previous thread: Re: data corruption with dmcrypt/LUKS by Christoph Anton Mitterer on Friday, February 15, 2008 - 5:25 pm. (1 message)

Next thread: USB regression (and other failures) in 2.6.2[45]* by Andrew Buehler on Friday, February 15, 2008 - 5:45 pm. (26 messages)