Ok,
this kernel is a winner.Just to show how _much_ of a winner it is, it's been awarded a coveted
"weasel" series name, which should tell you just how good it's going to
be. It's a name revered in Linux kernel history, and as such this brings
back the good old days where if you find a bug, you're almost certainly
simply mistaken, and you probably just did something wrong.But hey, you can try to prove me wrong. I dare you.
The full shortlog is appended, but instead of a diffstat, I'm doing the my
git "dirstat" output which gives a nicer high-level view. This was
generated withgit diff --cumulative --dirstat=2 --shortstat v2.6.25-rc1..v2.6.25-rc2
and actually gives a pretty good overview. In particular, it shows that
almost exactly half of the updates are to drivers, with network drivers
alone being a third of the whole patch. And of the remaining half, about
half was architecture updates, notably to SH.What a cool git feature, if I say so myself. It allows me an extra half
hour a week of just sitting back and sipping my foofy tropical drink
(bringing my total up to about 168.5 hours of drunken stupor a week),
since now I don't even have to try to make sense of the diffstat manually
any more. Anyway, here it is:2.1% Documentation/
3.7% arch/cris/
7.0% arch/sh/configs/
4.4% arch/sh/kernel/
4.9% arch/sh/mm/
17.8% arch/sh/
23.8% arch/
33.5% drivers/net/
6.0% drivers/scsi/lpfc/
7.1% drivers/scsi/
4.5% drivers/sh/maple/
49.5% drivers/
8.1% fs/
2.5% include/linux/
4.5% include/
7.2% kernel/
2.0% net/
509 files changed, 14470 insertions(+), 6986 deletions(-)which you have to admit looks very managerial, even if you have no clue
what the heck it really contains (which is also very managerial). Now I
just need to turn these reports into some colored powerpoint slides with
pie charts, and then I can turn off my brain _completely_.But if you actually care for anything but ...
Sadly not for me:
[ 5282.056415] ------------[ cut here ]------------
[ 5282.059757] kernel BUG at lib/list_debug.c:33!
[ 5282.062055] invalid opcode: 0000 [1] SMP
[ 5282.062055] CPU 3
[ 5282.062055] Modules linked in: radeon drm w83792d ipv6 tuner
tea5767 tda8290 tuner_xc2028 tda9887 tuner_simple mt20xx tea5761
tvaudio msp3400 bttv videodev v4l1_compat ir_common compat_ioctl32
v4l2_common videobuf_dma_sg videobuf_core btcx_risc tveeprom usbhid
pata_amd i2c_nforce2 hid sg
[ 5282.062055] Pid: 12937, comm: sed Not tainted 2.6.25-rc2 #1
[ 5282.062055] RIP: 0010:[<ffffffff803bffe4>]
-> then the output from the serial console stopped. I was in X, so I
could not see, if there was anything more on the real console.(gdb) list *0xffffffff803bffe4
0xffffffff803bffe4 is in __list_add (lib/list_debug.c:33).
28 }
29 if (unlikely(prev->next != next)) {
30 printk(KERN_ERR "list_add corruption.
prev->next should be "
31 "next (%p), but was %p. (prev=%p).\n",
32 next, prev->next, prev);
33 BUG();
34 }
35 next->prev = new;
36 new->next = next;
37 new->prev = prev;For more on this problem see http://marc.info/?l=linux-kernel&m=120293042005445
I will now try 2.6.25-rc2-mm1.
Torsten
--
hm. Your crashes do seem to span multiple subsystems, but it always
seems to be around the SLUB code. Could you try the patch below? The
SLUB code has a new optimization and i'm not 100% sure about it. [the
hack below switches the SLUB optimization off by disabling the CPU
feature it relies on.]Ingo
------------->
arch/x86/Kconfig | 4 ----
1 file changed, 4 deletions(-)Index: linux/arch/x86/Kconfig
===================================================================
--- linux.orig/arch/x86/Kconfig
+++ linux/arch/x86/Kconfig
@@ -59,10 +59,6 @@ config HAVE_LATENCYTOP_SUPPORT
config SEMAPHORE_SLEEPERS
def_bool y-config FAST_CMPXCHG_LOCAL
- bool
- default y
-
config MMU
def_bool y--
$ grep FAST_CMPXCHG_LOCAL */.config
linux-2.6.24-rc2-mm1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
linux-2.6.24-rc3-mm1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
linux-2.6.24-rc3-mm2/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
linux-2.6.24-rc6-mm1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
linux-2.6.24-rc8-mm1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
linux-2.6.25-rc1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
linux-2.6.25-rc2-mm1/.config:CONFIG_FAST_CMPXCHG_LOCAL=y
linux-2.6.25-rc2/.config:CONFIG_FAST_CMPXCHG_LOCAL=y-rc2-mm1 still worked for me.
Did you mean the new SLUB_FASTPATH?
$ grep "define SLUB_FASTPATH" */mm/slub.c
linux-2.6.25-rc1/mm/slub.c:#define SLUB_FASTPATH
linux-2.6.25-rc2-mm1/mm/slub.c:#define SLUB_FASTPATH
linux-2.6.25-rc2/mm/slub.c:#define SLUB_FASTPATHThe 2.6.24-rc3+ mm-kernels did crash for me, but don't seem to contain this...
On the other hand:
From the crash in 2.6.25-rc2-mm1:
[59987.116182] RIP [<ffffffff8029f83d>] kmem_cache_alloc_node+0x6d/0xa0(gdb) list *0xffffffff8029f83d
0xffffffff8029f83d is in kmem_cache_alloc_node (mm/slub.c:1646).
1641 if (unlikely(is_end(object) || !node_match(c, node))) {
1642 object = __slab_alloc(s, gfpflags,
node, addr, c);
1643 break;
1644 }
1645 stat(c, ALLOC_FASTPATH);
1646 } while (cmpxchg_local(&c->freelist, object, object[c->offset])
1647
!= object);
1648 #else
1649 unsigned long flags;
1650That code is part for SLUB_FASTPATH.
I'm willing to test the patch, but don't know how fast I can find the
time to do it, so my answer if your patch helps might be delayed until
the weekend.Torsten
--
Mathieu, Christoph is on vacation and I'm not at all that familiar
with this cmpxchg_local() optimization, so if you could take a peek at
this bug report to see if you can spot something obviously wrong with
it, I would much appreciate that.
--
Sure,
Initial thoughts :
I'd like to get the complete config causing this bug. I suspect either :
- A race between the lockless algo and an IRQ in a driver allocating
memory.
- stat(c, ALLOC_FASTPATH); seems to be using a var++, therefore
indicating it is not reentrant if IRQs are disabled. Since those are
only stats, I guess it's ok, but still weird.
- CPU hotplug problem.
http://bugzilla.kernel.org/attachment.cgi?id=14877&action=view shows
last sysfs file:
/sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
-- is this linked to a cpu up/down event ?Since this shows mostly with network card drivers, I think the most
plausible cause would be an IRQ nesting over kmem_cache_alloc_node and
calling it.Will dig further...
Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
# CONFIG_HOTPLUG_CPU is not set
I expect my system would get rather angry if I try to remove a cpu
while it is running. ;)And as I don't use suspend, I wanted to stay away from any bugs that
Thanks to yanmin crashing his machine with hackbench I now have a much
better testcase than recompiling KDE.
Starting hackbench will result in an nearly instant crash, if there is
a floodping running against the system.This new testcase also showed that 2.6.24-rc2-mm1 can also be crashed
in this way.I will try the patches from this tread against this new testcase and
report the results.Torsten
On Tue, 19 Feb 2008 09:02:30 -0500
I wonder how SLUB_FASTPATH is supposed to work, since it is affected by
a classical ABA problem of lockless algo.cmpxchg_local(&c->freelist, object, object[c->offset]) can succeed,
while an interrupt came (on this cpu), and several allocations were done,
and one free was performed at the end of this interruption, so 'object'
was recycled.c->freelist can then contain the previous value (object), but
object[c->offset] was changed by IRQ.We then put back in freelist an already allocated object.
--
Right. The alloc fastpath assumes that the object is not modified while it
is allocated...Du! This shoots my nice scheme in the foot and there does not seem to be
an easy way to fix it. This means we also need to revert the
page->end patch. Useless if we do not have the cmpxchg_local.--
I think you are right. A way to fix this would use the fact that the
freelist is only useful to point to the first free object in a page. We
could change it to an offset rather than an address.The freelist would become a counter of type "long" which increments
until it wraps at 2^32 or 2^64. A PAGE_MASK bitmask could then be used
to get low order bits which would get the page offset of the first free
object, while the high order bits would insure we can detect this type
of object reuse when doing a cmpxchg. Upon free, the freelist counter
should always be incremented; this would be provided by adding PAGE_SIZE
to the counter and setting the LSBs to the correct offset.On 32 bits architectures, it would allow 2^(32-12) = 1048576
allocations-free pairs to be done within interrupt handlers interrupting
a cmpxchg_local before the counter wraps. On 64 bits archs, it would
give 2^(64-12) = 2^52 allocation-free.If we assume kmalloc/kfree pairs can take up to 1000 cycles to execute
(see
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commi...)
(3GHz Pentium 4) (therefore 333ns)
an interrupt handler doing 1048576 kmalloc/kfree would run for 0.35
seconds. An interrupt handler taking that much time would lead to other
OS problems (potential scheduler problems), so this quantity of
available allocations before a wrap-around looks safe. However, making
sure preemption is disabled would be safer here, since we cannot bound
the number of allocations that can be done by other processes if we are
scheduled out in the middle of the cmpxchg_local loop.A revert is indeed the right solution at this stage until a corrected
version is ready.Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
Urgh.... That sounds way too complicated. Do you have an experimental
patch that would allow us to see the impact?--
Yep, I have an experimental patch ready. It would need some thorough
testing though. I seems to run fine on an x86_32 with and without
preemption (therefore with and without the slub cmpxchg_local fastpath).You should first revert commit 00e962c5408b9f2d0bebd2308673fe982cb9a5fe
(this is the slub fastpath revert itself, currently in 2.6.25-rc3)And then apply :
Implement slub fastpath in terms of freebase and freeoffset
It allows the cmpxchg_local to detect object re-use by keeping a counter in the
freeoffset MSBs.freeoffset & PAGE_MASK is the offset of the freelist element within the page.
freeoffset & ~PAGE_MASK is the counter to detect object re-use.
freebase is the address of the page in this the object is located. It is a
multiple of PAGE_SIZE.Whenever an object is freed in the cpu slab cache, the counter is incremented.
Whenever the alloc/free slow paths are modifying the freeoffset or freebase, the
freeoffset counter is also incremented. It is used to make sure we know if
freebase has been modified in an interrupt nested over the fast path.Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
CC: Christoph Lameter <clameter@sgi.com>
---
include/linux/slub_def.h | 35 +++++++++++++++++++
mm/slub.c | 84 +++++++++++++++++++++++++++++++++--------------
2 files changed, 93 insertions(+), 26 deletions(-)Index: linux-2.6-lttng/include/linux/slub_def.h
===================================================================
--- linux-2.6-lttng.orig/include/linux/slub_def.h 2008-02-27 20:40:48.000000000 -0500
+++ linux-2.6-lttng/include/linux/slub_def.h 2008-02-28 00:17:28.000000000 -0500
@@ -32,7 +32,18 @@ enum stat_item {
NR_SLUB_STAT_ITEMS };struct kmem_cache_cpu {
- void **freelist; /* Pointer to first free per cpu object */
+ unsigned long freeoffset; /*
+ * Offset of the first free per cpu
+ * object. The MSBs are used as a
+ * counter which must be incremented
+ * each time an ...
What is the performance impact?
page_address(c->page) + (c->freeoffset & (PAGE_MASK << s->order)) ?
How is this going to show up as an atomic update? You modify multiple per
The original code is wrong. It should be
if (is_end(c->freelist))
That use of set_freelist is safe since interrupts are disabled. So the
How does this work? You replace the freeoffset with the the once
incremented by the object offset??? The objects may not be in sequence on
the freelist and the increment wont get you to the next object anyways
since c->offset is usually 0.--
I guess it would be more :
page_address(c->page) | (c->freeoffset & c->off_mask)
where c->off_mask = (PAGE_SIZE << s->order) - 1;
Note the "from the fastpath point of view", which means that if an
interrupt nests over the cmpxchg_local fastpath and uses
set_freelist_ptr, all the operations will be executed before the
interrupt returns to the cmpxchg_local fastpath, therefore showing the
update as being done "atomically" from the cmpxchg_local fastpath point
of view.Anyway, I am changing this for :
static inline void set_freelist_ptr(struct kmem_cache_cpu *c, void **freelist)
{
unsigned long offset;offset = c->freeoffset + c->off_mask + 1;
offset &= ~c->off_mask;
offset |= (unsigned long)freelist & c->off_mask;
c->freeoffset = offset;
}Which will insure c->freeoffset is always in a valid state, therefore
Should be ok now since we allow interrupts to nest over
set_freelist_ptr. But I think this is also protected by slab_lock, taken
both in the slow path and in deactivate_slab.In any case, doing a
object = get_freelist_ptr(c);
set_freelist_ptr(c, object[c->offset]);Like deactivate_slab does should be either protected by disabling
Hrm, there should be a 1 to 1 mapping between the old code and the new
code :- } while (cmpxchg_local(&c->freelist, object, object[c->offset])
- != object);
+ /* No need to increment the MSB counter here, because only
+ * object free will lead to counter re-use. */
+ } while (cmpxchg_local(&c->freeoffset, freeoffset,
+ (freeoffset + (c->offset * sizeof(void *)))) != freeoffset);With the difference that we use only offsets instead of the full object
address.Ah! I think I found the issue. It should become :
newoffset = freeo...
Hmmm.. The cmpxchg_local was intended to check for a change of slab too
(addresses from different slabs are disjunct).
That is no longer possible since the upper bits are used by the
versioning? (Same issue in slab_alloc).--
Hrm, this is why I kept a separate "freebase", as a sort of
Here is the trick : if, on top of the cmpxchg_local, we have an
interrupt that calls slab_alloc or slab_free which causes a change of
slab, this interrupt will have to go through the slow path.These slow paths use set_freelist_ptr() to update the c->freeoffset,
which also increments the sequence number. When we return from
interrupt, the cmpxchg_local loop will fail because the sequence number
has changed.In short, the we also use the versioning to check for change of slab.
Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
Sounds good.
--
Test results :
On a 3GHz Pentium 4, CONFIG_SMP=y, no preemption, single CPU, hyperthreading
Kernel 2.6.25-rc3 (complete config below)Just to give away an overview of the results :
The "1. Kmalloc: Repeatedly allocate then free test", for a 32 bytes
allocation, takes :10000 times kmalloc(32) -> 191 cycles kfree -> 179 cycles
10000 times kmalloc(32) -> 93 cycles kfree -> 172 cyclesFor a speedup of 2 on the allocation and about the same speed for free.
For the alloc/free tests, we have a speedup of 3 for most of allocation
sizes. (as long as we are in the fast path)* baseline : standard fast path, without freeoffset patch
Single thread testing
=====================
1. Kmalloc: Repeatedly allocate then free test* baseline : standard fast path, without freeoffset patch
10000 times kmalloc(8) -> 182 cycles kfree -> 174 cycles
10000 times kmalloc(16) -> 185 cycles kfree -> 175 cycles
10000 times kmalloc(32) -> 191 cycles kfree -> 179 cycles
10000 times kmalloc(64) -> 208 cycles kfree -> 184 cycles
10000 times kmalloc(128) -> 260 cycles kfree -> 202 cycles
10000 times kmalloc(256) -> 367 cycles kfree -> 277 cycles
10000 times kmalloc(512) -> 394 cycles kfree -> 304 cycles
10000 times kmalloc(1024) -> 469 cycles kfree -> 397 cycles
10000 times kmalloc(2048) -> 417 cycles kfree -> 318 cycles
10000 times kmalloc(4096) -> 470 cycles kfree -> 356 cycles
10000 times kmalloc(8192) -> 692 cycles kfree -> 723 cycles
10000 times kmalloc(16384) -> 756 cycles kfree -> 830 cycles* cmpxchg_local fast path with freeoffset patch
10000 times kmalloc(8) -> 79 cycles kfree -> 169 cycles
10000 times kmalloc(16) -> 83 cycles kfree -> 169 cycles
10000 times kmalloc(32) -> 93 c...
Then we do not need the page->end field anymore right? I will try
to rediff your patch against current slab-mm and see how we can proceed
from there.
--
Slub Freeoffset check overflow
Check for overflow of the freeoffset version number.
I just thought adding this check in CONFIG_SLUB_DEBUG makes sense. It's
really unlikely that enough interrupt handlers will nest over the slub
fast path, and each of them do about a million alloc/free on 32 bits or
a huge amount of alloc/free on 64 bits, but just in case, it seems good
to warn if we detect we are half-way to a version overflow.Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
mm/slub.c | 32 ++++++++++++++++++++++++++------
1 file changed, 26 insertions(+), 6 deletions(-)Index: linux-2.6-lttng/mm/slub.c
===================================================================
--- linux-2.6-lttng.orig/mm/slub.c 2008-02-29 08:05:01.000000000 -0500
+++ linux-2.6-lttng/mm/slub.c 2008-02-29 08:16:13.000000000 -0500
@@ -1660,7 +1660,7 @@ static __always_inline void *slab_alloc(
*/#ifdef SLUB_FASTPATH
- unsigned long freeoffset, newoffset;
+ unsigned long freeoffset, newoffset, resoffset;c = get_cpu_slab(s, raw_smp_processor_id());
do {
@@ -1682,8 +1682,18 @@ static __always_inline void *slab_alloc(
newoffset = freeoffset;
newoffset &= ~c->off_mask;
newoffset |= (unsigned long)object[c->offset] & c->off_mask;
- } while (cmpxchg_local(&c->freeoffset, freeoffset, newoffset)
- != freeoffset);
+ resoffset = cmpxchg_local(&c->freeoffset, freeoffset,
+ newoffset);
+#ifdef CONFIG_SLUB_DEBUG
+ /*
+ * Just to be paranoid : warn if we detect that enough
+ * allocations nested on top of us to get the counter to go
+ * half-way to overflow. That would be insane to do that much
+ * allocations in interrupt handers, but check it anyway.
+ */
+ WARN_ON(resoffset - freeoffset > -1UL >> 1);
+#endif
+ } while (resoffset != freeoffset);
#else
unsigned long flags;@@ -1822,7 +1832,7 @@ static __always_inline void slab_free(st
struct kmem_cache_cpu *c;#ifdef SLUB_FASTPAT...
Check for overflow of the freeoffset version number.
I just thought adding this check in CONFIG_SLUB_DEBUG makes sense. It's
really unlikely that enough interrupt handlers will nest over the slub
fast path, and each of them do about a million alloc/free on 32 bits or
a huge amount of alloc/free on 64 bits, but just in case, it seems good
to warn if we detect we are half-way to a version overflow.Changelog :
- Mask out the LSB because of alloc fast path. See comment in source.Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
---
mm/slub.c | 40 ++++++++++++++++++++++++++++++++++------
1 file changed, 34 insertions(+), 6 deletions(-)Index: linux-2.6-lttng/mm/slub.c
===================================================================
--- linux-2.6-lttng.orig/mm/slub.c 2008-03-04 00:59:01.000000000 -0500
+++ linux-2.6-lttng/mm/slub.c 2008-03-04 01:03:44.000000000 -0500
@@ -1660,7 +1660,7 @@ static __always_inline void *slab_alloc(
*/#ifdef SLUB_FASTPATH
- unsigned long freeoffset, newoffset;
+ unsigned long freeoffset, newoffset, resoffset;c = get_cpu_slab(s, raw_smp_processor_id());
do {
@@ -1682,8 +1682,22 @@ static __always_inline void *slab_alloc(
newoffset = freeoffset;
newoffset &= ~c->off_mask;
newoffset |= (unsigned long)object[c->offset] & c->off_mask;
- } while (cmpxchg_local(&c->freeoffset, freeoffset, newoffset)
- != freeoffset);
+ resoffset = cmpxchg_local(&c->freeoffset, freeoffset,
+ newoffset);
+#ifdef CONFIG_SLUB_DEBUG
+ /*
+ * Just to be paranoid : warn if we detect that enough free or
+ * slow paths nested on top of us to get the counter to go
+ * half-way to overflow. That would be insane to do that much
+ * allocations/free in interrupt handers, but check it anyway.
+ * Mask out the LSBs because alloc fast path does not increment
+ * the sequence number, which may cause the overall values to go
+ * backward.
+ */
+ WARN_ON((resoffset & ~c->...
Could you rediff your work against the slab-mm branch of my VM tree on
kernel.org?
--
I guess you could replace page->end with the (expensive)
page_address(page) + 1 in new_slab.You would also have to change slab_address() to use
page_address(page). Since this is used by check_valid_pointer, I am not
sure it's a good idea to use this slow operation there.Or maybe you have a different alternative in mind ?
Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
page->end was introduced to have a unique pointer for each slab. The
counter that you introduces can take over that role. So we could use a
NULL pointer again as the end of the objects? Or some constant?I am going to revert the change that introduced page->end for 2.6.25 since
it seems that the new fastpath does not need this. It was specific to the
old fastpath.--
I think you may well be right. This looks like a good clue.
I'll do the revert. I wanted either a confirmation that reveting it
actually fixes something, _or_ an actual bug description, and this seems
to be a quite possible case of the latter.Linus
--
Hi Mathieu,
What is not re-entrant?
Yes, this can happen. Are you saying it is not safe to be in the
lockless path when an IRQ triggers?
--
incrementing the variable with a "++" when interrupts are not disabled.
It's not an atomic add and it's racy. The code within stat() doesIt should be safe, but I think Eric pointed the correct problem in his
reply.Thanks,
Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
Yes but that is only for used for statistics which can be racy. Note that
the VM event statistics also can be racy.--
Doing ++ on a u32 _is_ atomic wrt interrupts on x86 and probably lots of
other architectures, so we're OK using unsigned there. But on some other
architectures ++ on u32 is not atomic wrt interrutps, so they should use
atomic_t or some other arch-specific mechanism.And guess what? It's already all been done: local_t.
--
So you make implicit assumptions about how gcc compiles your ++? I am
afraid you can't do that, gcc is absolutely free to turn variable++;
statement to non-atomic sequence of instructions (memory->register load,
increment register, for example).--
Jiri Kosina
SUSE Labs--
hm, why should it be atomic wrt. irqs? There's nothing that keeps gcc
from not doing an "incl memory_address" but do something like: "load
memory address into regx, incl regx, ... store address into memory
address" - and that's not atomic at that point.Ingo
--
> hm, why should it be atomic wrt. irqs? There's nothing that keeps gcc
It isn't - we've got tty layer traces of serial counts going wrong that
rely on precisely that assumption.Alan
--
local_t requires the disabling of preempt to work right.
The real solution here is cpu_alloc / cpu_ops. Per cpu operations work on
an offset relative to the start of the per process cpu data area in some
register. An increment can then be atomic vs. interrupt because it does
the calculation of the address and the inc in one instruction. F.e.gs: inc [percpu_offset]
Processor may change before and after without ill effects. So no preempt
since the instruction will always reference the %%gs register that points
to the percpu area of the currently executing processor.
--
Hi Mathieu,
Hmm. The barrier() in slab_free() looks fishy. The comment says it's
there to make sure we've retrieved c->freelist before c->page but then
it uses a _compiler barrier_ which doesn't affect the CPU and the
reads may still be re-ordered... Not sure if that matters here though.
--
No, no. The comment says that it's purely there to serialize an
*interrupt*, and as such, a compiler-only barrier is sufficient (or the
comment is wrong).Interrupts are "totally ordered" within a cpu (of course, in theory a CPU
might have speculative work etc reordering, but the CPU also guarantees
that interrupt acts _as_if_ it was exact), so a compiler barrier is
sufficient.Of course, if we're talking about interrupts on another CPU, that's a
different issue, but the fact is, in that case it's not about interrupts
any more (might as well be other code just running normally on another
CPU), and a barrier doesn't help, it needs real locking.So that barrier is fine per se. Of course, the whole code (and/or just the
comment!) may be buggered, but any CPU SMP-aware barriers shouldn't be
relevant.What's much more likely to be an issue is simply the fact that since the
fastpath now accesses the per-cpu freelist without any locking, if there
is *any* sequence what-so-ever that does it from another CPU and assumes
the old locking behaviour, the list will be corrupted. And from a quick
look-through, I certainly cannot guarantee that isn't the case.There's still a lot of cases that do direct assignments to "c->freelist"
without using a guaranteed atomic sequence. They *should* be safe if it's
guaranteed that
(a) they always run with interrupts disabled
AND
(b) 'c' is _always_ the "current CPU" listbut I can't quickly see that guarantee for either.
I'd happily just revert this thing, but it would be really good to have
confirmation that it seems to matter. But Torsten's partial bisection
seems to say that the quicklist thing went into -mm before the crash even
started.So:
- it might be something else entirely
- it might still be the local cmpxchg, just Torsten didn't happen to
notice it until later.
- it might still be the local cmpxchg, but something else changed its
patterns to actually make it start triggering.and in ...
My new hackbench-testcase also killed 2.6.24-rc2-mm1, so I really
I tried the following three patches:
switching the barrier() for a smp_mb() in 2.6.25-rc2-mm1:
-> crashedreverting the FASTPATH-patch in 2.6.25-rc2:
-> workedonly removed FAST_CMPXCHG_LOCAL from arch/x86/Kconfig
-> workedSo all of these tests seem to confirm, that the bug is in the new SLUB fastpath.
Torsten
--
yeah - my revert suggestions were all completely conditional on such
type of test feedback.Btw., i did trigger occasional SLUB crashes myself starting at around
-rc1, on the order of one per 200-300 straight random bootups, and
yesterday i did a 50-bootups series of a specific .config that crashed,
to try to reproduce one of them but failed - so bisection was not an
option and i had nothing concrete and repeatable to report either. I had
a few complete lockups and only 3 usable backtraces - find them below.Networking features in all of the backtraces - and so does the VFS. All
of the crashes are on SMP - and given that 50% of the bootups are UP
this gives us a 1:8 chance hint that this bug is SMP specific. (All the
crashes are in distccd - that is what this build cluster does mainly so
it's the main activity of the box - so they dont necessarily indicate
anything workload specific.)Earlier today i turned off local-cmpxchg and havent had a crash or hang
since then - but at 200 bootups and 4-5 crashes in a week that's not
conclusive yet. I think others might have workloads that trigger this
bug more often.Ingo
---------------->
mercury login: [ 582.671916] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
[ 582.672334]
[ 582.672334] Pid: 3776, comm: distccd Not tainted (2.6.25-rc2 #5)
[ 582.672334] EIP: 0060:[<c0174fda>] EFLAGS: 00010246 CPU: 0
[ 582.672334] EIP is at kmem_cache_alloc+0x2a/0x90
[ 582.672334] EAX: 00000000 EBX: 8000061c ECX: c069ed1c EDX: 01060002
[ 582.672334] ESI: c0aeffc8 EDI: c1d11714 EBP: f6eddcdc ESP: f6eddcc4
[ 582.672334] DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
[ 582.672334] Process distccd (pid: 3776, ti=f6edc000 task=f508c000 task.ti=f6edc000)
[ 582.672334] Stack: c06a3d48 f6eddce4 00000020 8000061c 0000066c c0aeffc8 f6eddcf8 c069ed1c
[ 582.672334] 00000000 00000020 8000061c f7ce6580 f7ce6580 f6eddd18 c045e7bb ffffffff
[ 582.672334] 00000000 f7f683e0 8000061c f52136c0 f7ce6580 f6eddd58 c0461de5 f5...
i mean, today i've only done 200 randconfig bootups since i did the
cmpxchg SLUB revert, and given the statistics of the bug (thousands of
bootups and just 3 provable crashes) i cannot yet conclude that the bug
is truly gone.Ingo
--
find a fix patch for that below - most systems affected seem to be SMP
ones.If this (or my other patch) indeed solves the problem i'd still favor a
full revert of the SLUB_FASTPATH (commit 1f84260c8ce3b1ce26d4), it looks
quite un-cooked and quite un-tested for multiple independent reasons.Sigh, why do i again have to be the messenger who brings the bad news to
SLUB land, and again when poor Christoph went on vacation? :-/Ingo
-------------------------->
Subject: SLUB: barrier fix
From: Ingo Molnar <mingo@elte.hu>---
mm/slub.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)Index: linux/mm/slub.c
===================================================================
--- linux.orig/mm/slub.c
+++ linux/mm/slub.c
@@ -1862,7 +1862,7 @@ static __always_inline void slab_free(st
debug_check_no_locks_freed(object, s->objsize);
do {
freelist = c->freelist;
- barrier();
+ smp_mb();
/*
* If the compiler would reorder the retrieval of c->page to
* come before c->freelist then an interrupt could----------------->
Subject: slub: fastpath optimization revert
From: Ingo Molnar <mingo@elte.hu>
Date: Tue Feb 19 15:46:37 CET 2008revert:
commit 1f84260c8ce3b1ce26d4c1d6dedc2f33a3a29c0c
Author: Christoph Lameter <clameter@sgi.com>
Date: Mon Jan 7 23:20:30 2008 -0800SLUB: Alternate fast paths using cmpxchg_local
it was causing problems (crashes) and was incomplete.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
mm/slub.c | 87 --------------------------------------------------------------
1 file changed, 87 deletions(-)Index: linux-x86.q/mm/slub.c
===================================================================
--- linux-x86.q.orig/mm/slub.c
+++ linux-x86.q/mm/slub.c
@@ -149,13 +149,6 @@ static inline void ClearSlabDebug(struct
/* Enable to test recovery from slab corruption on boot */
#undef SLUB_RESILIENCY_TEST-/*
- * Currently fastpath is not support...
Torsten/Yamin, does this fix things for you? What about reverting commit
1f84260c8ce3b1ce26d4c1d6dedc2f33a3a29c0c ("SLUB: Alternate fast paths
using cmpxchg_local")?Pekka
--
I'm busy in another issue and will test it ASAP. Sorry.
-yanmin
--
I tested it on my 3 x86-64 machines. The small fix to use smp_mb to replace
barrier in slab_free doesn't work. Kernel still crashed at the same place.I will test the reverting patch.
-yanmin
--
Kernel with the reverting patch is ok.
I ran reboot/hackbench for more than 10 times on every one of my 3 x86-64 machines,
and kernel didn't crash.-yanmin
--
Great, Linus reverted the patch yesterday. Thanks for testing!
--
the revert patch is below. (manually done due to other changes since
1f84260c8ce3b1ce26d4 was commited, but trivial)Ingo
----------------->
Subject: slub: fastpath optimization revert
From: Ingo Molnar <mingo@elte.hu>
Date: Tue Feb 19 15:46:37 CET 2008revert:
commit 1f84260c8ce3b1ce26d4c1d6dedc2f33a3a29c0c
Author: Christoph Lameter <clameter@sgi.com>
Date: Mon Jan 7 23:20:30 2008 -0800SLUB: Alternate fast paths using cmpxchg_local
it was causing problems (crashes) and was incomplete.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
mm/slub.c | 87 --------------------------------------------------------------
1 file changed, 87 deletions(-)Index: linux-x86.q/mm/slub.c
===================================================================
--- linux-x86.q.orig/mm/slub.c
+++ linux-x86.q/mm/slub.c
@@ -149,13 +149,6 @@ static inline void ClearSlabDebug(struct
/* Enable to test recovery from slab corruption on boot */
#undef SLUB_RESILIENCY_TEST-/*
- * Currently fastpath is not supported if preemption is enabled.
- */
-#if defined(CONFIG_FAST_CMPXCHG_LOCAL) && !defined(CONFIG_PREEMPT)
-#define SLUB_FASTPATH
-#endif
-
#if PAGE_SHIFT <= 12/*
@@ -1514,11 +1507,6 @@ static void *__slab_alloc(struct kmem_ca
{
void **object;
struct page *new;
-#ifdef SLUB_FASTPATH
- unsigned long flags;
-
- local_irq_save(flags);
-#endif
if (!c->page)
goto new_slab;@@ -1541,9 +1529,6 @@ load_freelist:
unlock_out:
slab_unlock(c->page);
stat(c, ALLOC_SLOWPATH);
-#ifdef SLUB_FASTPATH
- local_irq_restore(flags);
-#endif
return object;another_slab:
@@ -1575,9 +1560,6 @@ new_slab:
c->page = new;
goto load_freelist;
}
-#ifdef SLUB_FASTPATH
- local_irq_restore(flags);
-#endif
/*
* No memory available.
*
@@ -1619,34 +1601,6 @@ static __always_inline void *slab_alloc(
{
void **object;
struct kmem_cache_cpu *c;
-
-/*
- * The SLUB_FASTPATH path is pr...
I am ok with this if someone can actually confirm it fixes things.
Pekka
--
hm, it's bad for at least one other reason as well (which is probably
unrelated to this crash):/*
* Currently fastpath is not supported if preemption is enabled.
*/
#if defined(CONFIG_FAST_CMPXCHG_LOCAL) && !defined(CONFIG_PREEMPT)
#define SLUB_FASTPATH
#endifsuch !PREEMPT exceptions tend to show "i didnt want to think too hard
about the preemptible case so just turn it off" thinking.Also, why isnt this "SLUB_FASTPATH" flag done in the Kconfig space?
Ingo
--
Ingo, a comment in slub.c explains it :
/*
* The SLUB_FASTPATH path is provisional and is currently disabled if the
* kernel is compiled with preemption or if the arch does not support
* fast cmpxchg operations. There are a couple of coming changes that will
* simplify matters and allow preemption. Ultimately we may end up making
* SLUB_FASTPATH the default.
*
* 1. The introduction of the per cpu allocator will avoid array lookups
* through get_cpu_slab(). A special register can be used instead.
*
* 2. The introduction of per cpu atomic operations (cpu_ops) means that
* we can realize the logic here entirely with per cpu atomics. The
* per cpu atomic ops will take care of the preemption issues.
*/Eventually, I think only CONFIG_FAST_CMPXCHG_LOCAL will be needed (when
the code will support preemption). Therefore, this SLUB_FASTPATH define
seems to be only here temporarily.I'm looking at the code right now.. more to come.
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
well the feature is not complete and there are no reasons given _why_
it's not complete ... and even if there's a reason it should have been
deferred to the next merge window. We still have 10 year old "this is a
temporary hack" comments in the kernel ;-)"hardware does not support it" is a valid argument, "kernel developer
had no time to implement it properly" is not ;-)Ingo
--
Hi,
Hmm, no idea. I think might have been some mix-up with merging the
patch. The one I saw was:http://ftp.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.24/2....
But I don't remember giving out a Reviewed-by for it (and my mailbox
confirms that). Furthermore, somehow it turned into this when merged:http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commi...
In any case, if Torsten/someone can verify that reverting
1f84260c8ce3b1ce26d4c1d6dedc2f33a3a29c0c ("SLUB: Alternate fast paths
using cmpxchg_local") fixes these problems, I think we should just do
it and let Christoph sort it out when he gets back.Pekka
--
Is there any chance that you could try to bisect this, if it's repeatable
enough for you? Even if you can't bisect it *all* the way, it would be
really good to do a handful of bisection runs which should already
hopefully narrow it down a bit more.Linus
--
It's repeatable, but not in a really reliable way.
So to mark a kernel good I need to compile around 100 KDE packages,
and even then I'm not 100% sure, if it's good or if I was just lucky.But I did a partly bisect against 2.6.24-rc6-mm1:
2.6.24-rc6 + mm-patches up to (including) git.nfsd -> worked
2.6.24-rc6 + mm-patches up to (including) git.xfs -> crashedI think the only added patch between rc2-mm1 and rc3-mm2 in that range
where the iommu changes that I later ruled out.
That leaves some git trees as suspects:
git-ocfs2.patch
git-selinux.patch
git-s390.patch
git-sched.patch
git-sh.patch
git-scsi-misc.patch
git-unionfs.patch
git-v9fs.patch
git-watchdog.patch
git-wireless.patch
git-ipwireless_cs.patch
git-x86.patch
git-xfs.patch(see http://marc.info/?l=linux-kernel&m=120276641105256 )
Torsten
--
There's the Bugzilla entry for it at
http://bugzilla.kernel.org/show_bug.cgi?id=9973Please update it with the current information.
Thanks,
Rafael
--
Crash for 2.6.25-rc2-mm1 added. That one had a complete stacktrace,
but the trace looks like others I already reported, so no real new
information... :-(Torsten
--
This is a multi-part message in MIME format.
--------------040606080104080609040505
Content-Type: text/plain; charset=ISO-8859-15; format=flowed
Content-Transfer-Encoding: quoted-printable2.6.25-rc2 fails to bring up my openSUSE 10.3 PC because LVM
cannot find the volume group containing the root file system.
2.6.25-rc1 has the same problem, 2.6.24 works fine.Some possibly relevant messages copied manually from the screen:
--------8<--------8<--------8<--------8<--------8<--------8<--------
Driver 'sd' needs updating - please use bus_type methods
sd 0:0:0:0: [sda] ...
sda: sda1 sda2 sda3
sd 0:0:0:0: [sda] Attached SCSI disk
sd 1:0:0:0: [sdb] ...
sda: sdb1
sd 0:0:0:0: [sdb] Attached SCSI disk
md: linear personality registered for level -1
Volume group "system" not found
Volume group "system" not found
Trying manual resume from /dev/sda2
Invoking userspace resume from /dev/sda2
boot/82-resume.userspace.sh: line 48: /proc/splash: No such file or direc=
tory
resume: could not stat configuration file
resume: libcrypt version: 1.2.4
Trying manual resume from /dev/sda2
PM: Starting manual resume from diskWaiting for device /dev/system/root to appear: . Volume group "system" no=
t found
Volume group "system" not found
[...]
Volume group "system" not found
CPA self-test:
4k 1024 large 223 gb 0 x 1247[c0000000-f7c00000] miss 0
4k 189440 large 39 gb 0 x 189479[c0000000-f7c00000] miss 0
4k 189440 large 39 gb 0 x 189479[c0000000-f7c00000] miss 0
ok.
=2E Volume group "system" not found
[...]
Volume group "system" not found
Could not find /dev/system/root.
Want me to fall back to /dev/system/root? (Y/n)
--------8<--------8<--------8<--------8<--------8<--------8<--------Config attached. Architecture is x86 32bit. The two SATA disks are
hanging off an ICH6. More detail on request.Anything obvious? Bisect?
--=20
Tilman Schmidt E-Mail: tilman@imap.cc
Bonn, Germany
Diese Nachrich...
On Sat, 16 Feb 2008 20:14:30 +0100
Compile in SCSI disk support. Modular even if loaded in initrd it seems
to have broken somewhere.
--
# CONFIG_SYSFS_DEPRECATED is not set
I would suspect this.
Try to upgrade to at least lvm 2.02.29 (I guess this is the first version which
understands the new sysfs layout).
--
IMHO That should be *set* by default until everyone has had time to
update their userspace software to cope with the changed sysfs layout.Alasdair
--
agk@redhat.com
--
[added CCs from the other thread on this topic]
It *is* set by default.
The root cause of the trouble is that its semantics are changing.
At one point in time (sorry, don't remember which kernel release
exactly) I tested whether the openSUSE 10.3 userspace supported
a CONFIG_SYSFS_DEPRECATED=3Dn kernel and found that it did. From
then on, "make oldconfig" would carry that setting over to every
new kernel I built, which was fine while the meaning of this
setting - ie. the difference in sysfs layout it controlled -
stayed the same.With commit edfaa7c36574f1bf09c65ad602412db9da5f96bf however, the
sysfs layout changed again, so the same CONFIG_SYSFS_DEPRECATED
setting now controls a different difference (argh) in sysfs
layout. That kind of situation is not handled very well by
"make oldconfig", which basically starts from the assumption
that a setting that was ok for the previous kernel version is
still ok for the new one.I see two ways of avoiding that problem: either create a new
backward compatibility config setting for that new sysfs change,
or create a way of telling "make oldconfig" that the semantics of
CONFIG_SYSFS_DEPRECATED have changed and it should ask the user
for that again even if there is a previous setting.HTH
T.--=20
Tilman Schmidt E-Mail: tilman@imap.cc
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Unge=F6ffnet mindestens haltbar bis: (siehe R=FCckseite)
Bisection says:
edfaa7c36574f1bf09c65ad602412db9da5f96bf is first bad commit
commit edfaa7c36574f1bf09c65ad602412db9da5f96bf
Author: Kay Sievers <kay.sievers@vrfy.org>
Date: Mon May 21 22:08:01 2007 +0200Driver core: convert block from raw kobjects to core devices
This moves the block devices to /sys/class/block. It will create a
flat list of all block devices, with the disks and partitions in one=directory. For compatibility /sys/block is created and contains syml=
inks
to the disks.Apparently, compatibility is in the eye of the beholder - in this
Setting
CONFIG_SCSI=3Dy
CONFIG_BLK_DEV_SD=3DySetting
CONFIG_SYSFS_DEPRECATED=3Dy
does indeed fix the problem and allows me to boot successfully.
I'll have to investigate how to do that without breaking anything.
HTH
T.--=20
Tilman Schmidt E-Mail: tilman@imap.cc
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Unge=F6ffnet mindestens haltbar bis: (siehe R=FCckseite)
I faced the same problem, but resolved with ...
vgscan
vgchange -a yAlso, ensure you set "write_cache_state = 1" in /etc/lvm.conf before
running the above.Let me know if this helps.
Thanks,
Jeff.
--
Sorry, I'm not sure what to do with those two commands.
Running them once manually doesn't seem to change anything,That was already set by default.
Thanks,
Tilman--=20
Tilman Schmidt E-Mail: tilman@imap.cc
Bonn, Germany
Diese Nachricht besteht zu 100% aus wiederverwerteten Bits.
Unge=F6ffnet mindestens haltbar bis: (siehe R=FCckseite)
1. Release Linux Kernel
2. Give it a weasel name
3. ??
ERROR: "LGUEST_PAGES_guest_gdt_desc" [drivers/lguest/lg.ko] undefined!
ERROR: "LGUEST_PAGES_host_gdt_desc" [drivers/lguest/lg.ko] undefined!
ERROR: "LGUEST_PAGES_host_cr3" [drivers/lguest/lg.ko] undefined!
....config see http://pastebin.ca/906358
--
Hi,
The softlockup is seen from 2.6.25-rc1-git{1,3} and is visible in the 2.6.24-rc2 kernel,
While booting up with the 2.6.25-rc1-git{1,3} and 2.6.25-rc2 kernel(s) on the powerboxLoading st.ko module
BUG: soft lockup - CPU#1 stuck for 61s! [insmod:379]
NIP: c0000000001b0620 LR: c0000000001a5dcc CTR: 0000000000000040
REGS: c00000077caab8a0 TRAP: 0901 Not tainted (2.6.25-rc2-autotest)
MSR: 8000000000009032 <EE,ME,IR,DR> CR: 84004088 XER: 20000000
TASK = c00000077cb450a0[379] 'insmod' THREAD: c00000077caa8000 CPU: 1
GPR00: c00000077c9d4000 c00000077caabb20 c000000000538a40 000000000000000b
GPR04: ffc0000000000000 c00000077e0c0000 0000000000000036 000000000000000a
GPR08: 0040000000000000 c00000077c9d4250 c000000000000000 0000000000000000
GPR12: c00000077c9d4230 c000000000481d00
NIP [c0000000001b0620] .radix_tree_gang_lookup+0x100/0x1e4
LR [c0000000001a5dcc] .call_for_each_cic+0x50/0x10c
Call Trace:
[c00000077caabb20] [c0000000001a5e2c] .call_for_each_cic+0xb0/0x10c (unreliable)
[c00000077caabc60] [c00000000019dba4] .exit_io_context+0xf0/0x110
[c00000077caabcf0] [c000000000061e38] .do_exit+0x820/0x850
[c00000077caabda0] [c000000000061f34] .do_group_exit+0xcc/0xe8
[c00000077caabe30] [c00000000000872c] syscall_exit+0x0/0x40
Instruction dump:
7d296214 39290018 e8090000 7caa2038 39290008 2fa00000 409e0018 7caa4215
396b0001 418200cc 424000b8 4bffffdc <79691f24> 7d296214 e9690018 2fab0000
INFO: task insmod:387 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
insmod D 000000001000e144 12144 387 1
Call Trace:
[c00000077cb97600] [c0000000008fae80] 0xc0000000008fae80 (unreliable)
[c00000077cb977d0] [c000000000010c7c] .__switch_to+0x11c/0x154
[c00000077cb97860] [c000000000344498] .schedule+0x5d0/0x6b0
[c00000077cb97950] [c0000000003447d8] .schedule_timeout+0x3c/0xe8
[c00000077cb97a20] [c000000000343d34] .wait_for_common+0x150/0x22c
[c00000077cb97ae0] [c00000000008ef00] .__stop_machine_run+0xbc...
Can you update the Bugzilla entry at:
http://bugzilla.kernel.org/show_bug.cgi?id=9948
with the above information, please?--
"Premature optimization is the root of all evil." - Donald Knuth
--
It's odd stuff. Could you perhaps try and add some printks to
block/cfq-iosched.c:call_for_each_cic(), like dumping the 'nr' return
from radix_tree_gang_lookup() and the pointer value of cics[i] in the
for() loop after the lookup?How many SCSI devices are online?
--
Jens Axboe--
On Sun, 17 Feb 2008 20:29:13 +0100
I met the same issue on ia64/NUMA box.
seems cisc[]->key is NULL and index for radix_tree_gang_lookup() was always '1'.Attached patch works well for me,
but I don't know much about cfq. please confirm.Regards,
-Kame==
cics[]->key can be NULL.
In that case, cics[]->dead_key has key value.Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Index: linux-2.6.25-rc2/block/cfq-iosched.c
===================================================================
--- linux-2.6.25-rc2.orig/block/cfq-iosched.c
+++ linux-2.6.25-rc2/block/cfq-iosched.c
@@ -1171,7 +1171,11 @@ call_for_each_cic(struct io_context *ioc
break;called += nr;
- index = 1 + (unsigned long) cics[nr - 1]->key;
+
+ if (!cics[nr - 1]->key)
+ index = 1 + (unsigned long) cics[nr - 1]->dead_key;
+ else
+ index = 1 + (unsigned long) cics[nr - 1]->key;for (i = 0; i < nr; i++)
func(ioc, cics[i]);--
Why does it keep repeating then? If ->key is NULL, the next lookup index
should be 1UL.But I think the radix 'scan over entire tree' is a bit fragile. This
patch adds a parallel hlist for ease of properly browsing the members,It doesn't make a lot of sense, I'm afraid.
block/blk-ioc.c | 35 +++++++++++++++--------------------
block/cfq-iosched.c | 37 +++++++++++--------------------------
include/linux/iocontext.h | 2 ++
3 files changed, 28 insertions(+), 46 deletions(-)diff --git a/block/blk-ioc.c b/block/blk-ioc.c
index 80245dc..73c7002 100644
--- a/block/blk-ioc.c
+++ b/block/blk-ioc.c
@@ -17,17 +17,13 @@ static struct kmem_cache *iocontext_cachep;static void cfq_dtor(struct io_context *ioc)
{
- struct cfq_io_context *cic[1];
- int r;
+ if (!hlist_empty(&ioc->cic_list)) {
+ struct cfq_io_context *cic;- /*
- * We don't have a specific key to lookup with, so use the gang
- * lookup to just retrieve the first item stored. The cfq exit
- * function will iterate the full tree, so any member will do.
- */
- r = radix_tree_gang_lookup(&ioc->radix_root, (void **) cic, 0, 1);
- if (r > 0)
- cic[0]->dtor(ioc);
+ cic = list_entry(ioc->cic_list.first, struct cfq_io_context,
+ cic_list);
+ cic->dtor(ioc);
+ }
}/*
@@ -57,18 +53,16 @@ EXPORT_SYMBOL(put_io_context);static void cfq_exit(struct io_context *ioc)
{
- struct cfq_io_context *cic[1];
- int r;
-
rcu_read_lock();
- /*
- * See comment for cfq_dtor()
- */
- r = radix_tree_gang_lookup(&ioc->radix_root, (void **) cic, 0, 1);
- rcu_read_unlock();- if (r > 0)
- cic[0]->exit(ioc);
+ if (!hlist_empty(&ioc->cic_list)) {
+ struct cfq_io_context *cic;
+
+ cic = list_entry(ioc->cic_list.first, struct cfq_io_context,
+ cic_list);
+ cic->exit(ioc);
+ }
+ rcu_read_unlock();
}/* Called by the exitting task */
@@ -105,6 +99,7 @@ struct io_context *alloc_io_context(gfp_t gfp_f...
Even though io_contexts are fairly uncommon, adding more stuff to a data
structure was a pretty sad alternative to fixing a bug in
radix_tree_gang_lookup(), or to fixing a bug in a caller of it.IOW: what exactly went wrong here??
--
The cfq use of it, not the radix tree code! It juggled the keys and
wants to make sure that we see all users, modulo raced added ones (ok ifI could not convince myself that the current code would always do the
right thing. We should not have been seeing ->key == NULL entries in
there, it implied a double exit of that process. So I decided to fix it
by making the code a lot more readable (the patch in question deleted a
lot more than it added), at the cost of that hlist head + node.--
Jens Axboe--
<snip>
Hi Jens,
Thanks for the patch. The patch works fine, machine boots up without the kernel panic.
--
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
--
On Tue, 19 Feb 2008 09:36:34 +0100
Works well for me and my box booted !Thanks,
-Kame--
Super, I'll get it upstream. Thanks for testing and debugging!
--
Jens Axboe--
On Tue, 19 Feb 2008 09:36:34 +0100
when I inserted printk here
==
for (i = 0; i < nr; i++)
func(ioc, cics[i]);
printk("%d %lx\n", nr, index);
==
index was always "1" and nr was always 32.will try. please wait a bit.
Thanks,
-Kame--
Hang on, it returned 32? It should not return more than 16, since that
is what we have room for and asked for. Using ->dead_key when ->key is
NULL is correct btw, since that is the correct location in the tree once
the process has exited. But that should not happen until AFTER theIt boots here, so at least it passes normal sanity tests. It should
solve your problem as well, hopefully.--
Jens Axboe--
On Tue, 19 Feb 2008 09:58:38 +0100
sorry. Of course, it was 16 ;(your patch works well. thank you.
-Kame
--
It's committed now and posted in the relevant bugzilla as well (#9948).
--
Jens Axboe--
Hi,
The 2.6.25-rc2 kernel oopses while running dbench on ext3 filesystem
mounted with mount -o data=writeback,nobh option on the x86_64 boxBUG: unable to handle kernel NULL pointer dereference at 0000000000000000
IP: [<ffffffff80274972>] kmem_cache_alloc+0x3a/0x6c
PGD 1f6860067 PUD 1f5d64067 PMD 0
Oops: 0000 [1] SMP
CPU 3
Modules linked in:
Pid: 4271, comm: dbench Not tainted 2.6.25-rc2-autotest #1
RIP: 0010:[<ffffffff80274972>] [<ffffffff80274972>] kmem_cache_alloc+0x3a/0x6c
RSP: 0000:ffff8101fb041dc8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff810180033c00 RCX: ffffffff8027b269
RDX: 0000000000000000 RSI: 00000000000080d0 RDI: ffffffff80632d70
RBP: 00000000000080d0 R08: 0000000000000001 R09: 0000000000000000
R10: ffff8101feb36e50 R11: 0000000000000190 R12: 0000000000000001
R13: 0000000000000000 R14: ffff8101f8f38000 R15: 00000000ffffff9c
FS: 0000000000000000(0000) GS:ffff8101fff0f000(0063) knlGS:00000000f7e41460
CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
CR2: 0000000000000000 CR3: 00000001f5620000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dbench (pid: 4271, threadinfo ffff8101fb040000, task ffff8101fb180000)
Stack: 0000000000000001 ffff8101fb041ea8 0000000000000001 ffffffff8027b269
ffff8101fb041ea8 ffffffff80281fe8 0000000000000001 0000000000000000
ffff8101fb041ea8 00000000ffffff9c 000000000000000b 0000000000000001
Call Trace:
[<ffffffff8027b269>] get_empty_filp+0x55/0xf9
[<ffffffff80281fe8>] __path_lookup_intent_open+0x22/0x8f
[<ffffffff80282853>] open_namei+0x86/0x5a7
[<ffffffff8027d019>] vfs_stat_fd+0x3c/0x4a
[<ffffffff80279ab1>] do_filp_open+0x1c/0x3d
[<ffffffff80279c2c>] get_unused_fd_flags+0x79/0x111
[<ffffffff80279dce>] do_sys_open+0x46/0xca
[<ffffffff80221c82>] ia32_sysret+0x0/0xaCode: 24 00 00 00 48 98 48 8b 9c c7 d8 02 00 00 48 8b 13 f6 c2 01 74 12...
Looks to me like we broke slab. Christoph is offline until the 27th..
--
On Sat, 16 Feb 2008 11:14:46 +0530 Kamalesh Babulal
On Mon, Feb 18, 2008 at 2:59 PM, Andrew Morton
This is probably fixed by:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commi...
As this is on the regression list, Kamalesh, can you please confirm
it's fixed now?Pekka
--
Thanks, I tested the 2.6.25-rc3-git4 kernel and the oops is not reproducible. This commit seems to fix the kernel oops.
--
Thanks & Regards,
Kamalesh Babulal,
Linux Technology Center,
IBM, ISTL.
--
Two x86-64 boxes here lock up here on 2.6.25-rc2, shortly after boot.
One running Fedora 8 + X (GNOME) and one a headless file server.
configs and lspci attached. Unable to capture any splatter so far.Bisecting...
Sounds like it may be http://lkml.org/lkml/2008/2/17/78.
Suggest you try reverting that before doing the bisect.
Cheers,
FJP
--
Here you go.
commit 45b503548210fe6f23e92b856421c2a3f05fd034
Author: Laszlo Attila Toth <panther@balabit.hu>
Date: Tue Feb 12 22:42:09 2008 -0800[RTNETLINK]: Send a single notification on device state changes.
contains the following gem:
if (tb[IFLA_LINKMODE]) {
- write_lock_bh(&dev_base_lock);
- dev->link_mode = nla_get_u8(tb[IFLA_LINKMODE]);
- write_unlock_bh(&dev_base_lock);
+ if (dev->link_mode != nla_get_u8(tb[IFLA_LINKMODE])) {
+ write_lock_bh(&dev_base_lock);
+ dev->link_mode = nla_get_u8(tb[IFLA_LINKMODE]);
+ write_lock_bh(&dev_base_lock);
+ modified = 1;
+ }
}and even with that fixed it breaks NetworkManager (on my test box it apparently
can't get the IP address using DHCP). Reverting this commit makes things
work again.Well, it looks like this patch went to you untested and unreviewed, so may I
gently request that it be reverted from your tree, pretty please?Thanks,
Rafael
--
| Andy Whitcroft | Re: 2.6.23-rc6-mm1 |
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| James Bottomley | Re: Integration of SCST in the mainstream Linux kernel |
| Alan | Re: [RFC] Heads up on sys_fallocate() |
git: | |
| Natalie Protasevich | [BUG] New Kernel Bugs |
| Gerrit Renker | [PATCH 0/37] dccp: Feature negotiation - last call for comments |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Winkler, Tomas | RE: iwlwifi: fix build bug in "iwlwifi: fix LED stall" |
