I got the following warning while running sched-devel/latest. I was not doing any tracing at that point (for the glibc compilation issue). I suspect it was while I was using a KDE application (amarok) as the desktop froze for a few seconds at one point. Cheers, FJP ------------[ cut here ]------------ WARNING: at mm/slub.c:2443 kmem_cache_destroy+0x168/0x1a9() Modules linked in: snd_rtctimer i915 drm tun ppdev lp ac battery nfs lockd nfs_acl sunrpc bridge llc fuse lm85 hwmon_vid cpufreq_ondemand acpi_cpufreq freq_table firewire_sbp2 ipv6 snd_hda_intel snd_pcm_oss snd_mixer_oss snd_pcm snd_seq_dummy snd_seq_oss snd_seq_midi snd_rawmidi snd_seq_midi_event snd_seq snd_timer snd_seq_device psmouse snd parport_pc i2c_i801 button iTCO_wdt parport serio_raw i2c_core intel_agp joydev soundcore evdev pcspkr snd_page_alloc ext3 jbd mbcache dm_mirror dm_snapshot dm_mod ide_cd_mod ata_piix ata_generic sg sr_mod cdrom sd_mod piix usbhid hid floppy ahci libata scsi_mod dock ide_pci_generic ide_core firewire_ohci firewire_core crc_itu_t ehci_hcd uhci_hcd e1000e thermal processor fan Pid: 32428, comm: cat Not tainted 2.6.26-rc0-sched-devel.git-x86-latest.git #3 Call Trace: [<ffffffff80237c7b>] warn_on_slowpath+0x58/0x6b [<ffffffff80273676>] ? get_pageblock_migratetype+0x1b/0x1d [<ffffffff80274291>] ? free_hot_page+0xb/0xd [<ffffffff802742ab>] ? __free_pages+0x18/0x21 [<ffffffff80292170>] ? __free_slab+0xdc/0xe9 [<ffffffff80293c5d>] kmem_cache_destroy+0x168/0x1a9 [<ffffffff803974d8>] mon_text_release+0x89/0xae [<ffffffff80299254>] __fput+0xb9/0x161 [<ffffffff8029957b>] fput+0x14/0x16 [<ffffffff80296892>] filp_close+0x66/0x71 [<ffffffff80239eea>] put_files_struct+0x77/0xcb [<ffffffff80239f72>] __exit_files+0x34/0x39 [<ffffffff8023b1ec>] do_exit+0x256/0x683 [<ffffffff8023b694>] do_group_exit+0x7b/0x96 [<ffffffff8024409f>] get_signal_to_deliver+0x2c4/0x2f1 [<ffffffff8020b9bb>] do_notify_resume+0xd0/0x8d5 [<ffffffff80232e0d>] ? ...
i think this message is harmless to system stability. Cc:-ing USB folks as the kmem_cache_destroy() comes from drivers/usb/mon/mon_text.c. Context: SLUB recently started warning about caches that get destroyed with still outstanding allocations. Pekka, it might make sense to turn this SLUB warning into something more --
Oh absolutely. Something like this?
[PATCH] slub: improve kmem_cache_destroy() error message
As pointed out by Ingo, the SLUB warning of calling kmem_cache_destroy()
with cache that still has objects triggers in practice. So turn this
WARN_ON() into a nice SLUB specific error message to avoid people
confusing it to a SLUB bug.
Cc: Ingo Molnar <mingo@elte.hu>
Cc: Christoph Lameter <clameter@sgi.com>
Signed-off-by: Pekka Enberg <penberg@cs.helsinki.fi>
---
diff --git a/mm/slub.c b/mm/slub.c
index 7f8aaa2..c372e88 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2426,8 +2426,11 @@ void kmem_cache_destroy(struct kmem_cache *s)
if (!s->refcount) {
list_del(&s->list);
up_write(&slub_lock);
- if (kmem_cache_close(s))
- WARN_ON(1);
+ if (kmem_cache_close(s)) {
+ printk(KERN_ERR "SLUB %s: %s called for cache that "
+ "still has objects.\n", s->name, __func__);
+ dump_stack();
+ }
sysfs_slab_remove(s);
} else
up_write(&slub_lock);
--
Acked-by: Christoph Lameter <clameter@sgi.com> --
Thanks, Ingo, I'll look at it right away. It's difficult to believe that usbmon would do something so stupid but stranger things happened. Actually, I used an explicit SLAB cache and not kmalloc specifically because SLAB catched this condition. It would be strange if SLAB didn't and SLUB started catching it. Anyway we'll see soon enough. -- Pete --
Hi Pete, SLAB checks this too so you should get a similar error. Pekka --
I looked at this whole day today, but found nothing. The code analysis for usbmon shows nothing. Anyone wants to have a look? Reproduction does not work either. I tried various loads, loops of opening/closing while pushing events from USB, nothing. BTW, I tried to create memory pressure with mem=300MB, result: top: page allocation failure. order:4, mode:0xc0d0 Pid: 2337, comm: top Not tainted 2.6.25-ub #33 Call Trace: [<ffffffff8027450d>] __alloc_pages+0x32f/0x352 [<ffffffff80272252>] rmqueue_bulk+0x8a/0x9b [<ffffffff80273a3d>] __get_free_pages+0xe/0x4d [<ffffffff802ddd14>] show_stat+0x25/0x4f1 [<ffffffff80252f6f>] __lock_acquire+0xd95/0xda4 [<ffffffff80252f6f>] __lock_acquire+0xd95/0xda4 [<ffffffff802b349c>] seq_read+0x37/0x2a7 [<ffffffff80251a2a>] mark_held_locks+0x57/0x73 [<ffffffff802b349c>] seq_read+0x37/0x2a7 [<ffffffff80470fff>] mutex_lock_nested+0x23a/0x256 [<ffffffff80251bd4>] trace_hardirqs_on+0xf9/0x123 [<ffffffff8047100c>] mutex_lock_nested+0x247/0x256 [<ffffffff80470fff>] mutex_lock_nested+0x23a/0x256 [<ffffffff8047100c>] mutex_lock_nested+0x247/0x256 [<ffffffff802b3579>] seq_read+0x114/0x2a7 [<ffffffff802b3465>] seq_read+0x0/0x2a7 [<ffffffff802d74a7>] proc_reg_read+0x80/0x9b The error is not just advisory, the application (top) dies because /proc/stat cannot be read. I'm surprised that you people managed to get the warning at 2443, because things start unravel way earlier for me. -- Pete --
Hi Pete, I didn't look too closely but mon_text_fetch() does list_del() but no kmem_cache_free() which looks fishy. --
Yup, looks like a leak in the error path. So it's mon_text_read_t() -> mon_text_read_wait() -> mon_text_fetch() that removes it from the list but we can fail before we reach the kmem_cache_free() in mon_text_read_t() and thus lose track of the object. --
This is false, we cannot fail. There's only one "return" for non-NULL case and the kmem_cache_free is right in front of it. See my reply to the patch. -- Pete --
So something like the following totally untested patch.
Pekka
diff --git a/drivers/usb/mon/mon_text.c b/drivers/usb/mon/mon_text.c
index 5e3e4e9..c6fdc6e 100644
--- a/drivers/usb/mon/mon_text.c
+++ b/drivers/usb/mon/mon_text.c
@@ -449,6 +449,7 @@ static struct mon_event_text *mon_text_read_wait(struct mon_reader_text *rp,
if (file->f_flags & O_NONBLOCK) {
set_current_state(TASK_RUNNING);
remove_wait_queue(&rp->wait, &waita);
+ kmem_cache_free(rp->e_slab, ep);
return ERR_PTR(-EWOULDBLOCK);
}
/*
@@ -458,6 +459,7 @@ static struct mon_event_text *mon_text_read_wait(struct mon_reader_text *rp,
schedule();
if (signal_pending(current)) {
remove_wait_queue(&rp->wait, &waita);
+ kmem_cache_free(rp->e_slab, ep);
return ERR_PTR(-EINTR);
}
set_current_state(TASK_INTERRUPTIBLE);
--
The code looks like this:
while ((ep = mon_text_fetch(rp, mbus)) == NULL) {
if (file->f_flags & O_NONBLOCK) {
set_current_state(TASK_RUNNING);
remove_wait_queue(&rp->wait, &waita);
return ERR_PTR(-EWOULDBLOCK);
It's impossible to get inside the while() with non-null ep,
so your patch doesn't fix anything (even if kmem_cache_free
can survive a NULL object).
Any other ideas?
-- Pete
--
Yeah, I need a new pair of glasses. Thanks. Well, a race in mon_text_release() after the kmem_cache_free() loop and the call to kmem_cache_destroy() but looks like that can't happen either. Frans, is this easily triggerable? If so, can you try with CONFIG_SLAB and CONFIG_DEBUG_SLAB enabled instead? --
Not really. I've only seen it once two days ago and not at all today while I was doing the traces for the scheduling issue. I can give it a go, but I'm not yet ready to switch to current git on this box for longer periods (although so far I've actually had less issues than with .24-rc1; fingers crossed). If I get anything I'll follow up. --
I've run with the same kernel for about 6 hours today but not seen the WARNING again, so the chance that I'll be able to provide additional info is close to zero :-( Maybe I'll have more "luck" later in the .26 cycle when I start using -rc releases on a daily basis. --
SLUB now dumps all the remaining objects in the cache when this happens so you probably want to have CONFIG_SLUB_DEBUG_ON enabled just in case you hit this again. Thanks. --
