This message contains a list of some regressions from 2.6.23 which have been reported since 2.6.24-rc1 was released and for which there are no fixes in the mainline that I know of. If any of them have been fixed already, please let me know. If you know of any other unresolved regressions from 2.6.23, please let me know either and I'll add them to the list. Subject : EHCI causes system to resume instantly from S4 Submitter : Maxim Levitsky <maximlevitsky@gmail.com> References : http://lkml.org/lkml/2007/10/27/66 http://bugzilla.kernel.org/show_bug.cgi?id=9258 Handled-By : "Rafael J. Wysocki" <rjw@sisk.pl> David Brownell <david-b@pacbell.net> Alan Stern <stern@rowland.harvard.edu> Patch : Note: : the problem appears to heavily depend on hardware Subject : leds: ledtrig-timer calls sleeping function from invalid context Submitter : Márton Németh <nm127@freemail.hu> References : http://bugzilla.kernel.org/show_bug.cgi?id=9264 Handled-By : Richard Purdie <rpurdie@rpsys.net> Patch : http://bugzilla.kernel.org/attachment.cgi?id=13493&action=view Subject : PATA scan: ACPI Exception AE_AML_PACKAGE_LIMIT... is beyond end of object Submitter : Hans de Bruin <bruinjm@xs4all.nl> References : http://bugzilla.kernel.org/show_bug.cgi?id=9320 Handled-By : Robert Moore <Robert.Moore@intel.com> Tejun Heo <htejun@gmail.com> Fu Michael <michael.fu@intel.com> Patch : Subject : snd_hda_intel 2.6.24-rc2 bug: interrupts don't always work on Lenovo X60s Submitter : Roland Dreier <rdreier@cisco.com> References : http://lkml.org/lkml/2007/11/8/255 http://bugzilla.kernel.org/show_bug.cgi?id=9332 Handled-By : Patch : Subject : system hangs after a few minutes Submitter : Marcus Better <marcus@better.se> References : http://bugzilla.kernel.org/show_bug.cgi?id=9335 Handled-By : Andrew Morton <akpm@linux-foundation.org> Alan Stern <stern@rowland.harvard.edu&g...
Linus, Andrew, i need some help deciding what to do about this regression. The fixes for this have been tested and resolve the regression, but they change printk and other code that runs by default and is thus rather invasive so late in the v2.6.24 cycle. This bug should only affect CONFIG_PRINTK_TIME=y kernels (a non-default debug option) - although some claimed effect was on udelay()/mdelay() too. i've attached below the queue of 5 patches that fix this problem. They have been build and boot tested with more than 1000 random kernels in the past few days, so i certainly trust the core and x86 bits of this. what do you think? Right now i've got them queued up for 2.6.25 in both the scheduler-devel and the x86-devel git trees - but can submit them for 2.6.24 if it's better if we did them there. I've got no strong opinion either way. Ingo --------------------> Subject: x86: scale cyc_2_nsec according to CPU frequency From: "Guillaume Chazarain" <guichaz@yahoo.fr> scale the sched_clock() cyc_2_nsec scaling factor according to CPU frequency changes. [ mingo@elte.hu: simplified it and fixed it for SMP. ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/kernel/tsc_32.c | 43 ++++++++++++++++++++++++++++++----- arch/x86/kernel/tsc_64.c | 57 ++++++++++++++++++++++++++++++++++++++--------- include/asm-x86/timer.h | 23 ++++++++++++++---- 3 files changed, 102 insertions(+), 21 deletions(-) Index: linux/arch/x86/kernel/tsc_32.c =================================================================== --- linux.orig/arch/x86/kernel/tsc_32.c +++ linux/arch/x86/kernel/tsc_32.c @@ -5,6 +5,7 @@ #include <linux/jiffies.h> #include <linux/init.h> #include <linux/dmi.h> +#include <linux/percpu.h> #include <asm/delay.h> #include <asm/tsc.h> @@ -80,13 +81,31 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable); * * -johnstul@us.ibm.com "math is ...
On Mon, 10 Dec 2007 21:42:12 +0100 printk_clock() doesn't seem terribly important but what's this stuff about effects on udelay/mdelay? That can be serious if they're getting shortened. --
since udelay depends on loops_per_jiffy, which is fixed up time_cpufreq_notifier(), i dont see how it could be affected by frequency changes. (but that's the theory - practice might be different) Ingo --
Stefano Brivio reported udelay()/mdelay() effects in the b43 driver. (and it caused driver failures for him.) Stefano, could you please try to sum up your experiences with that issue? Is it reproducable, and the 5 patches i did fix it? (if yes, could you try to re-do the mdelay verifications perhaps, to make sure it's not some other effect interacting here. In theory sched-clock scaling has no effect on udelay behavior.) Ingo --
On Tue, 11 Dec 2007 00:04:25 +0100 Sorry for disappearing. Anyway, yes, those patches fixed it. Precision in delays isn't that good when using my crappy unstable TSC (mdelay(2000) causes delays between 2 and 2.9 seconds) but it's not depending on frequency changes anymore. So I'd say it's fixed, but please tell me if you want me to do any other test so as to be sure it is. -- Ciao Stefano --
ok, just to make sure we are all synced up. I made 8 patches related to this problem category (and all the trickle effects). 3 are upstream already, 5 are pending for v2.6.25. One out of those 5 is an immaterial cleanup patch - which leaves us 4 patches to sort out. So i'd suggest for you to try latest -git - that will tell us whether udelay() is acceptable on your box right now. i've attached those 4 patches: x86-sched_clock-re-scheduler-fix-x86-regression-in-native-sched-clock.patch x86-cpu-clock-idle-event.patch sched-printk-recursion-fix.patch sched-printk-clock-fix.patch none of them is _supposed_ to have any effect on udelay(), but the interactions in this area are weird. [ note: CONFIG_PRINTK_TIME will be broken and only fixed in v2.6.25, so use some other time metric for determining mdelay quality. ] plus then there's this patch: http://lkml.org/lkml/2007/12/7/100 is it perhaps this one that fixed udelay for you? [ which would be much more expected, as this patch changes udelay ;-) ] Ingo
On Tue, 11 Dec 2007 10:01:20 +0100 Yes, it is (msleep(2000), as said, gives delays between 2 and 2.9s on my box, and drivers are happy). The commit which fixed this (it seems) is fa2dd441df28b9fdfc68f84ae66f1b507cfff0e4. I'll bisect and tell you more in the Will try it ASAP, again, in the next few days anyway. -- Ciao Stefano --
Ingo, it seems you dropped http://lkml.org/lkml/2007/12/7/100 (cpu_clock() based udelay), so how udelay can be affected by your proposed changes? Thanks. -- Guillaume --
was this needed for you to get stable udelay()? (that cpu_clock() based udelay patch was buggy, i got the units wrong. udelay does wacky conversions between various units. So i dropped it for the time being.) the last rollup you tested didnt show udelay problems, and it didnt include the sched_clock() based udelay patch. so it would be nice if you could re-examine exactly what is needed. Please try latest -git and the concatenation of the 4 patches below. What would be the best info is to see which (if any!) patches are needed against latest -git to get a stable udelay() on your box. Ingo ---------------------------------------> Linus, Andrew, i need some help deciding what to do about this regression. The fixes for this have been tested and resolve the regression, but they change printk and other code that runs by default and is thus rather invasive so late in the v2.6.24 cycle. This bug should only affect CONFIG_PRINTK_TIME=y kernels (a non-default debug option) - although some claimed effect was on udelay()/mdelay() too. i've attached below the queue of 5 patches that fix this problem. They have been build and boot tested with more than 1000 random kernels in the past few days, so i certainly trust the core and x86 bits of this. what do you think? Right now i've got them queued up for 2.6.25 in both the scheduler-devel and the x86-devel git trees - but can submit them for 2.6.24 if it's better if we did them there. I've got no strong opinion either way. Ingo --------------------> Subject: x86: scale cyc_2_nsec according to CPU frequency From: "Guillaume Chazarain" <guichaz@yahoo.fr> scale the sched_clock() cyc_2_nsec scaling factor according to CPU frequency changes. [ mingo@elte.hu: simplified it and fixed it for SMP. ] Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> --- arch/x86/kernel/tsc_32.c | 43 ++++++++++++++++++++++++++++++----- arch/x86/kernel/tsc_...
On Tue, 11 Dec 2007 00:34:33 +0100 I'm still quite concerned about this in dual/quad core scenarios; the frequency of both cores is the maximum of what linux sets each core to; this means that if you're THIS sensitive to that there still is quite a nasty issue there. I wonder if the various delay functions (maybe only in .25) should use the maximum observed loops_per_jiffie instead always (across cpus) to be super safe here. -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Do you mean that the cpufreq code can be confused about the actual frequency of the cores? That sounds like a big problem. Thanks for any insight. -- Guillaume --
On Tue, 11 Dec 2007 01:01:25 +0100 it'll get way worse going forward. (but even on todays systems, the tsc no longer represents frequency, but is some fixed clock totally unrelated to cpu frequency) -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
X86_FEATURE_CONSTANT_TSC CPUs (all modern Intel CPUs) should be fine -
we dont do any TSC frequency fixups for them. The loops_per_jiffy fixup
looks like this:
if (!(freq->flags & CPUFREQ_CONST_LOOPS))
cpu_data(freq->cpu).loops_per_jiffy =
cpufreq_scale(loops_per_jiffy_ref,
ref_freq, freq->new);
i.e. X86_FEATURE_CONSTANT_TSC excluded. The sched_clock() scaling factor
is modified like this:
if (!(freq->flags & CPUFREQ_CONST_LOOPS)) {
tsc_khz = cpu_khz;
preempt_disable();
set_cyc2ns_scale(cpu_khz, smp_processor_id());
so here X86_FEATURE_CONSTANT_TSC is excluded again. So the whole
frequency scaling issue will become a pure legacy issue only with time.
Ingo
--Any specific report? The jumping sched_clock on frequency change caused some scheduling oddities for me, but CFS attenuated the effect. Thanks. -- Guillaume --
Here's one for you - I have a new Lenovo t61p with which to irritate everyone. suspend-to-ram is a wipeout, but suspend-to-disk works OK under 2.6.23. However under 2.6.24-rc1 and -rc4 the machine reboots right at the end of resume-from-disk. Am trying to do a git-disect on it but it seems that someone has been screwing with ata Kconfig and I'm hitting a pile of cant-find-root-disk bisection points and I can't immediately work out why. I'll try to find time to look at it again next week. --
It's http://bugzilla.kernel.org/show_bug.cgi?id=9258 , I think. Does it do that if you unload ehci-hcd before the hibernation? --
the way i solve such bisection problems is to have the patch like below applied by a "git-bisect run" scriptlet (and popped off after the test). This way all must-have drivers and kernel features are selected for that particular testbox, no matter what Kconfig complication there are. (except outright config option renaming but those are rare) Ingo Index: linux/arch/x86/Kconfig.needed =================================================================== --- /dev/null +++ linux/arch/x86/Kconfig.needed @@ -0,0 +1,88 @@ +config FORCE_MINIMAL_CONFIG + bool + default y + +select EXPERIMENTAL + +select EXT3_FS +select EXT3_FS_XATTR +select EXT3_FS_POSIX_ACL +select EXT3_FS_SECURITY +select BLOCK +select HOTPLUG +#select INOTIFY +#select INOTIFY_USER + +# so that capset() works (sudo, etc.): +select SECURITY +select SECURITY_CAPABILITIES + +select BINFMT_ELF +select MSDOS_PARTITION +select PARTITION_ADVANCED +select BSD_DISKLABEL + +select SYSFS +select SYSFS_DEPRECATED +select PROC_FS +select FUTEX + +select ATA +select SATA_AHCI +select ATA_PIIX +select PATA_AMD +select PATA_OLDPIIX +select BLK_DEV_SD + +select E100 +select E1000 +select NET_ETHERNET +select NET_PCI +select MII +select CRC32 + +select 8139TOO +select FORCEDETH + +select PACKET + +select NETPOLL +select NETCONSOLE +select NET_POLL_CONTROLLER +select INET +select NET +select UNIX +select NETDEVICES + +select SERIAL_8250 +select SERIAL_8250_CONSOLE +select MAGIC_SYSRQ + +select INPUT +select INPUT_MOUSEDEV +select INPUT_POLLDEV +select INPUT_KEYBOARD +select KEYBOARD_ATKBD +select SERIO +select SERIO_I8042 + +select VT +select VT_CONSOLE +select HW_CONSOLE +select VGA_CONSOLE +select EARLY_PRINTK +select PRINTK +select UNIX98_PTYS + +select USB +select USB_MOUSE +select USB_EHCI_HCD +select USB_OHCI_HCD +select USB_UHCI_HCD +select USB_SUPPORT + +select PCI + +select STANDALONE +select PREVENT_FIRMWARE_BUILD + Index: linux/lib/Kconfig ===========...
Hi. I don't think that this is a regression: I reported on RedHat bugzilla when I switched from F7 to F8 and I was using 2.6.23.8 at that time. It looks to me an HAL regression, but of course I may be wrong :-) as the reported bisected to a bad commit. https://bugzilla.redhat.com/show_bug.cgi?id=373041 By the way, I now switched to Fedrora Rawhide with a 2.6.24-rc4-git5 custom kernel and Gnome desktop and the problem is still present, even with gnome-power-manager. Hope this helps. Regards, Fabio --
to me this looks like an ABI regression - utilities should work without
change. Something changed in /sys output that caused HAL to think that
there are two batteries:
| The output of lshal shows that there are two UDI's with
| info.capabilities = { 'battery' }:
|
| udi = '/org/freedesktop/Hal/devices/acpi_BAT0'
| udi = '/org/freedesktop/Hal/devices/computer_power_supply_0'
whether it's a HAL bug or a kernel bug, the original state should be
restored and it should be worked out without breaking users of older HAL
versions.
grumble: way too many times do various system utilities break when i
upgrade the kernel on my laptop. Maybe a new debug mechanism: we should
start fingerprinting the exact /sys and /proc output and enforce that
it's immutable across kernel releases as long as the hardware is
unmodified?
Ingo
--"breaking users of older HAL versions" == "breaking machines". That would be neat. It would need to be executed on a lot of different machines. I wonder if there's something sneaky we can do here. Install the script in /lib/modules/$(uname -r) and then run it from the kernel when the fork count reaches 1000 ;) (hey, I've seen worse: /proc files which start with #!/bin/sh) --
Hm, that wouldn't allow us to add new attributes ... --
Matt, the above bug is still occuring en masse during random bootups: [ 67.828312] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() [ 41.270905] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() [ 82.378714] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() [ 35.308126] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() [ 35.484155] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot() and the patch does not seem to be upstream yet. Ingo --
I was hoping for some discussion about whether it was the best fix.
The current kzalloc thing strikes me as a step backwards for all
allocators. We'd do better to have a single non-inline kzalloc
function rather than an extra branch in the normal kmalloc fast path.
But here's the patch again, with my sign-off:
Avoid calling page allocator with __GFP_ZERO, as we might be in atomic
context and this will make thing unhappy on highmem systems. Instead,
manually zero allocations from the page allocator.
Signed-off-by: Matt Mackall <mpm@selenic.com>
diff -r f7edf7226317 mm/slob.c
--- a/mm/slob.c Wed Dec 05 15:57:06 2007 -0600
+++ b/mm/slob.c Wed Dec 05 15:57:51 2007 -0600
@@ -223,6 +231,7 @@ static void *slob_new_page(gfp_t gfp, in
{
void *page;
+ gfp &= ~__GFP_ZERO;
#ifdef CONFIG_NUMA
if (node != -1)
page = alloc_pages_node(node, gfp, order);
@@ -457,6 +470,8 @@ void *__kmalloc_node(size_t size, gfp_t
page = virt_to_page(ret);
page->private = size;
}
+ if (unlikely((gfp & __GFP_ZERO) && ret))
+ memset(ret, 0, size);
return ret;
}
}
--
Mathematics is the supreme nostalgia of our time.
--I think this is fine, but didn't we fix the warning already? Calling page allocators with __GFP_ZERO should be fine, as long as __GFP_HIGHMEM isn't set, and slab/slub/slob/kmalloc cannot use GFP_HIGHMEM *anyway*. But I'll apply it anyway, because it looks "obviously correct" from the standpoint that the _other_
SLUB does not pass __GFP_ZERO to the page allocator. --
> standpoint that the _other_
Hi Linus, In case you didn't already merge this: Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi> --
While I think the whole GFP_ZERO thing is a bit broken here, this is an improvement on my original patch, so: -- Mathematics is the supreme nostalgia of our time. --
But what about:
(c) stop passing GFP_ZERO to kmalloc allocators
stop fiddling with flags inside allocators
remove ugly if (unlikely(gfp & GFP_ZERO)) from kmalloc allocators
make kzalloc and kcalloc non-inline functions that do the memset
That should:
- make both kmalloc and kzalloc/kcalloc faster (one less branch)
- reduce kernel size
GFP_ZERO is a bit of an abuse here, given that we don't even intend to
pass it to the underlying "GFP".
--
Mathematics is the supreme nostalgia of our time.
--It's a kmap_atomic() debugging patch which I wrote ages ago and whcih Ingo sucked into his tree. I don't _think_ this warning is present in your tree at all. http://lkml.org/lkml/2007/11/29/157 is where it starts. I had a lenghty back-and-forth with Christoph on this within the past couple of months and I cannot locate the thread and I don't recall what the upshot was and Christoph is still offline. Knocking out __GFP_ZERO at the point where the slab allocator(s) call the page allocator seems like a good approach to me. But I don't think we need to do anything for 2.6.24.. --
ok - Rafael, please strike this off the regressions list, there is no problem in .24 other than the double memset for some SLOB metadata. Ingo --
OK, dropped. Thanks, Rafael --
Good. Although we should perhaps look at that reported performance problem with SLUB. It looks like SLUB will do a memclear() for the area twice (first for the whole page, then for the thing it allocated) for the slow case. Maybe that exacerbates the problem. Linus --
i dont think the SLUB problem could be explained purely via a double
memset(). [which ought to be extremely fast anyway] We are talking about
a 10 times slowdown on a 64-way box of a workload that is fairly
common-sense. (tasks sending messages to each other via bog standard
means)
while i dont want to jump to conclusions without looking at some
profiles, i think the SLUB performance regression is indicative of the
following fallacy: "SLAB can be done significantly simpler while keeping
the same performance".
I couldnt point to any particular aspect of SLAB that i could
characterise as "needless bloat".
the SLUB concept is proudly outlined in init/Kconfig:
config SLUB
bool "SLUB (Unqueued Allocator)"
help
SLUB is a slab allocator that minimizes cache line usage
instead of managing queues of cached objects (SLAB approach).
Per cpu caching is realized using slabs of objects instead
of queues of objects. SLUB can use memory efficiently
and has enhanced diagnostics.
but that's not true anymore - the two concepts are pretty much
equivalent, after all the "performance tuning" that went on in SLUB.
(read: 'frantically try to catch up with SLAB in benchmarks')
so even today's upstream kernel, which has 'ancient' SLUB code, SLAB and
SLUB have essentially the same linecount:
$ wc -l mm/slab.c mm/slub.c
4478 mm/slab.c
4125 mm/slub.c
(and while linecount != complexity, there is a strong relationship.)
With SLAB having 10 years more test coverage and tuning.
the messiest and most fragile aspect of SLAB that i can think of is its
bootstrap hacks - but that is an entirely unimportant detail in my
opinion. SLAB has been cleaned up significantly in the past few years by
Pekka Enberg & co, it's pretty pleasant and straightforward code these
days.
I think we should we make SLAB the default for v2.6.24 ...
Ingo
--Well this is double crap. First of all SLUB does not do memclear twice. There is no reason to assume that SLUB has the problem just because SLOB hat that. A "fix" for that nonexistent problem went into Linus tree. WTH is going on? SLUB was done because of a series of problem with the basic concepts of I agree, SLABs architecture is pretty tight and I was one of those who helped it along to be that way. However, SLAB is just fundamentally wrong for todays machine. The key problem today is cacheline fetch latency and that problem will increase significantly in the future. Sure under some circumstances that exploit the fact that SLAB sometimes gets its guesses on the cpu cache right SLAB can still win but the more processors and nodes we get the more it will become difficult to keep SLAB around and the more it will become If you guarantee that all the regression of SLAB vs. SLUB are addressed then thats fine but AFAICT that is not possible. Here is a list of some of the benefits of SLUB just in case we forgot: - SLUB is performance wise much faster than SLAB. This can be more than a factor of 10 (case of concurrent allocations / frees on multiple processors). See http://lkml.org/lkml/2007/10/27/245 - Single threaded allocation speed is up to double that of SLAB - Remote freeing of objectcs in a NUMA systems is typically 30% faster. - Debugging on SLAB is difficult. Requires recompile of the kernel and the resulting output is difficult to interpret. SLUB can apply debugging options to a subset of the slabcaches in order to allow the system to work with maximum speed. This is necessary to detect difficult to reproduce race conditions. - SLAB can capture huge amounts of memory in its queues. The problem gets worse the more processors and NUMA nodes are in the system. The amount of memory limits the number of per cpu objects one can configure. - SLAB requires a pass through all slab caches every 2 seconds to expire objects. This is a ...
huh? You got the ordering wrong ;-) SLUB needs to resolve all regressions relative to SLAB. (or at least have a really good which is of little help if it regresses on other workloads. As we've seen it, SLUB can be more than 10 times slower on hackbench. You can tune SLUB to use 2MB pages but of course that's not a production level that's not a fundamental property of SLAB. It would be an about 10 lines hack to enable SLAB debugging switchable-on runtime, with the boot flag well that's the nature of caches, but it could be improved: restrict the moment you start capturing more memory in SLUB's per cpu queues i do appreciate that :-) SLUB was rather easy to "port" to PREEMPT_RT: it did not need a single line of change. The SLAB portion is a lot scarier: dione:~linux-rt.q> diffstat patches/rt-slab-new.patch all of these are neat. How about renaming it to SLAB2 instead of SLUB? The "unqueued" bit is just stupid NIH syndrome. It's _of course_ queued because it has to. "It actually, this might be a bug. the DMA caches should be created right away and filled with a small amount of objects due to stupid 16MB limitations with certain hardware. Later on a GFP_DMA request might not be fulfillable. (because that zone is filled up pretty quickly) Ingo --
Yes, SLUB should be the way to go, but some issues are not yet solved. I had to switch back to SLAB on a production NUMA server, with 2 nodes and 8GB ram. Using a lot of sockets, so a large part of memory was used by kernel. SLUB kernel was hitting OOM after 2 or 3 days of uptime. SLAB kernel never hit this. Unfortunatly I dont have a test machine to reproduce the setup. Maybe the problem is not related to SLUB at all, but an underlying VM/NUMA bug. The /proc/buddyinfo showed that : Node 0 contained two zones (DMA and DMA32) total 4 GB Node 1 contained one zone (Normal) total 4 GB So Node 0 contained no (Normal) zone part of /proc/meminfo Slab: 3338512 kB SReclaimable: 789716 kB SUnreclaim: 2548796 kB I remember network interrupts were taken by CPU 1, so most allocations were done by CPU 1 (node 1), and many freeing were done on CPU 0 Hope this helps --
On Sat, Dec 08, 2007 at 08:52:11PM +0100, Ingo Molnar wrote: > so even today's upstream kernel, which has 'ancient' SLUB code, SLAB and > SLUB have essentially the same linecount: > > $ wc -l mm/slab.c mm/slub.c > 4478 mm/slab.c > 4125 mm/slub.c > > (and while linecount != complexity, there is a strong relationship.) > > With SLAB having 10 years more test coverage and tuning. FWIW, the one thing slub does that slab doesn't that I find really nice is being enable to enable debugging at boot time rather than compile time. We don't get many people running benchmarks against the Fedora kernel, so any scalability differences between slub/slab probably won't reach us until we start shipping betas of the next RHEL based on the same kernel. Which leaves my only other gripe. It broke slabtop. There's an alternative implementation in Documentation/vm/slabinfo.c (why there not say, util-linux, home of current slabtop?) Dave -- http://www.codemonkey.org.uk --
yes, but that's largely due to "dont change SLAB because we've got SLUB" resistence to SLAB patches. It's a 2 minute hack to implement this for that's actually a _bad_ ABI regression. Rafael, could you please add the kernel should output /proc/slabinfo data with the same formatting, period. Ingo --
I can remember it being said that would be one of the things to-do before SLUB could become default. --
just to hammer this point home - and while Christoph Lameter isnt online
currently to defend SLUB, this issue did come up and i guess there must
be other MM hackers that are advocates of SLUB. (otherwise this code
couldnt be upstream)
I find SLUB a fantastically confusing concept, and that starts at the
name. "Unqueued" it says. Lets take a look at the actual code:
mm/slub.c:
static void __always_inline *slab_alloc(struct kmem_cache *s,
gfp_t gfpflags, int node, void *addr)
{
...
local_irq_save(flags);
c = get_cpu_slab(s, smp_processor_id());
...
else {
object = c->freelist;
c->freelist = object[c->offset];
}
local_irq_restore(flags);
so it has a "free list", which is clearly per cpu. Hang on! Isnt that
actually a per CPU queue? Which SLUB has not, we are told? The "U" in
SLUB. How on earth can an allocator in 2007 claim to have no queuing
(which is in essence caching)? Am i on crack with this? Did i miss
something really obvious?
Ingo
--Hi Ingo,
I think you did. The difference is explained in Christoph's announcement:
"A particular concern was the complex management of the numerous
object queues in SLAB. SLUB has no such queues. Instead we dedicate a
slab for each allocating CPU and use objects from a slab directly
instead of queueing them up."
Which, I think, is where SLUB gets its name from (the "unqueued" part).
Now, while SLAB code is "pleasant and straightforward code" (thanks,
btw) for UMA, it's really hairy for NUMA plus the "alien caches" eat
tons of memory (which is why Christoph wrote SLUB in the first place,
the current code in SLAB is mostly unfixable due to its *queuing*
nature).
I don't object changing the default to CONFIG_SLAB but it's not really
a long term strategy unless we want to have three kmalloc's in the
kernel: one for embedded, one for UMA, and one NUMA.
Pekka
--On Sun, 9 Dec 2007 10:20:19 +0200 "Pekka Enberg" <penberg@cs.helsinki.fi> wrote: .. and they make slab slower on numa systems for database workloads To be honest, a SLAB without the alien stuff might be one of the best performers today .... ;) (on database loads.. where slub is quite a disaster as everyone knows) -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
yes, i understand the initial announcement (and the Kconfig entry still
says the same), but that is not matched up by the reality i see in the
actual code - SLUB clearly uses a queue/list of objects (as cited in my
previous mail), for obvious performance reasons.
unless i'm missing something obvious (and i easily might), i see SLUB as
SLAB reimplemented with a different queueing model. Not "without
i'm curious about the real facts behind this "alien cache problem". I
heard about it and asked around and was told that there's some sort of
bad quadratic behavior of memory consumption on NUMA - but i cannot
actually see that in the code.
The alien caches feature of SLAB i see as a spread out clustered
index/cache of objects on other nodes. It's not increasing the average
per object memory consumption per se! The number of alien caches
increases with increasing number of nodes, but _of course_, as memory
size increases too so there's more stuff and a larger expected spreadout
of memory to keep track of. ("Fixing" that would be like reintroducing a
single runqueue for the scheduler, based on the argument that it's an
O(1) number of runqueues against O(N) number of runqueues - which would
be complete nonsense.)
so i see SLAB alien caches as an automatic self-partitioning mechanism
... which has complexities but which also has _obvious_ performance
benefits. Yes, it has some disadvantages like all caching schemes do -
there's more cached memory tied up in the allocator at any given moment
- but arguing against that would be like arguing against a 2MB L2 cache
purely on the basis that a 1MB L2 cache is smaller and hence more
space-efficient. Caches are there to cache stuff, and more caches ...
use more memory. It's all a question of proportion and tuning, but the
_design_ should be based on having as thorough caching as possible.
Ingo
--The "queue" that you are talking about is the freelist of a slab. It exist for each slab. SLAB uses a table of free entries there. The freelist is limited to a slab and thus the locality of the freelist is bound to the same page which avoids NUMA locality checks. --
Hi Ingo,
Sure, the emphasis is on "no *such* queues."
I mostly live in the legacy 32-bit UMA/UP land still so I cannot verify this
myself but the kind folks at SGI claim the following (again from the
announcement):
"On our systems with 1k nodes / processors we have several gigabytes
just tied up for storing references to objects for those queues This does
not include the objects that could be on those queues. One fears that the
whole memory of the machine could one day be consumed by those
queues."
The problem is that for each cache, you have an "per-node alien queues"
for each node (see struct kmem_cache nodelists -> struct kmem_list3 alien).
Moving slab metadata to struct page solves this but now you can only have
one "queue" thats part of the same struct.
Pekka
--Yes, you can find gigs tied up on systems that have 100 GB of RAM, or you can have gigs tied up if you over-size your caches. I'd like to see yes, it's what i referred to as "distributed, per node cache". It has no "quadratic overhead". It has SLAB memory spread out amongst nodes. I.e. 1 million pages are distributed amongst 1k nodes with 1000 pages per node with each node having 1 page. But that memory is not lost and it's disingenous to call it 'overhead' and it very much comes handy for performance _IF_ there's global workload that involves cross-node allocations. It's simply a cache that is mis-sized and mis-constructed on large node count systems but i bet it makes quite a performance difference on low-node-count systems. On high node-count systems it might make sense to reduce the amount of cross-node caching and to _structure_ the distributed NUMA SLAB cache in an intelligent way (perhaps along cpuset boundaries) - but a total, design level _elimination_ of this caching concept, using very misleading arguments, just looks stupid to me ... Ingo --
i think much of this could be achieved by delaying the creation of alien
caches up until the point a { node X } -> { node Y } alloc/free
relationship gets established. So if a system is partitioned along
cpusets, vast areas of the NxN matrix never gets populated with alien
caches.
Ingo
--err, umm, OK, I'll go hunt it down and take a closer look at it. --
Nasty one. Tejun and several diligent reporters are doing sterling work there and things have improved. I don't know whether any of Tejun's patches have been merged yet, but we'll probably be OK on this one. What is unclear (to me) is what actually caused those people's machines to break? --
Hello, (cc'ing Alan) I'm still trying to find out what's really going on. That drive is It's introduced by setting ATAPI transfer chunk size to actual transfer size which is the right thing to do generally. However, with the change, the ATAPI HSM should be ready to drain full extra transfer chunks which libata HSM wasn't doing. With that part changed, most regressions should go away. Unfortunately, simply adding that doesn't fix the case in bug 9346 and I'm still trying to find out why. The good news is that the drive works fine with proposed more extensive improvements to libata ATAPI which will probably be included into 2.6.25, so we at least have long term solution. If we fail to find out the solution in time, we always have the alternative of backing out the ATAPI transfer chunk size update. This will break some other cases which were fixed by the change but those won't be regressions at least and we can add transfer chunk size update with other changes to 2.6.25. Thanks. -- tejun --
Which will break far more controllers and drives than it fixes, so Great, make everyone else wait another three months for a working CD drive. The one off regression appears far less harmful than a revert. Tejun - instead of backing out important updates for 2.6.24 we should just blacklist that specific drive for now and sort it nicely in 2.6.25, not revert stuff and break everyone elses ATAPI devices. Alan --
Btw, Alan, that "math" is total and utter BULLSH*T, and you should know that. "The one off regression" is likely the tip of an iceberg. If something regresses for one person, for that one person who tested and noticed and made a bug-report, there's probably a thousand people who haven't even tested the development kernel, or who had problems and just went back to the previous version. In contrast, reverting something will be guaranteed to not have those kinds of issues, since the only people who could notice are people for who it never worked in the first place. There's no "silent mass of people" that can be affected. This is why regressions are so important. They don't trump _everything_, but basically ignoring and letting them slide is *much* more painful than just reverting it. The biggest reason to ignore a regression is if nobody can even figure out where it came from, or reverting simply isn't an option for some really deep and fundamental issue. That doesn't seem to be the case here. So we should revert unless there is some known acceptable real fix. Linus --
The one off regression is probably not one off, but this is IDE so actually its quite probable its a single broken firmware. The alternative is that you cripple just about every user of various other standards compliant devices and controllers whose hardware we finally fixed. Finally you need to remember that the 'regression' is caused by the fact we now do the _right_ thing both in terms of 'old IDE' and specs. Believe it or not I did actually think in quite some detail about this case, and the relative probabilities, and go back and re-review the old IDE code (whose behaviour we now follow) and the spec. I spend a measurable amount of my time reviewing code and weighing risks, regressions and progress for an enterprise Linux vendor, so its something I do every day of the week. To blindly argue regressions are critical is sometimes (as in this case) to argue that "this freeway is no longer compatible with a horse and cart" means the freeway should be turned back into a dirt road. The horse and cart happened to work by chance because the road was quiet that day. We clearly need to add a horse & cart lane in the long term, but for 2.6.24 it may well be the right thing to do just to blacklist that specific drive back to old behaviour until we can tidy it more nicely. Alan --
Alan, you're so full of shit that it's not even funny.
Have you even *read* the thread?
Tejun already reported that this apparently gets fixed _properly_ with the
more extensive cleanups and fixes that are pending for 2.6.25.
In other words, the stuff you call so critically important (yet we've been
able to live without it until now!) is apparently simply NOT YET READY.
It's breaking things.
In this case, Tejun seems to be right on the money. I also agree 100%
with him when he says
"Blacklist takes time to develop and temporary blacklist for just one
release doesn't sound like a good idea."
because if we create some blacklist for that one reported device, not only
is it likely going to be wrong (it's almost never just one firmware or one
chip that has a particular issue), but we tend to create thee blacklists
and later realize that we shouldn't have blacklisted things at all, we
should just have done things differently.
For examples of that, see the NCQ blacklist that was just _us_ doing
things wrong (over-reacting to things we shouldn't care about), and
there's currently another totally unrelated discussion on a very similar
thing wrt libata and the ACPI startup commands for an unused controller
.. and what the hell does that matter? If the code doesn't work, it
doesn't work, and you might as well point to some random scribblings done
by a three-year-old on toilet paper rather than any "specs".
Real life matters more. Regressions matter more.
We apparently do have a full fix, but it seems to be too invasive for
2.6.24, which means that the thing that currently DOES NOT WORK and
causes regressions should be reverted, so that 2.6.24 is at least no worse
than 2.6.23 (and all earlier kernels) in this respect.
And then we should just hope that the more complete fix that Tejun has
doesn't cause any issues on its own. I would suggest that if you care so
deeply about this issue, you press Fedora into putting Tejun's tree into
Fedora t...btw., how extensive are those cleanups and fixes in reality, is there a rollup somewhere one could take a look at? Those fixes and cleanups were deferred to v2.6.25 in the knowledge of having the current code included in v2.6.24 - but now that the current approach seems to regress, maybe those cleanups are still safe enough. (compared to an outright revert) Ingo --
The following git tree contains patches pending review for 2.6.25. http://git.kernel.org/?p=linux/kernel/git/tj/libata-dev.git;a=shortlog;h=improve-ATAPI... And we're getting close to fixing the regression. I don't think there's too much worry about this one. Just need a bit more time to test few more things. Thanks. -- tejun --
ah, i see, the joys of the kernel running BIOS written code (AML): http://bugzilla.kernel.org/attachment.cgi?id=13932&action=view cute! Ingo --
> Have you even *read* the thread? In detail, as it unfolds and while testing variants of Tejun's code on the hardware I have access to - none of which has this bug making it And as I keep pointing out but you keep ignoring - not doing it breaks The code without the changes doesn't work either. So pick your toilet Which as the distro bug lists for ATAPI will tell you - aint good. Still Linus, the kernel regresses all over the place every release. If it didn't do that you'd never get any changes in. Your kernel would fossilize like RHEL or SLES and you'd be spending weeks analysing each changeset for possible side effects, or - as happens by neccessity - adding code paths so a fix vital to one driver ceases to share core code with another driver - to reduce regression risk. Been there, done that Have fun. I trust you'll be fixing the other 11 I think it was listed regressions before 2.6.24 - or backing out every changeset that could be responsible ? No I thought not - because that wouldn't be sensible either. Alan --
