login
Header Space

 
 

Re: 2.6.24-rc4-git5: Reported regressions from 2.6.23

Previous thread: [PATCH] ecryptfs: make show_options reflect actual mount options by Eric Sandeen on Friday, December 7, 2007 - 10:06 pm. (1 message)

Next thread: [PATCH 2/4] docs: kernel-locking: Convert semaphore references by Daniel Walker on Friday, December 7, 2007 - 4:00 am. (1 message)
To: LKML <linux-kernel@...>
Cc: Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>
Date: Friday, December 7, 2007 - 10:40 pm

This message contains a list of some regressions from 2.6.23 which have been
reported since 2.6.24-rc1 was released and for which there are no fixes in the
mainline that I know of.  If any of them have been fixed already, please let me
know.

If you know of any other unresolved regressions from 2.6.23, please let me know
either and I'll add them to the list.


Subject		: EHCI causes system to resume instantly from S4
Submitter	: Maxim Levitsky &lt;maximlevitsky@gmail.com&gt;
References	: http://lkml.org/lkml/2007/10/27/66
		  http://bugzilla.kernel.org/show_bug.cgi?id=9258
Handled-By	: "Rafael J. Wysocki" &lt;rjw@sisk.pl&gt;
		  David Brownell &lt;david-b@pacbell.net&gt;
		  Alan Stern &lt;stern@rowland.harvard.edu&gt;
Patch		: 
Note:		: the problem appears to heavily depend on hardware


Subject		: leds: ledtrig-timer calls sleeping function from invalid context
Submitter	: Márton Németh &lt;nm127@freemail.hu&gt;
References	: http://bugzilla.kernel.org/show_bug.cgi?id=9264
Handled-By	: Richard Purdie &lt;rpurdie@rpsys.net&gt;
Patch		: http://bugzilla.kernel.org/attachment.cgi?id=13493&amp;action=view


Subject		: PATA scan: ACPI Exception AE_AML_PACKAGE_LIMIT... is beyond end of object
Submitter	: Hans de Bruin &lt;bruinjm@xs4all.nl&gt;
References	: http://bugzilla.kernel.org/show_bug.cgi?id=9320
Handled-By	: Robert Moore &lt;Robert.Moore@intel.com&gt;
		  Tejun Heo &lt;htejun@gmail.com&gt;
		  Fu Michael &lt;michael.fu@intel.com&gt;
Patch		: 


Subject		: snd_hda_intel 2.6.24-rc2 bug: interrupts don't always work on Lenovo X60s
Submitter	: Roland Dreier &lt;rdreier@cisco.com&gt;
References	: http://lkml.org/lkml/2007/11/8/255
		  http://bugzilla.kernel.org/show_bug.cgi?id=9332
Handled-By	: 
Patch		: 


Subject		: system hangs after a few minutes
Submitter	: Marcus Better &lt;marcus@better.se&gt;
References	: http://bugzilla.kernel.org/show_bug.cgi?id=9335
Handled-By	: Andrew Morton &lt;akpm@linux-foundation.org&gt;
		  Alan Stern &lt;stern@rowland.harvard.edu&g...
To: Rafael J. Wysocki <rjw@...>
Cc: LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Monday, December 10, 2007 - 4:42 pm

Linus, Andrew, i need some help deciding what to do about this 
regression. The fixes for this have been tested and resolve the 
regression, but they change printk and other code that runs by default 
and is thus rather invasive so late in the v2.6.24 cycle. This bug 
should only affect CONFIG_PRINTK_TIME=y kernels (a non-default debug 
option) - although some claimed effect was on udelay()/mdelay() too.

i've attached below the queue of 5 patches that fix this problem. They 
have been build and boot tested with more than 1000 random kernels in 
the past few days, so i certainly trust the core and x86 bits of this.

what do you think? Right now i've got them queued up for 2.6.25 in both 
the scheduler-devel and the x86-devel git trees - but can submit them 
for 2.6.24 if it's better if we did them there. I've got no strong 
opinion either way.

	Ingo

--------------------&gt;
Subject: x86: scale cyc_2_nsec according to CPU frequency
From: "Guillaume Chazarain" &lt;guichaz@yahoo.fr&gt;

scale the sched_clock() cyc_2_nsec scaling factor according to
CPU frequency changes.

[ mingo@elte.hu: simplified it and fixed it for SMP. ]

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
---
 arch/x86/kernel/tsc_32.c |   43 ++++++++++++++++++++++++++++++-----
 arch/x86/kernel/tsc_64.c |   57 ++++++++++++++++++++++++++++++++++++++---------
 include/asm-x86/timer.h  |   23 ++++++++++++++----
 3 files changed, 102 insertions(+), 21 deletions(-)

Index: linux/arch/x86/kernel/tsc_32.c
===================================================================
--- linux.orig/arch/x86/kernel/tsc_32.c
+++ linux/arch/x86/kernel/tsc_32.c
@@ -5,6 +5,7 @@
 #include &lt;linux/jiffies.h&gt;
 #include &lt;linux/init.h&gt;
 #include &lt;linux/dmi.h&gt;
+#include &lt;linux/percpu.h&gt;
 
 #include &lt;asm/delay.h&gt;
 #include &lt;asm/tsc.h&gt;
@@ -80,13 +81,31 @@ EXPORT_SYMBOL_GPL(check_tsc_unstable);
  *
  *			-johnstul@us.ibm.com "math is ...
To: Ingo Molnar <mingo@...>
Cc: <rjw@...>, <linux-kernel@...>, <torvalds@...>
Date: Monday, December 10, 2007 - 4:59 pm

On Mon, 10 Dec 2007 21:42:12 +0100

printk_clock() doesn't seem terribly important but what's this stuff about
effects on udelay/mdelay?  That can be serious if they're getting
shortened.

--
To: Andrew Morton <akpm@...>
Cc: <rjw@...>, <linux-kernel@...>, <torvalds@...>
Date: Monday, December 10, 2007 - 6:45 pm

since udelay depends on loops_per_jiffy, which is fixed up 
time_cpufreq_notifier(), i dont see how it could be affected by 
frequency changes. (but that's the theory - practice might be different)

	Ingo
--
To: Andrew Morton <akpm@...>
Cc: <rjw@...>, <linux-kernel@...>, <torvalds@...>, Stefano Brivio <stefano.brivio@...>, Guillaume Chazarain <guichaz@...>
Date: Monday, December 10, 2007 - 7:04 pm

Stefano Brivio reported udelay()/mdelay() effects in the b43 driver. 
(and it caused driver failures for him.)

Stefano, could you please try to sum up your experiences with that 
issue? Is it reproducable, and the 5 patches i did fix it? (if yes, 
could you try to re-do the mdelay verifications perhaps, to make sure 
it's not some other effect interacting here. In theory sched-clock 
scaling has no effect on udelay behavior.)

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Andrew Morton <akpm@...>, <rjw@...>, <linux-kernel@...>, <torvalds@...>, Guillaume Chazarain <guichaz@...>
Date: Monday, December 10, 2007 - 7:34 pm

On Tue, 11 Dec 2007 00:04:25 +0100

Sorry for disappearing. Anyway, yes, those patches fixed it. Precision in
delays isn't that good when using my crappy unstable TSC (mdelay(2000)
causes delays between 2 and 2.9 seconds) but it's not depending on frequency
changes anymore. So I'd say it's fixed, but please tell me if you want me
to do any other test so as to be sure it is.


--
Ciao
Stefano
--
To: Stefano Brivio <stefano.brivio@...>
Cc: Andrew Morton <akpm@...>, <rjw@...>, <linux-kernel@...>, <torvalds@...>, Guillaume Chazarain <guichaz@...>
Date: Tuesday, December 11, 2007 - 5:01 am

ok, just to make sure we are all synced up. I made 8 patches related to 
this problem category (and all the trickle effects). 3 are upstream 
already, 5 are pending for v2.6.25. One out of those 5 is an immaterial 
cleanup patch - which leaves us 4 patches to sort out.

So i'd suggest for you to try latest -git - that will tell us whether 
udelay() is acceptable on your box right now.

i've attached those 4 patches:

 x86-sched_clock-re-scheduler-fix-x86-regression-in-native-sched-clock.patch
 x86-cpu-clock-idle-event.patch
 sched-printk-recursion-fix.patch
 sched-printk-clock-fix.patch

none of them is _supposed_ to have any effect on udelay(), but the 
interactions in this area are weird.

[ note: CONFIG_PRINTK_TIME will be broken and only fixed in v2.6.25, so 
  use some other time metric for determining mdelay quality. ]

plus then there's this patch:

  http://lkml.org/lkml/2007/12/7/100

is it perhaps this one that fixed udelay for you? [ which would be much 
more expected, as this patch changes udelay ;-) ]

	Ingo
To: Ingo Molnar <mingo@...>
Cc: Andrew Morton <akpm@...>, <rjw@...>, <linux-kernel@...>, <torvalds@...>, Guillaume Chazarain <guichaz@...>
Date: Tuesday, December 11, 2007 - 5:10 pm

On Tue, 11 Dec 2007 10:01:20 +0100

Yes, it is (msleep(2000), as said, gives delays between 2 and 2.9s on my
box, and drivers are happy).

The commit which fixed this (it seems) is
fa2dd441df28b9fdfc68f84ae66f1b507cfff0e4. I'll bisect and tell you more in the


Will try it ASAP, again, in the next few days anyway.


--
Ciao
Stefano
--
To: Stefano Brivio <stefano.brivio@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <rjw@...>, <linux-kernel@...>, <torvalds@...>
Date: Monday, December 10, 2007 - 7:53 pm

Ingo,

it seems you dropped http://lkml.org/lkml/2007/12/7/100 (cpu_clock()
based udelay), so how udelay can be affected by your proposed changes?

Thanks.

-- 
Guillaume
--
To: Guillaume Chazarain <guichaz@...>
Cc: Stefano Brivio <stefano.brivio@...>, Andrew Morton <akpm@...>, <rjw@...>, <linux-kernel@...>, <torvalds@...>
Date: Tuesday, December 11, 2007 - 4:48 am

was this needed for you to get stable udelay()? (that cpu_clock() based 
udelay patch was buggy, i got the units wrong. udelay does wacky 
conversions between various units. So i dropped it for the time being.)

the last rollup you tested didnt show udelay problems, and it didnt 
include the sched_clock() based udelay patch.

so it would be nice if you could re-examine exactly what is needed. 
Please try latest -git and the concatenation of the 4 patches below. 
What would be the best info is to see which (if any!) patches are needed 
against latest -git to get a stable udelay() on your box.

	Ingo

---------------------------------------&gt;

Linus, Andrew, i need some help deciding what to do about this 
regression. The fixes for this have been tested and resolve the 
regression, but they change printk and other code that runs by default 
and is thus rather invasive so late in the v2.6.24 cycle. This bug 
should only affect CONFIG_PRINTK_TIME=y kernels (a non-default debug 
option) - although some claimed effect was on udelay()/mdelay() too.

i've attached below the queue of 5 patches that fix this problem. They 
have been build and boot tested with more than 1000 random kernels in 
the past few days, so i certainly trust the core and x86 bits of this.

what do you think? Right now i've got them queued up for 2.6.25 in both 
the scheduler-devel and the x86-devel git trees - but can submit them 
for 2.6.24 if it's better if we did them there. I've got no strong 
opinion either way.

	Ingo

--------------------&gt;
Subject: x86: scale cyc_2_nsec according to CPU frequency
From: "Guillaume Chazarain" &lt;guichaz@yahoo.fr&gt;

scale the sched_clock() cyc_2_nsec scaling factor according to
CPU frequency changes.

[ mingo@elte.hu: simplified it and fixed it for SMP. ]

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
Signed-off-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
---
 arch/x86/kernel/tsc_32.c |   43 ++++++++++++++++++++++++++++++-----
 arch/x86/kernel/tsc_...
To: Stefano Brivio <stefano.brivio@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <rjw@...>, <linux-kernel@...>, <torvalds@...>, Guillaume Chazarain <guichaz@...>
Date: Monday, December 10, 2007 - 7:56 pm

On Tue, 11 Dec 2007 00:34:33 +0100
I'm still quite concerned about this in dual/quad core scenarios;
the frequency of both cores is the maximum of what linux sets each core to;
this means that if you're THIS sensitive to that there still is quite a nasty issue there.

I wonder if the various delay functions (maybe only in .25) should use the maximum observed loops_per_jiffie instead always (across cpus) to be super safe here.

-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--
To: Arjan van de Ven <arjan@...>
Cc: Stefano Brivio <stefano.brivio@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <rjw@...>, <linux-kernel@...>, <torvalds@...>
Date: Monday, December 10, 2007 - 8:01 pm

Do you mean that the cpufreq code can be confused about the actual
frequency of the cores? That sounds like a big problem.

Thanks for any insight.

-- 
Guillaume
--
To: Guillaume Chazarain <guichaz@...>
Cc: Stefano Brivio <stefano.brivio@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <rjw@...>, <linux-kernel@...>, <torvalds@...>
Date: Monday, December 10, 2007 - 9:06 pm

On Tue, 11 Dec 2007 01:01:25 +0100


it'll get way worse going forward.
(but even on todays systems, the tsc no longer represents frequency, but is some fixed clock totally unrelated to cpu frequency)

-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--
To: Arjan van de Ven <arjan@...>
Cc: Guillaume Chazarain <guichaz@...>, Stefano Brivio <stefano.brivio@...>, Andrew Morton <akpm@...>, <rjw@...>, <linux-kernel@...>, <torvalds@...>
Date: Tuesday, December 11, 2007 - 4:43 am

X86_FEATURE_CONSTANT_TSC CPUs (all modern Intel CPUs) should be fine - 
we dont do any TSC frequency fixups for them. The loops_per_jiffy fixup 
looks like this:

                if (!(freq-&gt;flags &amp; CPUFREQ_CONST_LOOPS))
                        cpu_data(freq-&gt;cpu).loops_per_jiffy =
                                cpufreq_scale(loops_per_jiffy_ref,
                                                ref_freq, freq-&gt;new);

i.e. X86_FEATURE_CONSTANT_TSC excluded. The sched_clock() scaling factor 
is modified like this:

                        if (!(freq-&gt;flags &amp; CPUFREQ_CONST_LOOPS)) {
                                tsc_khz = cpu_khz;
                                preempt_disable();
                                set_cyc2ns_scale(cpu_khz, smp_processor_id());

so here X86_FEATURE_CONSTANT_TSC is excluded again. So the whole 
frequency scaling issue will become a pure legacy issue only with time.

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Monday, December 10, 2007 - 4:57 pm

Any specific report?
The jumping sched_clock on frequency change caused some
scheduling oddities for me, but CFS attenuated the effect.

Thanks.

-- 
Guillaume
--
To: Rafael J. Wysocki <rjw@...>
Cc: LKML <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>
Date: Sunday, December 9, 2007 - 7:54 am

Here's one for you - I have a new Lenovo t61p with which to irritate
everyone. 

suspend-to-ram is a wipeout, but suspend-to-disk works OK under
2.6.23.

However under 2.6.24-rc1 and -rc4 the machine reboots right at the end of
resume-from-disk.

Am trying to do a git-disect on it but it seems that someone has been
screwing with ata Kconfig and I'm hitting a pile of cant-find-root-disk
bisection points and I can't immediately work out why.  I'll try to find
time to look at it again next week.

--
To: Andrew Morton <akpm@...>
Cc: LKML <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>
Date: Sunday, December 9, 2007 - 10:24 am

It's http://bugzilla.kernel.org/show_bug.cgi?id=9258 , I think.

Does it do that if you unload ehci-hcd before the hibernation?
--
To: Andrew Morton <akpm@...>
Cc: Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Sunday, December 9, 2007 - 8:05 am

the way i solve such bisection problems is to have the patch like below 
applied by a "git-bisect run" scriptlet (and popped off after the test). 
This way all must-have drivers and kernel features are selected for that 
particular testbox, no matter what Kconfig complication there are. 
(except outright config option renaming but those are rare)

	Ingo

Index: linux/arch/x86/Kconfig.needed
===================================================================
--- /dev/null
+++ linux/arch/x86/Kconfig.needed
@@ -0,0 +1,88 @@
+config FORCE_MINIMAL_CONFIG
+	bool
+	default y
+
+select EXPERIMENTAL
+
+select EXT3_FS
+select EXT3_FS_XATTR
+select EXT3_FS_POSIX_ACL
+select EXT3_FS_SECURITY
+select BLOCK
+select HOTPLUG
+#select INOTIFY
+#select INOTIFY_USER
+
+# so that capset() works (sudo, etc.):
+select SECURITY
+select SECURITY_CAPABILITIES
+
+select BINFMT_ELF
+select MSDOS_PARTITION
+select PARTITION_ADVANCED
+select BSD_DISKLABEL
+
+select SYSFS
+select SYSFS_DEPRECATED
+select PROC_FS
+select FUTEX
+
+select ATA
+select SATA_AHCI
+select ATA_PIIX
+select PATA_AMD
+select PATA_OLDPIIX
+select BLK_DEV_SD
+
+select E100
+select E1000
+select NET_ETHERNET
+select NET_PCI
+select MII
+select CRC32
+
+select 8139TOO
+select FORCEDETH
+
+select PACKET
+
+select NETPOLL
+select NETCONSOLE
+select NET_POLL_CONTROLLER
+select INET
+select NET
+select UNIX
+select NETDEVICES
+
+select SERIAL_8250
+select SERIAL_8250_CONSOLE
+select MAGIC_SYSRQ
+
+select INPUT
+select INPUT_MOUSEDEV
+select INPUT_POLLDEV
+select INPUT_KEYBOARD
+select KEYBOARD_ATKBD
+select SERIO
+select SERIO_I8042
+
+select VT
+select VT_CONSOLE
+select HW_CONSOLE
+select VGA_CONSOLE
+select EARLY_PRINTK
+select PRINTK
+select UNIX98_PTYS
+
+select USB
+select USB_MOUSE
+select USB_EHCI_HCD
+select USB_OHCI_HCD
+select USB_UHCI_HCD
+select USB_SUPPORT
+
+select PCI
+
+select STANDALONE
+select PREVENT_FIRMWARE_BUILD
+
Index: linux/lib/Kconfig
===========...
To: Rafael J. Wysocki <rjw@...>
Cc: LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>
Date: Saturday, December 8, 2007 - 2:53 am

Hi.



I don't think that this is a regression: I reported on RedHat bugzilla
when I switched from F7 to F8 and I was using 2.6.23.8 at that time.
It looks to me an HAL regression, but of course I may be wrong :-) as
the reported bisected to a bad commit.

https://bugzilla.redhat.com/show_bug.cgi?id=373041

By the way, I now switched to Fedrora Rawhide with a 2.6.24-rc4-git5
custom kernel and Gnome desktop and the problem is still present, even
with gnome-power-manager.

Hope this helps.
Regards,
Fabio
--
To: Fabio Comolli <fabio.comolli@...>
Cc: Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Greg KH <gregkh@...>, Len Brown <lenb@...>
Date: Saturday, December 8, 2007 - 4:28 am

to me this looks like an ABI regression - utilities should work without 
change. Something changed in /sys output that caused HAL to think that 
there are two batteries:

| The output of lshal shows that there are two UDI's with 
| info.capabilities = { 'battery' }:
|
| udi = '/org/freedesktop/Hal/devices/acpi_BAT0'
| udi = '/org/freedesktop/Hal/devices/computer_power_supply_0'

whether it's a HAL bug or a kernel bug, the original state should be 
restored and it should be worked out without breaking users of older HAL 
versions.

grumble: way too many times do various system utilities break when i 
upgrade the kernel on my laptop. Maybe a new debug mechanism: we should 
start fingerprinting the exact /sys and /proc output and enforce that 
it's immutable across kernel releases as long as the hardware is 
unmodified?

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Fabio Comolli <fabio.comolli@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Linus Torvalds <torvalds@...>, Greg KH <gregkh@...>, Len Brown <lenb@...>, Alexey Starikovskiy <astarikovskiy@...>
Date: Saturday, December 8, 2007 - 5:23 am

"breaking users of older HAL versions" == "breaking machines".


That would be neat.  It would need to be executed on a lot of different
machines.

I wonder if there's something sneaky we can do here.  Install the script in
/lib/modules/$(uname -r) and then run it from the kernel when the fork
count reaches 1000 ;)

(hey, I've seen worse: /proc files which start with #!/bin/sh)
--
To: Andrew Morton <akpm@...>
Cc: Ingo Molnar <mingo@...>, Fabio Comolli <fabio.comolli@...>, LKML <linux-kernel@...>, Linus Torvalds <torvalds@...>, Greg KH <gregkh@...>, Len Brown <lenb@...>, Alexey Starikovskiy <astarikovskiy@...>
Date: Saturday, December 8, 2007 - 6:11 pm

Hm, that wouldn't allow us to add new attributes ...
--
To: Rafael J. Wysocki <rjw@...>
Cc: LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Matt Mackall <mpm@...>
Date: Saturday, December 8, 2007 - 5:30 am

Matt, the above bug is still occuring en masse during random bootups:

[   67.828312] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
[   41.270905] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
[   82.378714] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
[   35.308126] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
[   35.484155] WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()
WARNING: at arch/x86/mm/highmem_32.c:52 kmap_atomic_prot()

and the patch does not seem to be upstream yet.

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>
Date: Saturday, December 8, 2007 - 12:37 pm

I was hoping for some discussion about whether it was the best fix.
The current kzalloc thing strikes me as a step backwards for all
allocators. We'd do better to have a single non-inline kzalloc
function rather than an extra branch in the normal kmalloc fast path.

But here's the patch again, with my sign-off:


Avoid calling page allocator with __GFP_ZERO, as we might be in atomic
context and this will make thing unhappy on highmem systems. Instead,
manually zero allocations from the page allocator.

Signed-off-by: Matt Mackall &lt;mpm@selenic.com&gt;

diff -r f7edf7226317 mm/slob.c
--- a/mm/slob.c	Wed Dec 05 15:57:06 2007 -0600
+++ b/mm/slob.c	Wed Dec 05 15:57:51 2007 -0600
@@ -223,6 +231,7 @@ static void *slob_new_page(gfp_t gfp, in
 {
 	void *page;
 
+	gfp &amp;= ~__GFP_ZERO;
 #ifdef CONFIG_NUMA
 	if (node != -1)
 		page = alloc_pages_node(node, gfp, order);
@@ -457,6 +470,8 @@ void *__kmalloc_node(size_t size, gfp_t 
 			page = virt_to_page(ret);
 			page-&gt;private = size;
 		}
+		if (unlikely((gfp &amp; __GFP_ZERO) &amp;&amp; ret))
+			memset(ret, 0, size);
 		return ret;
 	}
 }


-- 
Mathematics is the supreme nostalgia of our time.
--
To: Matt Mackall <mpm@...>
Cc: Ingo Molnar <mingo@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>
Date: Saturday, December 8, 2007 - 1:47 pm

I think this is fine, but didn't we fix the warning already? Calling page 
allocators with __GFP_ZERO should be fine, as long as __GFP_HIGHMEM isn't 
set, and slab/slub/slob/kmalloc cannot use GFP_HIGHMEM *anyway*. 

But I'll apply it anyway, because it looks "obviously correct" from the 
standpoint that the _other_
To: Linus Torvalds <torvalds@...>
Cc: Matt Mackall <mpm@...>, Ingo Molnar <mingo@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Thursday, December 13, 2007 - 6:03 pm

SLUB does not pass __GFP_ZERO to the page allocator.
--
To: Matt Mackall <mpm@...>
Cc: Ingo Molnar <mingo@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>
Date: Saturday, December 8, 2007 - 1:54 pm

&gt; standpoint that the _other_
To: Linus Torvalds <torvalds@...>
Cc: Matt Mackall <mpm@...>, Ingo Molnar <mingo@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>
Date: Sunday, December 9, 2007 - 4:33 am

Hi Linus,


In case you didn't already merge this:

Reviewed-by: Pekka Enberg &lt;penberg@cs.helsinki.fi&gt;
--
To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>
Date: Saturday, December 8, 2007 - 3:00 pm

While I think the whole GFP_ZERO thing is a bit broken here, this is
an improvement on my original patch, so:


-- 
Mathematics is the supreme nostalgia of our time.
--
To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Andrew Morton <akpm@...>, Christoph Lameter <clameter@...>
Date: Saturday, December 8, 2007 - 2:33 pm

But what about:

(c) stop passing GFP_ZERO to kmalloc allocators
    stop fiddling with flags inside allocators
    remove ugly if (unlikely(gfp &amp; GFP_ZERO)) from kmalloc allocators
    make kzalloc and kcalloc non-inline functions that do the memset

That should:

- make both kmalloc and kzalloc/kcalloc faster (one less branch)
- reduce kernel size

GFP_ZERO is a bit of an abuse here, given that we don't even intend to
pass it to the underlying "GFP".

-- 
Mathematics is the supreme nostalgia of our time.
--
To: Linus Torvalds <torvalds@...>
Cc: Matt Mackall <mpm@...>, Ingo Molnar <mingo@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Saturday, December 8, 2007 - 2:09 pm

It's a kmap_atomic() debugging patch which I wrote ages ago and whcih Ingo
sucked into his tree.  I don't _think_ this warning is present in your tree
at all.

http://lkml.org/lkml/2007/11/29/157 is where it starts.

I had a lenghty back-and-forth with Christoph on this within the past
couple of months and I cannot locate the thread and I don't recall what the
upshot was and Christoph is still offline.

Knocking out __GFP_ZERO at the point where the slab allocator(s) call the
page allocator seems like a good approach to me.

But I don't think we need to do anything for 2.6.24..
--
To: Andrew Morton <akpm@...>
Cc: Linus Torvalds <torvalds@...>, Matt Mackall <mpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Sunday, December 9, 2007 - 3:58 am

ok - Rafael, please strike this off the regressions list, there is no 
problem in .24 other than the double memset for some SLOB metadata.

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Andrew Morton <akpm@...>, Linus Torvalds <torvalds@...>, Matt Mackall <mpm@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Sunday, December 9, 2007 - 10:17 am

OK, dropped.

Thanks,
Rafael
--
To: Andrew Morton <akpm@...>
Cc: Matt Mackall <mpm@...>, Ingo Molnar <mingo@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Saturday, December 8, 2007 - 2:37 pm

Good. Although we should perhaps look at that reported performance problem 
with SLUB. It looks like SLUB will do a memclear() for the area twice 
(first for the whole page, then for the thing it allocated) for the slow 
case. Maybe that exacerbates the problem.

			Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Andrew Morton <akpm@...>, Matt Mackall <mpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Saturday, December 8, 2007 - 3:52 pm

i dont think the SLUB problem could be explained purely via a double 
memset(). [which ought to be extremely fast anyway] We are talking about 
a 10 times slowdown on a 64-way box of a workload that is fairly 
common-sense. (tasks sending messages to each other via bog standard 
means)

while i dont want to jump to conclusions without looking at some 
profiles, i think the SLUB performance regression is indicative of the 
following fallacy: "SLAB can be done significantly simpler while keeping 
the same performance".

I couldnt point to any particular aspect of SLAB that i could 
characterise as "needless bloat".

the SLUB concept is proudly outlined in init/Kconfig:

  config SLUB
        bool "SLUB (Unqueued Allocator)"
        help
           SLUB is a slab allocator that minimizes cache line usage
           instead of managing queues of cached objects (SLAB approach).
           Per cpu caching is realized using slabs of objects instead
           of queues of objects. SLUB can use memory efficiently
           and has enhanced diagnostics.

but that's not true anymore - the two concepts are pretty much 
equivalent, after all the "performance tuning" that went on in SLUB. 
(read: 'frantically try to catch up with SLAB in benchmarks')

so even today's upstream kernel, which has 'ancient' SLUB code, SLAB and 
SLUB have essentially the same linecount:

  $ wc -l mm/slab.c mm/slub.c
  4478 mm/slab.c
  4125 mm/slub.c

(and while linecount != complexity, there is a strong relationship.)

With SLAB having 10 years more test coverage and tuning.

the messiest and most fragile aspect of SLAB that i can think of is its 
bootstrap hacks - but that is an entirely unimportant detail in my 
opinion. SLAB has been cleaned up significantly in the past few years by 
Pekka Enberg &amp; co, it's pretty pleasant and straightforward code these 
days.

I think we should we make SLAB the default for v2.6.24 ...

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Matt Mackall <mpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>
Date: Friday, December 14, 2007 - 12:07 am

Well this is double crap. First of all SLUB does not do memclear twice. 
There is no reason to assume that SLUB has the problem just because SLOB 
hat that. A "fix" for that nonexistent problem went into Linus tree. WTH 
is going on?

SLUB was done because of a series of problem with the basic concepts of 

I agree, SLABs architecture is pretty tight and I was one of those who 
helped it along to be that way.

However, SLAB is just fundamentally wrong for todays machine. The key 
problem today is cacheline fetch latency and that problem will increase 
significantly in the future. Sure under some circumstances that exploit 
the fact that SLAB sometimes gets its guesses on the cpu cache right SLAB 
can still win but the more processors and nodes we get the more it will 
become difficult to keep SLAB around and the more it will become 

If you guarantee that all the regression of SLAB vs. SLUB are addressed 
then thats fine but AFAICT that is not possible.

Here is a list of some of the benefits of SLUB just in case we forgot:


- SLUB is performance wise much faster than SLAB. This can be more than a
  factor of 10 (case of concurrent allocations / frees on multiple
  processors). See http://lkml.org/lkml/2007/10/27/245

- Single threaded allocation speed is up to double that of SLAB

- Remote freeing of objectcs in a NUMA systems is typically 30% faster.

- Debugging on SLAB is difficult. Requires recompile of the kernel
  and the resulting output is difficult to interpret. SLUB can apply
  debugging options to a subset of the slabcaches in order to allow
  the system to work with maximum speed. This is necessary to detect
  difficult to reproduce race conditions.

- SLAB can capture huge amounts of memory in its queues. The problem
  gets worse the more processors and NUMA nodes are in the system. The 
  amount of memory limits the number of per cpu objects one can configure.

- SLAB requires a pass through all slab caches every 2 seconds to
  expire objects. This is a ...
To: Christoph Lameter <clameter@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Matt Mackall <mpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>
Date: Friday, December 14, 2007 - 8:49 am

huh? You got the ordering wrong ;-) SLUB needs to resolve all 
regressions relative to SLAB. (or at least have a really good 

which is of little help if it regresses on other workloads. As we've 
seen it, SLUB can be more than 10 times slower on hackbench. You can 
tune SLUB to use 2MB pages but of course that's not a production level 


that's not a fundamental property of SLAB. It would be an about 10 lines 
hack to enable SLAB debugging switchable-on runtime, with the boot flag 

well that's the nature of caches, but it could be improved: restrict 

the moment you start capturing more memory in SLUB's per cpu queues 


i do appreciate that :-) SLUB was rather easy to "port" to PREEMPT_RT: 
it did not need a single line of change. The SLAB portion is a lot 
scarier:

  dione:~linux-rt.q&gt; diffstat patches/rt-slab-new.patch

all of these are neat.

How about renaming it to SLAB2 instead of SLUB? The "unqueued" bit is 
just stupid NIH syndrome. It's _of course_ queued because it has to. "It 

actually, this might be a bug. the DMA caches should be created right 
away and filled with a small amount of objects due to stupid 16MB 
limitations with certain hardware. Later on a GFP_DMA request might not 
be fulfillable. (because that zone is filled up pretty quickly)

	Ingo
--
To: Christoph Lameter <clameter@...>
Cc: Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Matt Mackall <mpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>
Date: Friday, December 14, 2007 - 3:16 am

Yes, SLUB should be the way to go, but some issues are not yet solved.

I had to switch back to SLAB on a production NUMA server, with 2 nodes and 8GB 
ram. Using a lot of sockets, so a large part of memory was used by kernel.

SLUB kernel was hitting OOM after 2 or 3 days of uptime.
SLAB kernel never hit this.

Unfortunatly I dont have a test machine to reproduce the setup.

Maybe the problem is not related to SLUB at all, but an underlying VM/NUMA bug.

The /proc/buddyinfo showed that :

Node 0 contained two zones (DMA and DMA32) total 4 GB
Node 1 contained one zone (Normal) total 4 GB

So Node 0 contained no (Normal) zone

part of /proc/meminfo

Slab:          3338512 kB
SReclaimable:   789716 kB
SUnreclaim:    2548796 kB

I remember network interrupts were taken by CPU 1, so most allocations were 
done by CPU 1 (node 1), and many freeing were done on CPU 0

Hope this helps

--
To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Matt Mackall <mpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, December 11, 2007 - 2:27 am

On Sat, Dec 08, 2007 at 08:52:11PM +0100, Ingo Molnar wrote:

 &gt; so even today's upstream kernel, which has 'ancient' SLUB code, SLAB and 
 &gt; SLUB have essentially the same linecount:
 &gt; 
 &gt;   $ wc -l mm/slab.c mm/slub.c
 &gt;   4478 mm/slab.c
 &gt;   4125 mm/slub.c
 &gt; 
 &gt; (and while linecount != complexity, there is a strong relationship.)
 &gt; 
 &gt; With SLAB having 10 years more test coverage and tuning.

FWIW, the one thing slub does that slab doesn't that I find really nice
is being enable to enable debugging at boot time rather than compile time.

We don't get many people running benchmarks against the Fedora kernel,
so any scalability differences between slub/slab probably won't reach us
until we start shipping betas of the next RHEL based on the same kernel.

Which leaves my only other gripe.  It broke slabtop.
There's an alternative implementation in Documentation/vm/slabinfo.c
(why there not say, util-linux, home of current slabtop?) 

	Dave

-- 
http://www.codemonkey.org.uk
--
To: Dave Jones <davej@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Matt Mackall <mpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Cc: Rafael J. Wysocki <rjw@...>
Date: Tuesday, December 11, 2007 - 4:52 am

yes, but that's largely due to "dont change SLAB because we've got SLUB" 
resistence to SLAB patches. It's a 2 minute hack to implement this for 

that's actually a _bad_ ABI regression. Rafael, could you please add 

the kernel should output /proc/slabinfo data with the same formatting, 
period.

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Dave Jones <davej@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Matt Mackall <mpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Tuesday, December 11, 2007 - 3:03 pm

I can remember it being said that would be one of the things to-do
before SLUB could become default.

--
To: Linus Torvalds <torvalds@...>
Cc: Andrew Morton <akpm@...>, Matt Mackall <mpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Saturday, December 8, 2007 - 4:29 pm

just to hammer this point home - and while Christoph Lameter isnt online 
currently to defend SLUB, this issue did come up and i guess there must 
be other MM hackers that are advocates of SLUB. (otherwise this code 
couldnt be upstream)

I find SLUB a fantastically confusing concept, and that starts at the 
name. "Unqueued" it says. Lets take a look at the actual code:

mm/slub.c:

  static void __always_inline *slab_alloc(struct kmem_cache *s,
                  gfp_t gfpflags, int node, void *addr)
  {
  ...
          local_irq_save(flags);
          c = get_cpu_slab(s, smp_processor_id());
  ...
          else {
                  object = c-&gt;freelist;
                  c-&gt;freelist = object[c-&gt;offset];
          }
          local_irq_restore(flags);

so it has a "free list", which is clearly per cpu. Hang on! Isnt that 
actually a per CPU queue? Which SLUB has not, we are told? The "U" in 
SLUB. How on earth can an allocator in 2007 claim to have no queuing
(which is in essence caching)? Am i on crack with this? Did i miss
something really obvious?

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Matt Mackall <mpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Sunday, December 9, 2007 - 4:20 am

Hi Ingo,


I think you did. The difference is explained in Christoph's announcement:

"A particular concern was the complex management of the numerous
object queues in SLAB. SLUB has no such queues. Instead we dedicate a
slab for each allocating CPU and use objects from a slab directly
instead of queueing them up."

Which, I think, is where SLUB gets its name from (the "unqueued" part).

Now, while SLAB code is "pleasant and straightforward code" (thanks,
btw) for UMA, it's really hairy for NUMA plus the "alien caches" eat
tons of memory (which is why Christoph wrote SLUB in the first place,
the current code in SLAB is mostly unfixable due to its *queuing*
nature).

I don't object changing the default to CONFIG_SLAB but it's not really
a long term strategy unless we want to have three kmalloc's in the
kernel: one for embedded, one for UMA, and one NUMA.

                        Pekka
--
To: Pekka Enberg <penberg@...>
Cc: Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Matt Mackall <mpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Sunday, December 9, 2007 - 11:59 am

On Sun, 9 Dec 2007 10:20:19 +0200
"Pekka Enberg" &lt;penberg@cs.helsinki.fi&gt; wrote:

.. and they make slab slower on numa systems for database workloads

To be honest, a SLAB without the alien stuff might be one of the best
performers today .... ;)
(on database loads.. where slub is quite a disaster as everyone knows)

-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--
To: Pekka Enberg <penberg@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Matt Mackall <mpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Sunday, December 9, 2007 - 4:50 am

yes, i understand the initial announcement (and the Kconfig entry still 
says the same), but that is not matched up by the reality i see in the 
actual code - SLUB clearly uses a queue/list of objects (as cited in my 
previous mail), for obvious performance reasons.

unless i'm missing something obvious (and i easily might), i see SLUB as 
SLAB reimplemented with a different queueing model. Not "without 

i'm curious about the real facts behind this "alien cache problem". I 
heard about it and asked around and was told that there's some sort of 
bad quadratic behavior of memory consumption on NUMA - but i cannot 
actually see that in the code.

The alien caches feature of SLAB i see as a spread out clustered 
index/cache of objects on other nodes. It's not increasing the average 
per object memory consumption per se! The number of alien caches 
increases with increasing number of nodes, but _of course_, as memory 
size increases too so there's more stuff and a larger expected spreadout 
of memory to keep track of. ("Fixing" that would be like reintroducing a 
single runqueue for the scheduler, based on the argument that it's an 
O(1) number of runqueues against O(N) number of runqueues - which would 
be complete nonsense.)

so i see SLAB alien caches as an automatic self-partitioning mechanism 
... which has complexities but which also has _obvious_ performance 
benefits. Yes, it has some disadvantages like all caching schemes do - 
there's more cached memory tied up in the allocator at any given moment 
- but arguing against that would be like arguing against a 2MB L2 cache 
purely on the basis that a 1MB L2 cache is smaller and hence more 
space-efficient. Caches are there to cache stuff, and more caches ... 
use more memory. It's all a question of proportion and tuning, but the 
_design_ should be based on having as thorough caching as possible.

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Pekka Enberg <penberg@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Matt Mackall <mpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>
Date: Thursday, December 13, 2007 - 6:07 pm

The "queue" that you are talking about is the freelist of a slab. It exist 
for each slab. SLAB uses a table of free entries there. The freelist is 
limited to a slab and thus the locality of the freelist is bound to 
the same page which avoids NUMA locality checks.


--
To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Matt Mackall <mpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Sunday, December 9, 2007 - 5:18 am

Hi Ingo,


Sure, the emphasis is on "no *such* queues."


I mostly live in the legacy 32-bit UMA/UP land still so I cannot verify this
myself but the kind folks at SGI claim the following (again from the
announcement):

"On our systems with 1k nodes / processors we have several gigabytes
 just tied up for storing references to objects for those queues  This does
 not include the objects that could be on those queues. One fears that the
 whole memory of the machine could one day be consumed by those
 queues."

The problem is that for each cache, you have an "per-node alien queues"
for each node (see struct kmem_cache nodelists -&gt; struct kmem_list3 alien).
Moving slab metadata to struct page solves this but now you can only have
one "queue" thats part of the same struct.

                        Pekka
--
To: Pekka Enberg <penberg@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Matt Mackall <mpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Sunday, December 9, 2007 - 7:51 am

Yes, you can find gigs tied up on systems that have 100 GB of RAM, or 
you can have gigs tied up if you over-size your caches. I'd like to see 

yes, it's what i referred to as "distributed, per node cache". It has no 
"quadratic overhead". It has SLAB memory spread out amongst nodes. I.e. 
1 million pages are distributed amongst 1k nodes with 1000 pages per 
node with each node having 1 page.

But that memory is not lost and it's disingenous to call it 'overhead' 
and it very much comes handy for performance _IF_ there's global 
workload that involves cross-node allocations. It's simply a cache that 
is mis-sized and mis-constructed on large node count systems but i bet 
it makes quite a performance difference on low-node-count systems.

On high node-count systems it might make sense to reduce the amount of 
cross-node caching and to _structure_ the distributed NUMA SLAB cache in 
an intelligent way (perhaps along cpuset boundaries) - but a total, 
design level _elimination_ of this caching concept, using very 
misleading arguments, just looks stupid to me ...

	Ingo
--
To: Pekka Enberg <penberg@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Matt Mackall <mpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Christoph Lameter <clameter@...>
Date: Sunday, December 9, 2007 - 8:34 am

i think much of this could be achieved by delaying the creation of alien 
caches up until the point a { node X } -&gt; { node Y } alloc/free 
relationship gets established. So if a system is partitioned along 
cpusets, vast areas of the NxN matrix never gets populated with alien 
caches.

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Linus Torvalds <torvalds@...>, Matt Mackall <mpm@...>
Date: Saturday, December 8, 2007 - 6:11 am

err, umm, OK, I'll go hunt it down and take a closer look at it.
--
To: Rafael J. Wysocki <rjw@...>
Cc: LKML <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Tejun Heo <htejun@...>
Date: Saturday, December 8, 2007 - 5:52 am

Nasty one.  Tejun and several diligent reporters are doing sterling work
there and things have improved.  I don't know whether any of Tejun's
patches have been merged yet, but we'll probably be OK on this one.

What is unclear (to me) is what actually caused those people's machines to
break?

--
To: Andrew Morton <akpm@...>
Cc: Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Alan Cox <alan@...>
Date: Sunday, December 9, 2007 - 3:00 am

Hello, (cc'ing Alan)


I'm still trying to find out what's really going on.  That drive is

It's introduced by setting ATAPI transfer chunk size to actual
transfer size which is the right thing to do generally.  However, with
the change, the ATAPI HSM should be ready to drain full extra transfer
chunks which libata HSM wasn't doing.  With that part changed, most
regressions should go away.

Unfortunately, simply adding that doesn't fix the case in bug 9346 and
I'm still trying to find out why.  The good news is that the drive
works fine with proposed more extensive improvements to libata ATAPI
which will probably be included into 2.6.25, so we at least have long
term solution.

If we fail to find out the solution in time, we always have the
alternative of backing out the ATAPI transfer chunk size update.  This
will break some other cases which were fixed by the change but those
won't be regressions at least and we can add transfer chunk size
update with other changes to 2.6.25.

Thanks.

-- 
tejun
--
To: Tejun Heo <htejun@...>
Cc: Andrew Morton <akpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>
Date: Sunday, December 9, 2007 - 9:42 am

Which will break far more controllers and drives than it fixes, so

Great, make everyone else wait another three months for a working CD
drive. The one off regression appears far less harmful than a revert.

Tejun - instead of backing out important updates for 2.6.24 we should
just blacklist that specific drive for now and sort it nicely in 2.6.25,
not revert stuff and break everyone elses ATAPI devices.

Alan
--
To: Alan Cox <alan@...>
Cc: Tejun Heo <htejun@...>, Andrew Morton <akpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>
Date: Sunday, December 9, 2007 - 2:41 pm

Btw, Alan, that "math" is total and utter BULLSH*T, and you should know 
that.

"The one off regression" is likely the tip of an iceberg. If something 
regresses for one person, for that one person who tested and noticed and 
made a bug-report, there's probably a thousand people who haven't even 
tested the development kernel, or who had problems and just went back to 
the previous version.

In contrast, reverting something will be guaranteed to not have those 
kinds of issues, since the only people who could notice are people for who 
it never worked in the first place. There's no "silent mass of people" 
that can be affected.

This is why regressions are so important. They don't trump _everything_, 
but basically ignoring and letting them slide is *much* more painful than 
just reverting it.

The biggest reason to ignore a regression is if nobody can even figure 
out where it came from, or reverting simply isn't an option for some 
really deep and fundamental issue. That doesn't seem to be the case here. 

So we should revert unless there is some known acceptable real fix.

		Linus

--
To: Linus Torvalds <torvalds@...>
Cc: Tejun Heo <htejun@...>, Andrew Morton <akpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>
Date: Sunday, December 9, 2007 - 6:01 pm

The one off regression is probably not one off, but this is IDE so
actually its quite probable its a single broken firmware. 

The alternative is that you cripple just about every user of various
other standards compliant devices and controllers whose hardware we
finally fixed.

Finally you need to remember that the 'regression' is caused by the fact
we now do the _right_ thing both in terms of 'old IDE' and specs.

Believe it or not I did actually think in quite some detail about this
case, and the relative probabilities, and go back and re-review the old
IDE code (whose behaviour we now follow) and the spec. I spend a
measurable amount of my time reviewing code and weighing risks,
regressions and progress for an enterprise Linux vendor, so its something
I do every day of the week.

To blindly argue regressions are critical is sometimes (as in this case)
to argue that "this freeway is no longer compatible with a horse and
cart" means the freeway should be turned back into a dirt road.

The horse and cart happened to work by chance because the road was quiet
that day. We clearly need to add a horse &amp; cart lane in the long term,
but for 2.6.24 it may well be the right thing to do just to blacklist
that specific drive back to old behaviour until we can tidy it more
nicely.

Alan
--
To: Alan Cox <alan@...>
Cc: Tejun Heo <htejun@...>, Andrew Morton <akpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>
Date: Sunday, December 9, 2007 - 9:57 pm

Alan, you're so full of shit that it's not even funny.

Have you even *read* the thread?

Tejun already reported that this apparently gets fixed _properly_ with the 
more extensive cleanups and fixes that are pending for 2.6.25.

In other words, the stuff you call so critically important (yet we've been 
able to live without it until now!) is apparently simply NOT YET READY. 
It's breaking things.

In this case, Tejun seems to be right on the money.  I also agree 100% 
with him when he says

   "Blacklist takes time to develop and temporary blacklist for just one
    release doesn't sound like a good idea."

because if we create some blacklist for that one reported device, not only 
is it likely going to be wrong (it's almost never just one firmware or one 
chip that has a particular issue), but we tend to create thee blacklists 
and later realize that we shouldn't have blacklisted things at all, we 
should just have done things differently.

For examples of that, see the NCQ blacklist that was just _us_ doing 
things wrong (over-reacting to things we shouldn't care about), and 
there's currently another totally unrelated discussion on a very similar 
thing wrt libata and the ACPI startup commands for an unused controller 

.. and what the hell does that matter? If the code doesn't work, it 
doesn't work, and you might as well point to some random scribblings done 
by a three-year-old on toilet paper rather than any "specs".

Real life matters more. Regressions matter more.

We apparently do have a full fix, but it seems to be too invasive for 
2.6.24, which means that the thing that currently DOES NOT WORK and 
causes regressions should be reverted, so that 2.6.24 is at least no worse 
than 2.6.23 (and all earlier kernels) in this respect.

And then we should just hope that the more complete fix that Tejun has 
doesn't cause any issues on its own. I would suggest that if you care so 
deeply about this issue, you press Fedora into putting Tejun's tree into 
Fedora t...
To: Linus Torvalds <torvalds@...>
Cc: Alan Cox <alan@...>, Tejun Heo <htejun@...>, Andrew Morton <akpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>
Date: Monday, December 10, 2007 - 4:21 am

btw., how extensive are those cleanups and fixes in reality, is there a 
rollup somewhere one could take a look at? Those fixes and cleanups were 
deferred to v2.6.25 in the knowledge of having the current code included 
in v2.6.24 - but now that the current approach seems to regress, maybe 
those cleanups are still safe enough. (compared to an outright revert)

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Alan Cox <alan@...>, Andrew Morton <akpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>
Date: Monday, December 10, 2007 - 4:27 am

The following git tree contains patches pending review for 2.6.25.

http://git.kernel.org/?p=linux/kernel/git/tj/libata-dev.git;a=shortlog;h=improve-ATAPI...

And we're getting close to fixing the regression.  I don't think there's
too much worry about this one.  Just need a bit more time to test few
more things.

Thanks.

-- 
tejun
--
To: Tejun Heo <htejun@...>
Cc: Linus Torvalds <torvalds@...>, Alan Cox <alan@...>, Andrew Morton <akpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>
Date: Monday, December 10, 2007 - 4:41 am

ah, i see, the joys of the kernel running BIOS written code (AML):

  http://bugzilla.kernel.org/attachment.cgi?id=13932&amp;action=view

cute!

	Ingo
--
To: Linus Torvalds <torvalds@...>
Cc: Tejun Heo <htejun@...>, Andrew Morton <akpm@...>, Rafael J. Wysocki <rjw@...>, LKML <linux-kernel@...>, Ingo Molnar <mingo@...>
Date: Sunday, December 9, 2007 - 11:38 pm

&gt; Have you even *read* the thread?

In detail, as it unfolds and while testing variants of Tejun's code on
the hardware I have access to - none of which has this bug making it

And as I keep pointing out but you keep ignoring - not doing it breaks

The code without the changes doesn't work either. So pick your toilet

Which as the distro bug lists for ATAPI will tell you - aint good. Still

Linus, the kernel regresses all over the place every release. If it
didn't do that you'd never get any changes in. Your kernel would
fossilize like RHEL or SLES and you'd be spending weeks analysing each
changeset for possible side effects, or - as happens by neccessity -
adding code paths so a fix vital to one driver ceases to share core code
with another driver - to reduce regression risk. Been there, done that

Have fun. I trust you'll be fixing the other 11 I think it was listed
regressions before 2.6.24  - or backing out every changeset that could be
responsible ?

No I thought not - because that wouldn't be sensible either.

Alan
--