[Bug #11220] Screen stays black after resume

Previous thread: Re: HPET regression in 2.6.26 versus 2.6.25 -- found another user with the same regression by David Witbrodt on Saturday, August 23, 2008 - 12:29 pm. (1 message)

Next thread: Re: HPET regression in 2.6.26 versus 2.6.25 -- found another user with the same regression by David Witbrodt on Saturday, August 23, 2008 - 12:47 pm. (3 messages)
From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:07 am

This message contains a list of some regressions from 2.6.26, for which there
are no fixes in the mainline I know of.  If any of them have been fixed already,
please let me know.

If you know of any other unresolved regressions from 2.6.26, please let me know
either and I'll add them to the list.  Also, please let me know if any of the
entries below are invalid.

Each entry from the list will be sent additionally in an automatic reply to
this message with CCs to the people involved in reporting and handling the
issue.


Listed regressions statistics:

  Date          Total  Pending  Unresolved
  ----------------------------------------
  2008-08-23      122       48          40
  2008-08-16      103       47          37
  2008-08-10       80       52          31
  2008-08-02       47       31          20


Unresolved regressions
----------------------

Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11414
Subject		: Random crashes with 2.6.27-rc3 on PPC
Submitter	: Michael Buesch <mb@bu3sch.de>
Date		: 2008-08-23 14:10 (1 days old)
References	: http://marc.info/?l=linux-kernel&m=121950076812616&w=4


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11410
Subject		: SLUB list_lock vs obj_hash.lock...
Submitter	: Daniel J Blueman <daniel.blueman@gmail.com>
Date		: 2008-08-22 21:48 (2 days old)
References	: http://marc.info/?l=linux-kernel&m=121944176609042&w=4


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11407
Subject		: suspend: unable to handle kernel paging request
Submitter	: Vegard Nossum <vegard.nossum@gmail.com>
Date		: 2008-08-21 17:28 (3 days old)
References	: http://marc.info/?l=linux-kernel&m=121933974928881&w=4
Handled-By	: Rafael J. Wysocki <rjw@sisk.pl>
		  Pekka Enberg <penberg@cs.helsinki.fi>
		  Pavel Machek <pavel@suse.cz>


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11406
Subject		: patch "x86: MOVE PCI IO ECS code to x86/pci" breaks CPU hotplug
Submitter	: Jan Beulich <jbeulich@novell.com>
Date		: ...
From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:07 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11141
Subject		: no battery or DC status - Dell i1501
Submitter	: Gu Rui <chaos.proton@gmail.com>
Date		: 2008-07-21 19:43 (34 days old)
Handled-By	: Zhao Yakui <yakui.zhao@intel.com>


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11207
Subject		: VolanoMark regression with 2.6.27-rc1
Submitter	: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
Date		: 2008-07-31 3:20 (24 days old)
References	: http://marc.info/?l=linux-kernel&m=121747464114335&w=4
Handled-By	: Zhang, Yanmin <yanmin_zhang@linux.intel.com>
		  Peter Zijlstra <a.p.zijlstra@chello.nl>
		  Dhaval Giani <dhaval@linux.vnet.ibm.com>
		  Miao Xie <miaox@cn.fujitsu.com>
Patch		: http://marc.info/?l=linux-kernel&m=121922991027344&w=4


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11191
Subject		: 2.6.26-git8: spinlock lockup in c1e_idle()
Submitter	: Mikhail Kshevetskiy <mikhail.kshevetskiy@gmail.com>
Date		: 2008-07-24 03:22 (31 days old)
References	: http://lkml.org/lkml/2008/7/23/317
Handled-By	: Thomas Gleixner <tglx@linutronix.de>


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11209
Subject		: 2.6.27-rc1 process time accounting
Submitter	: Lukas Hejtmanek <xhejtman@ics.muni.cz>
Date		: 2008-07-31 10:43 (24 days old)
References	: http://marc.info/?l=linux-kernel&m=121750102917490&w=4
Handled-By	: Peter Zijlstra <a.p.zijlstra@chello.nl>


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11210
Subject		: libata badness
Submitter	: Kumar Gala <galak@kernel.crashing.org>
Date		: 2008-07-31 18:53 (24 days old)
References	: http://marc.info/?l=linux-ide&m=121753059307310&w=4
Handled-By	: Ben Dooks <ben-linux@fluff.org>


--

From: Jeff Garzik
Date: Saturday, August 23, 2008 - 3:23 pm

From: Rafael J. Wysocki
Date: Sunday, August 24, 2008 - 2:04 pm

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11215
Subject		: INFO: possible recursive locking detected ps2_command
Submitter	: Zdenek Kabelac <zdenek.kabelac@gmail.com>
Date		: 2008-07-31 9:41 (24 days old)
References	: http://marc.info/?l=linux-kernel&m=121749737011637&w=4
Handled-By	: Peter Zijlstra <a.p.zijlstra@chello.nl>


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11219
Subject		: KVM modules break emergency reboot
Submitter	: Zdenek Kabelac <zdenek.kabelac@gmail.com>
Date		: 2008-08-01 20:25 (23 days old)
References	: http://marc.info/?l=linux-kernel&m=121762241105336&w=4


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11220
Subject		: Screen stays black after resume
Submitter	: Nico Schottelius <nico@schottelius.org>
Date		: 2008-07-31 21:05 (24 days old)
References	: http://marc.info/?l=linux-kernel&m=121753882422899&w=4


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11230
Subject		: Kconfig no longer outputs a .config with freshly updated defconfigs
Submitter	: Josh Boyer <jwboyer@linux.vnet.ibm.com>
Date		: 2008-08-02 16:03 (22 days old)
References	: http://marc.info/?l=linux-kernel&m=121769306319391&w=4


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11254
Subject		: KVM: fix userspace ABI breakage
Submitter	: Adrian Bunk <bunk@kernel.org>
Date		: 21 Jul 2008 17:58:26 (0 days old)
References	: http://lkml.org/lkml/2008/7/21/197
Handled-By	: Adrian Bunk <bunk@kernel.org>
Patch		: http://lkml.org/lkml/2008/7/21/197


--

From: Adrian Bunk
Date: Sunday, August 24, 2008 - 12:27 pm

The discussion in Bugzilla whether it is a regression at all can be 
condensed to the following question:

Can a struct that is part of the 2.6.26 userspace headers be defined to 
be part of an "experimental ABI" and therefore be changed?

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: Avi Kivity
Date: Monday, August 25, 2008 - 3:23 am

It is part of the experimental ABI.

However, as I'm going to apply your patch (as being the simplest fix, 
and as there is no measurable performance impact), the question is moot.

-- 
error compiling committee.c: too many arguments to function

--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11224
Subject		: Only three cores found on quad-core machine.
Submitter	: Dave Jones <davej@redhat.com>
Date		: 2008-08-01 18:15 (23 days old)
References	: http://marc.info/?l=linux-kernel&m=121761475224719&w=4


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11264
Subject		: Invalid op opcode in kernel/workqueue
Submitter	: Jean-Luc Coulon <jean.luc.coulon@gmail.com>
Date		: 2008-08-07 04:18 (17 days old)


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11271
Subject		: BUG: fealnx in 2.6.27-rc1
Submitter	: Jaswinder Singh <jaswinderlinux@gmail.com>
Date		: 2008-08-05 14:58 (19 days old)
References	: http://marc.info/?l=linux-netdev&m=121794762016830&w=4
		  http://lkml.org/lkml/2008/8/10/98
Handled-By	: Francois Romieu <romieu@fr.zoreil.com>


--

From: Jeff Garzik
Date: Saturday, August 23, 2008 - 3:26 pm

Jaswinder, does reverting 28cd4289abc2c8db90344ee4ff064a9bdf086fdf help?

That's the only material change to fealnx itself in years.

If not, any chance you could bisect this problem, and add more info to 
the bug?

--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11272
Subject		: BUG: parport_serial in 2.6.27-rc1 for NetMos Technology PCI 9835
Submitter	: Jaswinder Singh <jaswinderlinux@gmail.com>
Date		: 2008-08-05 15:12 (19 days old)
References	: http://marc.info/?l=linux-kernel&m=121794900319776&w=4


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11237
Subject		: corrupt PMD after resume
Submitter	: Alan Jenkins <alan-jenkins@tuffmail.co.uk>
Date		: 2008-08-02 9:51 (22 days old)
References	: http://marc.info/?l=linux-kernel&m=121767073424952&w=4
Handled-By	: Hugh Dickins <hugh@veritas.com>
		  Jeremy Fitzhardinge <jeremy@goop.org>


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11276
Subject		: build error: CONFIG_OPTIMIZE_INLINING=y causes gcc 4.2 to do stupid things
Submitter	: Randy Dunlap <randy.dunlap@oracle.com>
Date		: 2008-08-06 17:18 (18 days old)
References	: http://marc.info/?l=linux-kernel&m=121804329014332&w=4
		  http://lkml.org/lkml/2008/7/22/353
Handled-By	: Bjorn Helgaas <bjorn.helgaas@hp.com>
Patch		: http://lkml.org/lkml/2008/7/22/364


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11279
Subject		: 2.6.27-rc0 Power Bugs with HP/Compaq Laptops
Submitter	: Matt Parnell <mparnell@gmail.com>
Date		: 2008-08-07 14:57 (17 days old)
References	: http://marc.info/?l=linux-kernel&m=121812108031685&w=4


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11282
Subject		: Please fix x86 defconfig regression
Submitter	: Andi Kleen <andi@firstfloor.org>
Date		: 2008-08-07 20:46 (17 days old)
References	: http://marc.info/?l=linux-kernel&m=121814188805666&w=4


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11308
Subject		: tbench regression on each kernel release from 2.6.22 -> 2.6.28
Submitter	: Christoph Lameter <cl@linux-foundation.org>
Date		: 2008-08-11 18:36 (13 days old)
References	: http://marc.info/?l=linux-kernel&m=121847986119495&w=4


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11335
Subject		: 2.6.27-rc2-git5 BUG: unable to handle kernel paging request
Submitter	: Randy Dunlap <randy.dunlap@oracle.com>
Date		: 2008-08-12 4:18 (12 days old)
References	: http://marc.info/?l=linux-kernel&m=121851477201960&w=4
		  http://lkml.org/lkml/2008/8/16/274
Handled-By	: Hugh Dickins <hugh@veritas.com>


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11336
Subject		: 2.6.27-rc2:stall while mounting root fs
Submitter	: Torsten Kaiser <just.for.lkml@googlemail.com>
Date		: 2008-08-12 12:37 (12 days old)
References	: http://marc.info/?l=linux-kernel&m=121854484015909&w=4


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11334
Subject		: myri10ge: use ioremap_wc: compilation failure on ARM
Submitter	: Martin Michlmayr <tbm@cyrius.com>
Date		: 2008-08-10 11:25 (14 days old)
References	: http://marc.info/?l=linux-netdev&m=121836771727632&w=2


--

From: Martin Michlmayr
Date: Sunday, August 24, 2008 - 5:26 am

Yes, this is still there.
-- 
Martin Michlmayr
http://www.cyrius.com/
--

From: Rafael J. Wysocki
Date: Sunday, August 24, 2008 - 2:05 pm

Thanks for the update.

Rafael

--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11343
Subject		: SATA Cold Boot Problems with 2.6.27-rc[23] on nVidia 680i
Submitter	: Manny Maxwell <mannymax@mannymax.net>
Date		: 2008-08-14 4:16 (10 days old)
References	: http://marc.info/?l=linux-kernel&m=121868782917600&w=4


--

From: Jeff Garzik
Date: Saturday, August 23, 2008 - 3:34 pm

hmmmm.  Looking at changes between the two csets listed in the email 
(623fa57..8f616cd), all of them are driver-specific and unrelated to 
Manny's hardware except for

	commit 2486fa561a3192bbbec39c7feef87a1e07bd6342
	Author: Tejun Heo <tj@kernel.org>
	Date:   Thu Jul 31 07:52:40 2008 +0900

	    libata: update atapi disable handling

So you could try to revert that and see what happens.  But given that 
small range of changes, it really seems like something else, maybe in 
the PCI subsystem (random guess).

Looking at the entire kernel, nothing jumps out, either.  Its mostly fs 
updates (ext4, xfs), a networking update, an ARM update, and a libata 
update.

Also, some reset-related fixes just went in, so re-testing the latest 
-git would be helpful as well.

	Jeff



--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11340
Subject		: LTP overnight run resulted in unusable box
Submitter	: Alexey Dobriyan <adobriyan@gmail.com>
Date		: 2008-08-13 9:24 (11 days old)
References	: http://marc.info/?l=linux-kernel&m=121861951902949&w=4


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11342
Subject		: Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected
Submitter	: Alan D. Brunelle <Alan.Brunelle@hp.com>
Date		: 2008-08-13 23:03 (11 days old)
References	: http://marc.info/?l=linux-kernel&m=121866876027629&w=4
Handled-By	: Andrew Morton <akpm@linux-foundation.org>


--

From: Linus Torvalds
Date: Saturday, August 23, 2008 - 1:10 pm

This one makes no sense. It's triggering a BUG_ON(in_interrupt()), but 
then the call chain shows that there is no interrupt going on.

Also, the bisection is senseless - there's a trivial change wrt 
"do_one_initcall()" that got merged, but everything else is trivial about 
lguest and has nothing to do with the whole CPU-init thing. But if it was 
that initcall one, then "git bisect" woul have pointed to it, not the 
merge. And the merge itself had no conflicts or anything else going on..

The fact that it came and went later also implies that it's probably just 
some timing-dependent thing or some subtle memory corruption, making the 
bisection result even less likely to be exact.

But I'm adding Arjan and Rusty to the Cc, because that merge was takign 
Rusty's branch, and the "do_one_initcall()" is Arjan's commit. Since 
undoing that merge apparently does fix it, I'm wondering if something 
there just does end up triggering the problem.

The do_one_commit() thing _is_ in the path of sys_init_module(), so it 
_is_ at least somewhat relevant from an oops standpoint. 

One thing the "do_one_commit()" thing does is to put more pressure on the 
stack due to that whole buffer for the printk's going on.

Alan, can you try 
 - seeing how consistent it is with one kernel (ie boot a known-bad kernel 
   a few times just to see if it really is 100% consistent)
 - try enabling 'initcall_debug' on the kernel command line, to (a) see 
   the new code actually do something and (b) see what it is actually 
   calling just before.

Hmm..

			Linus
--

From: Arjan van de Ven
Date: Saturday, August 23, 2008 - 1:15 pm

but it's 64 bit.. with 8Kb stack and separate irq stacks. I'd be surprised if we blow that this easily.
the trace is a tad long with a long ACPI call chain.

Wonder what gcc is in use?
(newer ones tend to be a ton better... but maybe Alex is using a really old one)
--

From: Alan D. Brunelle
Date: Monday, August 25, 2008 - 5:07 am

I'm running Ubuntu 8.04 w/ gcc:

gcc (GCC) 4.2.3 (Ubuntu 4.2.3-2ubuntu7)

Alan
--

From: Linus Torvalds
Date: Saturday, August 23, 2008 - 1:17 pm

Ahh, later in that thread there's another totally unrelated oops in 
debug_mutex_add_waiter().

I'd guess that it is really wild pointer corrupting memory, quite possibly 
due to a double free or something like that. Alan - it would be good to 
run with DEBUG_PAGE_ALLOC and SLUB debugging etc if you don't already do 
that?

		Linus
--

From: Alan D. Brunelle
Date: Monday, August 25, 2008 - 5:03 am

I'll add those in - as to the repeatability: The "bad" kernels seem to
repeat quite reliably - not only in terms of counts (5 or 6 times in a
row before trying something else), but also in terms of the "what" -
either the original issue () or the other kernel with the later issue
(debug_mutex_add_waiter). That's /goodness/ in that it should help
narrow it down.

I'll make sure the kernel is still failing this morning, and then add in
DEBUG_PAGE_ALLOC and if that doesn't help, SLUB debugging...

Alan
--

From: Alan D. Brunelle
Date: Monday, August 25, 2008 - 5:22 am

Before adding any more debugging, this is the status of my kernel boots:
3 times in a row w/ this same error. (Primary problem is the same,
secondary stacks differ of course.)

Alan
From: Linus Torvalds
Date: Monday, August 25, 2008 - 11:00 am

Uhhuh! The previous "modprobe" uses stack like mad.  It could be 
"fuse_init()" that has done it, but looking at fuse, I seriously doubt it. 
It doesn't seem to do anything particularly bad.

So something has used over 6kB of stack, and it may well be the module 
loading code itself.


This really looks like

	ti->task->blocked_on = waiter;

where "ti->task" is NULL. You probably have almost everything enabled in 
order to turn "struct task_struct" that big, but judging by your register 
state it's really an offset off a NULL pointer, not some small integer.

Now, there is no way "ti->task" can _possibly_ be NULL. No way.

Well, except that "ti" is just below the stack, and if you had a stack 
overflow that overwrote it.

So I seriously do believe that you have run out of stack. If that is true, 
then it's quite likely that with DEBUG_PAGE_ALLOC you'll actually get a 
double fault, which in turn is fairly hard to debug (you look at it wrong 
and it turns into a triple fault which is going to just reboot your 
machine immediately).

Now, the stack oveflow probably happened a few calls earlier (and just 
left your thread_info corrupted), but there is more reason to believe you 

Here there is only 408 bytes left, which is _way_ too little, but it's 
also an optimistic measure. What the stack code usage code does is to just 
see how many zeroes it can find on the stack. If you have a big stack 
frame somewhere, it's quite possible that it actually used all your stack 
and then some, but left a bunch of zeroes around.

And the do_exit() oops is simply because once the thread_info is 
corrupted, all the basic thread data structures are crap, and yes, you're 
almost guaranteed to oops at that point.

Could you make your kernel image available somewhere, and we can take a 
look at it? Some versions of gcc are total pigs when it comes to stack 
usage, and your exact configuration matters too.  But yes, module loading 
is a bad case, for me "sys_init_module()" ...
From: Linus Torvalds
Date: Monday, August 25, 2008 - 11:09 am

I bet this one-liner will probably make your kernel work. It's not a full 
solution, but it will make the module-loading path lose _all_ of the above 
stack slots by just not inlining "load_module()" - the stack slots will 
still be used when the module is _loaded_, but by the time we actually 
callt he ->init function they will have been released since it's not all 
in the same crazy function any more.

I _seriously_ believe that we were better off back when gcc only inlined 
what we told it to inline, and never inlined on its own. The gcc inlining 
logic is pure and utter sh*t in an environment like the kernel where stack 
space is a valuable resource.

Anyway, Alan, even if this solves your particular problem, I'd still like 
to see your kernel image, so that I can hunt for other problems like 
this..

			Linus

---
 kernel/module.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/kernel/module.c b/kernel/module.c
index 08864d2..9db1191 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -1799,7 +1799,7 @@ static void *module_alloc_update_bounds(unsigned long size)
 
 /* Allocate and load the module: note that size of section 0 is always
    zero, and we rely on this for optional sections. */
-static struct module *load_module(void __user *umod,
+static noinline struct module *load_module(void __user *umod,
 				  unsigned long len,
 				  const char __user *uargs)
 {
--

From: Alan D. Brunelle
Date: Monday, August 25, 2008 - 1:19 pm

Mine has:

Dump of assembler code for function sys_init_module:
0xffffffff802688c0 <sys_init_module+0>:	push   %rbp
0xffffffff802688c1 <sys_init_module+1>:	mov    %rsp,%rbp
0xffffffff802688c4 <sys_init_module+4>:	sub    $0x1c0,%rsp
0xffffffff802688cb <sys_init_module+11>:	mov    %r12,-0x20(%rbp)
0xffffffff802688cf <sys_init_module+15>:	mov    %rdi,%r12

so 448 bytes.

The kernel is up at: http://free.linux.hp.com/~adb/bug.11342/vmlinux (if
you would let me know when you are through with it so I can free up some
space there I'd appreciate it...)

By doing the patch you provided, sys_init_module now looks like:

Dump of assembler code for function sys_init_module:
0xffffffff8026aa20 <sys_init_module+0>:	push   %rbp
0xffffffff8026aa21 <sys_init_module+1>:	mov    %rsp,%rbp
0xffffffff8026aa24 <sys_init_module+4>:	sub    $0x20,%rsp
0xffffffff8026aa28 <sys_init_module+8>:	mov    %r14,0x18(%rsp)
0xffffffff8026aa2d <sys_init_module+13>:	mov    %rdi,%r14


So only 32 bytes. (But of course, load_module() exists, and now has
0x1d0 (464) bytes...)

With the patch you provide, I /was/ able to repeatedly boot OK (latest
tree, and I also ran the patch against the 26.27.rc3-based kernel I was
having problems with initially, and that booted OK as well).


--

From: Linus Torvalds
Date: Monday, August 25, 2008 - 1:43 pm

Yeah, your build seems to have consistently bigger stack usage, and that 
may be due to some config option, but most likely it's a compiler version 
issue.

But I think part of the reason is that you have frame pointers enabled: 
that makes the stack frames bigger not only because of the frame pointer 
save/restore, but also because you have more register pressure and thus 

I'm downloading it now, I'll probably be done by the time you get this 
email.


Right - the stack usage didn't go away, but the _lifetimes_ changed.

So now load_module() will still use almost 500 bytes of stack, and it will 
call other routines that use stack too, but the lifetime of that stack 
usage is no longer over the whole module loading and initialization part, 
it's purely over just the loading thing.

And since the deep callchain came much later (in the actual ->init 
routines), by the time we do that, we no longer now have the load_module 

I had actually already committed it, because it was correct regardless 
(and gcc really is a total ass for doing that inlining to begin with), but 
it's good to have verification that the behaviour you saw was literally 
about this thing.

I'll look at your vmlinux binary to see what else sucks from a stack depth 
standpoint, but one of the problems in this whole thing is that the 
stack usage is obviously both a static thing (with some functions using 
_way_ too much stack!) _and_ a dynamic thing (with the total stack use 
being not about any individual function, but the whole chain).

My patch obviously doesn't change the static stack usage, it just moves it 
around a bit so that it's no longer on that same deep path, so the dynamic 
stack usage is much less.

But I'll look at your vmlinux, see what stands out.

		Linus
--

From: Arjan van de Ven
Date: Monday, August 25, 2008 - 1:45 pm

I wonder if we ought to have a light version of "make checkstack" always run,
but in such a way that we make a file with "limits" on the stack usage for key
functions (and we can grow this list over time when we learn about critical ones)..
and either warn very loudly or even fail the build if we're way over what could work.

--

From: Linus Torvalds
Date: Monday, August 25, 2008 - 1:52 pm

Oops. I already see the problem.

Your .config has soem _huge_ CPU count, doesn't it?

checkstack.pl shows these things as the top problems:

	0xffffffff80266234 smp_call_function_mask [vmlinux]:    2736
	0xffffffff80234747 __build_sched_domains [vmlinux]:     2232
	0xffffffff8023523f __build_sched_domains [vmlinux]:     2232
	0xffffffff8021e884 setup_IO_APIC_irq [vmlinux]:         1616
	0xffffffff8021ee24 arch_setup_ht_irq [vmlinux]:         1600
	0xffffffff8021f144 arch_setup_msi_irq [vmlinux]:        1600
	0xffffffff8021e3b0 __assign_irq_vector [vmlinux]:       1592
	0xffffffff8021e626 __assign_irq_vector [vmlinux]:       1592
	0xffffffff8023257e move_task_off_dead_cpu [vmlinux]:    1592
	0xffffffff802326e8 move_task_off_dead_cpu [vmlinux]:    1592
	0xffffffff8025dbc5 tick_handle_oneshot_broadcast [vmlinux]:1544
	0xffffffff8025dcb4 tick_handle_oneshot_broadcast [vmlinux]:1544
	0xffffffff803f3dc4 store_scaling_governor [vmlinux]:    1376
	0xffffffff80279ef4 cpuset_write_resmask [vmlinux]:      1360
	0xffffffff803f465d cpufreq_add_dev [vmlinux]:           1352
	0xffffffff803f495b cpufreq_add_dev [vmlinux]:           1352
	0xffffffff803f3fc4 store_scaling_max_freq [vmlinux]:    1328
	0xffffffff803f4064 store_scaling_min_freq [vmlinux]:    1328
	0xffffffff803f44c4 cpufreq_update_policy [vmlinux]:     1328
	..

and sys_init_module is actually way way down the list. I bet the only 
reason it showed up at all was because dynamically it was such a deep 
callchain, and part of that callchain probably called some of those really 
nasty things.

Anyway, the reason smp_call_function_mask and friends have such _huge_ 
stack usages for you is that they contain a 'cpumask_t' on the stack.

For example, for me, usign a sane NR_CPU, the size of the stack frame for 
smp_call_function_mask is under 200 bytes.  For you, it's 2736 bytes.

How about you make CONFIG_NR_CPU's something _sane_? Like 16? Or do you 
really have four thousand CPU's in that system?

Oh, I guess you have ...
From: Linus Torvalds
Date: Monday, August 25, 2008 - 2:15 pm

In fact, they contain multiple CPU-masks, each 4k-bits - 512 bytes - in 
size. And they tend to call each other.

Quite frankly, I don't think we were really ready for 4k CPU's. I'm going 
to commit this patch to make sure others don't do that many CPU's by 
mistake. It marks MAXCPU's as being 'broken' so you cannot select it, and 
also limits the number of CPU's that you _can_ select to "just" 512.

Right now, 4k cpu's is known broken because of the stack usage. I'm not 
willing to debug more of these kinds of stack smashers, they're really 
nasty to work with. I wonder how many other random failures these have 
been involved with?

This patch also makes the ifdef mess in Kconfig much cleaner and avoids 
duplicate definitions by just conditionally suppressing the question and 
giving higher defaults. 

We can enable MAXSMP and raise the CPU limits some time in the future. But 
that future is not going to be before 2.6.27 - the code simply isn't ready 
for it.

The reason I picked 512 CPU's as the limit is that we _used_ to limit 
things to 255. So it's higher than it used to be, but low enough to still 
feel safe. Considering that a 4k-bit CPU mask (512 bytes) _almost_ worked, 
the 512-bit (64 bytes) masks are almost certainly fine.

Still, sane people should limit their NR_CPUS to 8 or 16 or something like 
that. Very very few people really need the pain of big NR_CPUS. Not even 
"just" 512 CPU's.

Travis, Ingo and Thomas cc'd, since they were involved in the original 
commit (1184dc2ffe2c8fb9afb766d870850f2c3165ef25) that raised the limit.

		Linus

---
 arch/x86/Kconfig |   30 ++++++++----------------------
 1 files changed, 8 insertions(+), 22 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 68d91c8..ed92864 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -577,35 +577,29 @@ config SWIOTLB
 
 config IOMMU_HELPER
 	def_bool (CALGARY_IOMMU || GART_IOMMU || SWIOTLB || AMD_IOMMU)
+
 config MAXSMP
 	bool "Configure Maximum ...
From: Ingo Molnar
Date: Tuesday, August 26, 2008 - 12:22 am

yeah, that's OK i guess - distros can still enable 4K support if they 
wish to. Someone interested in improving the stack footprint situation 
should dust off the max-stack-footprint tracer so that we can catch 
these things in a more structured way.

And i guess the next generation of 4K CPUs support should just get away 
from cpumask_t-on-kernel-stack model altogether, as the current model is 
not maintainable. We tried the on-kernel-stack variant, and it really 
does not work reliably. We can fix this in v2.6.28.

	Ingo
--

From: David Miller
Date: Tuesday, August 26, 2008 - 12:46 am

From: Ingo Molnar <mingo@elte.hu>

I recenetly did some work on sparc64 to use cpumask pointers
as much as possible.

The only case that didn't work was due to a limitation in
arch interfaces for the new generic smp_call_function() code.
It passes a cpumask_t instead of a pointer to one via
arch_send_call_function_ipi().

But other than that, the whole sparc64 SMP stuff uses cpumask_t
pointers only.

What it comes down to is that you have to do the "self cpu"
and other tests in the cross-call dispatch routines themselves,
instead of at the top-level working on cpumask_t objects.

Otherwise you have to modify cpumask_t objects and thus pluck
them onto the stack where they take up silly amounts of space.
--

From: Ingo Molnar
Date: Tuesday, August 26, 2008 - 12:53 am

What we did was this: we added MAXSMP which just revs up all the SMP 
tunables to the maximum, so that we can see any problems early in 
testing.

And we triggered problems, and we fixed a couple of regressions all 
around stack footprint. But we didnt catch all of them - some were gcc 
version dependent and configuration dependent. So i think it's safe to 
say that the whole concept of allowing such a large cpumask_t to be on 
the stack is fragile.

Hence, i think the best way forward is to change the whole cpumask_t 
concept and disallow explicit masks altogether. It's so easy to smack a 
cpumask_t variable on the stack and nothing really warns about it, and 
any function can become part of a nested call sequence.

So i think the dynamics of it has to be changed: we need a get/put API 
and we need to make on-stack cpumask illegal on the build level (in 
generic code at least). This has been Rusty's main argument early on i 
think, and i now concur.

	Ingo
--

From: Yinghai Lu
Date: Tuesday, August 26, 2008 - 1:36 am

wonder if could use "unsigned long *" directly.
so could dyn_array directly like

int cpumask_size;

unsigned long *online_cpu_map;
DEFINE_DYN_ARRAY(online_cpu_map, sizeof(unsigned long), cpumask_size,
PAGE_SIZE, NULL);

and after nr_cpu_ids is assigned, have
cpumask_size = (nr_cpu_ids + sizeof(unsigned long) - 1)/sizeof(unsigned long);

then we could NR_CPUS=4096 kernel to small system. ...

YH
--

From: Linus Torvalds
Date: Tuesday, August 26, 2008 - 9:51 am

I would actually suggest something like this:

 - we continue to have a magic "cpumask_t".

 - we do different cases for big and small NR_CPUS:

	#if NR_CPUS <= BITS_PER_LONG

	/*
	 * Make it an array - that way passing it as an argument will
	 * always pass it as a pointer!
	 */
	typedef unsigned long cpumask_t[1];

	static inline void create_cpumask(cpumask_t *p)
	{
		*p = 0;
	}
	static inline void free_cpumask(cpumask_t *p)
	{
	}

	#else

	typedef unsigned long *cpumask_t;

	static inline void create_cpumask(cpumask_t *p)
	{
		*p = kcalloc(..);
	}

	static inline void free_cpumask(cpumask_t *p)
	{
		kfree(*p);
	}

	#endif

and now after you do this, you can just do something like

	cpumask_t mycpu;

	create_cpumask(&mycpu);
	..
	free_cpumask(&mycpu);

and in between, you can use 'cpumask' as a pointer, because even when it 
is an array directly allocated on the stack, the array can always 
degenerate into a pointer by C type rules!

And for the small-NR_CPUS case there is zero overhead.

			Linus
--

From: Yinghai Lu
Date: Tuesday, August 26, 2008 - 10:08 am

On Tue, Aug 26, 2008 at 9:51 AM, Linus Torvalds

that is good for local variables.

for global variables, need to allocate them in some point. may need one
int cpumask_size;

cpumask_t online_cpu_map;
DEFINE_DYN_ARRAY(online_cpu_map, sizeof(unsigned long), cpumask_size,
PAGE_SIZE, NULL);

or something like that.

YH
--

From: Rusty Russell
Date: Wednesday, September 24, 2008 - 6:50 pm

Hi Linus,

    This turns out to be awful in practice, mainly due to const.  Consider:

	#ifdef CONFIG_CPUMASK_OFFSTACK
	typedef unsigned long *cpumask_t;
	#else
	typedef unsigned long cpumask_t[1];
	#endif

	cpumask_t returns_cpumask(void);

That's obviously illegal if cpumask_t is an array.  So we need a typedef which 
says "really always a pointer".

	typedef unsigned long *cpumask_return_t;
	cpumask_return_t returns_cpumask(void);

But we usually want it to return a const ptr, and this doesn't work:

	const cpumask_return_t returns_cpumask(void);
	foo.c:12: warning: type qualifiers ignored on function return type

So now we need:
	typedef const unsigned long *cpumask_const_return_t;
	cpumask_const_return_t returns_cpumask(void);

OK, now consider a function which wants to take a const cpu bitmap:

	void cpus_copy(cpumask_t dst, const cpumask_t src);
	...
	cpus_copy(cpus, returns_cpumask());
	foo.c:34: warning: passing argument 2 of ‘cpus_copy’ discards qualifiers from 
pointer target type

Oops, that didn't work with the pointer version.  So we need another typedef:
	#ifdef CONFIG_CPUMASK_OFFSTACK
	typedef const unsigned long *cpumask_const_t;
	#else
	typedef const unsigned long cpumask_const_t[1];
	#endif

	void cpus_copy(cpumask_t dst, cpumask_const_t src);

We end up with this:
	#ifdef CONFIG_CPUMASK_OFFSTACK
	typedef unsigned long *cpumask_t;
	typedef const unsigned long *cpumask_const_t;
	#else
	typedef unsigned long cpumask_t[1];
	typedef const unsigned long cpumask_const_t[1];
	#endif
	typedef unsigned long *cpumask_return_t;
	typedef const unsigned long *cpumask_const_return_t;
	typedef unsigned long cpumask_data_t[1];

I can't see a neater way down this path, and I don't want to lose const.

I can see three alternatives:
1) An ONSTACK_CPUMASK(name) macro which declares "struct cpumask name[1]" or
   "struct cpumask *name".  Same idea as yours, without the typedef.
2) Use a normal struct for cpumask, make everyone use pointers, but ...
From: Ingo Molnar
Date: Thursday, September 25, 2008 - 1:55 am

since most of the important cpumasks in high-perf codepaths are already 
pre-constructed and embedded in some existing object (say task_struct), 
i think a variant of #3 is the best approach:

 - get rid of cpumask_t and use 'struct cpumask' everywhere.

 - do not expose normal kernel code to struct cpumask's definition, only
   declare the type via 'struct cpumask;' in sched.h - a'la 
   kmem_cache_t.

 - even hide the structure from sched.h - use an extra indirection for 
   struct cpumask *cpus_allowed in task_struct and be done with it.

 - have normal cpumask object alloc/free codepaths.

 - optimize any remaining important cases by hand, if needed. (the 
   scheduler mostly)

(wrt. #2, alloca() is a nightmare i think.)

	Ingo
--

From: Linus Torvalds
Date: Thursday, September 25, 2008 - 8:42 am

No. That's already broken. You cannot return a cpumask_t, regardless of 
interface. We must not do it regardless of how we pass those things 
around, since it generates _yet_ another temporary on the stack for the 
return slot for any kind of structure.

So all cpumask functions should always return pointers and/or take 
pointers to be filled in. That's true *regardless* of how we actually are 
to then allocate them.

So forget returning cpumasks. It's irrelevant.

What _is_ relevant is how we allocate them when we need temporary CPU 
masks. And _that_ is where my suggestion comes in. For small NR_CPUS, we 
really do want to allocate them on the stack, because calling kmalloc for 
a 4- or 8-byte allocation is just _stupid_.

So all your arguments are invalid, because you're looking at the wrong 
thing. The thing that I was talking about is converting current code that 
has

   random_function(..)
   {
	cpumask_t mask;

	.. do something with mask ...
   }

which has to be converted some way. And I think it needs to be converted 
in a way that does *not* force us to call kmalloc() for idiotically small 
values.

			Linus
--

From: Mike Travis
Date: Thursday, September 25, 2008 - 1:59 pm

Subject: [RFC 1/1] cpumask: Provide new cpumask API

Provide new cpumask interface API.  The relevant change is basically
cpumask becomes an opaque object.  I believe this results in the
minimum amount of editing while still allowing the inline cpumask
functions, and the ability to declare static cpumask objects.


    /* raw declaration */
    struct __cpumask_data_s { DECLARE_BITMAP(bits, NR_CPUS); };

    /* cpumask_map_t used for declaring static cpumask maps */
    typedef struct __cpumask_data_s cpumask_map_t[1];

    /* cpumask_t used for function args and return pointers */
    typedef struct __cpumask_data_s *cpumask_t;

    /* cpumask_var_t used for local variable */
    typedef struct __cpumask_data_s	cpumask_var_t[1]; /* SMALL NR_CPUS */
    typedef struct __cpumask_data_s	*cpumask_var_t;	  /* LARGE NR_CPUS */

    /* replaces cpumask_t dst = (cpumask_t)src */
    void cpus_copy(cpumask_t dst, const cpumask_t src);

Remove the '*' indirection in all references to cpumask_t objects.  You can
change the reference to the cpumask object but not the cpumask object itself
without using the functions that operate on cpumask objects (f.e. the cpu_*
operators).  Functions can return a cpumask_t (which is a pointer to the
cpumask object) and only be passed a cpumask_t.

All uses of cpumask_t on the stack are changed to be cpumask_var_t. 
Allocation of local cpumask objects will follow...

All cpumask operators now operate using nr_cpu_ids instead of NR_CPUS.  All
variants of the cpumask operators which used nr_cpu_ids instead of NR_CPUS
are deleted.

All variants of functions which use the (old cpumask_t *) pointer are deleted
(f.e. set_cpus_allowed_ptr()).

Based on code from Rusty Russell <rusty@rustcorp.com.au> (THANKS!!)

Signed-of-by: Mike Travis <travis@sgi.com>

---
 include/linux/cpumask.h |  340 ++++++++++++++++++++++++------------------------
 1 file changed, 174 insertions(+), 166 deletions(-)

--- ...
From: Rusty Russell
Date: Thursday, September 25, 2008 - 10:25 pm

No, for large NR_CPUS, cpumask_t is a pointer as shown.  And we have numerous 

Right, but cpumask_t is used for far more than stack decls, thus the problems.

I can make a separate "cpumask_stack_t" and use your method tho.  I think that 

Yeah, got that.  But your suggestion to change cpumask_t turned out horribly 
ugly.

Cheers,
Rusty.
--

From: Mike Travis
Date: Thursday, September 25, 2008 - 10:53 pm

Hi Rusty,

I've gotten some good traction on the changes in the following patch.  About 30%
of the kernel is compiling right now and I'm picking up errors and warnings as
I'm going along.  I think it's doing most of what we need.  Attempting to hide
the cpumask struct definition caused all kinds of problems with the inline
functions and statically declaring cpumask's.

(The following patch is a combination of all the changes to cpumask.h with the
header from the first patch.  I'll send you a complete copy in separate email.)

Thanks,
Mike
--

Subject: [RFC 1/1] cpumask: Provide new cpumask API

Provide new cpumask interface API.  The relevant change is basically
cpumask_t becomes an opaque object.  I believe this results in the
minimum amount of editing while still allowing the inline cpumask
functions, and the ability to declare static cpumask objects.


    /* raw declaration */
    struct __cpumask_data_s { DECLARE_BITMAP(bits, NR_CPUS); };

    /* cpumask_map_t used for declaring static cpumask maps */
    typedef struct __cpumask_data_s cpumask_map_t[1];

    /* cpumask_t used for function args and return pointers */
    typedef struct __cpumask_data_s *cpumask_t;

    /* cpumask_var_t used for local variable, definition follows */
    typedef struct __cpumask_data_s	cpumask_var_t[1]; /* SMALL NR_CPUS */
    typedef struct __cpumask_data_s	*cpumask_var_t;	  /* LARGE NR_CPUS */

    /* replaces cpumask_t dst = (cpumask_t)src */
    void cpus_copy(cpumask_t dst, const cpumask_t src);

Remove the '*' indirection in all references to cpumask_t objects.  You can
change the reference to the cpumask object but not the cpumask object itself
without using the functions that operate on cpumask objects (f.e. the cpu_*
operators).  Functions can return a cpumask_t (which is a pointer to the
cpumask object) and only be passed a cpumask_t.

All uses of cpumask_t on the stack are changed to be cpumask_var_t except
for pointers to static cpumask objects.  Allocation of ...
From: Ingo Molnar
Date: Saturday, September 27, 2008 - 12:16 pm

could you please send whatever .c changes you have already, so that we 
can have a look at how the end result will look like? Doesnt have to 
build, i'm just curious about how it looks like in practice, 
semantically.

	Ingo
--

From: Mike Travis
Date: Monday, September 29, 2008 - 7:33 am

I will, and the full "allyesconfig" does compile.  And it's basically a
benign change in that the functionality is still the same.  I'm currently
reordering it a bit to clean it up.

Thanks,
Mike
--

From: Ingo Molnar
Date: Tuesday, September 30, 2008 - 4:04 am

btw., are the resulting instructions also expected to be the same? If 
yes then you might want to verify it all by making sure the md5's of the 
.o's do not change.

(If that's not possible (gcc decides to compile it a bit differently) 
then no big deal, just wanted to mention the possibility.)

	Ingo
--

From: Mike Travis
Date: Tuesday, September 30, 2008 - 9:14 am

Well, not exactly... ;-)  It does institute the new API change that specifies
only pointers to cpumask's can be passed to functions and returned from
functions.  I really wanted the default cpumask_t to be a constant so those
instances where the passed in cpumask is used as a read/write temp variable
would be caught.  But it started getting messy.

One pain is:

	typedef struct __cpumask_s *cpumask_t;
	const cpumask_t xxx;

is not the same as:

	typedef const struct __cpumask_s *const_cpumask_t;
	const_cpumask_t xxx;

and I'm not exactly sure why.  It came up when I tried to declare
functions that returned a constant cpumask_t pointer (node_to_cpumask,
cpumask_of_cpu, etc.)

The other major change I'm contemplating is to remove "cpumask_t" completely
(maybe cpumask_ptr_t?).  This would force every instance of cpumask_t to be
examined.  (I found quite a few I had missed in my original edits when I
added the task struct temp cpumask's.)

Oh yeah, one question ... is "current" always valid?

Thanks,
Mike

--

From: Linus Torvalds
Date: Tuesday, September 30, 2008 - 9:46 am

Umm. The const has different 

One is

	typedef const struct __cpumask_s *const_cpumask_t;

which becomes

	(const struct __cpumask_s) *

while the other is

	const cpumask_t xxx

which is

	const (struct __cpumask_s *)

and if you look a bit more closely, you'll see that they are _obviously_ 
not the same thing at all.

Quite frankly, I personally do hate typedefs that end up being pointers, 
and used as pointers, without showing that in the source code.

When you do

	type_t a;

	fn(a);

I expect the code to essentially do a pass-by-value. But when the type_t 
is a pointer, that doesn't really work.

Your issue with 'const' is just another version of the same. You don't 
want the _pointer_ to be const, you want what it points _to_ to be const. 
But because you hid the pointerness inside the typedef, you simply cannot 
do that.

The problem with cpumask's, of course, is that for the "small mask" case, 
we really don't want it to be a pointer. So now it's sometimes a pointer 
and sometimes not. The typedef hides that, and I understand why it's a 
good idea, but I'm surprised you didn't understand what the implications 
were for 'const', and I'm now a bit more leery about this whole thing just 
because the typedef ends up hiding so much - it doesn't just hide the 
basic type, it hides a very basic *code* issue.

			Linus
--

From: Mike Travis
Date: Tuesday, September 30, 2008 - 11:02 am

Thanks, yes that explains what I should have figured out.  (Nice way to

I agree, and as I mentioned, Rusty was working towards an alternative
method of declaring cpumask's which does not hide this.

My goal was to create an (apparent) opaque handle for cpumask_t and modify

Perhaps then defining the cpumask as a list of unsigned longs (remove the
outer struct) would play more "naturally".  Lists by definition are always
referenced by pointers.  I have another version of the patchset that shows

Thanks!
Mike
--

From: Mike Travis
Date: Tuesday, September 30, 2008 - 3:22 pm

...

Here's an alternate proposal for the new cpumask API.  I have not yet begun
the tedious work of making it compile so there may still be some syntax
errors.  (Init/main.c compiles.)  But I believe it more closely adheres to
proper 'C' coding styles.

Thanks,
Mike
--
Subject: cpumask: Provide new cpumask API

[Copied from Documentation/cpumask.txt]

Introduction

Cpumask variables are used to represent (with a single bit) all the
CPU's in a system.  With the prolific increase in the number of CPU's
in a single system image (SSI), managing this large number of cpus has
become an ordeal.  When the limit of CPU's in a system was small, the
cpumask could fit within an integer.  Even with the increase to 128,
a bit pattern was well within a manageable size.  Now with systems
available that have 2048 cores, with hyper-threading there are 4096
cpu threads.  And the number of cpus in an SSI is growing, 16,384 is
right around the corner.  Even desktop systems with only 2 or 4 sockets,
a new generation of Intel processors that have 128 cores per socket will
increase the count of CPU threads tremendously.

Thus the handling of the cpumask that represents this 4096 limit needs
512 bytes, putting pressure on the amount of stack space needed to
accommodate temporary cpumask variables.

The primary goal of the cpumasks API is to accommodate these large
number of cpus without effecting the compactness of Linux for small,
compact systems.


The Changes

Provide new cpumask interface API.  The relevant change is basically
cpumask_t becomes an pointer to a constant list of unsigned longs
and two new typedef's are used for declaring cpumask_t variables,
cpumask_var_t and cpumask_map_t.  This should result in the minimum
amount of modifications while still allowing the inline cpumask
functions, and the ability to declare static cpumask objects.


    /* cpumask_t is used when pointing to a constant cpumask */
    typedef const unsigned long *cpumask_t;

    /* cpumask_map_t ...
From: Rusty Russell
Date: Tuesday, September 30, 2008 - 5:45 pm

Hiding a const here is not a good idea either, I think :(

Rusty.
--

From: Rusty Russell
Date: Tuesday, September 30, 2008 - 5:44 pm

Yes, this is why my version of the rework moved away from typedefs, except for 
the special case of "cpumask_var_t" for stack vars where this trick is really 
desired.

Everywhere else, the code becomes nice and clear: struct cpumask *.

Cheers,
Rusty.
--

From: Mike Travis
Date: Tuesday, August 26, 2008 - 12:11 pm

Iirc, it was the problem of basing percpu variables at zero that hit
problems with various gcc toolset versions.  I don't remember any
version problems with cpumask's on the stack, they all failed the


Removing cpumask_t's from the stack is fairly straight forward.  The
problem of changing all functions to expect a cpumask pointer via a
global change is much more problematic.  And of course all those
functions that return a cpumask value would need to be addressed.

Thanks,
Mike
--

From: Mike Travis
Date: Tuesday, August 26, 2008 - 12:06 pm

Yes, I had proposed either modifying, or supplementing a new
smp_call function to pass the cpumask_t as a pointer (similar
to set_cpus_allowed_ptr.)  But an ABI change such as this was
not well received at the time.

Thanks,
Mike
--

From: David Miller
Date: Tuesday, August 26, 2008 - 1:45 pm

From: Mike Travis <travis@sgi.com>

What it seems to come down to is that any cpumask_t not inside of
a dynamically allocated object should be marked const.

And that is something we can enforce at compile time.

Linus has just suggested dynamically allocating cpumask_t's
for such cases but I don't see that as the fix either.

Just mark them const and enforce that cpumask_t objects can only
be modified when they appear in dynamically allocated objects.

You really don't need to modify the ones that passed around functions
anyways.  The only code that wants to change bits in these things is
the cpu cross-call dispatch stuff, and that cpu choice logic can just
live where it belongs down in the cross-call dispatch code.

--

From: Jes Sorensen
Date: Friday, August 29, 2008 - 5:42 am

Dave and others,

Sorry if I jump into the middle of the thread. Stopped subscribing to
lkml for a while, so this is through the archives.

Ran into some of these issues with KVM too, and noticed just how much
we pass cpumask_t around in the smp_call functions :-( In fact, the
only arch that did pretty well on this front was sparc64.

I totally agree, that marking them const makes a ton of sense, but at
the same time I suggest we convert smp_call_function_mask() to take a
pointer to the cpumask_t. I whipped up the following patch, which cuts
down the amont of memcpy calls emitted quite a fair bit.

I have only tested this on ia64, but it boots, so it's obviously
perfect<tm> :-)

Comments, suggestions welcome.

I have a followup patch that makes virt/kvm/kvm_main.c use the new
interface.

Cheers,
Jes

From: Linus Torvalds
Date: Friday, August 29, 2008 - 9:14 am

Well, it probably boots because it doesn't really seem to _change_ much of 
anything.

Things like this:

	-static inline void arch_send_call_function_ipi(cpumask_t mask)
	+static inline void arch_send_call_function_ipi(cpumask_t *mask)
	 {
	-       smp_ops.send_call_func_ipi(mask);
	+       smp_ops.send_call_func_ipi(*mask);
	 }

will still do that stack allocation at the time of the call. You'd have to 
pass the thing all the way down as a pointer..

			Linus


--

From: David Miller
Date: Friday, August 29, 2008 - 1:04 pm

From: Linus Torvalds <torvalds@linux-foundation.org>

True, but we have to get there one step at a time.

BTW, sparc64 already wants a pointer here, so it's completely ready for
this:

void arch_send_call_function_ipi(cpumask_t mask)
{
	xcall_deliver((u64) &xcall_call_function, 0, 0, &mask);
}
--

From: Jes Sorensen
Date: Monday, September 1, 2008 - 4:53 am

Hi Linus,

I realize that, but as I have been doing this work on ia64, I didn't
want to mess too much with the x86 code. The ia64 part of the patch does
change things :-)

If someone with more x86 knowledge would want to try and make that part
of the patch better, and more importantly test it, I'd be quite keen on
helping out.

Cheers,

--

From: Jes Sorensen
Date: Tuesday, September 2, 2008 - 7:27 am

Linus,

Ok, so here's a version which tries to do the right thing on x86 as
well. Build tested on x86_64, but don't have an easy way to test it
right now. It's booting on ia64.

Cheers,
Jes

From: Mike Travis
Date: Tuesday, August 26, 2008 - 12:03 pm

I would be most interested in any tools to analyze call-trees and
accumulated stack usages.  My current method of using kdb is really
time consuming.

Thanks!
Mike
--

From: Linus Torvalds
Date: Tuesday, August 26, 2008 - 12:40 pm

Well, even just scripts/checkstack.pl is quite relevant.

The fact is, anything with a stack footprint of more than a hundred bytes 
is suspect. We _do_ have a lot of cases of several hundred bytes, and some 
of them are even very intentional.

For an example of _intentional_ and valid large stacks, look at 
do_sys_poll and do_select. They both have a big stack footprint in a 
normal kernel, and that's on purpose - it's not pretty, but they are very 
common and performance-sensitive functions, and using a big stack allows 
some basic allocations to be much cheaper by default.

Same goes for early_printk(), although I don't think the reasons are 
really very strong in that case.

Sadly, while those functions are _fairly_ high up, they aren't at the top, 
and we do have a lot of other functions that have huge stack footprints 
for totally bogus reasons. But the intentional ones are at least in the 
top ten.

But the kernel that Alan had problems with was different. The 
_intentional_ ones were way down in the noise.  do_sys_poll wasn't in the 
top ten, it was barely even in the top 50! (It was in fact #49, to be 
exact).

So look at the top ten in my kernel:

     1  ide_generic_init [vmlinux]:             1384
     2  idefloppy_ioctl [vmlinux]:              1208
     3  e1000_check_options [vmlinux]:  	1152
     4  do_sys_poll [vmlinux]:          	904
     5  ide_floppy_get_capacity [vmlinux]:      872
     6  do_select [vmlinux]:                    744
     7  early_printk [vmlinux]:         	720
     8  do_task_stat [vmlinux]:         	680
     9  mmc_ioctl [vmlinux]:                    648
    10  elf_kcore_store_hdr [vmlinux]:  	576

.. and in Alan's kernel:

     1  smp_call_function_mask [vmlinux]:       2736
     2  __build_sched_domains [vmlinux]:        2232
     3  setup_IO_APIC_irq [vmlinux]:            1616
     4  arch_setup_ht_irq [vmlinux]:            1600
     5  arch_setup_msi_irq [vmlinux]:   	1600
     6  __assign_irq_vector [vmlinux]:  ...
From: Mike Travis
Date: Tuesday, August 26, 2008 - 12:01 pm

Hi Linus,

[Sorry for the long winded response, but I felt that sufficient background
is needed to address this issue.... YOMV :-)]

The need to allow distros to set NR_CPUS=4096 (and NODES_SHIFT=9) is
critical to our upcoming SGI systems using what we have been calling
"UV".  This system will be capable of supporting 4096 cpu threads in a
single system image (and 16k cpus/2k nodes is right around the corner).
While obviously I cannot divulge too many details, it's sufficient to
say there are customers who not only require this extended capability,
but are extremely excited about it.

But the nature of some of these system environments is that they will
not accept a specially built kernel, but only a kernel that has been
built and certified (both from the application standpoint as well as
the security standpoint) by standard distributions.  And you probably
know how extensively these distributions test and certify for many known
defects and absolutely require that incoming source changes come from
the community supported source bases, primarily yours.

Due to the lead time required to accomplish these certifications,
the version of the distributions that will be available when this
system releases will be based on 2.6.27.  (They will allow patches
"post-2.6.27-rc.final" as long as those are committed in the source base.)
The two distributions that SGI supports for our customers is SLES
(SUSE Linux Enterprise Server) and RHEL (Red Hat Enterprise Linux).
[They, of course, are free to run any OS of their choosing, but SGI only
provides front line support for those two.]

I started last August to begin analyzing how to accomplish the above goals
and where exactly are the hot spots in the kernel that would require
attention.  It quickly became clear that cpumask_t and nodemask_t are
two variables that are very casually used (along with NR_CPUS), because
the assumption was that 64 "was more than sufficient" for an upper limit
and even extending it to 128 or 255 (254 was the ...
From: Linus Torvalds
Date: Tuesday, August 26, 2008 - 12:09 pm

That's fine. You can do it. The default kernel will not, because it's 
clearly not safe.

I really don't care what you do to _your_ images. But I will not 
distribute a known-broken kernel, and I will not debug random stack 
overflows that happen in it.

If you want the default kernel to support 4k cores, we'll need to fix the 
stack usage.  I don't think that is impossible, but IT IS NOT GOING TO 
HAPPEN for 2.6.27.

And quite frankly, if some vendor like RedHat enables NR_CPUS=4096 by 
default, they are totally and utterly crazy.

But some SGI-specific binary that is meant for SGI machines only, and has 
been extensively tested with the setup used on SGI machines is a different 
thing.

			Linus
--

From: Mike Travis
Date: Tuesday, August 26, 2008 - 12:35 pm

Ok, thanks for the reply, and looking into this issue.  We will "strongly
encourage" our distros to base the relevant releases on 2.6.28. :-)  

[Supplying an SGI-specific kernel would not be acceptable to many of our
customers because of the certification issues I mentioned.] 

Mike
--

From: Dave Jones
Date: Tuesday, August 26, 2008 - 12:28 pm

On Tue, Aug 26, 2008 at 12:09:46PM -0700, Linus Torvalds wrote:

 > If you want the default kernel to support 4k cores, we'll need to fix the 
 > stack usage.  I don't think that is impossible, but IT IS NOT GOING TO 
 > HAPPEN for 2.6.27.
 > 
 > And quite frankly, if some vendor like RedHat enables NR_CPUS=4096 by 
 > default, they are totally and utterly crazy.

heh.  *picks through Fedora changelog*

* Thu Aug 14 2008 Dave Jones <davej@redhat.com>
- Bump max cpus supported on x86-64 to 4096. Just to see what happens.

I never did get to find out unfortunatly, because of the security fiasco
in Fedora infrastructure the last week or two.
 
 > But some SGI-specific binary that is meant for SGI machines only, and has 
 > been extensively tested with the setup used on SGI machines is a different 
 > thing.

Every extra kernel image a distro vendor ends up shipping has an associated cost.

* build time: It currently takes about 2 hours for a set of Fedora RPMs.
  For RHEL it'll be even worse due to the extra archs).
  Killing off -smp specific builds was a big win for us in this regard.
  Adding extra flavours is always painful.

* diskspace (distro kernels aren't small. With the associated debugging symbols,
   they take up a shitload of disk space really fast).

* Having everyone running the same kernel makes it much easier to test/debug.
  Our QA guys hate adding extra columns to their test matrix.

But yes, for this to be even remotely feasible, there has to be a negligable
performance cost associated with it, which right now, we clearly don't have.
Given that the number of people running 4096 CPU boxes even in a few years time
will still be tiny, punishing the common case is obviously absurd.

	Dave

-- 
http://www.codemonkey.org.uk
--

From: Mike Travis
Date: Tuesday, August 26, 2008 - 1:01 pm

Dave Jones wrote:

I did do some fairly extensive benchmarking between configs of NR_CPUS = 128 and
4096 and most performance hits were in the neighborhood of < 5% on systems with
8 cpus and 4GB of memory (our most common test system).  [But changing cpumask_t's
to be pointers instead of values will likely increase this.]  I've tried to be
very sensitive to this issue with all my previous changes, so convincing the distros
to set NR_CPUS=4096 would be as painless for them as possible. ;-)

Btw, huge count cpu systems I don't think are that far away.  I believe the nextgen
Larabbee chips will be geared towards HPC applications [instead of just GFX apps],
and putting 4 of these chips on a motherboard would add up to 512 cpu threads (1024
if they support hyperthreading.)

Thanks,
Mike
--

From: Nick Piggin
Date: Tuesday, August 26, 2008 - 11:54 pm

5% is a pretty nasty performance hit... what sort of benchmarks are we
talking about here?

I just made some pretty crazy changes to the VM to get "only" around 5
or so % performance improvement in some workloads.

What places are making heavy use of cpumasks that causes such a slowdown?
Hopefully callers can mostly be improved so they don't need to use cpumasks
for common cases.

Until then, it would be kind of sad for a distro to ship a generic x86
kernel and lose 5% performance because it is set to 4096 CPUs...

But if I misunderstand and you're talking about specific microbenchmarks to

It would be quite interesting if they make them cache coherent / MP capable.
Will they be?
--

From: David Miller
Date: Wednesday, August 27, 2008 - 12:05 am

From: Nick Piggin <nickpiggin@yahoo.com.au>

It's almost certainly from the cross-call dispatch call chain.

As just one example, just to do a TLB flush mm->cpu_vm_mask probably
gets passed around as an aggregate two or three times on the way down
to the APIC programming code on x86.  That's two or three 512 byte
copies on the stack :)

Look at the sparc64 SMP code for how I solved the problem there.
--

From: Nick Piggin
Date: Wednesday, August 27, 2008 - 12:47 am

Yeah, I see. That's stupid isn't it? (Well, I guess it was completely
sane when cpumasks were word sized ;))

Hopefully that accounts for a significant chunk...
--

From: David Miller
Date: Wednesday, August 27, 2008 - 1:44 am

From: Nick Piggin <nickpiggin@yahoo.com.au>

There is a lot of indirect costs that are hard to see as well.

Two things a lot of these cross-call dispatch paths do is:

1) Clear self-cpu

2) AND with cpus_online

#1 can normally be a simple bit clear, but some places can also
implement this with something like "cpus_andn(X, cpumask_of_cpu(cpu))"

It's simply easier to move those two things down to the bottom of
the APIC programming code, they just loop over the cpumask doing
an expensive APIC I/O operation anyways, might as well overlap it
with these "skip self-cpu" and "skip not-online cpus" checks.

And oh yeah we get the stack wastage fixed too, isn't what what we
were talking about? :-)

--

From: Mike Travis
Date: Wednesday, August 27, 2008 - 7:48 am

Yes, the most time consuming part was determining whether a kmalloc
could safely be used in the context of the function, and what to
do about the out-of-memory problem.  Pushing that down to something
like:  for_each_cpu_thats_online(cpu, *maskptr) would remove the need for
many of the temp masks.  A simple if (cpu != me) would take care of
excluding self.  It might have better interaction with cpu hotplug
as well, since the online map would be checked just before the call
to that cpu is made.

Thanks,
Mike
--

From: Mike Travis
Date: Wednesday, August 27, 2008 - 7:36 am

I will, thanks!

Mike
--

From: Mike Travis
Date: Wednesday, August 27, 2008 - 7:35 am

It's been a while now, I should go back and check my notes.  Many of the
BM's did not have any changes.  I believe the ones that were right on the

That's another study I did, and it seemed that maybe 95% of the functions
would not be affected by passing pointers to cpumasks instead of the cpumasks
themselves, because the data was processed by a cpu_xxx function that
uses a pointer.  Most commonly was to create a temp cpumask, using
cpus_and(temp_mask, callers_mask, cpu_online_map);  The speedup to use nr_cpu_ids
instead of NR_CPUS in the traversal functions helped quite a bit.  Using this
same method in the cpus_xxx functions would further speed up things.  (As
well as only allocating the cpumask sized by nr_cpu_ids instead of NR_CPUS

Yes, I was (at the time) trying to determine how many of the cpumask functions
were actually in play by user tasks, so I was zeroing in on those (cpusets,

There's not been a lot of info available yet, but I think the 128 cores will
share at least an L2 cache + memory controller.  How the APIC's interact is
also another big question.  And most likely some standard system controller
CPU will be needed, but that could be a tiny VIA processor... ;-)

Thanks,
Mike
--

From: Alan D. Brunelle
Date: Monday, August 25, 2008 - 2:30 pm

This probably all started when I was working on a software tool (aiod)
that was failing because somebody ELSE had 4,096 CPUs configured.
[[Seems that gcc had/has? it's MAX CPU value set to 1,024 (bits/sched.h
__CPU_SETSIZE), so when you issue system calls like sched_getaffinity,
it will "fail" for systems configured w/ 4,096 CPUs. I worked around it
by simply forgetting about the gcc values, and kept allocating larger
CPU masks until it worked.]]

I think you're right: the kernel as a whole may not be ready for 4,096
CPUs apparently...

Thanks for taking the time to look into this...

Alan
--

From: Christoph Lameter
Date: Monday, August 25, 2008 - 3:07 pm

Mike has been working diligently on getting all these cpumasks off the stack
for the last months and has created an infrastructure to do this. So I think
we are close. It might just be a matter of merging some more patches that are
still left in Ingo's tree.

--

From: Ingo Molnar
Date: Tuesday, August 26, 2008 - 12:59 am

hm, there are no such patches left that i know of - the only bits in 
-tip are the zero-based percpu, which was found to be a bit fragile in 
testing:

 earth4:~/tip> git-log-line --author=Travis linus..
 d379497: Zero based percpu: infrastructure to rebase the per cpu area to zero
 b3a0cb4: x86: extend percpu ops to 64 bit

[and it has no relevance to stack footprint.]

So i dont think the current cpumask_t approach will work. We simply 
should not get into an endless fight against the windmills that 
introduce on-stack cpumask_t again and again. We should just take the 
plunge once and do a clean alloc/free cpumask model. Most of the hotpath 
cpumasks are constant or pre-constructed, so they are not a real issue.

Plus, on the general question of stack footprint problems and the 
difficulty of debugging them, the worst-case stack footprint tracer i 
wrote for -rt some time ago should be dusted off as well and put into 
ftrace. David has something quite close to that for Sparc64 already.

	Ingo
--

From: Mike Travis
Date: Tuesday, August 26, 2008 - 12:48 pm

I'll start experimenting with globally changing cpumask_t to be a pointer,
and see what falls out.

Thanks,
Mike

--

From: Rusty Russell
Date: Monday, August 25, 2008 - 6:11 pm

Hmm, wants neatening anyway; I'll see if I can reduce stack usage side effect.

Your workaround is very random, and that scares me.  I think a huge number of 
CPUs needs a real solution (an actual cpumask allocator, then do something 
clever if we come across an actual fastpath).

Thanks,
Rusty.
--

From: Linus Torvalds
Date: Tuesday, August 26, 2008 - 10:35 am

The thing is, the inlining thing is a separate issue.

Yes, the cpumasks were what made stack pressure so critical to begin with, 
but no, a 400-byte stack frame in a deep callchain isn't acceptable 
_regardless_ of any cpumask_t issues.

Gcc inlining is a total and utter pile of shit. And _that_ is the problem. 
I seriously think we shouldn't allow gcc to inline anything at all unless 
we tell it to. That's how it used to work, and quite frankly, that's how 
it _should_ work.

The downsides of inlining are big enough from both a debugging and a real 
code generation angle (eg stack usage like this), that the upsides 
(_somesimes_ smaller kernel, possibly slightly faster code) simply aren't 
relevant.

So the "noinline" was random, yes, but this is a real issue. Looking at 
checkstack output for a saner config (NR_CPUS=16), the top entries for me 
are things like

	ide_generic_init [vmlinux]:             1384
	idefloppy_ioctl [vmlinux]:              1208
	e1000_check_options [vmlinux]:  	1152
	...

which are "leaf" functions. They are broken as hell (the e1000 is 
apparently because it builds structs on the stack that should all be 
"static const", for example), but they are different from something like 
the module init sequence in that they are not going to affect anything 
else.

It would be interesting to see what "-fno-default-inline" does to the 
kernel. It really would get rid of a _lot_ of gcc version issues too. 
Inlining behavior of gcc has long been a problem for us.

			Linus
--

From: Adrian Bunk
Date: Tuesday, August 26, 2008 - 11:30 am

I added "-fno-inline-functions-called-once -fno-early-inlining" to 
KBUILD_CFLAGS, and (with gcc 4.3) that increased the size of my kernel 
image by 2%.

And when David's "-fwhole-program --combine" will become ready the cost
of disallowing gcc to inline functions will most likely increase.

A debugging option (for better traces) to disallow gcc some inlining 
might make sense (and might even make sense for distributions to 
enable in their kernels), but when you go to use cases that require
really small kernels the cost is too high.

But if you don't trust gcc's inlining you should revert
commit 3f9b5cc018566ad9562df0648395649aebdbc5e0 that increases gcc's 
freedom regarding what to inline in 2.6.27 - what gcc 4.2 does in the 
case of the regression tracked as Bugzilla #11276 is really not funny 
(two callers -> function not inlined; gcc seems to emit the function 
although both callers later get removed (or at least should be removed) 

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: Linus Torvalds
Date: Tuesday, August 26, 2008 - 11:40 am

You ignore the fact that it's really not just about debugging.

Inlining really isn't the great tool some people think it is. Especially 
not since gcc stack allocation is so horrid that it won't re-use stack 
slots etc (which I don't disagree with per se - it's _hard_ to re-use 
stack slots while still allowing code scheduling).

NOTE! I also would never claim that _our_ choices of "inline" are all that 
great, and we've often inlined too much or not inlined things that really 
could be inlined. But at least when a developer says "inline" (or forgets 
to say it), we have somebody to blame. When the compiler does insane 

Actually, that just allows gcc to _not_ inline. Which is probably ok.

(Well, it would be ok if gcc did it well enough, it obviously has some 
problems at times).

			Linus
--

From: Adrian Bunk
Date: Tuesday, August 26, 2008 - 1:21 pm

I had in mind that we anyway have to support it for tiny kernels.

I simply don't see that we add kconfig options for 5kB of code for
tiny kernels but remove something like this that can cause size

gcc's stack allocation has become better

Most LOCs of the kernel are not written by people like you or Al Viro or 
David Miller, and the average kernel developer is unlikely to do it as 
good as gcc.

For the average driver the choice is realistically between
"inline's randomly sprinkled across the driver" and
"no inline's, leave it to gcc".

And code evolves during the years from tiny with 1 caller to huge with 
many callers.

BTW:
I just ran checkstack on a (roughly) allyesconfig kernel, and we have a 

With the "gcc inline's static functions" you complain about we have
4-5 years of experience.

Suddenly allowing 4 release series of gcc to ignore any inline's is a 
completely new area for us. I'd generally agree with giving gcc more 
freedom here, but I'd rather do it right by removing tons of wrong 
inline's than doing one global change hoping that it will make things 
better.

And whether the "optimized inlining" actually makes the kernel bigger or 

cu
Adrian

[1] there are some rare exceptions

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: Linus Torvalds
Date: Tuesday, August 26, 2008 - 1:41 pm

I actually don't think that is true.

If we really were to decide to be stricter about it, and it makes a big 
size difference, we can probably also add a tool to warn about functions 


I agree that it has become better. But it still absolutely *sucks*.

For example, see the patch I just posted about e1000 stack usage. Even 
though the variables were all in completely separate scopes, they all got 
individual space on the stack over the whole lifetime of the function, 
causing an explosion of stack-space. As such, gcc used 500 bytes too much 
of stack, just because it didn't re-use the stackspace.

That was with gcc-4.3.0, and no, there were hardly any inlining issues 
involevd, although it is true that inlining actually did make it slightly 
worse in that case too (but since it was essentially a leaf function, that 
had little real life impact, since there were no deep callchains below it 
to care).

So the fact is, "better" simply is not "good enough". We still need to do 
a lot of optimizations _manually_, because gcc cannot see that it can 
re-use the stack-slots.

And sometimes those "optimizations" are actually performance 
pessimizations, because in order to make gcc not use all the stack at the 

Sure. But we do have tools. We do have checkstack.pl, it's just that it 
hasn't been an issue in a long time, so I suspect many people didn't even 
_realize_ we have it, and I certainly can attest to the fact that even 
people who remember it - like me - don't actually tend to run it all that 



Sure. And most of it isn't all that great.

But I do agree that lettign gcc make more decisions is _dangerous_. 
However, in this case, at least, the decisions it makes would at least 
make for less inlining, and thus less stack space explosion.

			Linus
--

From: Jamie Lokier
Date: Wednesday, August 27, 2008 - 9:21 am

Sounds like what's really desired here isn't more worry and
unpredictability, but for GCC+Binutils to gain the ability to
calculate the stack depth over all callchains (doesn't have to be
exact, just an upper bound; annotate recursions) in a way that's good
enough to do on every compile, complain if a depth is exceeded
statically (or it can't be proven), and to gain the

In my userspace code, I have macros tmp_alloc and tmp_free.  They must
be matched in the same function:

     unsigned char * recvbuf = tmp_alloc(1500);
     ....
     tmp_free(recvbuf);

When stack is plentiful, it maps to alloca() which is roughly
equivalent to using a stack variable.

When stack is constrained (as it is on my little devices), that maps
to xmalloc/free.  The kernel equivalent would be kmalloc GFP_ATOMIC
(perhaps).

With different macros to mine, it may be possible to map small
fixed-size requests exactly onto local variables, and large ones to
kmalloc().  A stab at it (not tested):

    #define LOCAL_ALLOC_THRESHOLD     128

    #define LOCAL_ALLOC(type, ptr)                                        \
        __typeof__(type) __attribute__((__unused__)) ptr##_local_struct;  \
        __typeof__(type) * ptr =                                          \
              ((__builtin_constant_p(sizeof(type))                        \
                && sizeof(type) <= LOCAL_ALLOC_THRESHOLD)                 \
               ? &ptr##_local_struct : kmalloc(sizeof(type), GFP_ATOMIC))

    #define LOCAL_FREE(ptr)                           \
        ((__builtin_constant_p(sizeof (*(ptr)))       \
          && sizeof(*(ptr)) <= LOCAL_ALLOC_THRESHOLD) \
         ? (void) 0 : kfree(ptr))

Would that be useful in the kernel?

I'm thinking if it were a commonly used pattern for temporary buffers,
unknown structures and arrays of macro-determined size, the "new
driver" author would be less likely to accidentally drop a big object
on the stack.

Obviously it would be nicer for GCC to code such a ...
From: Linus Torvalds
Date: Tuesday, August 26, 2008 - 11:47 am

Btw, did you check with just "-fno-inline-functions-called-once"?

The -fearly-inlining decisions _should_ be mostly right. If gcc sees early 
that a function is so small (even without any constant propagation etc) 
that it can be inlined, it's probably right. 

The inline-functions-called-once thing is what causes even big functions 
to be inlined, and that's where you find the big downsides too (eg the 
stack usage).

			Linus
--

From: Jamie Lokier
Date: Tuesday, August 26, 2008 - 12:02 pm

That's a bit bizarre, though, isn't it?

A function which is only called from one place should, if everything
made sense, _never_ use more stack through being inlined.  Inlining
should just increase the opportunities that the called function's
local variables can share the same stack slots are the caller's dead
locals.

Whereas not inlining guarantees they occupy separate, immediately
adjacent regions of the stack, and shouldn't be increasing the total
numbers of local variables.

-- Jamie
--

From: Linus Torvalds
Date: Tuesday, August 26, 2008 - 12:18 pm

But that's simply not true.

See the whole discussion.

The problem is that if you inline that function, the stack usage of the 
newly inlined function is now added to ALL THE OTHER paths too!

So the case we had in module loading was that yes, we had a function with 
a big stack footprint, but it was NOT in the deep path.

But by inlining it, it now moved the stack footprint "up" one level to 
another function, and now the big stack footprint really _was_ in the deep 
path, because the caller was involved in a much deeper chain. 

So inlining moves the code up the callchain, and that is a problem for the 
backtrace, but that's "just" a debugging issue. But it also moves the 
stack footprint up the callchain, and that can actually be a correctness 
issue.

Of course, a compiler doesn't _have_ to do that. A compiler _could_ have 
multiple different stack footprints for a single function, and do liveness 
analysis etc. But no sane compiler probably does that, because it's very 
painful indeed, and it's not even an issue if you aren't stack-limited 
(and being stack-limited is really just a kernel thing).

(Yeah, it can be an issue even if you have a big stack, in that you get 
worse cache behaviour, so a dense stack footprint _would_ help. But the 
complexity of stack liveness analysis is almost certainly not worth the 
relatively small gains it would get on some odd cases).

			Linus
--

From: Adrian Bunk
Date: Tuesday, August 26, 2008 - 1:59 pm

-fno-inline-functions-called-once alone costs me nearly 1% in code size.

And I'd expect it to become more with "-fwhole-program --combine".


If you think we have too many stacksize problems I'd suggest to consider 
removing the choice of 4k stacks on i386, sh and m68knommu instead of 
using -fno-inline-functions-called-once:

Now that 32bit x86 is no longer used for extreme highend configurations 
the only serious usecase for 4k stacks are AFAIK space savings on 
embedded archs.

4k stacks have caused us much pain [1], and the cases where gcc inlined 
too much were the easy ones.

I'm not saying that I'd like removing the choice of 4k stacks, but if we 
want to reduce the number of stack related problems that's IMHO the 

cu
Adrian

[1] AFAIR some callpaths in the kernel are still too big

BTW: In case anyone wonders about why I suggest removing 4k stacks:
     My position is that 4k stacks should either be enabled 
     unconditionally or no longer offered at all.
     And if we remove 4k stacks from 32bit x86 it's no longer 
     realistically maintainable for other architectures.

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: Linus Torvalds
Date: Tuesday, August 26, 2008 - 2:04 pm

Don't be silly. That makes the problem _worse_.

We're much better off with a 1% code-size reduction than forcing big 
stacks on people. The 4kB stack option is also a good way of saying "if it 
works with this, then 8kB is certainly safe".

And embedded people (the ones that might care about 1% code size) are the 
ones that would also want smaller stacks even more!

		Linus
--

From: Parag Warudkar
Date: Tuesday, August 26, 2008 - 3:54 pm

On Tue, Aug 26, 2008 at 5:04 PM, Linus Torvalds

This is something I never understood - embedded devices are not going
to run more than a few processes and 4K*(Few Processes)
 IMHO is not worth a saving now a days even in embedded world given
falling memory prices. Or do I misunderstand?

Parag
--

From: David VomLehn
Date: Tuesday, August 26, 2008 - 4:00 pm

Embedded applications span a huge range of sizes, from the very small devices to 
which you refer, to quite complex devices. The cable settop boxes we develop have 
over a hundred interrupt sources, typically run 250-300 threads, and have 192+ 
MiB of memory. For all that, we are very cost sensitive and are under constant 
--
David VomLehn


--

From: Adrian Bunk
Date: Tuesday, August 26, 2008 - 4:45 pm

As you say correctly the term "embedded" gets used for many different 
devices.

And if you have 192+ MiB of memory you have so much that all these 

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: Linus Torvalds
Date: Tuesday, August 26, 2008 - 4:47 pm

Well, by that argument, 1% of kernel size doesn't matter either..

1% of a kernel for an embedded device is roughly 10-30kB or so depending 
on how small you make the configuration. 

If that matters, then so should the difference of 3-8 processes' kernel 
stack usage when you have a 4k/8k stack choice.

And they _all_ will have at least 3-8 processes on them. Even the simplest 
ones will tend to have many more.

			Linus
--

From: Greg Ungerer
Date: Tuesday, August 26, 2008 - 5:53 pm

I have some simple devices (network access/routers) with 8MB of RAM,
at power up not really being configured to do anything running 25
processes. (Heck there is over 10 kernel processes running!). Configure
some interfaces and services and that will easily push past 40.
I'd be happy with a 160k saving :-)

The init memory being freed at the end of the kernel boot is 88k,
4k stacks could save more than that.

Regards
Greg


------------------------------------------------------------------------
Greg Ungerer  --  Chief Software Dude       EMAIL:     gerg@snapgear.com
Secure Computing Corporation                PHONE:       +61 7 3435 2888
825 Stanley St,                             FAX:         +61 7 3891 3630
Woolloongabba, QLD, 4102, Australia         WEB: http://www.SnapGear.com
--

From: Parag Warudkar
Date: Tuesday, August 26, 2008 - 6:08 pm

So you really need to run all 25 processes on that 8Mb box?
(For reference even the NGW100 development board comes with 16Mb RAM).

Even if you do need those all 25 processes on the 8Mb box, fixing the
memory usage of those user space hogs is lot better than trying to
save 160Kb in kernel stacks.
Last I looked, user space wasn't particularly frugal with memory usage.

Parag
--

From: Greg Ungerer
Date: Tuesday, August 26, 2008 - 6:31 pm

Yes, of course. Considerable effort has been put into running
a minimal set of processes (that still for fills the required function

Lots of development boards are fitted with lots of RAM.

And the pressure will still be on in _real_ products to reduce
the RAM footprint as much as possible. There are exceptions but

Yep, been done too. You don't squeeze a lot into these smaller

Then you haven't looked in the right places :-)

There are plenty of choices for making things small in user space.
Simple stuff like using uClibc, busybox, etc.

In this specific example things like /bin/init is 10k, /bin/inetd
is 10k, /bin/crond is 11k, etc. (Ofcourse it is a shared uClibc setup,
uClibc is ~300k). And XIP can help out here too.

Regards
Greg



------------------------------------------------------------------------
Greg Ungerer  --  Chief Software Dude       EMAIL:     gerg@snapgear.com
Secure Computing Corporation                PHONE:       +61 7 3435 2888
825 Stanley St,                             FAX:         +61 7 3891 3630
Woolloongabba, QLD, 4102, Australia         WEB: http://www.SnapGear.com
--

From: Parag Warudkar
Date: Tuesday, August 26, 2008 - 7:16 pm

Well, sure  - but the industry as a whole seems to have gone the other
way - do more with more at the similar or lower price points!
By that definition of less is better we should try and make the kernel
memory pageable (or has someone already done that?) - Windows does it,
by default ;)

Parag
--

From: Bernd Petrovitsch
Date: Wednesday, August 27, 2008 - 1:44 am

On Tue, 2008-08-26 at 22:16 -0400, Parag Warudkar wrote:

"The industry as a whole" doesn't exist on that low level. You can't
compare the laptop and/or desktop computer market (where one may buy
today hardware that runs in 3 years with the next generation/release of
the OS and applications) with the e.g. "WLAN router" market where - from
the commercial point of view - every Euro counts (and where the
requirements for the lifetime of the device are long frozen before the

That doesn't help as in really small devices (like WLAN routers, cable
modems, etc.) you run without any means of paging/swapping. And even
binaries/read-only files are not necessarily executable in place (but
must be loaded into RAM). So you can't flush these pages.

And pageable kernel memory doesn't come for free - even if one only

Which is more a sign that it is probably a very bad idea.

	Bernd
-- 
Firmix Software GmbH                   http://www.firmix.at/
mobil: +43 664 4416156                 fax: +43 1 7890849-55
          Embedded Linux Development and Services


--

From: Parag Warudkar
Date: Tuesday, August 26, 2008 - 5:58 pm

On Tue, Aug 26, 2008 at 7:47 PM, Linus Torvalds

The savings part -financial ones- are not always realizable with the
way memory is priced/sized/fitted.
Savings in few Mb of Kernel stack are not necessarily going to allow
getting rid of a single memory chip of 64M or so.
Either that or embedded manufacturing/configurations are different
than the desktop world.

(If my device has 2 memory slots and my user space requires 100Mb
including kernel memory - I anyways have to put in 64Mx2 there to take
advantage of mass manufactured, general purpose memory - so no big
deal if I saved 1.2Mb in Kernel stack or not. And savings of 64Mb
Kernel memory are not feasible anyways to allow user space to work
with 64Mb.)

On the other hand reducing  user space memory usage on those devices
(not counting savings from kernel stack size) is a way more attractive
option.

And although you said in your later reply that Linux x86 with 4K
stacks should be more than usable - my experiences running a untainted
desktop/file server with 4K stack have been always disastrous XFS or
not.  It _might_ work for some well defined workloads but you would
not want to risk 4K stacks otherwise.

I understand the having 4K stack option as a non-default for very
specific workloads is a good idea but apart from that I think no one
else seems to bother with reducing stack sizes (by no one I mean other
OSes.)

Parag
--

From: Linus Torvalds
Date: Tuesday, August 26, 2008 - 6:49 pm

Umm. How long?

4kB used to be the _only_ choice. And no, there weren't even irq stacks. 
So that 4kB was not just the whole kernel call-chain, it was also all the 
irq nesting above it.

And yes, we've gotten much worse over time, and no, I can't really suggest 
going back to that in general. The code bloat has certainly been 
accompanied by a stack bloat too.

But part of it is definitely gcc. Some versions of gcc used to be 
absolutely _horrid_ when it came to stack usage, especially with some 
flags, and especially with the crazy inlining that module-at-a-time 
caused.

But I'd be really happy if some embedded people tried to take some of that 
bloat back, and aim for 4kB stacks. Because it's definitely not 
unrealistic. At least it _shouldn't_ be. And a lot of the cases of us 
having structures on the stack is actually not worth it, and tends to be 
about being lazy rather than anything else.

			Linus
--

From: Parag Warudkar
Date: Tuesday, August 26, 2008 - 7:36 pm

On Tue, Aug 26, 2008 at 9:49 PM, Linus Torvalds

IIRC the last I tried 4K stacks with x86 was on 2.6.21 - Fedora 7
kernel, around June 07 time frame.

What about deep call chains? The problem with the uptake of 4K stacks
seems to be that is not reliably provable that it will work under all
circumstances.

Parag
--

From: Linus Torvalds
Date: Tuesday, August 26, 2008 - 7:52 pm

Umm. Neither is 8k stacks. Nobody "proved" anything.

But yes, some subsystems have insanely deep call chains. And yes, things 
like the VFS recursion (for symlinks) makes that deeper yet for 
filesystems, although only on the lookup path. And that is exactly the 
kind of thing that can exacerbate the problem of the compiler artificially 
making for a bigger stack footprint of a function (*).

For things like the VFS layer, right now we allow a nesting level of 8, I 
think. If I remember correctly, it was 5 historically. Part of raising 
that depth, though, was that we actually moved the recursive part into 
fs/namei.c, and the nesting stack-depth was something pretty damn small 
when the filesystem used "follow_link" properly and let the VFS do it for 
it (ie the callchain to actually look up the link could be deep, but it 
would not recurse back, and instead just return a pointer, so that the 
actual _recursive_ part was just __do_follow_link() and is just a few 
words on the stack).

So yes, we do have some deep callchains, but they tend to be pretty well 
managed for _good_ code. The problems tend to be the areas with lots of 
indirection layers, and yeah, XFS, MD and ACPI all have those kinds of 
things.

In an embdedded world, many of those should be a non-issue, though. 

			Linus

(*) ie the function that _is_ on the deep chain doesn't actually need much 
of a stack footprint at all itself, but it may call a helper function that 
is _not_ in the deep chain, and if it gets inlined it may give its 
excessive stack footprint to the deep chain - and this is _exactly_ the 
problem that happened with inlining "load_module()".
--

From: Alan Cox
Date: Wednesday, August 27, 2008 - 1:32 am

On x86-32 with 8K stacks your IRQ paths share them so that is even harder
to prove (not that you can prove any of them) and the bugs are more
obscure and random.
--

From: Paul Mackerras
Date: Tuesday, August 26, 2008 - 11:01 pm

I think your memory is failing you.  In 2.4 and earlier, the kernel
stack was 8kB minus the size of the task_struct, which sat at the
start of the 8kB.  For instance, from include/asm-i386/processor.h for
2.4.29:

#define THREAD_SIZE (2*PAGE_SIZE)
#define alloc_task_struct() ((struct task_struct *) __get_free_pages(GFP_KERNEL,1))
#define free_task_struct(p) free_pages((unsigned long) (p), 1)

Paul.
--

From: Arjan van de Ven
Date: Wednesday, August 27, 2008 - 3:58 am

but was shared with interrupts; so out of the 6Kb left, you had still really only 4Kb for user context stack

--

From: Linus Torvalds
Date: Wednesday, August 27, 2008 - 8:18 am

Yup, you're right.

			Linus
--

From: Adrian Bunk
Date: Wednesday, August 27, 2008 - 4:58 am

That was gcc 3.4.


cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: Bernd Petrovitsch
Date: Wednesday, August 27, 2008 - 2:00 am

On Tue, 2008-08-26 at 20:58 -0400, Parag Warudkar wrote:

No, but you can put an additional service(s) on it and sales people have

They are different. Think of running the complete system acting as a
bridge, router and/or firewall (Kernel early 2.4 though) from 4MB flash
in 32MB RAM and - listing the outside visible services - having a
command-line interface, web-GUI (implying a http server) and and a
(net-)SNMP agent on it.
Running a glibc without thread support is win there (implying that there

As soon as product management realizes that there is space left on the
device, they get new ideas and/or customer requirements to run more

There is no question if save space here or there. You save it - sooner

The embedded world of really small devices usually doesn't run XFS (or
ext? or reiser* of jfs or NFS or ...) or stacks block devices on files

They probably gave the idea pretty soon because you need to
rework/improve large parts of the kernel + drivers (and that has two
major problems - it consumes a lot of man power for "no new features and
everything must be completely tested again"[0] and it adds new risks).
And that is practically impossible if one sells "stable driver APIs" for
3rd party (commercial) drivers because these must be changed too.

	Bernd

[0]: Let alone if you (or your customers) need certificates from some
     governmental agencys.
-- 
Firmix Software GmbH                   http://www.firmix.at/
mobil: +43 664 4416156                 fax: +43 1 7890849-55
          Embedded Linux Development and Services


--

From: Parag Warudkar
Date: Wednesday, August 27, 2008 - 5:56 am

But not many embedded Linux arches support 4K stacks like Adrian
pointed out earlier.
So the same (lot of man power requirement) would apply to Linux.

Sure it will be good - but how reasonable it is to attempt it and how
reliably it will work under all conceived loads - those are the
questions.

Thanks

Parag
--

From: Bernd Petrovitsch
Date: Wednesday, August 27, 2008 - 6:17 am

What is an "embedded Linux arch"?

Of course. Look at the amount of work done by lots of people in that

If you "develop" an embedded system (which is partly system integration
of existing apps) to be installed in the field, you don't have that many
conceivable work loads compared to a desktop/server system. And you have
a fixed list of drivers and applications.
A usual approach is to run stress tests on several (or all)
subsystems/services/... in parallel and if the device survives it
functioning correctly, it is at least good enough.

	Bernd
-- 
Firmix Software GmbH                   http://www.firmix.at/
mobil: +43 664 4416156                 fax: +43 1 7890849-55
          Embedded Linux Development and Services


--

From: Jamie Lokier
Date: Wednesday, August 27, 2008 - 8:48 am

Hah!  Not in my line of embedded device.

32MB no-MMU ARM boards which people run new things and attach new
devices to rather often - without making new hardware.  Volume's too
low per individual application to get new hardware designed and made.

I'm seriously thinking of forwarding porting the 4 year old firmware
from 2.4.26 to 2.6.current, just to get new drivers and capabilities.
Backporting is tedious, so's feeling wretchedly far from the mainline

Per application.

Some little devices run hundreds of different applications and
customers expect to customise, script themselves, and attach different
devices (over USB).  The next customer in the chain expects the bits
you supplied to work in a variety of unexpected situations, even when
you advise that it probably won't do that.

Much like desktop/server Linux, but on a small device where silly
little things like 'create a process' are a stress for the dear little
thing.

(My biggest lesson: insist on an MMU next time!)

-- Jamie
--

From: Bernd Petrovitsch
Date: Wednesday, August 27, 2008 - 9:38 am

Yes, you may have several products on the same hardware with somewhat
differing requirements (or not). But that is much less than a general

That sounds reasonable (and I never meant maintaining the old system
infinitely. Actually once the thing is shipped it usually enters deep

ACK. But that also depends on amount local changes (and sorry, but not

Basically their problem. Yes, "they" actually think they get a Linux
system where they can do everything and it simply works.

Oh, that's obviously not a usual "WLAN-router style" of product (where

ACK. We avoid MMU-less hardware too - especially since there is enough
hardware with a MMU around.

	Bernd
-- 
Firmix Software GmbH                   http://www.firmix.at/
mobil: +43 664 4416156                 fax: +43 1 7890849-55
          Embedded Linux Development and Services


--

From: Jamie Lokier
Date: Wednesday, August 27, 2008 - 10:51 am

It is, but the idea that small embedded systems go through a 'all
components are known, drivers are known, test and if it passes it's

Sounds reasonable, but it's vetoed for anticipated time and cost,
compared with backporting on demand.  Fair enough, since 2.6.current
doesn't support ARM no-MMU last I heard ('soon'?).

On the other hand, the 2.6 anti-fragmentation patches, including
latest SLUB stuff, ironically meant to help big machines, sound really
appealing for my current problem and totally unrealistic to

I can't emphasise enough how much difference MMU makes to Linux userspace.

It's practically: MMU = standard Linux (with less RAM), have everything.
No-MMU = lots of familiar 'Linux' things not available or break.

-- Jamie
--

From: Bernd Petrovitsch
Date: Wednesday, August 27, 2008 - 12:30 pm

Not always but often enough. And yes, there is ARM-based embedded

That is to be expected;-)


ACK. And tell that a customer that everything is more effort and more
risk and not just "simply cross-compile it as it runs on my desktop
too".

	Bernd
-- 
Firmix Software GmbH                   http://www.firmix.at/
mobil: +43 664 4416156                 fax: +43 1 7890849-55
          Embedded Linux Development and Services

--

From: Greg Ungerer
Date: Wednesday, August 27, 2008 - 5:06 pm

And lots of things work in the usual way...

Of course the flip side is that for people who have platforms
without MMU they can run something more than the mostly "toy"
like operating systems typically available. There are plenty of
problem domains that the non-MMU limitations are not a problem for.
(Yours doesn't sound like one of them :-)

Regards
Greg


------------------------------------------------------------------------
Greg Ungerer  --  Chief Software Dude       EMAIL:     gerg@snapgear.com
Secure Computing Corporation                PHONE:       +61 7 3435 2888
825 Stanley St,                             FAX:         +61 7 3891 3630
Woolloongabba, QLD, 4102, Australia         WEB: http://www.SnapGear.com
--

From: Greg Ungerer
Date: Wednesday, August 27, 2008 - 5:11 pm

But given you have hardware you can't change would you choose
to not run Linux, even with the limitations of non-MMU?

Hell no :-)

Regards
Greg


------------------------------------------------------------------------
Greg Ungerer  --  Chief Software Dude       EMAIL:     gerg@snapgear.com
Secure Computing Corporation                PHONE:       +61 7 3435 2888
825 Stanley St,                             FAX:         +61 7 3891 3630
Woolloongabba, QLD, 4102, Australia         WEB: http://www.SnapGear.com
--

From: Bernd Petrovitsch
Date: Wednesday, August 27, 2008 - 1:34 am

Falling prices are no reason to increase the amount of available RAM (or
other hardware).
Especially if you (intend to) build >1E5 devices - where every Euro
counts.

	Bernd
-- 
Firmix Software GmbH                   http://www.firmix.at/
mobil: +43 664 4416156                 fax: +43 1 7890849-55
          Embedded Linux Development and Services


--

From: Adrian Bunk
Date: Tuesday, August 26, 2008 - 4:24 pm

You implicitely assume both would solve the same problem.

While 4kB stacks are something we anyway never got 100% working, the 
cases where gcc inlining functions causes a critical increase in stack 
usage are usually not that hard to find, and once found the fix is 
trivial.

We should anyway monitor stack usages better since we have frequent 
programming errors in this area, and problems caused by gcc can this
way be detected en passant.

You have a good point that aiming at 4kB makes 8kB a very safe choice.

But I do not think the problem you'd solve with 
-fno-inline-functions-called-once is big enough to warrant the size 

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: Linus Torvalds
Date: Tuesday, August 26, 2008 - 4:51 pm

I'm just saying that your logic doesn't hold water.

If we can save kernel stack usage, then a 1% increase in kernel size is 

What? Don't be silly. 

Linux _historically_ always used 4kB stacks.

No, they are likely not usable on x86-64, but dammit, they should be more 

You continually try to see the inlining as a single solution to one 
problem (debuggability, stack, whatever).

The biggest problem with gcc inlining has always been that it has been 
_unpredictable_. It causes problems in many different ways. It has caused 
stability issues due to gcc versions doing random things. It causes the 
stack expansion. It makes stack traces harder for debugging, etc.

If it was any one thing, I wouldn't care. But it's exactly the fact that 
it causes all these problems in different areas.

			Linus
--

From: Adrian Bunk
Date: Tuesday, August 26, 2008 - 5:23 pm

From some tests the size increase seems to become bigger for smaller 
kernels, but I don't have any really good data.


An interesting question is why most of our architectures for embedded 
devices only offer bigger stacks:

The only architectures offering a 4kB stacks option are:
- m68knommu
- sh
- 32bit x86

The following architectures that are used in embedded devices 
always use 8kB stacks (or bigger) in your tree:
- arm
- avr32
- blackfin
- cris
- frv
- h8300
- m32r
- m68k
- mips
- mn10300 (has an #ifdef CONFIG_4KSTACKS but no kconfig option)
- powerpc


When did we get callpaths like like nfs+xfs+md+scsi reliably 

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: Linus Torvalds
Date: Tuesday, August 26, 2008 - 5:28 pm

XFS may never have been usable, but the rest, sure.

And you seem to be making this whole argument an excuse to SUCK, adn an 
excuse to let gcc crap even more on our stack space.

Why?

Why aren't you saying that we should be able to do better? Instead, you 
seem to asking us to do even worse than we do now?

			Linus
--

From: Adrian Bunk
Date: Wednesday, August 27, 2008 - 4:58 am

My main point is:
- getting 4kB stacks working reliably is a hard task
- having an eye on gcc increasing the stack usage, and fixing it if
  required, is relatively easy

If we should be able to do better at getting (and keeping) 4kB stacks 
working, then coping with possible inlining problems caused by gcc

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: Paul Mundt
Date: Wednesday, August 27, 2008 - 9:00 am

Out of the architectures you've mentioned for 4k stacks, they also tend
to do IRQ stacks, which is something you seem to have overlooked.

In addition to that, debugging the runaway stack users on 4k tends to be
easier anyways since you end up blowing the stack a lot sooner. On sh
we've had pretty good luck with it, though most of our users are using
fairly deterministic workloads and continually profiling the footprint.
Anything that runs away or uses an insane amount of stack space needs to
be fixed well before that anyways, so catching it sooner is always
preferable. I imagine the same case is true for m68knommu (even sans IRQ
stacks).

Things might be more sensitive on x86, but it's certainly not something
that's a huge problem for the various embedded platforms to wire up,
whether they want to go the IRQ stack route or not.

In any event, lack of support for something on embedded architectures in
the kernel is more often due to apathy/utter indifference on the part of
the architecture maintainer rather than being indicative of any intrinsic
difficulty in supporting the thing in question. Most new "features" on the
lesser maintained architectures tend to end up there either out of peer
pressure or copying-and-pasting accidents rather than any sort of design.
;-)
--

From: Adrian Bunk
Date: Wednesday, August 27, 2008 - 10:35 am

No, I am aware of that, and on i386 IRQ stacks are only used with
4kB stacks.


CONFIG_DEBUG_STACKOVERFLOW should give you the same information, and if

How many platforms use 4kB stacks on sh?

Only 1 out of 34 defconfigs uses it.


arm or powerpc aren't exactly lesser maintained architectures.

4kB has shown to be a hard to achieve limit. After more than 4 years in 
mainline being available on i386 there are still cases where 4kB are not 
enough.

IMHO there seems to currently be a mismatch between it's maintainance 
cost and the actual number of users. That's in my opinion the main 
problem with it, no matter in which direction it gets resolved.

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: Paul Mundt
Date: Wednesday, August 27, 2008 - 5:32 pm

In some cases, yes. In the CONFIG_DEBUG_STACKOVERFLOW case the check is
only performed from do_IRQ(), which is sporadic at best, especially on
tickless. While it catches some things, it's not a complete solution in
and of iteslf.

In addition to this, there are even fewer platforms that support it than
there are platforms that do 4k stacks. At first glance, it looks like
it's only m32r, powerpc, sh, x86, and xtensa. Others support the Kconfig
option, but don't seem to realize that it's not an option that the kernel
does anything with by itself, and so don't actually do anything (ie,
The defconfigs tend to enable as much random stuff as people are
interested in for development and testing purposes. Most of these end up
being reference boards and are the basis for products, rather than
shipping products themselves. In the latter case, everything is gradually
tightened down, and 4k stack utilization in that case is the norm, rather
Indeed, which is why I find it bizarre that you would even bother
applying what was said to those platforms. Specifically I was referring
to the embedded platforms that don't do 4k stacks today. The fact they
don't support them today has much less to do with 4k being an
unattainable limit as it does with people simply not bothering to
Perhaps that's true on x86, but in general I take issue with that. On sh
we've had to do very little maintenance for it and most shipping products
are using it today (at least on MMU-Linux, we don't bother with it on
nommu). Most of the problems we ran in to with 4k stacks tended to be
stuff that we wanted to fix for 8k anyways. I suspect that this case is
true for the other embedded platforms also.

Note that on sh we also conditionalize IRQ stacks separately, so while
they are often used together, it's possible to use 4k stacks without
resorting to IRQ stacks (as m68knommu also seems to do).
--

From: David Miller
Date: Wednesday, August 27, 2008 - 5:46 pm

[Empty message]
From: Paul Mundt
Date: Wednesday, August 27, 2008 - 6:02 pm

Thanks for the pointer, I'll take a look at it!
--

From: Adrian Bunk
Date: Thursday, August 28, 2008 - 9:16 am

As far as I can see the only architectures that optionally offer 4kB 
stacks today are m68knommu, s390, sh and x86.



Most stack issues are not platform or architecture specific.

The maintainance effort therefore mostly depends on whether a non-zero 
number of architectures uses 4kB stacks.

And if something is considered to be important for small embedded 
systems, but not supported on ARM, MIPS or PowerPC, then that's 
a bit strange.

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: Greg Ungerer
Date: Wednesday, August 27, 2008 - 6:05 pm

Yep, definitely true for m68knommu in my experience. I haven't had
any problems with 4k stacks recently. But yes the workloads do tend
to be constrained - and almost never use any of the more exotic

Indeed :-)

Regards
Greg


------------------------------------------------------------------------
Greg Ungerer  --  Chief Software Dude       EMAIL:     gerg@snapgear.com
Secure Computing Corporation                PHONE:       +61 7 3435 2888
825 Stanley St,                             FAX:         +61 7 3891 3630
Woolloongabba, QLD, 4102, Australia         WEB: http://www.SnapGear.com
--

From: Alan Cox
Date: Wednesday, August 27, 2008 - 1:25 am

> You have a good point that aiming at 4kB makes 8kB a very safe choice.

Not really no - we use separate IRQ stacks in 4K but not 8K mode on
x86-32. That means you've actually got no more space if you are unlucky
with the timing of events. The 8K mode is merely harder to debug.

If 4K stacks really are not safe then x86-32 really really needs to
switch to using IRQ stacks in 8K stack mode as well.

Alan
--

From: Parag Warudkar
Date: Wednesday, August 27, 2008 - 5:52 am

By your logic though, XFS on x86 should work fine with 4K stacks -
many will attest that it does not and blows up due to stack issues.

I have first hand experiences of things blowing up with deep call
chains when using 4K stacks where 8K worked just fine on same
workload.

So there is definitely some other problem with 4K stacks.

Thanks
Parag
--

From: Alan Cox
Date: Wednesday, August 27, 2008 - 6:21 am

Nothing of the sort. If it blows up with a 4K stack it will almost
certainly blow up with an 8K stack *eventually* - when a heavy stack usage
coincides with a heavy stack using IRQ handler.

You won't catch it in simple testing, you won't catch it in trivial
simulation and it'll be incredibly hard to reproduce. Not the kind of bug
you want in a production system really. IRQ stacks make things much more
predictable.

Alan
--

From: Parag Warudkar
Date: Wednesday, August 27, 2008 - 9:24 am

I see - so if I end up having a workload on 8k where heavy stack using
IRQs and deep kernel call chains come at the same time - even 8K will
blow up.
So 4K will blow too except that it doesn't require IRQs also to use
heavy stack, just XFS is good enough :)

It then seems like the IRQs using lot of stack is not so much of a
problem in the current kernel as much as deeper call chains and stack
usage of normal non-irq path code is.
So 8k makes it possible for the deeper call chains of non-irq path to
survive since they get better part of the 8K to themselves and IRQs
can do with less almost always.

At least that's what I can derive from the fact that we do not have
lots of reports of 8K stack blowing up.

Thanks

Parag
--

From: Jeff Garzik
Date: Tuesday, August 26, 2008 - 12:55 pm

e1000_check_options builds a struct (singular) on the stack, really... 
struct e1000_option is reasonably small.

The problem, which has also shown itself in large ioctl-style switch{} 
statements, is that gcc will generate code such that the stack usage 
from independent code branches

	if {cond1} {
		char buster1[1000];
		foo(buster1);
	} else if (cond2) {
		char buster2[1000];
		foo(buster2);
	}

are added together, not noticed as mutually exclusive.

Of course, adding 'static const' as you noted is a reasonable 
workaround, but gcc is really annoying WRT stack allocation in this manner.

I had problems in the past, before struct ethtool_ops, with like ethtool 
ioctl switch statements using gobs of stack.  In fact, that was a big 
motivation for struct ethtool_ops.

	Jeff


--

From: Linus Torvalds
Date: Tuesday, August 26, 2008 - 1:06 pm

No it doesn't.

Look a bit more closely.

It builds a struct (singular) MANY MANY times. It also then builds up a 
huge e1000_opt_list[] array, even though it is const and should be static 
(and const).

I know. I wrote a patch to FIX it.

Here's the patch. It shrinks the stack from 1152 bytes to 192 bytes (the 
first version, that only did the e1000_option part, got it down to 600 
bytes). About half comes from not using multiple "e1000_option" 
structures, the other half comes from turning the "e1000_opt_list[]" 
arrays into "static const" instead, so that gcc doesn't copy them onto the 
stack.

Most of the patch is actually doing things like turning

	struct struct e1000_option opt = {

(which declares a _new_ e1000_option variable each time) into

	opt = (struct e1000_option) {

which just re-uses the single variable.

It becomes slightly larger than that, because some places the "opt = .." 
had to be moved around, since it's no longer a variable declaration, but a 
regular assignment.

The rest is just adding "const" to the right places, and turning

	struct e1000_opt_list speed_list[] = ..

into

	static const struct e1000_opt_list speed_list[] = ..

instead, and fixing the indentation to be more straightforward.

I have not tested the dang thing, but I think it's correct. And it turns 
stack usage from "totally horrible and broken" into "pretty reasonable".

		Linus

---
 drivers/net/e1000/e1000_param.c |   81 +++++++++++++++++++++-----------------
 1 files changed, 45 insertions(+), 36 deletions(-)

diff --git a/drivers/net/e1000/e1000_param.c b/drivers/net/e1000/e1000_param.c
index b9f90a5..213437d 100644
--- a/drivers/net/e1000/e1000_param.c
+++ b/drivers/net/e1000/e1000_param.c
@@ -208,7 +208,7 @@ struct e1000_option {
 		} r;
 		struct { /* list_option info */
 			int nr;
-			struct e1000_opt_list { int i; char *str; } *p;
+			const struct e1000_opt_list { int i; char *str; } *p;
 		} l;
 	} arg;
 };
@@ -242,7 +242,7 @@ static ...
From: Kok, Auke
Date: Tuesday, August 26, 2008 - 1:14 pm

totally cool patch afaics - if I still maintained this driver I'd have this tested
and merged right away :)

I suppose Jeff Kirsher is already doing so right now.

I suppose that he'll have to look at the other Intel ethernet drivers as well :)

Jeff, please add my:

Reveiewed-by: Auke Kok <auke-jan.h.kok@intel.com>

Cheers,


--

From: Jeff Kirsher
Date: Tuesday, August 26, 2008 - 3:04 pm

You suppose correctly.

Will do.

-- 
Cheers,
Jeff
--

From: Alan D. Brunelle
Date: Monday, August 25, 2008 - 5:44 am

With /just/ DEBUG_PAGE_ALLOC defined, I have seen two general panic types:

o  A new double fault w/ SMP_DEBUG_PAGEALLOC problem (prob4.txt)

o  The NULL pointer dereference @ 0x858 (prob4a.txt)

Enabling SLUB debugging to see what that shows

Alan
From: Alan D. Brunelle
Date: Monday, August 25, 2008 - 6:13 am

Adding in SLUB debugging doesn't show anything new (I think). Example
boot log (w/ initcall_debug enabled) is at:

http://free.linux.hp.com/~adb/bug.11342/prob5.txt

This has happened 3 times in a row as well. Whilst this is being looked
at, I'm going to fast-forward ahead to the latest in Linus' tree, and
see if the problem is still occurring (I think Linus' point earlier
about some sort of rogue timing and/or corruption bug is spot on, but
it's probably better to see how close to "today's tree" I can reproduce
this). I'll also try kernels w/ the problematic merge patch backed out
to see if that still "fixes" (or more likely(?) just patches over the
real problem).

Alan
--

From: Linus Torvalds
Date: Monday, August 25, 2008 - 11:02 am

Yeah, that's a stack overflow.

Confirmed.

			Linus
--

From: Alan D. Brunelle
Date: Monday, August 25, 2008 - 7:05 am

I built a kernel @

commit 83097aca8567a0bd593534853b71fe0fa9a75d69
Author: Arjan van de Ven <arjan@linux.intel.com>
Date:   Sat Aug 23 21:45:21 2008 -0700

And it fails like the others do

o  http://free.linux.hp.com/~adb/bug.11342/prob6.txt

SMP_DEBUG_PAGEALLOC

o  http://free.linux.hp.com/~adb/bug.11342/prob6a.txt

[    7.591198] BUG: unable to handle kernel NULL pointer dereference at
0000000000000858



I then backed out /just/ the merge for

commit 1c89ac55017f982355c7761e1c912c88c941483d
Merge: 88fa08f... b1b135c...
Author: Linus Torvalds <torvalds@linux-foundation.org>
Date:   Tue Aug 12 08:40:19 2008 -0700

And the machine has booted fine 5 times in a row.



I've put the latest .config up at

http://free.linux.hp.com/~adb/bug.11342/config.txt


Is there /some/ way to break down the patches within the merged patch,
and I could by-hand bisect through those?

Here's what I did to take the latest tree, and back out that merge (to
get booting kernels):

git-diff
88fa08f67bee1a0c765237bdac106a32872f57d2..1c89ac55017f982355c7761e1c912c88c941483d
| patch -p1 -R
patching file Documentation/lguest/lguest.c
patching file arch/powerpc/Kconfig
patching file arch/x86/Kconfig
patching file arch/x86/mm/Makefile
patching file drivers/char/hvc_console.c
patching file drivers/lguest/page_tables.c
patching file include/linux/Kbuild
Hunk #1 succeeded at 358 (offset 2 lines).
patching file include/linux/init.h
patching file include/linux/mm.h
patching file init/main.c
patching file kernel/module.c
patching file kernel/stop_machine.c
patching file mm/Kconfig
patching file mm/util.c

Alan
--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11354
Subject		: AMD Elan regression with 2.6.27-rc3
Submitter	: Sean Young <sean@mess.org>
Date		: 2008-08-15 18:37 (9 days old)
References	: http://marc.info/?l=linux-kernel&m=121882578430056&w=4


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11355
Subject		: Regression in 2.6.27-rc2 when cross-building the kernel
Submitter	: Larry Finger <Larry.Finger@lwfinger.net>
Date		: 2008-08-16 2:38 (8 days old)
References	: http://marc.info/?l=linux-kernel&m=121885432118368&w=4


--

From: Rafael J. Wysocki
Date: Sunday, August 24, 2008 - 2:34 pm

As I wrote in the Bugzilla, I'm seeing a related problem.

Namely, I build kernels on one box, with 'make O=<target>', then I mount
<target> on another one over NFS, 'cd' to it and try to install the kernel
modules with 'make modules_install'.  This results in 'HOSTCC firmware/ihex2fw'
and 'fatal error: ...: Read-only file system'.  It's readily reproducible.

Commenting out line 1130 of Makefile
("$(Q)$(MAKE) -f $(srctree)/scripts/Makefile.fwinst obj=firmware __fw_modinst")
obviously helps, so it looks like Makefile.fwinst needs fixing.

Thanks,
Rafael
--

From: David Woodhouse
Date: Monday, September 1, 2008 - 2:35 am

I don't like this much, but it should do the trick... please confirm.

diff --git a/firmware/Makefile b/firmware/Makefile
index aab12bf..e7130b5 100644
--- a/firmware/Makefile
+++ b/firmware/Makefile
@@ -166,15 +166,28 @@ $(patsubst %,$(obj)/%.gen.o, $(fw-external-y)): $(obj)/%.gen.o: $(fwdir)/%
 $(obj)/%: $(obj)/%.ihex | $(objtree)/$(obj)/$$(dir %)
 	$(call cmd,ihex)
 
+# Don't depend on ihex2fw if we're installing and it already exists.
+# Putting it after | in the dependencies doesn't seem sufficient when
+# we're installing after a cross-compile, because ihex2fw has dependencies
+# on stuff like /usr/lib/gcc/ppc64-redhat-linux/4.3.0/include/stddef.h and 
+# thus wants to be rebuilt. Which it can't be, if the prebuilt kernel tree
+# is exported read-only for someone to run 'make install'.
+ifeq ($(INSTALL):$(wildcard $(obj)/ihex2fw),install:$(obj)/ihex2fw)
+ihex2fw_dep :=
+else
+ihex2fw_dep := $(obj)/ihex2fw
+endif
+
+
 # .HEX is also Intel HEX, but where the offset and length in each record
 # is actually meaningful, because the firmware has to be loaded in a certain
 # order rather than as a single binary blob. Thus, we convert them into our
 # more compact binary representation of ihex records (<linux/ihex.h>)
-$(obj)/%.fw: $(obj)/%.HEX $(obj)/ihex2fw | $(objtree)/$(obj)/$$(dir %)
+$(obj)/%.fw: $(obj)/%.HEX $(ihex2fw_dep) | $(objtree)/$(obj)/$$(dir %)
 	$(call cmd,ihex2fw)
 
 # .H16 is our own modified form of Intel HEX, with 16-bit length for records.
-$(obj)/%.fw: $(obj)/%.H16 $(obj)/ihex2fw | $(objtree)/$(obj)/$$(dir %)
+$(obj)/%.fw: $(obj)/%.H16 $(ihex2fw_dep) | $(objtree)/$(obj)/$$(dir %)
 	$(call cmd,h16tofw)
 
 $(firmware-dirs):

-- 
dwmw2

--

From: Larry Finger
Date: Monday, September 1, 2008 - 9:51 am

Yes, the patch fixes the problem. Thanks.

Larry
--

From: David Woodhouse
Date: Monday, September 1, 2008 - 10:37 am

Ok, I have a small handful of fixes to send to Linus for 2.6.27; Unless
Sam (or someone else) comes up with a better answer, I'll make sure that
goes with them.

Thanks.

-- 
dwmw2

--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11356
Subject		: Linux 2.6.27-rc3 - build failure: undefined reference to `.lockdep_count_forward_deps'
Submitter	: Frans Pop <elendil@planet.nl>
Date		: 2008-08-16 19:11 (8 days old)
References	: http://marc.info/?l=linux-kernel&m=121891396320127&w=4


--

From: Frans Pop
Date: Saturday, August 23, 2008 - 11:13 pm

Fixed as per: http://marc.info/?l=linux-kernel&m=121898767530602&w=4
Adrian mentioned that he'd closed the bug, but apparently not.
--

From: Rafael J. Wysocki
Date: Sunday, August 24, 2008 - 2:10 pm

The bug is closed now.

Thanks,
Rafael



--

From: Adrian Bunk
Date: Monday, August 25, 2008 - 7:03 am

Sorry, I missed that Rafael had opened two bugs for two people reporting 
the same issue, and only closed the other one.

cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11357
Subject		: Can not boot up with zd1211rw USB-Wlan Stick
Submitter	: uwe <kender@freenet.de>
Date		: 2008-08-16 14:17 (8 days old)


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11358
Subject		: net: forcedeth call restore mac addr in nv_shutdown path
Submitter	: Yinghai Lu <yhlu.kernel@gmail.com>
Date		: 2008-08-17 3:30 (7 days old)
References	: http://marc.info/?l=linux-kernel&m=121894389018584&w=4
Handled-By	: Yinghai Lu <yhlu.kernel@gmail.com>
Patch		: http://marc.info/?l=linux-kernel&m=121894389018584&w=4


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11360
Subject		: mpc8xxx_wdt.c doesn't build modular
Submitter	: Dave Jones <davej@redhat.com>
Date		: 2008-08-17 08:07 (7 days old)
References	: http://lkml.org/lkml/2008/8/12/465
Handled-By	: Anton Vorontsov <avorontsov@ru.mvista.com>
Patch		: http://lkml.org/lkml/2008/8/13/344


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11361
Subject		: my servers with nvidia mcp55 nic don't work with msi in second kernel by kexec
Submitter	: Yinghai Lu <yhlu.kernel@gmail.com>
Date		: 2008-08-17 6:25 (7 days old)
References	: http://marc.info/?l=linux-kernel&m=121895439927053&w=4
Handled-By	: Rafael J. Wysocki <rjw@sisk.pl>
Patch		: http://marc.info/?l=linux-kernel&m=121917167232014&w=4


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11379
Subject		: char/tpm: tpm_infineon no longer loaded for HP 2510p laptop
Submitter	: Frans Pop <elendil@planet.nl>
Date		: 2008-08-18 13:40 (6 days old)
References	: http://marc.info/?l=linux-kernel&m=121906698213329&w=4
Handled-By	: Bjorn Helgaas <bjorn.helgaas@hp.com>


--

From: Frans Pop
Date: Saturday, August 23, 2008 - 11:18 pm

Fixed with:
commit 5e4c6564c95ce127beeefe75e15cd11c93487436
Author: Kay Sievers <kay.sievers@vrfy.org>
Date:   Thu Aug 21 15:28:56 2008 +0200

    pnp: fix "add acpi:* modalias entries"
--

From: Rafael J. Wysocki
Date: Sunday, August 24, 2008 - 2:12 pm

Thanks,  closed.

Rafael
--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11382
Subject		: e1000e: 2.6.27-rc1 corrupts EEPROM/NVM
Submitter	: David Vrabel <david.vrabel@csr.com>
Date		: 2008-08-08 10:47 (16 days old)
References	: http://marc.info/?l=linux-kernel&m=121819267211679&w=4


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11380
Subject		: lockdep warning: cpu_add_remove_lock at:cpu_maps_update_begin+0x14/0x16
Submitter	: Ingo Molnar <mingo@elte.hu>
Date		: 2008-08-20 6:44 (4 days old)
References	: http://marc.info/?l=linux-kernel&m=121921480931970&w=4


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11388
Subject		: 2.6.27-rc3 warns about MTRR range; only 3 of 16gb of memory is usable
Submitter	: Joshua Hoblitt <j_kernel@hoblitt.com>
Date		: 2008-08-20 17:38 (4 days old)


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11398
Subject		: hda_intel: IRQ timing workaround is activated for card #0. Suggest a bigger bdl_pos_adj.
Submitter	: Frans Pop <elendil@planet.nl>
Date		: 2008-08-21 17:17 (3 days old)


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11401
Subject		: pktcdvd: BUG, NULL pointer dereference in pkt_ioctl, bisected
Submitter	: Laurent Riffard <laurent.riffard@free.fr>
Date		: 2008-08-22 08:16 (2 days old)


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11404
Subject		: BUG: in 2.6.23-rc3-git7 in do_cciss_intr
Submitter	: rdunlap <randy.dunlap@oracle.com>
Date		: 2008-08-21 5:52 (3 days old)
References	: http://marc.info/?l=linux-kernel&m=121929819616273&w=4
		  http://marc.info/?l=linux-kernel&m=121932889105368&w=4
Handled-By	: Miller, Mike (OS Dev) <Mike.Miller@hp.com>
		  James Bottomley <James.Bottomley@hansenpartnership.com>


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11402
Subject		: skbuff bug?
Submitter	: Yinghai Lu <yhlu.kernel@gmail.com>
Date		: 2008-08-21 3:56 (3 days old)
References	: http://marc.info/?l=linux-kernel&m=121929102707658&w=4


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11403
Subject		: 2.6.27-rc2 USB suspend regression
Submitter	: Jeremy Fitzhardinge <jeremy@goop.org>
Date		: 2008-08-20 20:48 (4 days old)
References	: http://marc.info/?l=linux-kernel&m=121926536103630&w=4
Handled-By	: Alan Stern <stern@rowland.harvard.edu>


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11405
Subject		: 2.6.27-rc3 segfault on cold boot; not on warm boot.
Submitter	: David Greaves <david@dgreaves.com>
Date		: 2008-08-21 9:45 (3 days old)
References	: http://marc.info/?l=linux-kernel&m=121931198904777&w=4


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11406
Subject		: patch "x86: MOVE PCI IO ECS code to x86/pci" breaks CPU hotplug
Submitter	: Jan Beulich <jbeulich@novell.com>
Date		: 2008-08-21 12:59 (3 days old)
References	: http://marc.info/?l=linux-kernel&m=121932366326572&w=4
Handled-By	: Ingo Molnar <mingo@elte.hu>
		  Robert Richter <robert.richter@amd.com>


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11410
Subject		: SLUB list_lock vs obj_hash.lock...
Submitter	: Daniel J Blueman <daniel.blueman@gmail.com>
Date		: 2008-08-22 21:48 (2 days old)
References	: http://marc.info/?l=linux-kernel&m=121944176609042&w=4


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11407
Subject		: suspend: unable to handle kernel paging request
Submitter	: Vegard Nossum <vegard.nossum@gmail.com>
Date		: 2008-08-21 17:28 (3 days old)
References	: http://marc.info/?l=linux-kernel&m=121933974928881&w=4
Handled-By	: Rafael J. Wysocki <rjw@sisk.pl>
		  Pekka Enberg <penberg@cs.helsinki.fi>
		  Pavel Machek <pavel@suse.cz>


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11409
Subject		: build issue #564 for v2.6.27-rc4 : undefined reference to `NS8390p_init'
Submitter	: Toralf Förster <toralf.foerster@gmx.de>
Date		: 2008-08-22 8:33 (2 days old)
References	: http://marc.info/?l=linux-kernel&m=121939410214677&w=4
Handled-By	: Alan Cox <alan@lxorguk.ukuu.org.uk>
Patch		: http://marc.info/?l=linux-kernel&m=121943097320451&w=4


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11413
Subject		: get_rtc_time() triggers NMI watchdog in hpet_rtc_interrupt()
Submitter	: Mikael Pettersson <mikpe@it.uu.se>
Date		: 2008-08-23 9:48 (1 days old)
References	: http://marc.info/?l=linux-kernel&m=121948503224161&w=4
Handled-By	: Ingo Molnar <mingo@elte.hu>
Patch		: http://marc.info/?l=linux-kernel&m=121950734922457&w=4


--

From: Rafael J. Wysocki
Date: Saturday, August 23, 2008 - 11:10 am

This message has been generated automatically as a part of a report
of recent regressions.

The following bug entry is on the current list of known regressions
from 2.6.26.  Please verify if it still should be listed and let me know
(either way).


Bug-Entry	: http://bugzilla.kernel.org/show_bug.cgi?id=11414
Subject		: Random crashes with 2.6.27-rc3 on PPC
Submitter	: Michael Buesch <mb@bu3sch.de>
Date		: 2008-08-23 14:10 (1 days old)
References	: http://marc.info/?l=linux-kernel&m=121950076812616&w=4


--

From: Linus Torvalds
Date: Sunday, August 24, 2008 - 10:48 am

It would be good to have some kind of bisection of this one, because it 
looks pretty odd. Also, google doesn't find anybody else seeing that 
"segfault at ffffffbf", even though it seems to be very consistent for 
David. So I don't think we'll be able to even _guess_ where it is without 
some more information about exactly when it started happening.

Since it's present in 2.6.26 too, it's clearly not a regression from that 
one, but perhaps more importantly, since it's apparently an old one I'd 
have expected more reports like this if it was some common problem. And 
the warm-vs-cold-boot thing makes me think it's some hardware setup issue. 

Possibly the disk controller, possibly the CPU (eg some MTRR/PAT 
setup issue or TLB thing). But the dmesg's are all from late enough at 
boot that I can't even tell what disk controller it is (except that it is 
SATA), nor can I tell what CPU it is.

But again, if it was some MTRR/PAT issue, I'd expect a _lot_ more reports 
of this. 

MD/XFS sounds unlikely, since they should have absolutely nothing that 
could possibly matter for cold/hot boot.

			Linus
--

From: David Greaves
Date: Sunday, August 24, 2008 - 12:23 pm

OK, that all makes sense.

Given that I'll manage at best 1 bisect/day with a reasonable chance of data
corruption and hardware intermittency screwing it all up I thought it best to
ask first in case there was another debug approach that could work. However
since it does indeed sounds somewhat hardware related and it's an isolated
problem for my wife (as opposed to a problem that others are having too) then I
think she deserves a new machine...

Thanks for the impetus to cheer her up ;)

David
PS if anyone really is interested then I am happy to try the bisection once I've
moved her to a new box; otherwise I'm happy to close this.
--

From: Linus Torvalds
Date: Sunday, August 24, 2008 - 5:51 pm

Well, regardless, I think it would be good to fill in the hardware info, 
especially wrt CPU data and the exact SATA controller you have.

There's another regression for SATA cold/hot boot issues, and while that 
one looks very different, and is probably not really related, it's still a 
good idea to try to see if we can match them up. See

	Bug-Entry       : http://bugzilla.kernel.org/show_bug.cgi?id=11343
	Subject         : SATA Cold Boot Problems with 2.6.27-rc[23] on nVidia 680i
	Submitter       : Manny Maxwell <mannymax@mannymax.net>
	Date            : 2008-08-14 4:16 (10 days old)
	References      : http://marc.info/?l=linux-kernel&m=121868782917600&w=4

which actually has a patch, and which seems to work fine in 2.6.26 (so not 
only is failure pattern different, the point were it starts is different). 

But regardless of the big differences, it does seem to point to some 
weakness in SATA initialization. But is it limited to _that_ particular 
SATA controller, or just a few ones? Or a generic issue? Without more 
reports to really find a pattern, I don't think we have a clue, and the 
two may be _totally_ unrelated in all ways, but it would be good to at 
least report and log the information you have..

Oh, I just noticed that your dmesg _does_ mention sata_sil and sata_via, 
so we know which of two drivers it would be, at least. Not the nVidia one.

However, there's been tons of changes in soem core functions: both the 
reset handling and the wait-for-ready has changed and caused lots of churn 

I think it would be good to try to bisect. It could be something that is 
really just limited to that particular machine (maybe it really is some 
flaky hardware that just triggers some timing changes), but more likely it 
isn't. So the more information, the better. So keep the thing open as long 
as somebody is willing to try to gather more info, by all means.

		Linus
--

From: Linus Torvalds
Date: Sunday, August 24, 2008 - 11:03 am

This one now has a suggested patch for Daniel to try from Vegard, but no 
reply yet:

	http://marc.info/?l=linux-kernel&m=121946972307110&w=4

Vegard, I think your patch is a bit odd, though. The result of your patch 
is

 - first loop:

	hlist_for_each_entry_safe(obj, node, tmp, &db->list, node) {
		hlist_del(&obj->node);
		hlist_add_head(&obj->node, &freelist);
	}

   and quite frankly, I don't see what the difference between that and a 
   something like a simple

	struct hlist_node *first = bd->list.first;
	if (first) {
		bd->list.first = NULL;
		first->pprev = &first;
	}

   really is?

I dunno. We don't have list splicing ops for the hlist things.

		Linus
--

From: Vegard Nossum
Date: Sunday, August 24, 2008 - 11:43 am

On Sun, Aug 24, 2008 at 8:03 PM, Linus Torvalds


Hm.

I haven't really used the hlists before, so my first instinct was to
do what is obvious. That's also why I put the XXX comment. Other than
that, I guess open-coding list ops is also not very good programming
practice? :-)

But... feel free to submit your own patch. Oh, what am I saying.


Vegard

-- 
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
	-- E. W. Dijkstra, EWD1036
--

From: Linus Torvalds
Date: Sunday, August 24, 2008 - 11:58 am

I do agree that the hlist versions aren't very nice in this regard. The 

Agreed. It would be better if the people who use hlists most (I think that 

Silly boy. Next you'll ask me to _test_ any patches I send out.

Anyway, I think your patch is likely fine, I just thought it looked a bit 
odd to have a loop to move a list from one head pointer to another.

But regardless, it would need some testing. Daniel?

			Linus
--

From: Daniel J Blueman
Date: Monday, August 25, 2008 - 6:03 am

Hi Linus, Vegard,

On Sun, Aug 24, 2008 at 7:58 PM, Linus Torvalds

This opens another lockdep report at boot-time [1] - promoting
pool_lock may not be the best fix?

We then see a new deadlock condition (on the pool_lock spinlock) [2],
which seemingly was avoided by taking the debug-bucket lock first.

We reproduce this by booting with debug_objects=1 and causing a lot of activity.

Daniel

--- [1]

[ INFO: possible irq lock inversion dependency detected ]
2.6.27-rc4-225c-debug #3
---------------------------------------------------------
rcu_sched_grace/9 just changed the state of lock:
 (pool_lock){-...}, at: [<ffffffff80466c2c>] free_object+0x7c/0xc0
but this lock was taken by another, hard-irq-safe lock in the past:
 (xtime_lock){++..}

and interrupts could create inverse lock ordering between them.


other info that might help us debug this:
no locks held by rcu_sched_grace/9.

the first lock's dependencies:
-> (pool_lock){-...} ops: 59 {
   initial-use  at:
                                       [<ffffffff8026f795>]
__lock_acquire+0x1a5/0x1160
                                       [<ffffffff802707e1>]
lock_acquire+0x91/0xc0
                                       [<ffffffff806a2bb1>] _spin_lock+0x41/0x80
                                       [<ffffffff804675fa>]
__debug_object_init+0x14a/0x3e0
                                       [<ffffffff804678df>]
debug_object_init+0x1f/0x30
                                       [<ffffffff80260eee>]
hrtimer_init+0x2e/0x50
                                       [<ffffffff80237571>]
init_rt_bandwidth+0x41/0x60
                                       [<ffffffff812f3810>]
sched_init+0x72/0x63d
                                       [<ffffffff812e17e2>]
start_kernel+0x19c/0x456
                                       [<ffffffff812e10a9>]
x86_64_start_reservations+0x99/0xb6
                                       [<ffffffff812e11d0>]
x86_64_start_kernel+0xe2/0xe9
                                       ...
From: Linus Torvalds
Date: Sunday, August 24, 2008 - 11:34 am

This one looks irritating.

It's bisected to 5b6155ee70e9c4d2ad7e6f514c8eee06e2711c3a ("pktcdvd: push 
BKL down into driver"), but the problem goes deeper than that.

The "unlocked" ioctl's do not get a "struct inode *" pointer, they _only_ 
get the "struct file *". And this is very much historical usage, where 
some internal functions only passed in the inode (good or not, whatever).

And ioctl_by_bdev() doesn't have a "struct file *" and has depended on 
passing in a NUMM "struct file *" and its own "struct inode *", and 
expects the ioctl's to just use that instead. But the unlocked ioctl just 
drops it on the floor, and uses just the (unusable) file pointer.

Grr.

And some other cases (like pkt_ioctl() itself) that simply pass in a 
_different_ inode than the file itself is attached to. It does

	blkdev_ioctl(pd->bdev->bd_inode, file, cmd, arg);

where "file" points to the pkt_ioctl thing, but "inode" points to the 
inode "behind" the pkt interface.

Double grr.

I really think the only sane model is to literally make "unlocked_ioctl()" 
have the same calling convention as the old "ioctl()" thing had, and pass 
in both file * and inode *. It was a stupid "cleanup" to try to have a 
simpler interface for the unlocked version. Having two different models, 
where we have actually _depended_ on the old model and then are trying to 
convert to a (weaker) new model, is not a good idea.

The alternative is to do this _only_ for the blkdev_ioctl's, and have 
those only take the "inode *", and then create a new fake "struct file *" 
to go with it, regardless of what "struct file" was passed in (exactly 
because the blockdev ones really think that the inode is the important 
part).

Hmm?

We need to fix this.

		Linus
--

From: Peter Osterlund
Date: Wednesday, August 27, 2008 - 1:17 pm

Why not just revert the offending change and try again during the next
merge window, assuming someone has figured out an acceptable way to
handle this mess by then?

-- 
Peter Osterlund - petero2@telia.com
http://web.telia.com/~u89404340
--

From: Linus Torvalds
Date: Wednesday, August 27, 2008 - 1:40 pm

Well,, for 2.6.27 that's what we'll have to do. But there's actually a 
real problem here - the unlocked ioctl's (which we _should_ prefer) have a 
strictly weaker and worse interface. I also wonder if any other 
block_ioctl users were converted..

Anyway, I'll take your email as an ack for the revert.

		Linus
--

From: Linus Torvalds
Date: Wednesday, August 27, 2008 - 1:45 pm

Well, doing

	git log -p v2.6.26.. -Sunlocked_ioctl

and looking for blkdev_ioctl, that does seem to be the only one. So 
hopefully no other case like this is lurking, although it is possible that 
non-block areas have similar issues.

			Linus
--

From: Christoph Hellwig
Date: Thursday, August 28, 2008 - 6:52 am

Actually both interfaces are a fscking disaster.  The right things to
pass is neither and inode nor a file but a struct block_device.  Al had
all this work done a while and it just needs rebasing to a current tree:

	http://git.kernel.org/?p=linux/kernel/git/viro/bdev.git;a=summary
--

From: Jens Axboe
Date: Tuesday, September 2, 2008 - 12:26 am

Completely agreed. Al, I remember talking to you about this at the
storage summit back in february. What are your current plans wrt moving
this forward?

-- 
Jens Axboe

--

From: Al Viro
Date: Tuesday, September 2, 2008 - 7:06 pm

Rebased, with nfs parts of fmode_t patch taken out (irrelevant for
bdev anyway and really better off in intent-killing queue).  Other
than that, it's a straight port...  Same place, same branch.
--

From: Jens Axboe
Date: Thursday, September 4, 2008 - 3:13 am

So what's your plan with this - 2.6.28?

-- 
Jens Axboe

--

From: Al Viro
Date: Sunday, September 14, 2008 - 10:30 pm

Yes.  The only nastiness is around drivers/ide - there it gets a bunch of
annoying conflicts from the ide-{disk,floppy}_ioctl.c splitoff.  Other than
that, it's trivially ported on top of current linux-next.  Merge order is
going to be interesting - depending on whether block merge happens before
or after ide one.

I'm going to put linux-next-based series on kernel.org tonight, before
going to Portland...
--

From: Alan Cox
Date: Wednesday, August 27, 2008 - 3:08 pm

Easier just to fix it. Its a case of building everything until it
compiles with the prototype change. Almost all stuff  will just take the
argument initially and not use it.

Anyone else plan to do it or shall I hit all the x86 cases and post a
patch ?

--

From: Linus Torvalds
Date: Wednesday, August 27, 2008 - 3:38 pm

Well, I alrady reverted it, but if you actually fix unlocked_ioctl() to 
have the same calling convention as regular ioctl() then a lot of the 
noise from ioctl conversion goes away, and all that remains is literally 
just the BKL part.

Btw, why is unlocked_ioctl returning "long"? Does anybody depend on that 
too? That's another difference between the "unlocked" and the traditional 
version..

As to the "x86 cases", I think you should try to hit them all. Doing a 
"git grep unlocked_ioctl" gets 185 entries, and it looks like only 
something like 8 of them are non-x86 (3 in the arch/ directory, five in 
s390 drivers).

Of course, some of them may be drivers that aren't available on x86 for 
other reasons (ie the ARM embedded stuff), but regardless..

Anyway, the pure size of that patch makes me suspect that we might as well 
leave it until the next merge window, but if you do it and it's obviously 
totally mechanical, I'd be likely to just let it slip in early.

		Linus
--

From: David Miller
Date: Wednesday, August 27, 2008 - 3:43 pm

From: Linus Torvalds <torvalds@linux-foundation.org>

The return values want to be "long" sign extended all the way back
down to syscall dispatch, I think this is just an effort to add
some consistency here so that the int --> long extension eventually
can be eliminated once unlocked_ioctl is the only case left.
--

From: Alexey Dobriyan
Date: Wednesday, August 27, 2008 - 3:45 pm

Anybody doing this, don't forget to actually use "inode" instead of all those
dereferences:

	struct inode *inode = filp->f_path.dentry->d_inode;

--

From: Alan Cox
Date: Wednesday, August 27, 2008 - 3:28 pm

I don't know - a lot of syscall returns got defined as long and I guess

I'll take a crack at it tomorrow - but if its 185 entries then it
probably wants to go into -next instead.

Alan
--

From: Linus Torvalds
Date: Wednesday, August 27, 2008 - 4:00 pm

Being more careful.. This:

	git grep 'unlocked_ioctl.*=' |
		sed 's/^.*=[ 	]*\([_a-zA-Z0-9]*\).*$/\1/' |
		uniq | wc

says that ther are 160 distinct cases. I'm not sure it catches everything 
exactly, but it will be reasonably close, at least.

I wonder if I could essentially automate something to do the conversion..

		Linus
--

From: Linus Torvalds
Date: Wednesday, August 27, 2008 - 4:12 pm

[Empty message]
From: Linus Torvalds
Date: Wednesday, August 27, 2008 - 5:35 pm

What the hell. Here's a test patch. A largish part of it was generated 
through a stupid script that basically did a number of grep + 'sed' on a 
lot of files, and then the rest was fixed up manually after running "make 
allmodconfig".

I'm not going to guarantee anything, but it gets close. A starting point 
for somebody else, and considering that it is

 208 files changed, 370 insertions(+), 376 deletions(-)

this is definitely linux-next material.

The extra deletions are mainly because the passing of "inode" as an 
argument means that some functions don't need to look it up manually any 
more.

And yeah, I changed the return type to "int". There's no way the kernel 
can validly return anything bigger than that anyway. And this way all the 
ioctl functions have the same type, no confusion.

TOTALLY UNTESTED apart from the fact that it compiles.

			Linus

---
 arch/mips/sibyte/common/sb_tbprof.c   |    2 +-
 arch/parisc/kernel/perf.c             |    4 +-
 arch/sparc/kernel/apc.c               |    2 +-
 arch/x86/kernel/apm_32.c              |    2 +-
 arch/x86/kernel/cpu/mcheck/mce_64.c   |    2 +-
 arch/x86/kernel/cpu/mtrr/if.c         |    3 +-
 block/bsg.c                           |    2 +-
 block/compat_ioctl.c                  |   18 +++++++++++-----
 block/ioctl.c                         |    3 +-
 drivers/block/DAC960.c                |    2 +-
 drivers/block/cciss.c                 |    4 +-
 drivers/block/loop.c                  |    3 +-
 drivers/block/paride/pt.c             |    4 +-
 drivers/char/agp/agp.h                |    2 +-
 drivers/char/agp/compat_ioctl.c       |    2 +-
 drivers/char/agp/frontend.c           |    2 +-
 drivers/char/ds1302.c                 |    2 +-
 drivers/char/dsp56k.c                 |    2 +-
 drivers/char/efirtc.c                 |    4 +-
 drivers/char/ip2/ip2main.c            |    6 ++--
 drivers/char/ip27-rtc.c               |    4 +-
 drivers/char/ipmi/ipmi_devintf.c      |    2 +-
 ...
From: Linus Torvalds
Date: Sunday, August 24, 2008 - 11:52 am

Peter? Ingo? Alok?

This _looks_ like it might be due to "x86: merge the TSC cpu-freq code" 
thing by Alok, where we do this:

	+static struct notifier_block time_cpufreq_notifier_block = {
	+       .notifier_call  = time_cpufreq_notifier
	+};
	+
	+static int __init cpufreq_tsc(void)
	+{
	+       cpufreq_register_notifier(&time_cpufreq_notifier_block,
	+                               CPUFREQ_TRANSITION_NOTIFIER);
	+       return 0;
	+}

but that's just _insane_ if the CPU doesn't even support TSC to begin 
with. Also, in the actual time_cpufreq_notifier(), we do:

	if (cpu_has(&cpu_data(freq->cpu), X86_FEATURE_CONSTANT_TSC))
		return 0;

and this is stupid because:

 (a) if the CPU has no TSC at all, then it sure as hell won't have a 
     _constant_ one, so we'll actually continue into the function.

 (b) and why the hell is this done at run-time in the notifier, and not in 
     the "cpufreq_tsc" init function? If anybody mixes totally different 
     kinds of CPU's in SMP, they deserve whatever they want.

so why is the patch not something like the appended?

Sean, does this make any difference for you?

		Linus

---
 arch/x86/kernel/tsc.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 46af716..9bed5ca 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -325,6 +325,10 @@ static struct notifier_block time_cpufreq_notifier_block = {
 
 static int __init cpufreq_tsc(void)
 {
+	if (!cpu_has_tsc)
+		return 0;
+	if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
+		return 0;
 	cpufreq_register_notifier(&time_cpufreq_notifier_block,
 				CPUFREQ_TRANSITION_NOTIFIER);
 	return 0;
--

From: Sean Young
Date: Sunday, August 24, 2008 - 3:50 pm

Yes, this patch fixes it.

Thanks
Sean
--

From: H. Peter Anvin
Date: Sunday, August 24, 2008 - 5:16 pm

I added this patch to x86/urgent.

	-hpa
--

From: Linus Torvalds
Date: Sunday, August 24, 2008 - 12:03 pm

Hmm. Wasn't this already confirmed to be fixed by commit
df60a8441866153d691ae69b77934904c2de5e0d? 

At least Adrian sent out an email saying "Confirmed, bug closed.", but 
bugzilla seems to disagree and still show it as open.

		Linus
--

From: Adrian Bunk
Date: Sunday, August 24, 2008 - 12:23 pm

There were two different reports, Rafael opened a bug for each, and I 
missed that there were two open bugs for the same issue.

The one I closed was #11344.


cu
Adrian

-- 

       "Is there not promise of rain?" Ling Tan asked suddenly out
        of the darkness. There had been need of rain for many days.
       "Only a promise," Lao Er said.
                                       Pearl S. Buck - Dragon Seed

--

From: Rafael J. Wysocki
Date: Sunday, August 24, 2008 - 2:40 pm

On Saturday, 23 of August 2008, Rafael J. Wysocki wrote:


Jeff, do you have the patches for these two in your queue?

Rafael
--

From: Benjamin Herrenschmidt
Date: Sunday, August 24, 2008 - 5:48 pm

This appears to be a gcc bug when -fno-omit-stack-pointer is used (which
we mostly don't need on ppc anyway except that another gcc stupidity makes
it mandatory for -pg which ftrace needs).

We're working on a two fold workaround: removing -fno-omit-stack-pointer
in all the cases where we don't really need it, and for when we do (ie,
CONFIG_FTRACE becaues of -pg), using -mno-sched-epilog which seems to
work around it.

The root cause in gcc hasn't been fully identified yet.

Ben.


--

From: Rafael J. Wysocki
Date: Monday, August 25, 2008 - 4:40 am

Thanks Ben.

I've already dropped it from the list of recent regressions.

Rafael
--

Previous thread: Re: HPET regression in 2.6.26 versus 2.6.25 -- found another user with the same regression by David Witbrodt on Saturday, August 23, 2008 - 12:29 pm. (1 message)

Next thread: Re: HPET regression in 2.6.26 versus 2.6.25 -- found another user with the same regression by David Witbrodt on Saturday, August 23, 2008 - 12:47 pm. (3 messages)