This message contains a list of some regressions from 2.6.26, for which there are no fixes in the mainline I know of. If any of them have been fixed already, please let me know. If you know of any other unresolved regressions from 2.6.26, please let me know either and I'll add them to the list. Also, please let me know if any of the entries below are invalid. Each entry from the list will be sent additionally in an automatic reply to this message with CCs to the people involved in reporting and handling the issue. Listed regressions statistics: Date Total Pending Unresolved ---------------------------------------- 2008-08-23 122 48 40 2008-08-16 103 47 37 2008-08-10 80 52 31 2008-08-02 47 31 20 Unresolved regressions ---------------------- Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11414 Subject : Random crashes with 2.6.27-rc3 on PPC Submitter : Michael Buesch <mb@bu3sch.de> Date : 2008-08-23 14:10 (1 days old) References : http://marc.info/?l=linux-kernel&m=121950076812616&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11410 Subject : SLUB list_lock vs obj_hash.lock... Submitter : Daniel J Blueman <daniel.blueman@gmail.com> Date : 2008-08-22 21:48 (2 days old) References : http://marc.info/?l=linux-kernel&m=121944176609042&w=4 Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11407 Subject : suspend: unable to handle kernel paging request Submitter : Vegard Nossum <vegard.nossum@gmail.com> Date : 2008-08-21 17:28 (3 days old) References : http://marc.info/?l=linux-kernel&m=121933974928881&w=4 Handled-By : Rafael J. Wysocki <rjw@sisk.pl> Pekka Enberg <penberg@cs.helsinki.fi> Pavel Machek <pavel@suse.cz> Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11406 Subject : patch "x86: MOVE PCI IO ECS code to x86/pci" breaks CPU hotplug Submitter : Jan Beulich <jbeulich@novell.com> Date : ...
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11141 Subject : no battery or DC status - Dell i1501 Submitter : Gu Rui <chaos.proton@gmail.com> Date : 2008-07-21 19:43 (34 days old) Handled-By : Zhao Yakui <yakui.zhao@intel.com> --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11207 Subject : VolanoMark regression with 2.6.27-rc1 Submitter : Zhang, Yanmin <yanmin_zhang@linux.intel.com> Date : 2008-07-31 3:20 (24 days old) References : http://marc.info/?l=linux-kernel&m=121747464114335&w=4 Handled-By : Zhang, Yanmin <yanmin_zhang@linux.intel.com> Peter Zijlstra <a.p.zijlstra@chello.nl> Dhaval Giani <dhaval@linux.vnet.ibm.com> Miao Xie <miaox@cn.fujitsu.com> Patch : http://marc.info/?l=linux-kernel&m=121922991027344&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11191 Subject : 2.6.26-git8: spinlock lockup in c1e_idle() Submitter : Mikhail Kshevetskiy <mikhail.kshevetskiy@gmail.com> Date : 2008-07-24 03:22 (31 days old) References : http://lkml.org/lkml/2008/7/23/317 Handled-By : Thomas Gleixner <tglx@linutronix.de> --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11209 Subject : 2.6.27-rc1 process time accounting Submitter : Lukas Hejtmanek <xhejtman@ics.muni.cz> Date : 2008-07-31 10:43 (24 days old) References : http://marc.info/?l=linux-kernel&m=121750102917490&w=4 Handled-By : Peter Zijlstra <a.p.zijlstra@chello.nl> --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11210 Subject : libata badness Submitter : Kumar Gala <galak@kernel.crashing.org> Date : 2008-07-31 18:53 (24 days old) References : http://marc.info/?l=linux-ide&m=121753059307310&w=4 Handled-By : Ben Dooks <ben-linux@fluff.org> --
FWIW, http://marc.info/?l=linux-kernel&m=121754161727539&w=4 So IMO handled-by is Kumar? --
OK --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11215 Subject : INFO: possible recursive locking detected ps2_command Submitter : Zdenek Kabelac <zdenek.kabelac@gmail.com> Date : 2008-07-31 9:41 (24 days old) References : http://marc.info/?l=linux-kernel&m=121749737011637&w=4 Handled-By : Peter Zijlstra <a.p.zijlstra@chello.nl> --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11219 Subject : KVM modules break emergency reboot Submitter : Zdenek Kabelac <zdenek.kabelac@gmail.com> Date : 2008-08-01 20:25 (23 days old) References : http://marc.info/?l=linux-kernel&m=121762241105336&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11220 Subject : Screen stays black after resume Submitter : Nico Schottelius <nico@schottelius.org> Date : 2008-07-31 21:05 (24 days old) References : http://marc.info/?l=linux-kernel&m=121753882422899&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11230 Subject : Kconfig no longer outputs a .config with freshly updated defconfigs Submitter : Josh Boyer <jwboyer@linux.vnet.ibm.com> Date : 2008-08-02 16:03 (22 days old) References : http://marc.info/?l=linux-kernel&m=121769306319391&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11254 Subject : KVM: fix userspace ABI breakage Submitter : Adrian Bunk <bunk@kernel.org> Date : 21 Jul 2008 17:58:26 (0 days old) References : http://lkml.org/lkml/2008/7/21/197 Handled-By : Adrian Bunk <bunk@kernel.org> Patch : http://lkml.org/lkml/2008/7/21/197 --
The discussion in Bugzilla whether it is a regression at all can be
condensed to the following question:
Can a struct that is part of the 2.6.26 userspace headers be defined to
be part of an "experimental ABI" and therefore be changed?
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
It is part of the experimental ABI. However, as I'm going to apply your patch (as being the simplest fix, and as there is no measurable performance impact), the question is moot. -- error compiling committee.c: too many arguments to function --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11224 Subject : Only three cores found on quad-core machine. Submitter : Dave Jones <davej@redhat.com> Date : 2008-08-01 18:15 (23 days old) References : http://marc.info/?l=linux-kernel&m=121761475224719&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11264 Subject : Invalid op opcode in kernel/workqueue Submitter : Jean-Luc Coulon <jean.luc.coulon@gmail.com> Date : 2008-08-07 04:18 (17 days old) --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11271 Subject : BUG: fealnx in 2.6.27-rc1 Submitter : Jaswinder Singh <jaswinderlinux@gmail.com> Date : 2008-08-05 14:58 (19 days old) References : http://marc.info/?l=linux-netdev&m=121794762016830&w=4 http://lkml.org/lkml/2008/8/10/98 Handled-By : Francois Romieu <romieu@fr.zoreil.com> --
Jaswinder, does reverting 28cd4289abc2c8db90344ee4ff064a9bdf086fdf help? That's the only material change to fealnx itself in years. If not, any chance you could bisect this problem, and add more info to the bug? --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11272 Subject : BUG: parport_serial in 2.6.27-rc1 for NetMos Technology PCI 9835 Submitter : Jaswinder Singh <jaswinderlinux@gmail.com> Date : 2008-08-05 15:12 (19 days old) References : http://marc.info/?l=linux-kernel&m=121794900319776&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11237 Subject : corrupt PMD after resume Submitter : Alan Jenkins <alan-jenkins@tuffmail.co.uk> Date : 2008-08-02 9:51 (22 days old) References : http://marc.info/?l=linux-kernel&m=121767073424952&w=4 Handled-By : Hugh Dickins <hugh@veritas.com> Jeremy Fitzhardinge <jeremy@goop.org> --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11276 Subject : build error: CONFIG_OPTIMIZE_INLINING=y causes gcc 4.2 to do stupid things Submitter : Randy Dunlap <randy.dunlap@oracle.com> Date : 2008-08-06 17:18 (18 days old) References : http://marc.info/?l=linux-kernel&m=121804329014332&w=4 http://lkml.org/lkml/2008/7/22/353 Handled-By : Bjorn Helgaas <bjorn.helgaas@hp.com> Patch : http://lkml.org/lkml/2008/7/22/364 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11279 Subject : 2.6.27-rc0 Power Bugs with HP/Compaq Laptops Submitter : Matt Parnell <mparnell@gmail.com> Date : 2008-08-07 14:57 (17 days old) References : http://marc.info/?l=linux-kernel&m=121812108031685&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11282 Subject : Please fix x86 defconfig regression Submitter : Andi Kleen <andi@firstfloor.org> Date : 2008-08-07 20:46 (17 days old) References : http://marc.info/?l=linux-kernel&m=121814188805666&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11308 Subject : tbench regression on each kernel release from 2.6.22 -&gt; 2.6.28 Submitter : Christoph Lameter <cl@linux-foundation.org> Date : 2008-08-11 18:36 (13 days old) References : http://marc.info/?l=linux-kernel&m=121847986119495&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11335 Subject : 2.6.27-rc2-git5 BUG: unable to handle kernel paging request Submitter : Randy Dunlap <randy.dunlap@oracle.com> Date : 2008-08-12 4:18 (12 days old) References : http://marc.info/?l=linux-kernel&m=121851477201960&w=4 http://lkml.org/lkml/2008/8/16/274 Handled-By : Hugh Dickins <hugh@veritas.com> --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11336 Subject : 2.6.27-rc2:stall while mounting root fs Submitter : Torsten Kaiser <just.for.lkml@googlemail.com> Date : 2008-08-12 12:37 (12 days old) References : http://marc.info/?l=linux-kernel&m=121854484015909&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11334 Subject : myri10ge: use ioremap_wc: compilation failure on ARM Submitter : Martin Michlmayr <tbm@cyrius.com> Date : 2008-08-10 11:25 (14 days old) References : http://marc.info/?l=linux-netdev&m=121836771727632&w=2 --
Yes, this is still there. -- Martin Michlmayr http://www.cyrius.com/ --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11343 Subject : SATA Cold Boot Problems with 2.6.27-rc[23] on nVidia 680i Submitter : Manny Maxwell <mannymax@mannymax.net> Date : 2008-08-14 4:16 (10 days old) References : http://marc.info/?l=linux-kernel&m=121868782917600&w=4 --
hmmmm. Looking at changes between the two csets listed in the email (623fa57..8f616cd), all of them are driver-specific and unrelated to Manny's hardware except for commit 2486fa561a3192bbbec39c7feef87a1e07bd6342 Author: Tejun Heo <tj@kernel.org> Date: Thu Jul 31 07:52:40 2008 +0900 libata: update atapi disable handling So you could try to revert that and see what happens. But given that small range of changes, it really seems like something else, maybe in the PCI subsystem (random guess). Looking at the entire kernel, nothing jumps out, either. Its mostly fs updates (ext4, xfs), a networking update, an ARM update, and a libata update. Also, some reset-related fixes just went in, so re-testing the latest -git would be helpful as well. Jeff --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11340 Subject : LTP overnight run resulted in unusable box Submitter : Alexey Dobriyan <adobriyan@gmail.com> Date : 2008-08-13 9:24 (11 days old) References : http://marc.info/?l=linux-kernel&m=121861951902949&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11342 Subject : Linux 2.6.27-rc3: kernel BUG at mm/vmalloc.c - bisected Submitter : Alan D. Brunelle <Alan.Brunelle@hp.com> Date : 2008-08-13 23:03 (11 days old) References : http://marc.info/?l=linux-kernel&m=121866876027629&w=4 Handled-By : Andrew Morton <akpm@linux-foundation.org> --
This one makes no sense. It's triggering a BUG_ON(in_interrupt()), but then the call chain shows that there is no interrupt going on. Also, the bisection is senseless - there's a trivial change wrt "do_one_initcall()" that got merged, but everything else is trivial about lguest and has nothing to do with the whole CPU-init thing. But if it was that initcall one, then "git bisect" woul have pointed to it, not the merge. And the merge itself had no conflicts or anything else going on.. The fact that it came and went later also implies that it's probably just some timing-dependent thing or some subtle memory corruption, making the bisection result even less likely to be exact. But I'm adding Arjan and Rusty to the Cc, because that merge was takign Rusty's branch, and the "do_one_initcall()" is Arjan's commit. Since undoing that merge apparently does fix it, I'm wondering if something there just does end up triggering the problem. The do_one_commit() thing _is_ in the path of sys_init_module(), so it _is_ at least somewhat relevant from an oops standpoint. One thing the "do_one_commit()" thing does is to put more pressure on the stack due to that whole buffer for the printk's going on. Alan, can you try - seeing how consistent it is with one kernel (ie boot a known-bad kernel a few times just to see if it really is 100% consistent) - try enabling 'initcall_debug' on the kernel command line, to (a) see the new code actually do something and (b) see what it is actually calling just before. Hmm.. Linus --
but it's 64 bit.. with 8Kb stack and separate irq stacks. I'd be surprised if we blow that this easily. the trace is a tad long with a long ACPI call chain. Wonder what gcc is in use? (newer ones tend to be a ton better... but maybe Alex is using a really old one) --
I'm running Ubuntu 8.04 w/ gcc: gcc (GCC) 4.2.3 (Ubuntu 4.2.3-2ubuntu7) Alan --
Ahh, later in that thread there's another totally unrelated oops in debug_mutex_add_waiter(). I'd guess that it is really wild pointer corrupting memory, quite possibly due to a double free or something like that. Alan - it would be good to run with DEBUG_PAGE_ALLOC and SLUB debugging etc if you don't already do that? Linus --
I'll add those in - as to the repeatability: The "bad" kernels seem to repeat quite reliably - not only in terms of counts (5 or 6 times in a row before trying something else), but also in terms of the "what" - either the original issue () or the other kernel with the later issue (debug_mutex_add_waiter). That's /goodness/ in that it should help narrow it down. I'll make sure the kernel is still failing this morning, and then add in DEBUG_PAGE_ALLOC and if that doesn't help, SLUB debugging... Alan --
Before adding any more debugging, this is the status of my kernel boots: 3 times in a row w/ this same error. (Primary problem is the same, secondary stacks differ of course.) Alan
Uhhuh! The previous "modprobe" uses stack like mad. It could be "fuse_init()" that has done it, but looking at fuse, I seriously doubt it. It doesn't seem to do anything particularly bad. So something has used over 6kB of stack, and it may well be the module loading code itself. This really looks like ti->task->blocked_on = waiter; where "ti->task" is NULL. You probably have almost everything enabled in order to turn "struct task_struct" that big, but judging by your register state it's really an offset off a NULL pointer, not some small integer. Now, there is no way "ti->task" can _possibly_ be NULL. No way. Well, except that "ti" is just below the stack, and if you had a stack overflow that overwrote it. So I seriously do believe that you have run out of stack. If that is true, then it's quite likely that with DEBUG_PAGE_ALLOC you'll actually get a double fault, which in turn is fairly hard to debug (you look at it wrong and it turns into a triple fault which is going to just reboot your machine immediately). Now, the stack oveflow probably happened a few calls earlier (and just left your thread_info corrupted), but there is more reason to believe you Here there is only 408 bytes left, which is _way_ too little, but it's also an optimistic measure. What the stack code usage code does is to just see how many zeroes it can find on the stack. If you have a big stack frame somewhere, it's quite possible that it actually used all your stack and then some, but left a bunch of zeroes around. And the do_exit() oops is simply because once the thread_info is corrupted, all the basic thread data structures are crap, and yes, you're almost guaranteed to oops at that point. Could you make your kernel image available somewhere, and we can take a look at it? Some versions of gcc are total pigs when it comes to stack usage, and your exact configuration matters too. But yes, module loading is a bad case, for me "sys_init_module()" ...
I bet this one-liner will probably make your kernel work. It's not a full
solution, but it will make the module-loading path lose _all_ of the above
stack slots by just not inlining "load_module()" - the stack slots will
still be used when the module is _loaded_, but by the time we actually
callt he ->init function they will have been released since it's not all
in the same crazy function any more.
I _seriously_ believe that we were better off back when gcc only inlined
what we told it to inline, and never inlined on its own. The gcc inlining
logic is pure and utter sh*t in an environment like the kernel where stack
space is a valuable resource.
Anyway, Alan, even if this solves your particular problem, I'd still like
to see your kernel image, so that I can hunt for other problems like
this..
Linus
---
kernel/module.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/kernel/module.c b/kernel/module.c
index 08864d2..9db1191 100644
--- a/kernel/module.c
+++ b/kernel/module.c
@@ -1799,7 +1799,7 @@ static void *module_alloc_update_bounds(unsigned long size)
/* Allocate and load the module: note that size of section 0 is always
zero, and we rely on this for optional sections. */
-static struct module *load_module(void __user *umod,
+static noinline struct module *load_module(void __user *umod,
unsigned long len,
const char __user *uargs)
{
--
Mine has: Dump of assembler code for function sys_init_module: 0xffffffff802688c0 <sys_init_module+0>: push %rbp 0xffffffff802688c1 <sys_init_module+1>: mov %rsp,%rbp 0xffffffff802688c4 <sys_init_module+4>: sub $0x1c0,%rsp 0xffffffff802688cb <sys_init_module+11>: mov %r12,-0x20(%rbp) 0xffffffff802688cf <sys_init_module+15>: mov %rdi,%r12 so 448 bytes. The kernel is up at: http://free.linux.hp.com/~adb/bug.11342/vmlinux (if you would let me know when you are through with it so I can free up some space there I'd appreciate it...) By doing the patch you provided, sys_init_module now looks like: Dump of assembler code for function sys_init_module: 0xffffffff8026aa20 <sys_init_module+0>: push %rbp 0xffffffff8026aa21 <sys_init_module+1>: mov %rsp,%rbp 0xffffffff8026aa24 <sys_init_module+4>: sub $0x20,%rsp 0xffffffff8026aa28 <sys_init_module+8>: mov %r14,0x18(%rsp) 0xffffffff8026aa2d <sys_init_module+13>: mov %rdi,%r14 So only 32 bytes. (But of course, load_module() exists, and now has 0x1d0 (464) bytes...) With the patch you provide, I /was/ able to repeatedly boot OK (latest tree, and I also ran the patch against the 26.27.rc3-based kernel I was having problems with initially, and that booted OK as well). --
Yeah, your build seems to have consistently bigger stack usage, and that may be due to some config option, but most likely it's a compiler version issue. But I think part of the reason is that you have frame pointers enabled: that makes the stack frames bigger not only because of the frame pointer save/restore, but also because you have more register pressure and thus I'm downloading it now, I'll probably be done by the time you get this email. Right - the stack usage didn't go away, but the _lifetimes_ changed. So now load_module() will still use almost 500 bytes of stack, and it will call other routines that use stack too, but the lifetime of that stack usage is no longer over the whole module loading and initialization part, it's purely over just the loading thing. And since the deep callchain came much later (in the actual ->init routines), by the time we do that, we no longer now have the load_module I had actually already committed it, because it was correct regardless (and gcc really is a total ass for doing that inlining to begin with), but it's good to have verification that the behaviour you saw was literally about this thing. I'll look at your vmlinux binary to see what else sucks from a stack depth standpoint, but one of the problems in this whole thing is that the stack usage is obviously both a static thing (with some functions using _way_ too much stack!) _and_ a dynamic thing (with the total stack use being not about any individual function, but the whole chain). My patch obviously doesn't change the static stack usage, it just moves it around a bit so that it's no longer on that same deep path, so the dynamic stack usage is much less. But I'll look at your vmlinux, see what stands out. Linus --
I wonder if we ought to have a light version of "make checkstack" always run, but in such a way that we make a file with "limits" on the stack usage for key functions (and we can grow this list over time when we learn about critical ones).. and either warn very loudly or even fail the build if we're way over what could work. --
Oops. I already see the problem. Your .config has soem _huge_ CPU count, doesn't it? checkstack.pl shows these things as the top problems: 0xffffffff80266234 smp_call_function_mask [vmlinux]: 2736 0xffffffff80234747 __build_sched_domains [vmlinux]: 2232 0xffffffff8023523f __build_sched_domains [vmlinux]: 2232 0xffffffff8021e884 setup_IO_APIC_irq [vmlinux]: 1616 0xffffffff8021ee24 arch_setup_ht_irq [vmlinux]: 1600 0xffffffff8021f144 arch_setup_msi_irq [vmlinux]: 1600 0xffffffff8021e3b0 __assign_irq_vector [vmlinux]: 1592 0xffffffff8021e626 __assign_irq_vector [vmlinux]: 1592 0xffffffff8023257e move_task_off_dead_cpu [vmlinux]: 1592 0xffffffff802326e8 move_task_off_dead_cpu [vmlinux]: 1592 0xffffffff8025dbc5 tick_handle_oneshot_broadcast [vmlinux]:1544 0xffffffff8025dcb4 tick_handle_oneshot_broadcast [vmlinux]:1544 0xffffffff803f3dc4 store_scaling_governor [vmlinux]: 1376 0xffffffff80279ef4 cpuset_write_resmask [vmlinux]: 1360 0xffffffff803f465d cpufreq_add_dev [vmlinux]: 1352 0xffffffff803f495b cpufreq_add_dev [vmlinux]: 1352 0xffffffff803f3fc4 store_scaling_max_freq [vmlinux]: 1328 0xffffffff803f4064 store_scaling_min_freq [vmlinux]: 1328 0xffffffff803f44c4 cpufreq_update_policy [vmlinux]: 1328 .. and sys_init_module is actually way way down the list. I bet the only reason it showed up at all was because dynamically it was such a deep callchain, and part of that callchain probably called some of those really nasty things. Anyway, the reason smp_call_function_mask and friends have such _huge_ stack usages for you is that they contain a 'cpumask_t' on the stack. For example, for me, usign a sane NR_CPU, the size of the stack frame for smp_call_function_mask is under 200 bytes. For you, it's 2736 bytes. How about you make CONFIG_NR_CPU's something _sane_? Like 16? Or do you really have four thousand CPU's in that system? Oh, I guess you have ...
In fact, they contain multiple CPU-masks, each 4k-bits - 512 bytes - in size. And they tend to call each other. Quite frankly, I don't think we were really ready for 4k CPU's. I'm going to commit this patch to make sure others don't do that many CPU's by mistake. It marks MAXCPU's as being 'broken' so you cannot select it, and also limits the number of CPU's that you _can_ select to "just" 512. Right now, 4k cpu's is known broken because of the stack usage. I'm not willing to debug more of these kinds of stack smashers, they're really nasty to work with. I wonder how many other random failures these have been involved with? This patch also makes the ifdef mess in Kconfig much cleaner and avoids duplicate definitions by just conditionally suppressing the question and giving higher defaults. We can enable MAXSMP and raise the CPU limits some time in the future. But that future is not going to be before 2.6.27 - the code simply isn't ready for it. The reason I picked 512 CPU's as the limit is that we _used_ to limit things to 255. So it's higher than it used to be, but low enough to still feel safe. Considering that a 4k-bit CPU mask (512 bytes) _almost_ worked, the 512-bit (64 bytes) masks are almost certainly fine. Still, sane people should limit their NR_CPUS to 8 or 16 or something like that. Very very few people really need the pain of big NR_CPUS. Not even "just" 512 CPU's. Travis, Ingo and Thomas cc'd, since they were involved in the original commit (1184dc2ffe2c8fb9afb766d870850f2c3165ef25) that raised the limit. Linus --- arch/x86/Kconfig | 30 ++++++++---------------------- 1 files changed, 8 insertions(+), 22 deletions(-) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index 68d91c8..ed92864 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -577,35 +577,29 @@ config SWIOTLB config IOMMU_HELPER def_bool (CALGARY_IOMMU || GART_IOMMU || SWIOTLB || AMD_IOMMU) + config MAXSMP bool "Configure Maximum ...
yeah, that's OK i guess - distros can still enable 4K support if they wish to. Someone interested in improving the stack footprint situation should dust off the max-stack-footprint tracer so that we can catch these things in a more structured way. And i guess the next generation of 4K CPUs support should just get away from cpumask_t-on-kernel-stack model altogether, as the current model is not maintainable. We tried the on-kernel-stack variant, and it really does not work reliably. We can fix this in v2.6.28. Ingo --
From: Ingo Molnar <mingo@elte.hu> I recenetly did some work on sparc64 to use cpumask pointers as much as possible. The only case that didn't work was due to a limitation in arch interfaces for the new generic smp_call_function() code. It passes a cpumask_t instead of a pointer to one via arch_send_call_function_ipi(). But other than that, the whole sparc64 SMP stuff uses cpumask_t pointers only. What it comes down to is that you have to do the "self cpu" and other tests in the cross-call dispatch routines themselves, instead of at the top-level working on cpumask_t objects. Otherwise you have to modify cpumask_t objects and thus pluck them onto the stack where they take up silly amounts of space. --
What we did was this: we added MAXSMP which just revs up all the SMP tunables to the maximum, so that we can see any problems early in testing. And we triggered problems, and we fixed a couple of regressions all around stack footprint. But we didnt catch all of them - some were gcc version dependent and configuration dependent. So i think it's safe to say that the whole concept of allowing such a large cpumask_t to be on the stack is fragile. Hence, i think the best way forward is to change the whole cpumask_t concept and disallow explicit masks altogether. It's so easy to smack a cpumask_t variable on the stack and nothing really warns about it, and any function can become part of a nested call sequence. So i think the dynamics of it has to be changed: we need a get/put API and we need to make on-stack cpumask illegal on the build level (in generic code at least). This has been Rusty's main argument early on i think, and i now concur. Ingo --
wonder if could use "unsigned long *" directly. so could dyn_array directly like int cpumask_size; unsigned long *online_cpu_map; DEFINE_DYN_ARRAY(online_cpu_map, sizeof(unsigned long), cpumask_size, PAGE_SIZE, NULL); and after nr_cpu_ids is assigned, have cpumask_size = (nr_cpu_ids + sizeof(unsigned long) - 1)/sizeof(unsigned long); then we could NR_CPUS=4096 kernel to small system. ... YH --
I would actually suggest something like this:
- we continue to have a magic "cpumask_t".
- we do different cases for big and small NR_CPUS:
#if NR_CPUS <= BITS_PER_LONG
/*
* Make it an array - that way passing it as an argument will
* always pass it as a pointer!
*/
typedef unsigned long cpumask_t[1];
static inline void create_cpumask(cpumask_t *p)
{
*p = 0;
}
static inline void free_cpumask(cpumask_t *p)
{
}
#else
typedef unsigned long *cpumask_t;
static inline void create_cpumask(cpumask_t *p)
{
*p = kcalloc(..);
}
static inline void free_cpumask(cpumask_t *p)
{
kfree(*p);
}
#endif
and now after you do this, you can just do something like
cpumask_t mycpu;
create_cpumask(&mycpu);
..
free_cpumask(&mycpu);
and in between, you can use 'cpumask' as a pointer, because even when it
is an array directly allocated on the stack, the array can always
degenerate into a pointer by C type rules!
And for the small-NR_CPUS case there is zero overhead.
Linus
--
On Tue, Aug 26, 2008 at 9:51 AM, Linus Torvalds that is good for local variables. for global variables, need to allocate them in some point. may need one int cpumask_size; cpumask_t online_cpu_map; DEFINE_DYN_ARRAY(online_cpu_map, sizeof(unsigned long), cpumask_size, PAGE_SIZE, NULL); or something like that. YH --
Hi Linus,
This turns out to be awful in practice, mainly due to const. Consider:
#ifdef CONFIG_CPUMASK_OFFSTACK
typedef unsigned long *cpumask_t;
#else
typedef unsigned long cpumask_t[1];
#endif
cpumask_t returns_cpumask(void);
That's obviously illegal if cpumask_t is an array. So we need a typedef which
says "really always a pointer".
typedef unsigned long *cpumask_return_t;
cpumask_return_t returns_cpumask(void);
But we usually want it to return a const ptr, and this doesn't work:
const cpumask_return_t returns_cpumask(void);
foo.c:12: warning: type qualifiers ignored on function return type
So now we need:
typedef const unsigned long *cpumask_const_return_t;
cpumask_const_return_t returns_cpumask(void);
OK, now consider a function which wants to take a const cpu bitmap:
void cpus_copy(cpumask_t dst, const cpumask_t src);
...
cpus_copy(cpus, returns_cpumask());
foo.c:34: warning: passing argument 2 of ‘cpus_copy’ discards qualifiers from
pointer target type
Oops, that didn't work with the pointer version. So we need another typedef:
#ifdef CONFIG_CPUMASK_OFFSTACK
typedef const unsigned long *cpumask_const_t;
#else
typedef const unsigned long cpumask_const_t[1];
#endif
void cpus_copy(cpumask_t dst, cpumask_const_t src);
We end up with this:
#ifdef CONFIG_CPUMASK_OFFSTACK
typedef unsigned long *cpumask_t;
typedef const unsigned long *cpumask_const_t;
#else
typedef unsigned long cpumask_t[1];
typedef const unsigned long cpumask_const_t[1];
#endif
typedef unsigned long *cpumask_return_t;
typedef const unsigned long *cpumask_const_return_t;
typedef unsigned long cpumask_data_t[1];
I can't see a neater way down this path, and I don't want to lose const.
I can see three alternatives:
1) An ONSTACK_CPUMASK(name) macro which declares "struct cpumask name[1]" or
"struct cpumask *name". Same idea as yours, without the typedef.
2) Use a normal struct for cpumask, make everyone use pointers, but ...since most of the important cpumasks in high-perf codepaths are already pre-constructed and embedded in some existing object (say task_struct), i think a variant of #3 is the best approach: - get rid of cpumask_t and use 'struct cpumask' everywhere. - do not expose normal kernel code to struct cpumask's definition, only declare the type via 'struct cpumask;' in sched.h - a'la kmem_cache_t. - even hide the structure from sched.h - use an extra indirection for struct cpumask *cpus_allowed in task_struct and be done with it. - have normal cpumask object alloc/free codepaths. - optimize any remaining important cases by hand, if needed. (the scheduler mostly) (wrt. #2, alloca() is a nightmare i think.) Ingo --
No. That's already broken. You cannot return a cpumask_t, regardless of
interface. We must not do it regardless of how we pass those things
around, since it generates _yet_ another temporary on the stack for the
return slot for any kind of structure.
So all cpumask functions should always return pointers and/or take
pointers to be filled in. That's true *regardless* of how we actually are
to then allocate them.
So forget returning cpumasks. It's irrelevant.
What _is_ relevant is how we allocate them when we need temporary CPU
masks. And _that_ is where my suggestion comes in. For small NR_CPUS, we
really do want to allocate them on the stack, because calling kmalloc for
a 4- or 8-byte allocation is just _stupid_.
So all your arguments are invalid, because you're looking at the wrong
thing. The thing that I was talking about is converting current code that
has
random_function(..)
{
cpumask_t mask;
.. do something with mask ...
}
which has to be converted some way. And I think it needs to be converted
in a way that does *not* force us to call kmalloc() for idiotically small
values.
Linus
--
Subject: [RFC 1/1] cpumask: Provide new cpumask API
Provide new cpumask interface API. The relevant change is basically
cpumask becomes an opaque object. I believe this results in the
minimum amount of editing while still allowing the inline cpumask
functions, and the ability to declare static cpumask objects.
/* raw declaration */
struct __cpumask_data_s { DECLARE_BITMAP(bits, NR_CPUS); };
/* cpumask_map_t used for declaring static cpumask maps */
typedef struct __cpumask_data_s cpumask_map_t[1];
/* cpumask_t used for function args and return pointers */
typedef struct __cpumask_data_s *cpumask_t;
/* cpumask_var_t used for local variable */
typedef struct __cpumask_data_s cpumask_var_t[1]; /* SMALL NR_CPUS */
typedef struct __cpumask_data_s *cpumask_var_t; /* LARGE NR_CPUS */
/* replaces cpumask_t dst = (cpumask_t)src */
void cpus_copy(cpumask_t dst, const cpumask_t src);
Remove the '*' indirection in all references to cpumask_t objects. You can
change the reference to the cpumask object but not the cpumask object itself
without using the functions that operate on cpumask objects (f.e. the cpu_*
operators). Functions can return a cpumask_t (which is a pointer to the
cpumask object) and only be passed a cpumask_t.
All uses of cpumask_t on the stack are changed to be cpumask_var_t.
Allocation of local cpumask objects will follow...
All cpumask operators now operate using nr_cpu_ids instead of NR_CPUS. All
variants of the cpumask operators which used nr_cpu_ids instead of NR_CPUS
are deleted.
All variants of functions which use the (old cpumask_t *) pointer are deleted
(f.e. set_cpus_allowed_ptr()).
Based on code from Rusty Russell <rusty@rustcorp.com.au> (THANKS!!)
Signed-of-by: Mike Travis <travis@sgi.com>
---
include/linux/cpumask.h | 340 ++++++++++++++++++++++++------------------------
1 file changed, 174 insertions(+), 166 deletions(-)
--- ...No, for large NR_CPUS, cpumask_t is a pointer as shown. And we have numerous Right, but cpumask_t is used for far more than stack decls, thus the problems. I can make a separate "cpumask_stack_t" and use your method tho. I think that Yeah, got that. But your suggestion to change cpumask_t turned out horribly ugly. Cheers, Rusty. --
Hi Rusty,
I've gotten some good traction on the changes in the following patch. About 30%
of the kernel is compiling right now and I'm picking up errors and warnings as
I'm going along. I think it's doing most of what we need. Attempting to hide
the cpumask struct definition caused all kinds of problems with the inline
functions and statically declaring cpumask's.
(The following patch is a combination of all the changes to cpumask.h with the
header from the first patch. I'll send you a complete copy in separate email.)
Thanks,
Mike
--
Subject: [RFC 1/1] cpumask: Provide new cpumask API
Provide new cpumask interface API. The relevant change is basically
cpumask_t becomes an opaque object. I believe this results in the
minimum amount of editing while still allowing the inline cpumask
functions, and the ability to declare static cpumask objects.
/* raw declaration */
struct __cpumask_data_s { DECLARE_BITMAP(bits, NR_CPUS); };
/* cpumask_map_t used for declaring static cpumask maps */
typedef struct __cpumask_data_s cpumask_map_t[1];
/* cpumask_t used for function args and return pointers */
typedef struct __cpumask_data_s *cpumask_t;
/* cpumask_var_t used for local variable, definition follows */
typedef struct __cpumask_data_s cpumask_var_t[1]; /* SMALL NR_CPUS */
typedef struct __cpumask_data_s *cpumask_var_t; /* LARGE NR_CPUS */
/* replaces cpumask_t dst = (cpumask_t)src */
void cpus_copy(cpumask_t dst, const cpumask_t src);
Remove the '*' indirection in all references to cpumask_t objects. You can
change the reference to the cpumask object but not the cpumask object itself
without using the functions that operate on cpumask objects (f.e. the cpu_*
operators). Functions can return a cpumask_t (which is a pointer to the
cpumask object) and only be passed a cpumask_t.
All uses of cpumask_t on the stack are changed to be cpumask_var_t except
for pointers to static cpumask objects. Allocation of ...could you please send whatever .c changes you have already, so that we can have a look at how the end result will look like? Doesnt have to build, i'm just curious about how it looks like in practice, semantically. Ingo --
I will, and the full "allyesconfig" does compile. And it's basically a benign change in that the functionality is still the same. I'm currently reordering it a bit to clean it up. Thanks, Mike --
btw., are the resulting instructions also expected to be the same? If yes then you might want to verify it all by making sure the md5's of the .o's do not change. (If that's not possible (gcc decides to compile it a bit differently) then no big deal, just wanted to mention the possibility.) Ingo --
Well, not exactly... ;-) It does institute the new API change that specifies only pointers to cpumask's can be passed to functions and returned from functions. I really wanted the default cpumask_t to be a constant so those instances where the passed in cpumask is used as a read/write temp variable would be caught. But it started getting messy. One pain is: typedef struct __cpumask_s *cpumask_t; const cpumask_t xxx; is not the same as: typedef const struct __cpumask_s *const_cpumask_t; const_cpumask_t xxx; and I'm not exactly sure why. It came up when I tried to declare functions that returned a constant cpumask_t pointer (node_to_cpumask, cpumask_of_cpu, etc.) The other major change I'm contemplating is to remove "cpumask_t" completely (maybe cpumask_ptr_t?). This would force every instance of cpumask_t to be examined. (I found quite a few I had missed in my original edits when I added the task struct temp cpumask's.) Oh yeah, one question ... is "current" always valid? Thanks, Mike --
Umm. The const has different One is typedef const struct __cpumask_s *const_cpumask_t; which becomes (const struct __cpumask_s) * while the other is const cpumask_t xxx which is const (struct __cpumask_s *) and if you look a bit more closely, you'll see that they are _obviously_ not the same thing at all. Quite frankly, I personally do hate typedefs that end up being pointers, and used as pointers, without showing that in the source code. When you do type_t a; fn(a); I expect the code to essentially do a pass-by-value. But when the type_t is a pointer, that doesn't really work. Your issue with 'const' is just another version of the same. You don't want the _pointer_ to be const, you want what it points _to_ to be const. But because you hid the pointerness inside the typedef, you simply cannot do that. The problem with cpumask's, of course, is that for the "small mask" case, we really don't want it to be a pointer. So now it's sometimes a pointer and sometimes not. The typedef hides that, and I understand why it's a good idea, but I'm surprised you didn't understand what the implications were for 'const', and I'm now a bit more leery about this whole thing just because the typedef ends up hiding so much - it doesn't just hide the basic type, it hides a very basic *code* issue. Linus --
Thanks, yes that explains what I should have figured out. (Nice way to I agree, and as I mentioned, Rusty was working towards an alternative method of declaring cpumask's which does not hide this. My goal was to create an (apparent) opaque handle for cpumask_t and modify Perhaps then defining the cpumask as a list of unsigned longs (remove the outer struct) would play more "naturally". Lists by definition are always referenced by pointers. I have another version of the patchset that shows Thanks! Mike --
...
Here's an alternate proposal for the new cpumask API. I have not yet begun
the tedious work of making it compile so there may still be some syntax
errors. (Init/main.c compiles.) But I believe it more closely adheres to
proper 'C' coding styles.
Thanks,
Mike
--
Subject: cpumask: Provide new cpumask API
[Copied from Documentation/cpumask.txt]
Introduction
Cpumask variables are used to represent (with a single bit) all the
CPU's in a system. With the prolific increase in the number of CPU's
in a single system image (SSI), managing this large number of cpus has
become an ordeal. When the limit of CPU's in a system was small, the
cpumask could fit within an integer. Even with the increase to 128,
a bit pattern was well within a manageable size. Now with systems
available that have 2048 cores, with hyper-threading there are 4096
cpu threads. And the number of cpus in an SSI is growing, 16,384 is
right around the corner. Even desktop systems with only 2 or 4 sockets,
a new generation of Intel processors that have 128 cores per socket will
increase the count of CPU threads tremendously.
Thus the handling of the cpumask that represents this 4096 limit needs
512 bytes, putting pressure on the amount of stack space needed to
accommodate temporary cpumask variables.
The primary goal of the cpumasks API is to accommodate these large
number of cpus without effecting the compactness of Linux for small,
compact systems.
The Changes
Provide new cpumask interface API. The relevant change is basically
cpumask_t becomes an pointer to a constant list of unsigned longs
and two new typedef's are used for declaring cpumask_t variables,
cpumask_var_t and cpumask_map_t. This should result in the minimum
amount of modifications while still allowing the inline cpumask
functions, and the ability to declare static cpumask objects.
/* cpumask_t is used when pointing to a constant cpumask */
typedef const unsigned long *cpumask_t;
/* cpumask_map_t ...Hiding a const here is not a good idea either, I think :( Rusty. --
Yes, this is why my version of the rework moved away from typedefs, except for the special case of "cpumask_var_t" for stack vars where this trick is really desired. Everywhere else, the code becomes nice and clear: struct cpumask *. Cheers, Rusty. --
Iirc, it was the problem of basing percpu variables at zero that hit problems with various gcc toolset versions. I don't remember any version problems with cpumask's on the stack, they all failed the Removing cpumask_t's from the stack is fairly straight forward. The problem of changing all functions to expect a cpumask pointer via a global change is much more problematic. And of course all those functions that return a cpumask value would need to be addressed. Thanks, Mike --
Yes, I had proposed either modifying, or supplementing a new smp_call function to pass the cpumask_t as a pointer (similar to set_cpus_allowed_ptr.) But an ABI change such as this was not well received at the time. Thanks, Mike --
From: Mike Travis <travis@sgi.com> What it seems to come down to is that any cpumask_t not inside of a dynamically allocated object should be marked const. And that is something we can enforce at compile time. Linus has just suggested dynamically allocating cpumask_t's for such cases but I don't see that as the fix either. Just mark them const and enforce that cpumask_t objects can only be modified when they appear in dynamically allocated objects. You really don't need to modify the ones that passed around functions anyways. The only code that wants to change bits in these things is the cpu cross-call dispatch stuff, and that cpu choice logic can just live where it belongs down in the cross-call dispatch code. --
Dave and others, Sorry if I jump into the middle of the thread. Stopped subscribing to lkml for a while, so this is through the archives. Ran into some of these issues with KVM too, and noticed just how much we pass cpumask_t around in the smp_call functions :-( In fact, the only arch that did pretty well on this front was sparc64. I totally agree, that marking them const makes a ton of sense, but at the same time I suggest we convert smp_call_function_mask() to take a pointer to the cpumask_t. I whipped up the following patch, which cuts down the amont of memcpy calls emitted quite a fair bit. I have only tested this on ia64, but it boots, so it's obviously perfect<tm> :-) Comments, suggestions welcome. I have a followup patch that makes virt/kvm/kvm_main.c use the new interface. Cheers, Jes
Well, it probably boots because it doesn't really seem to _change_ much of
anything.
Things like this:
-static inline void arch_send_call_function_ipi(cpumask_t mask)
+static inline void arch_send_call_function_ipi(cpumask_t *mask)
{
- smp_ops.send_call_func_ipi(mask);
+ smp_ops.send_call_func_ipi(*mask);
}
will still do that stack allocation at the time of the call. You'd have to
pass the thing all the way down as a pointer..
Linus
--
From: Linus Torvalds <torvalds@linux-foundation.org>
True, but we have to get there one step at a time.
BTW, sparc64 already wants a pointer here, so it's completely ready for
this:
void arch_send_call_function_ipi(cpumask_t mask)
{
xcall_deliver((u64) &xcall_call_function, 0, 0, &mask);
}
--
Hi Linus, I realize that, but as I have been doing this work on ia64, I didn't want to mess too much with the x86 code. The ia64 part of the patch does change things :-) If someone with more x86 knowledge would want to try and make that part of the patch better, and more importantly test it, I'd be quite keen on helping out. Cheers, --
Linus, Ok, so here's a version which tries to do the right thing on x86 as well. Build tested on x86_64, but don't have an easy way to test it right now. It's booting on ia64. Cheers, Jes
I would be most interested in any tools to analyze call-trees and accumulated stack usages. My current method of using kdb is really time consuming. Thanks! Mike --
Well, even just scripts/checkstack.pl is quite relevant.
The fact is, anything with a stack footprint of more than a hundred bytes
is suspect. We _do_ have a lot of cases of several hundred bytes, and some
of them are even very intentional.
For an example of _intentional_ and valid large stacks, look at
do_sys_poll and do_select. They both have a big stack footprint in a
normal kernel, and that's on purpose - it's not pretty, but they are very
common and performance-sensitive functions, and using a big stack allows
some basic allocations to be much cheaper by default.
Same goes for early_printk(), although I don't think the reasons are
really very strong in that case.
Sadly, while those functions are _fairly_ high up, they aren't at the top,
and we do have a lot of other functions that have huge stack footprints
for totally bogus reasons. But the intentional ones are at least in the
top ten.
But the kernel that Alan had problems with was different. The
_intentional_ ones were way down in the noise. do_sys_poll wasn't in the
top ten, it was barely even in the top 50! (It was in fact #49, to be
exact).
So look at the top ten in my kernel:
1 ide_generic_init [vmlinux]: 1384
2 idefloppy_ioctl [vmlinux]: 1208
3 e1000_check_options [vmlinux]: 1152
4 do_sys_poll [vmlinux]: 904
5 ide_floppy_get_capacity [vmlinux]: 872
6 do_select [vmlinux]: 744
7 early_printk [vmlinux]: 720
8 do_task_stat [vmlinux]: 680
9 mmc_ioctl [vmlinux]: 648
10 elf_kcore_store_hdr [vmlinux]: 576
.. and in Alan's kernel:
1 smp_call_function_mask [vmlinux]: 2736
2 __build_sched_domains [vmlinux]: 2232
3 setup_IO_APIC_irq [vmlinux]: 1616
4 arch_setup_ht_irq [vmlinux]: 1600
5 arch_setup_msi_irq [vmlinux]: 1600
6 __assign_irq_vector [vmlinux]: ...Hi Linus, [Sorry for the long winded response, but I felt that sufficient background is needed to address this issue.... YOMV :-)] The need to allow distros to set NR_CPUS=4096 (and NODES_SHIFT=9) is critical to our upcoming SGI systems using what we have been calling "UV". This system will be capable of supporting 4096 cpu threads in a single system image (and 16k cpus/2k nodes is right around the corner). While obviously I cannot divulge too many details, it's sufficient to say there are customers who not only require this extended capability, but are extremely excited about it. But the nature of some of these system environments is that they will not accept a specially built kernel, but only a kernel that has been built and certified (both from the application standpoint as well as the security standpoint) by standard distributions. And you probably know how extensively these distributions test and certify for many known defects and absolutely require that incoming source changes come from the community supported source bases, primarily yours. Due to the lead time required to accomplish these certifications, the version of the distributions that will be available when this system releases will be based on 2.6.27. (They will allow patches "post-2.6.27-rc.final" as long as those are committed in the source base.) The two distributions that SGI supports for our customers is SLES (SUSE Linux Enterprise Server) and RHEL (Red Hat Enterprise Linux). [They, of course, are free to run any OS of their choosing, but SGI only provides front line support for those two.] I started last August to begin analyzing how to accomplish the above goals and where exactly are the hot spots in the kernel that would require attention. It quickly became clear that cpumask_t and nodemask_t are two variables that are very casually used (along with NR_CPUS), because the assumption was that 64 "was more than sufficient" for an upper limit and even extending it to 128 or 255 (254 was the ...
That's fine. You can do it. The default kernel will not, because it's clearly not safe. I really don't care what you do to _your_ images. But I will not distribute a known-broken kernel, and I will not debug random stack overflows that happen in it. If you want the default kernel to support 4k cores, we'll need to fix the stack usage. I don't think that is impossible, but IT IS NOT GOING TO HAPPEN for 2.6.27. And quite frankly, if some vendor like RedHat enables NR_CPUS=4096 by default, they are totally and utterly crazy. But some SGI-specific binary that is meant for SGI machines only, and has been extensively tested with the setup used on SGI machines is a different thing. Linus --
Ok, thanks for the reply, and looking into this issue. We will "strongly encourage" our distros to base the relevant releases on 2.6.28. :-) [Supplying an SGI-specific kernel would not be acceptable to many of our customers because of the certification issues I mentioned.] Mike --
On Tue, Aug 26, 2008 at 12:09:46PM -0700, Linus Torvalds wrote: > If you want the default kernel to support 4k cores, we'll need to fix the > stack usage. I don't think that is impossible, but IT IS NOT GOING TO > HAPPEN for 2.6.27. > > And quite frankly, if some vendor like RedHat enables NR_CPUS=4096 by > default, they are totally and utterly crazy. heh. *picks through Fedora changelog* * Thu Aug 14 2008 Dave Jones <davej@redhat.com> - Bump max cpus supported on x86-64 to 4096. Just to see what happens. I never did get to find out unfortunatly, because of the security fiasco in Fedora infrastructure the last week or two. > But some SGI-specific binary that is meant for SGI machines only, and has > been extensively tested with the setup used on SGI machines is a different > thing. Every extra kernel image a distro vendor ends up shipping has an associated cost. * build time: It currently takes about 2 hours for a set of Fedora RPMs. For RHEL it'll be even worse due to the extra archs). Killing off -smp specific builds was a big win for us in this regard. Adding extra flavours is always painful. * diskspace (distro kernels aren't small. With the associated debugging symbols, they take up a shitload of disk space really fast). * Having everyone running the same kernel makes it much easier to test/debug. Our QA guys hate adding extra columns to their test matrix. But yes, for this to be even remotely feasible, there has to be a negligable performance cost associated with it, which right now, we clearly don't have. Given that the number of people running 4096 CPU boxes even in a few years time will still be tiny, punishing the common case is obviously absurd. Dave -- http://www.codemonkey.org.uk --
Dave Jones wrote: I did do some fairly extensive benchmarking between configs of NR_CPUS = 128 and 4096 and most performance hits were in the neighborhood of < 5% on systems with 8 cpus and 4GB of memory (our most common test system). [But changing cpumask_t's to be pointers instead of values will likely increase this.] I've tried to be very sensitive to this issue with all my previous changes, so convincing the distros to set NR_CPUS=4096 would be as painless for them as possible. ;-) Btw, huge count cpu systems I don't think are that far away. I believe the nextgen Larabbee chips will be geared towards HPC applications [instead of just GFX apps], and putting 4 of these chips on a motherboard would add up to 512 cpu threads (1024 if they support hyperthreading.) Thanks, Mike --
5% is a pretty nasty performance hit... what sort of benchmarks are we talking about here? I just made some pretty crazy changes to the VM to get "only" around 5 or so % performance improvement in some workloads. What places are making heavy use of cpumasks that causes such a slowdown? Hopefully callers can mostly be improved so they don't need to use cpumasks for common cases. Until then, it would be kind of sad for a distro to ship a generic x86 kernel and lose 5% performance because it is set to 4096 CPUs... But if I misunderstand and you're talking about specific microbenchmarks to It would be quite interesting if they make them cache coherent / MP capable. Will they be? --
From: Nick Piggin <nickpiggin@yahoo.com.au> It's almost certainly from the cross-call dispatch call chain. As just one example, just to do a TLB flush mm->cpu_vm_mask probably gets passed around as an aggregate two or three times on the way down to the APIC programming code on x86. That's two or three 512 byte copies on the stack :) Look at the sparc64 SMP code for how I solved the problem there. --
Yeah, I see. That's stupid isn't it? (Well, I guess it was completely sane when cpumasks were word sized ;)) Hopefully that accounts for a significant chunk... --
From: Nick Piggin <nickpiggin@yahoo.com.au> There is a lot of indirect costs that are hard to see as well. Two things a lot of these cross-call dispatch paths do is: 1) Clear self-cpu 2) AND with cpus_online #1 can normally be a simple bit clear, but some places can also implement this with something like "cpus_andn(X, cpumask_of_cpu(cpu))" It's simply easier to move those two things down to the bottom of the APIC programming code, they just loop over the cpumask doing an expensive APIC I/O operation anyways, might as well overlap it with these "skip self-cpu" and "skip not-online cpus" checks. And oh yeah we get the stack wastage fixed too, isn't what what we were talking about? :-) --
Yes, the most time consuming part was determining whether a kmalloc could safely be used in the context of the function, and what to do about the out-of-memory problem. Pushing that down to something like: for_each_cpu_thats_online(cpu, *maskptr) would remove the need for many of the temp masks. A simple if (cpu != me) would take care of excluding self. It might have better interaction with cpu hotplug as well, since the online map would be checked just before the call to that cpu is made. Thanks, Mike --
It's been a while now, I should go back and check my notes. Many of the BM's did not have any changes. I believe the ones that were right on the That's another study I did, and it seemed that maybe 95% of the functions would not be affected by passing pointers to cpumasks instead of the cpumasks themselves, because the data was processed by a cpu_xxx function that uses a pointer. Most commonly was to create a temp cpumask, using cpus_and(temp_mask, callers_mask, cpu_online_map); The speedup to use nr_cpu_ids instead of NR_CPUS in the traversal functions helped quite a bit. Using this same method in the cpus_xxx functions would further speed up things. (As well as only allocating the cpumask sized by nr_cpu_ids instead of NR_CPUS Yes, I was (at the time) trying to determine how many of the cpumask functions were actually in play by user tasks, so I was zeroing in on those (cpusets, There's not been a lot of info available yet, but I think the 128 cores will share at least an L2 cache + memory controller. How the APIC's interact is also another big question. And most likely some standard system controller CPU will be needed, but that could be a tiny VIA processor... ;-) Thanks, Mike --
This probably all started when I was working on a software tool (aiod) that was failing because somebody ELSE had 4,096 CPUs configured. [[Seems that gcc had/has? it's MAX CPU value set to 1,024 (bits/sched.h __CPU_SETSIZE), so when you issue system calls like sched_getaffinity, it will "fail" for systems configured w/ 4,096 CPUs. I worked around it by simply forgetting about the gcc values, and kept allocating larger CPU masks until it worked.]] I think you're right: the kernel as a whole may not be ready for 4,096 CPUs apparently... Thanks for taking the time to look into this... Alan --
Mike has been working diligently on getting all these cpumasks off the stack for the last months and has created an infrastructure to do this. So I think we are close. It might just be a matter of merging some more patches that are still left in Ingo's tree. --
hm, there are no such patches left that i know of - the only bits in -tip are the zero-based percpu, which was found to be a bit fragile in testing: earth4:~/tip> git-log-line --author=Travis linus.. d379497: Zero based percpu: infrastructure to rebase the per cpu area to zero b3a0cb4: x86: extend percpu ops to 64 bit [and it has no relevance to stack footprint.] So i dont think the current cpumask_t approach will work. We simply should not get into an endless fight against the windmills that introduce on-stack cpumask_t again and again. We should just take the plunge once and do a clean alloc/free cpumask model. Most of the hotpath cpumasks are constant or pre-constructed, so they are not a real issue. Plus, on the general question of stack footprint problems and the difficulty of debugging them, the worst-case stack footprint tracer i wrote for -rt some time ago should be dusted off as well and put into ftrace. David has something quite close to that for Sparc64 already. Ingo --
I'll start experimenting with globally changing cpumask_t to be a pointer, and see what falls out. Thanks, Mike --
Hmm, wants neatening anyway; I'll see if I can reduce stack usage side effect. Your workaround is very random, and that scares me. I think a huge number of CPUs needs a real solution (an actual cpumask allocator, then do something clever if we come across an actual fastpath). Thanks, Rusty. --
The thing is, the inlining thing is a separate issue. Yes, the cpumasks were what made stack pressure so critical to begin with, but no, a 400-byte stack frame in a deep callchain isn't acceptable _regardless_ of any cpumask_t issues. Gcc inlining is a total and utter pile of shit. And _that_ is the problem. I seriously think we shouldn't allow gcc to inline anything at all unless we tell it to. That's how it used to work, and quite frankly, that's how it _should_ work. The downsides of inlining are big enough from both a debugging and a real code generation angle (eg stack usage like this), that the upsides (_somesimes_ smaller kernel, possibly slightly faster code) simply aren't relevant. So the "noinline" was random, yes, but this is a real issue. Looking at checkstack output for a saner config (NR_CPUS=16), the top entries for me are things like ide_generic_init [vmlinux]: 1384 idefloppy_ioctl [vmlinux]: 1208 e1000_check_options [vmlinux]: 1152 ... which are "leaf" functions. They are broken as hell (the e1000 is apparently because it builds structs on the stack that should all be "static const", for example), but they are different from something like the module init sequence in that they are not going to affect anything else. It would be interesting to see what "-fno-default-inline" does to the kernel. It really would get rid of a _lot_ of gcc version issues too. Inlining behavior of gcc has long been a problem for us. Linus --
I added "-fno-inline-functions-called-once -fno-early-inlining" to
KBUILD_CFLAGS, and (with gcc 4.3) that increased the size of my kernel
image by 2%.
And when David's "-fwhole-program --combine" will become ready the cost
of disallowing gcc to inline functions will most likely increase.
A debugging option (for better traces) to disallow gcc some inlining
might make sense (and might even make sense for distributions to
enable in their kernels), but when you go to use cases that require
really small kernels the cost is too high.
But if you don't trust gcc's inlining you should revert
commit 3f9b5cc018566ad9562df0648395649aebdbc5e0 that increases gcc's
freedom regarding what to inline in 2.6.27 - what gcc 4.2 does in the
case of the regression tracked as Bugzilla #11276 is really not funny
(two callers -> function not inlined; gcc seems to emit the function
although both callers later get removed (or at least should be removed)
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
You ignore the fact that it's really not just about debugging. Inlining really isn't the great tool some people think it is. Especially not since gcc stack allocation is so horrid that it won't re-use stack slots etc (which I don't disagree with per se - it's _hard_ to re-use stack slots while still allowing code scheduling). NOTE! I also would never claim that _our_ choices of "inline" are all that great, and we've often inlined too much or not inlined things that really could be inlined. But at least when a developer says "inline" (or forgets to say it), we have somebody to blame. When the compiler does insane Actually, that just allows gcc to _not_ inline. Which is probably ok. (Well, it would be ok if gcc did it well enough, it obviously has some problems at times). Linus --
I had in mind that we anyway have to support it for tiny kernels.
I simply don't see that we add kconfig options for 5kB of code for
tiny kernels but remove something like this that can cause size
gcc's stack allocation has become better
Most LOCs of the kernel are not written by people like you or Al Viro or
David Miller, and the average kernel developer is unlikely to do it as
good as gcc.
For the average driver the choice is realistically between
"inline's randomly sprinkled across the driver" and
"no inline's, leave it to gcc".
And code evolves during the years from tiny with 1 caller to huge with
many callers.
BTW:
I just ran checkstack on a (roughly) allyesconfig kernel, and we have a
With the "gcc inline's static functions" you complain about we have
4-5 years of experience.
Suddenly allowing 4 release series of gcc to ignore any inline's is a
completely new area for us. I'd generally agree with giving gcc more
freedom here, but I'd rather do it right by removing tons of wrong
inline's than doing one global change hoping that it will make things
better.
And whether the "optimized inlining" actually makes the kernel bigger or
cu
Adrian
[1] there are some rare exceptions
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
I actually don't think that is true. If we really were to decide to be stricter about it, and it makes a big size difference, we can probably also add a tool to warn about functions I agree that it has become better. But it still absolutely *sucks*. For example, see the patch I just posted about e1000 stack usage. Even though the variables were all in completely separate scopes, they all got individual space on the stack over the whole lifetime of the function, causing an explosion of stack-space. As such, gcc used 500 bytes too much of stack, just because it didn't re-use the stackspace. That was with gcc-4.3.0, and no, there were hardly any inlining issues involevd, although it is true that inlining actually did make it slightly worse in that case too (but since it was essentially a leaf function, that had little real life impact, since there were no deep callchains below it to care). So the fact is, "better" simply is not "good enough". We still need to do a lot of optimizations _manually_, because gcc cannot see that it can re-use the stack-slots. And sometimes those "optimizations" are actually performance pessimizations, because in order to make gcc not use all the stack at the Sure. But we do have tools. We do have checkstack.pl, it's just that it hasn't been an issue in a long time, so I suspect many people didn't even _realize_ we have it, and I certainly can attest to the fact that even people who remember it - like me - don't actually tend to run it all that Sure. And most of it isn't all that great. But I do agree that lettign gcc make more decisions is _dangerous_. However, in this case, at least, the decisions it makes would at least make for less inlining, and thus less stack space explosion. Linus --
Sounds like what's really desired here isn't more worry and
unpredictability, but for GCC+Binutils to gain the ability to
calculate the stack depth over all callchains (doesn't have to be
exact, just an upper bound; annotate recursions) in a way that's good
enough to do on every compile, complain if a depth is exceeded
statically (or it can't be proven), and to gain the
In my userspace code, I have macros tmp_alloc and tmp_free. They must
be matched in the same function:
unsigned char * recvbuf = tmp_alloc(1500);
....
tmp_free(recvbuf);
When stack is plentiful, it maps to alloca() which is roughly
equivalent to using a stack variable.
When stack is constrained (as it is on my little devices), that maps
to xmalloc/free. The kernel equivalent would be kmalloc GFP_ATOMIC
(perhaps).
With different macros to mine, it may be possible to map small
fixed-size requests exactly onto local variables, and large ones to
kmalloc(). A stab at it (not tested):
#define LOCAL_ALLOC_THRESHOLD 128
#define LOCAL_ALLOC(type, ptr) \
__typeof__(type) __attribute__((__unused__)) ptr##_local_struct; \
__typeof__(type) * ptr = \
((__builtin_constant_p(sizeof(type)) \
&& sizeof(type) <= LOCAL_ALLOC_THRESHOLD) \
? &ptr##_local_struct : kmalloc(sizeof(type), GFP_ATOMIC))
#define LOCAL_FREE(ptr) \
((__builtin_constant_p(sizeof (*(ptr))) \
&& sizeof(*(ptr)) <= LOCAL_ALLOC_THRESHOLD) \
? (void) 0 : kfree(ptr))
Would that be useful in the kernel?
I'm thinking if it were a commonly used pattern for temporary buffers,
unknown structures and arrays of macro-determined size, the "new
driver" author would be less likely to accidentally drop a big object
on the stack.
Obviously it would be nicer for GCC to code such a ...Btw, did you check with just "-fno-inline-functions-called-once"? The -fearly-inlining decisions _should_ be mostly right. If gcc sees early that a function is so small (even without any constant propagation etc) that it can be inlined, it's probably right. The inline-functions-called-once thing is what causes even big functions to be inlined, and that's where you find the big downsides too (eg the stack usage). Linus --
That's a bit bizarre, though, isn't it? A function which is only called from one place should, if everything made sense, _never_ use more stack through being inlined. Inlining should just increase the opportunities that the called function's local variables can share the same stack slots are the caller's dead locals. Whereas not inlining guarantees they occupy separate, immediately adjacent regions of the stack, and shouldn't be increasing the total numbers of local variables. -- Jamie --
But that's simply not true. See the whole discussion. The problem is that if you inline that function, the stack usage of the newly inlined function is now added to ALL THE OTHER paths too! So the case we had in module loading was that yes, we had a function with a big stack footprint, but it was NOT in the deep path. But by inlining it, it now moved the stack footprint "up" one level to another function, and now the big stack footprint really _was_ in the deep path, because the caller was involved in a much deeper chain. So inlining moves the code up the callchain, and that is a problem for the backtrace, but that's "just" a debugging issue. But it also moves the stack footprint up the callchain, and that can actually be a correctness issue. Of course, a compiler doesn't _have_ to do that. A compiler _could_ have multiple different stack footprints for a single function, and do liveness analysis etc. But no sane compiler probably does that, because it's very painful indeed, and it's not even an issue if you aren't stack-limited (and being stack-limited is really just a kernel thing). (Yeah, it can be an issue even if you have a big stack, in that you get worse cache behaviour, so a dense stack footprint _would_ help. But the complexity of stack liveness analysis is almost certainly not worth the relatively small gains it would get on some odd cases). Linus --
-fno-inline-functions-called-once alone costs me nearly 1% in code size.
And I'd expect it to become more with "-fwhole-program --combine".
If you think we have too many stacksize problems I'd suggest to consider
removing the choice of 4k stacks on i386, sh and m68knommu instead of
using -fno-inline-functions-called-once:
Now that 32bit x86 is no longer used for extreme highend configurations
the only serious usecase for 4k stacks are AFAIK space savings on
embedded archs.
4k stacks have caused us much pain [1], and the cases where gcc inlined
too much were the easy ones.
I'm not saying that I'd like removing the choice of 4k stacks, but if we
want to reduce the number of stack related problems that's IMHO the
cu
Adrian
[1] AFAIR some callpaths in the kernel are still too big
BTW: In case anyone wonders about why I suggest removing 4k stacks:
My position is that 4k stacks should either be enabled
unconditionally or no longer offered at all.
And if we remove 4k stacks from 32bit x86 it's no longer
realistically maintainable for other architectures.
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
Don't be silly. That makes the problem _worse_. We're much better off with a 1% code-size reduction than forcing big stacks on people. The 4kB stack option is also a good way of saying "if it works with this, then 8kB is certainly safe". And embedded people (the ones that might care about 1% code size) are the ones that would also want smaller stacks even more! Linus --
On Tue, Aug 26, 2008 at 5:04 PM, Linus Torvalds This is something I never understood - embedded devices are not going to run more than a few processes and 4K*(Few Processes) IMHO is not worth a saving now a days even in embedded world given falling memory prices. Or do I misunderstand? Parag --
Embedded applications span a huge range of sizes, from the very small devices to which you refer, to quite complex devices. The cable settop boxes we develop have over a hundred interrupt sources, typically run 250-300 threads, and have 192+ MiB of memory. For all that, we are very cost sensitive and are under constant -- David VomLehn --
As you say correctly the term "embedded" gets used for many different
devices.
And if you have 192+ MiB of memory you have so much that all these
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
Well, by that argument, 1% of kernel size doesn't matter either.. 1% of a kernel for an embedded device is roughly 10-30kB or so depending on how small you make the configuration. If that matters, then so should the difference of 3-8 processes' kernel stack usage when you have a 4k/8k stack choice. And they _all_ will have at least 3-8 processes on them. Even the simplest ones will tend to have many more. Linus --
I have some simple devices (network access/routers) with 8MB of RAM, at power up not really being configured to do anything running 25 processes. (Heck there is over 10 kernel processes running!). Configure some interfaces and services and that will easily push past 40. I'd be happy with a 160k saving :-) The init memory being freed at the end of the kernel boot is 88k, 4k stacks could save more than that. Regards Greg ------------------------------------------------------------------------ Greg Ungerer -- Chief Software Dude EMAIL: gerg@snapgear.com Secure Computing Corporation PHONE: +61 7 3435 2888 825 Stanley St, FAX: +61 7 3891 3630 Woolloongabba, QLD, 4102, Australia WEB: http://www.SnapGear.com --
So you really need to run all 25 processes on that 8Mb box? (For reference even the NGW100 development board comes with 16Mb RAM). Even if you do need those all 25 processes on the 8Mb box, fixing the memory usage of those user space hogs is lot better than trying to save 160Kb in kernel stacks. Last I looked, user space wasn't particularly frugal with memory usage. Parag --
Yes, of course. Considerable effort has been put into running a minimal set of processes (that still for fills the required function Lots of development boards are fitted with lots of RAM. And the pressure will still be on in _real_ products to reduce the RAM footprint as much as possible. There are exceptions but Yep, been done too. You don't squeeze a lot into these smaller Then you haven't looked in the right places :-) There are plenty of choices for making things small in user space. Simple stuff like using uClibc, busybox, etc. In this specific example things like /bin/init is 10k, /bin/inetd is 10k, /bin/crond is 11k, etc. (Ofcourse it is a shared uClibc setup, uClibc is ~300k). And XIP can help out here too. Regards Greg ------------------------------------------------------------------------ Greg Ungerer -- Chief Software Dude EMAIL: gerg@snapgear.com Secure Computing Corporation PHONE: +61 7 3435 2888 825 Stanley St, FAX: +61 7 3891 3630 Woolloongabba, QLD, 4102, Australia WEB: http://www.SnapGear.com --
Well, sure - but the industry as a whole seems to have gone the other way - do more with more at the similar or lower price points! By that definition of less is better we should try and make the kernel memory pageable (or has someone already done that?) - Windows does it, by default ;) Parag --
On Tue, 2008-08-26 at 22:16 -0400, Parag Warudkar wrote: "The industry as a whole" doesn't exist on that low level. You can't compare the laptop and/or desktop computer market (where one may buy today hardware that runs in 3 years with the next generation/release of the OS and applications) with the e.g. "WLAN router" market where - from the commercial point of view - every Euro counts (and where the requirements for the lifetime of the device are long frozen before the That doesn't help as in really small devices (like WLAN routers, cable modems, etc.) you run without any means of paging/swapping. And even binaries/read-only files are not necessarily executable in place (but must be loaded into RAM). So you can't flush these pages. And pageable kernel memory doesn't come for free - even if one only Which is more a sign that it is probably a very bad idea. Bernd -- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services --
On Tue, Aug 26, 2008 at 7:47 PM, Linus Torvalds The savings part -financial ones- are not always realizable with the way memory is priced/sized/fitted. Savings in few Mb of Kernel stack are not necessarily going to allow getting rid of a single memory chip of 64M or so. Either that or embedded manufacturing/configurations are different than the desktop world. (If my device has 2 memory slots and my user space requires 100Mb including kernel memory - I anyways have to put in 64Mx2 there to take advantage of mass manufactured, general purpose memory - so no big deal if I saved 1.2Mb in Kernel stack or not. And savings of 64Mb Kernel memory are not feasible anyways to allow user space to work with 64Mb.) On the other hand reducing user space memory usage on those devices (not counting savings from kernel stack size) is a way more attractive option. And although you said in your later reply that Linux x86 with 4K stacks should be more than usable - my experiences running a untainted desktop/file server with 4K stack have been always disastrous XFS or not. It _might_ work for some well defined workloads but you would not want to risk 4K stacks otherwise. I understand the having 4K stack option as a non-default for very specific workloads is a good idea but apart from that I think no one else seems to bother with reducing stack sizes (by no one I mean other OSes.) Parag --
Umm. How long? 4kB used to be the _only_ choice. And no, there weren't even irq stacks. So that 4kB was not just the whole kernel call-chain, it was also all the irq nesting above it. And yes, we've gotten much worse over time, and no, I can't really suggest going back to that in general. The code bloat has certainly been accompanied by a stack bloat too. But part of it is definitely gcc. Some versions of gcc used to be absolutely _horrid_ when it came to stack usage, especially with some flags, and especially with the crazy inlining that module-at-a-time caused. But I'd be really happy if some embedded people tried to take some of that bloat back, and aim for 4kB stacks. Because it's definitely not unrealistic. At least it _shouldn't_ be. And a lot of the cases of us having structures on the stack is actually not worth it, and tends to be about being lazy rather than anything else. Linus --
On Tue, Aug 26, 2008 at 9:49 PM, Linus Torvalds IIRC the last I tried 4K stacks with x86 was on 2.6.21 - Fedora 7 kernel, around June 07 time frame. What about deep call chains? The problem with the uptake of 4K stacks seems to be that is not reliably provable that it will work under all circumstances. Parag --
Umm. Neither is 8k stacks. Nobody "proved" anything. But yes, some subsystems have insanely deep call chains. And yes, things like the VFS recursion (for symlinks) makes that deeper yet for filesystems, although only on the lookup path. And that is exactly the kind of thing that can exacerbate the problem of the compiler artificially making for a bigger stack footprint of a function (*). For things like the VFS layer, right now we allow a nesting level of 8, I think. If I remember correctly, it was 5 historically. Part of raising that depth, though, was that we actually moved the recursive part into fs/namei.c, and the nesting stack-depth was something pretty damn small when the filesystem used "follow_link" properly and let the VFS do it for it (ie the callchain to actually look up the link could be deep, but it would not recurse back, and instead just return a pointer, so that the actual _recursive_ part was just __do_follow_link() and is just a few words on the stack). So yes, we do have some deep callchains, but they tend to be pretty well managed for _good_ code. The problems tend to be the areas with lots of indirection layers, and yeah, XFS, MD and ACPI all have those kinds of things. In an embdedded world, many of those should be a non-issue, though. Linus (*) ie the function that _is_ on the deep chain doesn't actually need much of a stack footprint at all itself, but it may call a helper function that is _not_ in the deep chain, and if it gets inlined it may give its excessive stack footprint to the deep chain - and this is _exactly_ the problem that happened with inlining "load_module()". --
On x86-32 with 8K stacks your IRQ paths share them so that is even harder to prove (not that you can prove any of them) and the bugs are more obscure and random. --
I think your memory is failing you. In 2.4 and earlier, the kernel stack was 8kB minus the size of the task_struct, which sat at the start of the 8kB. For instance, from include/asm-i386/processor.h for 2.4.29: #define THREAD_SIZE (2*PAGE_SIZE) #define alloc_task_struct() ((struct task_struct *) __get_free_pages(GFP_KERNEL,1)) #define free_task_struct(p) free_pages((unsigned long) (p), 1) Paul. --
but was shared with interrupts; so out of the 6Kb left, you had still really only 4Kb for user context stack --
That was gcc 3.4.
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
On Tue, 2008-08-26 at 20:58 -0400, Parag Warudkar wrote:
No, but you can put an additional service(s) on it and sales people have
They are different. Think of running the complete system acting as a
bridge, router and/or firewall (Kernel early 2.4 though) from 4MB flash
in 32MB RAM and - listing the outside visible services - having a
command-line interface, web-GUI (implying a http server) and and a
(net-)SNMP agent on it.
Running a glibc without thread support is win there (implying that there
As soon as product management realizes that there is space left on the
device, they get new ideas and/or customer requirements to run more
There is no question if save space here or there. You save it - sooner
The embedded world of really small devices usually doesn't run XFS (or
ext? or reiser* of jfs or NFS or ...) or stacks block devices on files
They probably gave the idea pretty soon because you need to
rework/improve large parts of the kernel + drivers (and that has two
major problems - it consumes a lot of man power for "no new features and
everything must be completely tested again"[0] and it adds new risks).
And that is practically impossible if one sells "stable driver APIs" for
3rd party (commercial) drivers because these must be changed too.
Bernd
[0]: Let alone if you (or your customers) need certificates from some
governmental agencys.
--
Firmix Software GmbH http://www.firmix.at/
mobil: +43 664 4416156 fax: +43 1 7890849-55
Embedded Linux Development and Services
--
But not many embedded Linux arches support 4K stacks like Adrian pointed out earlier. So the same (lot of man power requirement) would apply to Linux. Sure it will be good - but how reasonable it is to attempt it and how reliably it will work under all conceived loads - those are the questions. Thanks Parag --
What is an "embedded Linux arch"? Of course. Look at the amount of work done by lots of people in that If you "develop" an embedded system (which is partly system integration of existing apps) to be installed in the field, you don't have that many conceivable work loads compared to a desktop/server system. And you have a fixed list of drivers and applications. A usual approach is to run stress tests on several (or all) subsystems/services/... in parallel and if the device survives it functioning correctly, it is at least good enough. Bernd -- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services --
Hah! Not in my line of embedded device. 32MB no-MMU ARM boards which people run new things and attach new devices to rather often - without making new hardware. Volume's too low per individual application to get new hardware designed and made. I'm seriously thinking of forwarding porting the 4 year old firmware from 2.4.26 to 2.6.current, just to get new drivers and capabilities. Backporting is tedious, so's feeling wretchedly far from the mainline Per application. Some little devices run hundreds of different applications and customers expect to customise, script themselves, and attach different devices (over USB). The next customer in the chain expects the bits you supplied to work in a variety of unexpected situations, even when you advise that it probably won't do that. Much like desktop/server Linux, but on a small device where silly little things like 'create a process' are a stress for the dear little thing. (My biggest lesson: insist on an MMU next time!) -- Jamie --
Yes, you may have several products on the same hardware with somewhat differing requirements (or not). But that is much less than a general That sounds reasonable (and I never meant maintaining the old system infinitely. Actually once the thing is shipped it usually enters deep ACK. But that also depends on amount local changes (and sorry, but not Basically their problem. Yes, "they" actually think they get a Linux system where they can do everything and it simply works. Oh, that's obviously not a usual "WLAN-router style" of product (where ACK. We avoid MMU-less hardware too - especially since there is enough hardware with a MMU around. Bernd -- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services --
It is, but the idea that small embedded systems go through a 'all
components are known, drivers are known, test and if it passes it's
Sounds reasonable, but it's vetoed for anticipated time and cost,
compared with backporting on demand. Fair enough, since 2.6.current
doesn't support ARM no-MMU last I heard ('soon'?).
On the other hand, the 2.6 anti-fragmentation patches, including
latest SLUB stuff, ironically meant to help big machines, sound really
appealing for my current problem and totally unrealistic to
I can't emphasise enough how much difference MMU makes to Linux userspace.
It's practically: MMU = standard Linux (with less RAM), have everything.
No-MMU = lots of familiar 'Linux' things not available or break.
-- Jamie
--
Not always but often enough. And yes, there is ARM-based embedded That is to be expected;-) ACK. And tell that a customer that everything is more effort and more risk and not just "simply cross-compile it as it runs on my desktop too". Bernd -- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services --
And lots of things work in the usual way... Of course the flip side is that for people who have platforms without MMU they can run something more than the mostly "toy" like operating systems typically available. There are plenty of problem domains that the non-MMU limitations are not a problem for. (Yours doesn't sound like one of them :-) Regards Greg ------------------------------------------------------------------------ Greg Ungerer -- Chief Software Dude EMAIL: gerg@snapgear.com Secure Computing Corporation PHONE: +61 7 3435 2888 825 Stanley St, FAX: +61 7 3891 3630 Woolloongabba, QLD, 4102, Australia WEB: http://www.SnapGear.com --
But given you have hardware you can't change would you choose to not run Linux, even with the limitations of non-MMU? Hell no :-) Regards Greg ------------------------------------------------------------------------ Greg Ungerer -- Chief Software Dude EMAIL: gerg@snapgear.com Secure Computing Corporation PHONE: +61 7 3435 2888 825 Stanley St, FAX: +61 7 3891 3630 Woolloongabba, QLD, 4102, Australia WEB: http://www.SnapGear.com --
Falling prices are no reason to increase the amount of available RAM (or other hardware). Especially if you (intend to) build >1E5 devices - where every Euro counts. Bernd -- Firmix Software GmbH http://www.firmix.at/ mobil: +43 664 4416156 fax: +43 1 7890849-55 Embedded Linux Development and Services --
You implicitely assume both would solve the same problem.
While 4kB stacks are something we anyway never got 100% working, the
cases where gcc inlining functions causes a critical increase in stack
usage are usually not that hard to find, and once found the fix is
trivial.
We should anyway monitor stack usages better since we have frequent
programming errors in this area, and problems caused by gcc can this
way be detected en passant.
You have a good point that aiming at 4kB makes 8kB a very safe choice.
But I do not think the problem you'd solve with
-fno-inline-functions-called-once is big enough to warrant the size
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
I'm just saying that your logic doesn't hold water. If we can save kernel stack usage, then a 1% increase in kernel size is What? Don't be silly. Linux _historically_ always used 4kB stacks. No, they are likely not usable on x86-64, but dammit, they should be more You continually try to see the inlining as a single solution to one problem (debuggability, stack, whatever). The biggest problem with gcc inlining has always been that it has been _unpredictable_. It causes problems in many different ways. It has caused stability issues due to gcc versions doing random things. It causes the stack expansion. It makes stack traces harder for debugging, etc. If it was any one thing, I wouldn't care. But it's exactly the fact that it causes all these problems in different areas. Linus --
From some tests the size increase seems to become bigger for smaller
kernels, but I don't have any really good data.
An interesting question is why most of our architectures for embedded
devices only offer bigger stacks:
The only architectures offering a 4kB stacks option are:
- m68knommu
- sh
- 32bit x86
The following architectures that are used in embedded devices
always use 8kB stacks (or bigger) in your tree:
- arm
- avr32
- blackfin
- cris
- frv
- h8300
- m32r
- m68k
- mips
- mn10300 (has an #ifdef CONFIG_4KSTACKS but no kconfig option)
- powerpc
When did we get callpaths like like nfs+xfs+md+scsi reliably
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
XFS may never have been usable, but the rest, sure. And you seem to be making this whole argument an excuse to SUCK, adn an excuse to let gcc crap even more on our stack space. Why? Why aren't you saying that we should be able to do better? Instead, you seem to asking us to do even worse than we do now? Linus --
My main point is:
- getting 4kB stacks working reliably is a hard task
- having an eye on gcc increasing the stack usage, and fixing it if
required, is relatively easy
If we should be able to do better at getting (and keeping) 4kB stacks
working, then coping with possible inlining problems caused by gcc
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
Out of the architectures you've mentioned for 4k stacks, they also tend to do IRQ stacks, which is something you seem to have overlooked. In addition to that, debugging the runaway stack users on 4k tends to be easier anyways since you end up blowing the stack a lot sooner. On sh we've had pretty good luck with it, though most of our users are using fairly deterministic workloads and continually profiling the footprint. Anything that runs away or uses an insane amount of stack space needs to be fixed well before that anyways, so catching it sooner is always preferable. I imagine the same case is true for m68knommu (even sans IRQ stacks). Things might be more sensitive on x86, but it's certainly not something that's a huge problem for the various embedded platforms to wire up, whether they want to go the IRQ stack route or not. In any event, lack of support for something on embedded architectures in the kernel is more often due to apathy/utter indifference on the part of the architecture maintainer rather than being indicative of any intrinsic difficulty in supporting the thing in question. Most new "features" on the lesser maintained architectures tend to end up there either out of peer pressure or copying-and-pasting accidents rather than any sort of design. ;-) --
No, I am aware of that, and on i386 IRQ stacks are only used with
4kB stacks.
CONFIG_DEBUG_STACKOVERFLOW should give you the same information, and if
How many platforms use 4kB stacks on sh?
Only 1 out of 34 defconfigs uses it.
arm or powerpc aren't exactly lesser maintained architectures.
4kB has shown to be a hard to achieve limit. After more than 4 years in
mainline being available on i386 there are still cases where 4kB are not
enough.
IMHO there seems to currently be a mismatch between it's maintainance
cost and the actual number of users. That's in my opinion the main
problem with it, no matter in which direction it gets resolved.
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
In some cases, yes. In the CONFIG_DEBUG_STACKOVERFLOW case the check is only performed from do_IRQ(), which is sporadic at best, especially on tickless. While it catches some things, it's not a complete solution in and of iteslf. In addition to this, there are even fewer platforms that support it than there are platforms that do 4k stacks. At first glance, it looks like it's only m32r, powerpc, sh, x86, and xtensa. Others support the Kconfig option, but don't seem to realize that it's not an option that the kernel does anything with by itself, and so don't actually do anything (ie, The defconfigs tend to enable as much random stuff as people are interested in for development and testing purposes. Most of these end up being reference boards and are the basis for products, rather than shipping products themselves. In the latter case, everything is gradually tightened down, and 4k stack utilization in that case is the norm, rather Indeed, which is why I find it bizarre that you would even bother applying what was said to those platforms. Specifically I was referring to the embedded platforms that don't do 4k stacks today. The fact they don't support them today has much less to do with 4k being an unattainable limit as it does with people simply not bothering to Perhaps that's true on x86, but in general I take issue with that. On sh we've had to do very little maintenance for it and most shipping products are using it today (at least on MMU-Linux, we don't bother with it on nommu). Most of the problems we ran in to with 4k stacks tended to be stuff that we wanted to fix for 8k anyways. I suspect that this case is true for the other embedded platforms also. Note that on sh we also conditionalize IRQ stacks separately, so while they are often used together, it's possible to use 4k stacks without resorting to IRQ stacks (as m68knommu also seems to do). --
Thanks for the pointer, I'll take a look at it! --
As far as I can see the only architectures that optionally offer 4kB
stacks today are m68knommu, s390, sh and x86.
Most stack issues are not platform or architecture specific.
The maintainance effort therefore mostly depends on whether a non-zero
number of architectures uses 4kB stacks.
And if something is considered to be important for small embedded
systems, but not supported on ARM, MIPS or PowerPC, then that's
a bit strange.
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
Yep, definitely true for m68knommu in my experience. I haven't had any problems with 4k stacks recently. But yes the workloads do tend to be constrained - and almost never use any of the more exotic Indeed :-) Regards Greg ------------------------------------------------------------------------ Greg Ungerer -- Chief Software Dude EMAIL: gerg@snapgear.com Secure Computing Corporation PHONE: +61 7 3435 2888 825 Stanley St, FAX: +61 7 3891 3630 Woolloongabba, QLD, 4102, Australia WEB: http://www.SnapGear.com --
> You have a good point that aiming at 4kB makes 8kB a very safe choice. Not really no - we use separate IRQ stacks in 4K but not 8K mode on x86-32. That means you've actually got no more space if you are unlucky with the timing of events. The 8K mode is merely harder to debug. If 4K stacks really are not safe then x86-32 really really needs to switch to using IRQ stacks in 8K stack mode as well. Alan --
By your logic though, XFS on x86 should work fine with 4K stacks - many will attest that it does not and blows up due to stack issues. I have first hand experiences of things blowing up with deep call chains when using 4K stacks where 8K worked just fine on same workload. So there is definitely some other problem with 4K stacks. Thanks Parag --
Nothing of the sort. If it blows up with a 4K stack it will almost certainly blow up with an 8K stack *eventually* - when a heavy stack usage coincides with a heavy stack using IRQ handler. You won't catch it in simple testing, you won't catch it in trivial simulation and it'll be incredibly hard to reproduce. Not the kind of bug you want in a production system really. IRQ stacks make things much more predictable. Alan --
I see - so if I end up having a workload on 8k where heavy stack using IRQs and deep kernel call chains come at the same time - even 8K will blow up. So 4K will blow too except that it doesn't require IRQs also to use heavy stack, just XFS is good enough :) It then seems like the IRQs using lot of stack is not so much of a problem in the current kernel as much as deeper call chains and stack usage of normal non-irq path code is. So 8k makes it possible for the deeper call chains of non-irq path to survive since they get better part of the 8K to themselves and IRQs can do with less almost always. At least that's what I can derive from the fact that we do not have lots of reports of 8K stack blowing up. Thanks Parag --
e1000_check_options builds a struct (singular) on the stack, really...
struct e1000_option is reasonably small.
The problem, which has also shown itself in large ioctl-style switch{}
statements, is that gcc will generate code such that the stack usage
from independent code branches
if {cond1} {
char buster1[1000];
foo(buster1);
} else if (cond2) {
char buster2[1000];
foo(buster2);
}
are added together, not noticed as mutually exclusive.
Of course, adding 'static const' as you noted is a reasonable
workaround, but gcc is really annoying WRT stack allocation in this manner.
I had problems in the past, before struct ethtool_ops, with like ethtool
ioctl switch statements using gobs of stack. In fact, that was a big
motivation for struct ethtool_ops.
Jeff
--
No it doesn't.
Look a bit more closely.
It builds a struct (singular) MANY MANY times. It also then builds up a
huge e1000_opt_list[] array, even though it is const and should be static
(and const).
I know. I wrote a patch to FIX it.
Here's the patch. It shrinks the stack from 1152 bytes to 192 bytes (the
first version, that only did the e1000_option part, got it down to 600
bytes). About half comes from not using multiple "e1000_option"
structures, the other half comes from turning the "e1000_opt_list[]"
arrays into "static const" instead, so that gcc doesn't copy them onto the
stack.
Most of the patch is actually doing things like turning
struct struct e1000_option opt = {
(which declares a _new_ e1000_option variable each time) into
opt = (struct e1000_option) {
which just re-uses the single variable.
It becomes slightly larger than that, because some places the "opt = .."
had to be moved around, since it's no longer a variable declaration, but a
regular assignment.
The rest is just adding "const" to the right places, and turning
struct e1000_opt_list speed_list[] = ..
into
static const struct e1000_opt_list speed_list[] = ..
instead, and fixing the indentation to be more straightforward.
I have not tested the dang thing, but I think it's correct. And it turns
stack usage from "totally horrible and broken" into "pretty reasonable".
Linus
---
drivers/net/e1000/e1000_param.c | 81 +++++++++++++++++++++-----------------
1 files changed, 45 insertions(+), 36 deletions(-)
diff --git a/drivers/net/e1000/e1000_param.c b/drivers/net/e1000/e1000_param.c
index b9f90a5..213437d 100644
--- a/drivers/net/e1000/e1000_param.c
+++ b/drivers/net/e1000/e1000_param.c
@@ -208,7 +208,7 @@ struct e1000_option {
} r;
struct { /* list_option info */
int nr;
- struct e1000_opt_list { int i; char *str; } *p;
+ const struct e1000_opt_list { int i; char *str; } *p;
} l;
} arg;
};
@@ -242,7 +242,7 @@ static ...totally cool patch afaics - if I still maintained this driver I'd have this tested and merged right away :) I suppose Jeff Kirsher is already doing so right now. I suppose that he'll have to look at the other Intel ethernet drivers as well :) Jeff, please add my: Reveiewed-by: Auke Kok <auke-jan.h.kok@intel.com> Cheers, --
You suppose correctly. Will do. -- Cheers, Jeff --
With /just/ DEBUG_PAGE_ALLOC defined, I have seen two general panic types: o A new double fault w/ SMP_DEBUG_PAGEALLOC problem (prob4.txt) o The NULL pointer dereference @ 0x858 (prob4a.txt) Enabling SLUB debugging to see what that shows Alan
Adding in SLUB debugging doesn't show anything new (I think). Example boot log (w/ initcall_debug enabled) is at: http://free.linux.hp.com/~adb/bug.11342/prob5.txt This has happened 3 times in a row as well. Whilst this is being looked at, I'm going to fast-forward ahead to the latest in Linus' tree, and see if the problem is still occurring (I think Linus' point earlier about some sort of rogue timing and/or corruption bug is spot on, but it's probably better to see how close to "today's tree" I can reproduce this). I'll also try kernels w/ the problematic merge patch backed out to see if that still "fixes" (or more likely(?) just patches over the real problem). Alan --
Yeah, that's a stack overflow. Confirmed. Linus --
I built a kernel @ commit 83097aca8567a0bd593534853b71fe0fa9a75d69 Author: Arjan van de Ven <arjan@linux.intel.com> Date: Sat Aug 23 21:45:21 2008 -0700 And it fails like the others do o http://free.linux.hp.com/~adb/bug.11342/prob6.txt SMP_DEBUG_PAGEALLOC o http://free.linux.hp.com/~adb/bug.11342/prob6a.txt [ 7.591198] BUG: unable to handle kernel NULL pointer dereference at 0000000000000858 I then backed out /just/ the merge for commit 1c89ac55017f982355c7761e1c912c88c941483d Merge: 88fa08f... b1b135c... Author: Linus Torvalds <torvalds@linux-foundation.org> Date: Tue Aug 12 08:40:19 2008 -0700 And the machine has booted fine 5 times in a row. I've put the latest .config up at http://free.linux.hp.com/~adb/bug.11342/config.txt Is there /some/ way to break down the patches within the merged patch, and I could by-hand bisect through those? Here's what I did to take the latest tree, and back out that merge (to get booting kernels): git-diff 88fa08f67bee1a0c765237bdac106a32872f57d2..1c89ac55017f982355c7761e1c912c88c941483d | patch -p1 -R patching file Documentation/lguest/lguest.c patching file arch/powerpc/Kconfig patching file arch/x86/Kconfig patching file arch/x86/mm/Makefile patching file drivers/char/hvc_console.c patching file drivers/lguest/page_tables.c patching file include/linux/Kbuild Hunk #1 succeeded at 358 (offset 2 lines). patching file include/linux/init.h patching file include/linux/mm.h patching file init/main.c patching file kernel/module.c patching file kernel/stop_machine.c patching file mm/Kconfig patching file mm/util.c Alan --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11354 Subject : AMD Elan regression with 2.6.27-rc3 Submitter : Sean Young <sean@mess.org> Date : 2008-08-15 18:37 (9 days old) References : http://marc.info/?l=linux-kernel&m=121882578430056&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11355 Subject : Regression in 2.6.27-rc2 when cross-building the kernel Submitter : Larry Finger <Larry.Finger@lwfinger.net> Date : 2008-08-16 2:38 (8 days old) References : http://marc.info/?l=linux-kernel&m=121885432118368&w=4 --
As I wrote in the Bugzilla, I'm seeing a related problem. Namely, I build kernels on one box, with 'make O=<target>', then I mount <target> on another one over NFS, 'cd' to it and try to install the kernel modules with 'make modules_install'. This results in 'HOSTCC firmware/ihex2fw' and 'fatal error: ...: Read-only file system'. It's readily reproducible. Commenting out line 1130 of Makefile ("$(Q)$(MAKE) -f $(srctree)/scripts/Makefile.fwinst obj=firmware __fw_modinst") obviously helps, so it looks like Makefile.fwinst needs fixing. Thanks, Rafael --
I don't like this much, but it should do the trick... please confirm. diff --git a/firmware/Makefile b/firmware/Makefile index aab12bf..e7130b5 100644 --- a/firmware/Makefile +++ b/firmware/Makefile @@ -166,15 +166,28 @@ $(patsubst %,$(obj)/%.gen.o, $(fw-external-y)): $(obj)/%.gen.o: $(fwdir)/% $(obj)/%: $(obj)/%.ihex | $(objtree)/$(obj)/$$(dir %) $(call cmd,ihex) +# Don't depend on ihex2fw if we're installing and it already exists. +# Putting it after | in the dependencies doesn't seem sufficient when +# we're installing after a cross-compile, because ihex2fw has dependencies +# on stuff like /usr/lib/gcc/ppc64-redhat-linux/4.3.0/include/stddef.h and +# thus wants to be rebuilt. Which it can't be, if the prebuilt kernel tree +# is exported read-only for someone to run 'make install'. +ifeq ($(INSTALL):$(wildcard $(obj)/ihex2fw),install:$(obj)/ihex2fw) +ihex2fw_dep := +else +ihex2fw_dep := $(obj)/ihex2fw +endif + + # .HEX is also Intel HEX, but where the offset and length in each record # is actually meaningful, because the firmware has to be loaded in a certain # order rather than as a single binary blob. Thus, we convert them into our # more compact binary representation of ihex records (<linux/ihex.h>) -$(obj)/%.fw: $(obj)/%.HEX $(obj)/ihex2fw | $(objtree)/$(obj)/$$(dir %) +$(obj)/%.fw: $(obj)/%.HEX $(ihex2fw_dep) | $(objtree)/$(obj)/$$(dir %) $(call cmd,ihex2fw) # .H16 is our own modified form of Intel HEX, with 16-bit length for records. -$(obj)/%.fw: $(obj)/%.H16 $(obj)/ihex2fw | $(objtree)/$(obj)/$$(dir %) +$(obj)/%.fw: $(obj)/%.H16 $(ihex2fw_dep) | $(objtree)/$(obj)/$$(dir %) $(call cmd,h16tofw) $(firmware-dirs): -- dwmw2 --
Yes, the patch fixes the problem. Thanks. Larry --
Ok, I have a small handful of fixes to send to Linus for 2.6.27; Unless Sam (or someone else) comes up with a better answer, I'll make sure that goes with them. Thanks. -- dwmw2 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11356 Subject : Linux 2.6.27-rc3 - build failure: undefined reference to `.lockdep_count_forward_deps' Submitter : Frans Pop <elendil@planet.nl> Date : 2008-08-16 19:11 (8 days old) References : http://marc.info/?l=linux-kernel&m=121891396320127&w=4 --
Fixed as per: http://marc.info/?l=linux-kernel&m=121898767530602&w=4 Adrian mentioned that he'd closed the bug, but apparently not. --
The bug is closed now. Thanks, Rafael --
Sorry, I missed that Rafael had opened two bugs for two people reporting
the same issue, and only closed the other one.
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11357 Subject : Can not boot up with zd1211rw USB-Wlan Stick Submitter : uwe <kender@freenet.de> Date : 2008-08-16 14:17 (8 days old) --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11358 Subject : net: forcedeth call restore mac addr in nv_shutdown path Submitter : Yinghai Lu <yhlu.kernel@gmail.com> Date : 2008-08-17 3:30 (7 days old) References : http://marc.info/?l=linux-kernel&m=121894389018584&w=4 Handled-By : Yinghai Lu <yhlu.kernel@gmail.com> Patch : http://marc.info/?l=linux-kernel&m=121894389018584&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11360 Subject : mpc8xxx_wdt.c doesn't build modular Submitter : Dave Jones <davej@redhat.com> Date : 2008-08-17 08:07 (7 days old) References : http://lkml.org/lkml/2008/8/12/465 Handled-By : Anton Vorontsov <avorontsov@ru.mvista.com> Patch : http://lkml.org/lkml/2008/8/13/344 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11361 Subject : my servers with nvidia mcp55 nic don't work with msi in second kernel by kexec Submitter : Yinghai Lu <yhlu.kernel@gmail.com> Date : 2008-08-17 6:25 (7 days old) References : http://marc.info/?l=linux-kernel&m=121895439927053&w=4 Handled-By : Rafael J. Wysocki <rjw@sisk.pl> Patch : http://marc.info/?l=linux-kernel&m=121917167232014&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11379 Subject : char/tpm: tpm_infineon no longer loaded for HP 2510p laptop Submitter : Frans Pop <elendil@planet.nl> Date : 2008-08-18 13:40 (6 days old) References : http://marc.info/?l=linux-kernel&m=121906698213329&w=4 Handled-By : Bjorn Helgaas <bjorn.helgaas@hp.com> --
Fixed with:
commit 5e4c6564c95ce127beeefe75e15cd11c93487436
Author: Kay Sievers <kay.sievers@vrfy.org>
Date: Thu Aug 21 15:28:56 2008 +0200
pnp: fix "add acpi:* modalias entries"
--
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11382 Subject : e1000e: 2.6.27-rc1 corrupts EEPROM/NVM Submitter : David Vrabel <david.vrabel@csr.com> Date : 2008-08-08 10:47 (16 days old) References : http://marc.info/?l=linux-kernel&m=121819267211679&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11380 Subject : lockdep warning: cpu_add_remove_lock at:cpu_maps_update_begin+0x14/0x16 Submitter : Ingo Molnar <mingo@elte.hu> Date : 2008-08-20 6:44 (4 days old) References : http://marc.info/?l=linux-kernel&m=121921480931970&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11388 Subject : 2.6.27-rc3 warns about MTRR range; only 3 of 16gb of memory is usable Submitter : Joshua Hoblitt <j_kernel@hoblitt.com> Date : 2008-08-20 17:38 (4 days old) --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11398 Subject : hda_intel: IRQ timing workaround is activated for card #0. Suggest a bigger bdl_pos_adj. Submitter : Frans Pop <elendil@planet.nl> Date : 2008-08-21 17:17 (3 days old) --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11401 Subject : pktcdvd: BUG, NULL pointer dereference in pkt_ioctl, bisected Submitter : Laurent Riffard <laurent.riffard@free.fr> Date : 2008-08-22 08:16 (2 days old) --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11404 Subject : BUG: in 2.6.23-rc3-git7 in do_cciss_intr Submitter : rdunlap <randy.dunlap@oracle.com> Date : 2008-08-21 5:52 (3 days old) References : http://marc.info/?l=linux-kernel&m=121929819616273&w=4 http://marc.info/?l=linux-kernel&m=121932889105368&w=4 Handled-By : Miller, Mike (OS Dev) <Mike.Miller@hp.com> James Bottomley <James.Bottomley@hansenpartnership.com> --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11402 Subject : skbuff bug? Submitter : Yinghai Lu <yhlu.kernel@gmail.com> Date : 2008-08-21 3:56 (3 days old) References : http://marc.info/?l=linux-kernel&m=121929102707658&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11403 Subject : 2.6.27-rc2 USB suspend regression Submitter : Jeremy Fitzhardinge <jeremy@goop.org> Date : 2008-08-20 20:48 (4 days old) References : http://marc.info/?l=linux-kernel&m=121926536103630&w=4 Handled-By : Alan Stern <stern@rowland.harvard.edu> --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11405 Subject : 2.6.27-rc3 segfault on cold boot; not on warm boot. Submitter : David Greaves <david@dgreaves.com> Date : 2008-08-21 9:45 (3 days old) References : http://marc.info/?l=linux-kernel&m=121931198904777&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11406 Subject : patch "x86: MOVE PCI IO ECS code to x86/pci" breaks CPU hotplug Submitter : Jan Beulich <jbeulich@novell.com> Date : 2008-08-21 12:59 (3 days old) References : http://marc.info/?l=linux-kernel&m=121932366326572&w=4 Handled-By : Ingo Molnar <mingo@elte.hu> Robert Richter <robert.richter@amd.com> --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11410 Subject : SLUB list_lock vs obj_hash.lock... Submitter : Daniel J Blueman <daniel.blueman@gmail.com> Date : 2008-08-22 21:48 (2 days old) References : http://marc.info/?l=linux-kernel&m=121944176609042&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11407 Subject : suspend: unable to handle kernel paging request Submitter : Vegard Nossum <vegard.nossum@gmail.com> Date : 2008-08-21 17:28 (3 days old) References : http://marc.info/?l=linux-kernel&m=121933974928881&w=4 Handled-By : Rafael J. Wysocki <rjw@sisk.pl> Pekka Enberg <penberg@cs.helsinki.fi> Pavel Machek <pavel@suse.cz> --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11409 Subject : build issue #564 for v2.6.27-rc4 : undefined reference to `NS8390p_init' Submitter : Toralf Förster <toralf.foerster@gmx.de> Date : 2008-08-22 8:33 (2 days old) References : http://marc.info/?l=linux-kernel&m=121939410214677&w=4 Handled-By : Alan Cox <alan@lxorguk.ukuu.org.uk> Patch : http://marc.info/?l=linux-kernel&m=121943097320451&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11413 Subject : get_rtc_time() triggers NMI watchdog in hpet_rtc_interrupt() Submitter : Mikael Pettersson <mikpe@it.uu.se> Date : 2008-08-23 9:48 (1 days old) References : http://marc.info/?l=linux-kernel&m=121948503224161&w=4 Handled-By : Ingo Molnar <mingo@elte.hu> Patch : http://marc.info/?l=linux-kernel&m=121950734922457&w=4 --
This message has been generated automatically as a part of a report of recent regressions. The following bug entry is on the current list of known regressions from 2.6.26. Please verify if it still should be listed and let me know (either way). Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11414 Subject : Random crashes with 2.6.27-rc3 on PPC Submitter : Michael Buesch <mb@bu3sch.de> Date : 2008-08-23 14:10 (1 days old) References : http://marc.info/?l=linux-kernel&m=121950076812616&w=4 --
It would be good to have some kind of bisection of this one, because it looks pretty odd. Also, google doesn't find anybody else seeing that "segfault at ffffffbf", even though it seems to be very consistent for David. So I don't think we'll be able to even _guess_ where it is without some more information about exactly when it started happening. Since it's present in 2.6.26 too, it's clearly not a regression from that one, but perhaps more importantly, since it's apparently an old one I'd have expected more reports like this if it was some common problem. And the warm-vs-cold-boot thing makes me think it's some hardware setup issue. Possibly the disk controller, possibly the CPU (eg some MTRR/PAT setup issue or TLB thing). But the dmesg's are all from late enough at boot that I can't even tell what disk controller it is (except that it is SATA), nor can I tell what CPU it is. But again, if it was some MTRR/PAT issue, I'd expect a _lot_ more reports of this. MD/XFS sounds unlikely, since they should have absolutely nothing that could possibly matter for cold/hot boot. Linus --
OK, that all makes sense. Given that I'll manage at best 1 bisect/day with a reasonable chance of data corruption and hardware intermittency screwing it all up I thought it best to ask first in case there was another debug approach that could work. However since it does indeed sounds somewhat hardware related and it's an isolated problem for my wife (as opposed to a problem that others are having too) then I think she deserves a new machine... Thanks for the impetus to cheer her up ;) David PS if anyone really is interested then I am happy to try the bisection once I've moved her to a new box; otherwise I'm happy to close this. --
Well, regardless, I think it would be good to fill in the hardware info, especially wrt CPU data and the exact SATA controller you have. There's another regression for SATA cold/hot boot issues, and while that one looks very different, and is probably not really related, it's still a good idea to try to see if we can match them up. See Bug-Entry : http://bugzilla.kernel.org/show_bug.cgi?id=11343 Subject : SATA Cold Boot Problems with 2.6.27-rc[23] on nVidia 680i Submitter : Manny Maxwell <mannymax@mannymax.net> Date : 2008-08-14 4:16 (10 days old) References : http://marc.info/?l=linux-kernel&m=121868782917600&w=4 which actually has a patch, and which seems to work fine in 2.6.26 (so not only is failure pattern different, the point were it starts is different). But regardless of the big differences, it does seem to point to some weakness in SATA initialization. But is it limited to _that_ particular SATA controller, or just a few ones? Or a generic issue? Without more reports to really find a pattern, I don't think we have a clue, and the two may be _totally_ unrelated in all ways, but it would be good to at least report and log the information you have.. Oh, I just noticed that your dmesg _does_ mention sata_sil and sata_via, so we know which of two drivers it would be, at least. Not the nVidia one. However, there's been tons of changes in soem core functions: both the reset handling and the wait-for-ready has changed and caused lots of churn I think it would be good to try to bisect. It could be something that is really just limited to that particular machine (maybe it really is some flaky hardware that just triggers some timing changes), but more likely it isn't. So the more information, the better. So keep the thing open as long as somebody is willing to try to gather more info, by all means. Linus --
This one now has a suggested patch for Daniel to try from Vegard, but no reply yet: http://marc.info/?l=linux-kernel&m=121946972307110&w=4 Vegard, I think your patch is a bit odd, though. The result of your patch is - first loop: hlist_for_each_entry_safe(obj, node, tmp, &db->list, node) { hlist_del(&obj->node); hlist_add_head(&obj->node, &freelist); } and quite frankly, I don't see what the difference between that and a something like a simple struct hlist_node *first = bd->list.first; if (first) { bd->list.first = NULL; first->pprev = &first; } really is? I dunno. We don't have list splicing ops for the hlist things. Linus --
On Sun, Aug 24, 2008 at 8:03 PM, Linus Torvalds Hm. I haven't really used the hlists before, so my first instinct was to do what is obvious. That's also why I put the XXX comment. Other than that, I guess open-coding list ops is also not very good programming practice? :-) But... feel free to submit your own patch. Oh, what am I saying. Vegard -- "The animistic metaphor of the bug that maliciously sneaked in while the programmer was not looking is intellectually dishonest as it disguises that the error is the programmer's own creation." -- E. W. Dijkstra, EWD1036 --
I do agree that the hlist versions aren't very nice in this regard. The Agreed. It would be better if the people who use hlists most (I think that Silly boy. Next you'll ask me to _test_ any patches I send out. Anyway, I think your patch is likely fine, I just thought it looked a bit odd to have a loop to move a list from one head pointer to another. But regardless, it would need some testing. Daniel? Linus --
Hi Linus, Vegard,
On Sun, Aug 24, 2008 at 7:58 PM, Linus Torvalds
This opens another lockdep report at boot-time [1] - promoting
pool_lock may not be the best fix?
We then see a new deadlock condition (on the pool_lock spinlock) [2],
which seemingly was avoided by taking the debug-bucket lock first.
We reproduce this by booting with debug_objects=1 and causing a lot of activity.
Daniel
--- [1]
[ INFO: possible irq lock inversion dependency detected ]
2.6.27-rc4-225c-debug #3
---------------------------------------------------------
rcu_sched_grace/9 just changed the state of lock:
(pool_lock){-...}, at: [<ffffffff80466c2c>] free_object+0x7c/0xc0
but this lock was taken by another, hard-irq-safe lock in the past:
(xtime_lock){++..}
and interrupts could create inverse lock ordering between them.
other info that might help us debug this:
no locks held by rcu_sched_grace/9.
the first lock's dependencies:
-> (pool_lock){-...} ops: 59 {
initial-use at:
[<ffffffff8026f795>]
__lock_acquire+0x1a5/0x1160
[<ffffffff802707e1>]
lock_acquire+0x91/0xc0
[<ffffffff806a2bb1>] _spin_lock+0x41/0x80
[<ffffffff804675fa>]
__debug_object_init+0x14a/0x3e0
[<ffffffff804678df>]
debug_object_init+0x1f/0x30
[<ffffffff80260eee>]
hrtimer_init+0x2e/0x50
[<ffffffff80237571>]
init_rt_bandwidth+0x41/0x60
[<ffffffff812f3810>]
sched_init+0x72/0x63d
[<ffffffff812e17e2>]
start_kernel+0x19c/0x456
[<ffffffff812e10a9>]
x86_64_start_reservations+0x99/0xb6
[<ffffffff812e11d0>]
x86_64_start_kernel+0xe2/0xe9
...This one looks irritating. It's bisected to 5b6155ee70e9c4d2ad7e6f514c8eee06e2711c3a ("pktcdvd: push BKL down into driver"), but the problem goes deeper than that. The "unlocked" ioctl's do not get a "struct inode *" pointer, they _only_ get the "struct file *". And this is very much historical usage, where some internal functions only passed in the inode (good or not, whatever). And ioctl_by_bdev() doesn't have a "struct file *" and has depended on passing in a NUMM "struct file *" and its own "struct inode *", and expects the ioctl's to just use that instead. But the unlocked ioctl just drops it on the floor, and uses just the (unusable) file pointer. Grr. And some other cases (like pkt_ioctl() itself) that simply pass in a _different_ inode than the file itself is attached to. It does blkdev_ioctl(pd->bdev->bd_inode, file, cmd, arg); where "file" points to the pkt_ioctl thing, but "inode" points to the inode "behind" the pkt interface. Double grr. I really think the only sane model is to literally make "unlocked_ioctl()" have the same calling convention as the old "ioctl()" thing had, and pass in both file * and inode *. It was a stupid "cleanup" to try to have a simpler interface for the unlocked version. Having two different models, where we have actually _depended_ on the old model and then are trying to convert to a (weaker) new model, is not a good idea. The alternative is to do this _only_ for the blkdev_ioctl's, and have those only take the "inode *", and then create a new fake "struct file *" to go with it, regardless of what "struct file" was passed in (exactly because the blockdev ones really think that the inode is the important part). Hmm? We need to fix this. Linus --
Why not just revert the offending change and try again during the next merge window, assuming someone has figured out an acceptable way to handle this mess by then? -- Peter Osterlund - petero2@telia.com http://web.telia.com/~u89404340 --
Well,, for 2.6.27 that's what we'll have to do. But there's actually a real problem here - the unlocked ioctl's (which we _should_ prefer) have a strictly weaker and worse interface. I also wonder if any other block_ioctl users were converted.. Anyway, I'll take your email as an ack for the revert. Linus --
Well, doing git log -p v2.6.26.. -Sunlocked_ioctl and looking for blkdev_ioctl, that does seem to be the only one. So hopefully no other case like this is lurking, although it is possible that non-block areas have similar issues. Linus --
Actually both interfaces are a fscking disaster. The right things to pass is neither and inode nor a file but a struct block_device. Al had all this work done a while and it just needs rebasing to a current tree: http://git.kernel.org/?p=linux/kernel/git/viro/bdev.git;a=summary --
Completely agreed. Al, I remember talking to you about this at the storage summit back in february. What are your current plans wrt moving this forward? -- Jens Axboe --
Rebased, with nfs parts of fmode_t patch taken out (irrelevant for bdev anyway and really better off in intent-killing queue). Other than that, it's a straight port... Same place, same branch. --
So what's your plan with this - 2.6.28? -- Jens Axboe --
Yes. The only nastiness is around drivers/ide - there it gets a bunch of
annoying conflicts from the ide-{disk,floppy}_ioctl.c splitoff. Other than
that, it's trivially ported on top of current linux-next. Merge order is
going to be interesting - depending on whether block merge happens before
or after ide one.
I'm going to put linux-next-based series on kernel.org tonight, before
going to Portland...
--
Easier just to fix it. Its a case of building everything until it compiles with the prototype change. Almost all stuff will just take the argument initially and not use it. Anyone else plan to do it or shall I hit all the x86 cases and post a patch ? --
Well, I alrady reverted it, but if you actually fix unlocked_ioctl() to have the same calling convention as regular ioctl() then a lot of the noise from ioctl conversion goes away, and all that remains is literally just the BKL part. Btw, why is unlocked_ioctl returning "long"? Does anybody depend on that too? That's another difference between the "unlocked" and the traditional version.. As to the "x86 cases", I think you should try to hit them all. Doing a "git grep unlocked_ioctl" gets 185 entries, and it looks like only something like 8 of them are non-x86 (3 in the arch/ directory, five in s390 drivers). Of course, some of them may be drivers that aren't available on x86 for other reasons (ie the ARM embedded stuff), but regardless.. Anyway, the pure size of that patch makes me suspect that we might as well leave it until the next merge window, but if you do it and it's obviously totally mechanical, I'd be likely to just let it slip in early. Linus --
From: Linus Torvalds <torvalds@linux-foundation.org> The return values want to be "long" sign extended all the way back down to syscall dispatch, I think this is just an effort to add some consistency here so that the int --> long extension eventually can be eliminated once unlocked_ioctl is the only case left. --
Anybody doing this, don't forget to actually use "inode" instead of all those dereferences: struct inode *inode = filp->f_path.dentry->d_inode; --
I don't know - a lot of syscall returns got defined as long and I guess I'll take a crack at it tomorrow - but if its 185 entries then it probably wants to go into -next instead. Alan --
Being more careful.. This: git grep 'unlocked_ioctl.*=' | sed 's/^.*=[ ]*\([_a-zA-Z0-9]*\).*$/\1/' | uniq | wc says that ther are 160 distinct cases. I'm not sure it catches everything exactly, but it will be reasonably close, at least. I wonder if I could essentially automate something to do the conversion.. Linus --
What the hell. Here's a test patch. A largish part of it was generated through a stupid script that basically did a number of grep + 'sed' on a lot of files, and then the rest was fixed up manually after running "make allmodconfig". I'm not going to guarantee anything, but it gets close. A starting point for somebody else, and considering that it is 208 files changed, 370 insertions(+), 376 deletions(-) this is definitely linux-next material. The extra deletions are mainly because the passing of "inode" as an argument means that some functions don't need to look it up manually any more. And yeah, I changed the return type to "int". There's no way the kernel can validly return anything bigger than that anyway. And this way all the ioctl functions have the same type, no confusion. TOTALLY UNTESTED apart from the fact that it compiles. Linus --- arch/mips/sibyte/common/sb_tbprof.c | 2 +- arch/parisc/kernel/perf.c | 4 +- arch/sparc/kernel/apc.c | 2 +- arch/x86/kernel/apm_32.c | 2 +- arch/x86/kernel/cpu/mcheck/mce_64.c | 2 +- arch/x86/kernel/cpu/mtrr/if.c | 3 +- block/bsg.c | 2 +- block/compat_ioctl.c | 18 +++++++++++----- block/ioctl.c | 3 +- drivers/block/DAC960.c | 2 +- drivers/block/cciss.c | 4 +- drivers/block/loop.c | 3 +- drivers/block/paride/pt.c | 4 +- drivers/char/agp/agp.h | 2 +- drivers/char/agp/compat_ioctl.c | 2 +- drivers/char/agp/frontend.c | 2 +- drivers/char/ds1302.c | 2 +- drivers/char/dsp56k.c | 2 +- drivers/char/efirtc.c | 4 +- drivers/char/ip2/ip2main.c | 6 ++-- drivers/char/ip27-rtc.c | 4 +- drivers/char/ipmi/ipmi_devintf.c | 2 +- ...
Peter? Ingo? Alok?
This _looks_ like it might be due to "x86: merge the TSC cpu-freq code"
thing by Alok, where we do this:
+static struct notifier_block time_cpufreq_notifier_block = {
+ .notifier_call = time_cpufreq_notifier
+};
+
+static int __init cpufreq_tsc(void)
+{
+ cpufreq_register_notifier(&time_cpufreq_notifier_block,
+ CPUFREQ_TRANSITION_NOTIFIER);
+ return 0;
+}
but that's just _insane_ if the CPU doesn't even support TSC to begin
with. Also, in the actual time_cpufreq_notifier(), we do:
if (cpu_has(&cpu_data(freq->cpu), X86_FEATURE_CONSTANT_TSC))
return 0;
and this is stupid because:
(a) if the CPU has no TSC at all, then it sure as hell won't have a
_constant_ one, so we'll actually continue into the function.
(b) and why the hell is this done at run-time in the notifier, and not in
the "cpufreq_tsc" init function? If anybody mixes totally different
kinds of CPU's in SMP, they deserve whatever they want.
so why is the patch not something like the appended?
Sean, does this make any difference for you?
Linus
---
arch/x86/kernel/tsc.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 46af716..9bed5ca 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -325,6 +325,10 @@ static struct notifier_block time_cpufreq_notifier_block = {
static int __init cpufreq_tsc(void)
{
+ if (!cpu_has_tsc)
+ return 0;
+ if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
+ return 0;
cpufreq_register_notifier(&time_cpufreq_notifier_block,
CPUFREQ_TRANSITION_NOTIFIER);
return 0;
--
I added this patch to x86/urgent. -hpa --
Hmm. Wasn't this already confirmed to be fixed by commit df60a8441866153d691ae69b77934904c2de5e0d? At least Adrian sent out an email saying "Confirmed, bug closed.", but bugzilla seems to disagree and still show it as open. Linus --
There were two different reports, Rafael opened a bug for each, and I
missed that there were two open bugs for the same issue.
The one I closed was #11344.
cu
Adrian
--
"Is there not promise of rain?" Ling Tan asked suddenly out
of the darkness. There had been need of rain for many days.
"Only a promise," Lao Er said.
Pearl S. Buck - Dragon Seed
--
On Saturday, 23 of August 2008, Rafael J. Wysocki wrote: Jeff, do you have the patches for these two in your queue? Rafael --
This appears to be a gcc bug when -fno-omit-stack-pointer is used (which we mostly don't need on ppc anyway except that another gcc stupidity makes it mandatory for -pg which ftrace needs). We're working on a two fold workaround: removing -fno-omit-stack-pointer in all the cases where we don't really need it, and for when we do (ie, CONFIG_FTRACE becaues of -pg), using -mno-sched-epilog which seems to work around it. The root cause in gcc hasn't been fully identified yet. Ben. --
Thanks Ben. I've already dropped it from the list of recent regressions. Rafael --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: R |
