Hi I am getting stack overflows on my Sparc64 station. They happen when I copy to device-mapper snapshot origin device using small IO size (512 bytes) and simultaneously execute "lvs" command. The kernel is compiled with most debugging functions enabled. The stack trace is this: __ide_end_request __blk_end_request __end_that_request_first req_bio_endio bio_endio clone_endio dec_pending bio_endio clone_endio dec_pending bio_endio clone_endio dec_pending bio_endio end_bio_bh_io_sync end_buffer_read_sync __end_buffer_read_notouch unlock_buffer wake_up_bit __wake_up_bit __wake_up __wake_up_common wake_bit_function autoremove_wake_function default_wake_function try_to_wake_up task_rq_lock __spin_lock lock_acquire __lock_acquire *** crash, stack overflow --- observations: That loop bio_endio->clone_endio->dec_pending is repeating for each level of nested devices --- so for any architecture there exists a level at which it causes trouble. We need something to prevent recursion, maybe the similar trick that was done with avoing bio request function recursion (i.e. if bio_endio is called recursively, it just adds the bio to queue and lets the top level to call endio method). Wait queue waking looks like being written by a high-level maniac --- it contains 8 levels of calls (none of them inlined). 7 of these calls (until try_to_wake_up) do nothing but pass arguments to lower level call. And each of these calls allocate at least 192 bytes of stack space. All these 7 useless calls consume 1360 bytes of stack (and cause windows traps that needlessly damage performance). Would you agree to inline most of the calls to save stack? Or do you see another solution? Long-term consideration: Is it possible to implement interrupt stacks on sparc64? Functions on sparc eat stack much more aggressively than on other architectures (minimum stack size for a function is 192 bytes). Mikulas --
From: Mikulas Patocka <mpatocka@redhat.com> Some of them could be inlined but there are a few limiting factors here. Even spin lock acquisitions are function calls, limiting how much leaf function and tail call optimizations can be done. Also, wake_up_bit has this aggregate local variable "key" whose address is passed down to subsequent functions, which limits optimizations even further. I had a patch but at the time I wrote it (several years ago) I couldn't make it stable enough to put mainline, I may resurrect it. I just did a quick scan and I can't find the last copy I had, and things have changed enough that I'd probably work from scratch anyways. But the level of recursion possible by the current device layer is excessive and needs to be curtained irrespective of these generic wakeup and sparc64 interrupt stack issues. --
I inlined three of them, I think I can inline another two. So hopefully, Tail call optimization is not done at all if you compile kernel with stack I fixed that too. BTW. what's the purpose of having 192-byte stack frame? There are 16 8-byte registers being saved per function call, so 128-byte frame should be sufficient, shoudn't? The ABI specifies that some additional entries must be present even if unused, but I don't see reason for them. Would something bad happen if GCC started to generate 128-byte stacks? Mikulas --
From: Mikulas Patocka <mpatocka@redhat.com> The callee can pop the arguments into the area past the register window. So you have the 128 byte register window save area, 6 slots for incoming arguments, which gives us 176 bytes. The rest is for some miscellaneous stack frame state, which I don't remember the details of at the moment. I'd have to read the sparc backend of gcc to remember. --
I see ... the callee writes arguments into caller's stack frame, if it has variable number of arguments. That it misdesign, the callee should write registers arguments into it's own frame like on AMD64 (then this space would be allocated only if needed). But nothing can be done with it since ABI was specified :-( --
From: Mikulas Patocka <mpatocka@redhat.com> The callee can do this even for non-variable argument lists. It's like a set of pre-allocated stack slots for those incoming argument registers when reloading under register pressure. In my opinion it is better to put this onus on the callee because only the callee knows if it needs to pop these values onto the stack to alleviate register pressure. I think it might be possible for the compiler to only use 176 bytes. I'll take a look at the gcc sparc backend and the ABI specification to see if this is the case. --
Yes, it could be shrunk to 176 bytes. Maybe there could be some performance problems if the spills are cacheline-unaligned. Or better --- make special -mkernel-abi function to gcc that will drop this area at all and make 128-byte frames. In kernel it wouldn't matter that ABI is incompatible. Mikulas --
I took another few traces (to track the whole stack content) and there is another problem: nested interrupts. Does Sparc64 limit them somehow? sys_call_table timer_interrupt irq_exit do_softirq __do_softirq run_timer_softirq _spin_unlock sys_call_table handler_irq handler_fasteoi_irq handle_irq_event ide_intr ide_dma_intr task_end_request ide_end_request __ide_end_request ... Mikulas --
From: Mikulas Patocka <mpatocka@redhat.com> Two levels should be the deepest you will ever see, and this is equivalent to what you get on other platforms. That path occurs when softirq processing re-enabled HW interrupts when returning from the top-level interrupt. --
And what if network softirq happened here? How much stack does it consume? The whole overflowed stack trace has 75 functions, I was able to get rid of 9 by avoiding bio_endio recursion and 10 by turning simple functions into inlines. --- so is it enough or not enough for possible networking calls? Maybe a good thing would be to add a check for stack size to __do_softirq and handing the softirq to ksoftirqd if there's not enough space. Mikulas --
From: Mikulas Patocka <mpatocka@redhat.com> It should be OK, because the minimum stack of a (75 - 19) depth call I'd rather it spit out a WARN_ON() message and a backtrace. Otherwise it will be considered a feature and people won't fix these deep call chains. --
I meant if some fancy networking options can eat those 19 frames that I saved and crash again? I use the computer as a workstation, it doesn't have high network load and it doesn't use any features except basic If you think that process context+network processing+hardirqs can fit into 75 nested functions... I really have no idea how much the networking takes, given the amount of protocols and features and inability to test them all in one lab, it looks very scary. Mikulas --
From: Mikulas Patocka <mpatocka@redhat.com> I agree, it looks scary to me too. Next week I'll make one of my primary projects the implementation of IRQ stacks for sparc64. --
From: David Miller <davem@davemloft.net>
Next week came real fast :-)
Amazingly this patch worked the first time I booted it up on my
dual-cpu workstation (using SMP + PREEMPT). I'm worried that maybe
gcc can do some clever things and not allow my extremely simple
approach to work, but anyways it might work for you and it's worth
giving a try.
sparc64: Implement support for IRQ stacks.
Signed-off-by: David S. Miller <davem@davemloft.net>
diff --git a/arch/sparc64/Kconfig.debug b/arch/sparc64/Kconfig.debug
index 6a4d28a..df80962 100644
--- a/arch/sparc64/Kconfig.debug
+++ b/arch/sparc64/Kconfig.debug
@@ -41,4 +41,11 @@ config FRAME_POINTER
depends on MCOUNT
default y
+config IRQSTACKS
+ bool "Use separate kernel stacks when processing interrupts"
+ help
+ If you say Y here the kernel will use separate kernel stacks
+ for handling hard and soft interrupts. This can help avoid
+ overflowing the process kernel stacks.
+
endmenu
diff --git a/arch/sparc64/kernel/irq.c b/arch/sparc64/kernel/irq.c
index b441a26..24820fa 100644
--- a/arch/sparc64/kernel/irq.c
+++ b/arch/sparc64/kernel/irq.c
@@ -674,10 +674,42 @@ void ack_bad_irq(unsigned int virt_irq)
ino, virt_irq);
}
+#ifdef CONFIG_IRQSTACKS
+void *hardirq_stack[NR_CPUS];
+void *softirq_stack[NR_CPUS];
+
+static void *set_hardirq_stack(void)
+{
+ void *orig_sp, *sp = hardirq_stack[smp_processor_id()];
+
+ __asm__ __volatile__("mov %%sp, %0" : "=r" (orig_sp));
+ if (orig_sp < sp ||
+ orig_sp > (sp + THREAD_SIZE)) {
+ sp += THREAD_SIZE - 192 - STACK_BIAS;
+ __asm__ __volatile__("mov %0, %%sp" : : "r" (sp));
+ }
+
+ return orig_sp;
+}
+static void restore_hardirq_stack(void *orig_sp)
+{
+ __asm__ __volatile__("mov %0, %%sp" : : "r" (orig_sp));
+}
+#else
+static void *set_hardirq_stack(void)
+{
+ return NULL;
+}
+static void restore_hardirq_stack(void *orig_sp)
+{
+}
+#endif
+
void handler_irq(int irq, struct pt_regs *regs)
{
unsigned long pstate, ...For me it doesn't work. Locked up after "console: colour dummy device 80x25". --
From: Mikulas Patocka <mpatocka@redhat.com> Machine type, compiler, and config please so I can debug this :-) --
Ultra 5, 360MHz, 256MB, compiler gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21). .config is attached. Mikulas
From: Mikulas Patocka <mpatocka@redhat.com> Are you sure you didn't see a "Stack overflow" message on the screen? :-) That's what I get when I try to boot with your provided kernel config. The problem is that CONFIG_STACK_DEBUG doesn't understand the IRQ stacks at all. I'll see if I can tweak it to handle this. --
From: David Miller <davem@davemloft.net> This patch, on top of my original IRQSTACKS patch for sparc64, seems to get things working for me. diff --git a/arch/sparc64/lib/mcount.S b/arch/sparc64/lib/mcount.S index 9e4534b..24e6a3c 100644 --- a/arch/sparc64/lib/mcount.S +++ b/arch/sparc64/lib/mcount.S @@ -45,11 +45,47 @@ _mcount: sub %g3, STACK_BIAS, %g3 cmp %sp, %g3 bg,pt %xcc, 1f - sethi %hi(panicstring), %g3 + nop +#ifdef CONFIG_IRQSTACKS + lduh [%g6 + TI_CPU], %g1 + sethi %hi(hardirq_stack), %g3 + or %g3, %lo(hardirq_stack), %g3 + sllx %g1, 3, %g1 + ldx [%g3 + %g1], %g7 + sub %g7, STACK_BIAS, %g7 + cmp %sp, %g7 + bleu,pt %xcc, 2f + sethi %hi(THREAD_SIZE), %g3 + add %g7, %g3, %g7 + cmp %sp, %g7 + blu,pn %xcc, 1f +2: sethi %hi(softirq_stack), %g3 + or %g3, %lo(softirq_stack), %g3 + ldx [%g3 + %g1], %g7 + cmp %sp, %g7 + bleu,pt %xcc, 2f + sethi %hi(THREAD_SIZE), %g3 + add %g7, %g3, %g7 + cmp %sp, %g7 + blu,pn %xcc, 1f + nop +#endif + /* If we are already on panic stack, don't hop onto it + * again, we are already trying to output the stack overflow + * message. + */ sethi %hi(ovstack), %g7 ! cant move to panic stack fast enough or %g7, %lo(ovstack), %g7 - add %g7, OVSTACKSIZE, %g7 + add %g7, OVSTACKSIZE, %g3 + sub %g3, STACK_BIAS - 192, %g3 sub %g7, STACK_BIAS, %g7 + cmp %sp, %g7 + blu,pn %xcc, 2f + cmp %sp, %g3 + bleu,pn %xcc, 1f + nop +2: mov %g3, %sp + sethi %hi(panicstring), %g3 mov %g7, %sp call prom_printf or %g3, %lo(panicstring), %o0 --
Thanks, the patch works. --
From: Mikulas Patocka <mpatocka@redhat.com> Thanks for testing. I'm engaged in a few activities right now related to this: 1) I'm making a final version of this irqstacks patch which I'll get into Linus's tree and then submit to -stable. 2) I have a patch which I'm regression testing for gcc which gets rid of the 16-byte secondary-reload stack slot. This gcc is emitting 176 byte default stack frames on sparc64. 3) I'm talking with some folks familiar with the V9 ABI about the mandatory incoming argument stack slots. I think they can be eliminated in all cases except functions taking varargs. If it cannot be done universally for some reason, I'll add a -mkernel option that will eliminate them. This will emit 128 byte default stack frames when possible. So in the end we should hopefully have 128 byte stack frames coming out of gcc and IRQ stacks in the kernel. --
I think no, it just locked-up solid. There is a problem with console handover. See this dmesg that I get on boot. Notice the lines: (1) console handover: boot [earlyprom0] -> real [tty0] and (2) Console: switching to colour frame buffer device 128x48 At line (1), the kernel disables the PROM console. At line (2) it enables framebuffer. Between these lines, the kernel runs with no console at all. Everything that is printk'ed between these lines doesn't go to the screen. If the kernel hits oops at some point between (1) and (2), you don't see anything, it just appears as a lockup. I hit already three crashes that happened between these lines and didn't generate any output: this one with interrupt stacks that you have just fixed, CONFIG_LOCKDEP+CONFIG_DEBUG_PAGEALLOC crash that I will send you patch for, and then boot failure of 2.6.27-rc[12] because of bad memory migratetype. Is this migratetype crash a known problem? --- the problem is that starting with 2.6.27rc1, I'm getting crash with this backtrace: __list_add __free_pages_ok __free_pages __free_pages_bootmem __free_all_bootmem mem_init start_kernel_tlb_fixup_code --- the crash is due to migratetype == 5 in __free_one_page (inlined into __free_pages_ok) and because there are only 5 migratettypes, it attempts to add to a non-existent list. The trace can be obtained if I disable console handover in kernel/printk. But it should really be somehow rewritten so that the kernel can write crashes during boot on console without extra patching --- the PROM console is disabled just before the framebuffer is registered and not too early. Mikulas PROMLIB: Sun IEEE Boot Prom 'OBP 3.31.0 2001/07/25 20:36' PROMLIB: Root node compatible: Linux version 2.6.26-devel (root@slunicko) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #24 Wed Aug 13 01:25:13 CEST 2008 console [earlyprom0] enabled ARCH: SUN4U Ethernet address: 08:00:20:f5:03:81 Kernel: Using 3 locked TLB entries for main kernel ...
From: Mikulas Patocka <mpatocka@redhat.com> Yes, I know, this is such an incredible pain and it bothers me a lot as it makes diagnosing bugs that trigger in between these two points very difficult to diagnose. The VT layer should not register it's console until an upper level provider (such as an fbdev driver or the plain VGA console) really has The interrupt stacks one would show up on the console, because it uses prom_printf() to use the firmware console directly. Actually, I bet it got printed, but you didn't see it, because the framebuffer driver changed the console palette, resulting in the pixels the PROM console writes being black on the black We have another report of this, thanks for grabbing the extra Another way to capture this is to remove the CON_BOOT thing from the prom console struct in arch/sparc64/kernel/setup.c I am probably going to make the old "-p" boot command line option do this dynamically. --
From: Mikulas Patocka <mpatocka@redhat.com> Mikulas can you send me the .config you're using in 2.6.27 to trigger this? Thanks. --
From: David Miller <davem@davemloft.net> Meanwhile I tried to figure out how this can go wrong like this. The way this stuff works this early is very simple. The pageblock bitmaps get allocated by sparse_init() as it iterates over each mem section, via sparse_early_usemap_alloc(). These use the various bootmem allocators, which will zero initialize the bitmap. I added some debugging to sparse_early_usemap_alloc() to make sure the size was correct and that the pointer looked sane. What happens next is that memmap_init_zone() walks over each zone's page and initializes their pageblock migrate type to MIGRATE_MOVABLE which is "2". So given the simplicity of that stuff, I can only imagine that something is writing all over the bitmaps, clobbering them somehow. I'll try to reproduce this here so I can try to narrow down the cause a bit more, but so far my attempts have not been successful. --
Here it is. The computer is Ultra 5 with 512MB RAM. Mikulas # # Automatically generated make config: don't edit # Linux kernel version: 2.6.27-rc2 # Thu Aug 7 14:30:44 2008 # CONFIG_SPARC=y CONFIG_SPARC64=y CONFIG_GENERIC_TIME=y CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_64BIT=y CONFIG_MMU=y CONFIG_IOMMU_HELPER=y CONFIG_QUICKLIST=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y # CONFIG_ARCH_HAS_ILOG2_U32 is not set # CONFIG_ARCH_HAS_ILOG2_U64 is not set CONFIG_AUDIT_ARCH=y CONFIG_HAVE_SETUP_PER_CPU_AREA=y CONFIG_ARCH_NO_VIRT_TO_BUS=y CONFIG_OF=y CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ=y CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config" # # General setup # CONFIG_EXPERIMENTAL=y CONFIG_BROKEN_ON_SMP=y CONFIG_INIT_ENV_ARG_LIMIT=32 CONFIG_LOCALVERSION="-devel" # CONFIG_LOCALVERSION_AUTO is not set CONFIG_SWAP=y CONFIG_SYSVIPC=y CONFIG_SYSVIPC_SYSCTL=y CONFIG_POSIX_MQUEUE=y # CONFIG_BSD_PROCESS_ACCT is not set # CONFIG_TASKSTATS is not set # CONFIG_AUDIT is not set CONFIG_IKCONFIG=y CONFIG_IKCONFIG_PROC=y CONFIG_LOG_BUF_SHIFT=18 # CONFIG_CGROUPS is not set # CONFIG_GROUP_SCHED is not set # CONFIG_SYSFS_DEPRECATED_V2 is not set # CONFIG_RELAY is not set CONFIG_NAMESPACES=y # CONFIG_UTS_NS is not set # CONFIG_IPC_NS is not set # CONFIG_USER_NS is not set # CONFIG_PID_NS is not set # CONFIG_BLK_DEV_INITRD is not set CONFIG_CC_OPTIMIZE_FOR_SIZE=y CONFIG_SYSCTL=y # CONFIG_EMBEDDED is not set CONFIG_UID16=y CONFIG_SYSCTL_SYSCALL=y CONFIG_SYSCTL_SYSCALL_CHECK=y CONFIG_KALLSYMS=y CONFIG_KALLSYMS_ALL=y # CONFIG_KALLSYMS_EXTRA_PASS is not set CONFIG_HOTPLUG=y CONFIG_PRINTK=y CONFIG_BUG=y CONFIG_ELF_CORE=y CONFIG_COMPAT_BRK=y CONFIG_BASE_FULL=y CONFIG_FUTEX=y CONFIG_ANON_INODES=y CONFIG_EPOLL=y CONFIG_SIGNALFD=y CONFIG_TIMERFD=y CONFIG_EVENTFD=y CONFIG_SHMEM=y CONFIG_VM_EVENT_COUNTERS=y CONFIG_SLUB_DEBUG=y # CONFIG_SLAB is not set CONFIG_SLUB=y # CONFIG_SLOB is not ...
From: Mikulas Patocka <mpatocka@redhat.com>
Thanks, I've tried a lot of things to try and reproduce this myself but to
no avail.
Let's try to track down exactly where the corruption happens. Please try
the debug patch below on your box. It will spit out a message something
like:
[ 0.000000] BUG: Bogus migrate type 5
[ 0.000000] BUG: Usemap for section 0 corrupted at sparse_init+0x17c/0x218[mm/sparse.c:498]
The theory is that some other piece of code is either not allocating a large
enough buffer or overwriting past the end of a validly sized other buffer
for some reason, and thus clobbering this pageblock flags bitmap.
Wherever this debugging check first triggers should give us some idea
of who the culprit might be.
diff --git a/arch/sparc64/mm/init.c b/arch/sparc64/mm/init.c
index 217de3e..26b018f 100644
--- a/arch/sparc64/mm/init.c
+++ b/arch/sparc64/mm/init.c
@@ -1643,6 +1643,8 @@ void __init setup_per_cpu_areas(void)
{
}
+extern void sparse_validate_usemap(const char *file, int line);
+
void __init paging_init(void)
{
unsigned long end_pfn, shift, phys_base;
@@ -1788,7 +1790,9 @@ void __init paging_init(void)
#ifndef CONFIG_NEED_MULTIPLE_NODES
max_mapnr = last_valid_pfn;
#endif
+ sparse_validate_usemap(__FILE__, __LINE__);
kernel_physical_mapping_init();
+ sparse_validate_usemap(__FILE__, __LINE__);
{
unsigned long max_zone_pfns[MAX_NR_ZONES];
@@ -1798,12 +1802,15 @@ void __init paging_init(void)
max_zone_pfns[ZONE_NORMAL] = end_pfn;
free_area_init_nodes(max_zone_pfns);
+ sparse_validate_usemap(__FILE__, __LINE__);
}
printk("Booting Linux...\n");
central_probe();
+ sparse_validate_usemap(__FILE__, __LINE__);
cpu_probe();
+ sparse_validate_usemap(__FILE__, __LINE__);
}
int __init page_in_phys_avail(unsigned long paddr)
diff --git a/init/main.c b/init/main.c
index 0bc7e16..80771f5 100644
--- a/init/main.c
+++ b/init/main.c
@@ -536,6 +536,8 @@ void __init __weak ...Hi So I tried the patch and found out that the corruption happens in setup_command_line --- the first strcpy call corrupted the migratetype map. Examining the problem further, it turned out that Johannes Weiner committed new bootmem allocator to 2.6.27-rc1 and the allocator is broken. This is the minimal sequence that jams the allocator: void *p, *q, *r; p = alloc_bootmem(PAGE_SIZE); q = alloc_bootmem(64); free_bootmem(p, PAGE_SIZE); p = alloc_bootmem(PAGE_SIZE); r = alloc_bootmem(64); --- after this sequence (assuming that the allocator was empty or page-aligned before), pointer "q" will be equal to pointer "r". What's hapenning inside the allocator: p = alloc_bootmem(PAGE_SIZE); in allocator: last_end_off == PAGE_SIZE, bitmap contains bits 10000... q = alloc_bootmem(64); in allocator: last_end_off == PAGE_SIZE + 64, bitmap contains 11000... free_bootmem(p, PAGE_SIZE); in allocator: last_end_off == PAGE_SIZE + 64, bitmap contains 01000... p = alloc_bootmem(PAGE_SIZE); in allocator: last_end_off == PAGE_SIZE, bitmap contains 11000... r = alloc_bootmem(64); and now: it finds bit "2", as a place where to allocate (sidx) it hits the condition if (bdata->last_end_off && PFN_DOWN(bdata->last_end_off) + 1 == sidx)) start_off = ALIGN(bdata->last_end_off, align); --- you can see that the condition is true, so it assigns start_off = ALIGN(bdata->last_end_off, align); --- that is PAGE_SIZE --- and allocates over already allocated block. This patch fixes it (kernels 2.6.27-rc2 and 2.6.27-rc3 boot ok after the patch). Johannes, please review the patch and submit it to Linus. With the patch it tries to continue at the end of previous allocation only if the previous allocation ended in the middle of the page. Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> --- mm/bootmem.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: ...
From: Mikulas Patocka <mpatocka@redhat.com> Date: Thu, 14 Aug 2008 19:11:19 -0400 (EDT) [ Adding Alexander back to the CC: as he is seeing this same Ok, I was just looking at Alexander's debugging dump from my patch and in his case it pointed to kernel_physical_mapping_init() and I couldn't find any obvious problems there. --
It is working! Thanks a lot David and Mikulas! alexb@sparky ~ $ zgrep LOCKDEP /proc/config.gz CONFIG_LOCKDEP_SUPPORT=y CONFIG_LOCKDEP=y CONFIG_DEBUG_LOCKDEP=y alexb@sparky ~ $ uname -a Linux sparky 2.6.27-rc3-00171-gb635ace-dirty #3 PREEMPT Fri Aug 15 14:52:40 MSD 2008 sparc64 sun4u TI UltraSparc IIi (Sabre) GNU/Linux --
From: "Alexander Beregalov" <a.beregalov@gmail.com> Thanks for your testing and patience :) --
Hi Mikulas, Yes, taking last_end_off into account when it's page-aligned is bogus as the whole merging thing is about partial pages. Acked-by: Johannes Weiner <hannes@saeurebad.de> --
Are you sure? What about this:
ide-io.c:ide_intr
if (drive->unmask)
local_irq_enable_in_hardirq();
or this:
kernel/irq/handle.c:handle_IRQ_event
if (!(action->flags & IRQF_DISABLED))
local_irq_enable_in_hardirq();
--- how is number of nested interrupts here supposed to be limited?
If these things are not limited, you get at most as many nested handlers
as there are hardware interrupts, which means crash.
--
From: Mikulas Patocka <mpatocka@redhat.com> It means i386 and every other platform potentially has the same exact problem. What point wrt. sparc64 are you trying to make here? :-) --
The difference is that i386 takes minimum 4 bytes per stack frame and sparc64 192 bytes per stack frame. So this problem will kill sparc64 sooner. But yes, it is general problem and should be solved in arch-independent code. Mikulas --
From: Mikulas Patocka <mpatocka@redhat.com> I agree on both counts. Although I'm curious what the average stack frame sizes look like on x86_64 and i386, and also how this area appears on powerpc. One mitigating factor on sparc64 is that typically when there are lots of devices with interrupts there are also lots of cpus, and we evenly distribute the IRQ targetting amongst the available cpus on sparc64. This is probably why, in practice, these problems tend to not surface often. In any event, with the work you've accomplished and my implementation of IRQ stacks for sparc64 we should be able to get things in much better shape. --
From: David Miller <davem@davemloft.net> I also one to mention in passing that another thing we can do to help deep call stack sizes is to make call chains more tail-call friendly when possible. --
... and remove -fno-optimize-sibling-calls?: Makefile: ifdef CONFIG_FRAME_POINTER KBUILD_CFLAGS += -fno-omit-frame-pointer -fno-optimize-sibling-calls else KBUILD_CFLAGS += -fomit-frame-pointer endif --- maybe it could be better to remove it, instead of some inlining that I made. Or do you see a situation when for debugging purpose, user would want -fno-optimize-sibling-calls? Mikulas --
From: Mikulas Patocka <mpatocka@redhat.com> Yes for debugging and other things it has to stay. --
If you want it to stay, then it doesn't make sense to make functions tail-call-friendly --- because it should not crash with or without debugging. Mikulas --
From: Mikulas Patocka <mpatocka@redhat.com> On the contrary, of course it makes sense to do so. When debugging is disabled, the kernel will run faster. We have to fix the stack usage in either case, but from a performance standpoint when debugging is disabled the tail-call friendly layout is still highly desirable. --
I agree, but performance is different problem than stack overflows. I put all the patches for this overflow problem here: http://people.redhat.com/mpatocka/patches/kernel-stack-overflow Mikulas --
If I look at an old oops that I have in my log on i386: it's 1104 stack
I created this to help with nested irqs:
--- linux-2.6.26-rc5-devel.orig/include/linux/interrupt.h 2008-06-20
23:34:04.000000000 +0200
+++ linux-2.6.26-rc5-devel/include/linux/interrupt.h 2008-06-20
23:36:03.000000000 +0200
@@ -95,7 +95,7 @@
#ifdef CONFIG_LOCKDEP
# define local_irq_enable_in_hardirq() do { } while (0)
#else
-# define local_irq_enable_in_hardirq() local_irq_enable()
+# define local_irq_enable_in_hardirq() do { if (hardirq_count() <= (1 <<
HARDIRQ_SHIFT)) local_irq_enable(); } while (0)
#endif
extern void disable_irq_nosync(unsigned int irq);
Mikulas
--
