Bootmem allocator broken [was: console handover badness]

Previous thread: [2.6 patch] sh/kernel/ cleanups by Adrian Bunk on Tuesday, June 17, 2008 - 5:36 pm. (1 message)

Next thread: [git patches] net driver fixes for .26 by Jeff Garzik on Tuesday, June 17, 2008 - 9:20 pm. (2 messages)
From: Mikulas Patocka
Date: Tuesday, June 17, 2008 - 5:47 pm

Hi

I am getting stack overflows on my Sparc64 station. They happen when I 
copy to device-mapper snapshot origin device using small IO size (512 
bytes) and simultaneously execute "lvs" command. The kernel is compiled 
with most debugging functions enabled. The stack trace is this:

__ide_end_request
__blk_end_request
__end_that_request_first
req_bio_endio
bio_endio
clone_endio
dec_pending
bio_endio
clone_endio
dec_pending
bio_endio
clone_endio
dec_pending
bio_endio
end_bio_bh_io_sync
end_buffer_read_sync
__end_buffer_read_notouch
unlock_buffer
wake_up_bit
__wake_up_bit
__wake_up
__wake_up_common
wake_bit_function
autoremove_wake_function
default_wake_function
try_to_wake_up
task_rq_lock
__spin_lock
lock_acquire
__lock_acquire
*** crash, stack overflow

--- observations:

That loop bio_endio->clone_endio->dec_pending is repeating for each level 
of nested devices --- so for any architecture there exists a level at 
which it causes trouble. We need something to prevent recursion, maybe the 
similar trick that was done with avoing bio request function recursion 
(i.e. if bio_endio is called recursively, it just adds the bio to queue 
and lets the top level to call endio method).

Wait queue waking looks like being written by a high-level maniac --- it 
contains 8 levels of calls (none of them inlined). 7 of these calls (until 
try_to_wake_up) do nothing but pass arguments to lower level call. And 
each of these calls allocate at least 192 bytes of stack space. All these 
7 useless calls consume 1360 bytes of stack (and cause windows traps that 
needlessly damage performance). Would you agree to inline most of the 
calls to save stack? Or do you see another solution?

Long-term consideration: Is it possible to implement interrupt stacks on 
sparc64? Functions on sparc eat stack much more aggressively than on other 
architectures (minimum stack size for a function is 192 bytes).

Mikulas
--

From: David Miller
Date: Tuesday, June 17, 2008 - 9:01 pm

From: Mikulas Patocka <mpatocka@redhat.com>

Some of them could be inlined but there are a few limiting
factors here.

Even spin lock acquisitions are function calls, limiting how
much leaf function and tail call optimizations can be done.

Also, wake_up_bit has this aggregate local variable "key" whose
address is passed down to subsequent functions, which limits
optimizations even further.


I had a patch but at the time I wrote it (several years ago) I
couldn't make it stable enough to put mainline, I may resurrect it.

I just did a quick scan and I can't find the last copy I had, and
things have changed enough that I'd probably work from scratch
anyways.

But the level of recursion possible by the current device layer is
excessive and needs to be curtained irrespective of these generic
wakeup and sparc64 interrupt stack issues.
--

From: Mikulas Patocka
Date: Wednesday, June 18, 2008 - 8:24 pm

I inlined three of them, I think I can inline another two. So hopefully, 

Tail call optimization is not done at all if you compile kernel with stack 

I fixed that too.

BTW. what's the purpose of having 192-byte stack frame? There are 16 
8-byte registers being saved per function call, so 128-byte frame should 
be sufficient, shoudn't? The ABI specifies that some additional entries 
must be present even if unused, but I don't see reason for them. Would 
something bad happen if GCC started to generate 128-byte stacks?

Mikulas
--

From: David Miller
Date: Wednesday, June 18, 2008 - 8:59 pm

From: Mikulas Patocka <mpatocka@redhat.com>

The callee can pop the arguments into the area past the
register window.

So you have the 128 byte register window save area, 6
slots for incoming arguments, which gives us 176 bytes.
The rest is for some miscellaneous stack frame state,
which I don't remember the details of at the moment.
I'd have to read the sparc backend of gcc to remember.
--

From: Mikulas Patocka
Date: Wednesday, June 18, 2008 - 10:17 pm

I see ... the callee writes arguments into caller's stack frame, if it has 
variable number of arguments. That it misdesign, the callee should write 
registers arguments into it's own frame like on AMD64 (then this space 
would be allocated only if needed).
But nothing can be done with it since ABI was specified :-(

--

From: David Miller
Date: Wednesday, June 18, 2008 - 11:37 pm

From: Mikulas Patocka <mpatocka@redhat.com>

The callee can do this even for non-variable argument lists.

It's like a set of pre-allocated stack slots for those incoming
argument registers when reloading under register pressure.

In my opinion it is better to put this onus on the callee because only
the callee knows if it needs to pop these values onto the stack to
alleviate register pressure.

I think it might be possible for the compiler to only use 176 bytes.
I'll take a look at the gcc sparc backend and the ABI specification
to see if this is the case.
--

From: Mikulas Patocka
Date: Thursday, June 19, 2008 - 6:01 am

Yes, it could be shrunk to 176 bytes. Maybe there could be some 
performance problems if the spills are cacheline-unaligned. Or better --- 
make special -mkernel-abi function to gcc that will drop this area at all 
and make 128-byte frames. In kernel it wouldn't matter that ABI is 
incompatible.

Mikulas
--

From: Mikulas Patocka
Date: Friday, June 20, 2008 - 8:47 am

I took another few traces (to track the whole stack content) and there is 
another problem: nested interrupts. Does Sparc64 limit them somehow?

sys_call_table
timer_interrupt
irq_exit
do_softirq
__do_softirq
run_timer_softirq
_spin_unlock
sys_call_table
handler_irq
handler_fasteoi_irq
handle_irq_event
ide_intr
ide_dma_intr
task_end_request
ide_end_request
__ide_end_request
...

Mikulas
--

From: David Miller
Date: Friday, June 20, 2008 - 10:26 am

From: Mikulas Patocka <mpatocka@redhat.com>

Two levels should be the deepest you will ever see, and this is
equivalent to what you get on other platforms.

That path occurs when softirq processing re-enabled HW interrupts when
returning from the top-level interrupt.
--

From: Mikulas Patocka
Date: Friday, June 20, 2008 - 1:34 pm

And what if network softirq happened here? How much stack does it consume?

The whole overflowed stack trace has 75 functions, I was able to get rid 
of 9 by avoiding bio_endio recursion and 10 by turning simple functions 
into inlines. --- so is it enough or not enough for possible networking 
calls?

Maybe a good thing would be to add a check for stack size to __do_softirq 
and handing the softirq to ksoftirqd if there's not enough space.

Mikulas
--

From: David Miller
Date: Friday, June 20, 2008 - 1:37 pm

From: Mikulas Patocka <mpatocka@redhat.com>

It should be OK, because the minimum stack of a (75 - 19) depth call

I'd rather it spit out a WARN_ON() message and a backtrace.

Otherwise it will be considered a feature and people won't fix
these deep call chains.
--

From: Mikulas Patocka
Date: Friday, June 20, 2008 - 2:26 pm

I meant if some fancy networking options can eat those 19 frames that I 
saved and crash again? I use the computer as a workstation, it doesn't 
have high network load and it doesn't use any features except basic 

If you think that process context+network processing+hardirqs can fit into 
75 nested functions... I really have no idea how much the networking 
takes, given the amount of protocols and features and inability to test 
them all in one lab, it looks very scary.

Mikulas
--

From: David Miller
Date: Friday, June 20, 2008 - 2:41 pm

From: Mikulas Patocka <mpatocka@redhat.com>

I agree, it looks scary to me too.

Next week I'll make one of my primary projects the implementation of
IRQ stacks for sparc64.
--

From: David Miller
Date: Friday, June 20, 2008 - 9:51 pm

From: David Miller <davem@davemloft.net>

Next week came real fast :-)

Amazingly this patch worked the first time I booted it up on my
dual-cpu workstation (using SMP + PREEMPT).  I'm worried that maybe
gcc can do some clever things and not allow my extremely simple
approach to work, but anyways it might work for you and it's worth
giving a try.

sparc64: Implement support for IRQ stacks.

Signed-off-by: David S. Miller <davem@davemloft.net>

diff --git a/arch/sparc64/Kconfig.debug b/arch/sparc64/Kconfig.debug
index 6a4d28a..df80962 100644
--- a/arch/sparc64/Kconfig.debug
+++ b/arch/sparc64/Kconfig.debug
@@ -41,4 +41,11 @@ config FRAME_POINTER
 	depends on MCOUNT
 	default y
 
+config IRQSTACKS
+	bool "Use separate kernel stacks when processing interrupts"
+	help
+	  If you say Y here the kernel will use separate kernel stacks
+	  for handling hard and soft interrupts.  This can help avoid
+	  overflowing the process kernel stacks.
+
 endmenu
diff --git a/arch/sparc64/kernel/irq.c b/arch/sparc64/kernel/irq.c
index b441a26..24820fa 100644
--- a/arch/sparc64/kernel/irq.c
+++ b/arch/sparc64/kernel/irq.c
@@ -674,10 +674,42 @@ void ack_bad_irq(unsigned int virt_irq)
 	       ino, virt_irq);
 }
 
+#ifdef CONFIG_IRQSTACKS
+void *hardirq_stack[NR_CPUS];
+void *softirq_stack[NR_CPUS];
+
+static void *set_hardirq_stack(void)
+{
+	void *orig_sp, *sp = hardirq_stack[smp_processor_id()];
+
+	__asm__ __volatile__("mov %%sp, %0" : "=r" (orig_sp));
+	if (orig_sp < sp ||
+	    orig_sp > (sp + THREAD_SIZE)) {
+		sp += THREAD_SIZE - 192 - STACK_BIAS;
+		__asm__ __volatile__("mov %0, %%sp" : : "r" (sp));
+	}
+
+	return orig_sp;
+}
+static void restore_hardirq_stack(void *orig_sp)
+{
+	__asm__ __volatile__("mov %0, %%sp" : : "r" (orig_sp));
+}
+#else
+static void *set_hardirq_stack(void)
+{
+	return NULL;
+}
+static void restore_hardirq_stack(void *orig_sp)
+{
+}
+#endif
+
 void handler_irq(int irq, struct pt_regs *regs)
 {
 	unsigned long pstate, ...
From: Mikulas Patocka
Date: Saturday, June 21, 2008 - 12:42 pm

For me it doesn't work. Locked up after "console: colour dummy device 
80x25".

--

From: David Miller
Date: Sunday, June 22, 2008 - 12:03 am

From: Mikulas Patocka <mpatocka@redhat.com>

Machine type, compiler, and config please so I can debug this :-)
--

From: Mikulas Patocka
Date: Sunday, June 22, 2008 - 6:48 am

Ultra 5, 360MHz, 256MB, compiler gcc version 4.1.2 20061115 (prerelease) 
(Debian 4.1.1-21). .config is attached.

Mikulas
From: David Miller
Date: Monday, August 11, 2008 - 11:30 pm

From: Mikulas Patocka <mpatocka@redhat.com>

Are you sure you didn't see a "Stack overflow" message on the
screen? :-)

That's what I get when I try to boot with your provided
kernel config.

The problem is that CONFIG_STACK_DEBUG doesn't understand
the IRQ stacks at all.

I'll see if I can tweak it to handle this.
--

From: David Miller
Date: Tuesday, August 12, 2008 - 1:22 am

From: David Miller <davem@davemloft.net>

This patch, on top of my original IRQSTACKS patch for
sparc64, seems to get things working for me.

diff --git a/arch/sparc64/lib/mcount.S b/arch/sparc64/lib/mcount.S
index 9e4534b..24e6a3c 100644
--- a/arch/sparc64/lib/mcount.S
+++ b/arch/sparc64/lib/mcount.S
@@ -45,11 +45,47 @@ _mcount:
 	sub		%g3, STACK_BIAS, %g3
 	cmp		%sp, %g3
 	bg,pt		%xcc, 1f
-	 sethi		%hi(panicstring), %g3
+	 nop
+#ifdef CONFIG_IRQSTACKS
+	lduh		[%g6 + TI_CPU], %g1
+	sethi		%hi(hardirq_stack), %g3
+	or		%g3, %lo(hardirq_stack), %g3
+	sllx		%g1, 3, %g1
+	ldx		[%g3 + %g1], %g7
+	sub		%g7, STACK_BIAS, %g7
+	cmp		%sp, %g7
+	bleu,pt		%xcc, 2f
+	 sethi		%hi(THREAD_SIZE), %g3
+	add		%g7, %g3, %g7
+	cmp		%sp, %g7
+	blu,pn		%xcc, 1f
+2:	 sethi		%hi(softirq_stack), %g3
+	or		%g3, %lo(softirq_stack), %g3
+	ldx		[%g3 + %g1], %g7
+	cmp		%sp, %g7
+	bleu,pt		%xcc, 2f
+	 sethi		%hi(THREAD_SIZE), %g3
+	add		%g7, %g3, %g7
+	cmp		%sp, %g7
+	blu,pn		%xcc, 1f
+	 nop
+#endif
+	/* If we are already on panic stack, don't hop onto it
+	 * again, we are already trying to output the stack overflow
+	 * message.
+	 */
 	sethi		%hi(ovstack), %g7		! cant move to panic stack fast enough
 	 or		%g7, %lo(ovstack), %g7
-	add		%g7, OVSTACKSIZE, %g7
+	add		%g7, OVSTACKSIZE, %g3
+	sub		%g3, STACK_BIAS - 192, %g3
 	sub		%g7, STACK_BIAS, %g7
+	cmp		%sp, %g7
+	blu,pn		%xcc, 2f
+	 cmp		%sp, %g3
+	bleu,pn		%xcc, 1f
+	 nop
+2:	mov		%g3, %sp
+	sethi		%hi(panicstring), %g3
 	mov		%g7, %sp
 	call		prom_printf
 	 or		%g3, %lo(panicstring), %o0
--

From: Mikulas Patocka
Date: Tuesday, August 12, 2008 - 5:53 pm

Thanks, the patch works.

--

From: David Miller
Date: Tuesday, August 12, 2008 - 5:59 pm

From: Mikulas Patocka <mpatocka@redhat.com>

Thanks for testing.  I'm engaged in a few activities right
now related to this:

1) I'm making a final version of this irqstacks patch which
   I'll get into Linus's tree and then submit to -stable.

2) I have a patch which I'm regression testing for gcc which
   gets rid of the 16-byte secondary-reload stack slot.  This
   gcc is emitting 176 byte default stack frames on sparc64.

3) I'm talking with some folks familiar with the V9 ABI about
   the mandatory incoming argument stack slots.  I think they
   can be eliminated in all cases except functions taking
   varargs.  If it cannot be done universally for some
   reason, I'll add a -mkernel option that will eliminate
   them.  This will emit 128 byte default stack frames when
   possible.

So in the end we should hopefully have 128 byte stack frames
coming out of gcc and IRQ stacks in the kernel.
--

From: Mikulas Patocka
Date: Tuesday, August 12, 2008 - 6:11 pm

I think no, it just locked-up solid. There is a problem with console 
handover. See this dmesg that I get on boot.

Notice the lines:
(1) console handover: boot [earlyprom0] -> real [tty0]
and
(2) Console: switching to colour frame buffer device 128x48

At line (1), the kernel disables the PROM console. At line (2) it enables 
framebuffer. Between these lines, the kernel runs with no console at all. 
Everything that is printk'ed between these lines doesn't go to the screen.

If the kernel hits oops at some point between (1) and (2), you don't see 
anything, it just appears as a lockup.

I hit already three crashes that happened between these lines and didn't 
generate any output: this one with interrupt stacks that you have just 
fixed, CONFIG_LOCKDEP+CONFIG_DEBUG_PAGEALLOC crash that I will send you 
patch for, and then boot failure of 2.6.27-rc[12] because of bad memory 
migratetype. Is this migratetype crash a known problem? --- the problem is 
that starting with 2.6.27rc1, I'm getting crash with this backtrace:
__list_add
__free_pages_ok
__free_pages
__free_pages_bootmem
__free_all_bootmem
mem_init
start_kernel_tlb_fixup_code
--- the crash is due to migratetype == 5 in __free_one_page (inlined into 
__free_pages_ok) and because there are only 5 migratettypes, it attempts 
to add to a non-existent list.

The trace can be obtained if I disable console handover in kernel/printk. 
But it should really be somehow rewritten so that the kernel can write 
crashes during boot on console without extra patching --- the PROM console 
is disabled just before the framebuffer is registered and not too early.

Mikulas


PROMLIB: Sun IEEE Boot Prom 'OBP 3.31.0 2001/07/25 20:36'
PROMLIB: Root node compatible: 
Linux version 2.6.26-devel (root@slunicko) (gcc version 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)) #24 Wed Aug 13 01:25:13 CEST 2008
console [earlyprom0] enabled
ARCH: SUN4U
Ethernet address: 08:00:20:f5:03:81
Kernel: Using 3 locked TLB entries for main kernel ...
From: David Miller
Date: Tuesday, August 12, 2008 - 6:22 pm

From: Mikulas Patocka <mpatocka@redhat.com>

Yes, I know, this is such an incredible pain and it bothers
me a lot as it makes diagnosing bugs that trigger in between
these two points very difficult to diagnose.

The VT layer should not register it's console until an upper level
provider (such as an fbdev driver or the plain VGA console) really has

The interrupt stacks one would show up on the console, because it
uses prom_printf() to use the firmware console directly.

Actually, I bet it got printed, but you didn't see it, because
the framebuffer driver changed the console palette, resulting in
the pixels the PROM console writes being black on the black

We have another report of this, thanks for grabbing the extra

Another way to capture this is to remove the CON_BOOT thing from
the prom console struct in arch/sparc64/kernel/setup.c

I am probably going to make the old "-p" boot command line option do
this dynamically.
--

From: David Miller
Date: Tuesday, August 12, 2008 - 6:40 pm

From: Mikulas Patocka <mpatocka@redhat.com>

Mikulas can you send me the .config you're using in 2.6.27 to trigger
this?

Thanks.
--

From: David Miller
Date: Wednesday, August 13, 2008 - 1:50 am

From: David Miller <davem@davemloft.net>

Meanwhile I tried to figure out how this can go wrong like this.

The way this stuff works this early is very simple.

The pageblock bitmaps get allocated by sparse_init() as it iterates over
each mem section, via sparse_early_usemap_alloc().  These use the
various bootmem allocators, which will zero initialize the bitmap.

I added some debugging to sparse_early_usemap_alloc() to make sure
the size was correct and that the pointer looked sane.

What happens next is that memmap_init_zone() walks over each zone's
page and initializes their pageblock migrate type to MIGRATE_MOVABLE
which is "2".

So given the simplicity of that stuff, I can only imagine that something
is writing all over the bitmaps, clobbering them somehow.

I'll try to reproduce this here so I can try to narrow down the cause
a bit more, but so far my attempts have not been successful.
--

From: Mikulas Patocka
Date: Wednesday, August 13, 2008 - 5:46 am

Here it is. The computer is Ultra 5 with 512MB RAM.

Mikulas

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.27-rc2
# Thu Aug  7 14:30:44 2008
#
CONFIG_SPARC=y
CONFIG_SPARC64=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_64BIT=y
CONFIG_MMU=y
CONFIG_IOMMU_HELPER=y
CONFIG_QUICKLIST=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_AUDIT_ARCH=y
CONFIG_HAVE_SETUP_PER_CPU_AREA=y
CONFIG_ARCH_NO_VIRT_TO_BUS=y
CONFIG_OF=y
CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_BROKEN_ON_SMP=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION="-devel"
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
# CONFIG_AUDIT is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=18
# CONFIG_CGROUPS is not set
# CONFIG_GROUP_SCHED is not set
# CONFIG_SYSFS_DEPRECATED_V2 is not set
# CONFIG_RELAY is not set
CONFIG_NAMESPACES=y
# CONFIG_UTS_NS is not set
# CONFIG_IPC_NS is not set
# CONFIG_USER_NS is not set
# CONFIG_PID_NS is not set
# CONFIG_BLK_DEV_INITRD is not set
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_SYSCTL_SYSCALL_CHECK=y
CONFIG_KALLSYMS=y
CONFIG_KALLSYMS_ALL=y
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_COMPAT_BRK=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not ...
From: David Miller
Date: Wednesday, August 13, 2008 - 8:25 pm

From: Mikulas Patocka <mpatocka@redhat.com>

Thanks, I've tried a lot of things to try and reproduce this myself but to
no avail.

Let's try to track down exactly where the corruption happens.  Please try
the debug patch below on your box.  It will spit out a message something
like:

[    0.000000] BUG: Bogus migrate type 5
[    0.000000] BUG: Usemap for section 0 corrupted at sparse_init+0x17c/0x218[mm/sparse.c:498]

The theory is that some other piece of code is either not allocating a large
enough buffer or overwriting past the end of a validly sized other buffer
for some reason, and thus clobbering this pageblock flags bitmap.

Wherever this debugging check first triggers should give us some idea
of who the culprit might be.

diff --git a/arch/sparc64/mm/init.c b/arch/sparc64/mm/init.c
index 217de3e..26b018f 100644
--- a/arch/sparc64/mm/init.c
+++ b/arch/sparc64/mm/init.c
@@ -1643,6 +1643,8 @@ void __init setup_per_cpu_areas(void)
 {
 }
 
+extern void sparse_validate_usemap(const char *file, int line);
+
 void __init paging_init(void)
 {
 	unsigned long end_pfn, shift, phys_base;
@@ -1788,7 +1790,9 @@ void __init paging_init(void)
 #ifndef CONFIG_NEED_MULTIPLE_NODES
 	max_mapnr = last_valid_pfn;
 #endif
+	sparse_validate_usemap(__FILE__, __LINE__);
 	kernel_physical_mapping_init();
+	sparse_validate_usemap(__FILE__, __LINE__);
 
 	{
 		unsigned long max_zone_pfns[MAX_NR_ZONES];
@@ -1798,12 +1802,15 @@ void __init paging_init(void)
 		max_zone_pfns[ZONE_NORMAL] = end_pfn;
 
 		free_area_init_nodes(max_zone_pfns);
+		sparse_validate_usemap(__FILE__, __LINE__);
 	}
 
 	printk("Booting Linux...\n");
 
 	central_probe();
+	sparse_validate_usemap(__FILE__, __LINE__);
 	cpu_probe();
+	sparse_validate_usemap(__FILE__, __LINE__);
 }
 
 int __init page_in_phys_avail(unsigned long paddr)
diff --git a/init/main.c b/init/main.c
index 0bc7e16..80771f5 100644
--- a/init/main.c
+++ b/init/main.c
@@ -536,6 +536,8 @@ void __init __weak ...
From: Mikulas Patocka
Date: Thursday, August 14, 2008 - 4:11 pm

Hi

So I tried the patch and found out that the corruption happens in 
setup_command_line --- the first strcpy call corrupted the migratetype 
map.

Examining the problem further, it turned out that Johannes Weiner 
committed new bootmem allocator to 2.6.27-rc1 and the allocator is broken.

This is the minimal sequence that jams the allocator:

void *p, *q, *r;
p = alloc_bootmem(PAGE_SIZE);
q = alloc_bootmem(64);
free_bootmem(p, PAGE_SIZE);
p = alloc_bootmem(PAGE_SIZE);
r = alloc_bootmem(64);

--- after this sequence (assuming that the allocator was empty or 
page-aligned before), pointer "q" will be equal to pointer "r".

What's hapenning inside the allocator:
p = alloc_bootmem(PAGE_SIZE);
in allocator: last_end_off == PAGE_SIZE, bitmap contains bits 10000...
q = alloc_bootmem(64);
in allocator: last_end_off == PAGE_SIZE + 64, bitmap contains 11000...
free_bootmem(p, PAGE_SIZE);
in allocator: last_end_off == PAGE_SIZE + 64, bitmap contains 01000...
p = alloc_bootmem(PAGE_SIZE);
in allocator: last_end_off == PAGE_SIZE, bitmap contains 11000...
r = alloc_bootmem(64);
and now:
it finds bit "2", as a place where to allocate (sidx)
it hits the condition
if (bdata->last_end_off && PFN_DOWN(bdata->last_end_off) + 1 == sidx))
start_off = ALIGN(bdata->last_end_off, align);
--- you can see that the condition is true, so it assigns start_off = 
ALIGN(bdata->last_end_off, align); --- that is PAGE_SIZE --- and allocates 
over already allocated block.

This patch fixes it (kernels 2.6.27-rc2 and 2.6.27-rc3 boot ok after the 
patch). Johannes, please review the patch and submit it to Linus.

With the patch it tries to continue at the end of previous allocation only 
if the previous allocation ended in the middle of the page.

Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>

---
 mm/bootmem.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: ...
From: David Miller
Date: Thursday, August 14, 2008 - 4:25 pm

From: Mikulas Patocka <mpatocka@redhat.com>
Date: Thu, 14 Aug 2008 19:11:19 -0400 (EDT)

[ Adding Alexander back to the CC: as he is seeing this same

Ok, I was just looking at Alexander's debugging dump from my patch
and in his case it pointed to kernel_physical_mapping_init() and
I couldn't find any obvious problems there.


--

From: Alexander Beregalov
Date: Friday, August 15, 2008 - 4:09 am

It is working! Thanks a lot David and Mikulas!

alexb@sparky ~ $ zgrep LOCKDEP /proc/config.gz
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_LOCKDEP=y
CONFIG_DEBUG_LOCKDEP=y
alexb@sparky ~ $ uname -a
Linux sparky 2.6.27-rc3-00171-gb635ace-dirty #3 PREEMPT Fri Aug 15
14:52:40 MSD 2008 sparc64 sun4u TI UltraSparc IIi (Sabre) GNU/Linux
--

From: David Miller
Date: Friday, August 15, 2008 - 2:13 pm

From: "Alexander Beregalov" <a.beregalov@gmail.com>

Thanks for your testing and patience :)
--

From: Johannes Weiner
Date: Thursday, August 14, 2008 - 4:40 pm

Hi Mikulas,


Yes, taking last_end_off into account when it's page-aligned is bogus as
the whole merging thing is about partial pages.


Acked-by: Johannes Weiner <hannes@saeurebad.de>

--

From: Mikulas Patocka
Date: Friday, June 20, 2008 - 2:14 pm

Are you sure? What about this:
ide-io.c:ide_intr
         if (drive->unmask)
                 local_irq_enable_in_hardirq();

or this:
kernel/irq/handle.c:handle_IRQ_event
         if (!(action->flags & IRQF_DISABLED))
                 local_irq_enable_in_hardirq();


--- how is number of nested interrupts here supposed to be limited?

If these things are not limited, you get at most as many nested handlers 
as there are hardware interrupts, which means crash.

--

From: David Miller
Date: Friday, June 20, 2008 - 2:20 pm

From: Mikulas Patocka <mpatocka@redhat.com>

It means i386 and every other platform potentially has the same exact
problem.

What point wrt. sparc64 are you trying to make here? :-)


--

From: Mikulas Patocka
Date: Friday, June 20, 2008 - 2:25 pm

The difference is that i386 takes minimum 4 bytes per stack frame and 
sparc64 192 bytes per stack frame. So this problem will kill sparc64 
sooner.

But yes, it is general problem and should be solved in arch-independent 
code.

Mikulas
--

From: David Miller
Date: Friday, June 20, 2008 - 2:44 pm

From: Mikulas Patocka <mpatocka@redhat.com>

I agree on both counts.  Although I'm curious what the average stack
frame sizes look like on x86_64 and i386, and also how this area
appears on powerpc.

One mitigating factor on sparc64 is that typically when there are lots
of devices with interrupts there are also lots of cpus, and we evenly
distribute the IRQ targetting amongst the available cpus on sparc64.

This is probably why, in practice, these problems tend to not surface
often.

In any event, with the work you've accomplished and my implementation
of IRQ stacks for sparc64 we should be able to get things in much
better shape.
--

From: David Miller
Date: Friday, June 20, 2008 - 2:47 pm

From: David Miller <davem@davemloft.net>

I also one to mention in passing that another thing we can do to
help deep call stack sizes is to make call chains more tail-call
friendly when possible.
--

From: Mikulas Patocka
Date: Friday, June 20, 2008 - 3:22 pm

... and remove -fno-optimize-sibling-calls?:

Makefile:
ifdef CONFIG_FRAME_POINTER
KBUILD_CFLAGS   += -fno-omit-frame-pointer -fno-optimize-sibling-calls
else
KBUILD_CFLAGS   += -fomit-frame-pointer
endif

--- maybe it could be better to remove it, instead of some inlining that I 
made. Or do you see a situation when for debugging purpose, user would 
want -fno-optimize-sibling-calls?

Mikulas
--

From: David Miller
Date: Friday, June 20, 2008 - 3:28 pm

From: Mikulas Patocka <mpatocka@redhat.com>

Yes for debugging and other things it has to stay.
--

From: Mikulas Patocka
Date: Friday, June 20, 2008 - 3:36 pm

If you want it to stay, then it doesn't make sense to make functions 
tail-call-friendly --- because it should not crash with or without 
debugging.

Mikulas
--

From: David Miller
Date: Friday, June 20, 2008 - 3:47 pm

From: Mikulas Patocka <mpatocka@redhat.com>

On the contrary, of course it makes sense to do so.

When debugging is disabled, the kernel will run faster.

We have to fix the stack usage in either case, but from a
performance standpoint when debugging is disabled the
tail-call friendly layout is still highly desirable.
--

From: Mikulas Patocka
Date: Friday, June 20, 2008 - 5:37 pm

I agree, but performance is different problem than stack overflows.

I put all the patches for this overflow problem here:
http://people.redhat.com/mpatocka/patches/kernel-stack-overflow

Mikulas
--

From: Mikulas Patocka
Date: Friday, June 20, 2008 - 3:33 pm

If I look at an old oops that I have in my log on i386: it's 1104 stack 

I created this to help with nested irqs:
--- linux-2.6.26-rc5-devel.orig/include/linux/interrupt.h       2008-06-20 
23:34:04.000000000 +0200
+++ linux-2.6.26-rc5-devel/include/linux/interrupt.h    2008-06-20 
23:36:03.000000000 +0200
@@ -95,7 +95,7 @@
  #ifdef CONFIG_LOCKDEP
  # define local_irq_enable_in_hardirq() do { } while (0)
  #else
-# define local_irq_enable_in_hardirq() local_irq_enable()
+# define local_irq_enable_in_hardirq() do { if (hardirq_count() <= (1 << 
HARDIRQ_SHIFT)) local_irq_enable(); } while (0)
  #endif

  extern void disable_irq_nosync(unsigned int irq);

Mikulas
--

Previous thread: [2.6 patch] sh/kernel/ cleanups by Adrian Bunk on Tuesday, June 17, 2008 - 5:36 pm. (1 message)

Next thread: [git patches] net driver fixes for .26 by Jeff Garzik on Tuesday, June 17, 2008 - 9:20 pm. (2 messages)