Re: Random crashes with 2.6.27-rc3 on PPC

Previous thread: [PATCH 2/2] ide-cd: add a debug_mask module parameter by Borislav Petkov on Saturday, August 23, 2008 - 6:08 am. (4 messages)

Next thread: GA-MA790FX-DS5 SATA ahci NCQ erros on Jmicron 20360/20363 (JMB363) kernel 2.6.25-2 Debian/Lenny by Sergey Spiridonov on Saturday, August 23, 2008 - 7:32 am. (9 messages)
From: Michael Buesch
Date: Saturday, August 23, 2008 - 7:10 am

I am seeing random kernel and userland application
crashes on a Powerbook running a 2.6.27-rc3 based kernel (wireless-testing.git).

The crashes did recently appear. It might be the case that they were
introduced with the merge of 2.6.27-rc1 into wireless-testing.
I'm not sure on that one, however. Just a guess. I still need to
do more testing (also on vanilla upstream kernels).

The crashes are completely random and they look like bad hardware.
However I cannot reproduce on 2.6.25.9 (That's a kernel I still had
installed, so I tried that one). So it most likely is _not_ caused
by faulty hardware.

The crashes are hard to reproduce, and happen about every 20 minutes
when compiling a kernel tree. (gcc segfaults). Sometimes the kernel
oopses in random places with pointer dereference faults.

Is this a known issue?
I'm going to bisect this one, but it will take a lot of time, as reproducing
takes about 20 minutes. So that's about an hour for one test round.

The kernel configuration is the following:

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.27-rc3
# Fri Aug 22 18:57:55 2008
#
# CONFIG_PPC64 is not set

#
# Processor support
#
CONFIG_6xx=y
# CONFIG_PPC_85xx is not set
# CONFIG_PPC_8xx is not set
# CONFIG_40x is not set
# CONFIG_44x is not set
# CONFIG_E200 is not set
CONFIG_PPC_FPU=y
CONFIG_ALTIVEC=y
CONFIG_PPC_STD_MMU=y
CONFIG_PPC_STD_MMU_32=y
# CONFIG_PPC_MM_SLICES is not set
# CONFIG_SMP is not set
CONFIG_PPC32=y
CONFIG_WORD_SIZE=32
CONFIG_PPC_MERGE=y
CONFIG_MMU=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_TIME_VSYSCALL=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_HARDIRQS=y
# CONFIG_HAVE_SETUP_PER_CPU_AREA is not ...
From: Benjamin Herrenschmidt
Date: Saturday, August 23, 2008 - 3:52 pm

Random guess:

CONFIG_FRAME_POINTER=y
CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER=y

Note sure what those together do, check if you have any file compiled
with -fno-omit-frame-pointer and if you do, try to change things so
that you don't ... we found some miscompiles when that is set, exposed
by FTRACE typically (which you don't have enabled) but possibly by other
things.

Ben.


--

From: Michael Buesch
Date: Sunday, August 24, 2008 - 12:23 am

Ok, thanks for the suggestion.
I could reproduce the crash with 2.6.26, so this is not a regression between
2.6.26 and 2.6.27-rcX.
I'm currently running longer tests on 2.6.25 again to make sure it really
isn't hardware related.


NO_NO_OMIT is a brain screwer, btw :)

-- 
Greetings Michael.
--

From: Michael Buesch
Date: Sunday, August 24, 2008 - 6:44 am

Thanks for your random guess.
The following workaround seems to fix the crashes on powerpc.
However, this patch is clearly not what we want for other architectures,
as they might need -fno-omit-frame-pointer to function properly.

I reproduced the random crashes of kernel and userspace applications
(without the following patch) on a vanilla 2.6.26 and 2.6.27-rc{1-4}
kernel. I did _not_ try a 2.6.25 kernel with -fno-omit-frame-pointer, so
I don't know if it would also crash then.

I'm currently running more tests on a patched 2.6.27-rc4 kernel, but it
didn't crash, yet. I already did 5 complete kernel tree compilations. It
should have crashed by now, but it didn't :)

The compiler is:
gcc (GCC) 4.1.3 20070929 (prerelease) (Ubuntu 4.1.2-16ubuntu2)


Index: linux-2.6/Makefile
===================================================================
--- linux-2.6.orig/Makefile	2008-08-24 11:49:53.000000000 +0200
+++ linux-2.6/Makefile	2008-08-24 12:16:42.000000000 +0200
@@ -523,13 +523,13 @@ endif
 
 # Force gcc to behave correct even for buggy distributions
 # Arch Makefiles may override this setting
 KBUILD_CFLAGS += $(call cc-option, -fno-stack-protector)
 
 ifdef CONFIG_FRAME_POINTER
-KBUILD_CFLAGS	+= -fno-omit-frame-pointer -fno-optimize-sibling-calls
+KBUILD_CFLAGS	+= -fno-optimize-sibling-calls
 else
 KBUILD_CFLAGS	+= -fomit-frame-pointer
 endif
 
 ifdef CONFIG_DEBUG_INFO
 KBUILD_CFLAGS	+= -g
Index: linux-2.6/kernel/Makefile
===================================================================
--- linux-2.6.orig/kernel/Makefile	2008-08-24 11:50:23.000000000 +0200
+++ linux-2.6/kernel/Makefile	2008-08-24 12:15:54.000000000 +0200
@@ -92,13 +92,13 @@ obj-$(CONFIG_SMP) += sched_cpupri.o
 ifneq ($(CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER),y)
 # According to Alan Modra <alan@linuxcare.com.au>, the -fno-omit-frame-pointer is
 # needed for x86 only.  Why this used to be enabled for all architectures is beyond
 # me.  I suspect most platforms don't need this, but until we know ...
From: Andreas Schwab
Date: Sunday, August 24, 2008 - 7:46 am

This has a better chance to be accepted. :-)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 8b5a7d3..f9a2e48 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -394,7 +394,7 @@ config LOCKDEP
 	bool
 	depends on DEBUG_KERNEL && TRACE_IRQFLAGS_SUPPORT && STACKTRACE_SUPPORT && LOCKDEP_SUPPORT
 	select STACKTRACE
-	select FRAME_POINTER if !X86 && !MIPS
+	select FRAME_POINTER if !X86 && !MIPS && !PPC
 	select KALLSYMS

CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER is already enabled on powerpc.

Andreas.

-- 
Andreas Schwab, SuSE Labs, schwab@suse.de
SuSE Linux Products GmbH, Maxfeldstraße 5, 90409 Nürnberg, Germany
PGP key fingerprint = 58CA 54C7 6D53 942B 1756  01D3 44D5 214B 8276 4ED5
"And now for something completely different."
--

From: Michael Buesch
Date: Sunday, August 24, 2008 - 8:00 am

This is not what my patch is doing.
Your patch always forces FRAME_POINTER off. At least as far as lockdep is concerned.
What about other parts of the kernel that enable FRAME_POINTER?

I think this should be fixed in the makefile by substitution of
-fno-omit-frame-pointer on PPC (and probably depending on the compiler
version).

Otherwise, if somebody else decides to do select FRAME_POINTER in some other
code, the bug will reappear.
I'm also not sure if it's desired to always force FRAME_POINTER off.

-- 
Greetings Michael.
--

From: Benjamin Herrenschmidt
Date: Sunday, August 24, 2008 - 3:39 pm

Unfortunately, that won't solve the FTRACE problem.


--

From: Benjamin Herrenschmidt
Date: Sunday, August 24, 2008 - 3:37 pm

Well, and -pg requires it, even on powerpc, so that won't work for
ftrace. Any chance you can try the workaround that segher proposed
though ?

http://penguinppc.de/~segher/0001-powerpc-Workaround-for-the-ftrace-problem.patch

His workaround only kicks in with CONFIG_FTRACE, that would have to be
fixed of course. Also, I suspect the bits that have -pg in a flag
"remove" section should have also "fno-omit-frame-pointer" in that

Thanks !

Ben.


--

From: Tony Breeds
Date: Monday, September 1, 2008 - 11:50 pm

This bug is causing random crashes
(http://bugzilla.kernel.org/show_bug.cgi?id=11414).  -fomit-frame-pointer is
only needed on powerpc when -pg is also supplied.  This patch ensures that
CONFIG_FRAME_POINTER is only selected by ftrace.  When CONFIG_FTRACE is enabled
we also pass -mno-sched-epilog to work around the codegen bug

Patch based on work by:
	Andreas Schwab <schwab@suse.de>
	Segher Boessenkool <segher@kernel.crashing.org>

Signed-off-by: Tony Breeds <tony@bakeyournoodle.com>
---
 arch/powerpc/Makefile                    |    5 +++++
 arch/powerpc/kernel/Makefile             |    7 ++++---
 arch/powerpc/platforms/powermac/Makefile |    2 +-
 lib/Kconfig.debug                        |    6 +++---
 4 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index 9155c93..c6be19e 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -116,6 +116,11 @@ ifeq ($(CONFIG_6xx),y)
 KBUILD_CFLAGS		+= -mcpu=powerpc
 endif
 
+# Work around a gcc code-gen bug with -fno-omit-frame-pointer.
+ifeq ($(CONFIG_FTRACE),y)
+KBUILD_CFLAGS		+= -mno-sched-epilog
+endif
+
 cpu-as-$(CONFIG_4xx)		+= -Wa,-m405
 cpu-as-$(CONFIG_6xx)		+= -Wa,-maltivec
 cpu-as-$(CONFIG_POWER4)		+= -Wa,-maltivec
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 64f5948..946daea 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -14,12 +14,13 @@ endif
 
 ifdef CONFIG_FTRACE
 # Do not trace early boot code
-CFLAGS_REMOVE_cputable.o = -pg
-CFLAGS_REMOVE_prom_init.o = -pg
+CFLAGS_REMOVE_cputable.o = -pg -mno-sched-epilog
+CFLAGS_REMOVE_prom_init.o = -pg -mno-sched-epilog
+CFLAGS_REMOVE_btext.o = -pg -mno-sched-epilog
 
 ifdef CONFIG_DYNAMIC_FTRACE
 # dynamic ftrace setup.
-CFLAGS_REMOVE_ftrace.o = -pg
+CFLAGS_REMOVE_ftrace.o = -pg -mno-sched-epilog
 endif
 
 endif
diff --git a/arch/powerpc/platforms/powermac/Makefile b/arch/powerpc/platforms/powermac/Makefile
index ...
Previous thread: [PATCH 2/2] ide-cd: add a debug_mask module parameter by Borislav Petkov on Saturday, August 23, 2008 - 6:08 am. (4 me