Re: edac_core: crashes on shutdown

Previous thread: [PATCH] spi: spidev: Add 32 bit compat ioctl() by Bernhard Walle on Wednesday, December 1, 2010 - 3:51 am. (1 message)

Next thread: [Patch] net: kill an RCU warning in inet_fill_link_af() by Amerigo Wang on Wednesday, December 1, 2010 - 4:14 am. (10 messages)
From: Tobias Karnat
Date: Wednesday, December 1, 2010 - 4:01 am

Hi,

I have a problem with the edac_core module in 2.6.36.1, it crashes on shutdown.
If I remove it manually, it does not crash.

I compiled the edac modules from 2.6.35 against the headers of 2.6.36.1 and they are fine.

-Tobias

Screenshot of the crash:

http://oi51.tinypic.com/r9mu55.jpg

Used edac modules:

i82975x_edac
edac_core

Module option:

options edac_core check_pci_errors=1

Kernel config:

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.36.1
# Wed Dec  1 00:37:32 2010
#
CONFIG_64BIT=y
# CONFIG_X86_32 is not set
CONFIG_X86_64=y
CONFIG_X86=y
CONFIG_INSTRUCTION_DECODER=y
CONFIG_OUTPUT_FORMAT="elf64-x86-64"
CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig"
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_NEED_DMA_MAP_STATE=y
CONFIG_NEED_SG_DMA_LENGTH=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_GPIO=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
# CONFIG_RWSEM_GENERIC_SPINLOCK is not ...
From: Borislav Petkov
Date: Wednesday, December 1, 2010 - 5:39 am

What does that mean? Are you using an official kernel version or

Please try to catch the whole oops. I can see rIP but it'd be much more
helpful to see the calltrace and the whole register dump which are cut
off at the bottom of your pic.

Thanks.

-- 
Regards/Gruss,
Boris.
--

From: Tobias Karnat
Date: Wednesday, December 1, 2010 - 6:24 am

I'm using the official 2.6.36.1 kernel.
The problem happens with the modules from the kernel 2.6.36.1

But as a workaround for now,
I'm using the edac modules from 2.6.35 compiled for 2.6.36.1

Yes, here you go:

http://i55.tinypic.com/igyhpt.jpg

-Tobias



--

From: Borislav Petkov
Date: Wednesday, December 1, 2010 - 7:33 am

Unfortunately, this is still incomplete because it is missing the Code: tag and
the end trace tag which looks like this

---[ end trace ... ]---

and this is most probably so because we're halting and cannot print out
all the oops. Two things you can do:

1. Connect the machine to another machine over serial- or netconsole.
Google for instructions how.

2. If you can't do 1. you can try enabling CONFIG_BOOT_PRINTK_DELAY and
boot the kernel with 'boot_delay=10', for example, so that a 10ms delay
is inserted after each printk line. This might let us see more on the
console before halt and catch the whole oops.

Thanks.

-- 
Regards/Gruss,
Boris.
--

From: Tobias Karnat
Date: Wednesday, December 1, 2010 - 10:46 am

Yes, boot_delay=10 has helped and I also removed the nvidia driver:

http://oi56.tinypic.com/eas0pg.jpg

-Tobias



--

From: Borislav Petkov
Date: Wednesday, December 1, 2010 - 12:35 pm

Hmm, I see two backtraces now. I'm still not sure whether the first
one is actually the first one that happens since it is cut off at the
beginning.

If it is the first one, it should contain something like

[ ... ] Pid: ... , comm: ... Not tainted <kernel version> <DMI system name>

the important part being the "Not tainted" string. Can you check that
please?

Also, can you disable the EDAC subsystem in Kconfig and retry?
I'm trying to rule out some other issue since the backtraces are
inconclusive wrt EDAC.

Thanks.

-- 
Regards/Gruss,
    Boris.
--

From: Tobias Karnat
Date: Thursday, December 2, 2010 - 2:03 am

I only can get an oops with the end cut off or two oopses with the
beginning cut off.

But here is an oops with the "Not tainted" string:


Yes, maybe on the weekend, but as I said replacing the modules with
compiled from older src does work fine.

-Tobias

--

From: Florian Mickler
Date: Thursday, December 2, 2010 - 7:51 am

On Thu, 02 Dec 2010 10:03:33 +0100

What edac modules are you using? 

Regards,
Flo
--

From: Florian Mickler
Date: Thursday, December 2, 2010 - 7:52 am

On Thu, 2 Dec 2010 15:51:41 +0100

hm.. just found it in your first message,  sorry... ;)
--

From: Borislav Petkov
Date: Thursday, December 2, 2010 - 8:21 am

Well, thanks for the photos. I don't have an idea what might cause this
workqueue corruption I'm seeing, all reg/unreg paths look ok. The only
change that came in between .35 and .36.1 I can think of being relevant
is 00740c58541b6087d78418cebca1fcb86dc6077d. You could try backing that
one out to see whether it fixes the issue.

-- 
Regards/Gruss,
Boris.
--

From: Tobias Karnat
Date: Thursday, December 2, 2010 - 9:21 am

Yes, reverting this fixed the issue!

But why?

-Tobias


--

From: Florian Mickler
Date: Thursday, December 2, 2010 - 10:02 am

On Thu, 02 Dec 2010 17:21:12 +0100

It doesnt work because op_state is set to OP_OFFLINE in edac_mc_del_mc
before calling edac_mc_workq_teardown. Or am I seeing things?

<snip>
577 struct mem_ctl_info *edac_mc_del_mc(struct device *dev)
578 {
<snip>
592         /* marking MCI offline */
593         mci->op_state = OP_OFFLINE;
594 
595         del_mc_from_global_list(mci);
596         mutex_unlock(&mem_ctls_mutex);
597 
598         /* flush workq processes and remove sysfs */
599         edac_mc_workq_teardown(mci);
<snip>

Probably a better check in _teardown is on a bool that get's set in
edac_mc_workq_setup...?
--

From: Borislav Petkov
Date: Thursday, December 2, 2010 - 10:07 am

Good observation Florian, see my other mail I just sent.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
--

From: Borislav Petkov
Date: Thursday, December 2, 2010 - 10:06 am

Dang, I know why. This whole ->op_state fumbling is pretty fragile and
needs de-fragilizing :o).

Please try out the one below after re-reverting
00740c58541b6087d78418cebca1fcb86dc6077d (i.e., ontop of .36.1).

Thanks.

--
From: Borislav Petkov <borislav.petkov@amd.com>
Date: Thu, 2 Dec 2010 17:48:35 +0100
Subject: [PATCH] EDAC: Fix workqueue-related crashes

00740c58541b6087d78418cebca1fcb86dc6077d changed edac_core to
un-/register a workqueue item only if a lowlevel driver supplies a
polling routine. Normally, when we remove a polling low-level driver,
we go and teardown the workqueue and cancel all the queued work.
However, the workqueue unreg happens based on the ->op_state setting,
and edac_mc_del_mc() sets this to OP_OFFLINE _before_ we cancel the work
item, leading to NULL ptr oops on the workqueue list.

Fix it by putting the unreg stuff in proper order.

Cc: <stable@kernel.org> #36.x
Reported-by: Tobias Karnat <tobias.karnat@googlemail.com>
LKML-Reference: <1291201307.3029.21.camel@Tobias-Karnat>
Signed-off-by: Borislav Petkov <borislav.petkov@amd.com>
---
 drivers/edac/edac_mc.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c
index 6b21e25..6d2e34d 100644
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@ -578,14 +578,16 @@ struct mem_ctl_info *edac_mc_del_mc(struct device *dev)
 		return NULL;
 	}
 
-	/* marking MCI offline */
-	mci->op_state = OP_OFFLINE;
-
 	del_mc_from_global_list(mci);
 	mutex_unlock(&mem_ctls_mutex);
 
-	/* flush workq processes and remove sysfs */
+	/* flush workq processes */
 	edac_mc_workq_teardown(mci);
+
+	/* marking MCI offline */
+	mci->op_state = OP_OFFLINE;
+
+	/* remove from sysfs */
 	edac_remove_sysfs_mci_device(mci);
 
 	edac_printk(KERN_INFO, EDAC_MC,
-- 
1.7.3.1.50.g1e633


-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew ...
From: Tobias Karnat
Date: Thursday, December 2, 2010 - 11:05 am

This patch fixed it.
I have rebooted five times and it does not crash anymore.

Thank you.

Btw, are there any information available regarding the NMI option?

parm:	edac_op_state:EDAC Error Reporting state: 0=Poll,1=NMI (int)

In edac.txt NMI is listed under FUTURE HARDWARE SCANNING.

-Tobias

--

From: Borislav Petkov
Date: Thursday, December 2, 2010 - 11:37 am

Thanks for testing and taking pictures :). I'll send it to Linus before

Well, looking at <arch/x86/kernel/traps.c:mem_parity_error()> this
should already work. But it is kinda of a hack, if I'm reading Doug
correctly: http://lkml.org/lkml/2010/9/21/144

And yes, using some kind of an interrupt is much better than polling but
I don't know whether there's a single interrupt source for the error
types all edac drivers can decode and report. In the amd64_edac case,
we're piggybacking on MCE, for example. This was actually the initial
reason for 00740c58541b6087d78418cebca1fcb86dc6077d and dropping polling
from that driver.

Doug, any additions?

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
--

From: Tobias Karnat
Date: Thursday, December 2, 2010 - 3:34 pm

This thread has many unknown abbreviations for me,

I thought as you wrote later, that this would be an option to use an
interrupt instead of polling for edac in general.

But as far as I understand this, it is used for PCI bus errors?

And should be replaced for detection of PCI SERR and/or PCIE AER,

From Intel Architectures Software Developer's Manual Part 3A:

"Starting with 45nm Intel 64 processor with CPUID signature
DisplayFamily_DisplayModel encoding of 06H_1AH (...), the processor can
report information on corrected machine-check errors and deliver a
programmable interrupt for software to respond to MC errors, referred to
as corrected machine-check error interrupt (CMCI)."

Seems to be unlikely for me on an Intel 975X Mainboard.

-Tobias

--

From: Borislav Petkov
Date: Friday, December 3, 2010 - 4:58 am

Yep, reportedly, memory parity errors can be reported through an NMI.
However, I don't know whether all chipset vendors still do that since

Yep, this all pertains to PCI/PCIe error reporting but don't
ask me about specifics. You might get some more info from
http://lwn.net/Articles/193468/ - it is not in the current kernel
sources though. Also, EDAC has a generic polling routine for PCI errors,

Hmm, I don't know, let's ask the driver author :)

Arvind?

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
--

From: Florian Mickler
Date: Thursday, December 2, 2010 - 11:14 am

On Thu, 2 Dec 2010 18:06:10 +0100

Yes. That should work. Once we stopped the workqueue and removed it
from the global list, do we actually need to set it to OP_OFFLINE?

Also 00740c585 did fix a hang in edac_mc.c... could this also
happen in the edac_device_del_device/edac_pci_del_device functions? 

Regards,
Flo

--

From: Borislav Petkov
Date: Thursday, December 2, 2010 - 11:51 am

I think yes, because we seem to protect ourselves in the actual
edac_mc_workq_function() on exit, if we overlap the work items
cancellation with the execution of the delayed work at the same time on
a different cpu. Besides, it is a single assignment and it does cost us

Nope, because there we don't check ->op_state when we cancel the work
items in the respective _teardown() functions - we simply cancel them
unconditionally.

-- 
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632
--

From: Florian Mickler
Date: Thursday, December 2, 2010 - 12:54 pm

On Thu, 2 Dec 2010 19:51:23 +0100

true. I wonder if the flush workqueue waits for the work-function

But shouldn't we check ->op_state for those as well? Why don't we hang
for those functions in similar cases as your original patch fixed?

--

Previous thread: [PATCH] spi: spidev: Add 32 bit compat ioctl() by Bernhard Walle on Wednesday, December 1, 2010 - 3:51 am. (1 message)

Next thread: [Patch] net: kill an RCU warning in inet_fill_link_af() by Amerigo Wang on Wednesday, December 1, 2010 - 4:14 am. (10 messages)