Hi, I have a problem with the edac_core module in 2.6.36.1, it crashes on shutdown. If I remove it manually, it does not crash. I compiled the edac modules from 2.6.35 against the headers of 2.6.36.1 and they are fine. -Tobias Screenshot of the crash: http://oi51.tinypic.com/r9mu55.jpg Used edac modules: i82975x_edac edac_core Module option: options edac_core check_pci_errors=1 Kernel config: # # Automatically generated make config: don't edit # Linux kernel version: 2.6.36.1 # Wed Dec 1 00:37:32 2010 # CONFIG_64BIT=y # CONFIG_X86_32 is not set CONFIG_X86_64=y CONFIG_X86=y CONFIG_INSTRUCTION_DECODER=y CONFIG_OUTPUT_FORMAT="elf64-x86-64" CONFIG_ARCH_DEFCONFIG="arch/x86/configs/x86_64_defconfig" CONFIG_GENERIC_CMOS_UPDATE=y CONFIG_CLOCKSOURCE_WATCHDOG=y CONFIG_GENERIC_CLOCKEVENTS=y CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y CONFIG_LOCKDEP_SUPPORT=y CONFIG_STACKTRACE_SUPPORT=y CONFIG_HAVE_LATENCYTOP_SUPPORT=y CONFIG_MMU=y CONFIG_ZONE_DMA=y CONFIG_NEED_DMA_MAP_STATE=y CONFIG_NEED_SG_DMA_LENGTH=y CONFIG_GENERIC_ISA_DMA=y CONFIG_GENERIC_IOMAP=y CONFIG_GENERIC_BUG=y CONFIG_GENERIC_BUG_RELATIVE_POINTERS=y CONFIG_GENERIC_HWEIGHT=y CONFIG_GENERIC_GPIO=y CONFIG_ARCH_MAY_HAVE_PC_FDC=y # CONFIG_RWSEM_GENERIC_SPINLOCK is not ...
What does that mean? Are you using an official kernel version or Please try to catch the whole oops. I can see rIP but it'd be much more helpful to see the calltrace and the whole register dump which are cut off at the bottom of your pic. Thanks. -- Regards/Gruss, Boris. --
I'm using the official 2.6.36.1 kernel. The problem happens with the modules from the kernel 2.6.36.1 But as a workaround for now, I'm using the edac modules from 2.6.35 compiled for 2.6.36.1 Yes, here you go: http://i55.tinypic.com/igyhpt.jpg -Tobias --
Unfortunately, this is still incomplete because it is missing the Code: tag and the end trace tag which looks like this ---[ end trace ... ]--- and this is most probably so because we're halting and cannot print out all the oops. Two things you can do: 1. Connect the machine to another machine over serial- or netconsole. Google for instructions how. 2. If you can't do 1. you can try enabling CONFIG_BOOT_PRINTK_DELAY and boot the kernel with 'boot_delay=10', for example, so that a 10ms delay is inserted after each printk line. This might let us see more on the console before halt and catch the whole oops. Thanks. -- Regards/Gruss, Boris. --
Yes, boot_delay=10 has helped and I also removed the nvidia driver: http://oi56.tinypic.com/eas0pg.jpg -Tobias --
Hmm, I see two backtraces now. I'm still not sure whether the first
one is actually the first one that happens since it is cut off at the
beginning.
If it is the first one, it should contain something like
[ ... ] Pid: ... , comm: ... Not tainted <kernel version> <DMI system name>
the important part being the "Not tainted" string. Can you check that
please?
Also, can you disable the EDAC subsystem in Kconfig and retry?
I'm trying to rule out some other issue since the backtraces are
inconclusive wrt EDAC.
Thanks.
--
Regards/Gruss,
Boris.
--
I only can get an oops with the end cut off or two oopses with the beginning cut off. But here is an oops with the "Not tainted" string: Yes, maybe on the weekend, but as I said replacing the modules with compiled from older src does work fine. -Tobias --
On Thu, 02 Dec 2010 10:03:33 +0100 What edac modules are you using? Regards, Flo --
On Thu, 2 Dec 2010 15:51:41 +0100 hm.. just found it in your first message, sorry... ;) --
Well, thanks for the photos. I don't have an idea what might cause this workqueue corruption I'm seeing, all reg/unreg paths look ok. The only change that came in between .35 and .36.1 I can think of being relevant is 00740c58541b6087d78418cebca1fcb86dc6077d. You could try backing that one out to see whether it fixes the issue. -- Regards/Gruss, Boris. --
Yes, reverting this fixed the issue! But why? -Tobias --
On Thu, 02 Dec 2010 17:21:12 +0100
It doesnt work because op_state is set to OP_OFFLINE in edac_mc_del_mc
before calling edac_mc_workq_teardown. Or am I seeing things?
<snip>
577 struct mem_ctl_info *edac_mc_del_mc(struct device *dev)
578 {
<snip>
592 /* marking MCI offline */
593 mci->op_state = OP_OFFLINE;
594
595 del_mc_from_global_list(mci);
596 mutex_unlock(&mem_ctls_mutex);
597
598 /* flush workq processes and remove sysfs */
599 edac_mc_workq_teardown(mci);
<snip>
Probably a better check in _teardown is on a bool that get's set in
edac_mc_workq_setup...?
--
Good observation Florian, see my other mail I just sent. -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 --
Dang, I know why. This whole ->op_state fumbling is pretty fragile and needs de-fragilizing :o). Please try out the one below after re-reverting 00740c58541b6087d78418cebca1fcb86dc6077d (i.e., ontop of .36.1). Thanks. -- From: Borislav Petkov <borislav.petkov@amd.com> Date: Thu, 2 Dec 2010 17:48:35 +0100 Subject: [PATCH] EDAC: Fix workqueue-related crashes 00740c58541b6087d78418cebca1fcb86dc6077d changed edac_core to un-/register a workqueue item only if a lowlevel driver supplies a polling routine. Normally, when we remove a polling low-level driver, we go and teardown the workqueue and cancel all the queued work. However, the workqueue unreg happens based on the ->op_state setting, and edac_mc_del_mc() sets this to OP_OFFLINE _before_ we cancel the work item, leading to NULL ptr oops on the workqueue list. Fix it by putting the unreg stuff in proper order. Cc: <stable@kernel.org> #36.x Reported-by: Tobias Karnat <tobias.karnat@googlemail.com> LKML-Reference: <1291201307.3029.21.camel@Tobias-Karnat> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> --- drivers/edac/edac_mc.c | 10 ++++++---- 1 files changed, 6 insertions(+), 4 deletions(-) diff --git a/drivers/edac/edac_mc.c b/drivers/edac/edac_mc.c index 6b21e25..6d2e34d 100644 --- a/drivers/edac/edac_mc.c +++ b/drivers/edac/edac_mc.c @@ -578,14 +578,16 @@ struct mem_ctl_info *edac_mc_del_mc(struct device *dev) return NULL; } - /* marking MCI offline */ - mci->op_state = OP_OFFLINE; - del_mc_from_global_list(mci); mutex_unlock(&mem_ctls_mutex); - /* flush workq processes and remove sysfs */ + /* flush workq processes */ edac_mc_workq_teardown(mci); + + /* marking MCI offline */ + mci->op_state = OP_OFFLINE; + + /* remove from sysfs */ edac_remove_sysfs_mci_device(mci); edac_printk(KERN_INFO, EDAC_MC, -- 1.7.3.1.50.g1e633 -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach General Managers: Alberto Bozzo, Andrew ...
This patch fixed it. I have rebooted five times and it does not crash anymore. Thank you. Btw, are there any information available regarding the NMI option? parm: edac_op_state:EDAC Error Reporting state: 0=Poll,1=NMI (int) In edac.txt NMI is listed under FUTURE HARDWARE SCANNING. -Tobias --
Thanks for testing and taking pictures :). I'll send it to Linus before Well, looking at <arch/x86/kernel/traps.c:mem_parity_error()> this should already work. But it is kinda of a hack, if I'm reading Doug correctly: http://lkml.org/lkml/2010/9/21/144 And yes, using some kind of an interrupt is much better than polling but I don't know whether there's a single interrupt source for the error types all edac drivers can decode and report. In the amd64_edac case, we're piggybacking on MCE, for example. This was actually the initial reason for 00740c58541b6087d78418cebca1fcb86dc6077d and dropping polling from that driver. Doug, any additions? -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 --
This thread has many unknown abbreviations for me, I thought as you wrote later, that this would be an option to use an interrupt instead of polling for edac in general. But as far as I understand this, it is used for PCI bus errors? And should be replaced for detection of PCI SERR and/or PCIE AER, From Intel Architectures Software Developer's Manual Part 3A: "Starting with 45nm Intel 64 processor with CPUID signature DisplayFamily_DisplayModel encoding of 06H_1AH (...), the processor can report information on corrected machine-check errors and deliver a programmable interrupt for software to respond to MC errors, referred to as corrected machine-check error interrupt (CMCI)." Seems to be unlikely for me on an Intel 975X Mainboard. -Tobias --
Yep, reportedly, memory parity errors can be reported through an NMI. However, I don't know whether all chipset vendors still do that since Yep, this all pertains to PCI/PCIe error reporting but don't ask me about specifics. You might get some more info from http://lwn.net/Articles/193468/ - it is not in the current kernel sources though. Also, EDAC has a generic polling routine for PCI errors, Hmm, I don't know, let's ask the driver author :) Arvind? -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 --
On Thu, 2 Dec 2010 18:06:10 +0100 Yes. That should work. Once we stopped the workqueue and removed it from the global list, do we actually need to set it to OP_OFFLINE? Also 00740c585 did fix a hang in edac_mc.c... could this also happen in the edac_device_del_device/edac_pci_del_device functions? Regards, Flo --
I think yes, because we seem to protect ourselves in the actual edac_mc_workq_function() on exit, if we overlap the work items cancellation with the execution of the delayed work at the same time on a different cpu. Besides, it is a single assignment and it does cost us Nope, because there we don't check ->op_state when we cancel the work items in the respective _teardown() functions - we simply cancel them unconditionally. -- Regards/Gruss, Boris. Advanced Micro Devices GmbH Einsteinring 24, 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen Registergericht Muenchen, HRB Nr. 43632 --
On Thu, 2 Dec 2010 19:51:23 +0100 true. I wonder if the flush workqueue waits for the work-function But shouldn't we check ->op_state for those as well? Why don't we hang for those functions in similar cases as your original patch fixed? --
