the patch below fixes a bootup-crash bug merged via today's SCSI git
merge:
commit df3d80f5a5c74168be42788364d13cf6c83c7b9c
Merge: 3d06f7a... c8e91b0...
Author: Linus Torvalds <torvalds@woody.linux-foundation.org>
Date: Mon Oct 15 08:19:33 2007 -0700
Merge master.kernel.org:/pub/scm/linux/kernel/git/jejb/scsi-misc-2.6
--------------------->
Subject: scsi: fix crash in gdth_timeout()
From: Ingo Molnar <mingo@elte.hu>
fix the following bootup crash in gdth_timeout():
[ 40.739625] BUG: spinlock bad magic on CPU#1, swapper/1
[ 40.744822] lock: 788a6e40, .magic: 784d8cec, .owner: <none>/-1, .owner_cpu: 2018346022
[ 40.752881] [<78104f2c>] show_trace_log_lvl+0x1a/0x2f
[ 40.757994] [<781058d1>] show_trace+0x12/0x14
[ 40.762414] [<781058e9>] dump_stack+0x16/0x18
[ 40.766833] [<782f1683>] spin_bug+0xa1/0xa9
[ 40.771079] [<782f17b0>] _raw_spin_lock+0x1e/0xe4
[ 40.775846] [<78643f9f>] _spin_lock_irqsave+0x53/0x64
[ 40.781461] [<784b224a>] gdth_timeout+0x1c/0xa6
[ 40.786054] [<7812632f>] run_timer_softirq+0x142/0x1a4
[ 40.791254] [<78123fb8>] __do_softirq+0x7b/0xf1
[ 40.795847] [<78105fb8>] do_softirq+0x61/0xc7
[ 40.800267] =======================
the bug is that list_first_entry() assumes a non-empty list.
A further problem is probably that the GDTH timer is not stopped by a
failed GDTH probe?
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
drivers/scsi/gdth.c | 3 +++
1 file changed, 3 insertions(+)
Index: linux/drivers/scsi/gdth.c
===================================================================
--- linux.orig/drivers/scsi/gdth.c
+++ linux/drivers/scsi/gdth.c
@@ -3793,6 +3793,9 @@ static void gdth_timeout(ulong data)
gdth_ha_str *ha;
ulong flags;
+ if (list_empty(&gdth_instances))
+ return;
+
ha = list_first_entry(&gdth_instances, gdth_ha_str, list);
spin_lock_irqsave(&ha->smp_lock, flags);
-
Indeed. Maybe this is a better fix? That driver is pretty messy, and this should have been found ealier. James? Boaz? Linus --- drivers/scsi/gdth.c | 4 ++++ 1 files changed, 4 insertions(+), 0 deletions(-) diff --git a/drivers/scsi/gdth.c b/drivers/scsi/gdth.c index e8010a7..3ac080e 100644 --- a/drivers/scsi/gdth.c +++ b/drivers/scsi/gdth.c @@ -5213,6 +5213,10 @@ static int __init gdth_init(void) #endif /* CONFIG_PCI */ TRACE2(("gdth_detect() %d controller detected\n", gdth_ctr_count)); + + if (list_empty(&gdth_instances)) + return -ENODEV; + #ifdef GDTH_STATISTICS TRACE2(("gdth_detect(): Initializing timer !\n")); init_timer(&gdth_timer); -
FWIW, the gdth driver was "super-messy". With this latest SCSI pull, that severity has been successfully downgraded to "messy" :) IMO some easy-to-fix breakage was inevitable with such a large volume of fundamental changes. Honestly, the driver is probably rarely run by people that lack the hardware, I bet... Jeff -
It was all "flight by instruments only". I called for HW testers and none came forward. All these changes, apart from "successful downgrade to messy" where also needed in order to push important changes to scsi. But a little bird said that QEMU might simulate this HW. SO I guess it is QEMU time for me. Boaz -
heh. Incidentally i was thinking about using KVM for automated testing. Important pieces of hardware should get an in-KVM simulator/emulator, that way developers who do not own that hardware can do functionality testing too. So basically the highest-quality drivers would have an "inverse driver" in KVM, which simulates the hardware. (that model is evidently useful to the hardware maker even for new hardware: it can then also be used to test the Linux compatibility and Linux performance of future planned releases of the hardware, etc.) Ingo -
Using emulators to test device drivers is almost certain to be pointless. The problem with device drivers tends to be timing issues, odd hardware interactions, and lots of strange (and sometimes undocumented) behaviour and dependencies (eg things like "you have to wait 50us after setting the reset bit until the hardware has actually reset"). These are all things that you'd generally not catch in emulation - because the emulation by necessity is only going to be a very weak picture of the real thing. So I suspect you can find the easy stuff, but only by writing insanely complex device model descriptions in the emulator environment itself, and only for those device models that have actually been written. Can it be donein theory? Sure. Practically? Not so sure. Would it help? I suspect the effort to do the device model would be many times bigger than *any* conceivable effort to just fix the driver bugs as they get found through other means. So we could perhaps have *really* good emulated hardware for a few models of hw out there, but likely they'd be fewer and less varied platforms than most kernel developers end up having hidden under their desk anyway.. Linus -
On Mon, 15 Oct 2007 12:38:06 -0700 (PDT) For some things. I do it a bit because you can use it to fake failures that are tricky to do in the real world. It won't tell you the driver works but its suprisingly good for testing for races (forcing IRQ delivery at specific points), buggy hardware you don't posess, and things like media failures and timeouts your real hardware refuses to do. Alan -
Heh. I do agree that you likely find bugs, even if quite often it's exactly because the behaviour is something that will never happen on real hardware. But failure testing is very useful - I forget who it was who debugged some driver by taking a CD and just scrathing it mercilessly to induce read errors ;) Having a really *bad* HW emulator can certainly work that way too, even if it also would probably end up hitting just a few of the potential error paths.. Linus -
something like that wont enable 100% coverage (or even reasonable coverage for most hardware), so it's no replacement for actual hard testing, but it could push out the domain of minimally tested code quite a bit and increase the quality of the kernel. Races are always tough and so are bugs on the side of the hardware, but it's the silly boot-time crash showstoppers and "device does not work anymore" mistakes that causes us to lose most of the testers and early adopters. I'm not really worried about the 1% of bugs that are tough to find and fix (because they are actually fun to find and fix), i'm worried about the 99% easy and boring bugs - because they annoy users just as much as the exciting bugs do. If we fix them faster then there's more time (and more tester stamina) left for the harder to find bugs. so i think adding redundancy in form of a simplified hw emulator can certainly not hurt and fundamentally increases robustness - and it will definitely reduce the chance for a whole host of stupid bugs that are not in the hardware but are in the ~4 million lines of Linux driver codebase. Limits and scalability would also be testable: "if i put 32 of these networking cards into a Linux box, will the Linux driver blow up". not that i think this is realistic for any significant portion of the hardware currently - unless some hw maker starts doing it. But KVM will have a good portion of the core PC platform emulated (APIC, etc.) - and that's a nice step forward already. Ingo -
indeed - with this patch instead of the one i did the CONFIG_SCSI_GDTH=y bzImage boots up fine too, and no crash: Calling initcall 0x80a417e0: gdth_init+0x0/0xb20() GDT-HA: Storage RAID Controller Driver. Version: 3.05 GDT-HA: Found 0 PCI Storage RAID Controllers initcall 0x80a417e0: gdth_init+0x0/0xb20() returned -19. initcall 0x80a417e0 ran for 5 msecs: gdth_init+0x0/0xb20() Tested-by: Ingo Molnar <mingo@elte.hu> Ingo -
It was: http://marc.info/?t=119238793200002 and that patch was my best guess for fixing this as well ... do we have someone with actual hardware to confirm it yet? (apparently QEMU uses gdth as some type of emulated controller). James -
