Hi,
so i have upgraded a system to kernel 2.6.24. After that, it failed to=20
boot with the usual message telling, that the rootfs on device /dev/sda1 =
cannot be mounted (a raid1 run by the controller below).
With 2.6.23.12, everything is working fine.
# lspci -v:
03:01.0 RAID bus controller: Intel Corporation RAID Controller
Subsystem: Intel Corporation Unknown device 01db
Flags: bus master, 66MHz, slow devsel, latency 64, IRQ 17
Memory at ddffc000 (32-bit, prefetchable) [size=3D16K]
[virtual] Expansion ROM at deef0000 [disabled] [size=3D32K]
Capabilities: [80] Power Management version 2
# GDT-related dmesg output (2.6.23.12):
GDT-HA: Storage RAID Controller Driver. Version: 3.05
ACPI: PCI Interrupt 0000:03:01.0[A] -> GSI 24 (level, low) -> IRQ 17
GDT-HA: Found 1 PCI Storage RAID Controllers
Configuring GDT-PCI HA at 3/1 IRQ 17
GDT-HA 0: Name: SRCU42L
scsi0 : SRCU42L
sd 0:0:0:0: [sda] 143299800 512-byte hardware sectors (73369 MB)
sd 0:0:0:0: [sda] Assuming Write Enabled
sd 0:0:0:0: [sda] Assuming drive cache: write through
sd 0:0:0:0: [sda] 143299800 512-byte hardware sectors (73369 MB)
sd 0:0:0:0: [sda] Assuming Write Enabled
sd 0:0:0:0: [sda] Assuming drive cache: write through
sda: sda1 sda2 < sda5 >
sd 0:0:0:0: [sda] Attached SCSI disk
# cat /boot/config-2.6.24 |grep GDT
CONFIG_SCSI_GDTH=3Dy
Any ideas?
http://www.kernel.org/diff/diffview.cgi?file=3D%2Fpub%2Flinux%2Fkernel%2F=
v2.6%2Fpatch-2.6.24.bz2=20
show huge drivers/scsi/gdth* related changes.
Can't test at the moment. System went production.
Regards,
Sven
Hi Sven! CCing to the scsi mailing list. Yes the gdth driver passed an open hart surgery in kernel 2.6.24. The bad thing about it is that all three of the Coders that did that did not have any HW to work on. One of them is me. We did cry for tester for a long time but no one came forward. Could you test patches for us? first thing would be to enable debug output patch below. If you absolutely need a 2.6.24 kernel, + gdth in a production system you could checkout the 2.6.23 driver and compile. The old driver will work the same in 2.6.24. It will not however even compile in 2.6.25-rcx. If any one wants to send me a card that uses the gdth driver, I will be very happy to debug this card, and return it once I'm done. Boaz --- git-diff --stat -p drivers/scsi/gdth.c | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/drivers/scsi/gdth.c b/drivers/scsi/gdth.c index c825239..eca72c4 100644 --- a/drivers/scsi/gdth.c +++ b/drivers/scsi/gdth.c @@ -188,6 +188,7 @@ static int __gdth_queuecommand(gdth_ha_str *ha, struct scsi_cmnd *scp, struct gdth_cmndinfo *cmndinfo); static void gdth_scsi_done(struct scsi_cmnd *scp); +#define DEBUG_GDTH 1 #ifdef DEBUG_GDTH static unchar DebugState = DEBUG_GDTH; --
> Yes the gdth driver passed an open hart surgery in kernel 2.6.24. The b= came forward. All i'd like to ask is WHY!? WHY such a big open heart surgery? put patch below. The machine is still production. It will be replaced by a 64bit system=20 with an AACRAID card some time in the future. Then, i could maybe test=20 patches on the old machine. But unfortunatly, it's not my machine. It's=20 just administrated by me. And i don't know, what's exactly is the future = I see. Thanks for the hint. Would be an alternative. Actually, i don't=20 need 2.6.24 - but if something security related is fixed, then i'm not=20 able to move on to 2.6.24. And i'm using Gentoo. There's not really a=20 I don't think, it's a seperate card, that i could send you. It's some=20 19" rack server with such a card on-board, i think. So it don't see much opportunity for me to test patched and stuff :-(
Thanks, Perhaps someone else then. Anyone with gdth HW that can test patches? Your lspci said: "Intel Corporation RAID Controller" Matthew is there a gdth card lying around in an Intel lab near you? James do we need to mark gdth BROKEN for 2.6.24 and higher? Boaz --
Hi, Is bisecting down the existing chain and finding the bad commit I say revert it for the time being. --
Most certainly. If you are willing, please... I'm looking for someone
responsive.
first - enable debug prints (see the original post)
and send me the prints.
second - bisection could be grate yes.
It could be reverted for 2.6.24.x maintenance releases but for 2.6.25-xxx
it cannot as it will not compile, and the fix to that is what you see in code.
Boaz
--
<snip>
With the kind help of:
Joerg Dorchain: <joerg@dorchain.net>
Jon Chelton <jchelton@ffpglobal.com>
Stefan Priebe <s.priebe@allied-internet.ag>
Which let me take up their machines their effort and their time
I hope I'm able to fix the gdth driver for the 2.6.24 kernel and forward.
Actually it was a simple miss by Christoph, but with my inexperience
it took me a bisection and a while to get it.
Both Joerg, and Stefan where able to boot with these patches and work
on their machine. Jon Chelton's disks array should also work, we're testing.
Submitted 2 patches. They should also be included after some testing
into the 2.6.24.x stable releases. (Will be posted after some more testing)
[PATCH 1/2] gdth: scan for scsi devices
simple but must fatal.
[PATCH 2/2] gdth: bugfix for the Timer at exit crash
James please inspect and comment on this patch. It was not yet tested
by the original bug submitter.
In the original gdth series Christoph has forgotten to add the call to
scsi_scan_host(). Jeff alternative patches did do the scan. After everything
was probed, the code would loop on all found cards and scan. However I like
to individually scan at each probe, because I think this way it is more ready
for the hot-plug API where the discovery is done outside of the driver, and the
probe is called on single host at a time. Is that right? please comment.
Test away.
Meany thanks to Joerg, Jon && Stefan, Cheers.
Boaz
--
The patch: "gdth: switch to modern scsi host registration" missed one simple fact when moving a way from scsi_module.c. That is to call scsi_scan_host() on the probed host. With this the gdth driver from 2.6.24 is again able to see drives and boot. Signed-off-by: Boaz Harrosh <bharrosh@panasas.com> Tested-by: Joerg Dorchain: <joerg@dorchain.net> Tested-by: Stefan Priebe <s.priebe@allied-internet.ag> Tested-by: Jon Chelton <jchelton@ffpglobal.com> --- drivers/scsi/gdth.c | 9 +++++++++ 1 files changed, 9 insertions(+), 0 deletions(-) diff --git a/drivers/scsi/gdth.c b/drivers/scsi/gdth.c index b253b8c..8eb78be 100644 --- a/drivers/scsi/gdth.c +++ b/drivers/scsi/gdth.c @@ -4838,6 +4838,9 @@ static int __init gdth_isa_probe_one(ulong32 isa_bios) if (error) goto out_free_coal_stat; list_add_tail(&ha->list, &gdth_instances); + + scsi_scan_host(shp); + return 0; out_free_coal_stat: @@ -4965,6 +4968,9 @@ static int __init gdth_eisa_probe_one(ushort eisa_slot) if (error) goto out_free_coal_stat; list_add_tail(&ha->list, &gdth_instances); + + scsi_scan_host(shp); + return 0; out_free_ccb_phys: @@ -5102,6 +5108,9 @@ static int __init gdth_pci_probe_one(gdth_pci_str *pcistr, int ctr) if (error) goto out_free_coal_stat; list_add_tail(&ha->list, &gdth_instances); + + scsi_scan_host(shp); + return 0; out_free_coal_stat: -- 1.5.3.3 --
Doh, someone please hand me a brown paper bag. My first series of patches had this but it got dropped when I rebased it over various janitor cleanups. The patch looks obviously correct, thanks. --
gdth _exit would first remove all cards then stop the timer
and would not sync with the timer function. This caused a crash
in gdth_timer() when module was unloaded.
del_timer_sync the timer before we delete the cards.
NOT YET TESTED
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
drivers/scsi/gdth.c | 15 ++++++++-------
1 files changed, 8 insertions(+), 7 deletions(-)
diff --git a/drivers/scsi/gdth.c b/drivers/scsi/gdth.c
index 8eb78be..103280e 100644
--- a/drivers/scsi/gdth.c
+++ b/drivers/scsi/gdth.c
@@ -3793,6 +3793,8 @@ static void gdth_timeout(ulong data)
gdth_ha_str *ha;
ulong flags;
+ BUG_ON(list_empty(&gdth_instances));
+
ha = list_first_entry(&gdth_instances, gdth_ha_str, list);
spin_lock_irqsave(&ha->smp_lock, flags);
@@ -5146,8 +5148,6 @@ static void gdth_remove_one(gdth_ha_str *ha)
ha->sdev = NULL;
}
- gdth_flush(ha);
-
if (shp->irq)
free_irq(shp->irq,ha);
@@ -5245,14 +5245,15 @@ static void __exit gdth_exit(void)
{
gdth_ha_str *ha;
- list_for_each_entry(ha, &gdth_instances, list)
- gdth_remove_one(ha);
+ unregister_chrdev(major,"gdth");
+ unregister_reboot_notifier(&gdth_notifier);
#ifdef GDTH_STATISTICS
- del_timer(&gdth_timer);
+ del_timer_sync(&gdth_timer);
#endif
- unregister_chrdev(major,"gdth");
- unregister_reboot_notifier(&gdth_notifier);
+
+ list_for_each_entry(ha, &gdth_instances, list)
+ gdth_remove_one(ha);
}
module_init(gdth_init);
--
1.5.3.3
--
Hello! I've tested this patch now - and it works fine. Now rmmod, halt and reboot also works. Stefan Priebe --
This is grate news Stefan. Thank you very much for all your time and effort, with out we could not have fixed all this. Boaz --
Do you have a git tree with the latest pieces? --
No, scsi-misc I guess ;) I could put it here: git://git.bhalevy.com/open-osd gdth branch give me an hours Boaz --
James please put this patch in rc-fixes also. It has now been tested by few people, and it solves a reproducible problem in the unloading of the driver. It was not yet confirmed by Andrew's reporter with the: + if (list_empty(&gdth_instances)) + return; at gdth_timer() In -mm tree. In my patch I have converted the if() to a BUG_ON because now it should not happen. But I figure it is not worse then what there is now, which is nothing. With your recommendation I will push both patches to the stable branches People have emailed me requesting it. Thanks Boaz --
This piece doesn't look right. gdth_flush() forces the internal cache to disk backing. If you remove it, you're taking the chance that the machine will be powered off without a writeback which can cause data corruption. James --
Yes. I have more problems reported, with exit, and am just sending one more patch that puts this back in. Which was tested. So I will resend this one plus one new one. Boaz --
The gdth driver would do a register_reboot_notifier(&gdth_notifier); to a gdth_halt() function, which would then redo half of what gdth_exit does, and wrongly so, and crash. Are we guaranteed in todays kernel that modules .exit function be called on an halt or reboot? If so then there is no need for duplications and the gdth_halt() should go. Submitted a patch that replaces the previous one I submitted with a deeper fix. [PATCH] gdth: bugfix for the at-exit problems If you ask me this all gdth_flush() is a crackup. sd and scsi-ml are doing scsi FLUSH commands when ever is needed. The controller as no business caching data in memory longer then what is stated in standard. Raid controller or no raid controller. Virtual or not virtual device. Data on Plate means data on plate. What if there is a power outage? what the driver can do then? Boaz --
gdth_exit would first remove all cards then stop the timer
and would not sync with the timer function. This caused a crash
in gdth_timer() when module was unloaded.
So del_timer_sync the timer before we delete the cards.
also a reboot notifier function was registered but is
not needed anymore. And would crash. So remove gdth_halt
and the reboot notifier registration.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
drivers/scsi/gdth.c | 64 +++++++-------------------------------------------
1 files changed, 9 insertions(+), 55 deletions(-)
diff --git a/drivers/scsi/gdth.c b/drivers/scsi/gdth.c
index 8eb78be..8469fe6 100644
--- a/drivers/scsi/gdth.c
+++ b/drivers/scsi/gdth.c
@@ -183,7 +183,6 @@ static int gdth_ioctl(struct inode *inode, struct file *filep,
unsigned int cmd, unsigned long arg);
static void gdth_flush(gdth_ha_str *ha);
-static int gdth_halt(struct notifier_block *nb, ulong event, void *buf);
static int gdth_queuecommand(Scsi_Cmnd *scp,void (*done)(Scsi_Cmnd *));
static int __gdth_queuecommand(gdth_ha_str *ha, struct scsi_cmnd *scp,
struct gdth_cmndinfo *cmndinfo);
@@ -418,12 +417,6 @@ static inline void gdth_set_sglist(struct scsi_cmnd *cmd,
#include "gdth_proc.h"
#include "gdth_proc.c"
-/* notifier block to get a notify on system shutdown/halt/reboot */
-static struct notifier_block gdth_notifier = {
- gdth_halt, NULL, 0
-};
-static int notifier_disabled = 0;
-
static gdth_ha_str *gdth_find_ha(int hanum)
{
gdth_ha_str *ha;
@@ -3793,6 +3786,8 @@ static void gdth_timeout(ulong data)
gdth_ha_str *ha;
ulong flags;
+ BUG_ON(list_empty(&gdth_instances));
+
ha = list_first_entry(&gdth_instances, gdth_ha_str, list);
spin_lock_irqsave(&ha->smp_lock, flags);
@@ -4668,45 +4663,6 @@ static void gdth_flush(gdth_ha_str *ha)
}
}
-/* shutdown routine */
-static int gdth_halt(struct notifier_block *nb, ulong event, void *buf)
-{
- gdth_ha_str *ha;
-#ifndef ...No. The __exit section is actually discardable if you promise never to remove the module. James --
I don't understand please explain. What does a driver need to do if it needs a consistent shutdown retine? --
It needs to register a reboot notifier, which gdth does. However, the notifier is only called on reboot, so it also needs to clean up correctly on module exit as well. The alternative for GDTH would be to process the SCSI SYNCHRONIZE CACHE command. That's done by a shutdown notifier from sd, so the correct thing would always get done; however it does mean the driver has to be in a condition to process the last sync cache command. For the quick fix, just keep the current infrastructure and put back the gdth_flush() command where it can be effective. James --
Totally untested.
---
From: Boaz Harrosh <bharrosh@panasas.com>
Subject: [PATCH] gdth: bugfix for the at-exit problems
gdth_exit would first remove all cards then stop the timer
and would not sync with the timer function. This caused a crash
in gdth_timer() when module was unloaded.
So del_timer_sync the timer before we delete the cards.
also the reboot notifier function would crash. So unify
the exit and halt functions with a gdth_shutdown() that's
called by both.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
drivers/scsi/gdth.c | 99 ++++++++++++++++++++------------------------------
1 files changed, 40 insertions(+), 59 deletions(-)
diff --git a/drivers/scsi/gdth.c b/drivers/scsi/gdth.c
index 8eb78be..7bb9b45 100644
--- a/drivers/scsi/gdth.c
+++ b/drivers/scsi/gdth.c
@@ -183,7 +183,6 @@ static int gdth_ioctl(struct inode *inode, struct file *filep,
unsigned int cmd, unsigned long arg);
static void gdth_flush(gdth_ha_str *ha);
-static int gdth_halt(struct notifier_block *nb, ulong event, void *buf);
static int gdth_queuecommand(Scsi_Cmnd *scp,void (*done)(Scsi_Cmnd *));
static int __gdth_queuecommand(gdth_ha_str *ha, struct scsi_cmnd *scp,
struct gdth_cmndinfo *cmndinfo);
@@ -418,12 +417,6 @@ static inline void gdth_set_sglist(struct scsi_cmnd *cmd,
#include "gdth_proc.h"
#include "gdth_proc.c"
-/* notifier block to get a notify on system shutdown/halt/reboot */
-static struct notifier_block gdth_notifier = {
- gdth_halt, NULL, 0
-};
-static int notifier_disabled = 0;
-
static gdth_ha_str *gdth_find_ha(int hanum)
{
gdth_ha_str *ha;
@@ -3793,6 +3786,8 @@ static void gdth_timeout(ulong data)
gdth_ha_str *ha;
ulong flags;
+ BUG_ON(list_empty(&gdth_instances));
+
ha = list_first_entry(&gdth_instances, gdth_ha_str, list);
spin_lock_irqsave(&ha->smp_lock, flags);
@@ -4668,45 +4663,6 @@ static void gdth_flush(gdth_ha_str *ha)
}
}
-/* shutdown routine */
-static ...Static police alert! Just make it static and move it into I'm not sure you can do this, aren't reboot notifiers called with the rwsem held? In which case the unregister which also takes the rwsem will hang the system. James --
On Wed, Feb 13 2008 at 19:36 +0200, James Bottomley <James.Bottomley@HansenPartnership.com> wrote: right forgot the static. But I use it in gdth_init(), so it must be external. Unless you promise me that gdth_init() will never ever be called after a call to shutdown. Any way the hot-plug patch changes all that. This is only humm, can't remove a notifier from within the notifier. Thanks James for the catch, it's what happens when you don't test your own patches. Will send a new version for review. Please note that this is a bugfix patch on top of 2.6.24. It is not needed for Jeff's hot-plug path. There will be one more bugfix patch for a crash at the user-mode ioctl code. Boaz --
This is a bugfix for the 2.6.24.x stable releases.
gdth_exit would first remove all cards then stop the timer
and would not sync with the timer function. This caused a crash
in gdth_timer() when module was unloaded.
So del_timer_sync the timer before we delete the cards.
also the reboot notifier function would crash. So unify
the exit and halt functions with a gdth_shutdown() that's
called by both.
Signed-off-by: Boaz Harrosh <bharrosh@panasas.com>
---
drivers/scsi/gdth.c | 90 ++++++++++++++++++++------------------------------
1 files changed, 36 insertions(+), 54 deletions(-)
diff --git a/drivers/scsi/gdth.c b/drivers/scsi/gdth.c
index 8eb78be..3828b23 100644
--- a/drivers/scsi/gdth.c
+++ b/drivers/scsi/gdth.c
@@ -183,7 +183,6 @@ static int gdth_ioctl(struct inode *inode, struct file *filep,
unsigned int cmd, unsigned long arg);
static void gdth_flush(gdth_ha_str *ha);
-static int gdth_halt(struct notifier_block *nb, ulong event, void *buf);
static int gdth_queuecommand(Scsi_Cmnd *scp,void (*done)(Scsi_Cmnd *));
static int __gdth_queuecommand(gdth_ha_str *ha, struct scsi_cmnd *scp,
struct gdth_cmndinfo *cmndinfo);
@@ -418,12 +417,6 @@ static inline void gdth_set_sglist(struct scsi_cmnd *cmd,
#include "gdth_proc.h"
#include "gdth_proc.c"
-/* notifier block to get a notify on system shutdown/halt/reboot */
-static struct notifier_block gdth_notifier = {
- gdth_halt, NULL, 0
-};
-static int notifier_disabled = 0;
-
static gdth_ha_str *gdth_find_ha(int hanum)
{
gdth_ha_str *ha;
@@ -3793,6 +3786,8 @@ static void gdth_timeout(ulong data)
gdth_ha_str *ha;
ulong flags;
+ BUG_ON(list_empty(&gdth_instances));
+
ha = list_first_entry(&gdth_instances, gdth_ha_str, list);
spin_lock_irqsave(&ha->smp_lock, flags);
@@ -4668,45 +4663,6 @@ static void gdth_flush(gdth_ha_str *ha)
}
}
-/* shutdown routine */
-static int gdth_halt(struct notifier_block *nb, ulong event, void *buf)
-{
- ...The patch looks fine now, thanks. Can we actually get a tester just to make sure there's nothing I missed. James --
Yes, and the tester reported, a breakage. We are on it. Apparently, you cannot do a full deallocation of resources at reboot notifier, nor would you want to I guess. But you can do the flush. The exit call is never called on a reboot and the card access is valid to the end. Please comment? So I pretty much reverted that patch, but did leave some cleanups. Also we found the other problems reported with user-mode tools and cat /proc/sys/gdth/0 so 2 patches on the way above reverted. Give us a few ours to test every thing. Thanks Boaz --
Why would we think that the controller does not support this command Why would it not be ready? what do other drivers do? The drivers is ready until the very last module's .exit. Is that good Just did. But if needed I would prefer to emulate the SCSI SYNCHRONIZE CACHE Boaz --
Um, because the controller isn't a SCSI device. It's an emulated device which means the SCSI comands are processed in the driver. It does look like the driver<->HBA communication is some sort of translated dialect shutdown is called as part of device removal and module unload ... usually from scsi_remove_host(). So you can't tear down command I think such a change, though desirable, would be too large to count as a bug fix. James --
Well, for crappy legacy driver that's the way, but it's not really recommended. As soon as a driver uses the proper driver models, e.g. gdth for pci using Jeff's pci hotplug patches it can just implement the ->shutdown method that is called before shutdown/kexec
