Hi, I have a newish machine - maybe 3 months old - which unreliably finds its disk drives at each boot. Probably 40% of the time booting 1 or more drives will be missing. So far I've been unable to determine what might be wrong. Sometimes it finds all 5 drives, sometimes only 4 or 3 drives. It's not always the same drives that are missing. Shown below are two successive warm boots. The machien had been running with all 5 drives working. The first time through it missed both /dev/sdd and /dev/sde while the very next time it found all 5 drives. Each of the 5 drives has been missing one or more times at different boots. (I.e. - it's not always drive sdd for instance.) Every time I boot the AMI BIOS screen says all 5 drives are there. I have found that if I drop into BIOS before going to grub that selecting each drive one at a time and reading through it's setup seems to make the boot more reliable, but not 100%. Probably 80% reliable. I don't know what info is required to look at this. I'm running the newest Gentoo kernel but it's been happening with all kernels I've tried since I built the machine. I'm attaching the kernel config as well as dmesg from the last boot which had all the drives. The only difference in dmesg when the drives don't show up (that I've spotted) is that the missing drive just isn't in dmesg. (I.e. no error messages that I could spot.) Let me know what I might try or what other info you might want. The motherboard is an Asus Rampage II Extreme with an i7-980x 6 core/12 thread processor. sda, sdb & sdc are part of a RAID1, sdd & sde are part of a RAID0. Thanks, Mark mark@c2stable ~ $ uname -a Linux c2stable 2.6.34-gentoo-r1 #1 SMP PREEMPT Fri Jul 2 10:04:52 PDT 2010 x86_64 Intel(R) Core(TM) i7 CPU X 980 @ 3.33GHz GenuineIntel GNU/Linux mark@c2stable ~ $ mark@c2stable ~ $ ls -la /dev/sd* brw-rw---- 1 root disk 8, 0 Jul 2 2010 /dev/sda brw-rw---- 1 root disk 8, 1 Jul 2 2010 /dev/sda1 brw-rw---- 1 root disk 8, 2 Jul 2 ...
<SNIP> As a follow-up here's dmesg on a boot where neither sdd or sde showed up. - Mark mark@c2stable ~ $ ls -al /dev/sd* brw-rw---- 1 root disk 8, 0 Jul 3 2010 /dev/sda brw-rw---- 1 root disk 8, 1 Jul 3 2010 /dev/sda1 brw-rw---- 1 root disk 8, 2 Jul 3 2010 /dev/sda2 brw-rw---- 1 root disk 8, 3 Jul 3 2010 /dev/sda3 brw-rw---- 1 root disk 8, 4 Jul 3 2010 /dev/sda4 brw-rw---- 1 root disk 8, 5 Jul 3 2010 /dev/sda5 brw-rw---- 1 root disk 8, 6 Jul 3 2010 /dev/sda6 brw-rw---- 1 root disk 8, 16 Jul 3 2010 /dev/sdb brw-rw---- 1 root disk 8, 17 Jul 3 2010 /dev/sdb1 brw-rw---- 1 root disk 8, 18 Jul 3 2010 /dev/sdb2 brw-rw---- 1 root disk 8, 19 Jul 3 2010 /dev/sdb3 brw-rw---- 1 root disk 8, 20 Jul 3 2010 /dev/sdb4 brw-rw---- 1 root disk 8, 21 Jul 3 2010 /dev/sdb5 brw-rw---- 1 root disk 8, 22 Jul 3 2010 /dev/sdb6 brw-rw---- 1 root disk 8, 32 Jul 3 2010 /dev/sdc brw-rw---- 1 root disk 8, 33 Jul 3 2010 /dev/sdc1 brw-rw---- 1 root disk 8, 34 Jul 3 2010 /dev/sdc2 brw-rw---- 1 root disk 8, 35 Jul 3 2010 /dev/sdc3 brw-rw---- 1 root disk 8, 36 Jul 3 2010 /dev/sdc4 brw-rw---- 1 root disk 8, 37 Jul 3 2010 /dev/sdc5 brw-rw---- 1 root disk 8, 38 Jul 3 2010 /dev/sdc6 mark@c2stable ~ $ dmesg Linux version 2.6.34-gentoo-r1 (root@c2stable) (gcc version 4.4.3 (Gentoo 4.4.3-r2 p1.2) ) #1 SMP PREEMPT Fri Jul 2 10:04:52 PDT 2010 Command line: root=/dev/md5 BIOS-provided physical RAM map: BIOS-e820: 0000000000000000 - 000000000009fc00 (usable) BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved) BIOS-e820: 00000000000e4000 - 0000000000100000 (reserved) BIOS-e820: 0000000000100000 - 00000000bf780000 (usable) BIOS-e820: 00000000bf780000 - 00000000bf798000 (ACPI data) BIOS-e820: 00000000bf798000 - 00000000bf7dc000 (ACPI NVS) BIOS-e820: 00000000bf7dc000 - 00000000c0000000 (reserved) BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved) BIOS-e820: 00000000ffe00000 - 0000000100000000 (reserved) BIOS-e820: 0000000100000000 - ...
(cc'ing linux-ide) Can you please *attach* full logs of a successful boot and several failing boots? Thanks. -- tejun --
OK, I enable printk timing. Here are two boots. The first one had /dev/sde come up missing: mark@c2stable ~ $ ls -al /dev/sd* brw-rw---- 1 root disk 8, 0 Jul 3 2010 /dev/sda brw-rw---- 1 root disk 8, 1 Jul 3 2010 /dev/sda1 brw-rw---- 1 root disk 8, 2 Jul 3 2010 /dev/sda2 brw-rw---- 1 root disk 8, 3 Jul 3 2010 /dev/sda3 brw-rw---- 1 root disk 8, 4 Jul 3 2010 /dev/sda4 brw-rw---- 1 root disk 8, 5 Jul 3 2010 /dev/sda5 brw-rw---- 1 root disk 8, 6 Jul 3 2010 /dev/sda6 brw-rw---- 1 root disk 8, 16 Jul 3 2010 /dev/sdb brw-rw---- 1 root disk 8, 17 Jul 3 2010 /dev/sdb1 brw-rw---- 1 root disk 8, 18 Jul 3 2010 /dev/sdb2 brw-rw---- 1 root disk 8, 19 Jul 3 2010 /dev/sdb3 brw-rw---- 1 root disk 8, 20 Jul 3 2010 /dev/sdb4 brw-rw---- 1 root disk 8, 21 Jul 3 2010 /dev/sdb5 brw-rw---- 1 root disk 8, 22 Jul 3 2010 /dev/sdb6 brw-rw---- 1 root disk 8, 32 Jul 3 2010 /dev/sdc brw-rw---- 1 root disk 8, 33 Jul 3 2010 /dev/sdc1 brw-rw---- 1 root disk 8, 34 Jul 3 2010 /dev/sdc2 brw-rw---- 1 root disk 8, 35 Jul 3 2010 /dev/sdc3 brw-rw---- 1 root disk 8, 36 Jul 3 2010 /dev/sdc4 brw-rw---- 1 root disk 8, 37 Jul 3 2010 /dev/sdc5 brw-rw---- 1 root disk 8, 38 Jul 3 2010 /dev/sdc6 brw-rw---- 1 root disk 8, 48 Jul 3 2010 /dev/sdd brw-rw---- 1 root disk 8, 49 Jul 3 2010 /dev/sdd1 mark@c2stable ~ $ I then did two warm boots and got the same problem so I shut down completely and did a cold boot which worked: mark@c2stable ~ $ ls -al /dev/sd* brw-rw---- 1 root disk 8, 0 Jul 3 2010 /dev/sda brw-rw---- 1 root disk 8, 1 Jul 3 2010 /dev/sda1 brw-rw---- 1 root disk 8, 2 Jul 3 2010 /dev/sda2 brw-rw---- 1 root disk 8, 3 Jul 3 2010 /dev/sda3 brw-rw---- 1 root disk 8, 4 Jul 3 2010 /dev/sda4 brw-rw---- 1 root disk 8, 5 Jul 3 2010 /dev/sda5 brw-rw---- 1 root disk 8, 6 Jul 3 2010 /dev/sda6 brw-rw---- 1 root disk 8, 16 Jul 3 2010 /dev/sdb brw-rw---- 1 root disk 8, 17 Jul 3 2010 /dev/sdb1 brw-rw---- 1 root disk 8, 18 Jul 3 2010 ...
Can you please apply the attached patch, reproduce the problem and post the kernel log? Thanks. -- tejun
I'm sorry. What am I patching? I'm not a kernel developer - not even a programmer - so I'll need some help with this. What's the command I should use? c2stable src # ls -la /usr/src/ total 32 drwxr-xr-x 8 root root 4096 Jul 2 09:56 . drwxr-xr-x 14 root root 4096 Apr 15 07:46 .. -rw-r--r-- 1 root root 0 Mar 24 18:37 .keep lrwxrwxrwx 1 root root 22 Jul 2 09:56 linux -> linux-2.6.34-gentoo-r1 drwxr-xr-x 23 root root 4096 Jun 16 07:23 linux-2.6.32-gentoo-r7 drwxr-xr-x 24 root root 4096 Jun 16 08:42 linux-2.6.34-gentoo drwxr-xr-x 24 root root 4096 Jul 3 15:30 linux-2.6.34-gentoo-r1 drwxr-xr-x 21 root root 4096 Jun 27 13:12 linux-2.6.34-rc3 drwxr-xr-x 20 root root 4096 Jun 15 08:05 linux-2.6.34-rc5 drwxr-xr-x 20 root root 4096 Jun 27 13:13 linux-2.6.35-rc3 c2stable src # Thanks, Mark --
Hello, Hmm... $ cd /usr/src/linux && patch -p1 < PATCH_FILE should do it. You know how to build and install the compiled kernel, right? Thanks. -- tejun --
OK - thanks. The patch seemed to install correctly. I then did make clean make && make modules_install and then a Gentoo command: modules-rebuild -X rebuild to pick up any package modules that need to be rebuild when I use a new kernel. (X drivers, vmware, etc.) The kernel boots fine: mark@c2stable ~ $ uname -a Linux c2stable 2.6.34-gentoo-r1 #4 SMP PREEMPT Tue Jul 6 10:35:15 PDT 2010 x86_64 Intel(R) Core(TM) i7 CPU X 980 @ 3.33GHz GenuineIntel GNU/Linux mark@c2stable ~ $ I don't know what I'm looking for in the dmesg files but I do see one message about a SATA Link being down. Files attached. Cheers, Mark
Hello, Umm... the patched module isn't being loaded. If patched, it should be reporting whether it's hard or soft resetting and some other debug messages. One common mistake is not updating initrd with patched modules. Is "modules-rebuild -X rebuild" the command for initrd update? Thanks. -- tejun --
I don't use an initrd. I don't know what happened with the patch but clearly it wasn't in there. I wasn't confident so I used the --dry-run option. Maybe I forgot to remove it or something. Sorry. It's in now. OK - I don't know if this was you intention but since adding this patch I've not had a single drive missing failure. I've cold booted about 8 times and warm booted at least 20 times. Every one has come up fine. I've even gone so far as to turn off the UPS and sit for 5 minutes before cold booting. Still nothing fails right now. I've had this sort of statistical thing happen before where it hasn't failed for days, maybe even weeks, but then it starts failing and fails every time for awhile. Over the past few days working with you I've never had to reboot more than twice to get you a file. Now I've tried 30 times this morning and I've come up with nothing. I will continue to watch the machine and send you the failing dmesg file whenever I finally get it. For now I can only attach the passing file showing the patch is now included. modules-rebuild is just a little Gentoo script that rebuilds a list of modules that I've previously set up. Each time I build a new kernel I rebuild some modules as well as mesa: xorg-drivers xorg-input-evdev xorg-input-keyboard xorg-input-mouse xf86-video-ati xf86-video-fbdev xf86-video-vmware vmware-modules mesa Clearly I shouldn't need both evdev as well as keyboard/mouse but there's been problems with hald so I've been flipping back and forth a bit. mesa is just superstition but I _think_ it's helped a couple of times. Thanks! - Mark
Hello, It seems like SIDPR is a bit more unreliable than the current code can handle and the added delay and read could have affected the result. Eh... weird. Can you please apply the attached patch instead? The only difference is it will print out two SControl values instead of one. ie. "XXX SControl after resume = AAA BBB, tries=T". Can you please try to boot multiple times and see if AAA and BBB differ anytime? If that happens, please attach the boot log. Also, if you see one with T > 1, please attach that one too. Thanks. -- tejun
Certainly. Is there a way to reverse the previous patch? c2stable linux # patch -p1 --dry-run <~mark/Downloads/resume-dbg-1.patch patching file drivers/ata/libata-core.c Hunk #1 succeeded at 3798 (offset 86 lines). Hunk #2 succeeded at 3833 with fuzz 2 (offset 94 lines). Hunk #3 FAILED at 6109. 1 out of 3 hunks FAILED -- saving rejects to file drivers/ata/libata-core.c.rej c2stable linux # I assume this is failing because your patch is over the plain kernel, not the one I've patched? Thanks, Mark --
$ patch -R -p1 < ~mark/Downloads/resume-dbg.patch $ patch -p1 < ~mark/Downloads/resume-dbg-1.patch -- tejun --
Thanks. Building the new kernel now. I'll start trying to save the data you're looking for. Cheers, Mark --
4 warm reboots. All 4 said 300 300. However the 4th one only showed an extra attempt at running the patch code with and also showed Tries = 2. I'm attaching boot #1 and boot #4 for now. I've saved them all if you need or just want them. Please note that in all 4 cases all drives were found. Nothing is missing in any test yet after adding these either of these patches. I've not tried cold boots yet. That's next. Cheers, Mark
Hello, Hmm... just in case you're being lucky, please keep an eye on it over several days and report the result. I think all that's necessary is slight modification to the resume logic but let's watch a bit first. Thanks. -- tejun --
I've tried two cold boots so far. One of them had that same extra Tries = 2 at the same place. I'll do a couple more just to see what happens. I'm happy to watch it as long as it takes and will certainly save any results if and when a drive isn't found. This problem has run hot and cold for a few months. Sometimes I go a week with every boot is good. This last two weeks was finally bad enough to get me to report it. I also want to investigate actually getting the BIOS AHCI setting working, but I think I'll leave that alone for afew days so as to not upset this experiment. Performance on the machine isn't critical right now, assuming AHCI helps. I'll get back to you when I've got something new to report. Cheers, Mark --
Tejun, With about 10-12 day of testing, 1-2 boots/day, I've not had a single boot failure since adding the patch. Only twice has it said tries=2. Every other time it's tries=1. The machine seems to work fine either way. Thanks, Mark --
Hello, Hmmm... can you please test the attached patch instead? It seems likely that the root cause is not flakiness of SIDPR but incorrect locking in libata EH code. Thanks. -- tejun
Hey Tejun: I guess this is the same patch that you sent me to fix my issue with missing drives. Good news: I've been through about 10 reboots now and no problems. Based on my prior experience, I'd say with the old setup, 10 clean boots in a row was probably less than a 1% event. So, it seems that this has fixed my problem. Thanks! --
Helo, Yeap, it's the same one. I'm forwarding the patch upstream now but, Mark, please let me know the test result. Thanks. -- tejun --
OK, I was able to get into the machine remotely for a few minutes. I think the patch applied correctly and the machine cold booted cleanly. (My wife powered it up for me.) I'll do more boots later when you confirm everything looks reasonable. I don't see any print statement in the patch so I don't know what to look for. I'm attaching a dmesg file for you to review. Assuming I did this right then everything seems good so far. Cheers, Mark c2stable linux # patch --verbose -p1 <~mark/Downloads/ata_piix-sidpr-lock.patch Hmm... Looks like a unified diff to me... The text leading up to this was: -------------------------- |diff --git a/drivers/ata/ata_piix.c b/drivers/ata/ata_piix.c |index 7409f98..3971bc0 100644 |--- a/drivers/ata/ata_piix.c |+++ b/drivers/ata/ata_piix.c -------------------------- Patching file drivers/ata/ata_piix.c using Plan A... Hunk #1 succeeded at 158. Hunk #2 succeeded at 952. Hunk #3 succeeded at 968. Hunk #4 succeeded at 1573. done c2stable linux #
Tejun, Looks like I had a failure today. First one in weeks and only the 3rd or 4th boot with this newer patch file. One of the two drives making a RAID0 wasn't found so /dev/md11 (constructed from /dev/sdd and /dev/sde) couldn't be started. I did a cold reboot and the drive was found. If it matters, and it probably doesn't, the failure came on a boot which had a scheduled fsck to do of /dev/md5 - my main / drive. I don't see how that would make a difference but I figure why leave the info out. That's why the times are so much larger in the dmesg file. (I think) dmesg attached. I patched the Gentoo kernel if it makes a difference, same as I did with the earlier patch. mark@c2stable ~ $ uname -a Linux c2stable 2.6.34-gentoo-r2 #1 SMP PREEMPT Sun Jul 18 14:09:48 PDT 2010 x86_64 Intel(R) Core(TM) i7 CPU X 980 @ 3.33GHz GenuineIntel GNU/Linux mark@c2stable ~ $ Sorry, Mark
That's unfortunate. FYI I have continued to be trouble free, but my processor is a bit weaker than Mark's, although I would find it surprising that this would cause a problem. Also, FYI Mark, I have 12GB of Corsair RAM, and have it bumped up to the Intel XMP profile. --
Hello, Hmmm... that's weird. Can you please make sure the patch is actually applied? Adding a printk("XXX patch applied!\n") near other changes usually is easy enough. Also, can you please apply resume-dbg-1.patch too and reproduce the failure and post log? Thanks. -- tejun --
Hi Tejun,
I'm finally home and trying to get back to this. I'm really a bad
programmer so I don't know what I've done wrong but it seems patch
isn't happy with me.
c2stable linux # patch --dry-run -p1 <../ata_piix-sidpr-lock.patch
patching file drivers/ata/ata_piix.c
patch: **** malformed patch at line 13:
c2stable linux #
Here's the change I tried to make to a copy of the file:
c2stable linux # cat ../ata_piix-sidpr-lock.patch
diff --git a/drivers/ata/ata_piix.c b/drivers/ata/ata_piix.c
index 7409f98..3971bc0 100644
--- a/drivers/ata/ata_piix.c
+++ b/drivers/ata/ata_piix.c
@@ -158,6 +158,7 @@ struct piix_map_db {
struct piix_host_priv {
const int *map;
u32 saved_iocfg;
+ spinlock_t sidpr_lock; /* FIXME: remove once locking in EH is fixed */
+ printk("MWK - ata_sidpr patch applied!\n");
void __iomem *sidpr;
};
@@ -951,12 +952,15 @@ static int piix_sidpr_scr_read(struct ata_link *link,
unsigned int reg, u32 *val)
{
struct piix_host_priv *hpriv = link->ap->host->private_data;
+ unsigned long flags;
if (reg >= ARRAY_SIZE(piix_sidx_map))
return -EINVAL;
+ spin_lock_irqsave(&hpriv->sidpr_lock, flags);
piix_sidpr_sel(link, reg);
*val = ioread32(hpriv->sidpr + PIIX_SIDPR_DATA);
+ spin_unlock_irqrestore(&hpriv->sidpr_lock, flags);
return 0;
}
@@ -964,12 +968,15 @@ static int piix_sidpr_scr_write(struct ata_link *link,
unsigned int reg, u32 val)
{
struct piix_host_priv *hpriv = link->ap->host->private_data;
+ unsigned long flags;
if (reg >= ARRAY_SIZE(piix_sidx_map))
return -EINVAL;
+ spin_lock_irqsave(&hpriv->sidpr_lock, flags);
piix_sidpr_sel(link, reg);
iowrite32(val, hpriv->sidpr + PIIX_SIDPR_DATA);
+ spin_unlock_irqrestore(&hpriv->sidpr_lock, flags);
return 0;
}
@@ -1566,6 +1573,7 @@ ...Whenever the patch file was saved on this system, line 13 of it was split (probably by an email client). Whenever I see this, I just join (merge) that line and the next one and try again... sometimes --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** --
Randy,
Could very well be what happened. I added line 13 (the printk) by hand
<SNIP - ORIGINAL PATCH FILE>
struct piix_host_priv {
const int *map;
u32 saved_iocfg;
+ spinlock_t sidpr_lock; /* FIXME: remove once locking in EH is fixed */
void __iomem *sidpr;
};
<SNIP - MY CHANGE BY HAND>
struct piix_host_priv {
const int *map;
u32 saved_iocfg;
+ spinlock_t sidpr_lock; /* FIXME: remove once locking in EH is fixed */
+ printk("MWK - ata_sidpr patch applied!\n");
void __iomem *sidpr;
};
Maybe I should have just put it on the same line as the previous
spinlock command?
I'll play with it and see if I can get it working.
Thanks,
Mark
--
Ah, so you added a line to a patch file. That means that the patch block header must be changed from something like this: @@ -964,12 +968,15 @@ to like this: @@ -964,12 +968,16 @@ Any text following the second "@@" is just a comment so it does not matter. The ,12 ,15 ,16 are all line counts. The ",12" is the number of lines in the before version of the patch. The ",15" or ",16" is the number of lines in the after version of the patch, so you would need to increase it by 1 if you added one line. Or you can just put the printk on the same line as another part of the patch and it won't matter. :) --- ~Randy *** Remember to use Documentation/SubmitChecklist when testing your code *** --
Thanks Randy. This helped. OK - so I'm going to hang fire on this problem right now. As part of this process I've updated to a 2.6.35-gentoo kernel and I'm not seeing the problem at all so far. I've booted about 25 times over the last 3 days - warm and cold boots mixed - and so far no failures. While I've had times in the past where earlier kernels didn't exhibit the problem for days I don't remember a time where I got this many good boots in a row. As my problem didn't seem to be exactly the same as Paul's, and since Tejun expressed surprise that his fix for Paul's issue didn't fix mine, maybe my problem is actually something different and (hopefully) already fixed. I'll continue to use the machine and watch for any failures. If I get one I'll reengage the list and go into debug mode. Cheers, Mark --
Tejun asked Mark to add a printk, and Mark added it directly to the patch. Mark, just apply the original patch as-is, and then add the printk to the source code in ata_piix.c. You should add it somewhere like the piix_init_one function, e.g. right before the "Save IOCFG" comment around line 1575. --
Hey Paul. Glad it worked for you as it did for me. Was your hardware in any way similar to mine? - Mark --
What processor? I've got the Core i7-980X. I've wondered if this problem showed up for me because possibly the 12 thread processor moves through some part of boot setup pretty quickly. Mine is an Asus Rampage II Extreme with five 500GB drives. Whatever the reason great thanks to Tejun for fixing it. I keep wanting to add an 'r' to his last name to get 'Hero'! Cheers, Mark --
Oh, yeah, my processor is the i7 975. That's 4 cores 8 threads, right? Yes, nice work Tejun, thanks. I'll have to run custom kernels until this --
Please provide the make/model of the PC. If it's whitebox or DIY please provide make/model of PSU, mobo and CPU. How many USB peripherals are powered by the PC? Are you powering a water cooling loop pump from the PC's power supply? Is this PC in a temperature controlled environment (A/C)? -- Stan --
Build it myself. Asus Rampage II Extreme motherboard 12GB Crucial DRAM currently installed (Holds 24GB) Intel Core i7-980X CPU @ 3.33Ghz Palit nVidia 9500GT-based graphics card Sony Nec Optiarc AD-7241S-0B 24X Dual Layer DVD+/-RW SATA Drive (5x) WD5002ABYS RE3 Enterprise Class 500GB hard drives No external devices other than monitor, mouse, keyboard and the USB interface to the UPS are attached. No USB, 1394 or eSATA are attached at this time. It's all powered by: Corsair CMPSU-750TX 750-Watt TX Series 80 Plus Certified Power Supply Air cooled using the stock Intel fan that came with the processor and sitting in a home office environment. The machine draws (steady state) about 250-275W according to both the UPS it's hooked to as well as my trusty Kill-a-Watt. What it might draw transient at power on while drives are spinning up I wouldn't hazard a guess but it does seem to be well below the rating of the supply. The PSU actually has something like 8 or 10 SATA power connections, not that that means anything. I'm using 6. (1 CDRW, 5 drives) Note two things: 1) All the drives are always reported by BIOS at boot time. Now, that doesn't guarantee that the drives spin up. It may only mean they can be read by BIOS, but they are there as far as I can tell. They show up in the boot screens and in BIOS itself if I drop in to play with settings. 2) Whatever state the machine comes up in - drives recognized or not - it will run forever in that state under some pretty heavy loads so it isn't like the PSU can't completely do the job. It could possibly be marginal though. QUESTION: There are some settings in BIOS for delaying the drive. (Or something. I'm using the machine and not in BIOS) There were settings from 0 to 35 seconds if I remember correctly. Possibly I should try setting each drive to a different value to different value to stagger power up? If you need more info or have other ideas please let me know. Thanks, Mark --
If that PSU meets published specs you shouldn't need delayed spin up with Your answers here should have pretty much eliminated hardware issues as the cause, unless that particular mobo has BIOS or other issues I'm unaware of. I've found it's always best to ask about hardware with this kind of report just to eliminate possibilities. All that gear is good quality stuff. If the problem is due to hardware, it's because one of your components is defective, but we don't see evidence of that at this point. Also, TTBOMK, if a SATA drive motor doesn't spin up, the drive firmware won't report the drive as ready upstream, thus the BIOS won't list the drive. -- Stan --
I've not dropped into BIOS yet as the machine is in use but from the Asus manual it appears the delay is not on a drive by drive basis so I An off-list response suggested possibly setting some drive jumpers on non-boot drives to power up in standby. Apparently the kernel will then spin up those drives later? If I cannot stagger the drives in BIOS then I will likely try that. Technically I guess I only need /boot on sda to get the kernel booted. The mdadm RAID1 on sda/sdb/sdc could start slightly later, and technically the RAID0 on sdd/sde could start very late as there are only VMWare images on that drive. Cheers, Mark --
OK, I don't know if this is related but so far I cannot get the machine to boot if I set BIOS SATA configuration to AHCI. I believe that I have AHCI support as well as SATA support built into the kernel but when I set the BIOS to AHCI the machine just hangs saying it finds no medium. I assume that means no hard drive. I have to set SATA support to enhanced and then IDE to get the machine to boot at all. I did notice that my earlier kernel had the depreciated ATA/ATAPI support selected so I removed that from the kernel but it didn't change the results. The lshw listings below show the DATA controllers. The 20360/363 is (apparently) the eSATA controller going to the front panel. Nothing is attached there. The chipset supposedly handles 6 SATA ports - they seem to be arranged x4 & x2. TTBOMK I have the CDROM and the 3-drive RAID1 on the 4 port controller and the RAID0, which fails to be recognized more often - on the 2 port controller. I don't know how to prove that though. I'm unclear how mature the SATA support is for this chipset. Is there a chance that this is some bit that's not being sey reliably? Also, I misspoke earlier about the graphics adapter. It's actually an ATI Radeon 5770 in this machine. The 9500GT is in another machine. Thanks, Mark c2stable ~ # lshw -short -class storage H/W path Device Class Description ========================================================= /0/100/1c.4/0 storage 20360/20363 Serial ATA Controller /0/100/1c.4/0.1 storage 20360/20363 Serial ATA Controller /0/100/1f.2 scsi2 storage 82801JI (ICH10 Family) 4 port SATA IDE Controller /0/100/1f.5 scsi5 storage 82801JI (ICH10 Family) 2 port SATA IDE Controller c2stable ~ # lshw -short | grep SATA /0/100/1f.2 scsi2 storage 82801JI (ICH10 Family) 4 port SATA IDE Controller /0/100/1f.5 scsi5 storage 82801JI (ICH10 Family) 2 port SATA IDE ...
If it wasn't for all that, I'd have suspected a small power issue. I recently ran into an odd problem of how my PSU has its rails split up between different cables. If I had all the hardware in the box hooked up with the normal "black" cables from my PSU I'd get errors from one of my 2TB WD Green's. But if I moved the drives, or the GPU to the "red" plug (modular psu), everything worked. I assume the rails on my psu were split up absolutely retardedly, causing the main 12v rail to be shared between the cpu, all add-on devices, AND the non-modular PCI-E cable for the GPU (8800GTS), causing that rail's voltage to drop below spec. But it sounds like maybe your SATA chipset might not be supported in AHCI mode (which you really want), and might have issues in IDE mode. But given that its an intel chipset you'd think it'd have perfect support :o So I don't know. -- Thomas Fjellstrom tfjellstrom@strangesoft.net --
Hello, That's odd. Can you please attach kernel boot log w/ ahci mode? Booting a recent live CD and saving boot log from there should do the trick. Thanks. -- tejun --
Hi Tehun, Thanks for the help. I tried a Gentoo install CD from last march. It's a 2.6.31 type kernel. Problem is the buffer depths are not big enough to capture the complete dmesg contents and I don't know a command line option to make it larger on the fly. If you know of one that's got a larger buffer - or a command to increase it at boot time - then let me know and I'll try again. I'm attaching what I was able to catch for both AHCI and IDE settings in BIOS for the storage configuration option. It seems to me that even in AHCI mode the machine does see all the hard drives. Maybe there's something about my boot partition that's having problems in AHCI mode only? If it sees /dev/sda then why wouldn't it find grub and at least show a grub menu? Thanks, Mark
On some machines changing from IDE to AHCI messes up the boot order selection in the BIOS - you may have to switch to AHCI, save settings, reboot, go back into the BIOS and then make sure the disk boot order settings are correct (whatever drive grub is installed on needs to be first). --
That's a very interesting idea Robert. Thanks. I'll have to be careful. I have 5 identical drives in the machine so figuring out which is which when I'm in BIOS might be a bit of a trick. I'll give it a go and report back. Cheers, Mark --
You can install grub on the mbr of all disks and not worry about the order. Especially if you work with UUIDs. Sander -- Humilis IT Services and Solutions http://www.humilis.net --
The ahci driver is working fine. If the system can't boot w/ ahci mode, the problem is probably on the bios side and most likely caused by getting the boot device wrong as Robert suggested. Just choose one after another until it boots should do the trick. Thanks. -- tejun --
