Re: Drives missing at boot

Previous thread: [PATCH] Break out types from <linux/list.h> to <linux/list_types.h>. by Chris Metcalf on Friday, July 2, 2010 - 10:41 am. (11 messages)

Next thread: [PATCH] trivial: fix typos concerning "already" by =?UTF-8?q?Uwe=20Kleine-K=C3=B6nig?= on Friday, July 2, 2010 - 11:41 am. (2 messages)
From: Mark Knecht
Date: Friday, July 2, 2010 - 10:56 am

Hi,
   I have a newish machine - maybe 3 months old - which unreliably
finds its disk drives at each boot. Probably 40% of the time booting 1
or more drives will be missing.

   So far I've been unable to determine what might be wrong. Sometimes
it finds all 5 drives, sometimes only 4 or 3 drives. It's not always
the same drives that are missing. Shown below are two successive warm
boots. The machien had been running with all 5 drives working. The
first time through it missed both /dev/sdd and /dev/sde while the very
next time it found all 5 drives. Each of the 5 drives has been missing
one or more times at different boots. (I.e. - it's not always drive
sdd for instance.)

   Every time I boot the AMI BIOS screen says all 5 drives are there.
I have found that if I drop into BIOS before going to grub that
selecting each drive one at a time and reading through it's setup
seems to make the boot more reliable, but not 100%. Probably 80%
reliable.

   I don't know what info is required to look at this. I'm running the
newest Gentoo kernel but it's been happening with all kernels I've
tried since I built the machine. I'm attaching the kernel config as
well as dmesg from the last boot which had all the drives. The only
difference in dmesg when the drives don't show up (that I've spotted)
is that the missing drive just isn't in dmesg. (I.e. no error messages
that I could spot.)

   Let me know what I might try or what other info you might want. The
motherboard is an Asus Rampage II Extreme with an i7-980x 6 core/12
thread processor. sda, sdb &amp; sdc are part of a RAID1, sdd &amp; sde are
part of a RAID0.

Thanks,
Mark

mark@c2stable ~ $ uname -a
Linux c2stable 2.6.34-gentoo-r1 #1 SMP PREEMPT Fri Jul 2 10:04:52 PDT
2010 x86_64 Intel(R) Core(TM) i7 CPU X 980 @ 3.33GHz GenuineIntel
GNU/Linux
mark@c2stable ~ $
mark@c2stable ~ $ ls -la /dev/sd*
brw-rw---- 1 root disk 8,  0 Jul  2  2010 /dev/sda
brw-rw---- 1 root disk 8,  1 Jul  2  2010 /dev/sda1
brw-rw---- 1 root disk 8,  2 Jul  2  ...
From: Mark Knecht
Date: Saturday, July 3, 2010 - 8:20 am

&lt;SNIP&gt;

As a follow-up here's dmesg on a boot where neither sdd or sde showed up.

- Mark

mark@c2stable ~ $ ls -al /dev/sd*
brw-rw---- 1 root disk 8,  0 Jul  3  2010 /dev/sda
brw-rw---- 1 root disk 8,  1 Jul  3  2010 /dev/sda1
brw-rw---- 1 root disk 8,  2 Jul  3  2010 /dev/sda2
brw-rw---- 1 root disk 8,  3 Jul  3  2010 /dev/sda3
brw-rw---- 1 root disk 8,  4 Jul  3  2010 /dev/sda4
brw-rw---- 1 root disk 8,  5 Jul  3  2010 /dev/sda5
brw-rw---- 1 root disk 8,  6 Jul  3  2010 /dev/sda6
brw-rw---- 1 root disk 8, 16 Jul  3  2010 /dev/sdb
brw-rw---- 1 root disk 8, 17 Jul  3  2010 /dev/sdb1
brw-rw---- 1 root disk 8, 18 Jul  3  2010 /dev/sdb2
brw-rw---- 1 root disk 8, 19 Jul  3  2010 /dev/sdb3
brw-rw---- 1 root disk 8, 20 Jul  3  2010 /dev/sdb4
brw-rw---- 1 root disk 8, 21 Jul  3  2010 /dev/sdb5
brw-rw---- 1 root disk 8, 22 Jul  3  2010 /dev/sdb6
brw-rw---- 1 root disk 8, 32 Jul  3  2010 /dev/sdc
brw-rw---- 1 root disk 8, 33 Jul  3  2010 /dev/sdc1
brw-rw---- 1 root disk 8, 34 Jul  3  2010 /dev/sdc2
brw-rw---- 1 root disk 8, 35 Jul  3  2010 /dev/sdc3
brw-rw---- 1 root disk 8, 36 Jul  3  2010 /dev/sdc4
brw-rw---- 1 root disk 8, 37 Jul  3  2010 /dev/sdc5
brw-rw---- 1 root disk 8, 38 Jul  3  2010 /dev/sdc6
mark@c2stable ~ $ dmesg
Linux version 2.6.34-gentoo-r1 (root@c2stable) (gcc version 4.4.3
(Gentoo 4.4.3-r2 p1.2) ) #1 SMP PREEMPT Fri Jul 2 10:04:52 PDT 2010
Command line: root=/dev/md5
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
 BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e4000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 00000000bf780000 (usable)
 BIOS-e820: 00000000bf780000 - 00000000bf798000 (ACPI data)
 BIOS-e820: 00000000bf798000 - 00000000bf7dc000 (ACPI NVS)
 BIOS-e820: 00000000bf7dc000 - 00000000c0000000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000ffe00000 - 0000000100000000 (reserved)
 BIOS-e820: 0000000100000000 - ...
From: Tejun Heo
Date: Saturday, July 3, 2010 - 9:01 am

(cc'ing linux-ide)


Can you please *attach* full logs of a successful boot and several
failing boots?

Thanks.

-- 
tejun
--

From: Mark Knecht
Date: Saturday, July 3, 2010 - 9:06 am

Certainly? Which logs? dmesg or something else?

Thanks,
Mark
--

From: Tejun Heo
Date: Saturday, July 3, 2010 - 9:13 am

Hello,


dmesg output preferably with printk timestamp enabled.

Thanks.

-- 
tejun
--

From: Mark Knecht
Date: Saturday, July 3, 2010 - 9:42 am

OK, I enable printk timing.

Here are two boots. The first one had /dev/sde come up missing:

mark@c2stable ~ $ ls -al /dev/sd*
brw-rw---- 1 root disk 8,  0 Jul  3  2010 /dev/sda
brw-rw---- 1 root disk 8,  1 Jul  3  2010 /dev/sda1
brw-rw---- 1 root disk 8,  2 Jul  3  2010 /dev/sda2
brw-rw---- 1 root disk 8,  3 Jul  3  2010 /dev/sda3
brw-rw---- 1 root disk 8,  4 Jul  3  2010 /dev/sda4
brw-rw---- 1 root disk 8,  5 Jul  3  2010 /dev/sda5
brw-rw---- 1 root disk 8,  6 Jul  3  2010 /dev/sda6
brw-rw---- 1 root disk 8, 16 Jul  3  2010 /dev/sdb
brw-rw---- 1 root disk 8, 17 Jul  3  2010 /dev/sdb1
brw-rw---- 1 root disk 8, 18 Jul  3  2010 /dev/sdb2
brw-rw---- 1 root disk 8, 19 Jul  3  2010 /dev/sdb3
brw-rw---- 1 root disk 8, 20 Jul  3  2010 /dev/sdb4
brw-rw---- 1 root disk 8, 21 Jul  3  2010 /dev/sdb5
brw-rw---- 1 root disk 8, 22 Jul  3  2010 /dev/sdb6
brw-rw---- 1 root disk 8, 32 Jul  3  2010 /dev/sdc
brw-rw---- 1 root disk 8, 33 Jul  3  2010 /dev/sdc1
brw-rw---- 1 root disk 8, 34 Jul  3  2010 /dev/sdc2
brw-rw---- 1 root disk 8, 35 Jul  3  2010 /dev/sdc3
brw-rw---- 1 root disk 8, 36 Jul  3  2010 /dev/sdc4
brw-rw---- 1 root disk 8, 37 Jul  3  2010 /dev/sdc5
brw-rw---- 1 root disk 8, 38 Jul  3  2010 /dev/sdc6
brw-rw---- 1 root disk 8, 48 Jul  3  2010 /dev/sdd
brw-rw---- 1 root disk 8, 49 Jul  3  2010 /dev/sdd1
mark@c2stable ~ $

I then did two warm boots and got the same problem so I shut down
completely and did a cold boot which worked:

mark@c2stable ~ $ ls -al /dev/sd*
brw-rw---- 1 root disk 8,  0 Jul  3  2010 /dev/sda
brw-rw---- 1 root disk 8,  1 Jul  3  2010 /dev/sda1
brw-rw---- 1 root disk 8,  2 Jul  3  2010 /dev/sda2
brw-rw---- 1 root disk 8,  3 Jul  3  2010 /dev/sda3
brw-rw---- 1 root disk 8,  4 Jul  3  2010 /dev/sda4
brw-rw---- 1 root disk 8,  5 Jul  3  2010 /dev/sda5
brw-rw---- 1 root disk 8,  6 Jul  3  2010 /dev/sda6
brw-rw---- 1 root disk 8, 16 Jul  3  2010 /dev/sdb
brw-rw---- 1 root disk 8, 17 Jul  3  2010 /dev/sdb1
brw-rw---- 1 root disk 8, 18 Jul  3  2010 ...
From: Tejun Heo
Date: Sunday, July 4, 2010 - 11:30 pm

Can you please apply the attached patch, reproduce the problem and
post the kernel log?

Thanks.

-- 
tejun
From: Mark Knecht
Date: Monday, July 5, 2010 - 9:56 am

I'm sorry. What am I patching? I'm not a kernel developer - not even a
programmer - so I'll need some help with this. What's the command I
should use?

c2stable src # ls -la /usr/src/
total 32
drwxr-xr-x  8 root root 4096 Jul  2 09:56 .
drwxr-xr-x 14 root root 4096 Apr 15 07:46 ..
-rw-r--r--  1 root root    0 Mar 24 18:37 .keep
lrwxrwxrwx  1 root root   22 Jul  2 09:56 linux -&gt;
linux-2.6.34-gentoo-r1
drwxr-xr-x 23 root root 4096 Jun 16 07:23 linux-2.6.32-gentoo-r7
drwxr-xr-x 24 root root 4096 Jun 16 08:42 linux-2.6.34-gentoo
drwxr-xr-x 24 root root 4096 Jul  3 15:30 linux-2.6.34-gentoo-r1
drwxr-xr-x 21 root root 4096 Jun 27 13:12 linux-2.6.34-rc3
drwxr-xr-x 20 root root 4096 Jun 15 08:05 linux-2.6.34-rc5
drwxr-xr-x 20 root root 4096 Jun 27 13:13 linux-2.6.35-rc3
c2stable src #

Thanks,
Mark
--

From: Tejun Heo
Date: Monday, July 5, 2010 - 11:33 pm

Hello,


Hmm...

$ cd /usr/src/linux &amp;&amp; patch -p1 &lt; PATCH_FILE

should do it.  You know how to build and install the compiled kernel,
right?

Thanks.

-- 
tejun
--

From: Mark Knecht
Date: Tuesday, July 6, 2010 - 11:13 am

OK - thanks. The patch seemed to install correctly. I then did

make clean
make &amp;&amp; make modules_install

and then a Gentoo command:

modules-rebuild -X rebuild

to pick up any package modules that need to be rebuild when I use a
new kernel. (X drivers, vmware, etc.)

The kernel boots fine:

mark@c2stable ~ $ uname -a
Linux c2stable 2.6.34-gentoo-r1 #4 SMP PREEMPT Tue Jul 6 10:35:15 PDT
2010 x86_64 Intel(R) Core(TM) i7 CPU X 980 @ 3.33GHz GenuineIntel
GNU/Linux
mark@c2stable ~ $

I don't know what I'm looking for in the dmesg files but I do see one
message about a SATA Link being down.

Files attached.

Cheers,
Mark
From: Tejun Heo
Date: Tuesday, July 6, 2010 - 10:50 pm

Hello,


Umm... the patched module isn't being loaded.  If patched, it should
be reporting whether it's hard or soft resetting and some other debug
messages.  One common mistake is not updating initrd with patched
modules.  Is &quot;modules-rebuild -X rebuild&quot; the command for initrd
update?

Thanks.

-- 
tejun
--

From: Mark Knecht
Date: Wednesday, July 7, 2010 - 8:34 am

I don't use an initrd.

I don't know what happened with the patch but clearly it wasn't in
there. I wasn't confident so I used the --dry-run option. Maybe I
forgot to remove it or something. Sorry. It's in now.

OK - I don't know if this was you intention but since adding this
patch I've not had a single drive missing failure. I've cold booted
about 8 times and warm booted at least 20 times. Every one has come up
fine. I've even gone so far as to turn off the UPS and sit for 5
minutes before cold booting. Still nothing fails right now.

I've had this sort of statistical thing happen before where it hasn't
failed for days, maybe even weeks, but then it starts failing and
fails every time for awhile. Over the past few days working with you
I've never had to reboot more than twice to get you a file. Now I've
tried 30 times this morning and I've come up with nothing.

I will continue to watch the machine and send you the failing dmesg
file whenever I finally get it. For now I can only attach the passing
file showing the patch is now included.

modules-rebuild is just a little Gentoo script that rebuilds a list of
modules that I've previously set up. Each time I build a new kernel I
rebuild some modules as well as mesa:

xorg-drivers
xorg-input-evdev
xorg-input-keyboard
xorg-input-mouse
xf86-video-ati
xf86-video-fbdev
xf86-video-vmware
vmware-modules
mesa

Clearly I shouldn't need both evdev as well as keyboard/mouse but
there's been problems with hald so I've been flipping back and forth a
bit. mesa is just superstition but I _think_ it's helped a couple of
times.

Thanks!

- Mark
From: Tejun Heo
Date: Wednesday, July 7, 2010 - 8:48 am

Hello,


It seems like SIDPR is a bit more unreliable than the current code can
handle and the added delay and read could have affected the result.
Eh... weird.  Can you please apply the attached patch instead?  The
only difference is it will print out two SControl values instead of
one.  ie. &quot;XXX SControl after resume = AAA BBB, tries=T&quot;.  Can you
please try to boot multiple times and see if AAA and BBB differ
anytime?  If that happens, please attach the boot log.  Also, if you
see one with T &gt; 1, please attach that one too.

Thanks.

-- 
tejun
From: Mark Knecht
Date: Wednesday, July 7, 2010 - 9:15 am

Certainly. Is there a way to reverse the previous patch?

c2stable linux # patch -p1 --dry-run &lt;~mark/Downloads/resume-dbg-1.patch
patching file drivers/ata/libata-core.c
Hunk #1 succeeded at 3798 (offset 86 lines).
Hunk #2 succeeded at 3833 with fuzz 2 (offset 94 lines).
Hunk #3 FAILED at 6109.
1 out of 3 hunks FAILED -- saving rejects to file drivers/ata/libata-core.c.rej
c2stable linux #

I assume this is failing because your patch is over the plain kernel,
not the one I've patched?

Thanks,
Mark
--

From: Tejun Heo
Date: Wednesday, July 7, 2010 - 9:19 am

$ patch -R -p1 &lt; ~mark/Downloads/resume-dbg.patch
$ patch -p1 &lt; ~mark/Downloads/resume-dbg-1.patch

-- 
tejun
--

From: Mark Knecht
Date: Wednesday, July 7, 2010 - 9:27 am

Thanks. Building the new kernel now. I'll start trying to save the
data you're looking for.

Cheers,
Mark
--

From: Mark Knecht
Date: Wednesday, July 7, 2010 - 10:06 am

4 warm reboots. All 4 said 300 300. However the 4th one only showed an
extra attempt at running the patch code with and also showed Tries =
2. I'm attaching boot #1 and boot #4 for now. I've saved them all if
you need or just want them.

Please note that in all 4 cases all drives were found. Nothing is
missing in any test yet after adding these either of these patches.

I've not tried cold boots yet. That's next.

Cheers,
Mark
From: Tejun Heo
Date: Wednesday, July 7, 2010 - 10:26 am

Hello,


Hmm... just in case you're being lucky, please keep an eye on it over
several days and report the result.  I think all that's necessary is
slight modification to the resume logic but let's watch a bit first.

Thanks.

-- 
tejun
--

From: Mark Knecht
Date: Wednesday, July 7, 2010 - 10:32 am

I've tried two cold boots so far. One of them had that same extra
Tries = 2 at the same place. I'll do a couple more just to see what
happens.

I'm happy to watch it as long as it takes and will certainly save any
results if and when a drive isn't found. This problem has run hot and
cold for a few months. Sometimes I go a week with every boot is good.
This last two weeks was finally bad enough to get me to report it.

I also want to investigate actually getting the BIOS AHCI setting
working, but I think I'll leave that alone for afew days so as to not
upset this experiment. Performance on the machine isn't critical right
now, assuming AHCI helps.

I'll get back to you when I've got something new to report.

Cheers,
Mark
--

From: Mark Knecht
Date: Monday, July 19, 2010 - 12:31 pm

Tejun,
   With about 10-12 day of testing, 1-2 boots/day, I've not had a
single boot failure since adding the patch. Only twice has it said
tries=2. Every other time it's tries=1. The machine seems to work fine
either way.

Thanks,
Mark
--

From: Tejun Heo
Date: Monday, July 19, 2010 - 2:01 pm

Hello,


Hmmm... can you please test the attached patch instead?  It seems
likely that the root cause is not flakiness of SIDPR but incorrect
locking in libata EH code.

Thanks.

-- 
tejun
From: Paul Check
Date: Monday, July 19, 2010 - 8:14 pm

Hey Tejun: I guess this is the same patch that you sent me to fix my issue
with missing drives.  Good news: I've been through about 10 reboots now
and no problems.  Based on my prior experience, I'd say with the old
setup, 10 clean boots in a row was probably less than a 1% event.  So, it
seems that this has fixed my problem.

Thanks!



--

From: Tejun Heo
Date: Tuesday, July 20, 2010 - 7:14 am

Helo,


Yeap, it's the same one.  I'm forwarding the patch upstream now but,
Mark, please let me know the test result.

Thanks.

-- 
tejun
--

From: Mark Knecht
Date: Tuesday, July 20, 2010 - 7:53 am

Tejun,
   I'm traveling but back tonight. I'll try it then.

Thanks,
Mark
--

From: Mark Knecht
Date: Tuesday, July 20, 2010 - 9:16 am

OK, I was able to get into the machine remotely for a few minutes. I
think the patch applied correctly and the machine cold booted cleanly.
(My wife powered it up for me.)

I'll do more boots later when you confirm everything looks reasonable.
I don't see any print statement in the patch so I don't know what to
look for. I'm attaching a dmesg file for you to review.

Assuming I did this right then everything seems good so far.

Cheers,
Mark

c2stable linux # patch --verbose -p1 &lt;~mark/Downloads/ata_piix-sidpr-lock.patch
Hmm...  Looks like a unified diff to me...
The text leading up to this was:
--------------------------
|diff --git a/drivers/ata/ata_piix.c b/drivers/ata/ata_piix.c
|index 7409f98..3971bc0 100644
|--- a/drivers/ata/ata_piix.c
|+++ b/drivers/ata/ata_piix.c
--------------------------
Patching file drivers/ata/ata_piix.c using Plan A...
Hunk #1 succeeded at 158.
Hunk #2 succeeded at 952.
Hunk #3 succeeded at 968.
Hunk #4 succeeded at 1573.
done
c2stable linux #
From: Mark Knecht
Date: Wednesday, July 21, 2010 - 1:54 pm

Tejun,
   Looks like I had a failure today. First one in weeks and only the
3rd or 4th boot with this newer patch file. One of the two drives
making a RAID0 wasn't found so /dev/md11 (constructed from /dev/sdd
and /dev/sde) couldn't be started. I did a cold reboot and the drive
was found.

   If it matters, and it probably doesn't, the failure came on a boot
which had a scheduled fsck to do of /dev/md5 - my main / drive. I
don't see how that would make a difference but I figure why leave the
info out. That's why the times are so much larger in the dmesg file.
(I think)

   dmesg attached. I patched the Gentoo kernel if it makes a
difference, same as I did with the earlier patch.

mark@c2stable ~ $ uname -a
Linux c2stable 2.6.34-gentoo-r2 #1 SMP PREEMPT Sun Jul 18 14:09:48 PDT
2010 x86_64 Intel(R) Core(TM) i7 CPU X 980 @ 3.33GHz GenuineIntel
GNU/Linux
mark@c2stable ~ $

Sorry,
Mark
From: Paul Check
Date: Wednesday, July 21, 2010 - 2:22 pm

That's unfortunate. FYI I have continued to be trouble free, but my
processor is a bit weaker than Mark's, although I would find it surprising
that this would cause a problem.  Also, FYI Mark, I have 12GB of Corsair
RAM, and have it bumped up to the Intel XMP profile.



--

From: Tejun Heo
Date: Thursday, July 22, 2010 - 5:39 am

Hello,


Hmmm... that's weird.  Can you please make sure the patch is actually
applied?  Adding a printk(&quot;XXX patch applied!\n&quot;) near other changes
usually is easy enough.  Also, can you please apply resume-dbg-1.patch
too and reproduce the failure and post log?

Thanks.

-- 
tejun
--

From: Mark Knecht
Date: Monday, August 2, 2010 - 3:07 pm

Hi Tejun,
   I'm finally home and trying to get back to this. I'm really a bad
programmer so I don't know what I've done wrong but it seems patch
isn't happy with me.

c2stable linux # patch --dry-run -p1 &lt;../ata_piix-sidpr-lock.patch
patching file drivers/ata/ata_piix.c
patch: **** malformed patch at line 13:

c2stable linux #

   Here's the change I tried to make to a copy of the file:

c2stable linux # cat ../ata_piix-sidpr-lock.patch
diff --git a/drivers/ata/ata_piix.c b/drivers/ata/ata_piix.c
index 7409f98..3971bc0 100644
--- a/drivers/ata/ata_piix.c
+++ b/drivers/ata/ata_piix.c
@@ -158,6 +158,7 @@ struct piix_map_db {
 struct piix_host_priv {
        const int *map;
        u32 saved_iocfg;
+       spinlock_t sidpr_lock;  /* FIXME: remove once locking in EH is fixed */
+       printk(&quot;MWK - ata_sidpr patch applied!\n&quot;);
        void __iomem *sidpr;
 };

@@ -951,12 +952,15 @@ static int piix_sidpr_scr_read(struct ata_link *link,
                               unsigned int reg, u32 *val)
 {
        struct piix_host_priv *hpriv = link-&gt;ap-&gt;host-&gt;private_data;
+       unsigned long flags;

        if (reg &gt;= ARRAY_SIZE(piix_sidx_map))
                return -EINVAL;

+       spin_lock_irqsave(&amp;hpriv-&gt;sidpr_lock, flags);
        piix_sidpr_sel(link, reg);
        *val = ioread32(hpriv-&gt;sidpr + PIIX_SIDPR_DATA);
+       spin_unlock_irqrestore(&amp;hpriv-&gt;sidpr_lock, flags);
        return 0;
 }

@@ -964,12 +968,15 @@ static int piix_sidpr_scr_write(struct ata_link *link,
                                unsigned int reg, u32 val)
 {
        struct piix_host_priv *hpriv = link-&gt;ap-&gt;host-&gt;private_data;
+       unsigned long flags;

        if (reg &gt;= ARRAY_SIZE(piix_sidx_map))
                return -EINVAL;

+       spin_lock_irqsave(&amp;hpriv-&gt;sidpr_lock, flags);
        piix_sidpr_sel(link, reg);
        iowrite32(val, hpriv-&gt;sidpr + PIIX_SIDPR_DATA);
+       spin_unlock_irqrestore(&amp;hpriv-&gt;sidpr_lock, flags);
        return 0;
 }

@@ -1566,6 +1573,7 @@ ...
From: Randy Dunlap
Date: Tuesday, August 3, 2010 - 11:41 am

Whenever the patch file was saved on this system, line 13 of it was
split (probably by an email client).  Whenever I see this, I just
join (merge) that line and the next one and try again... sometimes


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
--

From: Mark Knecht
Date: Tuesday, August 3, 2010 - 11:47 am

Randy,
   Could very well be what happened. I added line 13 (the printk) by hand

&lt;SNIP - ORIGINAL PATCH FILE&gt;
struct piix_host_priv {
       const int *map;
       u32 saved_iocfg;
+       spinlock_t sidpr_lock;  /* FIXME: remove once locking in EH is fixed */
        void __iomem *sidpr;
};

&lt;SNIP - MY CHANGE BY HAND&gt;
struct piix_host_priv {
       const int *map;
       u32 saved_iocfg;
+       spinlock_t sidpr_lock;  /* FIXME: remove once locking in EH is fixed */
+       printk(&quot;MWK - ata_sidpr patch applied!\n&quot;);
       void __iomem *sidpr;
};

Maybe I should have just put it on the same line as the previous
spinlock command?

I'll play with it and see if I can get it working.

Thanks,
Mark
--

From: Randy Dunlap
Date: Tuesday, August 3, 2010 - 11:55 am

Ah, so you added a line to a patch file.  That means that the patch
block header must be changed from something like this:

@@ -964,12 +968,15 @@

to like this:

@@ -964,12 +968,16 @@

Any text following the second &quot;@@&quot; is just a comment so it does not matter.

The ,12  ,15   ,16   are all line counts. The &quot;,12&quot; is the number of lines
in the before version of the patch.  The &quot;,15&quot; or &quot;,16&quot; is the number of lines
in the after version of the patch, so you would need to increase it by 1 if
you added one line.   Or you can just put the printk on the same line as another
part of the patch and it won't matter.  :)


---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
--

From: Mark Knecht
Date: Wednesday, August 4, 2010 - 9:16 am

Thanks Randy. This helped.

OK - so I'm going to hang fire on this problem right now. As part of
this process I've updated to a 2.6.35-gentoo kernel and I'm not seeing
the problem at all so far. I've booted about 25 times over the last 3
days - warm and cold boots mixed - and so far no failures. While I've
had times in the past where earlier kernels didn't exhibit the problem
for days I don't remember a time where I got this many good boots in a
row.

As my problem didn't seem to be exactly the same as Paul's, and since
Tejun expressed surprise that his fix for Paul's issue didn't fix
mine, maybe my problem is actually something different and (hopefully)
already fixed.

I'll continue to use the machine and watch for any failures. If I get
one I'll reengage the list and go into debug mode.

Cheers,
Mark
--

From: Mark Knecht
Date: Tuesday, August 3, 2010 - 11:53 am

Thanks Jim. I'll give it a shot.

Cheers,
Mark
--

From: Jim Paris
Date: Tuesday, August 3, 2010 - 11:49 am

Tejun asked Mark to add a printk, and Mark added it directly to the
patch.  Mark, just apply the original patch as-is, and then add the
printk to the source code in ata_piix.c.  You should add it somewhere
like the piix_init_one function, e.g. right before the &quot;Save IOCFG&quot;
comment around line 1575.

--

From: Mark Knecht
Date: Tuesday, July 20, 2010 - 1:52 pm

Hey Paul. Glad it worked for you as it did for me.

Was your hardware in any way similar to mine?

- Mark
--

From: Paul Check
Date: Tuesday, July 20, 2010 - 2:19 pm

I missed your hardware. I have an ASUS PT 6 Deluxe V2 board with 4 drives.


--

From: Mark Knecht
Date: Tuesday, July 20, 2010 - 2:26 pm

What processor? I've got the Core i7-980X. I've wondered if this
problem showed up for me because possibly the 12 thread processor
moves through some part of boot setup pretty quickly.

Mine is an Asus Rampage II Extreme with five 500GB drives.

Whatever the reason great thanks to Tejun for fixing it. I keep
wanting to add an 'r' to his last name to get 'Hero'!

Cheers,
Mark
--

From: Paul Check
Date: Tuesday, July 20, 2010 - 4:05 pm

Oh, yeah, my processor is the i7 975. That's 4 cores 8 threads, right? 
Yes, nice work Tejun, thanks.  I'll have to run custom kernels until this


--

From: Stan Hoeppner
Date: Saturday, July 3, 2010 - 11:56 am

Please provide the make/model of the PC.  If it's whitebox or DIY please
provide make/model of PSU, mobo and CPU.  How many USB peripherals are powered
by the PC?  Are you powering a water cooling loop pump from the PC's power
supply?  Is this PC in a temperature controlled environment (A/C)?

-- 
Stan
--

From: Mark Knecht
Date: Saturday, July 3, 2010 - 12:21 pm

Build it myself.

Asus Rampage II Extreme motherboard
12GB Crucial DRAM currently installed (Holds 24GB)
Intel Core i7-980X CPU @ 3.33Ghz
Palit nVidia 9500GT-based graphics card
Sony Nec Optiarc AD-7241S-0B 24X Dual Layer DVD+/-RW SATA Drive
(5x) WD5002ABYS RE3 Enterprise Class 500GB hard drives

No external devices other than monitor, mouse, keyboard and the USB
interface to the UPS are attached. No USB, 1394 or eSATA are attached
at this time.

It's all powered by:

Corsair CMPSU-750TX 750-Watt TX Series 80 Plus Certified Power Supply

Air cooled using the stock Intel fan that came with the processor and
sitting in a home office environment.

The machine draws (steady state) about 250-275W according to both the
UPS it's hooked to as well as my trusty Kill-a-Watt. What it might
draw transient at power on while drives are spinning up I wouldn't
hazard a guess but it does seem to be well below the rating of the
supply. The PSU actually has something like 8 or 10 SATA power
connections, not that that means anything. I'm using 6. (1 CDRW, 5
drives)

Note two things:

1) All the drives are always reported by BIOS at boot time. Now, that
doesn't guarantee that the drives spin up. It may only mean they can
be read by BIOS, but they are there as far as I can tell. They show up
in the boot screens and in BIOS itself if I drop in to play with
settings.

2) Whatever state the machine comes up in - drives recognized or not -
it will run forever in that state under some pretty heavy loads so it
isn't like the PSU can't completely do the job. It could possibly be
marginal though.

QUESTION: There are some settings in BIOS for delaying the drive. (Or
something. I'm using the machine and not in BIOS) There were settings
from 0 to 35 seconds if I remember correctly. Possibly I should try
setting each drive to a different value to different value to stagger
power up?

If you need more info or have other ideas please let me know.

Thanks,
Mark
--

From: Stan Hoeppner
Date: Saturday, July 3, 2010 - 12:42 pm

If that PSU meets published specs you shouldn't need delayed spin up with

Your answers here should have pretty much eliminated hardware issues as the
cause, unless that particular mobo has BIOS or other issues I'm unaware of.

I've found it's always best to ask about hardware with this kind of report
just to eliminate possibilities.  All that gear is good quality stuff.  If the
problem is due to hardware, it's because one of your components is defective,
but we don't see evidence of that at this point.

Also, TTBOMK, if a SATA drive motor doesn't spin up, the drive firmware won't
report the drive as ready upstream, thus the BIOS won't list the drive.

-- 
Stan


--

From: Mark Knecht
Date: Saturday, July 3, 2010 - 12:57 pm

I've not dropped into BIOS yet as the machine is in use but from the
Asus manual it appears the delay is not on a drive by drive basis so I

An off-list response suggested possibly setting some drive jumpers on
non-boot drives to power up in standby. Apparently the kernel will
then spin up those drives later? If I cannot stagger the drives in
BIOS then I will likely try that. Technically I guess I only need
/boot on sda to get the kernel booted. The mdadm RAID1 on sda/sdb/sdc
could start slightly later, and technically the RAID0 on sdd/sde could
start very late as there are only VMWare images on that drive.

Cheers,
Mark
--

From: Mark Knecht
Date: Saturday, July 3, 2010 - 3:31 pm

OK, I don't know if this is related but so far I cannot get the
machine to boot if I set BIOS SATA configuration to AHCI. I believe
that I have AHCI support as well as SATA support built into the kernel
but when I set the BIOS to AHCI the machine just hangs saying it finds
no medium. I assume that means no hard drive. I have to set SATA
support to enhanced and then IDE to get the machine to boot at all.

I did notice that my earlier kernel had the depreciated ATA/ATAPI
support selected so I removed that from the kernel but it didn't
change the results.

The lshw listings below show the DATA controllers. The 20360/363 is
(apparently) the eSATA controller going to the front panel. Nothing is
attached there. The chipset supposedly handles 6 SATA ports - they
seem to be arranged x4 &amp; x2. TTBOMK I have the CDROM and the 3-drive
RAID1 on the 4 port controller and the RAID0, which fails to be
recognized more often - on the 2 port controller. I don't know how to
prove that though.

I'm unclear how mature the SATA support is for this chipset. Is there
a chance that this is some bit that's not being sey reliably?

Also, I misspoke earlier about the graphics adapter. It's actually an
ATI Radeon 5770 in this machine. The 9500GT is in another machine.

Thanks,
Mark

c2stable ~ # lshw -short -class storage
H/W path               Device     Class       Description
=========================================================
/0/100/1c.4/0                     storage     20360/20363 Serial ATA Controller
/0/100/1c.4/0.1                   storage     20360/20363 Serial ATA Controller
/0/100/1f.2            scsi2      storage     82801JI (ICH10 Family) 4
port SATA IDE Controller
/0/100/1f.5            scsi5      storage     82801JI (ICH10 Family) 2
port SATA IDE Controller
c2stable ~ # lshw -short | grep SATA
/0/100/1f.2            scsi2      storage     82801JI (ICH10 Family) 4
port SATA IDE Controller
/0/100/1f.5            scsi5      storage     82801JI (ICH10 Family) 2
port SATA IDE ...
From: Thomas Fjellstrom
Date: Saturday, July 3, 2010 - 6:25 pm

If it wasn't for all that, I'd have suspected a small power issue.

I recently ran into an odd problem of how my PSU has its rails split up 
between different cables. If I had all the hardware in the box hooked up 
with the normal &quot;black&quot; cables from my PSU I'd get errors from one of my 2TB 
WD Green's. But if I moved the drives, or the GPU to the &quot;red&quot; plug (modular 
psu), everything worked. I assume the rails on my psu were split up 
absolutely retardedly, causing the main 12v rail to be shared between the 
cpu, all add-on devices, AND the non-modular PCI-E cable for the GPU 
(8800GTS), causing that rail's voltage to drop below spec.

But it sounds like maybe your SATA chipset might not be supported in AHCI 
mode (which you really want), and might have issues in IDE mode. But given 
that its an intel chipset you'd think it'd have perfect support :o So I 
don't know.


-- 
Thomas Fjellstrom
tfjellstrom@strangesoft.net
--

From: Tejun Heo
Date: Sunday, July 4, 2010 - 11:19 pm

Hello,


That's odd.  Can you please attach kernel boot log w/ ahci mode?
Booting a recent live CD and saving boot log from there should do the
trick.

Thanks.

-- 
tejun
--

From: Mark Knecht
Date: Monday, July 5, 2010 - 9:48 am

Hi Tehun,
   Thanks for the help.

   I tried a Gentoo install CD from last march. It's a 2.6.31 type
kernel. Problem is the buffer depths are not big enough to capture the
complete dmesg contents and I don't know a command line option to make
it larger on the fly. If you know of one that's got a larger buffer -
or a command to increase it at boot time - then let me know and I'll
try again.

   I'm attaching what I was able to catch for both AHCI and IDE
settings in BIOS for the storage configuration option.

   It seems to me that even in AHCI mode the machine does see all the
hard drives. Maybe there's something about my boot partition that's
having problems in AHCI mode only? If it sees /dev/sda then why
wouldn't it find grub and at least show a grub menu?

Thanks,
Mark
From: Robert Hancock
Date: Monday, July 5, 2010 - 4:59 pm

On some machines changing from IDE to AHCI messes up the boot order 
selection in the BIOS - you may have to switch to AHCI, save settings, 
reboot, go back into the BIOS and then make sure the disk boot order 
settings are correct (whatever drive grub is installed on needs to be 
first).
--

From: Mark Knecht
Date: Monday, July 5, 2010 - 9:16 pm

That's a very interesting idea Robert. Thanks.

I'll have to be careful. I have 5 identical drives in the machine so
figuring out which is which when I'm in BIOS might be a bit of a
trick.

I'll give it a go and report back.

Cheers,
Mark
--

From: Stan Hoeppner
Date: Monday, July 5, 2010 - 11:13 pm

Makes you yearn for the good old days of SCSI don't it? :(

-- 
Stan
--

From: Sander
Date: Tuesday, July 6, 2010 - 4:26 am

You can install grub on the mbr of all disks and not worry about the
order. Especially if you work with UUIDs.

	Sander

-- 
Humilis IT Services and Solutions
http://www.humilis.net
--

From: Tejun Heo
Date: Monday, July 5, 2010 - 11:32 pm

The ahci driver is working fine.  If the system can't boot w/ ahci
mode, the problem is probably on the bios side and most likely caused
by getting the boot device wrong as Robert suggested.  Just choose one
after another until it boots should do the trick.

Thanks.

-- 
tejun
--

Previous thread: [PATCH] Break out types from <linux/list.h> to <linux/list_types.h>. by Chris Metcalf on Friday, July 2, 2010 - 10:41 am. (11 messages)

Next thread: [PATCH] trivial: fix typos concerning "already" by =?UTF-8?q?Uwe=20Kleine-K=C3=B6nig?= on Friday, July 2, 2010 - 11:41 am. (2 messages)