Re: 2.6.23-rc7-mm1 AHCI ATA errors -- won't boot

Previous thread: [PATCH] make lockdep happy with r/o bind mounts by Dave Hansen on Monday, September 24, 2007 - 10:46 am. (3 messages)

Next thread: IDE broken on Pegasos PPC platform by Chuck Ebbert on Monday, September 24, 2007 - 12:10 pm. (7 messages)
From: Berck E. Nash
Date: Monday, September 24, 2007 - 11:02 am

Greetings,

I get a few million of these on boot-- the system never actually boots.
Works fine in 2.6.23-rc7.

[   50.456012] ata2.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
[   50.462484] ata2.00: irq_stat 0x40000001
[   50.466441] ata2.00: cmd e5/00:00:00:00:00/00:00:00:00:00/a0 tag 0
cdb 0x0 data 0
[   50.466442]          res 51/04:00:01:01:80/00:00:00:00:00/a0 Emask
0x1 (device error)
[   50.481914] ata2.00: status: {DRDY ERR }
[   50.485876] ata2.00: error: {ABRT }
[   50.489533] ata2.00: configured for UDMA/133
[   50.493839] ata2: EH complete

I've attached the entire dmesg and lspci.

Berck
From: Jeff Garzik
Date: Monday, September 24, 2007 - 6:37 pm

Are you "git-friendly"?  A few quick kernel compiles and reboots would 
help us narrow down the problem, given that it's a reproducible regression.

The first step would be to clone the "upstream" branch of 
git://git.kernel.org/pub/scm/linux/kernel/git/jgarzik/libata-dev.git

and see if the problem is reproducible there.  If yes, then you have 
narrowed down the problem to something my ATA devel tree has introduced 
into -mm.

Once the blame has been squared fixed upon me :) you can use git-bisect 
to locate the precise change that broke your setup.

Info at http://kerneltrap.org/node/11753 or 
http://www.kernel.org/pub/software/scm/git/docs/v1.3.3/howto/isolate-bugs-with-bisect.txt
or "man git-bisect"

	Jeff


-

From: Berck E. Nash
Date: Tuesday, September 25, 2007 - 11:14 am

Nope, you're off the hook.  The libata tree works great, so it must be
something else in -mm conflicting.
-

From: Jens Axboe
Date: Tuesday, September 25, 2007 - 11:21 am

Can you try 2.6.23-rc8 plus this patch:

http://brick.kernel.dk/git-block.patch.bz2

and see if that works?

-- 
Jens Axboe

-

From: Berck E. Nash
Date: Tuesday, September 25, 2007 - 11:28 am

Whoops, sorry!  I just lied.  I'm a git newbie, and failed to actually
get the "upstream" branch the first time, so rc8 is clean, but it fails
when I actually pull the upstream branch.  I'll git bisect and get back
to you.

BErck
-

From: Jens Axboe
Date: Tuesday, September 25, 2007 - 11:32 am

OK, you probably realize this, but you can forget about the git-block
testing for now then.

-- 
Jens Axboe

-

From: Berck E. Nash
Date: Tuesday, September 25, 2007 - 12:29 pm

Okay, here's the problem:

268fe6f9f15551be9abedd44a237392675d529d5 is first bad commit
commit 268fe6f9f15551be9abedd44a237392675d529d5
Author: Jeff Garzik <jeff@garzik.org>
Date:   Fri Sep 21 07:09:36 2007 -0400

    [libata] SCSI: simple TEST UNIT READY simulation

    It's trivial to ping the device, and that's a much more sane behavior
    than no-op.

df6d21f7ce56a4e796f8f856c1f647b0395ab4df M      drivers

Berck
-

From: Jeff Garzik
Date: Tuesday, September 25, 2007 - 1:40 pm

Thanks for debugging!

Can you tell me something about this device?

[   49.045635] ata2.00: ATA-6: Config  Disk, RGL10364, max UDMA/133
[   49.051677] ata2.00: 640 sectors, multi 1: LBA
[   49.056321] ata2.00: configured for UDMA/133

It seems like it does not support the 'check power mode' command.

Can you post a text file attachment, containing the output of 'hdparm 
--Istdout' ?

	Jeff



-

From: Berck E. Nash
Date: Tuesday, September 25, 2007 - 3:07 pm

No problem.  The device in question is a Western Digital Raptor WD360GD
36.7GB 10,000 RPM Serial ATA150 Hard Drive.

hdparm output attached.

Berck
-

From: Berck E. Nash
Date: Tuesday, September 25, 2007 - 3:46 pm

Whoops, it really is this time.

From: Jeff Garzik
Date: Tuesday, September 25, 2007 - 6:21 pm

Does the attached patch change behavior at all?  You should be able to 
apply it on top of libata-dev.git#upstream or -mm.

If there are still problems, an updated dmesg (w/ the attached patch) 
and output from enabling ATA_DEBUG (include/linux/libata.h) would be 
very helpful.

Thanks!

	Jeff


From: Berck E. Nash
Date: Tuesday, September 25, 2007 - 7:25 pm

Still broken, dmesg with ATA_DEBUG defined, attached.
From: Jeff Garzik
Date: Tuesday, September 25, 2007 - 7:33 pm

Great, this will be useful output.  It will probably be a couple days 
before my next patch.  In the meantime, you can extract the bad commit 
to a patch

	git-diff-tree -p 268fe6f9f15551be9abedd44a237392675d529d5 > \
		/tmp/patch

and then revert it locally in your kernel tree

	patch -sp1 -R < /tmp/patch

to temporarily work around this.

I will definitely make sure this is either fixed or reverted before it 
goes upstream to Linus.

Thanks,

	Jeff


-

From: Jeff Garzik
Date: Tuesday, September 25, 2007 - 9:40 pm

Would it also be possible for you to send along 'hdparm --Istdout' 
output for your config disk thingy, /dev/sdd ?

	Jeff



-

From: Bernd Schmidt
Date: Wednesday, September 26, 2007 - 3:03 am

One of these appears in my system as well (ASUS P5W-DH Deluxe 
mainboard).  Here's the hdparm output:

/dev/sdb:
0040 3fff c837 0010 0000 0000 003f 0000
0000 0000 3030 3030 3030 305f 5f5f 5f5f
5f5f 5f5f 5f30 5f41 0003 3e00 0004 5247
4c31 3033 3634 436f 6e66 6967 2020 4469
736b 2020 2020 2020 2020 2020 2020 2020
2020 2020 2020 2020 2020 2020 2020 8001
0000 2f00 4000 0200 0000 0007 3fff 0010
003f fc10 00fb 0101 0280 0000 0000 0407
0003 0078 0078 0078 0078 0000 0000 0000
0000 0000 0000 0000 0201 0000 0000 0000
007e 001b 0068 5060 4000 0000 1000 4000
407f 0000 0000 0000 fffe 0000 c0fe 0000
0000 0000 0000 0000 0001 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0001 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0017 2040
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 baa5

Since about 2.6.17 or 2.6.18, it has been causing long delays while booting:
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: qc timeout (cmd 0xec)
ata2.00: failed to IDENTIFY (I/O error, err_mask=0x5)
ata2: port is slow to respond, please be patient (Status 0x80)
ata2: COMRESET failed (errno=-16)
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: ATA-6: Config  Disk, RGL10364, max UDMA/133
ata2.00: 640 sectors, multi 1: LBA
ata2.00: configured for UDMA/133


Bernd
-

From: Berck E. Nash
Date: Wednesday, September 26, 2007 - 7:22 am

And yup, same problem with the painful boot delays since 2.6.18.  Tejun
indicated that a fix would get merged with 2.6.23, but that didn't
happen.  Here's hoping something makes it into .24!

Berck
-

From: Tejun Heo
Date: Wednesday, September 26, 2007 - 6:05 pm

Yeah, it is the sil4726 virtual device which is really crappy as an ATA
device.  About the fix, I thought PMP support would fix it but the
controller on P5W-DH doesn't support PMP.  It can only talk to the
virtual device or the device attached to the first port depending on how
the PMP chip is configured.  It seems we'll have to blacklist the
mainboard and skip or use modified reset sequence on the affected port,
so that's why the fix was delayed.  I'm currently on the road but I'll
look into it when I get back (next week).

Thanks.

-- 
tejun

-

From: Berck E. Nash
Date: Wednesday, September 26, 2007 - 7:19 am

Sure, just don't ask me what it is!  (I've generally assumed that
writing to it would be a bad idea.)

Berck
From: Jeff Garzik
Date: Wednesday, October 3, 2007 - 2:18 pm

FWIW I haven't had time to debug this, so I'm going to simply revert the 
patch, and make sure it does not make it into 2.6.24.

	Jeff



-

Previous thread: [PATCH] make lockdep happy with r/o bind mounts by Dave Hansen on Monday, September 24, 2007 - 10:46 am. (3 messages)

Next thread: IDE broken on Pegasos PPC platform by Chuck Ebbert on Monday, September 24, 2007 - 12:10 pm. (7 messages)