I could not agree more.
CC'ing the relevant mailing lists to see if someone out there has any idea
what more we could do as this has been affecting you (more so than myself,
but I would still like to get some sort of resolution as well, as it still
happens to me too):
Similar, but not the same issue:
Sep 17 20:20:05 p34 kernel: [1422169.440538] ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep 17 20:20:05 p34 kernel: [1422169.440549] ata5.00: cmd b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0
Sep 17 20:20:05 p34 kernel: [1422169.440551] res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 17 20:20:05 p34 kernel: [1422169.440556] ata5.00: status: { DRDY }
Sep 17 20:20:05 p34 kernel: [1422169.440561] ata5: hard resetting link
Sep 17 20:20:06 p34 kernel: [1422169.744980] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep 17 20:20:06 p34 kernel: [1422169.770448] ata5.00: configured for UDMA/133
Sep 17 20:20:06 p34 kernel: [1422169.770461] ata5: EH complete
(2.6.23.3) above
--
About ata1:0 problem, as reported in the bugzilla bug: I would try to disable NCQ to see if it helps. Your disks firmware might not fully support it. You can either add the parameter "libata.force=noncq" when loading your kernel, or set queue_depth to 1 for all the Seagate drives behind the Marvell MV88SX6081 controller. About ata5:0 , someone - in user space probably - is trying to do a SMART ENABLE operation, but the device ignores it. I don't know which device you are using, but I assume it does not support ATA SMART feature set. Timeout is an acceptable but not a nice way to answer, a cancel would have been better; check if there is a firmware upgrade for your device. Gwendal. --
I disabled NCQ and same thing...Just says DMA freeze instead of NCQ freeze... ----- Original Message ----- From: "Gwendal Grignou" <gwendal@google.com> To: "Justin Piszcz" <jpiszcz@lucidpixels.com> Cc: "Brian Rademacher" <rad@radfiles.net>; <linux-ide@vger.kernel.org>; <linux-raid@vger.kernel.org>; <linux-kernel@vger.kernel.org> Sent: Tuesday, September 23, 2008 12:14 PM --
You certainly called the SMART issue, I was wondering why a new distribution install on some older hardware was getting all the errors, clearly the Fedora "smartd" doesn't check SMART capability before trying to enable the feature. Oddly the drive on which I see this does reply to SMART requests, so the firmware must be "semi-functional." Not a problem, in my case the drive is just used for testing handling of hot swap, and has no data of any value. -- Bill Davidsen <davidsen@tmr.com> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark --
Can you post full kernel log including the boot messages and the error messages? Also, please attach the output of hdparm -I on the drive which fails the smart command. (cc'ing Bruce, hi!) Bruce, this is the second report I see about drive timing out SMART ENABLE OPERATIONS. Does anything ring a bell? Thanks. -- tejun --
Don't know if this is the original poster's problem, but if the drive
is spun down, then enabling SMART or trying to read SMART attributes
causes the drive to spin up and the command is delayed until this has
occurred.
The fix is to increase the timeout given to scsi_execute() in
drivers/ata/libata-scsi.c.
ie, current code (2.6.26.5) is:
/* Good values for timeout and retries? Values below
from scsi_ioctl_send_command() for default case... */
cmd_result = scsi_execute(scsidev, scsi_cmd, data_dir, argbuf, argsize,
sensebuf, (10*HZ), 5, 0);
Should be changed to:
/* Good values for timeout and retries? Values below
from scsi_ioctl_send_command() for default case... */
cmd_result = scsi_execute(scsidev, scsi_cmd, data_dir, argbuf, argsize,
sensebuf, (30*HZ), 5, 0);
Using a 1TB Hitachi hard drive, this command times out because it
takes this drive about 15 seconds to spin up. Virtutally all hard
drives spin up in less than 30 sec, but perhaps make this higher in
case there are slower drives out there?
Cheers,
Tom
--
Velociraptor 10k drive here (2.6.26.5):
Sep 30 15:55:06 p34 kernel: [420781.333179] ata6.00: exception Emask 0x0 SAct
0x0 SErr 0x0 action 0x6 frozen
Sep 30 15:55:06 p34 kernel: [420781.333189] ata6.00: cmd
b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0
Sep 30 15:55:06 p34 kernel: [420781.333190] res
40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 30 15:55:06 p34 kernel: [420781.333194] ata6.00: status: { DRDY }
Sep 30 15:55:06 p34 kernel: [420781.333200] ata6: hard resetting link
Sep 30 15:55:06 p34 kernel: [420781.638589] ata6: SATA link up 3.0 Gbps (SStatus
123 SControl 300)
Sep 30 15:55:06 p34 kernel: [420781.662166] ata6.00: configured for UDMA/133
Sep 30 15:55:06 p34 kernel: [420781.669416] sd 5:0:0:0: [sdf] Write Protect is
off
Sep 30 15:55:06 p34 kernel: [420781.669416] sd 5:0:0:0: [sdf] Mode Sense: 00 3a
00 00
Sep 30 15:55:06 p34 kernel: [420781.669416] sd 5:0:0:0: [sdf] Write cache:
enabled, read cache: enabled, doesn't support DPO or FUA
Nothing wrong with the disk, it just happens... :( Linux/kernel bug?
It happens on multiple controllers, Intel, SiI, Marvell, does not seem to
matter.
SMART Self-test log structure revision number 1
Num Test_Description Status Remaining LifeTime(hours) LBA
_of_first_error
# 1 Short offline Completed without error 00% 2761 -
# 2 Short offline Completed without error 00% 2737 -
# 3 Extended offline Completed without error 00% 2714 -
# 4 Short offline Completed without error 00% 2689 -
# 5 Extended offline Completed without error 00% 2514 -
# 6 Short offline Completed without error 00% 2306 -
# 7 Short offline Completed without error 00% 2282 -
# 8 Short offline Completed without error 00% 2258 -
# 9 Short offline Completed without error 00% 2234 -
#10 Extended ...Hello Justin , I take it you've tried differant drive manufacturers ? Or even a differant drive of same manuf. ? Seeing as you've moved this same drive(?) across several chipsets & possibly mother boards , Leads me to beleive that the difficulty is either with the driver or the drive (if it is always the same drive or drive model) . Hth , JimL -- +------------------------------------------------------------------+ | James W. Laferriere | System Techniques | Give me VMS | | Network&System Engineer | 2133 McCullam Ave | Give me Linux | | babydr@baby-dragons.com | Fairbanks, AK. 99701 | only on AXP | +------------------------------------------------------------------+ --
It also occurs (I have seen it) on WD 750GiB drives on a different motherboard Other people have the same problem with Seagate and / other drives. This also occurs on Raptor 150s. Justin. --
This morning (On P965 board this time-- and on RAID1, not RAID5)
[469680.004654] ata2.00: status: { DRDY }
[469680.004660] ata2: hard resetting link
[469680.309567] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[469680.333461] ata2.00: configured for UDMA/133
[469680.333477] ata2: EH complete
[469680.333461] sd 1:0:0:0: [sdb] 586072368 512-byte hardware sectors (300069 MB)
[469680.340461] sd 1:0:0:0: [sdb] Write Protect is off
[469680.340461] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[469680.345461] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
=== START OF INFORMATION SECTION ===
Device Model: WDC WD3000GLFS-01F8U0
Serial Number: XX-XXXXXXXXXXXX
Firmware Version: 03.03V01
User Capacity: 300,069,052,416 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Wed Oct 1 07:11:23 2008 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Perfectly good disk (and yes, I have swapped out cables as well), even tried
different a differnet server/psu (earlier version of the same motherboard)
as well):
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 0
3 Spin_Up_Time 0x0003 204 196 021 Pre-fail Always - 2758
4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 36
5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0
7 Seek_Error_Rate 0x000e 200 200 000 Old_age Always - 0
9 ...You're leaving the good part out. Please always attach full kernel log. Thanks. -- tejun --
Full log below: [ 0.000000] Linux version 2.6.26.5 (root@p34.internal.lan) (gcc version 4.3.1 (Debian 4.3.1-2) ) #1 SMP Tue Sep 9 04:07:56 EDT 2008 [ 0.000000] Command line: auto BOOT_IMAGE=2.6.26.5-1 ro root=902 [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: 0000000000000000 - 000000000008f000 (usable) [ 0.000000] BIOS-e820: 000000000008f000 - 00000000000a0000 (reserved) [ 0.000000] BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved) [ 0.000000] BIOS-e820: 0000000000100000 - 00000000cf58f000 (usable) [ 0.000000] BIOS-e820: 00000000cf58f000 - 00000000cf59c000 (reserved) [ 0.000000] BIOS-e820: 00000000cf59c000 - 00000000cf639000 (usable) [ 0.000000] BIOS-e820: 00000000cf639000 - 00000000cf694000 (ACPI NVS) [ 0.000000] BIOS-e820: 00000000cf694000 - 00000000cf697000 (ACPI data) [ 0.000000] BIOS-e820: 00000000cf697000 - 00000000cf6ef000 (ACPI NVS) [ 0.000000] BIOS-e820: 00000000cf6ef000 - 00000000cf6f1000 (ACPI data) [ 0.000000] BIOS-e820: 00000000cf6f1000 - 00000000cf6f2000 (usable) [ 0.000000] BIOS-e820: 00000000cf6f2000 - 00000000cf6ff000 (ACPI data) [ 0.000000] BIOS-e820: 00000000cf6ff000 - 00000000cf700000 (usable) [ 0.000000] BIOS-e820: 00000000cf700000 - 00000000d0000000 (reserved) [ 0.000000] BIOS-e820: 00000000f0000000 - 00000000f8000000 (reserved) [ 0.000000] BIOS-e820: 00000000fff00000 - 0000000100000000 (reserved) [ 0.000000] BIOS-e820: 0000000100000000 - 000000022c000000 (usable) [ 0.000000] Entering add_active_range(0, 0, 143) 0 entries of 256 used [ 0.000000] Entering add_active_range(0, 256, 849295) 1 entries of 256 used [ 0.000000] Entering add_active_range(0, 849308, 849465) 2 entries of 256 used [ 0.000000] Entering add_active_range(0, 849649, 849650) 3 entries of 256 used [ 0.000000] Entering add_active_range(0, 849663, 849664) 4 entries of 256 used [ 0.000000] Entering add_active_range(0, 1048576, 2277376) 5 entries of 256 used [ ...
Are there any specific / useful / debugging options one could turn on/enable to obtain more information during these events that would produce meaningful/useful output? Justin. --
I think that the log stated a lot. b- --
What do these signifiers mean (they are always the same, no matter the
controller used OR the disk in question (happens across 12 disks and 3
different controllers)):
[420781.333179] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[420781.333189] ata6.00: cmd b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0
^^ ^^(b0/d8)^^ ^^(4f:c2)
[420781.333190] res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4
(timeout) ^^ 40:00:ff
[420781.333194] ata6.00: status: { DRDY }
[420781.333200] ata6: hard resetting link
[420781.638589] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[420781.662166] ata6.00: configured for UDMA/133
[420781.662166] ata6: EH complete
(at the time there was little to no I/O occuring on this block device, but
disks on the raid5 volume were being accessed at the time, so there was
system activity, mainly disk reads 300-500KiB/s over ethernet)
^^ ^^ (40:00: but no ff)
The rest of the messages are the same. Is there any correlation
that can be made here? When this happens to others, is it
always the same codes as shown above or do they change? If they
do not change, how come they vary between users who have this
Can anything be said about these errors, can we classify them into groups?
Or are they just random? It does not appear to happen more or less with
one filesystem or another either, one guy is using ext3, I am using XFS--
certainly something much deeper..
Justin.
--
They are the send and received task file (command blocks) for the failed -- -- "Alan, I'm getting a bit worried about you." -- Linus Torvalds --
Belatedly, the drive was retired after I got thinking about it, I'm just calling it an old bad drive and leaving it at that. I'm not goinf to re-install a dubious 20GB drive, it's out of the hot-swap carrier, cleared with dban, and in the recycle bin. You seem to have far better -- Bill Davidsen <davidsen@tmr.com> "Woe unto the statesman who makes war without a reason that will still be valid when the war is over..." Otto von Bismark --
