Re: exception Emask 0x0 SAct 0x1 / SErr 0x0 action 0x2 frozen

Previous thread: [PATCH v2] serial: set correct baud_base for Oxford Semiconductor Ltd EXSYS EX-41092 Dual 16950 Serial adapter by Niels de Vos on Monday, September 22, 2008 - 6:11 am. (2 messages)

Next thread: [PATCH] AMD IOMMU: revert "x86, AMD IOMMU: honor iommu=off instead of amd_iommu=off" by FUJITA Tomonori on Monday, September 22, 2008 - 6:35 am. (11 messages)
From: Justin Piszcz
Date: Monday, September 22, 2008 - 6:19 am

I could not agree more.

CC'ing the relevant mailing lists to see if someone out there has any idea 
what more we could do as this has been affecting you (more so than myself, 
but I would still like to get some sort of resolution as well, as it still 
happens to me too):

Similar, but not the same issue:

Sep 17 20:20:05 p34 kernel: [1422169.440538] ata5.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
Sep 17 20:20:05 p34 kernel: [1422169.440549] ata5.00: cmd b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0
Sep 17 20:20:05 p34 kernel: [1422169.440551]          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 17 20:20:05 p34 kernel: [1422169.440556] ata5.00: status: { DRDY }
Sep 17 20:20:05 p34 kernel: [1422169.440561] ata5: hard resetting link
Sep 17 20:20:06 p34 kernel: [1422169.744980] ata5: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
Sep 17 20:20:06 p34 kernel: [1422169.770448] ata5.00: configured for UDMA/133
Sep 17 20:20:06 p34 kernel: [1422169.770461] ata5: EH complete

(2.6.23.3) above

--

From: Justin Piszcz
Date: Monday, September 22, 2008 - 6:26 am

From: Gwendal Grignou
Date: Tuesday, September 23, 2008 - 11:14 am

About ata1:0 problem, as reported in the bugzilla bug: I would try to
disable NCQ to see if it helps. Your disks firmware might not fully
support it.

You can either add the parameter "libata.force=noncq" when loading
your kernel, or set queue_depth to 1 for all the Seagate drives behind
the Marvell MV88SX6081 controller.

About ata5:0 , someone - in user space probably - is trying to do a
SMART ENABLE operation, but the device ignores it. I don't know which
device you are using, but I assume it does not support ATA SMART
feature set. Timeout is an acceptable but not a nice way to answer, a
cancel would have been better; check if there is a firmware upgrade
for your device.

Gwendal.

--

From: Brian Rademacher
Date: Tuesday, September 23, 2008 - 1:59 pm

I disabled NCQ and same thing...Just says DMA freeze instead of NCQ 
freeze...

----- Original Message ----- 
From: "Gwendal Grignou" <gwendal@google.com>
To: "Justin Piszcz" <jpiszcz@lucidpixels.com>
Cc: "Brian Rademacher" <rad@radfiles.net>; <linux-ide@vger.kernel.org>; 
<linux-raid@vger.kernel.org>; <linux-kernel@vger.kernel.org>
Sent: Tuesday, September 23, 2008 12:14 PM

--

From: Bill Davidsen
Date: Thursday, September 25, 2008 - 8:17 am

You certainly called the SMART issue, I was wondering why a new 
distribution install on some older hardware was getting all the errors, 
clearly the Fedora "smartd" doesn't check SMART capability before trying 
to enable the feature. Oddly the drive on which I see this does reply to 
SMART requests, so the firmware must be "semi-functional." Not a 
problem, in my case the drive is just used for testing handling of hot 
swap, and has no data of any value.

-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 


--

From: Tejun Heo
Date: Monday, September 29, 2008 - 1:13 am

Can you post full kernel log including the boot messages and the error
messages?  Also, please attach the output of hdparm -I on the drive
which fails the smart command.

(cc'ing Bruce, hi!) Bruce, this is the second report I see about drive
timing out SMART ENABLE OPERATIONS.  Does anything ring a bell?

Thanks.

-- 
tejun
--

From: Tom Mortensen
Date: Tuesday, September 30, 2008 - 1:47 pm

Don't know if this is the original poster's problem, but if the drive
is spun down, then enabling SMART or trying to read SMART attributes
causes the drive to spin up and the command is delayed until this has
occurred.

The fix is to increase the timeout given to scsi_execute() in
drivers/ata/libata-scsi.c.

ie, current code (2.6.26.5) is:

        /* Good values for timeout and retries?  Values below
           from scsi_ioctl_send_command() for default case... */
        cmd_result = scsi_execute(scsidev, scsi_cmd, data_dir, argbuf, argsize,
                                  sensebuf, (10*HZ), 5, 0);

Should be changed to:

        /* Good values for timeout and retries?  Values below
           from scsi_ioctl_send_command() for default case... */
        cmd_result = scsi_execute(scsidev, scsi_cmd, data_dir, argbuf, argsize,
                                  sensebuf, (30*HZ), 5, 0);

Using a 1TB Hitachi hard drive, this command times out because it
takes this drive about 15 seconds to spin up.  Virtutally all hard
drives spin up in less than 30 sec, but perhaps make this higher in
case there are slower drives out there?

Cheers,
Tom

--

From: Justin Piszcz
Date: Tuesday, September 30, 2008 - 2:18 pm

Velociraptor 10k drive here (2.6.26.5):

Sep 30 15:55:06 p34 kernel: [420781.333179] ata6.00: exception Emask 0x0 SAct
0x0 SErr 0x0 action 0x6 frozen
Sep 30 15:55:06 p34 kernel: [420781.333189] ata6.00: cmd
b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0
Sep 30 15:55:06 p34 kernel: [420781.333190]          res
40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
Sep 30 15:55:06 p34 kernel: [420781.333194] ata6.00: status: { DRDY }
Sep 30 15:55:06 p34 kernel: [420781.333200] ata6: hard resetting link
Sep 30 15:55:06 p34 kernel: [420781.638589] ata6: SATA link up 3.0 Gbps (SStatus
123 SControl 300)
Sep 30 15:55:06 p34 kernel: [420781.662166] ata6.00: configured for UDMA/133
Sep 30 15:55:06 p34 kernel: [420781.669416] sd 5:0:0:0: [sdf] Write Protect is
off
Sep 30 15:55:06 p34 kernel: [420781.669416] sd 5:0:0:0: [sdf] Mode Sense: 00 3a
00 00
Sep 30 15:55:06 p34 kernel: [420781.669416] sd 5:0:0:0: [sdf] Write cache:
enabled, read cache: enabled, doesn't support DPO or FUA

Nothing wrong with the disk, it just happens... :(  Linux/kernel bug?
It happens on multiple controllers, Intel, SiI, Marvell, does not seem to
matter.

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA
_of_first_error
# 1  Short offline       Completed without error       00%      2761         -
# 2  Short offline       Completed without error       00%      2737         -
# 3  Extended offline    Completed without error       00%      2714         -
# 4  Short offline       Completed without error       00%      2689         -
# 5  Extended offline    Completed without error       00%      2514         -
# 6  Short offline       Completed without error       00%      2306         -
# 7  Short offline       Completed without error       00%      2282         -
# 8  Short offline       Completed without error       00%      2258         -
# 9  Short offline       Completed without error       00%      2234         -
#10  Extended ...
From: Mr. James W. Laferriere
Date: Tuesday, September 30, 2008 - 8:50 pm

Hello Justin ,

 	I take it you've tried differant drive manufacturers ?
 	Or even a differant drive of same manuf. ?
 	Seeing as you've moved this same drive(?) across several chipsets & 
possibly mother boards ,  Leads me to beleive that the difficulty is either with 
the driver or the drive (if it is always the same drive or drive model) .

 		Hth ,  JimL
-- 
+------------------------------------------------------------------+
| James   W.   Laferriere | System    Techniques | Give me VMS     |
| Network&System Engineer | 2133    McCullam Ave |  Give me Linux  |
| babydr@baby-dragons.com | Fairbanks, AK. 99701 |   only  on  AXP |
+------------------------------------------------------------------+
--

From: Justin Piszcz
Date: Wednesday, October 1, 2008 - 1:06 am

It also occurs (I have seen it) on WD 750GiB drives on a different motherboard
Other people have the same problem with Seagate and / other drives.
This also occurs on Raptor 150s.

Justin.


--

From: Justin Piszcz
Date: Wednesday, October 1, 2008 - 4:12 am

This morning (On P965 board this time-- and on RAID1, not RAID5)

[469680.004654] ata2.00: status: { DRDY }
[469680.004660] ata2: hard resetting link
[469680.309567] ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[469680.333461] ata2.00: configured for UDMA/133
[469680.333477] ata2: EH complete
[469680.333461] sd 1:0:0:0: [sdb] 586072368 512-byte hardware sectors (300069 MB)
[469680.340461] sd 1:0:0:0: [sdb] Write Protect is off
[469680.340461] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[469680.345461] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD3000GLFS-01F8U0
Serial Number:    XX-XXXXXXXXXXXX
Firmware Version: 03.03V01
User Capacity:    300,069,052,416 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Wed Oct  1 07:11:23 2008 EDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

Perfectly good disk (and yes, I have swapped out cables as well), even tried
different a differnet server/psu (earlier version of the same motherboard) 
as well):

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
   1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       0
   3 Spin_Up_Time            0x0003   204   196   021    Pre-fail  Always       -       2758
   4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       36
   5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
   7 Seek_Error_Rate         0x000e   200   200   000    Old_age   Always       -       0
   9 ...
From: Tejun Heo
Date: Friday, October 3, 2008 - 7:27 pm

You're leaving the good part out.  Please always attach full kernel
log.

Thanks.

-- 
tejun
--

From: Justin Piszcz
Date: Saturday, October 4, 2008 - 1:11 am

Full log below:

[    0.000000] Linux version 2.6.26.5 (root@p34.internal.lan) (gcc version 4.3.1 (Debian 4.3.1-2) ) #1 SMP Tue Sep 9 04:07:56 EDT 2008
[    0.000000] Command line: auto BOOT_IMAGE=2.6.26.5-1 ro root=902
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  BIOS-e820: 0000000000000000 - 000000000008f000 (usable)
[    0.000000]  BIOS-e820: 000000000008f000 - 00000000000a0000 (reserved)
[    0.000000]  BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
[    0.000000]  BIOS-e820: 0000000000100000 - 00000000cf58f000 (usable)
[    0.000000]  BIOS-e820: 00000000cf58f000 - 00000000cf59c000 (reserved)
[    0.000000]  BIOS-e820: 00000000cf59c000 - 00000000cf639000 (usable)
[    0.000000]  BIOS-e820: 00000000cf639000 - 00000000cf694000 (ACPI NVS)
[    0.000000]  BIOS-e820: 00000000cf694000 - 00000000cf697000 (ACPI data)
[    0.000000]  BIOS-e820: 00000000cf697000 - 00000000cf6ef000 (ACPI NVS)
[    0.000000]  BIOS-e820: 00000000cf6ef000 - 00000000cf6f1000 (ACPI data)
[    0.000000]  BIOS-e820: 00000000cf6f1000 - 00000000cf6f2000 (usable)
[    0.000000]  BIOS-e820: 00000000cf6f2000 - 00000000cf6ff000 (ACPI data)
[    0.000000]  BIOS-e820: 00000000cf6ff000 - 00000000cf700000 (usable)
[    0.000000]  BIOS-e820: 00000000cf700000 - 00000000d0000000 (reserved)
[    0.000000]  BIOS-e820: 00000000f0000000 - 00000000f8000000 (reserved)
[    0.000000]  BIOS-e820: 00000000fff00000 - 0000000100000000 (reserved)
[    0.000000]  BIOS-e820: 0000000100000000 - 000000022c000000 (usable)
[    0.000000] Entering add_active_range(0, 0, 143) 0 entries of 256 used
[    0.000000] Entering add_active_range(0, 256, 849295) 1 entries of 256 used
[    0.000000] Entering add_active_range(0, 849308, 849465) 2 entries of 256 used
[    0.000000] Entering add_active_range(0, 849649, 849650) 3 entries of 256 used
[    0.000000] Entering add_active_range(0, 849663, 849664) 4 entries of 256 used
[    0.000000] Entering add_active_range(0, 1048576, 2277376) 5 entries of 256 used
[    ...
From: Justin Piszcz
Date: Saturday, October 4, 2008 - 5:23 pm

Are there any specific / useful / debugging options one could turn 
on/enable to obtain more information during these events that would 
produce meaningful/useful output?

Justin.

--

From: berk walker
Date: Saturday, October 4, 2008 - 5:29 pm

I think that the log stated a lot.

b-

--

From: Justin Piszcz
Date: Friday, October 10, 2008 - 12:13 pm

What do these signifiers mean (they are always the same, no matter the
controller used OR the disk in question (happens across 12 disks and 3 
different controllers)):

[420781.333179] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x6 frozen
[420781.333189] ata6.00: cmd b0/d8:00:00:4f:c2/00:00:00:00:00/00 tag 0
                              ^^ ^^(b0/d8)^^ ^^(4f:c2)
[420781.333190]          res 40/00:ff:00:00:00/00:00:00:00:00/00 Emask 0x4
(timeout)                    ^^ 40:00:ff
[420781.333194] ata6.00: status: { DRDY }
[420781.333200] ata6: hard resetting link
[420781.638589] ata6: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
[420781.662166] ata6.00: configured for UDMA/133
[420781.662166] ata6: EH complete

(at the time there was little to no I/O occuring on this block device, but 
disks on the raid5 volume were being accessed at the time, so there was 
system activity, mainly disk reads 300-500KiB/s over ethernet)

               ^^ ^^ (40:00: but no ff)

           The rest of the messages are the same.  Is there any correlation
           that can be made here?  When this happens to others, is it
           always the same codes as shown above or do they change?  If they
           do not change, how come they vary between users who have this

Can anything be said about these errors, can we classify them into groups?
Or are they just random? It does not appear to happen more or less with 
one filesystem or another either, one guy is using ext3, I am using XFS-- 
certainly something much deeper..

Justin.
--

From: Alan Cox
Date: Friday, October 10, 2008 - 12:27 pm

They are the send and received task file (command blocks) for the failed


-- 
--
	"Alan, I'm getting a bit worried about you."
				-- Linus Torvalds
--

From: Bill Davidsen
Date: Wednesday, October 1, 2008 - 8:09 am

Belatedly, the drive was retired after I got thinking about it, I'm just 
calling it an old bad drive and leaving it at that. I'm not goinf to 
re-install a dubious 20GB drive, it's out of the hot-swap carrier, 
cleared with dban, and in the recycle bin. You seem to have far better 


-- 
Bill Davidsen <davidsen@tmr.com>
  "Woe unto the statesman who makes war without a reason that will still
  be valid when the war is over..." Otto von Bismark 


--

Previous thread: [PATCH v2] serial: set correct baud_base for Oxford Semiconductor Ltd EXSYS EX-41092 Dual 16950 Serial adapter by Niels de Vos on Monday, September 22, 2008 - 6:11 am. (2 messages)

Next thread: [PATCH] AMD IOMMU: revert "x86, AMD IOMMU: honor iommu=off instead of amd_iommu=off" by FUJITA Tomonori on Monday, September 22, 2008 - 6:35 am. (11 messages)