Mark, David, Doug, Tejin, Alan, Jeff, LKML,
I'm afraid that there may be some problem with SMART + libata in the
2.6.22 kernel. An hour ago I discovered that I missed a month of
correspondence (some LKML, some private) about this problem which Alan,
Tejun, Jeff, Mark and others copied to me -- it was automatically shoved
into one of my mailboxes by my mail client. Sorry about that. So I am
trying to catch up to see if there is some real problem or not.Here is a typical bug report that worries me:
http://article.gmane.org/gmane.linux.utilities.smartmontools/4712Here is another similar report:
http://thread.gmane.org/gmane.linux.utilities.smartmontools/4713And another report:
impression that the problem may be a very simple one, namely that starting
with 2.6.22 one needs to run a command to enable SMART when a box is first
booted -- the kernel no longer does this as part of the init/setup of the
disks. But that is NOT consistent with the first two reports above, which
show 'SMART ENABLED'.Here are some of the earlier threads that I completely missed:
http://www.ussg.iu.edu/hypermail/linux/kernel/0706.1/0849.html
http://www.mail-archive.com/linux-kernel@vger.kernel.org/msg164863.htmlBefore I go off half-cocked, could anyone shed some light on this? Is
there a real problem here or just something dumb?Cheers,
Bruce
-
I have done some more debugging on this one. An easy way to reproduce the
problem is to use 'smartctl -H /dev/sdb'. If I enable debugging with '-r
ioctl,2', I find the following difference between outputs using 2.6.21.1
(works OK) and 2.6.22 (fails):--- sm-2.6.21.1b.log 2007-07-09 23:47:28.000000000 +0300
+++ sm-2.6.22.log 2007-07-09 23:39:56.000000000 +0300
@@ -11,7 +11,7 @@
status=0x0
[ata pass-through(16): 85 08 0e 00 00 00 01 00 00 00 00 00 00 00 ec 00 ]
scsi_status=0x0, host_status=0x0, driver_status=0x0
- info=0x0 duration=0 milliseconds resid=0
+ info=0x0 duration=4 milliseconds resid=0
Incoming data, len=512 [only first 256 bytes shown]:
00 5a 0c ff 3f 37 c8 10 00 00 00 00 00 3f 00 00 00
10 00 00 00 00 20 20 20 20 20 20 20 20 20 20 20 20
@@ -97,11 +97,11 @@
scsi_status=0x2, host_status=0x0, driver_status=0x8
info=0x1 duration=48 milliseconds resid=0
>>> Sense buffer, len=22:
- 00 72 00 00 00 00 00 00 0e 09 0c 00 00 00 00 00 00
- 10 00 4f 00 c2 00 50
+ 00 72 00 00 00 00 00 00 0e 09 0c 00 00 00 01 00 00
+ 10 00 00 00 00 00 50
status=2: [desc] sense_key=0 asc=0 ascq=0
Values from ATA status return descriptor are:
- 00 09 0c 00 00 00 00 00 00 00 4f 00 c2 00 50
+ 00 09 0c 00 00 00 01 00 00 00 00 00 00 00 50
REPORT-IOCTL: DeviceFD=3 Command=SMART STATUS returned 0REPORT-IOCTL: DeviceFD=3 Command=SMART STATUS CHECK
@@ -110,9 +110,13 @@
info=0x1 duration=40 milliseconds resid=0
>>> Sense buffer, len=22:
00 72 00 00 00 00 00 00 0e 09 0c 00 00 00 00 00 00
- 10 00 4f 00 c2 00 50
+ 10 00 00 00 00 00 50 ...
The other system with the Maxtor disk fails in a slightly different way
(it correctly returns the c2 byte but not in the correct location):[ 162.896173] ata_qc_complete before: 00 00 00 40
[ 162.896179] ata_qc_complete 16: 00 c2 00 50My earlier 'git bisect' suggested that this problem surfaced after the
patch1e999736cafdffc374f22eed37b291129ef82e4e is first bad commit
commit 1e999736cafdffc374f22eed37b291129ef82e4e
Author: Alan Cox <alan@lxorguk.ukuu.org.uk>
Date: Wed Apr 11 00:23:13 2007 +0100libata: HPA support
I have now done some further tests to see what is happening.
It turned out that after commenting the call (at line 1956 in
drivers/ata/libata-core.c in 2.6.22)if (ata_id_hpa_enabled(dev->id))
dev->n_sectors = ata_hpa_resize(dev);'smartctl -H' worked again without problems. This applied to both of the
systems where I see the problem. The disks in both systems support hpa but
nothing is hidden. Next I commented only the call to
ata_read_native_max_address_ext() in ata_hpa_resize(). This was enough
to remove the problem (as was expected).So, the question is: why does calling ata_read_native_max_address_ext()
when booting the system cause the SMART RETURN STATUS fail much later?--
Kai
-
Please try the patch in the following message.
http://article.gmane.org/gmane.linux.ide/20799/raw
--
tejun
-
This solves the 'smartctl -H' problem both of my systems (one with Nvidia
CK804 and one with MCP51).Tested-by: Kai Makisara <Kai.Makisara@kolumbus.fi>
Thanks for pointing out the patch.
--
Kai
-
This patch also solved the problem I reported here:
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=428975Thanks!
Bye Klaus
-
Tejun: thanks for pointing out this patch.
Kai, Klaus: thanks for testing the patch!
Petr: thanks for fixing the SMART 2.6.22 problems!
Jeff: two user (Kai, Klaus) both saw the SMART STATUS problem disappear
when they tested this libata patch. I hope you stick it into your own
source tree.thread_exit(:-);
Cheers,
Bruce-
Kai,
Thanks for the analysis.
Some background documents for those interested:
The SCSI to ATA Translation (SAT) draft:
http://www.t10.org/ftp/t10/drafts/sat/sat-r09.pdf
in which the relevant section is 12.2.6 (page 110)
table 93.
A modern descriptor-based SCSI sense buffer is being used
to convey the "ATA (status) return descriptor" back from
the ATA device after the command has been completed.
My SAT code in smartmontools requests this descriptor
so it should be returned irrespective of whether the
ATA command succeeded or failed.Now from the ATA side the command being executed is
"SMART RETURN STATUS B0h/DAh, non-data". For
reference I use this draft from www.t13.org :
D1699r3f-ATA8-ACS.pdf . See that command's
_description_ section. That explains that 4f and c2
in the LBA field indicates the disk is healthy. "threshold
exceeded" is indicated by putting f4 and 2c in the same
positions. [Whoever specified that must have hated people
with dyslexia.] No ATA command error is indicated ("abort"
is the only one listed for that ATA command) in the reports
that I have seen.So when smartmontools sees 0 and 0 in those positions it
pulls out the red card for that device. My guess is that
libata in lk 2.6.22 is corrupting those FIS device to
host register values.Doug Gilbert
-
Kai, Doug: thank you very much for tracking down the source of this
problem.Jeff: OK, from what I am reading here I think that this is a genuine
libata/kernel bug. But I'm out of my depth here, so the ball is in your
court. Hopefully you'll understand what's going on and how to fix it.Cheers,
Bruce
-
This is mine and although it's a 'real' problem, it is something that's easy to
hack around by having the suspend script turn on smart after it is resumed. (Of
course I can't use resume until a skge wol bug is fixed so I won't see/test this
unless asked too.)The smart init scripts run '-s on' when the system boots anyway for my system -
this problem only occurs for me during suspend/resume. Maybe smartd should
detect that as Alan says.Please let me know if there's anything else you need.
David
-
OK, that should be easy to do. So let's forget about the 'SMART disabled'
issue. This is easy to fix in multiple ways and is not a LKML issue.David: can you reproduce the more serious problem
http://article.gmane.org/gmane.linux.utilities.smartmontools/4712 reported
by Jan Dvorak?Jeff: this is the problem that really has me concerned.
Jan: what happens if you replace '-d ata' with '-d sat'? This option
should be available in the 5.37 release of smartmontools that you are
using unless the Suse package maintainer is playing games with the version
numbers.Unfortunately I don't think this will fix the problem, as the bug report
by Klaus Fuerstberger
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=428975 is using '-d sat'.Jeff: the fact that both links given above are reporting the same bug in
two different settings, and the fact that the bug goes away when reverting
2.6.22 to 2.6.21 still has me concerned.Cheers,
Bruce
-
Relevant lspci and dmesg output would be useful... that gives enhanced
error diagnostics.Jeff
-
Sorry, I haven't seen that problem.
David
-
Here is another similar report:
http://article.gmane.org/gmane.linux.utilities.smartmontools/4704/match=...
Again, this indicates that SMART is enabled. But it's not clear what the
kernel version here is. The report indicates that the problem started
with an FC7 kernel upgradeBruce
-
That was me, and the kernel in question is 2.6.21-1.3194.fc7. I tried
Jeff's noacpi suggestion, and here is the outcome. I am sure it comes
as no surprise that his patch to support the boot-time parameter
libata.noacpi is not included in this kernel:Kernel command line: ro root=/dev/vg0/fc-root rhgb selinux=0 nodmraid libata.noacpi=1
Unknown boot option `libata.noacpi=1': ignoringHowever, the module option is there:
# modinfo libata
filename: /lib/modules/2.6.21-1.3194.fc7/kernel/drivers/ata/libata.ko
version: 2.20
license: GPL
description: Library module for ATA devices
author: Jeff Garzik
srcversion: 44DAFFD701701A15EB2D574
depends: scsi_mod
vermagic: 2.6.21-1.3194.fc7 SMP mod_unload 686 4KSTACKS
parm: atapi_enabled:Enable discovery of ATAPI devices (0=off, 1=on) (int)
parm: atapi_dmadir:Enable ATAPI DMADIR bridge support (0=off, 1=on) (int)
parm: fua:FUA support (0=off, 1=on) (int)
parm: ignore_hpa:Ignore HPA (0=keep BIOS setting 1=ignore it) (int)
parm: ata_probe_timeout:Set ATA probing timeout (seconds) (int)
parm: noacpi:Disables the use of ACPI in suspend/resume when set (int)And when used via:
# cat /etc/modprobe.d/libata
options libata noacpi=1I still see the same problem:
smartctl version 5.37 [i686-redhat-linux-gnu] Copyright (C) 2002-6 Bruce Allen
Home page is http://smartmontools.sourceforge.net/=== START OF INFORMATION SECTION ===
Model Family: Maxtor DiamondMax 10 family (ATA/133 and SATA/150)
Device Model: Maxtor 6L250S0
Serial Number: L50A1B8H
Firmware Version: BANC1G10
User Capacity: 251,000,193,024 bytes
Device is: In smartctl database [for details use: -P show]
ATA Version is: 7
ATA Standard is: ATA/ATAPI-7 T13 1532D revision 0
Local Time is: Mon Jul 9 23:39:25 2007 BST
SMART support is: Available - device has SMART capability.
SM...
So it's a bug in the sata_nv.c port driver.
That looks a bit strange, because the driver goes to some effort
to prevent these kind of commands from ever being issued "in ADMA mode",
precisely because there's no way to do a tf_read in that mode.Mmm.. buggy somewhere in there.
-
On the base point, libata has never enabled SMART on its own. That's
always up to the BIOS, etc.It's possible that the recent addition of ACPI support will cause disks
to be in different modes than previously expected. ACPI supplies ATA
taskfiles to be pushed to the disk, and who knows what's in there...Jeff
-
On Sun, 8 Jul 2007, Jeff Garzik wrote:
Is there a simple way I can have affected users test this? Is there a
kernel boot flag or sysctl setting or something else they can use to
disable the ACPI stuff so see if the problem then goes away?Cheers,
Bruce
-
The 'noacpi' module option.
Jeff
-
OK, thanks.
Klaus, Jan: could you please see if your problem with 2.6.22 goes away
with noacpi passed as a flag to libata?Jeff: I will add the noacpi test suggestion into the Debian bug report
here http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=428975 to try to
ensure that Klaus sees it.Cheers,
Bruce
-
| Greg Kroah-Hartman | [PATCH 001/196] Chinese: Add the known_regression URI to the HOWTO |
| Linus Torvalds | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Andrew Morton | 2.6.25-mm1 |
| Vladislav Bolkhovitin | Re: Integration of SCST in the mainstream Linux kernel |
git: | |
| David Miller | [GIT]: Networking |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 0/37] dccp: Feature negotiation - last call for comments |
| Natalie Protasevich | [BUG] New Kernel Bugs |
