I have 200 servers and at random under load and not under load they will crash. The the only consistency across all machines is this pattern of:
Mar 6 01:14:28 ftf-32 kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Mar 6 01:14:28 ftf-32 kernel: hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
Mar 6 01:14:28 ftf-32 kernel: ide: failed opcode was: unknown
Mar 6 01:14:28 ftf-32 kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Mar 6 01:14:28 ftf-32 kernel: hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
Mar 6 01:14:28 ftf-32 kernel: ide: failed opcode was: unknown
Mar 6 01:14:30 ftf-32 kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Mar 6 01:14:30 ftf-32 kernel: hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
Mar 6 01:14:30 ftf-32 kernel: ide: failed opcode was: unknown
Mar 6 01:14:30 ftf-32 kernel: hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
Mar 6 01:14:30 ftf-32 kernel: hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
Mar 6 01:14:30 ftf-32 kernel: ide: failed opcode was: unknown
Mar 6 01:14:30 ftf-32 kernel: hdb: DMA disabled
Mar 6 01:14:30 ftf-32 kernel: ide0: reset: success
repeats itself 3 more times until the machine crashes. Any ideas? Thanks in advance!
This has been happening since 2.6.13. Here is the dmesg.
Linux version 2.6.15-1.1833_FC4 (bhcompile@hs20-bc1-1.build.redhat.com) (gcc version 4.0.2 20051125 (Red Hat 4.0.2-8)) #1 Wed Mar 1 23:41:37 EST 2006
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000e8000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000007bff0000 (usable)
BIOS-e820: 000000007bff0000 - 000000007bff8000 (ACPI data)
BIOS-e820: 000000007bff8000 - 000000007c000000 (ACPI NVS)
BIOS-e820: 00000000fec00000 - 00000000fec01000 (reserved)
BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
BIOS-e820: 00000000ffee0000 - 00000000fff00000 (reserved)
BIOS-e820: 00000000fffc0000 - 0000000100000000 (reserved)
1087MB HIGHMEM available.
896MB LOWMEM available.
found SMP MP-table at 000fbc70
Using x86 segment limits to approximate NX protection
On node 0 totalpages: 507888
DMA zone: 4096 pages, LIFO batch:0
DMA32 zone: 0 pages, LIFO batch:0
Normal zone: 225280 pages, LIFO batch:31
HighMem zone: 278512 pages, LIFO batch:31
DMI 2.3 present.
ACPI: RSDP (v000 AMI ) @ 0x000faa60
ACPI: RSDT (v001 AMIINT SiS740XX 0x00001000 MSFT 0x0100000b) @ 0x7bff0000
ACPI: FADT (v001 AMIINT SiS740XX 0x00000011 MSFT 0x0100000b) @ 0x7bff0030
ACPI: MADT (v001 AMIINT SiS740XX 0x00001000 MSFT 0x0100000b) @ 0x7bff00c0
ACPI: DSDT (v001 SiS 746 0x00000100 INTL 0x02002024) @ 0x00000000
ACPI: PM-Timer IO Port: 0x808
ACPI: Local APIC disabled (-2); pass 'lapic' to re-enable.
LAPIC disabled (-2)
Allocating PCI resources starting at 80000000 (gap: 7c000000:82c00000)
Built 1 zonelists
Kernel command line: ro root=LABEL=/
mapped APIC to ffffd000 (00000000)
Initializing CPU#0
CPU 0 irqstacks, hard=c0414000 soft=c0413000
PID hash table entries: 4096 (order: 12, 65536 bytes)
Detected 2007.124 MHz processor.
Using pmtmr for high-res timesource
Console: colour VGA+ 80x25
Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
Memory: 2008932k/2031552k available (2149k kernel code, 21396k reserved, 777k data, 196k init, 1114048k highmem)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
Calibrating delay using timer specific routine.. 4016.88 BogoMIPS (lpj=8033766)
Security Framework v1.0.0 initialized
SELinux: Initializing.
SELinux: Starting in permissive mode
selinux_register_security: Registering secondary module capability
Capability LSM initialized as secondary
Mount-cache hash table entries: 512
CPU: After generic identify, caps: 0383fbff c1cbfbff 00000000 00000000 00000000 00000000 00000000
CPU: After vendor identify, caps: 0383fbff c1cbfbff 00000000 00000000 00000000 00000000 00000000
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 256K (64 bytes/line)
CPU: After all inits, caps: 0383f3ff c1cbfbff 00000000 00000020 00000000 00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
mtrr: v2.0 (20020519)
CPU: AMD Sempron(tm) 2800+ stepping 01
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Checking 'hlt' instruction... OK.
ACPI: setting ELCR to 0200 (from 0c00)
Local APIC disabled by default; use 'lapic' to enable it.
checking if image is initramfs... it is
Freeing initrd memory: 1073k freed
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: PCI BIOS revision 2.10 entry at 0xfdb31, last bus=2
PCI: Using configuration type 1
ACPI: Subsystem revision 20050902
ACPI: Interpreter enabled
ACPI: Using PIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (0000:00)
PCI: Probing PCI hardware (bus 00)
Uncovering SIS963 that hid as a SIS503 (compatible=0)
Enabling SiS 96x SMBus.
Boot video device is 0000:01:00.0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: Power Resource [URP1] (off)
ACPI: Power Resource [URP2] (off)
ACPI: Power Resource [FDDP] (off)
ACPI: Power Resource [LPTP] (off)
ACPI: PCI Interrupt Link [LNKA] (IRQs 3 4 5 6 7 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LNKB] (IRQs 3 4 5 6 7 10 *11 12 14 15)
ACPI: PCI Interrupt Link [LNKC] (IRQs 3 4 5 6 7 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNKD] (IRQs 3 4 5 6 7 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKE] (IRQs 3 4 5 6 7 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKF] (IRQs 3 4 5 6 7 *10 11 12 14 15)
ACPI: PCI Interrupt Link [LNKG] (IRQs 3 4 5 6 7 10 11 12 14 15) *0, disabled.
ACPI: PCI Interrupt Link [LNKH] (IRQs 3 4 5 6 7 *10 11 12 14 15)
Linux Plug and Play Support v0.97 (c) Adam Belay
pnp: PnP ACPI init
pnp: PnP ACPI: found 12 devices
usbcore: registered new driver usbfs
usbcore: registered new driver hub
PCI: Using ACPI for IRQ routing
PCI: If a device doesn't work, try "pci=routeirq". If it helps, post a report
PCI: Ignore bogus resource 6 [0:0] of 0000:01:00.0
PCI: Bridge: 0000:00:01.0
IO window: b000-bfff
MEM window: cfd00000-cfefffff
PREFETCH window: bfa00000-cfbfffff
apm: BIOS not found.
audit: initializing netlink socket (disabled)
audit(1141720362.988:1): initialized
highmem bounce pool size: 64 pages
Total HugeTLB memory allocated, 0
VFS: Disk quotas dquot_6.5.1
Dquot-cache hash table entries: 1024 (order 0, 4096 bytes)
SELinux: Registering netfilter hooks
Initializing Cryptographic API
ksign: Installing public key data
Loading keyring
- Added public key BD239ECA35D39A38
- User ID: Red Hat, Inc. (Kernel Module GPG key)
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered
pci_hotplug: PCI Hot Plug PCI Core version: 0.5
isapnp: Scanning for PnP cards...
isapnp: No Plug & Play device found
Real Time Clock Driver v1.12
Linux agpgart interface v0.101 (c) Dave Jones
agpgart: Detected SiS 741 chipset
agpgart: AGP aperture is 64M @ 0xd0000000
PNP: PS/2 Controller [PNP0303:PS2K] at 0x60,0x64 irq 1
PNP: PS/2 controller doesn't have AUX irq; using default 12
serio: i8042 KBD port at 0x60,0x64 irq 1
Serial: 8250/16550 driver $Revision: 1.90 $ 4 ports, IRQ sharing enabled
serial8250: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
serial8250: ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
00:07: ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
RAMDISK driver initialized: 16 RAM disks of 16384K size 1024 blocksize
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
SIS5513: IDE controller at PCI slot 0000:00:02.5
SIS5513: chipset revision 0
SIS5513: not 100% native mode: will probe irqs later
SIS5513: SiS 962/963 MuTIOL IDE UDMA133 controller
ide0: BM-DMA at 0xff00-0xff07, BIOS settings: hda:DMA, hdb:DMA
ide1: BM-DMA at 0xff08-0xff0f, BIOS settings: hdc:DMA, hdd:DMA
Probing IDE interface ide0...
hda: Maxtor 7B250R0, ATA DISK drive
hdb: Maxtor 7B250R0, ATA DISK drive
ide0 at 0x1f0-0x1f7,0x3f6 on irq 14
Probing IDE interface ide1...
hdc: Maxtor 7B250R0, ATA DISK drive
hdd: Maxtor 7B250R0, ATA DISK drive
ide1 at 0x170-0x177,0x376 on irq 15
hda: max request size: 1024KiB
hda: 490234752 sectors (251000 MB) w/16384KiB Cache, CHS=30515/255/63, UDMA(133)
hda: cache flushes supported
hda: hda1 hda2
hdb: max request size: 1024KiB
hdb: 490234752 sectors (251000 MB) w/16384KiB Cache, CHS=30515/255/63, UDMA(133)
hdb: cache flushes supported
hdb: hdb1
hdc: max request size: 1024KiB
hdc: 490234752 sectors (251000 MB) w/16384KiB Cache, CHS=30515/255/63, UDMA(133)
hdc: cache flushes supported
hdc: hdc1 hdc2
hdd: max request size: 1024KiB
hdd: 490234752 sectors (251000 MB) w/16384KiB Cache, CHS=30515/255/63, UDMA(133)
hdd: cache flushes supported
hdd: hdd1
ide-floppy driver 0.99.newide
usbcore: registered new driver libusual
usbcore: registered new driver hiddev
usbcore: registered new driver usbhid
drivers/usb/input/hid-core.c: v2.6:USB HID core driver
mice: PS/2 mouse device common for all mice
md: md driver 0.90.3 MAX_MD_DEVS=256, MD_SB_DISKS=27
md: bitmap version 4.39
NET: Registered protocol family 2
input: AT Translated Set 2 keyboard as /class/input/input0
IP route cache hash table entries: 65536 (order: 6, 262144 bytes)
TCP established hash table entries: 262144 (order: 10, 4194304 bytes)
TCP bind hash table entries: 65536 (order: 8, 1310720 bytes)
TCP: Hash tables configured (established 262144 bind 65536)
TCP reno registered
TCP bic registered
Initializing IPsec netlink socket
NET: Registered protocol family 1
NET: Registered protocol family 17
Using IPI Shortcut mode
ACPI wakeup devices:
PCI0 PS2K UAR1 USB1 USB2 EHCI LAN MDM AUD
ACPI: (supports S0 S1 S4 S5)
Freeing unused kernel memory: 196k freed
Write protecting the kernel read-only data: 337k
EXT3-fs: INFO: recovery required on readonly filesystem.
EXT3-fs: write access will be enabled during recovery.
kjournald starting. Commit interval 5 seconds
EXT3-fs: hda1: orphan cleanup on readonly fs
ext3_orphan_cleanup: deleting unreferenced inode 332002
ext3_orphan_cleanup: deleting unreferenced inode 458818
ext3_orphan_cleanup: deleting unreferenced inode 458816
ext3_orphan_cleanup: deleting unreferenced inode 458858
ext3_orphan_cleanup: deleting unreferenced inode 458847
ext3_orphan_cleanup: deleting unreferenced inode 458922
ext3_orphan_cleanup: deleting unreferenced inode 458846
EXT3-fs: hda1: 7 orphan inodes deleted
EXT3-fs: recovery complete.
EXT3-fs: mounted filesystem with ordered data mode.
SELinux: Disabled at runtime.
SELinux: Unregistering netfilter hooks
Floppy drive(s): fd0 is 1.44M
FDC 0 is a post-1991 82077
r8169 Gigabit Ethernet driver 2.2LK-NAPI loaded
ACPI: PCI Interrupt Link [LNKB] enabled at IRQ 11
PCI: setting IRQ 11 as level-triggered
ACPI: PCI Interrupt 0000:00:09.0[A] -> Link [LNKB] -> GSI 11 (level, low) -> IRQ 11
eth0: Identified chip type is 'RTL8169s/8110s'.
eth0: RTL8169 at 0xf881cf00, 00:40:f4:d0:91:07, IRQ 11
sis900.c: v1.08.08 Jan. 22 2005
ACPI: PCI Interrupt Link [LNKD] enabled at IRQ 10
PCI: setting IRQ 10 as level-triggered
ACPI: PCI Interrupt 0000:00:04.0[A] -> Link [LNKD] -> GSI 10 (level, low) -> IRQ 10
0000:00:04.0: Realtek RTL8201 PHY transceiver found at address 1.
0000:00:04.0: Using transceiver found at address 1 as default
eth1: SiS 900 PCI Fast Ethernet at 0xdc00, IRQ 10, 00:0b:6a:e5:90:07.
i2c-sis96x version 1.0.0
sis96x_smbus 0000:00:02.1: SiS96x SMBus base address: 0x0c00
ACPI: PCI Interrupt Link [LNKH] enabled at IRQ 10
ACPI: PCI Interrupt 0000:00:03.2[D] -> Link [LNKH] -> GSI 10 (level, low) -> IRQ 10
ehci_hcd 0000:00:03.2: EHCI Host Controller
PCI: cache line size of 64 is not supported by device 0000:00:03.2
ehci_hcd 0000:00:03.2: new USB bus registered, assigned bus number 1
ehci_hcd 0000:00:03.2: irq 10, io mem 0xcfffb000
ehci_hcd 0000:00:03.2: USB 2.0 started, EHCI 1.00, driver 10 Dec 2004
hub 1-0:1.0: USB hub found
hub 1-0:1.0: 6 ports detected
ohci_hcd: 2005 April 22 USB 1.1 'Open' Host Controller (OHCI) Driver (PCI)
ACPI: PCI Interrupt Link [LNKE] enabled at IRQ 10
ACPI: PCI Interrupt 0000:00:03.0[A] -> Link [LNKE] -> GSI 10 (level, low) -> IRQ 10
ohci_hcd 0000:00:03.0: OHCI Host Controller
ohci_hcd 0000:00:03.0: new USB bus registered, assigned bus number 2
ohci_hcd 0000:00:03.0: irq 10, io mem 0xcfff9000
hub 2-0:1.0: USB hub found
hub 2-0:1.0: 3 ports detected
ACPI: PCI Interrupt Link [LNKF] enabled at IRQ 10
ACPI: PCI Interrupt 0000:00:03.1[B] -> Link [LNKF] -> GSI 10 (level, low) -> IRQ 10
ohci_hcd 0000:00:03.1: OHCI Host Controller
ohci_hcd 0000:00:03.1: new USB bus registered, assigned bus number 3
ohci_hcd 0000:00:03.1: irq 10, io mem 0xcfffa000
hub 3-0:1.0: USB hub found
hub 3-0:1.0: 3 ports detected
ACPI: Power Button (FF) [PWRF]
ACPI: Power Button (CM) [PWRB]
ibm_acpi: ec object not found
md: Autodetecting RAID arrays.
md: autorun ...
md: ... autorun DONE.
device-mapper: 4.4.0-ioctl (2005-01-12) initialised: dm-devel@redhat.com
eth1: Media Link Off
r8169: eth0: link up
EXT3 FS on hda1, internal journal
kjournald starting. Commit interval 5 seconds
EXT3 FS on hda2, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting. Commit interval 5 seconds
EXT3 FS on hdb1, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting. Commit interval 5 seconds
EXT3 FS on hdc2, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
kjournald starting. Commit interval 5 seconds
EXT3 FS on hdd1, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
Adding 4192924k swap on /dev/hdc1. Priority:-1 extents:1 across:4192924k
ip_tables: (C) 2000-2002 Netfilter core team
Netfilter messages via NETLINK v0.30.
ip_conntrack version 2.4 (8192 buckets, 65536 max) - 232 bytes per conntrack
NET: Registered protocol family 10
lo: Disabled Privacy Extensions
IPv6 over IPv4 tunneling driver
eth0: no IPv6 routers present
Removing netfilter NETLINK layer.
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
ide: failed opcode was: unknown
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
ide: failed opcode was: unknown
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
ide: failed opcode was: unknown
hda: dma_intr: status=0x51 { DriveReady SeekComplete Error }
hda: dma_intr: error=0x84 { DriveStatusError BadCRC }
ide: failed opcode was: unknown
IDE errors symptomatic of overheating or dying
The IDE errors you're posting are symptomatic of either dying hard drives (about to fail hard), in which case the kernel shouldn't be dying, or overheating hard drives, in which case the CPU or memory will also be dying.
Check the cooling in the machines, check that the rooms they're in aren't overly hot or static (no air flow), and set up a serial console or similar to catch the final oops or panic message. Equipped with the panic message, it'll be possible to track down where in the kernel it's dying; if it's seemingly random, you've got hardware issues.
I dont agree. The IDE is
I dont agree. The IDE is listed as SiS 5513. My home desktop has the same (a ten year old computer), works fine with 2.6.15 but gives the same error when the kernel is upgraded to 2.6.26 (Ubuntu 6.06 -> 8.04). There seems to be an incompatibility with the IDE drivers in the newer kernels I have tested up to 2.6.53
what does SMART say?
what does SMART say?
Try:
smartctl -a /dev/hda
BTW - Ubuntu 8.03 uses libata, doesn't it? So your drive should be /dev/sda, but in that case, error messages should be completely different. Weird...
upgrading
if you upgrade and don't reinstall ubuntu, it should retain the configuration, i.e. use the same devices