Hi I'm seeing the following errors when writing to an ext4 file system: qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 fa6a73 2002. end_request: I/O error, dev sda, sector 7812771880 Aborting journal on device sda-8. EXT4-fs (sda): delayed block allocation failed for inode 752 at logical offset 1982464 with max blocks 17588 with error -30 This should not happen!! Data will be lost EXT4-fs error (device sda) in ext4_da_writepages: Journal has aborted EXT4-fs error (device sda): ext4_journal_start_sb: Detected aborted journal EXT4-fs (sda): Remounting filesystem read-only EXT4-fs (sda): delayed block allocation failed for inode 756 at logical offset 177656 with max blocks 1205 with error -30 EXT4-fs error (device sda) in ext4_da_write_end: IO failure This should not happen!! Data will be lost EXT4-fs (sda): ext4_da_writepages: jbd2_start: 1205 pages, ino 756; err -30 EXT4-fs (sda): ext4_da_writepages: jbd2_start: 17588 pages, ino 752; err -30 EXT4-fs (sda): delayed block allocation failed for inode 755 at logical offset 669696 with max blocks 3268 with error -30 This should not happen!! Data will be lost EXT4-fs (sda): ext4_da_writepages: jbd2_start: 3268 pages, ino 755; err -30 This happened while stress-testing the machine [1] with stress -d 5 --hdd-bytes 10G --hdd-noclean This command is supposed to fill the disk and then exit with "no space left". Instead, the above errors occured when the fs was 100% full. There's nothing in the logs of the Raid system. Is this a known issue? Who's to blame, qla2xxx or ext4? Thanks Andre [1] Vanilla 2.6.34-rc4, a 7.3T ext4 fs on a FC Infortrend hardware raid, connected to a qlogic hba. Let me know if you need more info. -- The only person who always got his work done by Friday was Robinson Crusoe
ext4 is just the messenger here, I think. --
Yes, or a problem with the qla2xxx driver. However, it's a bit surprising that the hardware survived the stress test for more than one day and went south just at the moment the fs was completely filled. My first guess was that the problem might be caused by a reported device size which is slightly larger than the actual size. But then ddrescue -i 7440Gi /dev/zero /dev/sda would have resulted in a similar error. This command completed without errors though. Thanks Andre -- The only person who always got his work done by Friday was Robinson Crusoe
Hm, that's true. Well, the error was on sector 7812771880 of the device ... can you reliably write directly to that sector w/o the filesystem in between? --
The driver here is simply reporting that the a command's timeout value had been exceeded and the mid-layer began error-recovery with a request to 'abort' the command. The abort-request completed successfully: qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 fa6a73 2002. I can't explain why the storage did not complete the request in the allotted time. -- av --
Ah, that's valuable information, thanks. The underlying Infortrend Raid System is rather old but worked without any problems for several years. We recently replaced its 400G disks by new 2T WD disks. Maybe the new disks have longer response times, could that be the reason? And is there a way to increase the timeout value? Andre -- The only person who always got his work done by Friday was Robinson Crusoe
To update the default timeout value (30 seconds) for commands
submitted to /dev/sdn to 60 seconds:
$ echo 60 > /sys/block/sdn/device/timeout
-- av
--
I will re-run the stress test with a 60 seconds timeout value and follow up if this did not help. Thanks again Andre -- The only person who always got his work done by Friday was Robinson Crusoe
That will not help if the command is "SYNCHRONIZE_CACHE", as that ignores device settings, but uses scsi default timeout (30s), which is far too small for SATA based raid units. Scsi maintainers ignored that and a couple of other patches I wrote to improve error handling with Infortrend units. Will send the patches again soon. Also, if the abort command succeeds, it the command should be re-queued and should not result in an error. I think my patches also would increase verbosity to point out what exactly happened (possibly a wrong return code in the qla2xxx driver, although that should activate the next step in error handling, I need to find some to go through the code...). Altogether filesystem unrelated. The filesystem just might be the reason for a synchronize-cache, e.g. barriers, etc. Greetings from Tübingen, Bernd --
Please CC me when you do so. The machine I am having trouble with is only our fallback server. I can use it freely for testing and am willing to give your patches a try. Thanks Andre -- The only person who always got his work done by Friday was Robinson Crusoe
There is actually not much to test, as the patches had been the only solution to stabilize a large Lustre environment with dozens of Infortrend Raids. I spent months to debug Infortrend Raids, scsi stack and the LSI MPT fusion driver. Nowadays some patches are also used for DDN customers. I'm just always out of time to forward port it to more recent linux-git and to resend. Cheers, Bernd --
Bernd Schubert wrote: Interesting that it only happens at enospc time though - the fs would be sending barriers in general usage. Although, as ext4 fills, it does do more syncing/flushing to try to eke out that last bit of space.... hmm.... Andre, just for fun, you might try mounting -o nodelalloc, fill it again, and see if you see the same behavior. If not, you might do it again w/ defaults, and capture blktrace data, which I think will trace barrier requests as well. (or, maybe also try testing with -o nobarrier) I still don't think it's likely a filesystem problem but maybe you can pinpoint the fs behavior that triggers it. --
Will do. This will take some time though as I destroyed the fs by writing directly to the device, so I had to recreate it and start from scratch. I'm currently filling the new fs using a 60s device timeout as I'll try to reproduce the problem using different timeout values and the ext4 options you suggest. If I can find a reliable reproducer, I'll run blktrace and post the results. Regards Andre -- The only person who always got his work done by Friday was Robinson Crusoe
Here are some results. Prior to running the tests I wrote a bunch of 10G files and then filled the fs completely with 2T files containing zeros. Each of the tests below consisted of three runs of - remove 5 of the above 10G files to make 50G space available - run stress -d 5 --hdd-bytes 10G --hdd-noclean until it dies - run fsck if any fs errors occured Summary: Increasing the device timeout to 60s _or_ disabling barriers makes the problem go away. Deactivating delayed allocation makes the problem worse. - device timeout 60s, default ext4 parameters No problems at all, all three runs OK - device timeout 30s, default ext4 parameters 1. OK 2. dmesg: qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2ea270b 2002. end_request: I/O error, dev sda, sector 7812889640 Aborting journal on device sda-8. EXT4-fs error (device sda): ext4_journal_start_sb: Detected aborted journal EXT4-fs (sda): Remounting filesystem read-only fsck: Inode 287, i_blocks is 4294918568, should be 416. Fix? yes Inode 288, i_size is 2198897426432, should be 2199023251456. Fix? yes Inode 288, i_blocks is 4294721960, should be 416. Fix? yes 3. qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2ece6a8 2002. qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2ece6dc 2002. end_request: I/O error, dev sda, sector 7812690136 Aborting journal on device sda-8. EXT4-fs error (device sda) in ext4_free_blocks: Journal has aborted EXT4-fs error (device sda) in ext4_ext_remove_space: Journal has aborted EXT4-fs error (device sda) in ext4_reserve_inode_write: Journal has aborted EXT4-fs error (device sda) in ext4_ext_truncate: Journal has aborted EXT4-fs error (device sda) in ext4_reserve_inode_write: Journal has aborted EXT4-fs error (device sda) in ext4_orphan_del: Journal has aborted EXT4-fs error (device sda) in ext4_reserve_inode_write: Journal has aborted EXT4-fs error (device sda) in ext4_delete_inode: Journal has ...
On 04/20/2010 10:37 AM, Andre Noll wrote: Hm, so something about barriers being issued is causing timeout problems on the device...? since nodelalloc made it -worse- I suppose that rules out the problems with fs flushing near ensopc, that should only be happening with delalloc. To be honest I'm not sure what I would do with the blktrace data now, but perhaps someone might be interested to see it; I'd at least gather it. -Eric --
I think interesting at this point would be the exact model of the Infortrend device. There are some completely broken models (IMHO), which have two controllers for redundancy. Now with enabled write-back cache, it can happen that those units run into some kind of firmware bug. It then takes about 2h to flush 2GB of write-back cache. The telnet interface will show the status of the cache. More recent IFT dual-controller units do not suffer from this bug anymore, but as Andre said, they are using an old unit... Thanks, Bernd --
Here's the system information as reported by the telnet interface: CPU Type PPC750FX Total Cache Size 2048MB DDR(ECC) Firmware Version 3.42I.03 Bootrecord Version 1.23A FW Upgradability Rev. C Serial Number 6912121 Battery Backup Unit Present Base Board Rev. ID 0 Base Board ID 81 ID of NVRAM Defaults A16F-G2221 V6.10 This is a 4 year old system (which does not support Raid6). It has only Hey, I saw this once on a different (newer) infortrend system. However, it might still be hapening on this system as well and cause the timeout problems. Guess I'll have to check if there's a more recent firmware for this system.. Thanks Andre -- The only person who always got his work done by Friday was Robinson Crusoe
I think the dual-controller models that work fine have have SAS-interlink. Infortrend never confirmed the issue, but I guess it is related to cache- coherency between both controllers. There are also other cache related firmware bugs, when it fails to flush the cache at all. Scsi commands then time out, it enters recovery, properly responds to scsi commands, resumes normal operation and fails those commands again. Even with software raid out of several of those hardware raids, this fail-recover-fail loop prevents suitable operation. Also part of my scsi patches to limit number of recoveries within a time limit. This issue should be fixed with recent firmware version, though. But depending on your model, At least worth a try. Cheers, Bernd --
OK, will send it tomorrow. Thanks Andre -- The only person who always got his work done by Friday was Robinson Crusoe
Here we go.
mount | grep sda:
/dev/sda on /mnt type ext4 (ro,relatime,barrier=1,data=ordered)
command:
ls stress.* | head -n 5 | xargs rm && stress -d 5 --hdd-noclean --hdd-bytes 10G
dmesg:
EXT4-fs (sda): mounted filesystem with ordered data mode
qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2f82d1 2002.
qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2f82d3 2002.
end_request: I/O error, dev sda, sector 7812689752
Aborting journal on device sda-8.
qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2f82d8 2002.
EXT4-fs (sda): delayed block allocation failed for inode 4198406 at logical offset 0 with max blocks 32768 with error -30
This should not happen!! Data will be lost
EXT4-fs error (device sda) in ext4_da_writepages: Journal has aborted
qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2f82da 2002.
qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2f82db 2002.
EXT4-fs (sda): delayed block allocation failed for inode 4198435 at logical offset 0 with max blocks 32768 with error -30
This should not happen!! Data will be lost
EXT4-fs error (device sda): ext4_journal_start_sb: Detected aborted journal
EXT4-fs (sda): Remounting filesystem read-only
EXT4-fs (sda): ext4_da_writepages: jbd2_start: 32768 pages, ino 4198406; err -30
EXT4-fs (sda): ext4_da_writepages: jbd2_start: 32768 pages, ino 4198435; err -30
qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2f82df 2002.
EXT4-fs (sda): delayed block allocation failed for inode 4198433 at logical offset 0 with max blocks 32768 with error -30
This should not happen!! Data will be lost
EXT4-fs (sda): delayed block allocation failed for inode 4198437 at logical offset 0 with max blocks 32768 with error -30
This should not happen!! Data will be lost
...2Gb cache is really huge. barriers=0 , result in less disk wcache activity, but more real IO And nodelaloc result in more real IO due, so imho this is looks like device issue. about nodelalloc: It is unlikely to see "This should not happen!! Data will be lost" because this message appear from writepage so may happens only when you rewrite an existing file(below i_size). BTW, you already noted that you have performed some stress on the device without filesystem. What was they doing? IMHO you will able to reproduce the issue without fs by performing --
stress is a simple, yet useful program which imposes certain types of stress on a machine. With the above command line options, it simply writes 5 files in parallel, each 10G large, in an endless loop until the file system is full (or becomes read-only due to errors). It helped me more than once to identify hardware or software problems, Really? This is a four year old el-cheapo hardware raid system with 16 SATA slots. You can easily spend twice the money and get much more Yes, I think we all agree that the problem is not ext4-related but is most likely an issue with the infortrend hardware. However, ext4 Nope, this definitely occured while stress was writing new files and I only ran ddrescue /dev/sda /dev/null once to make sure everything is readable. This completed with no problems, so I created an ext4 file system and used the above stress command which resulted in write errors. I then used ddrescue again to rewrite the sector on which the error occured. This also succeeded which indicates a transient problem, i.e. no problem with the particular sector. Regards Andre -- The only person who always got his work done by Friday was Robinson Crusoe
Yes, that does not result in any errors. Andre -- The only person who always got his work done by Friday was Robinson Crusoe
These look like hardware errors; it may not be the qla2xxx driver's
fault, but actually the fault of the hard disk....
- Ted
--
