Re: ext4: (2.6.34-rc4): This should not happen!! Data will be lost

Previous thread: [Bug 15792] New: ext4_inode_inode->i_flags modification is racy by bugzilla-daemon on Thursday, April 15, 2010 - 3:13 pm. (10 messages)

Next thread: Be Officially Inform(Final Notification) by Western Union Foreign Payment Office on Friday, April 16, 2010 - 8:17 am. (1 message)
From: Andre Noll
Date: Friday, April 16, 2010 - 5:35 am

Hi

I'm seeing the following errors when writing to an ext4 file system:

	qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 fa6a73 2002.
	end_request: I/O error, dev sda, sector 7812771880
	Aborting journal on device sda-8.
	EXT4-fs (sda): delayed block allocation failed for inode 752 at logical offset 1982464 with max blocks 17588 with error -30

	This should not happen!!  Data will be lost
	EXT4-fs error (device sda) in ext4_da_writepages: Journal has aborted
	EXT4-fs error (device sda): ext4_journal_start_sb: Detected aborted journal
	EXT4-fs (sda): Remounting filesystem read-only
	EXT4-fs (sda): delayed block allocation failed for inode 756 at logical offset 177656 with max blocks 1205 with error -30
	EXT4-fs error (device sda) in ext4_da_write_end: IO failure

	This should not happen!!  Data will be lost
	EXT4-fs (sda): ext4_da_writepages: jbd2_start: 1205 pages, ino 756; err -30

	EXT4-fs (sda): ext4_da_writepages: jbd2_start: 17588 pages, ino 752; err -30

	EXT4-fs (sda): delayed block allocation failed for inode 755 at logical offset 669696 with max blocks 3268 with error -30

	This should not happen!!  Data will be lost
	EXT4-fs (sda): ext4_da_writepages: jbd2_start: 3268 pages, ino 755; err -30

This happened while stress-testing the machine [1] with

	stress -d 5 --hdd-bytes 10G --hdd-noclean

This command is supposed to fill the disk and then exit with "no space
left". Instead, the above errors occured when the fs was 100% full.
There's nothing in the logs of the Raid system.

Is this a known issue? Who's to blame, qla2xxx or ext4?

Thanks
Andre

[1] Vanilla 2.6.34-rc4, a 7.3T ext4 fs on a FC Infortrend hardware
raid, connected to a qlogic hba. Let me know if you need more info.
-- 
The only person who always got his work done by Friday was Robinson Crusoe
From: Eric Sandeen
Date: Friday, April 16, 2010 - 8:02 am

ext4 is just the messenger here, I think.


--

From: Andre Noll
Date: Friday, April 16, 2010 - 8:30 am

Yes, or a problem with the qla2xxx driver. However, it's a bit
surprising that the hardware survived the stress test for more than
one day and went south just at the moment the fs was completely filled.

My first guess was that the problem might be caused by a reported
device size which is slightly larger than the actual size. But then

	ddrescue -i 7440Gi /dev/zero /dev/sda

would have resulted in a similar error. This command completed without
errors though.

Thanks
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe
From: Eric Sandeen
Date: Friday, April 16, 2010 - 8:43 am

Hm, that's true.

Well, the error was on sector 7812771880 of the device ...

can you reliably write directly to that sector w/o the filesystem in
between?


--

From: Andrew Vasquez
Date: Friday, April 16, 2010 - 8:52 am

The driver here is simply reporting that the a command's timeout value
had been exceeded and the mid-layer began error-recovery with a request
to 'abort' the command.  The abort-request completed successfully:

	qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 fa6a73 2002.

I can't explain why the storage did not complete the request in the
allotted time.

-- av
--

From: Andre Noll
Date: Friday, April 16, 2010 - 9:08 am

Ah, that's valuable information, thanks. The underlying Infortrend
Raid System is rather old but worked without any problems for several
years. We recently replaced its 400G disks by new 2T WD disks. Maybe
the new disks have longer response times, could that be the reason? And
is there a way to increase the timeout value?

Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe
From: Andrew Vasquez
Date: Friday, April 16, 2010 - 9:36 am

To update the default timeout value (30 seconds) for commands
submitted to /dev/sdn to 60 seconds:

        $ echo 60 > /sys/block/sdn/device/timeout

-- av
--

From: Andre Noll
Date: Friday, April 16, 2010 - 10:07 am

I will re-run the stress test with a 60 seconds timeout value and follow
up if this did not help.

Thanks again
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe
From: Bernd Schubert
Date: Saturday, April 17, 2010 - 9:55 am

That will not help if the command is "SYNCHRONIZE_CACHE", as that ignores 
device settings, but uses scsi default timeout (30s), which is far too small 
for SATA based raid units. Scsi maintainers ignored that and a couple of other 
patches I wrote to improve error handling with Infortrend units. Will send the 
patches again soon.
Also, if the abort command succeeds, it the command should be re-queued and 
should not result in an error. I think my patches also would increase 
verbosity to point out what exactly happened (possibly a wrong return code in 
the qla2xxx driver, although that should activate the next step in error 
handling, I need to find some to go through the code...).

Altogether filesystem unrelated. The filesystem just might be the reason for a 
synchronize-cache, e.g. barriers, etc.

Greetings from Tübingen,
Bernd 
--

From: Andre Noll
Date: Saturday, April 17, 2010 - 11:19 am

Please CC me when you do so. The machine I am having trouble with is
only our fallback server. I can use it freely for testing and am willing
to give your patches a try.

Thanks
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe
From: Bernd Schubert
Date: Saturday, April 17, 2010 - 11:43 am

There is actually not much to test, as the patches had been the only solution 
to stabilize a large Lustre environment with dozens of Infortrend Raids. I 
spent months to debug Infortrend Raids, scsi stack and the LSI MPT fusion 
driver. Nowadays some patches are also used for DDN customers. 
I'm just always out of time to forward port it to more recent linux-git and to 
resend.


Cheers,
Bernd

--

From: Eric Sandeen
Date: Saturday, April 17, 2010 - 1:45 pm

Bernd Schubert wrote:

Interesting that it only happens at enospc time though - the fs would
be sending barriers in general usage.

Although, as ext4 fills, it does do more syncing/flushing to try to
eke out that last bit of space.... hmm....

Andre, just for fun, you might try mounting -o nodelalloc, fill it again,
and see if you see the same behavior.

If not, you might do it again w/ defaults, and capture blktrace data, which
I think will trace barrier requests as well.

(or, maybe also try testing with -o nobarrier)

I still don't think it's likely a filesystem problem but maybe you can
pinpoint the fs behavior that triggers it.


--

From: Andre Noll
Date: Saturday, April 17, 2010 - 3:38 pm

Will do. This will take some time though as I destroyed the fs by
writing directly to the device, so I had to recreate it and start from
scratch. I'm currently filling the new fs using a 60s device timeout as

I'll try to reproduce the problem using different timeout values and the
ext4 options you suggest. If I can find a reliable reproducer, I'll run
blktrace and post the results.

Regards
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe
From: Andre Noll
Date: Tuesday, April 20, 2010 - 8:37 am

Here are some results. Prior to running the tests I wrote a bunch of
10G files and then filled the fs completely with 2T files containing
zeros.  Each of the tests below consisted of three runs of

	- remove 5 of the above 10G files to make 50G space available
	- run stress -d 5 --hdd-bytes 10G --hdd-noclean until it dies
	- run fsck if any fs errors occured

Summary: Increasing the device timeout to 60s _or_ disabling barriers
makes the problem go away. Deactivating delayed allocation makes the
problem worse.

- device timeout 60s, default ext4 parameters
	No problems at all, all three runs OK

- device timeout 30s, default ext4 parameters
	1. OK
	2. dmesg:
		qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2ea270b 2002.
		end_request: I/O error, dev sda, sector 7812889640
		Aborting journal on device sda-8.
		EXT4-fs error (device sda): ext4_journal_start_sb: Detected aborted journal
		EXT4-fs (sda): Remounting filesystem read-only

	fsck:
			Inode 287, i_blocks is 4294918568, should be 416.  Fix? yes
			Inode 288, i_size is 2198897426432, should be 2199023251456.  Fix? yes
			Inode 288, i_blocks is 4294721960, should be 416.  Fix? yes

	3.
		qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2ece6a8 2002.
		qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2ece6dc 2002.
		end_request: I/O error, dev sda, sector 7812690136
		Aborting journal on device sda-8.
		EXT4-fs error (device sda) in ext4_free_blocks: Journal has aborted
		EXT4-fs error (device sda) in ext4_ext_remove_space: Journal has aborted
		EXT4-fs error (device sda) in ext4_reserve_inode_write: Journal has aborted
		EXT4-fs error (device sda) in ext4_ext_truncate: Journal has aborted
		EXT4-fs error (device sda) in ext4_reserve_inode_write: Journal has aborted
		EXT4-fs error (device sda) in ext4_orphan_del: Journal has aborted
		EXT4-fs error (device sda) in ext4_reserve_inode_write: Journal has aborted
		EXT4-fs error (device sda) in ext4_delete_inode: Journal has ...
From: Eric Sandeen
Date: Tuesday, April 20, 2010 - 9:51 am

On 04/20/2010 10:37 AM, Andre Noll wrote:

Hm, so something about barriers being issued is causing timeout
problems on the device...?

since nodelalloc made it -worse- I suppose that rules out the problems
with fs flushing near ensopc, that should only be happening with delalloc.

To be honest I'm not sure what I would do with the blktrace data now,
but perhaps someone might be interested to see it; I'd at least gather it.

-Eric
--

From: Bernd Schubert
Date: Tuesday, April 20, 2010 - 10:26 am

I think interesting at this point would be the exact model of the Infortrend 
device. There are some completely broken models (IMHO), which have two 
controllers for redundancy. Now with enabled write-back cache, it can happen 
that those units run into some kind of firmware bug. It then takes about 2h to 
flush 2GB of write-back cache. The telnet interface will show the status of 
the cache. More recent IFT dual-controller units do not suffer from this bug 
anymore, but as Andre said, they are using an old unit...


Thanks,
Bernd



--

From: Andre Noll
Date: Tuesday, April 20, 2010 - 11:35 am

Here's the system information as reported by the telnet interface:

	CPU Type             PPC750FX        
	Total Cache Size     2048MB DDR(ECC) 
	Firmware Version     3.42I.03        
	Bootrecord Version   1.23A           
	FW Upgradability     Rev. C          
	Serial Number        6912121         
	Battery Backup Unit  Present         
	Base Board Rev. ID   0               
	Base Board ID        81              
	ID of NVRAM Defaults A16F-G2221 V6.10

This is a 4 year old system (which does not support Raid6). It has only

Hey, I saw this once on a different (newer) infortrend system. However,
it might still be hapening on this system as well and cause the timeout
problems.

Guess I'll have to check if there's a more recent firmware for this
system..

Thanks
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe
From: Bernd Schubert
Date: Tuesday, April 20, 2010 - 1:09 pm

I think the dual-controller models that work fine have have SAS-interlink. 
Infortrend never confirmed the issue, but I guess it is related to cache-
coherency between both controllers. 
There are also other cache related firmware bugs, when it fails to flush the 
cache at all. Scsi commands then time out, it enters recovery, properly 
responds to scsi commands, resumes normal operation and fails those commands 
again. Even with software raid out of several of those hardware raids, this 
fail-recover-fail loop prevents suitable operation. Also part of my scsi 
patches to limit number of recoveries within a time limit. This issue should 
be fixed with recent firmware version, though. But depending on your model, 

At least worth a try.

Cheers,
Bernd
--

From: Andre Noll
Date: Tuesday, April 20, 2010 - 10:46 am

OK, will send it tomorrow.

Thanks
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe
From: Andre Noll
Date: Thursday, April 22, 2010 - 1:21 am

Here we go.

mount | grep sda:

        /dev/sda on /mnt type ext4 (ro,relatime,barrier=1,data=ordered)

command:

        ls stress.* | head -n 5 | xargs rm && stress -d 5 --hdd-noclean --hdd-bytes 10G

dmesg:

        EXT4-fs (sda): mounted filesystem with ordered data mode
        qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2f82d1 2002.
        qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2f82d3 2002.
        end_request: I/O error, dev sda, sector 7812689752
        Aborting journal on device sda-8.
        qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2f82d8 2002.
        EXT4-fs (sda): delayed block allocation failed for inode 4198406 at logical offset 0 with max blocks 32768 with error -30

        This should not happen!!  Data will be lost
        EXT4-fs error (device sda) in ext4_da_writepages: Journal has aborted
        qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2f82da 2002.
        qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2f82db 2002.
        EXT4-fs (sda): delayed block allocation failed for inode 4198435 at logical offset 0 with max blocks 32768 with error -30

        This should not happen!!  Data will be lost
        EXT4-fs error (device sda): ext4_journal_start_sb: Detected aborted journal
        EXT4-fs (sda): Remounting filesystem read-only
        EXT4-fs (sda): ext4_da_writepages: jbd2_start: 32768 pages, ino 4198406; err -30

        EXT4-fs (sda): ext4_da_writepages: jbd2_start: 32768 pages, ino 4198435; err -30

        qla2xxx 0000:06:09.0: scsi(0:0:0): Abort command issued -- 1 2f82df 2002.
        EXT4-fs (sda): delayed block allocation failed for inode 4198433 at logical offset 0 with max blocks 32768 with error -30

        This should not happen!!  Data will be lost
        EXT4-fs (sda): delayed block allocation failed for inode 4198437 at logical offset 0 with max blocks 32768 with error -30

        This should not happen!!  Data will be lost
    ...
From: Dmitry Monakhov
Date: Wednesday, April 21, 2010 - 1:57 am

2Gb cache is really huge.
barriers=0 , result in less disk wcache activity, but more real IO
And nodelaloc result in more real IO due, so imho this is looks like
device issue.
about nodelalloc: It is unlikely to see "This should not happen!! 
Data will be lost" because this message appear from writepage
so may happens only when you rewrite an existing file(below i_size).
BTW, you already noted that you have performed some stress on the device
without filesystem. What was they doing?
IMHO you will able to reproduce the issue without fs by performing
--

From: Andre Noll
Date: Wednesday, April 21, 2010 - 6:47 am

stress is a simple, yet useful program which imposes certain types of
stress on a machine. With the above command line options, it simply
writes 5 files in parallel, each 10G large, in an endless loop until
the file system is full (or becomes read-only due to errors). It
helped me more than once to identify hardware or software problems,

Really? This is a four year old el-cheapo hardware raid system with
16 SATA slots. You can easily spend twice the money and get much more

Yes, I think we all agree that the problem is not ext4-related but
is most likely an issue with the infortrend hardware. However, ext4

Nope, this definitely occured while stress was writing new files and

I only ran ddrescue /dev/sda /dev/null once to make sure everything
is readable. This completed with no problems, so I created an ext4
file system and used the above stress command which resulted in write
errors. I then used ddrescue again to rewrite the sector on which
the error occured. This also succeeded which indicates a transient
problem, i.e. no problem with the particular sector.

Regards
Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe
From: Andre Noll
Date: Friday, April 16, 2010 - 8:57 am

Yes, that does not result in any errors.

Andre
-- 
The only person who always got his work done by Friday was Robinson Crusoe
From: tytso
Date: Friday, April 16, 2010 - 7:29 am

These look like hardware errors; it may not be the qla2xxx driver's
fault, but actually the fault of the hard disk....

       	   	    	      	     	  - Ted
--

Previous thread: [Bug 15792] New: ext4_inode_inode->i_flags modification is racy by bugzilla-daemon on Thursday, April 15, 2010 - 3:13 pm. (10 messages)

Next thread: Be Officially Inform(Final Notification) by Western Union Foreign Payment Office on Friday, April 16, 2010 - 8:17 am. (1 message)