Re: Regular ext4 error warning with HD in USB dock

Previous thread: test by Vasiliy G Tolstov on Monday, December 27, 2010 - 3:11 pm. (1 message)

Next thread: linux-next: build warning after merge of the sound-asoc tree by Stephen Rothwell on Monday, December 27, 2010 - 4:32 pm. (2 messages)
From: Con Kolivas
Date: Monday, December 27, 2010 - 3:53 pm

Hi all

Since I moved my internal HD into a USB dock externally and mount the ext4 
filesystem on it, I regularly get the following errors after it has been 
mounted for a while (see timecode). It doesn't seem to matter which recent 
kernel I use. 


[1048401.773270] EXT4-fs (sde8): mounted filesystem with writeback data mode. 
Opts: (null)
[1048702.736011] EXT4-fs (sde8): error count: 3
[1048702.736016] EXT4-fs (sde8): initial error at 1289053677: 
ext4_journal_start_sb:251
[1048702.736018] EXT4-fs (sde8): last error at 1289080948: ext4_put_super:719

I've forced an fs check numerous times but nothing comes up. There was only 
one occasion where the filesystem mounted RO but that was months ago and has 
since been fine.

This does not appear to cause any actual problems and happens it seems at 
_precisely_ the moment the hard drive decides to go into sleep mode as I hear 
it spin down. hdparm is unable to extract details from the hard drive when it 
is plugged into the usb dock.



The options used on this are as follows:
tune2fs 1.41.12 (17-May-2010)
Filesystem volume name:   backup
Last mounted on:          /media/backup
Filesystem UUID:          4eff5eb7-89fb-4db1-8cfd-f09db598a00b
Filesystem magic number:  0xEF53
Filesystem revision #:    1 (dynamic)
Filesystem features:      has_journal ext_attr resize_inode dir_index filetype 
needs_recovery extent flex_bg sparse_super large_file huge_file uninit_bg 
dir_nlink extra_isize
Filesystem flags:         signed_directory_hash 
Default mount options:    journal_data_writeback
Filesystem state:         clean
Errors behavior:          Continue
Filesystem OS type:       Linux
Inode count:              22593536
Block count:              90355576
Reserved block count:     0
Free blocks:              30330219
Free inodes:              21828725
First block:              0
Block size:               4096
Fragment size:            4096
Reserved GDT blocks:      1002
Blocks per group:         32768
Fragments per group:   ...
From: Ted Ts'o
Date: Monday, December 27, 2010 - 7:53 pm

That's actually not an error.  It's a report which is generated every
24 hours, indicating that there has been 3 errors since the last time
the error count has been cleared, with the first error taking place at
Sat Nov 6 10:27:57 2010 (US/Eastern) in the function
ext4_journal_start_sb(), at line 251, and the most recent error taking
place at Sat Nov 6 18:02:28 2010 (US/Eastern), in the function
ext4_put_super() at line 719.  This is a new feature which was added
in 2.6.36.

This information (which is stored in the ext4 superblock) gets cleared
by e2fsck version 1.41.13 or newer (1.41.14 is the most recently
released version).  If you are using an older version of e2fsck, this
information won't get cleared, which is not a disaster, although it
could potentially get annoying.  The solution is to simply to upgrade
to the newest version of e2fsprogs if you are using the 2.6.36 or
newer kernel.

The idea behind this feature is to make sure that if there was a file
system corruption detected, it wouldn't be lost due to log files
getting overflowed.  It's also helpful when users report problems to
distribution support lines, since it makes it more likely support
personnel will be able to determine that there has been some file
system corrutpion issues, perhaps caused by hardware problems, or some
other cause, including kernel bugs.

Hope this helps explain what you're seeing!

						- Ted
--

From: Rogier Wolff
Date: Tuesday, December 28, 2010 - 1:19 am

Nice. But the issue you're not mentioning is: What errors could have 
happened on the 6th of november? Should Con worry about those errors?

OK, the chances are that he has rebooted since november, and that an
older fsck fixed the errors, but not cleared the "fs errror log". Would
these errors have triggered a "remount-readonly" if the fs was mounted 
like that?

I don't reboot that often: 

obelix:~> uptime
 09:16:19 up 175 days, 19:04, 19 users,  load average: 110.10, 110.61, 111.22

(and yes, the load is quite high on that machine. I won't polute this
thread about that....)

	Roger. 

-- 
** R.E.Wolff@BitWizard.nl ** http://www.BitWizard.nl/ ** +31-15-2600998 **
**    Delftechpark 26 2628 XH  Delft, The Netherlands. KVK: 27239233    **
*-- BitWizard writes Linux device drivers for any device you may have! --*
Q: It doesn't work. A: Look buddy, doesn't work is an ambiguous statement. 
Does it sit on the couch all day? Is it unemployed? Please be specific! 
Define 'it' and what it isn't doing. --------- Adapted from lxrbot FAQ
--

From: Con Kolivas
Date: Tuesday, December 28, 2010 - 2:09 am

Indeed the FS was remounted as RO at the time. I had to power down the USB 
device and power it up again after umounting the FS and then doing an FSCK 
before it was usable again, and the machine had about a month of uptime when 
it happened. It has not caused a problem since.

Thanks,
Con
-- 
-ck
--

From: Amir Goldstein
Date: Tuesday, December 28, 2010 - 3:30 am

Ted,

I would like to use this opportunity to remind you about my
record_journal_errstr()
implementation, see:
https://github.com/amir73il/ext4-snapshots/blob/next3-stable/fs/next3/super.c#L157

It records the initial errors messages (which I found to be the most
interesting),
in a message buffer on the unused space after the journal super block
(3K on a 4K block fs).

fsck prints out those messages and clears the buffer.
In under a year of Next3 fs in beta and production, it has helped me many times
to analyse bugs post-mortem and find the problem.

If there is demand, I can post the patch for Ext4.

--

From: Ric Wheeler
Date: Saturday, January 1, 2011 - 10:20 am

I do think that this sounds like a useful addition - should be very useful in 
doing post mortem analysis...

Thanks!

Ric

--

From: Ted Ts'o
Date: Tuesday, December 28, 2010 - 7:15 am

It looks like those errors were probably caused by I/O errors while
writing to the journal.  Maybe due to a flakey USB connection,
perhaps?  It would be useful to look at the /var/log/messagfes around

Yes, errors that get recorded into the superblock are also errors that
would cause a read-only remount if the file system is set that way.

      	      			       	    - Ted
--

From: torn5
Date: Tuesday, December 28, 2010 - 3:41 am

This is going to be a faq...
I suppose the datetime is encoded (what format is that?) in that long 
number after "at".
May I suggest the datetime gets decoded in the printing?
Also may I suggest that the error happens immediately after mount and 
not after 300 seconds from mount?

I just subscribed to this list exactly to report the same kind of error.

Last week I was doing reliability tests for open-iscsi and this error 
drove me crazy. I spent days in tests where I thought I could reproduce 
an error in open-iscsi by disconnecting and reconnecting the network; I 
even reported this to the open-iscsi mailing list, but in fact it was an 
old error of the filesystem and it was not getting cleared by my older 
e2fsck 1.41.11 .

If that happened immediately after mount or if it spitted the datetime 
in human readable format, I would have immediately guessed it was due to 
an existing "filesystem problem" (even though I was running fsck.ext4 -f 
prior to each mount) but instead I thought it was due to my 5 minutes of 
"networks disconnection tests" I was doing after each mount. DOH!

Anyway thanks for your work: excellent filesystem.

PS: I have a question for you regarding ext4 behaviour with SCSI 
commands resubmissions... ok I am opening another thread for that.
--

From: Ted Ts'o
Date: Tuesday, December 28, 2010 - 7:32 am

Number of seconds since January 1st, 1970 midnight GMT.  (Standard
Unix-time format).  It's hard for the kernel to decode it since it
doesn't know what time-zone you are in, or what your local legislature
has done vis-a-vis arbitrary and cabritious adjustments to the
start/stop date of daylight savings time, etc.

It's easy enough to decode it using perl:


Hmmm.... maybe.  The idea behind the 300 second delay was for the root
file system, to let e2fsck have a chance to fix the file system before
reporting the errors at mount time.

						- Ted


     	     	       
--

From: Ben Pfaff
Date: Tuesday, December 28, 2010 - 8:02 am

The "date" program can decode these values too:
        date -d @1289053677
-- 
Ben Pfaff 
http://benpfaff.org

--

From: torn5
Date: Tuesday, December 28, 2010 - 8:20 am

I suggest to decode it in GMT then.
Like this it's not possible to guess it's a "time" value, and hence that 
it's an *old* error. I thought it was a pointer to something, and so I 
thought the error was happening now, or more likely (since it was 
triggered after 300 seconds) that it was some inconsistency in the 
filesystem being detected now, in some kind of lazy check.
--

Previous thread: test by Vasiliy G Tolstov on Monday, December 27, 2010 - 3:11 pm. (1 message)

Next thread: linux-next: build warning after merge of the sound-asoc tree by Stephen Rothwell on Monday, December 27, 2010 - 4:32 pm. (2 messages)