Re: [PATCH, RFC] ext4: Store basic fs error information in the superblock

Previous thread: [PATCH, RFC] ext4: Store basic fs error information in the superblock by Theodore Ts'o on Wednesday, June 23, 2010 - 8:21 pm. (4 messages)

Next thread: ext4 scalability testing results by Eric Whitney on Friday, June 25, 2010 - 2:54 pm. (1 message)
From: Amir G.
Date: Thursday, June 24, 2010 - 5:09 am

Hi Ted,

I saw your patch to store fs error information in the superblock.
I think it is a very useful feature and I have implemented something similar in
next3_snapshot_journal_error.patch and e2fs_next3_message_buffer.patch
(attached).

There is one big problem I encountered with this feature:
If the file system error behavior is set to "abort" or "remount-ro",
the journal recovery on the next mount will most likely write over the
superblock with the errors information.

To solve this problem I stored the errors message buffer in the
journal superblock
and copied the message buffer to the filesystem superblock on journal
recovery (both on mount and fsck).
fsck also displays the errors buffer and clears it.

This feature helped me hunt down some rare bugs that happened on beta
sites, which I had to analyse post-mortem.
fsck simply gives me the first few error messages after the last time
fsck was run.

Amir.


--

From: tytso
Date: Thursday, June 24, 2010 - 6:27 am

True, thanks for pointing that out; the simplest way to solve this for
my purposes is to snapshot those superblock fields and restore them

That's an interesting approach, although as you point out it only
works on file systems with a 4k block size.  Your design seems to be
focused on recording only the most recent logs, which makes sense in a
debugging environment.  My assumption was that the most recent
problems would probably be recorded in /var/log/messages, although if
the problem occurred on a single-disk system, that assumption probably
wouldn't hold true.  I wonder if the a better solution for this
particular use case is much larger ring buffer, and a hook into the
printk system which is guaranteed to record *everything*, even after a
panic or after the journal has been aborted and the file system has
been remounted read-only.

For the patch I wrote, my intention was as a supplement to
/var/log/messages --- where s_first_error_time might be from long
after /var/log/messages had rolled over.  So I was trying to solve a
somewhat different problem.  (Hmm, actually, it would probably be good
to save both details about the first as well as the most recent error.)

   	     	     	       	     	     - Ted
--

From: Amir G.
Date: Friday, June 25, 2010 - 6:16 pm

I guess that should work.
I wonder why the ERROR_FS flag is not snapshotted on mount

sounds like a good feature which would be hard to implement...
BTW, I think that if the file system error behavior is set to "remount-ro"
a file system with ERROR_FS, should be remounted read-only on mount time.
this is the only way to prevent a file system from getting over corrupted
and I don't see why there is no way to enforce this with existing
error behavior options.

One thing that is missing from the error info is its severity level.
If I would have to save just one error info, it would be the first
error after fsck
(i.e. transition from healthy to sick file system), but I would
override it if a message
of higher severity occurs.

Amir.
--

Previous thread: [PATCH, RFC] ext4: Store basic fs error information in the superblock by Theodore Ts'o on Wednesday, June 23, 2010 - 8:21 pm. (4 messages)

Next thread: ext4 scalability testing results by Eric Whitney on Friday, June 25, 2010 - 2:54 pm. (1 message)