Re: fat: excessive log spamming due to corrupted fs

Previous thread: [PATCH] blk-cgroup: Fix RCU correctness warning in cfq_init_queue() by Vivek Goyal on Thursday, April 22, 2010 - 8:54 am. (14 messages)

Next thread: [PATCH 4/4] scheduler: kill paranoia check in synchronize_sched_expedited() by Tejun Heo on Thursday, April 22, 2010 - 9:09 am. (1 message)
From: Johannes Stezenbach
Date: Thursday, April 22, 2010 - 8:59 am

Hi,

my office mate has a 1GB USB stick with a currupted vfat fs.

(I don't really know when and how it got currupted.  I
just mounted it, copied a file onto it, unmounted, unplugged.
After that corruption showed up when accessing the newly
copied file.  I'm running 2.6.33.1.)

Mounting still worked but when accessing the new file the kernel
log was filled up with

Apr 22 16:30:18 zzz kernel: FAT: Filesystem error (dev sdb1)
Apr 22 16:30:18 zzz kernel:    fat_get_cluster: invalid cluster chain (i_pos 34568)
Apr 22 16:30:18 zzz kernel:    File system has been set read-only
Apr 22 16:30:18 zzz kernel: FAT: Filesystem error (dev sdb1)
Apr 22 16:30:18 zzz kernel:    fat_get_cluster: invalid cluster chain (i_pos 34568)
Apr 22 16:30:18 zzz kernel: FAT: Filesystem error (dev sdb1)
Apr 22 16:30:18 zzz kernel:    fat_get_cluster: invalid cluster chain (i_pos 34568)
...
Apr 22 16:30:18 zzz kernel: FAT: Filesystem error lidAT: Filesysalid cluster chain (i_pos 34568)
Apr 22 16:30:18 zzz kernel: FAT: Filesystem ealid cluster chain (i_pos 3AT: Fiesysalid cluster chain (i_pos 34568)
Apr 22 16:30:18 zzz kernel: FAT: Filesystem ealid cluster chain (i_pos 3AT: Fesystalid cluster chain (i_pos 34568)
Apr 22 16:30:18 zzz kernel: FAT: Filesystem alid cluster chain (i_pos 3AT: Fiesyalid cluster chain (i_pos 34568)
Apr 22 16:30:18 zzz kernel: FAT: Filesystem ealid cluster chain (i_pos AT: Fiesystalid cluster chain (i_pos 34568)
Apr 22 16:30:18 zzz kernel: FAT: Filesystealid cluster chain (i_pos 3AT: Fiesystalid cluster chain (i_pos 34568)
...
(~10000 lines)


It seems that fat_fs_error() generates corrupted output
(on an Athlon 4850e dual core), and the excessive amounts
of output are IMHO not useful.


BTW: dosfsck refused to fix it.
BTW2: when my office mate plugged it into her MacBook it
caused MacOS to crash ;-)


Johannes
--

From: OGAWA Hirofumi
Date: Tuesday, April 27, 2010 - 2:48 am

It seems, userland or readahead or read directory entires didn't stop
with EIO (dir is intended though, to salvage as many files as
possible). I'll think about using the ratelimit for fs corruption
report.

I have no idea about message corruption, vfat just call vprintf() for
it. I'll see current vprintf() locking stuff.

Thanks.
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
--

From: Johannes Stezenbach
Date: Tuesday, April 27, 2010 - 3:14 am

I think multiple printk per line is prone to corruption on SMP, see the
comment about KERN_CONT in kernel.h.  But maybe it only happens
for /var/log/kern.log and my xconsole when generating too much
output too quickly, "dmesg -s 10000000 | less" did not show
the corruption (but only shows ~2700 lines out of the ~10000).
But I think fat should use vprintf() to a buffer and then one printk()
instead of multiple printk + vprintk.

In xconsole it looks like this:

Apr 27 12:06:45 zzz kernel: <systerrouster chain (i_pos 34568)
Apr 27 12:06:45 zzz kernel: <systeerroruster chain (i_pos 34568)
Apr 27 12:06:45 zzz kernel: <systeerroruster chain (i_pos 34568)
Apr 27 12:06:45 zzz kernel: <3systerroruster chain (i_pos 34568)
Apr 27 12:06:45 zzz kernel: uster chain (i_pos 34568)


Thanks
Johannes
--

From: OGAWA Hirofumi
Date: Tuesday, April 27, 2010 - 4:21 am

I think KERN_CONT issue doesn't explain this corruption (i.e. preempted
at middle of vprintk()), so even if this is one vprintk(), this will not
be fixed. One of possibility is buffer overflow of printk, so truncated
the message, but of course, I'm not sure at least for now, and I'm not
checking current printk stuff yet.

-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
--

From: OGAWA Hirofumi
Date: Tuesday, April 27, 2010 - 9:41 am

Um, please check syslog at same time with dmesg. If there is corruption
in syslog, but there is not in dmesg. It sounds like userland problem.
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
--

From: Johannes Stezenbach
Date: Tuesday, April 27, 2010 - 10:36 am

I guess klogd cannot handle that much output.  If you'd add some
rate limiting that might fix the log corruption, too.

Thanks
Johannes
--

From: OGAWA Hirofumi
Date: Tuesday, April 27, 2010 - 12:11 pm

I guess it meant the corruption is in only syslog. Ok, it explains this
problem.

Thanks.
-- 
OGAWA Hirofumi <hirofumi@mail.parknet.co.jp>
--

Previous thread: [PATCH] blk-cgroup: Fix RCU correctness warning in cfq_init_queue() by Vivek Goyal on Thursday, April 22, 2010 - 8:54 am. (14 messages)

Next thread: [PATCH 4/4] scheduler: kill paranoia check in synchronize_sched_expedited() by Tejun Heo on Thursday, April 22, 2010 - 9:09 am. (1 message)