Hello,
I got a bunch of these into dmesg:
EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory #323880: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory #323888: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory #323882: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0The kernel is 2.4.35 SMP, dual-processor. The scsi driver is Fusion MPT SCSI
Host driver 2.05.16.The device is /dev/sda2, root fs.
One line per each directory had dropped into dmesg each night (I think
during updatedb) before I noticed.debugfs: ncheck 323888
Inode Pathname
323888 /usr/share/doc/logcheck-1.1.1
debugfs: ncheck 323882
Inode Pathname
323882 /usr/share/doc/dev86-0.15.5
debugfs: ncheck 323880
Inode Pathname
323880 /usr/share/doc/mod_put-1.3The hardware _should_ be solid, although I can never 100% sure rule out disk
level corruption.Does this ring any bells to anyone, short of block level corruption?
-- v --
-
Interesting. Can you look (using debugfs) on the content of the
/usr/share/doc/ directory? It seems like parts of it have been zeroedHonza
-
Unfortunately, no. I removed those directories because those were the only
ones causing problems and wasn't able to reboot for a proper fsck
immediately. The rm -rf command gave no errors (to stdout or dmesg), and a
read-only fsck right after that gave no errors on the directory structure.Sorry for the sparse details, but when you have these kind of problems on
live servers, you tend to forget the debuggability...-- v --
-
Yes, I can understand that :). It's just that now it's hard to find
out what has really happened. Anyway, thanks for your report.Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs
-
If we are really lucky or unlucky it will happen again.
Zeroed-out block just might be a kernel problem (SMP race, whatever) -
random corruption would more likely be a hardware problem. There were no IO
error either. But, 2.4 ext3 has been pretty extensively tested, so I don't
suppose that's likely either. And judging from
http://www.kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.3[45] there
haven't been many changes to ext3 lately either.-- v --
-
Hi Ville,
Thanks for your report. Unfortunately, I've rechecked the recent changelogs
and see nothing related either. At least, in order to keep trace of the
incident, would you please post some info about your config (CPU, RAM,
chipset, .config, gcc, and any possible patches you may have applied) ?
Maybe some of these info may remind old bad memories to some people.Also, do you know if this server has ECC memory ? I would more easily
bet for side effects of one random bit flip in memory than for some
massive block corruption.I vaguely remember about very old reports of people sometimes observing
zeroed out blocks during writes, which were attributed to chipset bugs
if my memory serves me. But I would rule this out as recent chipsets
look more stable than 5-10 years ago !Regards,
Willy-
Willy,
The machine is a virtual machine on an VMware ESX 3.0.1 host.
/proc/cpuinfo shows two of these:
Dual
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 8
cpu MHz : 2333.014
cache size : 64 KBIt has 864MB of memory.
.config is at:
http://v.iki.fi/~vherva/tmp/2.4.35-config
The kernel is plain vanilla 2.4.35 from kernel.org, no patches.gcc 2.96-129:
cat /proc/version
Linux version 2.4.35 (root) (gcc version 2.96 20000731 (Red Hat Linux 7.2 2.96-129.7.2)) #1 SMP Thu Aug 9 10:35:37 EEST 2007Memory is ECC.
The server is HP Proliant ML370 with 82801BA/CA/DB/EB chipset. I've had my
share of chipset bugs with older Via chipsets, but I think it's very likely
in this case.This could very well be a VMware bug, but I wanted to know if this rings
bells for someone.-- v --
-
Hi Ville,
It could also be a problem with the host OS, drivers, hardware, etc...
Cheers,
Willy-
The box was virtualized a while ago and 2.4.32-rc1 and earlier 2.4 compiled
with the same compiler ran very solidly for years. It was UP beforeYes, pretty much anything. There's no solid evidency of anything, only
guesses of what might be more likely and what might be less likely...If it happens again, I'll try to debug more.
-- v --
-
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Hiten Pandya | Re: up? (emacs docbook xml ide) |
| Martin Michlmayr | Network slowdown due to CFS |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | [GIT]: Networking |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Natalie Protasevich | [BUG] New Kernel Bugs |
| Yaroslav Tarasenko | Re: PC-BSD |
| Ben Cadieux | DragonFly MBR |
| justin | Re: dragonfly pdf documentation |
| dark0s Optik | DragonFly over Sony Vaio |
