Re: 2.4.35 SMP: ext3_readdir: bad entry in directory #323888: rec_len is smaller than minimal

Previous thread: What can be done to reduce the huge number of build fixes required to release an MM tree? by Miles Lane on Tuesday, September 18, 2007 - 8:23 am. (3 messages)

Next thread: alternatives in modules by Denys Vlasenko on Tuesday, September 18, 2007 - 8:36 am. (2 messages)
To: <linux-kernel@...>
Date: Tuesday, September 18, 2007 - 8:23 am

Hello,

I got a bunch of these into dmesg:

EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory #323880: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory #323888: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0
EXT3-fs error (device sd(8,2)): ext3_readdir: bad entry in directory #323882: rec_len is smaller than minimal - offset=0, inode=0, rec_len=0, name_len=0

The kernel is 2.4.35 SMP, dual-processor. The scsi driver is Fusion MPT SCSI
Host driver 2.05.16.

The device is /dev/sda2, root fs.

One line per each directory had dropped into dmesg each night (I think
during updatedb) before I noticed.

debugfs: ncheck 323888
Inode Pathname
323888 /usr/share/doc/logcheck-1.1.1
debugfs: ncheck 323882
Inode Pathname
323882 /usr/share/doc/dev86-0.15.5
debugfs: ncheck 323880
Inode Pathname
323880 /usr/share/doc/mod_put-1.3

The hardware _should_ be solid, although I can never 100% sure rule out disk
level corruption.

Does this ring any bells to anyone, short of block level corruption?

-- v --

v@iki.fi

-

To: Ville Herva <v@...>
Cc: <linux-kernel@...>
Date: Tuesday, September 18, 2007 - 11:12 am

Interesting. Can you look (using debugfs) on the content of the
/usr/share/doc/ directory? It seems like parts of it have been zeroed

Honza
-

To: Jan Kara <jack@...>
Cc: <linux-kernel@...>
Date: Tuesday, September 18, 2007 - 12:09 pm

Unfortunately, no. I removed those directories because those were the only
ones causing problems and wasn't able to reboot for a proper fsck
immediately. The rm -rf command gave no errors (to stdout or dmesg), and a
read-only fsck right after that gave no errors on the directory structure.

Sorry for the sparse details, but when you have these kind of problems on
live servers, you tend to forget the debuggability...

-- v --

v@iki.fi

-

To: Ville Herva <v@...>
Cc: <linux-kernel@...>
Date: Tuesday, September 18, 2007 - 12:22 pm

Yes, I can understand that :). It's just that now it's hard to find
out what has really happened. Anyway, thanks for your report.

Honza
--
Jan Kara <jack@suse.cz>
SuSE CR Labs
-

To: Jan Kara <jack@...>
Cc: <linux-kernel@...>
Date: Tuesday, September 18, 2007 - 12:33 pm

If we are really lucky or unlucky it will happen again.

Zeroed-out block just might be a kernel problem (SMP race, whatever) -
random corruption would more likely be a hardware problem. There were no IO
error either. But, 2.4 ext3 has been pretty extensively tested, so I don't
suppose that's likely either. And judging from
http://www.kernel.org/pub/linux/kernel/v2.4/ChangeLog-2.4.3[45] there
haven't been many changes to ext3 lately either.

-- v --

v@iki.fi

-

To: Ville Herva <v@...>
Cc: Jan Kara <jack@...>, <linux-kernel@...>
Date: Tuesday, September 18, 2007 - 5:47 pm

Hi Ville,

Thanks for your report. Unfortunately, I've rechecked the recent changelogs
and see nothing related either. At least, in order to keep trace of the
incident, would you please post some info about your config (CPU, RAM,
chipset, .config, gcc, and any possible patches you may have applied) ?
Maybe some of these info may remind old bad memories to some people.

Also, do you know if this server has ECC memory ? I would more easily
bet for side effects of one random bit flip in memory than for some
massive block corruption.

I vaguely remember about very old reports of people sometimes observing
zeroed out blocks during writes, which were attributed to chipset bugs
if my memory serves me. But I would rule this out as recent chipsets
look more stable than 5-10 years ago !

Regards,
Willy

-

To: Willy Tarreau <w@...>
Cc: Jan Kara <jack@...>, <linux-kernel@...>
Date: Thursday, September 20, 2007 - 8:45 am

Willy,

The machine is a virtual machine on an VMware ESX 3.0.1 host.

/proc/cpuinfo shows two of these:
Dual
model : 15
model name : Intel(R) Xeon(R) CPU E5345 @ 2.33GHz
stepping : 8
cpu MHz : 2333.014
cache size : 64 KB

It has 864MB of memory.

.config is at:
http://v.iki.fi/~vherva/tmp/2.4.35-config
The kernel is plain vanilla 2.4.35 from kernel.org, no patches.

gcc 2.96-129:
cat /proc/version
Linux version 2.4.35 (root) (gcc version 2.96 20000731 (Red Hat Linux 7.2 2.96-129.7.2)) #1 SMP Thu Aug 9 10:35:37 EEST 2007

Memory is ECC.

The server is HP Proliant ML370 with 82801BA/CA/DB/EB chipset. I've had my
share of chipset bugs with older Via chipsets, but I think it's very likely
in this case.

This could very well be a VMware bug, but I wanted to know if this rings
bells for someone.

-- v --

v@iki.fi

-

To: Ville Herva <v@...>
Cc: Jan Kara <jack@...>, <linux-kernel@...>
Date: Thursday, September 20, 2007 - 9:20 am

Hi Ville,

It could also be a problem with the host OS, drivers, hardware, etc...

Cheers,
Willy

-

To: Willy Tarreau <w@...>
Cc: Jan Kara <jack@...>, <linux-kernel@...>
Date: Thursday, September 20, 2007 - 9:25 am

The box was virtualized a while ago and 2.4.32-rc1 and earlier 2.4 compiled
with the same compiler ran very solidly for years. It was UP before

Yes, pretty much anything. There's no solid evidency of anything, only
guesses of what might be more likely and what might be less likely...

If it happens again, I'll try to debug more.

-- v --

v@iki.fi

-

Previous thread: What can be done to reduce the huge number of build fixes required to release an MM tree? by Miles Lane on Tuesday, September 18, 2007 - 8:23 am. (3 messages)

Next thread: alternatives in modules by Denys Vlasenko on Tuesday, September 18, 2007 - 8:36 am. (2 messages)