Re: File corruption when using kernels 2.6.18+

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Robert Hancock <hancockr@...>
Cc: Pekka Enberg <penberg@...>, Neil Romig <neil@...>, <linux-kernel@...>, <hyoshiok@...>, Andrew Morton <akpm@...>
Date: Wednesday, October 3, 2007 - 11:39 pm

On Wed, 3 Oct 2007, Robert Hancock wrote:

The Intel-optimized memcpy doesn't use the SSE registers, just regular 
32-bit integer nontemporal stores (movnti). The reason is that the SSE 
state save is too expensive to be worth it.

So it's not that. Also, considering that it was a single-bit error in all 
the cases I saw, I wouldn't expect it to be a cache coherency problem, 
which I'd expect to corrupt a whole cacheline or possibly at least a whole 
access.

That said, bit corruption can be just about anything. It's certainly not 
impossible that it's a CPU bug.

But my first guess would be slightly dodgy motherboard, possibly coupled 
with a chipset that simply isn't very tolerant to any timing errors. If 
the motherboard traces to the DDR aren't impedance-matched, or if the 
traces don't have the same length, or if the capacitors that are supposed 
to handle spikes in burst current aren't up to snuff, you'll just get 
noisy lines.

And at some point, noisy lines means that you go from reliable operation 
to "oh, that bit didn't make it correctly".

Lowering the front-side bus frequency or altering the memory timings can 
help (ie doing things like running DDR-333 at DDR-266). Making sure that 
your power supply isn't even close to its limits is good. And choosing a 
motherboard and chipsets from a reliable manufacturer is more than a good 
idea.

The reason why it's interesting that the errors seemed to happen in the 
same byte-lane is that I think it's common policy to route data lines on 
the same layer, and matching trace length per group is very important, 
because you do signal clocking per-group, afaik. But on the other hand, 
multiple layers on the board are expensive, so people try to minimize 
them, and maybe you end up routing through a via to another layer - which 
then makes timing and capacitance harder.

Or there aren't ground lines close enough, or the data lines are too close 
to other lines and you get cross-talk etc etc.

No, I've not done board design, and I don't know what I'm talking about, 
but look at the interesting zig-zagging the data (and address) lines often 
do on the board. It often looks totally crazy ("why doesn't that line just 
go straight?"), but the thing is that the groups all need to have the same 
length, but the pins are all at different points, so you can't make the 
lines straight, or some of them would be much shorter than others.

And if something is border-line, it may work all of the time - *until* you 
hit specific patterns that cause lots of lines to wiggle around, and then 
a capacitor won't handle the extra current draw from switching, or 
cross-talk between lines hits you, and what used to work doesn't work any 
more.

I wish we all had ECC memory. That gets rid of a lot of worries.

			Linus
-
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: File corruption when using kernels 2.6.18+, Robert Hancock, (Wed Oct 3, 10:59 pm)
Re: File corruption when using kernels 2.6.18+, Linus Torvalds, (Wed Oct 3, 11:39 pm)