Re: ECC and DMA to/from disk controllers

Previous thread: [PATCH] radeonfb: bug fix for 2.6.23-rc5 by aherrman on Monday, September 10, 2007 - 8:13 pm. (1 message)

Next thread: Re: tsc timer related problems/questions by Robert Hancock on Monday, September 10, 2007 - 9:19 pm. (2 messages)
To: Bruce Allen <ballen@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Bruce Allen <bruce.allen@...>
Date: Monday, September 10, 2007 - 9:09 pm

It depends where the data got corrupted. Normally transfers over the PCI
or PCI Express bus are protected by parity (or CRC or something, I
assume on PCI-E) so errors there would get detected. This is quite rare
unless the motherboard or expansion card is faulty or badly designed
with timing problems.

However, it's conceivable that data could get corrupted inside the
controller, or inside the chipset. This seems quite rare however, except
in the presence of design flaws (like some VIA southbridges that had
nasty problems with losing data if PCI bus masters kept the CPU off the

I don't know any controller that works in this way. This would greatly
increase CPU overhead since the CPU would need to perform this CRC
calculation.

--
Robert Hancock Saskatoon, SK, Canada
To email, remove "nospam" from hancockr@nospamshaw.ca
Home Page: http://www.roberthancock.com/

-

To: Alan Cox <alan@...>, linux-os (Dick Johnson) <linux-os@...>, Robert Hancock <hancockr@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Bruce Allen <bruce.allen@...>
Date: Wednesday, September 12, 2007 - 11:37 pm

Alan, Robert, Dick,

Thank you all for the informed and helpful response!

Alan, I'll pass your comments on to Peter Kelemen. Not sure if he follows
LKML. I think he'll be interested in your characterization of the error
types. I'll point him to the thread. (I think Peter and his
collaborators are fairly aware of the undetected error rates in standard
ethernet TCP/IP traffic which as I recall is about one undetected
single-bit error per 4TB transfered. I am pretty sure they have ruled
this out since they have checksums computed after any network transfers.)

Robert, Dick, if I have understood correctly, in response to my specific
question, RAID controllers on PCI cards will DMA data into memory over a
PCI bus using one parity bit per 32 data bits for protection. This does
provide some protection against errors in the data transfer, but much less
protection than typical RAM ECC which has one ECC byte for each eight data
bytes. As I recall, many older motherboards disabled parity on the PCI
bus, so even this protection may be inactive in many cases. From a few
minutes of on-line research, I have the impression that PCI-e has better
ECC protection against address/data errors than PCI but I am not certain.

Thanks again!

Cheers,
Bruce
-

Previous thread: [PATCH] radeonfb: bug fix for 2.6.23-rc5 by aherrman on Monday, September 10, 2007 - 8:13 pm. (1 message)

Next thread: Re: tsc timer related problems/questions by Robert Hancock on Monday, September 10, 2007 - 9:19 pm. (2 messages)