Re: ECC and DMA to/from disk controllers

Previous thread: [PATCH] ipconfig.c: De-clutter IP configuration report by Maciej W. Rozycki on Monday, September 10, 2007 - 8:09 am. (3 messages)

Next thread: [PATCH] sb1250-mac.c: De-typedef, de-volatile, de-etc... by Maciej W. Rozycki on Monday, September 10, 2007 - 8:20 am. (12 messages)
To: Linux Kernel Mailing List <linux-kernel@...>
Cc: Bruce Allen <ballen@...>, Bruce Allen <bruce.allen@...>
Date: Monday, September 10, 2007 - 8:19 am

Dear LKML,

Apologies in advance for potential mis-use of LKML, but I don't know where
else to ask.

An ongoing study on datasets of several Petabytes have shown that there
can be 'silent data corruption' at rates much larger than one might
naively expect from the expected error rates in RAID arrays and the
expected probability of single bit uncorrected errors in hard disks.

The origin of this data corruption is still unknown. See for example
http://cern.ch/Peter.Kelemen/talk/2007/kelemen-2007-C5-Silent_Corruption...

In thinking about this, I began to wonder about the following. Suppose
that a (possibly RAID) disk controller correctly reads data from disk and
has correct data in the controller memory and buffers. However when that
data is DMA'd into system memory some errors occur (cosmic rays,
electrical noise, etc). Am I correct that these errors would NOT be
detected, even on a 'reliable' server with ECC memory? In other words the
ECC bits would be calculated in server memory based on incorrect data from
the disk.

The alternative is that disk controllers (or at least ones that are meant
to be reliable) DMA both the data AND the ECC byte into system memory.
So that if an error occurs in this transfer, then it would most likely be
picked up and corrected by the ECC mechanism. But I don't think that
'this is how it works'. Could someone knowledgable please confirm or
contradict?

Cheers,
Bruce
-

To: Bruce Allen <ballen@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Bruce Allen <bruce.allen@...>
Date: Monday, September 10, 2007 - 2:05 pm

In a typical system, there are usually hardware data transfer
paths that are not under the protection of any ECC mechanism.
One example is "bus mastering" DMA itself. If the bus-interface
state-machine is improperly designed (read timing problems), data
transfer may be unreliable. Of course serial-ATA, SCSI, and
other external buses have a modicum of protection, but early
IDE did not. There are many file-systems that have been corrupted
by incorrect cables, bad motherboard or chip designs, or using
UDMA when the hardware won't reliably work.

That said, the reliability of data transfer buses is pretty
good because they don't need to store data for long periods
of time, like RAM. The probability of a bit upset due to
a nuclear event is highly unlikely in a bus where something
is driving the bus, keeping the data valid, during the time
that something else is reading the bus. Nuclear events
generally upset RAM because the data are stored in very
small charges and femtoamperes of spurious current can
alter logic states.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.22.1 on an i686 machine (5588.30 BogoMips).
My book : http://www.AbominableFirebug.com/
_

****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.
-

To: Bruce Allen <ballen@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Bruce Allen <ballen@...>, Bruce Allen <bruce.allen@...>
Date: Monday, September 10, 2007 - 9:54 am

Its almost entirely device specific at every level. Some general
information and comment however

- Drives normally do error correction and shouldn't be fooled very often
by bad bits.
- The ECC level on the drive processors and memory cache vary by vendor.
Good luck getting any information on this although maybe if you are Cern
sized they will talk

After the drive we cross the cable. For SATA this is pretty good, and
UDMA data transfer is CRC protected. For PATA the data is but not the
command block so on PATA there is a minute chance you send the CRC
protected block to the wrong place

Once its crossing the PCI bus and main memory and CPU cache its entirely
down to the system you are running what is protected and how much. Note
that a lot of systems won't report ECC errors unless you ask.

If you have hardware RAID controllers its all vendor specific including
CPU cache etc on the card etc.

The next usual mess is network transfers. The TCP checksum strength is
questionable for such workloads but the ethernet one is pretty good.
Unfortunately lots of high performance people use checksum offload which
removes much of the end to end protection and leads to problems with iffy
cards and the like. This is well studied and known to be very problematic
but also CPU bugs as near OOM we will be paging hard and any L2 cache page
out/page table race from software or hardware would fit what it describes,
especially the transient nature

Type III wrong block on PATA fits with the fact the block number isn't
protected and also the limits on the cache quality of drives/drive
firmware bugs.

For drivers/ide there are *lots* of problems with error handling so that
might be implicated (would want to do old v new ide tests on the same h/w
which would be very intriguing).

Stale data from disk cache I've seen reported, also offsets from FIFO
hardware bugs (The LOTR render farm hit the latter and had to avoid UDMA
to avoid a hardware bug)

Chunks of zero sounds like caches again, woul...

To: Alan Cox <alan@...>
Cc: Bruce Allen <ballen@...>, Linux Kernel Mailing List <linux-kernel@...>, Bruce Allen <bruce.allen@...>
Date: Friday, September 14, 2007 - 5:32 am

* Alan Cox (alan@lxorguk.ukuu.org.uk) [20070910 14:54]:

Alan,

Do you have any contacts? We're in contact directly with the

All our data is based on system-local probes (i.e. no network

Thanks, it's new information. I was planning to extend fsprobe
with locality information inside the buffers so that we can catch

We tried to “force” these corruptions out from their hiding
places on targeted systems, but we failed miserably. Currently we

That's interesting, I'll think about how to expose this.
Currently a single pass writes data only once, so I don't think

They seem to be popping more frequently on ARECA-based boxes. The
“software” is a running target as we gradually upgrade the

Most of our workhorses are 3ware controllers, the CPU nodes
usually have Intel SATA chips.

The fsprobe utility we run in the background on practically all
our boxes is available at http://cern.ch/Peter.Kelemen/fsprobe/ .
We have it deployed on several thousand machines to gather data.
I know that some other HEP institutes looked at it, but I have no
information on who's running it on how many boxes, let alone what
it found. I would be very much interested in whatever findings
people have.

Peter

--
.+'''+. .+'''+. .+'''+. .+'''+. .+''
Kelemen Péter / \ / \ Peter.Kelemen@cern.ch
.+' `+...+' `+...+' `+...+' `+...+'
-

Previous thread: [PATCH] ipconfig.c: De-clutter IP configuration report by Maciej W. Rozycki on Monday, September 10, 2007 - 8:09 am. (3 messages)

Next thread: [PATCH] sb1250-mac.c: De-typedef, de-volatile, de-etc... by Maciej W. Rozycki on Monday, September 10, 2007 - 8:20 am. (12 messages)