Re: [PATCH] Update Documentation/md.txt to mention journaling won't help dirty+degraded case.

Previous thread: [PATCH]:resize2fs:adjust the inode before inode_tables were covered by Gui Xiaohua on Thursday, March 12, 2009 - 1:14 am. (2 messages)

Next thread: 2.6.29-rc7: ext4 hangs on sparc (was: next-20090310: ext4 hangs) by Alexander Beregalov on Thursday, March 12, 2009 - 5:35 am. (1 message)
From: Pavel Machek
Date: Thursday, March 12, 2009 - 2:21 am

Not all block devices are suitable for all filesystems. In fact, some
block devices are so broken that reliable operation is pretty much
impossible. Document stuff ext2/ext3 needs for reliable operation.

Signed-off-by: Pavel Machek <pavel@ucw.cz>

diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
new file mode 100644
index 0000000..9c3d729
--- /dev/null
+++ b/Documentation/filesystems/expectations.txt
@@ -0,0 +1,47 @@
+Linux block-backed filesystems can only work correctly when several
+conditions are met in the block layer and below (disks, flash
+cards). Some of them are obvious ("data on media should not change
+randomly"), some are less so.
+
+Write errors not allowed (NO-WRITE-ERRORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Writes to media never fail. Even if disk returns error condition
+during write, filesystems can't handle that correctly, because success
+on fsync was already returned when data hit the journal.
+
+	Fortunately writes failing are very uncommon on traditional 
+	spinning disks, as they have spare sectors they use when write
+	fails.
+
+Sector writes are atomic (ATOMIC-SECTORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Either whole sector is correctly written or nothing is written during
+powerfail.
+
+	Unfortuantely, none of the cheap USB/SD flash cards I seen do 
+	behave like this, and are unsuitable for all linux filesystems 
+	I know. 
+
+		An inherent problem with using flash as a normal block
+		device is that the flash erase size is bigger than
+		most filesystem sector sizes.  So when you request a
+		write, it may erase and rewrite the next 64k, 128k, or
+		even a couple megabytes on the really _big_ ones.
+
+		If you lose power in the middle of that, filesystem
+		won't notice that data in the "sectors" _around_ the
+		one your were trying to write to got trashed.
+
+	Because RAM tends to fail faster than rest of system during 
+	powerfail, special hw killing DMA ...
From: Jochen Voß
Date: Thursday, March 12, 2009 - 4:40 am

Hi,

   ^^^^
Shouldn't this be "Ext2"?

All the best,
Jochen
-- 
http://seehuhn.de/
--

From: Rob Landley
Date: Thursday, March 12, 2009 - 12:13 pm

I vaguely recall that the behavior of when a write error _does_ occur is to 
remount the filesystem read only?  (Is this VFS or per-fs?)

Is there any kind of hotplug event associated with this?

I'm aware write errors shouldn't happen, and by the time they do it's too late 



Somebody corrected me, it's not "the next" it's "the surrounding".

(Writes aren't always cleanly at the start of an erase block, so critical data 


These days instead of "atomic" it's better to think in terms of "barriers".  
Requesting a flush blocks until all the data written _before_ that point has 
made it to disk.  This wait may be arbitrarily long on a busy system with lots 
of disk transactions happening in parallel (perhaps because Firefox decided to 
garbage collect and is spending the next 30 seconds swapping itself back in to 


And here we're talking about ext2.  Does neither one know about write 
barriers, or does this just apply to ext2?  (What about ext4?)

Also I remember a historical problem that not all disks honor write barriers, 
because actual data integrity makes for horrible benchmark numbers.  Dunno how 
current that is with SATA, Alan Cox would probably know.

Rob
--

From: Pavel Machek
Date: Monday, March 16, 2009 - 5:28 am

This is not about barriers (that should be different topic). Atomic
write means that either whole sector is written, or nothing at all is
written. Because raid5 needs to update both master data and parity at

This document is about ext2. Ext3 can support barriers in

Sounds like broken disk, then. We should blacklist those.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Rob Landley
Date: Monday, March 16, 2009 - 12:26 pm

Care to elaborate?  (When a filesystem is mounted RO, I'm not sure what 

Fun.

When "please do not turn of your playstation until game save completes" 
honestly seems like the best solution for making the technology reliable, 
something is wrong with the technology.


Good point, but I thought that's what journaling was for?

I'm aware that any flash filesystem _must_ be journaled in order to work 
sanely, and must be able to view the underlying erase granularity down to the 
bare metal, through any remapping the hardware's doing.  Possibly what's 
really needed is a "flash is weird" section, since flash filesystems can't be 
mounted on arbitrary block devices.

Although an "-O erase_size=128" option so they _could_ would be nice.  There's 
"mtdram" which seems to be the only remaining use for ram disks, but why there 
isn't an "mtdwrap" that works with arbitrary underlying block devices, I have 

It wasn't just one brand of disk cheating like that, and you'd have to ask him 
(or maybe Jens Axboe or somebody) whether the problem is still current.  I've 
been off in embedded-land for a few years now...

Rob
--

From: Pavel Machek
Date: Monday, March 23, 2009 - 3:45 am

Ok, can you suggest a patch? I believe remount-ro is already

Well, fsync() error reporting does not really work properly, but I
guess it will save you for the remount-ro case. So the data will be in


I believe journaling operates on assumption that "either whole sector

I don't think that works. Compactflash (etc) cards basically randomly
remap the data, so you can't really run flash filesystem over
compactflash/usb/SD card -- you don't know the details of remapping.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Goswin von Brederlow
Date: Monday, March 30, 2009 - 8:06 am

Actualy raid5 should have no problem with a power failure during
normal operations of the raid. The parity block should get marked out
of sync, then the new data block should be written, then the new

The real problem comes in degraded mode. In that case the data block
(if present) and parity block must be written at the same time
atomically. If the system crashes after writing one but before writing
the other then the data block on the missng drive changes its
contents. And for example with a chunk size of 1MB and 16 disks that
could be 15MB away from the block you actualy do change. And you can
not recover that after a crash as you need both the original and
changed contents of the block.

So writing one sector has the risk of corrupting another (for the FS)
totally unconnected sector. No amount of journaling will help
there. The raid5 would need to do journaling or use battery backed
cache.

MfG
        Goswin
--

From: Pavel Machek
Date: Monday, August 24, 2009 - 2:26 am

From: Pavel Machek
Date: Monday, August 24, 2009 - 2:31 am

Running journaling filesystem such as ext3 over flashdisk or degraded
RAID array is a bad idea: journaling guarantees no longer apply and
you will get data corruption on powerfail.

We can't solve it easily, but we should certainly warn the users. I
actually lost data because I did not understand these limitations...

Signed-off-by: Pavel Machek <pavel@ucw.cz>

diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
new file mode 100644
index 0000000..80fa886
--- /dev/null
+++ b/Documentation/filesystems/expectations.txt
@@ -0,0 +1,52 @@
+Linux block-backed filesystems can only work correctly when several
+conditions are met in the block layer and below (disks, flash
+cards). Some of them are obvious ("data on media should not change
+randomly"), some are less so.
+
+Write errors not allowed (NO-WRITE-ERRORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Writes to media never fail. Even if disk returns error condition
+during write, filesystems can't handle that correctly.
+
+	Fortunately writes failing are very uncommon on traditional 
+	spinning disks, as they have spare sectors they use when write
+	fails.
+
+Don't cause collateral damage to adjacent sectors on a failed write (NO-COLLATERALS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Unfortunately, cheap USB/SD flash cards I've seen do have this bug,
+and are thus unsuitable for all filesystems I know.
+
+	An inherent problem with using flash as a normal block device
+	is that the flash erase size is bigger than most filesystem
+	sector sizes.  So when you request a write, it may erase and
+	rewrite some 64k, 128k, or even a couple megabytes on the
+	really _big_ ones.
+
+	If you lose power in the middle of that, filesystem won't
+	notice that data in the "sectors" _around_ the one your were
+	trying to write to got trashed.
+
+	RAID-4/5/6 in degraded mode has same problem.
+
+
+Don't damage the old data on a ...
From: Florian Weimer
Date: Monday, August 24, 2009 - 4:19 am

You should make clear that the file lists per-file-system rules and

Isn't this by design?  In other words, if the metadata doesn't survive
non-atomic writes, wouldn't it be an ext3 bug?

-- 
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99
--

From: Theodore Tso
Date: Monday, August 24, 2009 - 6:01 am

The only one that falls into that category is the one about not being
able to handle failed writes, and the way most failures take place,
they generally fail the ATOMIC-WRITES criterion in any case.  That is,
when a write fails, an attempt to read from that sector will generally
result in either (a) an error, or (b) data other than what was there

Part of the problem here is that "atomic-writes" is confusing; it
doesn't mean what many people think it means.  The assumption which
many naive filesystem designers make is that writes succeed or they
don't.  If they don't succeed, they don't change the previously
existing data in any way.  

So in the case of journalling, the assumption which gets made is that
when the power fails, the disk either writes a particular disk block,
or it doesn't.  The problem here is as with humans and animals, death
is not an event, it is a process.  When the power fails, the system
just doesn't stop functioning; the power on the +5 and +12 volt rails
start dropping to zero, and different components fail at different
times.  Specifically, DRAM, being the most voltage sensitve, tends to
fail before the DMA subsystem, the PCI bus, and the hard drive fails.
So as a result, garbage can get written out to disk as part of the
failure.  That's just the way hardware works.

Now consider a file system which does logical journalling.  It has
written to the journal, using a compact encoding, "the i_blocks field
is now 25, and i_size is 13000", and the journal transaction has
committed.  So now, it's time to update the inode on disk; but at that
precise moment, the power failures, and garbage is written to the
inode table.  Oops!  The entire sector containing the inode is
trashed.  But the only thing which recorded in the journal is the new
value of i_blocks and i_size.  So a journal replay won't help file
systems that do logical block journalling. 

Is that a file system "bug"?  Well, it's better to call that a
mismatch between the assumptions made of ...
From: Artem Bityutskiy
Date: Monday, August 24, 2009 - 7:55 am

Hi Theodore,

thanks for the insightful writing.

On 08/24/2009 04:01 PM, Theodore Tso wrote:


There is a thing called eMMC (embedded MMC) in the embedded world. You
may consider it as a non-removable MMC. This thing is a block device from
the Linux POW, and you may mount ext3 on top of it. And people do this.

The device seems to have a decent FTL, and does not look bad.

However, there are subtle things which mortals never think about. In
case of eMMC - power cuts may make some sectors unreadable - eMMC returns
ECC errors on reads. Namely, the sectors which were being written at
the very moment when the power cut happened may become unreadable.
And this makes ext3 refuse mounting the file-system, this makes
chkfs.ext3 refuse the file-system. Although this should be fixable in
SW, but we did not find time to do this so far.

Anyway, my point is that documenting subtle things like this is a very
good thing to do, just because nowadays we are trying to use existing
software with flash-based storage devices, which may violate these
subtle assumptions, or introduce other ones.

Probably, Pavel did too good job in generalizing things, and it could be
better to make a doc about HDD vs SSD or HDD vs Flash-based-storage.
Not sure. But the idea to document subtle FS assumption is good, IMO.

-- 
Best Regards,
Artem Bityutskiy (Артём Битюцкий)
--

From: Rob Landley
Date: Monday, August 24, 2009 - 3:30 pm

The standard procedure for this seems to be to cc: Jonathan Corbet on the 
discussion, make puppy eyes at him, and subscribe to Linux Weekly News.

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds
--

From: Pavel Machek
Date: Monday, August 24, 2009 - 12:52 pm

Yep, and at that point you lost data. You had "silent data corruption"
from fs point of view, and that's bad. 

It will be probably very bad on XFS, probably okay on Ext3, and
certainly okay on Ext2: you do filesystem check, and you should be
able to repair any damage. So yes, physical journaling is good, but

If those filesystem assumptions were not documented, I'd call it

Actually, ext2 should be able to survive that, no? Error writing ->

Well... there's very big difference between harddrives and flash

There's a difference. In case of cosmic rays, hardware is clearly
buggy. I have one machine with bad DRAM (about 1 errors in 2 days),
and I still use it. I will not complain if ext3 trashes that.

In case of degraded raid-5, even with perfect hardware, and with
ext3 on top of that, you'll get silent data corruption. Nice, eh?

Clearly, Linux is buggy there. It could be argued it is raid-5's

Well well well. Before I pulled that flash card, I assumed that doing
so is safe, because flashcard is presented as block device and ext3
should cope with sudden disk disconnects.

And I was wrong wrong wrong. (Noone told me at the university. I guess
I should want my money back).

Plus note that it is not only my trashy laptop and one trashy MMC
card; every USB thumb drive I seen is affected. (OTOH USB disks should
be safe AFAICT).

Ext3 is unsuitable for flash cards and RAID arrays, plain and
simple. It is not documented anywhere :-(. [ext2 should work better --

Can you suggest better patch? I'm not saying we should redesign ext3,


I hold ext2/ext3 to higher standards than other filesystem in
tree. I'd not use XFS/VFAT etc. 

I would not want people to migrate towards XFS/VFAT, and yes I believe
XFSs/VFATs/... requirements should be documented, too. (But I know too
little about those filesystems).

If you can suggest better wording, please help me. But... those
requirements are non-trivial, commonly not met and the result is data
loss. It has to be documented ...
From: Ric Wheeler
Date: Monday, August 24, 2009 - 1:24 pm

I don't see why you think that. In general, fsck (for any fs) only 
checks metadata. If you have silent data corruption that corrupts things 
that are fixable by fsck, you most likely have silent corruption hitting 
things users care about like their data blocks inside of files. Fsck 
will not fix (or notice) any of that, that is where things like full 
data checksums can help.

Also note (from first hand experience), unless you check and validate 
your data, you can have data corruptions that will not get flagged as IO 
I think that we need to help people understand the full spectrum of data 
concerns, starting with reasonable best practices that will help most 
people suffer *less* (not no) data loss. And make very sure that they 
are not falsely assured that by following any specific script that they 
can skip backups, remote backups, etc :-)

Nothing in our code in any part of the kernel deals well with every 

I think that the example and the response are both off base. If your 
head ever touches the platter, you won't be reading from a huge part of 
your drive ever again (usually, you have 2 heads per platter, 3-4 
platters, impact would kill one head and a corresponding percentage of 
your data).

No file system will recover that data although you might be able to 
scrape out some remaining useful bits and bytes.

More common causes of silent corruption would be bad DRAM in things like 
the drive write cache, hot spots (that cause adjacent track data 
errors), etc.  Note in this last case, your most recently written data 

It is hard for anyone to see the real data without looking in detail at 
large numbers of parts. Back at EMC, we looked at failures for lots of 
parts so we got a clear grasp on trends.  I do agree that flash/SSD 
parts are still very young so we will have interesting and unexpected 

Nothing is perfect. It is still a trade off between storage utilization 
(how much storage we give users for say 5 2TB drives), performance and 


I think that ...
From: Pavel Machek
Date: Monday, August 24, 2009 - 1:52 pm

Ok, but in case of data corruption, at least your filesystem does not

I can reproduce data loss with ext3 on flashcard in about 40
seconds. I'd not call that "odd event". It would be nice to handle
that, but that is hard. So ... can we at least get that documented


_Maybe_ SSDs, being HDD replacements are better. I don't know.

_All_ flash cards (MMC, USB, SD) had the problems. You don't need to
get clear grasp on trends. Those cards just don't meet ext3

"Nothing is perfect"?! That's design decision/problem in raid5/ext3. I
believe that should be at least documented. (And understand why ZFS is

And I still use my zaurus with crappy DRAM.

I would not trust raid5 array with my data, for multiple
reasons. The fact that degraded raid5 breaks ext3 assumptions should

The papers show failures in "once a year" range. I have "twice a
minute" failure scenario with flashdisks.

Not sure how often "degraded raid5 breaks ext3 atomicity" would bite,
but I bet it would be on "once a day" scale.

We should document those.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Ric Wheeler
Date: Monday, August 24, 2009 - 2:08 pm

Even worse, your data is potentially gone and you have not noticed 
it...  This is why array vendors and archival storage products do 
periodic scans of all stored data (read all the bytes, compared to a 

Part of documenting best practices is to put down very specific things 
that do/don't work. What I worry about is producing too much detail to 
be of use for real end users.

I have to admit that I have not paid enough attention to this specifics 
of your ext3 + flash card issue - is it the ftl stuff doing out of order 

Your statement is overly broad - ext3 on a commercial RAID array that 
does RAID5 or RAID6, etc has no issues that I know of.


Again, you say RAID5 without enough specifics.  Are you pointing just at 

Documentation is fine with sufficient, hard data....

ric


--

From: Pavel Machek
Date: Monday, August 24, 2009 - 2:25 pm

Well, I was trying to write for kernel audience. Someone can turn that

The problem is that flash cards destroy whole erase block on unplug,

Pull them hot.

[Some people try -osync to avoid data loss on flash cards... that will

If your commercial RAID array is battery backed, maybe. But I was

Degraded MD RAID5 on anything, including SATA, and including

Degraded MD RAID5 does not work by design; whole stripe will be
damaged on powerfail or reset or kernel bug, and ext3 can not cope
with that kind of damage. [I don't see why statistics should be
neccessary for that; the same way we don't need statistics to see that
ext2 needs fsck after powerfail.]
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Ric Wheeler
Date: Monday, August 24, 2009 - 3:05 pm

Kernel people who don't do storage or file systems will still need a 
summary - making very specific proposals based on real data and analysis 

Even if you unmount the file system? Why isn't this an issue with ext2?

Sounds like you want to suggest very specifically that journalled file 
systems are not appropriate for low end flash cards (which seems quite 

Pulling hot any device will cause data loss for recent data loss, even 

Many people in the real world who use RAID5 (for better or worse) use 

Degraded is one faulted drive while MD is doing a rebuild? And then you 
hot unplug it or power cycle? I think that would certainly cause failure 
What you are describing is a double failure and RAID5 is not double 
failure tolerant regardless of the file system type....

I don't want to be overly negative since getting good documentation is 
certainly very useful. We just need to be document things correctly 
based on real data.

Ric


--

From: Pavel Machek
Date: Monday, August 24, 2009 - 3:41 pm

No, I'm talking hot unplug here. It is the issue with ext2, but ext2

Right. But in ext3 case you basically loose whole filesystem, because


You get single disk failure then powerfail (or reset or kernel
panic). I would not call that double failure. I agree that it will
mean problems for most filesystems.

Anyway, even if that can be called a double failure, this limitation
should be clearly documented somewhere.

...and that's exactly what I'm trying to fix.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Zan Lynx
Date: Monday, August 24, 2009 - 3:22 pm

Are you sure he isn't talking about how RAID must write all the data 
chunks to make a complete stripe and if there is a power-loss, some of 
the chunks may be written and some may not?

As I read Pavel's point he is saying that the incomplete write can be 
detected by the incorrect parity chunk, but degraded RAID-5 has no 
working parity chunk so the incomplete write would go undetected.

I know this is a RAID failure mode. However, I actually thought this was 
a problem even for a intact RAID-5. AFAIK, RAID-5 does not generally 
read the complete stripe and perform verification unless that is 
requested, because doing so would hurt performance and lose the entire 
point of the RAID-5 rotating parity blocks.

-- 
Zan Lynx
zlynx@acm.org

"Knowledge is Power.  Power Corrupts.  Study Hard.  Be Evil."
--

From: Pavel Machek
Date: Monday, August 24, 2009 - 3:44 pm

Not sure; is not RAID expected to verify the array after unclean
shutdown?

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Ric Wheeler
Date: Monday, August 24, 2009 - 5:34 pm

Not usually - that would take multiple hours of verification, roughly 
equivalent to doing a RAID rebuild since you have to read each sector of 
every drive (although you would do this at full speed if the array was 
offline, not throttled like we do with rebuilds).

That is part of the thing that scrubbing can do.

Note that once you find a bad bit of data, it is really useful to be 
able to map that back into a humanly understandable object/repair 
action. For example, map the bad data range back to metadata which would 
translate into a fsck run or a list of impacted files or directories....

Ric

--

From: david
Date: Monday, August 24, 2009 - 4:42 pm

q write to raid 5 doesn't need to write to all drives, but it does need to 
write to two drives (the drive you are modifying and the parity drive)

if you are not degraded and only suceed on one write you will detect the 
corruption later when you try to verify the data.

if you are degraded and only suceed on one write, then the entire stripe 
gets corrupted.

but this is a double failure (one drive + unclean shutdown)

if you have battery-backed cache you will finish the writes when you 
reboot.

if you don't have battery-backed cache (or are using software raid and 
crashed in the middle of sending the writes to the drive) you loose, but 
unless you disable write buffers and do sync writes (which nobody is going 
to do because of the performance problems) you will loose data in an 
unclean shutdown anyway.

--

From: Theodore Tso
Date: Monday, August 24, 2009 - 3:39 pm

Sure --- but name **any** filesystem that can deal with the fact that
128k or 256k worth of data might disappear when you pull out the flash

It's not just high end RAID arrays that have battery backups; I happen
to use a mid-range hardware RAID card that comes with a battery
backup.   It's just a matter of choosing your hardware carefully.

If your concern is that with Linux MD, you could potentially lose an
entire stripe in RAID 5 mode, then you should say that explicitly; but
again, this isn't a filesystem specific cliam; it's true for all
filesystems.  I don't know of any file system that can survive having
a RAID stripe-shaped-hole blown into the middle of it due to a power
failure.

I'll note, BTW, that AIX uses a journal to protect against these sorts
of problems with software raid; this also means that with AIX, you
also don't have to rebuild a RAID 1 device after an unclean shutdown,
like you have do with Linux MD.  This was on the EVMS's team
development list to implement for Linux, but it got canned after LVM
won out, lo those many years ago.  Ce la vie; but it's a problem which
is solvable at the RAID layer, and which is traditionally and
historically solved in competent RAID implementations.

							- Ted
--

From: Pavel Machek
Date: Monday, August 24, 2009 - 4:00 pm

First... I consider myself quite competent in the os level, yet I did
not realize what flash does and what that means for data
integrity. That means we need some documentation, or maybe we should
refuse to mount those devices r/w or something.

Then to answer your question... ext2. You expect to run fsck after
unclean shutdown, and you expect to have to solve some problems with
it. So the way ext2 deals with the flash media actually matches what
the user expects. (*)

OTOH in ext3 case you expect consistent filesystem after unplug; and

Again, ext2 handles that in a way user expects it.

At least I was teached "ext2 needs fsck after powerfail; ext3 can

Yep, we should add journal to RAID; or at least write "Linux MD
*needs* an UPS" in big and bold letters. I'm trying to do the second
part.

(Attached is current version of the patch).

[If you'd prefer patch saying that MMC/USB flash/Linux MD arrays are
generaly unsafe to use without UPS/reliable connection/no kernel
bugs... then I may try to push that. I was not sure... maybe some
filesystem _can_ handle this kind of issues?]

								Pavel

(*) Ok, now... user expects to run fsck, but very advanced users may
not expect old data to be damaged. Certainly I was not advanced enough
user few months ago.

diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
new file mode 100644
index 0000000..d1ef4d0
--- /dev/null
+++ b/Documentation/filesystems/expectations.txt
@@ -0,0 +1,57 @@
+Linux block-backed filesystems can only work correctly when several
+conditions are met in the block layer and below (disks, flash
+cards). Some of them are obvious ("data on media should not change
+randomly"), some are less so. Not all filesystems require all of these
+to be satisfied for safe operation.
+
+Write errors not allowed (NO-WRITE-ERRORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Writes to media never fail. Even if disk returns error condition
+during write, filesystems ...
From: david
Date: Monday, August 24, 2009 - 5:02 pm

the problem is that people have been preaching that journaling filesystems 
eliminate all data loss for no cost (or at worst for minimal cost).

they don't, they never did.

they address one specific problem (metadata inconsistancy), but they do 
not address data loss, and never did (and for the most part the filesystem 
developers never claimed to)

depending on how much data gets lost, you may or may not be able to 
recover enough to continue to use the filesystem, and when your block 
device takes actions in larger chunks than the filesystem asked it to, 
it's very possible for seemingly unrelated data to be lost as well.

this is true for every single filesystem, nothing special about ext3

people somehow have the expectation that ext3 does the data equivalent of 
solving world hunger, it doesn't, it never did, and it never claimed to.

bashing it because it doesn't isn't fair. bashing XFS because it doesn't 
also isn't fair.

personally I don't consider the two filesystems to be significantly 
different in terms of the data loss potential. I think people are more 
aware of the potentials with XFS than with ext3, but I believe that the 
risk of loss is really about the same (and pretty much for the same 

you were teached wrong. the people making these claims for ext3 didn't 
understand what ext3 does and doesn't do.

--

From: Pavel Machek
Date: Tuesday, August 25, 2009 - 2:32 am

Well, in case of flashcard and degraded MD Raid5, ext3 does _not_
address metadata inconsistency problem. And that's why I'm trying to
fix the documentation. Current ext3 documentation says:

#Journaling Block Device layer
#-----------------------------
#The Journaling Block Device layer (JBD) isn't ext3 specific.  It was
#designed
#to add journaling capabilities to a block device.  The ext3 filesystem
#code
#will inform the JBD of modifications it is performing (called a
#transaction).
#The journal supports the transactions start and stop, and in case of a
#crash,
#the journal can replay the transactions to quickly put the partition
#back into
#a consistent state.

There's no mention that this does not work on flash cards and degraded



Cool. So... can we fix the documentation?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Ric Wheeler
Date: Monday, August 24, 2009 - 5:06 pm

So, would you be happy if ext3 fsck was always run on reboot (at least 
for flash devices)?


--

From: Pavel Machek
Date: Tuesday, August 25, 2009 - 2:34 am

For flash devices, MD Raid 5 and anything else that needs it; yes that
would make me happy ;-).

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: david
Date: Tuesday, August 25, 2009 - 8:34 am

the thing is that fsck would not fix the problem.

it may (if the data lost was metadata) detect the problem and tell you how 
many files you have lost, but if the data lost was all in a data file you 
would not detect it with a fsck

the only way you would detect the missing data is to read all the files on 
the filesystem and detect that the data you are reading is wrong.

but how can you tell if the data you are reading is wrong?

on a flash drive, your read can return garbage, but how do you know that 
garbage isn't the contents of the file?

on a degraded raid5 array you have no way to test data integrity, so when 
the missing drive is replaced, the rebuild algorithm will calculate the 
appropriate data to make the parity calculations work out and write 
garbage to that drive.

David Lang
--

From: Rik van Riel
Date: Tuesday, August 25, 2009 - 8:32 pm

Sorry, but that just shows your naivete.

Metadata takes up such a small part of the disk that fscking
it and finding it to be OK is absolutely no guarantee that
the data on the filesystem has not been horribly mangled.

Personally, what I care about is my data.

The metadata is just a way to get to my data, while the data
is actually important.

-- 
All rights reversed.
--

From: Pavel Machek
Date: Wednesday, August 26, 2009 - 4:17 am

Personally, I care about metadata consistency, and ext3 documentation
suggests that journal protects its integrity. Except that it does not
on broken storage devices, and you still need to run fsck there.

How do you protect your data is another question, but ext3
documentation does not claim journal to protect them, so that's up to
the user I guess.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: david
Date: Wednesday, August 26, 2009 - 4:29 am

as the ext3 authors have stated many times over the years, you still need 
to run fsck periodicly anyway.

what the journal gives you is a reasonable chance of skipping it when the 
system crashes and you want to get it back up ASAP.

--

From: Pavel Machek
Date: Wednesday, August 26, 2009 - 6:10 am

Where is that documented? I very much agree with that, but when suse10
switched periodic fsck off, I could not find any docs to show that it
is bad idea.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: david
Date: Wednesday, August 26, 2009 - 6:43 am

linux-kernel mailing list archives.

--

From: Theodore Tso
Date: Wednesday, August 26, 2009 - 11:02 am

Probably from some 6-8 years ago, in e-mail postings that I made.  My
argument has always been that PC-class hardware is crap, and it's a
Really Good Idea to periodically check the metadata because corruption
there can end up causing massive data loss.  The main problem is that
doing it at reboot time really hurt system availability, and "after 20
reboots (plus or minus)" resulted in fsck checks at wildly varying
intervals depending on how often people reboot.

What I've been recommending for some time is that people use LVM, and
run fsck on a snapshot every week or two, at some convenient time when
the system load is at a minimum.  There is an e2croncheck script in
the e2fsprogs sources, in the contrib directory; it's short enough
that I'll attach here here.

Is it *necessary*?  In a world where hardware is perfect, no.  In a
world where people don't bother buying ECC memory because it's 10%
more expensive, and PC builders use the cheapest possible parts --- I
think it's a really good idea.

						- Ted

P.S.  Patches so that this shell script takes a config file, and/or
parses /etc/fstab to automatically figure out which filesystems should
be checked, are greatly appreciated.  Getting distro's to start
including this in their e2fsprogs packaging scripts would also be
greatly appreciated.

#!/bin/sh
#
# e2croncheck -- run e2fsck automatically out of /etc/cron.weekly
#
# This script is intended to be run by the system administrator 
# periodically from the command line, or to be run once a week
# or so by the cron daemon to check a mounted filesystem (normally
# the root filesystem, but it could be used to check other filesystems
# that are always mounted when the system is booted).
#
# Make sure you customize "VG" so it is your LVM volume group name, 
# "VOLUME" so it is the name of the filesystem's logical volume, 
# and "EMAIL" to be your e-mail address
#
# Written by Theodore Ts'o, Copyright 2007, 2008, 2009.
#
# This file may be redistributed under the terms ...
From: Eric Sandeen
Date: Wednesday, August 26, 2009 - 11:28 pm

Aside ... can we default mkfs.ext3 to not set a mandatory fsck interval 
then? :)

--

From: Pavel Machek
Date: Monday, November 9, 2009 - 1:53 am

Well, in SUSE11-or-so, distro stopped period fscks, silently :-(. I
believed that it was really bad idea at that point, but because I
could not find piece of documentation recommending them, I lost the
argument.
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Theodore Tso
Date: Monday, November 9, 2009 - 7:05 am

It's an engineering trade-off.  If you have perfect memory that is
never has cosmic-ray hiccups, and hard drives that never write data to
the wrong place, etc. then you don't need periodic fsck's.

If you do have imperfect hardware, the question then is how imperfect
your hardware is, and how frequently it introduces errors.  If you
check too frequently, though, users get upset, especially when it
happens at the most inconvenient time (when you're trying to recover
from unscheduled downtime by rebooting); if you check too infrequently
then it doesn't help you too much since too much data gets damaged
before fsck notices.

So these days, what I strongly recommend is that people use LVM
snapshots, and schedule weekly checks during some low usage period
(i.e., 3am on Saturdays), using something like the e2croncheck shell
script.

						- Ted
--

From: Andreas Dilger
Date: Monday, November 9, 2009 - 8:58 am

There was another script written to do this that handled the e2fsck,  
reiserfsck
and xfs_check, detecting all volume groups automatically, along with  
e.g.
validating that the snapshot volume doesn't exist before starting the  
check
(which may indicate that the previous e2fsck is still running), and  
not running while on AC power.

The last version was in the thread "forced fsck (again?)" dated  
2008-01-28.
Would it be better to use that one?  In that thread we discussed not  
clobbering
the last checked time as e2croncheck does, so the admin can see how  
long it
was since the filesystem was last checked.

Maybe it makes more sense to get the lvcheck script included into util- 
linux-ng
or lvm2 packages, and have it added automatically to the cron.weekly  
directory?
Then the distros could disable the at-boot checking safely, while  
still being
able to detect corruption caused by cables/RAM/drives/software.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.

--

From: Pavel Machek
Date: Sunday, August 30, 2009 - 12:03 am

That's not where fs documentation belongs :-(.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Theodore Tso
Date: Wednesday, August 26, 2009 - 5:28 am

Caring about metadata consistency and not data is just weird, I'm
sorry.  I can't imagine anyone who actually *cares* about what they
have stored, whether it's digital photographs of child taking a first
step, or their thesis research, caring about more about the metadata
than the data.  Giving advice that pretends that most users have that
priority is Just Wrong.

That's why what we should document is that people should avoid broken
storage devices, and advice on how to use RAID properly.  At the end
of the day, getting people to switch from ext2 to ext3 on some
misguided notion that this way, they'll know when their metadata is
safe (at least in the power failure case; but not the system hangs and
you have to reboot case), and getting them to ignore the question of
why are they using a broken storage device in the first place, is
Documentation malpractice.

						- Ted
--

From: Rob Landley
Date: Wednesday, August 26, 2009 - 11:06 pm

I thought the reason for that was that if your metadata is horked, further 
writes to the disk can trash unrelated existing data because it's lost track 
of what's allocated and what isn't.  So back when the assumption was "what's 
written stays written", then keeping the metadata sane was still darn 
important to prevent normal operation from overwriting unrelated existing 
data.

Then Pavel notified us of a situation where interrupted writes to the disk can 
trash unrelated existing data _anyway_, because the flash block size on the 16 
gig flash key I bought retail at Fry's is 2 megabytes, and the filesystem thinks 
it's 4k or smaller.  It seems like what _broke_ was the assumption that the 
filesystem block size >= the disk block size, and nobody noticed for a while.  
(Except the people making jffs2 and friends, anyway.)

Today we have cheap plentiful USB keys that act like hard drives, except that 
their write block size isn't remotely the same as hard drives', but they 
pretend it is, and then the block wear levelling algorithms fuzz things 
further.  (Gee, a drive controller lying about drive geometry, the scsi crowd 
should feel right at home.)

Now Pavel's coming back with a second situation where RAID stripes (under 
certain circumstances) seem to have similar granularity issues, again breaking 
what seems to be the same assumption.  Big media use big chunks for data, and 
media is getting bigger.  It doesn't seem like this problem is going to 
diminish in future.

I agree that it seems like a good idea to have BIG RED WARNING SIGNS about 
those kind of media and how _any_ journaling filesystem doesn't really help 
here.  So specifically documenting "These kinds of media lose unrelated random 
data if writes to them are interrupted, journaling filesystems can't help with 
this and may actually hide the problem, and even an fsck will only find 
corrupted metadata not lost file contents" seems kind of useful.

That said, ext3's assumption that filesystem block size ...
From: david
Date: Wednesday, August 26, 2009 - 11:54 pm

actually, you don't know if your USB key works that way or not. Pavel has 
ssome that do, that doesn't mean that all flash drives do

when you do a write to a flash drive you have to do the following items

1. allocate an empty eraseblock to put the data on

2. read the old eraseblock

3. merge the incoming write to the eraseblock

4. write the updated data to the flash

5. update the flash trnslation layer to point reads at the new location 
instead of the old location.

now if the flash drive does things in this order you will not loose any 
previously written data.

if the flash drive does step 5 before it does step 4, then you have a 
window where a crash can loose data (and no btrfs won't survive any better 
to have a large chunk of data just disappear)

it's possible that some super-cheap flash drives skip having a flash 
translation layer entirely, on those the process would be

1. read the old data into ram

2. merge the new write into the data in ram

3. erase the old data

4. write the new data

this obviously has a significant data loss window.

but if the device doesn't have a flash translation layer, then repeated 
writes to any one sector will kill the drive fairly quickly. (updates to 
the FAT would kill the sectors the FAT, journal, root directory, or 
superblock lives in due to the fact that every change to the disk requires 

I think an update to the documentation is a good thing (especially after 
learning that a raid 6 array that has lost a single disk can still be 
corrupted during a powerfail situation), but I also agree that Pavel's 

I thought that that assumption was in the VFS layer, not in any particular 
filesystem


David Lang
--

From: Rob Landley
Date: Thursday, August 27, 2009 - 12:34 am

Pretty much all the ones that present a USB disk interface to the outside 
world and then thus have to do hardware levelling.  Here's Valerie Aurora on 
the topic:



That's what something like jffs2 will do, sure.  (And note that mounting those 
suckers is slow while it reads the whole disk to figure out what order to put 
the chunks in.)

However, your average consumer level device A) isn't very smart, B) is judged 
almost entirely by price/capacity ratio and thus usually won't even hide 
capacity for bad block remapping.  You expect them to have significant hidden 

I've never seen one that presented a USB disk interface that _didn't_ do this.  
(Not that this observation means much.)  Neither the windows nor the Macintosh 
world is calling for this yet.  Even the Linux guys barely know about it.  And 
these are the same kinds of manufacturers that NOPed out the flush commands to 

Yup.  It's got enough of one to get past the warantee, but beyond that they're 

The VFS layer cares about how to talk to the backing store?  I thought that 
was the filesystem driver's job...


Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds
--

From: david
Date: Friday, August 28, 2009 - 7:37 am

I am not saying that all devices get this right (not by any means), but I 
_am_ saying that devices with wear-leveling _can_ avoid this problem 
entirely

you do not need to do a log-structured filesystem. all you need to do is 
to always write to a new block rather than re-writing a block in place.

even if the disk only does a 12-block rotation for it's wear leveling, 
that is enough for it to not loose other data when you write. to loose 
data you have to be updating a block in place by erasing the old one 
first. _anything_ that writes the data to a new location before it erases 
the old location will prevent you from loosing other data.

I'm all for documenting that this problem can and does exist, but I'm not 
in agreement with documentation that states that _all_ flash drives have 
this problem because (with wear-leveling in a flash translation layer on 
the device) it's not inherent to the technology. so even if all existing 
flash devices had this problem, there could be one released tomorrow that 
didn't.

this is like the problem that flash SSDs had last year that could cause 
them to stall for up to a second on write-heavy workloads. it went from a 
problem that almost every drive for sale had (and something that was 
generally accepted as being a characteristic of SSDs), to being extinct in 
about one product cycle after the problem was identified.

I think this problem will also disappear rapidly once it's publicised.

so what's needed is for someone to come up with a way to test this, let 
people test the various devices, find out how broad the problem is, and 
publicise the results.

personally, I expect that the better disk-replacements will not have a 
problem with this.

I would also be surprised if the larger thumb drives had this problem.

if a flash eraseblock can be used 100k times, then if you use FAT on a 16G 
drive and write 1M files and update the FAT after each file (like you 
would with a camera), the block the FAT is on will die after ...
From: Pavel Machek
Date: Sunday, August 30, 2009 - 12:19 am

That would need two erases per single sector writen, no? Erase is in
milisecond range, so the performance would be just way too bad :-(.
	   	     	 	     	      	       	       Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: david
Date: Sunday, August 30, 2009 - 5:48 am

no, it only needs one erase

if you don't have a pool of pre-erased blocks, then you need to do an 
erase of the new block you are allocating (before step 4)

if you do have a pool of pre-erased blocks, then you don't have to do any 
erase of the data blocks until after step 5 and you do the erase when you 
add the old data block to the pool of pre-erased blocks later.

in either case the requirements of wear leveling require that the flash 
translation layer update it's records to show that an additional write 
took place.

what appears to be happening on some cheap devices is that they do the 
following instead

1. allocate an empty eraseblock to put the data on

2. read the old eraseblock

3. merge the incoming write to the eraseblock

4. erase the old eraseblock

5. write the updated data to the flash

I don't know where in (or after) this process theyupdate the 
wear-levling/flash translation layer info.

with this algortihm, if the device looses power between step 4 and step 5 
you loose all the data on the eraseblock.

with deferred erasing of blocks, the safer algortihm is actually the 
faster one (up until you run out of your pool of available eraseblocks, at 
which time it slows down to the same speed as the unreliable one.

most flash drives are fairly slow to write to in any case.

even the Intel X25M drives are in the same ballpark as rotating media for 
writes. as far as I know only the X25E SSD drives are faster to write to 
than rotating media, and most of them are _far_ slower.

David Lang
--

From: Rob Landley
Date: Wednesday, August 26, 2009 - 10:27 pm

Hence wanting documentation properly explaining the situation, yes.

Often the people writing the documentation aren't the people who know the most 
about the situation, but the people who found out they NEED said 
documentation, and post errors until they get sufficient corrections.

In which case "you're wrong, it's actually _this_" is helpful, and "you're 

Are you saying ext3 should default to journal=data then?

It seems that the default journaling only handles the metadata, and people 
seem to think that journaled filesystems exist for a reason.

There seems to be a lot of "the guarantees you think a journal provides aren't 
worth anything, so the fact there are circumstances under which it doesn't 
provide them isn't worth telling anybody about" in this thread.  So we 
shouldn't bother journaled filesystems?  I'm not sure what the intended 
argument is here...

I have no clue what the finished documentation on this issue should look like 
either.  But I want to read it.

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds
--

From: Theodore Tso
Date: Monday, August 24, 2009 - 5:08 pm

But if the 256k hole is in data blocks, fsck won't find a problem,
even with ext2.

And if the 256k hole is the inode table, you will *still* suffer
massive data loss.  Fsck will tell you how badly screwed you are, but
it doesn't "fix" the disk; most users don't consider questions of the
form "directory entry <precious-thesis-data> points to trashed inode,

You don't get a consistent filesystem with ext2, either.  And if your
claim is that several hundred lines of fsck output detailing the
filesystem's destruction somehow makes things all better, I suspect
most users would disagree with you.

In any case, depending on where the flash was writing at the time of
the unplug, the data corruption could be silent anyway.

Maybe this came as a surprise to you, but anyone who has used a
compact flash in a digital camera knows that you ***have*** to wait
until the led has gone out before trying to eject the flash card.  I
remember seeing all sorts of horror stories from professional
photographers about how they lost an important wedding's day worth of
pictures with the attendant commercial loss, on various digital
photography forums.  It tends to be the sort of mistake that digital
photographers only make once.

(It's worse with people using Digital SLR's shooting in raw mode,
since it can take upwards of 30 seconds or more to write out a 12-30MB
raw image, and if you eject at the wrong time, you can trash the
contents of the entire CF card; in the worst case, the Flash
Translation Layer data can get corrupted, and the card is completely
ruined; you can't even reformat it at the filesystem level, but have
to get a special Windows program from the CF manufacturer to --maybe--
reset the FTL layer.  Early CF cards were especially vulnerable to
this; more recent CF cards are better, but it's a known failure mode
of CF cards.)

						- Ted
--

From: Pavel Machek
Date: Tuesday, August 25, 2009 - 2:42 am

Well it will fix the disk in the end. And no, "directory entry
<precious-thesis-data> points to trashed inode, may I delete directory
entry?" is not _terribly_ helpful, but it is slightly helpful and

It actually comes as surprise to me. Actually yes and no. I know that
digital cameras use VFAT, so pulling CF card out of it may do bad
thing, unless I run fsck.vfat afterwards. If digital camera was using
ext3, I'd expect it to be safely pullable at any time.

Will IBM microdrive do any difference there?

Anyway, it was not known to me. Rather than claiming "everyone knows"
(when clearly very few people really understand all the details), can
we simply document that?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Ric Wheeler
Date: Tuesday, August 25, 2009 - 6:37 am

I really think that the expectation that all OS's (windows, mac, even your ipod) 
all teach you not to hot unplug a device with any file system. Users have an 
"eject" or "safe unload" in windows, your iPod tells you not to power off or 
disconnect, etc.

I don't object to making that general statement - "Don't hot unplug a device 
with an active file system or actively used raw device" - but would object to 
the overly general statement about ext3 not working on flash, RAID5 not working, 
etc...

ric



--

From: Alan Cox
Date: Tuesday, August 25, 2009 - 6:42 am

On Tue, 25 Aug 2009 09:37:12 -0400


The overall general statement for all media and all OS's should be

"Do you have a backup, have you tested it recently"

--

From: Rob Landley
Date: Wednesday, August 26, 2009 - 8:16 pm

It might be nice to know when you _needed_ said backup, and when you shouldn't 
re-backup bad data over it, because your data corruption actually got detected 
before then.

And maybe a pony.

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds
--

From: Pavel Machek
Date: Tuesday, August 25, 2009 - 2:15 pm

You can object any way you want, but running ext3 on flash or MD RAID5
is stupid:

* ext2 would be faster

* ext2 would provide better protection against powerfail.

"ext3 works on flash and MD RAID5, as long as you do not have
powerfail" seems to be the accurate statement, and if you don't need
to protect against powerfails, you can just use ext2.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Ric Wheeler
Date: Tuesday, August 25, 2009 - 3:42 pm

Not true - that is true today with or without journals as we have discussed in 
great detail. Including specifically ext2.

Basically, any file system (Linux, windows, OSX, etc) that writes into the page 

Not true in the slightest, you continue to ignore the ext2/3/4 developers 

Strange how your personal preference is totally out of sync with the entire 
enterprise class user base.

ric


--

From: Pavel Machek
Date: Tuesday, August 25, 2009 - 3:51 pm

No, not ext3 on SATA disk with barriers on and proper use of
fsync(). I actually tested that.

Yes, I should be able to hotunplug SATA drives and expect the data

I know I will lose data. Both ext2 and ext3 will lose data on
flashdisk. (That's what I'm trying to document). But... what is the
benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least
protects you against kernel panic. MD RAID5 is in software, so... that

Perhaps noone told them MD RAID5 is dangerous? You see, that's exactly
what I'm trying to document here.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: david
Date: Tuesday, August 25, 2009 - 4:03 pm

the block device can loose data, it has absolutly nothing to do with the 

a MD raid array that's degraded to the point where there is no redundancy 
is dangerous, but I don't think that any of the enterprise users would be 
surprised.

I think they will be surprised that it's possible that a prior failed 
write that hasn't been scrubbed can cause data loss when the array later 
degrades.

David Lang
--

From: Pavel Machek
Date: Tuesday, August 25, 2009 - 4:29 pm

Cool, so Ted's "raid5 has highly undesirable properties" is actually
pretty accurate. Some raid person should write more detailed README,
I'd say...
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Ric Wheeler
Date: Tuesday, August 25, 2009 - 4:03 pm

You can and will lose data (even after fsync) with any type of storage at some 
rate. What you are missing here is that data loss needs to be measured in hard 
numbers - say percentage of installed boxes that have config X that lose data.

Strangely enough, this is what high end storage companies do for a living, 
configure, deploy and then measure results.

A long winded way of saying that just because you can induce data failure by 
recreating an event that happens almost never (power loss while rebuilding a 
RAID5 group specifically) does not mean that this makes RAID5 with ext3 unreliable.

What does happen all of the time is single bad sector IO's and (less often, but 
more than your scenario) complete drive failures. In both cases, MD RAID5 will 
repair that damage before a second failure (including a power failure) happens 
99.99% of the time.

I can promise you that hot unplugging and replugging a S-ATA drive will also 
lose you data if you are actively writing to it (ext2, 3, whatever).

Your micro datah loss benchmark is not a valid reflection of the wider 
experience and I fear that you will cause people to lose more data, not less, 

Faster recovery time on any normal kernel crash or power outage.  Data loss 

Using MD RAID5 will save more people from commonly occurring errors (sector and 
disk failures) than will lose it because of your rebuild interrupted by a power 
failure worry.

What you are trying to do is to document a belief you have that is not born out 
by real data across actual user boxes running real work loads.

Unfortunately, getting that data is hard work and one of the things that we as a 
community do especially poorly.  All of the data (secret data from my past and 
published data by NetApp, Google, etc) that I have seen would directly 
contradict your assertions and you will cause harm to our users with this.

Ric


--

From: Pavel Machek
Date: Tuesday, August 25, 2009 - 4:26 pm

I'm talking "by design" here.

I will lose data even on SATA drive that is properly powered on if I

I can promise you that running S-ATA drive will also lose you data,
even if you are not actively writing to it. Just wait 10 years; so
what is your point?

But ext3 is _designed_ to preserve fsynced data on SATA drive, while
it is _not_ designed to preserve fsynced data on MD RAID5.


No, because you'll actually repair the ext2 with fsck after the kernel
crash or power outage. Data loss will not be equivalent; in particular
you'll not lose data writen _after_ power outage to ext2.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Ric Wheeler
Date: Tuesday, August 25, 2009 - 4:40 pm

You are dead wrong.

For RAID5 arrays, you assume that you have a hard failure and a power outage 
before you can rebuild the RAID (order of hours at full tilt).

The failure rate of S-ATA drives is at the rate of a few percentage of the 
installed base in a year. Some drives will fail faster than that (bad parts, bad 
environmental conditions, etc).

Why don't you hold all of your most precious data on that single S-ATA drive for 
five year on one box and put a second copy on a small RAID5 with ext3 for the 
same period?

Repeat experiment until you get up to something like google scale or the other 
papers on failures in national labs in the US and then we can have an informed 

I lost a s-ata drive 24 hours after installing it in a new box. If I had MD5 
RAID5, I would not have lost any.

My point is that you fail to take into account the rate of failures of a given 




As Ted (who wrote fsck for ext*) said, you will lose data in both.  Your 
argument is not based on fact.

You need to actually prove your point, not just state it as fact.

ric
--

From: david
Date: Tuesday, August 25, 2009 - 4:48 pm

me to, in fact just after I copied data from a raid array to it so that I 
could rebuild the raid array differently :-(

David Lang
--

From: Pavel Machek
Date: Tuesday, August 25, 2009 - 4:53 pm

I'm not interested in discussing statistics with you. I'd rather discuss
fsync() and storage design issues.

ext3 is designed to work on single SATA disks, and it is not designed
to work on flash cards/degraded MD RAID5s, as Ted acknowledged.

Because that fact is non obvious to the users, I'd like to see it
documented, and now have nice short writeup from Ted.

If you want to argue that ext3/MD RAID5/no UPS combination is still
less likely to fail than single SATA disk given part fail
probabilities, go ahead and present nice statistics. Its just that I'm
not interested in them.
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Ric Wheeler
Date: Tuesday, August 25, 2009 - 5:11 pm

That is a proven fact and a well published one. If you choose to ignore 
published work (and common sense) that RAID makes you lose data less than 
non-RAID, why should anyone care what you write?

Ric

--

From: Ric Wheeler
Date: Tuesday, August 25, 2009 - 5:31 pm

I will let Ted clarify his text on his own, but the quoted text says "... have 
potential...".

Why not ask Neil if he designed MD to not work properly with ext3?

Ric

--

From: Theodore Tso
Date: Tuesday, August 25, 2009 - 6:00 pm

So let me clarify by saying the following things.   

1) Filesystems are designed to expect that storage devices have
certain properties.  These include returning the same data that you
wrote, and that an error when writing a sector, or a power failure
when writing sector, should not be amplified to cause collateral
damage with previously succfessfully written sectors.

2) Degraded RAID 5/6 filesystems do not meet these properties.
Neither to cheap flash drives.  This increases the chances you can
lose, bigtime.  

3) Does that mean that you shouldn't use ext3 on RAID drives?  Of
course not!  First of all, Ext3 still saves you against kernel panics
and hangs caused by device driver bugs or other kernel hangs.  You
will lose less data, and avoid needing to run a long and painful fsck
after a forced reboot, compared to if you used ext2.  You are making
an assumption that the only time running the journal takes place is
after a power failure.  But if the system hangs, and you need to hit
the Big Red Switch, or if you using the system in a Linux High
Availability setup and the ethernet card fails, so the STONITH ("shoot
the other node in the head") system forces a hard reset of the system,
or you get a kernel panic which forces a reboot, in all of these cases
ext3 will save you from a long fsck, and it will do so safely.

Secondly, what's the probability of a failure causes the RAID array to
become degraded, followed by a power failure, versus a power failure
while the RAID array is not running in degraded mode?  Hopefully you
are running with the RAID array in full, proper running order a much
larger percentage of the time than running with the RAID array in
degraded mode.  If not, the bug is with the system administrator!

If you are someone who tends to run for long periods of time in
degraded mode --- then better get a UPS.  And certainly if you want to
avoid the chances of failure, periodically scrubbing the disks so you
detect hard drive failures early, instead of ...
From: Pavel Machek
Date: Tuesday, August 25, 2009 - 6:16 pm

Actually... ext3 + MD RAID5 will still have a problem on kernel
panic. MD RAID5 is implemented in software, so if kernel panics, you
can still get inconsistent data in your array.

I mostly agree with the rest.
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Theodore Tso
Date: Tuesday, August 25, 2009 - 7:55 pm

Only if the MD RAID array is running in degraded mode (and again, if
the system is in this state for a long time, the bug is in the system
administrator).  And even then, it depends on how the kernel dies.  If
the system hangs due to some deadlock, or we get an OOPS that kills a
process while still holding some locks, and that leads to a deadlock,
it's likely the low-level MD driver can still complete the stripe
write, and no data will be lost.  If the kernel ties itself in knots
due to running out of memory, and the OOM handler is invoked, someone
hitting the reset button to force a reboot will also be fine.

If the RAID array is degraded, and we get an oops in interrupt
handler, such that the system is immediately halted --- then yes, data
could get lost.  But there are many system crashes where the software
RAID's ability to complete a stripe write would not be compromised.

       	       	  	     	    	  	- Ted
--

From: Ric Wheeler
Date: Wednesday, August 26, 2009 - 6:37 am

Just to add some real world data, Bianca Schroeder published a really good paper 
that looks at failures in national labs which has actual measured disk failures:

http://www.cs.cmu.edu/~bianca/fast07.pdf

Her numbers showed various rates of failures, but depending on the box, drive 
type, etc, they lost between 1-6% of the install drives each year.

There is also a good paper from Google:

http://labs.google.com/papers/disk_failures.html

Both of the above are largely linux boxes.

And several other FAST papers on failures in commercial RAID boxes, most notably 
by NetApp.

If reading papers is not at the top of your list of things to do, just skim 
through and look for the tables on disk failures, etc. which have great 
measurements of what really failed in these systems...

Ric





--

From: Ric Wheeler
Date: Tuesday, August 25, 2009 - 6:15 pm

I agree with the whole write up outside of the above - degraded RAID 
does meet this requirement unless you have a second (or third, counting 
the split write) failure during the rebuild.

Note that the window of exposure during a RAID rebuild is linear with 
the size of your disk and how much you detune the rebuild...


--

From: Theodore Tso
Date: Tuesday, August 25, 2009 - 7:58 pm

The argument is that if the degraded RAID array is running in this
state for a long time, and the power fails while the software RAID is
in the middle of writing out a stripe, such that the stripe isn't
completely written out, we could lose all of the data in that stripe.

In other words, a power failure in the middle of writing out a stripe
in a degraded RAID array counts as a second failure.

To me, this isn't a particularly interesting or newsworthy point,
since a competent system administrator who cares about his data and/or
his hardware will (a) have a UPS, and (b) be running with a hot spare
and/or will imediately replace a failed drive in a RAID array.

       	    	       	       	 	      - Ted
--

From: Ric Wheeler
Date: Wednesday, August 26, 2009 - 3:39 am

I agree that this is not an interesting (or likely) scenario, certainly 
when compared to the much more frequent failures that RAID will protect 
against which is why I object to the document as Pavel suggested. It 
will steer people away from using RAID and directly increase their 
chances of losing their data if they use just a single disk.

Ric
--

From: Pavel Machek
Date: Wednesday, August 26, 2009 - 4:12 am

So instead of fixing or at least documenting known software deficiency
in Linux MD stack, you'll try to surpress that information so that
people use more of raid5 setups?

Perhaps the better documentation will push them to RAID1, or maybe
make them buy an UPS?
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: david
Date: Wednesday, August 26, 2009 - 4:28 am

people aren't objecting to better documentation, they are objecting to 
misleading documentation.

for flash drives the danger is very straightforward (although even then 
you have to note that it depends heavily on the firmware of the device, 
some will loose lots of data, some won't loose any)

a good thing to do here would be for someone to devise a test to show this 
problem, and then gather the results of lots of people performing this 
test to see what the commonalities are.

you are generalizing that since you have lost data on flash drives, all 
flash drives are dangerous.

what if it turns out that only one manufacturer is doing things wrong? you 
will have discouraged people from using flash drives for no reason. 
(potentially causing them to loose data becouse they ae scared away from 
using flash drives and don't implement anything better)

to be safe, all that a flash drive needs to do is to not change the FTL 
pointers until the data has fully been recorded in it's new location. this 
is probably a trivial firmware change.


for raid arrays, we are still learning the nuances of what actually can 
happen. the comment that Rik made a few hours ago when he pointed out that 
with raid 5 you won't trash the entire stripe (which is what I thought 
happened from prior comments), but instead run the risk of loosing two 
relativly definable chunks of data

1. the block you are writing (which you can loose anyway)

2. the block that would live on the disk that is missing.

that drasticly lessens the impact of the problem

I would like to see someone explain what would happen on raid 6, and I 
think that the possibilities that Neil talked about where he said that it 
was possible to try the various combinations and see which ones agree with 
each other would be a good thing to implement if he can do so.

but the super simplified statement you keep trying to make is 
significantly overstating and oversimplifying the problem.

David Lang
--


Actually Ric is. He's trying hard to make RAID5 look better than it


Do the flash manufacturers claim they do not cause collateral damage
during powerfail? If not, they probably are dangerous.

Anyway, you wanted a test, and one is attached. It normally takes like

Offer better docs? You are right that it does not lose whole stripe,
it merely loses random block on same stripe, but result for journaling
filesystem is similar.
									Pavel


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

I object to misleading and dangerous documentation that you have 
proposed. I spend a lot of time working in data integrity, talking and 
writing about it so I care deeply that we don't misinform people.

In this thread, I put out a draft that is accurate several times and you 
have failed to respond to it.

The big picture that you don't agree with is:

(1) RAID (specifically MD RAID) will dramatically improve data integrity 
for real users. This is not a statement of opinion, this is a statement 
of fact that has been shown to be true in large scale deployments with 
commodity hardware.

(2) RAID5 protects you against a single failure and your test case 
purposely injects a double failure.

(3) How to configure MD reliably should be documented in MD 
documentation, not in each possible FS or raw device application

(4) Data loss occurs in non-journalling file systems and journalling 
file systems when you suffer double failures or hot unplug storage, 
especially inexpensive FLASH parts.

ric



--


Most people would be surprised that press of reset button is 'failure'

It does not happen on inexpensive DISK parts,  so people do not expect
that and it is worth pointing out.
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--


Pavel, you have no information and an attitude of not wanting to listen to 
anyone who has real experience or facts. Not just me, but also Ted and others.

Totally pointless to reply to you further.

Ric

--


For the record, I've been able to follow Pavel's arguments, and I've been able 
to follow Ted's arguments.  But as far as I can tell, you're arguing about a 
different topic than the rest of us.

There's a difference between:

A) This filesystem was corrupted because the underlying hardware is permanently 
damaged, no longer functioning as it did when it was new, and never will 
again.

B) We had a transient glitch that ate the filesystem.  The underlying hardware 
is as good as new, but our data is gone.

You can argue about whether or not "new" was ever any good, but Linux has run 
on PC-class hardware from day 1.  Sure PC-class hardware remains crap in many 
different ways, but this is not a _new_ problem.  Refusing to work around what 
people actually _have_ and insisting we get a better class of user instead 
_is_ a new problem, kind of a disturbing one.

USB keys are the modern successor to floppy drives, and even now 
Documentation/blockdev/floppy.txt is still full of some of the torturous 
workarounds implemented for that over the past 2 decades.  The hardware 
existed, and instead of turning up their nose at it they made it work as best 
they could.

Perhaps what's needed for the flash thing is a userspace package, the way 
mdutils made floppies a lot more usable than the kernel managed at the time.  
For the flash problem perhaps some FUSE thing a bit like mtdblock might be 
nice, a translation layer remapping an arbitrary underlying block device into 
larger granularity chunks and being sure to do the "write the new one before 
you erase the old one" trick that so many hardware-only flash devices _don't_, 
and then maybe even use Pavel's crash tool to figure out the write granularity 
of various sticks and ship it with a whitelist people can email updates to so 
we don't have to guess large.  (Pressure on the USB vendors to give us a "raw 
view" extension bypassing the "pretend to be a hard drive, with remapping" 
hardware in future devices would be nice too, ...

no other OS avoids this problem either.

I actually don't see how you can do this from userspace, because when you 
write to the device you have _no_ idea where on the device your data will 
actually land.

writing in larger chunks may or may not help, (if you do a 128K write, 
and the device is emulating 512b blocks on top of 128K eraseblocks, 
depending on the current state of the flash translation layer, you could 
end up writing to many different eraseblocks, up to the theoretical max of 
256)

David Lang
--


It certainly is not easy. Self-correcting codes could probably be
used, but that would be very special, very slow, and very
non-standard. (Basically... we could design filesystem so that it
would survive damage of arbitrarily 512K on disk -- using
self-correcting codes in CD-like manner). I'm not sure if it would be
practical.

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--


I had no trouble following what Ric was arguing about.

Ric never said "use only the best devices and you won't have problems".

Ric was arguing the exact opposite - ALL devices are crap if you define
crap as "can loose data".  What he is saying is you need to UNDERSTAND
your devices and their behavior and you must act accordingly.

PAVEL DID NOT ACT ACCORDING TO HIS DEVICE LIMITATIONS.

We understand he was clueless, but user error is still user error!

And Ric said do not stigmatize whole classes of A) devices, B) raid,

We have been trying forever to deal with device problems and as
Ric kept trying to explain we do understand them.  The problem is
not "can we be better" it is "at what cost".  As they keep saying
"fast", "cheap", "safe"... pick any 2.  Adding software solutions
to solve it will always turn "fast" to "slow".

Most people will choose some risk they can manage (such as

Saw it. I am not an MD guy so I will not say anything bad about it
except all the "journal" crud.  It really is only pandering to Pavel
because ALL filesystems can be screwed and that is what they really
need to know.  The journal stuff distracts those who are not running
a journaling filesystem, even if your description is correct except
that as we fs people keep saying, fsck is meaningless and again will
only give you a false sense of security that your data is OK.

jim
--


And if you include meteor strike and flooding in your operating criteria you 
can come up with quite a straw man argument.  It still doesn't mean "X is 


I think he understands he was clueless too, that's why he investigated the 

I don't care what "Pavel says", so you can leave the ad hominem at the door, 
thanks.

The kernel presents abstractions, such as block device nodes.  Sometimes 
implementation details bubble through those abstractions.  Presumably, we 
agree on that so far.

I was once asked to write what became Documentation/rbtree.txt, which got 
merged.  I've also read maybe half of Documentation/RCU.  Neither technique is 
specific to Linux, but this doesn't seem to have been an objection at the time.

The technique, "journaling", is widely perceived as eliminating the need for 
fsck (and thus the potential for filesystem corruption) in the case of unclean 
shutdowns.  But there are easily reproducible cases where the technique, 
"journaling", does not do this.  Thus journaling, as a concept, has 
limitations which are _not_ widely understood by the majority of people who 
purchase and use USB flash keys.

The kernel doesn't currently have any documentation on journaling theory where 
mention of journaling's limitations could go.  It does have a section on its 
internal Journaling API in Documentation/DocBook/filesystems.tmpl which links 
to two papers (both about ext3, even though reiserfs was merged first and IBM's 
JFS was implemented before either) from 1998 and 2000 respectively.  The 2000 
paper brushes against disk granularity answering a question starting at 72m, 
21s, and brushes against software raid and write ordering starting at the 72m 
32s mark.  But it never directly addresses either issue...

Sigh, I'm well into tl;dr territory here, aren't I?

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds
--


See, this is exactly the problem we have with all the proposed
documentation.  The reader (you) did not get what the writer (me)
was trying to say.  That does not say either of us was wrong in
what we thought was meant, simply that we did not communicate.

What I meant was we did not want to accept Pavel's incorrect

We don't have any problem with documenting abstractions.  But they
must be written as abstracts and accurate, not as IMO blogs.

It is not "he means well, so we will just accept it".  The rule
for kernel docs should be the same as for code.  If it is not
correct in all cases or causes problems, we don't accept it.

jim
--


That's why I've mostly stopped bothering with this thread.  I could respond to 
Ric Wheeler's latest (what does write barriers have to do with whether or not 
a multi-sector stripe is guaranteed to be atomically updated during a panic or 
power failure?) but there's just no point.

The LWN article on the topic is out, and incomplete as it is I expect it's the 
best documentation anybody will actually _read_.

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds
--


The point of that post was that the failure that you and Pavel both 
attribute to RAID and journalled fs happens whenever the storage cannot 
promise to do atomic writes of a logical FS block (prevent torn 
pages/split writes/etc). I gave a specific example of why this happens 
even with simple, single disk systems.

Further, if  you have the write cache enabled on your local S-ATA/SAS 
drives and do not have working barriers (as is the case with MD 
RAID5/6), you have a hard promise of data loss on power outage and these 
split writes are not going to be the cause of your issues.

You can verify this by testing. Or, try to find people that do storage 

--


ext3 does not expect atomic write of 4K block, according to Ted. So

Would anyone (probably privately?) share the lwn link?
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--


I am not sure what you mean by "expect."

ext3 (and other file systems) certainly expect that acknowledged writes 
will still be there after a crash.

With your disk write cache on (and no working barriers or non-volatile 
write cache), this will always require a repair via fsck or leave you 
with corrupted data or metadata.

ext4, btrfs and zfs all do checksumming of writes, but this is a 
detection mechanism.

Repair of the partial write is done on detection (if you have another 
copy in btrfs or xfs) or by repair (ext4's fsck).

For what it's worth, this is the same story with databases (DB2, Oracle, 
etc). They spend a lot of energy trying to detect partial writes from 
the application level's point of view and their granularity is often 

--


On Sat, 5 Sep 2009 12:28:10 +0200

	http://lwn.net/SubscriberLink/349970/9875eff987190551/

assuming you've not already gotten one from elsewhere.

jon
--


I agree; it's very nicely written, balanced, and doesn't scare users
unduly.

						- Ted
--


Apparently because most people haven't read Documentation/md.txt:

  Boot time assembly of degraded/dirty arrays
  -------------------------------------------

  If a raid5 or raid6 array is both dirty and degraded, it could have
  undetectable data corruption.  This is because the fact that it is
  'dirty' means that the parity cannot be trusted, and the fact that it
  is degraded means that some datablocks are missing and cannot reliably
  be reconstructed (due to no parity).

And so on for several more paragraphs.  Perhaps the documentation needs to be 
extended to note that "journaling will not help here, because the lost data 
blocks render entire stripes unreconstructable"...

Hmmm, I'll take a stab at it.  (I'm not addressing the raid 0 issues brought 
up elsewhere in this thread because I don't comfortably understand the current 
state of play...)

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds
--

From: Rob Landley
Date: Wednesday, September 2, 2009 - 3:49 pm

From: Rob Landley <rob@landley.net>

Add more warnings to the "Boot time assembly of degraded/dirty arrays" section,
explaining that using a journaling filesystem can't overcome this problem.

Signed-off-by: Rob Landley <rob@landley.net>
---

 Documentation/md.txt |   17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/Documentation/md.txt b/Documentation/md.txt
index 4edd39e..52b8450 100644
--- a/Documentation/md.txt
+++ b/Documentation/md.txt
@@ -75,6 +75,23 @@ So, to boot with a root filesystem of a dirty degraded raid[56], use
 
    md-mod.start_dirty_degraded=1
 
+Note that Journaling filesystems do not effectively protect data in this
+case, because the update granularity of the RAID is larger than the journal
+was designed to expect.  Reconstructing data via partity information involes
+matching together corresponding stripes, and updating only some of these
+stripes renders the corresponding data in all the unmatched stripes
+meaningless.  Thus seemingly unrelated data in other parts of the filesystem
+(stored in the unmatched stripes) can become unreadable after a partial
+update, but the journal is only aware of the parts it modified, not the
+"collateral damage" elsewhere in the filesystem which was affected by those
+changes.
+
+Thus successful journal replay proves nothing in this context, and even a
+full fsck only shows whether or not the filesystem's metadata was affected.
+(A proper solution to this problem would involve adding journaling to the RAID
+itself, at least during degraded writes.  In the meantime, try not to allow
+a system to shut down uncleanly with its RAID both dirty and degraded, it
+can handle one but not both.)
 
 Superblock formats
 ------------------


-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds
--

From: Pavel Machek
Date: Thursday, September 3, 2009 - 2:08 am

I like it! Not sure if I know enough about MD to add ack, but...

Acked-by: Pavel Machek <pavel@ucw.cz>

								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Ric Wheeler
Date: Thursday, September 3, 2009 - 5:05 am

NACK.

Now you have moved the inaccurate documentation about journalling file systems 
into the MD documentation.

Repeat after me:

(1) partial writes to a RAID stripe (with or without file systems, with or 
without journals) create an invalid stripe

(2) partial writes can be prevented in most cases by running with write cache 
disabled or working barriers

(3) fsck can (for journalling fs or non journalling fs) detect and fix your file 
system. It won't give you back the data in that stripe, but you will get the 
rest of your metadata and data back and usable.

You don't need MD in the picture to test this - take fsfuzzer or just dd and 
zero out a RAID stripe width of data from a file system. If you hit data blocks, 
your fsck (for ext2) or mount (for any journalling fs) will not see an error. If 
metadata, fsck in both cases when run will try to fix it as best as it can.

Also note that partial writes (similar to torn writes) can happen for multiple 
reasons on non-RAID systems and leave the same kind of damage.

Side note, proposing a half sketched out "fix" for partial stripe writes in 
documentation is not productive. Much better to submit a fully thought out 
proposal or actual patches to demonstrate the issue.

Rob, you should really try to take a few disks, build a working MD RAID5 group 
and test your ideas. Try it with and without the write cache enabled.

Measure and report, say after 20 power losses, how  files integrity and fsck 
repairs were impacted.

Try the same with ext2 and ext3.

Regards,

Ric

--

From: Pavel Machek
Date: Thursday, September 3, 2009 - 5:31 am

Given how long experience with storage you claim, you should know that

....and understand by now that statistics are irrelevant for design
problems.

Ouch and trying to silence people by telling them to fix the problem
instead of documenting it is not nice either.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--


so let's get broader testing (including testing the SSDs as well as the 

I think that every single one of them will tell you to not unplug the 
drive while writing to it. in fact, I'll bet they all tell you to not 

Ok, help me understand this.

I copy these two files to a system, change them to point at the correct 
device, run them and unplug the drive while it's running.

when I plug the device back in, how do I tell if it lost something 
unexpected? since you are writing from urandom I have no idea what data 
_should_ be on the drive, so how can I detect that a data block has been 
corrupted?

David Lang

I have mirror on disk you are not unplugging. See cmp || exit lines.

The test continues until it detects corruption.
								Pavel


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Ric Wheeler
Date: Wednesday, August 26, 2009 - 5:01 am

I am against documenting unlikely scenarios out of context that will 
lead people to do the wrong thing.

ric


--

From: Theodore Tso
Date: Wednesday, August 26, 2009 - 5:23 am

First of all, it's not a "known software deficiency"; you can't do
anything about a degraded RAID array, other than to replace the failed
disk.  Secondly, what we should document is things like "don't use
crappy flash devices", "don't let the RAID array run in degraded mode
for a long time" and "if you must (which is a bad idea), better have a
UPS or a battery-backed hardware RAID".  What we should *not* document
is

"ext3 is worthless for RAID 5 arrays" (simply wrong)

and

"ext2 is better than ext3 because it forces you to run a long, slow
fsck after each boot, and that helps you to catch filesystem
corruptions when the storage devices goes bad" (Second part of the
statement is true, but it's still bad general advice, and it's
horribly misleading)

and

"ext2 and ext3 have this surprising dependency that disks act like
disks".  (alarmist)

      	       	    	 	    	       	    - Ted

--

From: Pavel Machek
Date: Sunday, August 30, 2009 - 12:01 am

AFAICT, you mount block device, not disk. Many block devices fail the
test. And since users (and block device developers) do not  know in
detail how disks behave, it is hard to blame them... ("you may corrupt
sector you are writing to and ext3 handles that ok" was surprise for
me, for example).

					
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Rob Landley
Date: Wednesday, August 26, 2009 - 10:19 pm

Or panic, hang, the drive failed because the system is overheating because the 
air conditioner suddenly died and the server room is now an oven.  (Yup, 

I'm a bit concerned by the argument that we don't need to document serious 
pitfalls because every Linux system has a sufficiently competent administrator 
they already know stuff that didn't even come up until the second or third day 
it was discussed on lkml.


I worked at a company that retested their UPSes a year after installing them 
and found that _none_ of them supplied more than 15 seconds charge, and when 
they dismantled them the batteries had physically bloated inside their little 
plastic cases.  (Same company as the dead air conditioner, possibly 
overheating was involved but the little _lights_ said everything was ok.)

That was by no means the first UPS I'd seen die, the suckers have a higher 
failure rate than hard drives in my experience.  This is a device where the 
batteries get constantly charged and almost never tested because if it _does_ 
fail you just rebooted your production server, so a lot of smaller companies 

Here's hoping they shut the system down properly to install the new drive in 
the raid then, eh?  Not accidentally pull the plug before it's finished running 
the ~7 minutes of shutdown scripts in the last Red Hat Enterprise I messed 
with...

Does this situation apply during the rebuild?  I.E. once a hot spare has been 
supplied, is the copy to the new drive linear, or will it write dirty pages to 
the new drive out of order, even before the reconstruction's gotten that far, 
_and_ do so in an order that doesn't open this race window of the data being 
unable to be reconstructed?

If "degraded array" just means "don't have a replacement disk yet", then it 
sounds like what Pavel wants to document is "don't write to a degraded array 
at all, because power failures can cost you data due to write granularity 
being larger than filesystem block size".  (Which still comes as news to some ...
From: Theodore Tso
Date: Thursday, August 27, 2009 - 5:24 am

I'm not convinced that information which needs to be known by System
Administrators is best documented in the kernel Documentation
directory.  Should there be a HOWTO document on stuff like that?
Sure, if someone wants to put something like that together, having
free documentation about ways to set up your storage stack in a sane
way is not a bad thing.  

It should be noted that these sorts of issues are discussed in various
books targetted at System Administrators, and in Usenix's System
Administration tutorials.  The computer industry is highly
specialized, and so just because an OS kernel hacker might not be
familiar with these issues, doesn't mean that professionals whose job
it is to run data centers don't know about these things!  Similarly,
you could be a whiz at Linux's networking stack, but you might not
know about certain pitfalls in configuring a Cisco router using IOS;
does that mean we should have an IOS tutorial in the kernel

Sure, but the fact that we don't currently say much about storage
stacks doesn't mean we should accept a patch that might actively

Sounds like they were using really cheap UPS's; certainly not the kind
I would expect to find in a data center.  And if company's system
administrator is using the cheapest possible consumer-grade UPS's,
then yes, they might have a problem.  Even an educational institution
like MIT, where I was an network administrator some 15 years ago, had
proper UPS's, *and* we had a diesel generator which kicked in after 15
seconds --- and we tested the diesel generator every Friday morning,

Even my home RAID array uses hot-plug SATA disks, so I can replace a
failed disk without shutting down my system.  (And yes, I have a
backup battery for the hardware RAID, and the firmware runs periodic
tests on it; the hardware RAID card also will send me e-mail if a RAID
array drive fails and it needs to use my hot-spare.  At that point, I
order a new hard drive, secure in the knowledge that the system can
still suffer another ...
From: Ric Wheeler
Date: Thursday, August 27, 2009 - 6:10 am

One thing that does need fixing for some MD configurations is to stress again 
that we need to make sure that barrier operations are properly supported or 
users will need to disable the write cache on devices with volatile write caches.

Ric

--

From: Jeff Garzik
Date: Thursday, August 27, 2009 - 9:54 am

Agreed; chime in on Christoph's linux-vfs thread if people have input.

I quickly glanced at MD and DM.  Currently, upstream, we see a lot of

         if (unlikely(bio_barrier(bio))) {
                 bio_endio(bio, -EOPNOTSUPP);
                 return 0;
         }

in DM and MD make_request functions.

Only md/raid1 supports barriers at present, it seems.  None of the other 
MD drivers support barriers.

DM has some barrier code...  but the above code was pasted from DM's 
make_request function, so I am guessing that DM's barrier stuff is 
incomplete and disabled at present.

I've been mentioning this issue for years... glad some people finally 
noticed :)

	Jeff



--

From: Alasdair G Kergon
Date: Thursday, August 27, 2009 - 11:09 am

That code is from the new request-based multipath implementation in 2.6.31
which doesn't yet.

But bio-based dm does support barriers now.  (Just missing some patches to
complete the dm-raid1 support that are still under review IIRC.)

Alasdair
--

From: Michael Tokarev
Date: Wednesday, September 2, 2009 - 9:17 am

Only for raid1 there's no requiriment for inter-drive ordering.  Hence
only raid1 supports barriers (and gained that support very recently,
in 1 or 2 kernel releases).  For the rest, including raid0 and linear,
inter-drive ordering is necessary to implement barriers.  Or md should
have its own queue (flushing) mechanisms.

/mjt
--

From: Pavel Machek
Date: Saturday, August 29, 2009 - 3:02 am

It is not only for system administrators; I was trying to find out if

ext3 documentation states that journal protects fs integrity on
powerfail. If you don't want to talk about storage stacks, perhaps
that should be removed?

Now... You mocked me up for 'ext3 expects disks to behave like disks
(alarmist)'. I actually believe that should be written somewhere. ext3
depends on fairly subtle storage disk characteristics, and many common
configs just do not meet the expectations (missing barriers is most
common, followed by collateral damage).

Maybe not documenting that was okay 10 years ago, but with all the USB
sticks and raid arrays around, its just sloppy. Because those
characteristics are not documented, storage stack authors do not know
what they have to guarantee, and the result is bad. See for example
nbd -- it does not propagate barriers and is therefore unsafe.

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Henrique de Moraes Holschuh
Date: Tuesday, August 25, 2009 - 7:53 pm

Can we get a proper scrub function (full rewrite of all component
disks), please?  Not every disk out there will stop a streaming read to

Debian got this right :-)

-- 
  "One disk to rule them all, One disk to find them. One disk to bring
  them all and in the darkness grind them. In the Land of Redmond
  where the shadows lie." -- The Silicon Valley Tarot
  Henrique Holschuh
--

From: Pavel Machek
Date: Thursday, September 3, 2009 - 2:47 am

Yes. Unfortunately, different filesystems expect different properties
from block devices. ext3 will work with write cache enabled/barriers
enabled, while ext2 needs write cache disabled.

The requirements are also quite surprising; AFAICT ext3 can handle
disk writing garbage to single sector during powerfail, while xfs can
not handle that.

Now, how do you expect users to know these subtle details when it is
not documented anywhere? And why are you fighting against documenting

As was uncovered, MD RAID does not properly support barriers,

Trust me, 99% of sysadmins are not compentent by your definition. So

ext3 greatly contributes to administrator incomentency:

# The journal supports the transactions start and stop, and in case of a
# crash, the journal can replay the transactions to quickly put the
# partition back into a consistent state.

...it does not mention that (non-default!) barrier=1 is needed to make
this reliable, nor it mentions that there are certain requirements for
this to work. It just says that journal will magically help you.

And you wonder while people expect magic from your filesystem?

								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Rik van Riel
Date: Tuesday, August 25, 2009 - 8:50 pm

The reality in your document does not match up with the reality
out there in the world.  That sounds like a good reason not to
have your (incorrect) document out there, confusing people.

-- 
All rights reversed.
--

From: Rob Landley
Date: Wednesday, August 26, 2009 - 8:53 pm

On google scale anvil lightning can fry your machine out of a clear sky.

However, there are still a few non-enterprise users out there, and knowing 
that specific usage patterns don't behave like they expect might be useful to 

Actually, that's _exactly_ what he's talking about.

When writing to a degraded raid or a flash disk, journaling is essentially 
useless.  If you get a power failure, kernel panic, somebody tripping over a 
USB cable, and so on, your filesystem will not be protected by journaling.  
Your data won't be trashed _every_ time, but the likelihood is much greater 
than experience with journaling in other contexts would suggest.

Worse, the journaling may be counterproductive by _hiding_ many errors that 
fsck would promptly detect, so when the error is detected it may not be 
associated with the event that caused it.  It also may not be noticed until 
good backups of the data have been overwritten or otherwise cycled out.

You seem to be arguing that Linux is no longer used anywhere but the 
enterprise, so issues affecting USB flash keys or cheap software-only RAID 
aren't worth documenting?

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds
--

From: Ric Wheeler
Date: Thursday, August 27, 2009 - 4:43 am

You are missing the broader point of both papers. They (and people like 
me when back at EMC) look at large numbers of machines and try to fix 
what actually breaks when run in the real world and causes data loss. 
The motherboards, S-ATA controllers, disk types are the same class of 
parts that I have in my desktop box today.

The advantage of google, national labs, etc is that they have large 
numbers of systems and can draw conclusions that are meaningful to our 
broad user base.

Specifically, in using S-ATA drives (just like ours, maybe slightly more 
reliable) they see up to 7% of those drives fail each year.  All users 
have "soft" drive failures like single remapped sectors.

These errors happen extremely commonly and are what RAID deals with well.

What does not happen commonly is that during the RAID rebuild (kicked 
off only after a drive is kicked out), you push the power button or have 
a second failure (power outage).

We will have more users loose data if they decide to use ext2 instead of 
ext3 and use only single disk storage.

We have real numbers that show that is true. Injecting double faults 
into a system that handles single faults is frankly not that interesting.

You can get better protection from these double faults if you move to 
"cloud" like storage configs where each box is fault tolerant, but you 
also spread your data over multiple boxes in multiple locations.

Regards,


--

From: Rob Landley
Date: Thursday, August 27, 2009 - 1:51 pm

No, I'm dismissing the papers (some of which I read when they first came out 
and got slashdotted) as irrelevant to the topic at hand.

Pavel has two failure modes which he can trivially reproduce.  The USB stick 
one is reproducible on a laptop by jostling said stick.  I myself used to have 
a literal USB keychain, and the weight of keys dangling from it pulled it out 
of the USB socket fairly easily if I wasn't careful.  At the time nobody had 
told me a journaling filesystem was not a reasonable safeguard here.

Presumably the degraded raid one can be reproduced under an emulator, with no 
hardware directly involved at all, so talking about hardware failure rates 
ignores the fact that he's actually discussing a _software_ problem.  It may 
happen in _response_ to hardware failures, but the damage he's attempting to 
document happens entirely in software.

These failure modes can cause data loss which journaling can't help, but which 
journaling might (or might not) conceivably hide so you don't immediately 
notice it.  They share a common underlying assumption that the storage 
device's update granularity is less than or equal to the filesystem's block 
size, which is not actually true of all modern storage devices.  The fact he's 
only _found_ two instances where this assumption bites doesn't mean there 
aren't more waiting to be found, especially as more new storage media types 
get introduced.

Pavel's response was to attempt to document this.  Not that journaling is 
_bad_, but that it doesn't protect against this class of problem.

Your response is to talk about google clusters, cloud storage, and cite 
academic papers of statistical hardware failure rates.  As I understand the 
discussion, that's not actually the issue Pavel's talking about, merely one 
potential trigger for it.

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds
--

From: Ric Wheeler
Date: Thursday, August 27, 2009 - 3:00 pm

From: david
Date: Friday, August 28, 2009 - 7:49 am

I don't think anyone is disagreeing with the statement that journaling 
doesn't protect against this class of problems, but Pavel's statements 
didn't say that. he stated that ext3 is more dangerous than ext2.

David Lang
--

From: Pavel Machek
Date: Saturday, August 29, 2009 - 3:05 am

Well, if you use 'common' fsck policy, ext3 _is_ more dangerous.

But I'm not pushing that to documentation, I'm trying to push info
everyone agrees with. (check the patches).
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Rob Landley
Date: Saturday, August 29, 2009 - 1:22 pm

The filesystem itself isn't more dangerous, but it may provide a false sense of 
security when used on storage devices it wasn't designed for.

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds
--


from this discussin (and the similar discussion on lwn.net) there appears 
to be confusion/disagreement over what fsck does and what the results of 
not running it are.

it has been stated here that fsck cannot fix broken data, all it tries to 
do is to clean up metadata, but it would probably help to get a clear 
statement of what exactly that means.

I know that it:

finds entries that don't actually have data and deletes them

finds entries where multiple files share data blocks and duplicates the 
(bad for one file) data to seperate them

finds blocks that have been orphaned (allocated, but no directory pointer 
to them) and creates entries in lost+found

but if a fsck does not get run on a filesystem that has been damaged, what 
additional damage can be done?

can it overwrite data that could have been saved?

can it cause new files that are created (or new data written to existing, 
but uncorrupted files) to be lost?

or is it just a matter of not knowing about existing corruption?

David Lang

--

From: Theodore Tso
Date: Thursday, September 3, 2009 - 12:27 pm

Let me give you my formulation of fsck which may be helpful.  Fsck can
not fix broken data; and (particularly in fsck -y mode) may not even
recover the maximal amount of lost data caused by metadata corruption.
(This is why sometimes an expert using debugfs can recover more data
than fsck -y, and if you have some really precious data, like ten
years' worth of Ph.D. research that you've never bothered to back
up[1], the first thing you should do is buy a new hard drive and make a
sector-by-sector copy of the disk and *then* run fsck.  A new
terrabyte hard drive costs $100; how much is your data worth to you?)

[1] This isn't hypothetical; while I was at MIT this sort of thing
actually happened more than once --- which brings up the philosophical
question of whether someone who is that stupid about not doing backups
on critical data *deserves* to get a Ph.D. degree.  :-)

Fsck's primary job is to make sure that further writes to the
filesystem, whether you are creating new files or removing directory
hierarchies, etc., will not cause *additional* data loss due to meta
data corruption in the file system.  Its secondary goals are to
preserve as much data as possible, and to make sure that file system
metadata is valid (i.e., so that a block pointer contains a valid
block address, so that an attempt to read a file won't cause an I/O
error when the filesystems attempts to seek to a non-existent sector
on disk).

For some filesystems, invalid, corrupt metadata can actually cause a
system panic or oops message, so it's not necessarily safe to mount a
filesystem with corrupt metadata read-only without risking the need to
reboot the machine in question.  More recently, there are folks who
have been filing security bugs when they detect such cases, so there
are fewer examples of such cases, but historically it was a good idea
to run fsck because otherwise it's possible the kernel might oops or

Consider the case where there are data blocks in use by inodes,
containing precious data, ...

So your argument basically is

'our abs brakes are broken, but lets not tell anyone; our car is still
safer than a horse'.

and

'while we know our abs brakes are broken, they are not major factor in
accidents, so lets not tell anyone'.

Sorry, but I'd expect slightly higher moral standards. If we can
document it in a way that's non-scary, and does not push people to
single disks (horses), please go ahead; but you have to mention that
md raid breaks journalling assumptions (our abs brakes really are
broken).
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--


You continue to ignore the technical facts that everyone (both MD and ext3) 
people put in front of you.

If you have a specific bug in MD code, please propose a patch.

Ric

--


Interesting. So, what's technically wrong with the patch below?

									Pavel
---

From: Theodore Tso <tytso@mit.edu>

Document that many devices are too broken for filesystems to protect
data in case of powerfail.

Signed-of-by: Pavel Machek <pavel@ucw.cz> 

diff --git a/Documentation/filesystems/dangers.txt b/Documentation/filesystems/dangers.txt
new file mode 100644
index 0000000..2f3eec1
--- /dev/null
+++ b/Documentation/filesystems/dangers.txt
@@ -0,0 +1,21 @@
+There are storage devices that high highly undesirable properties when
+they are disconnected or suffer power failures while writes are in
+progress; such devices include flash devices and DM/MD RAID 4/5/6 (*)
+arrays.  These devices have the property of potentially corrupting
+blocks being written at the time of the power failure, and worse yet,
+amplifying the region where blocks are corrupted such that additional
+sectors are also damaged during the power failure.
+        
+Users who use such storage devices are well advised take
+countermeasures, such as the use of Uninterruptible Power Supplies,
+and making sure the flash device is not hot-unplugged while the device
+is being used.  Regular backups when using these devices is also a
+Very Good Idea.
+        
+Otherwise, file systems placed on these devices can suffer silent data
+and file system corruption.  An forced use of fsck may detect metadata
+corruption resulting in file system corruption, but will not suffice
+to detect data corruption.
+
+(*) Degraded array or single disk failure "near" the powerfail is
+neccessary for this property of RAID arrays to bite.


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--


You mean apart from ".... that high highly undesirable ...." ??
                               ^^^^^^^^^^^

And the phrase "Regular backups when using these devices ...." should
be "Regular backups when using any devices .....".
                               ^^^
If you have a device failure near a power fail on a raid5 you might
lose some blocks of data.  If you have a device failure near (or not
near) a power failure on raid0 or jbod etc you will certainly lose lots
of blocks of data.

I think it would be better to say:

   ".... and degraded DM/MD RAID 4/5/6(*) arrays..."
             ^^^^^^^^
with
(*) If device failure causes the array to become degraded during or
immediately after the power failure, the same problem can result.

And "necessary" only have the one 'c' :-)

--


Ok, I still believe kernel documentation should be ... well... in
kernel, not in LWN article, so I fixed the patch according to your
comments.

Signed-off-by: Pavel Machek <pavel@ucw.cz>

diff --git a/Documentation/filesystems/dangers.txt b/Documentation/filesystems/dangers.txt
new file mode 100644
index 0000000..14d0324
--- /dev/null
+++ b/Documentation/filesystems/dangers.txt
@@ -0,0 +1,21 @@
+There are storage devices that have highly undesirable properties when
+they are disconnected or suffer power failures while writes are in
+progress; such devices include flash devices and degraded DM/MD RAID
+4/5/6 (*) arrays.  These devices have the property of potentially
+corrupting blocks being written at the time of the power failure, and
+worse yet, amplifying the region where blocks are corrupted such that
+additional sectors are also damaged during the power failure.
+        
+Users who use such storage devices are well advised take
+countermeasures, such as the use of Uninterruptible Power Supplies,
+and making sure the flash device is not hot-unplugged while the device
+is being used.  Regular backups when using any devices, and these
+devices in particular is also a Very Good Idea.
+        
+Otherwise, file systems placed on these devices can suffer silent data
+and file system corruption.  An forced use of fsck may detect metadata
+corruption resulting in file system corruption, but will not suffice
+to detect data corruption.
+
+(*) If device failure causes the array to become degraded during or
+immediately after the power failure, the same problem can result.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--


My suggestion was that you stop trying to document your assertion of an issue 
and actually suggest fixes in code or implementation. I really don't think that 
you have properly diagnosed your specific failure or done sufficient. However, 
if you put a full analysis and suggested code out to the MD devel lists, we can 
debate technical implementation as we normally do.

As Ted quite clearly stated, documentation on how RAID works, how to configure 
it, etc, is best put in RAID documentation.  What you claim as a key issue is an 
issue for all file systems (including ext2).

The only note that I would put in ext3/4 etc documentation would be:

"Reliable storage is important for any file system. Single disks (or FLASH or 
SSD) do fail on a regular basis.

To reduce your risk of data loss, it is advisable to use RAID which can overcome 
these common issues. If using MD software RAID, see the RAID documentation on 
how best to configure your storage.

With or without RAID, it is always important to back up your data to an external 
device and keep copies of that backup off site."


--


I don't think I should be required to rewrite linux md layer in order

Uh, how clever, instead of documenting that our md raid code does not
always work as expected, you document that components fail. Newspeak
101?

You even failed to mention little design problem with flash and
eraseblock size... and the fact that you don't need flash to fail to
get data loss.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--


NACK.  I didn't write this patch, and it's disingenuous for you to try
to claim that I authored it.

You took text I wrote from the *middle* of an e-mail discussion and
you ignored multiple corrections to typo's that I made --- typo's that
I would have corrected if I had ultimately decided to post this as a
patch, which I did NOT.

While Neil Brown's corrections are minimally necessary so the text is
at least technically *correct*, it's still not the right advice to
give system administrators.  It's better than the fear-mongering
patches you had proposed earlier, but what would be better *still* is
telling people why running with degraded RAID arrays is bad, and to
give them further tips about how to use RAID arrays safely.

To use your ABS brakes analogy, just becase it's not safe to rely on
ABS brakes if the "check brakes" light is on, that doesn't justify
writing something alarmist which claims that ABS brakes don't work
100% of the time, don't use ABS brakes, they're broken!!!!

The first part of it is true, since ABS brakes can suffer mechnical
failure.  But what we should be telling drivers is, "if the 'check
brakes' light comes on, don't keep driving with it, go to a garage and
get it fixed!!!".  Similarly, if you get a notice that your RAID is
running in degraded mode, you've already suffered one failure; you
won't survive another failure, so fix that issue ASAP!

If you're really paranoid, you could decide to "pull over to the side
of the road"; that is, you could stop writing to the RAID array as
soon as possible, and then get the the RAID array rebuilt before
proceeding.  That can reduce the chances of a second failure.  But in
the real world, there are costs associated with taking a production
server off-line, and the prudent system administrator has to do a
risk-reward tradeoff.  A better approach might to have the array
configured with a hot spare, and to regularly scrub the array, and
configure the RAID array with either a battery backup or a UPS.  ...

Well, you did write original text, so I wanted to give you

Maybe this belongs to Doc*/filesystems, and more detailed RAID

If it only was this simple. We don't have 'check brakes' (aka
'journalling ineffective') warning light. If we had that, I would not
have problem.

It is rather that your ABS brakes are ineffective if 'check engine'
(RAID degraded) is lit. And yes, running with 'check engine' for
extended periods may be bad idea, but I know people that do
that... and I still hope their brakes work (and believe they should

'your RAID array is degraded' is very counter intuitive way to say
'...and btw your journalling is no longer effective, either'.

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--


Why should this be placed in *kernel* documentation anyway? The "dangers 
of RAID", the hints that "backups are a good idea" - isn't that something 
for howtos for sysadmins? No end-user will ever look into Documentation/ 
anyway. The sysadmins should know what they're doing and see the upsides 
and downsides of RAID and journalling filesystems. And they'll turn to 
howtos and tutorials to find out. And maybe seek *reference* documentation 
in Documentation/ - but I don't think Storage-101 should be covered in 
a mostly hidden place like Documentation/.

Christian.
-- 
BOFH excuse #212:

Of course it doesn't work. We've performed a software upgrade.
--


The fact that two kernel subsystems (MD RAID, journaling filesystems)
do not work well together is surprising and should be documented near
the source.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--


the 'RAID degraded' warning says that _anything_ you put on that block 
device is at risk. it doesn't matter if you are using a filesystem with a 
journal, one without, or using the raw device directly.

David Lang
--


The easiest way to lose your data in Linux - with RAID, without RAID, 
S-ATA or SAS - is to run with the write cache enabled.

If you compare the size of even a large RAID stripe it will be measured 
in KB and as this thread has mentioned already, you stand to have damage 
to just one stripe (or even just a disk sector or two).

If you lose power with the write caches enabled on that same 5 drive 
RAID set, you could lose as much as 5 * 32MB of freshly written data on  
a power loss (16-32MB write caches are common on s-ata disks these days).

For MD5 (and MD6), you really must run with the write cache disabled 
until we get barriers to work for those configurations.

It would be interesting for Pavel to retest with the write cache 
enabled/disabled on his power loss scenarios with multi-drive RAID.

Regards,

Ric



--


Ric Wheeler wrote:

This is fundamentally wrong.  Many filesystems today use either barriers
or flushes (if barriers are not supported), and the times when disk drives

I highly doubt barriers will ever be supported on anything but simple
raid1, because it's impossible to guarantee ordering across multiple
drives.  Well, it *is* possible to have write barriers with journalled
(and/or with battery-backed-cache) raid[456].

Note that even if raid[456] does not support barriers, write cache
flushes still works.

/mjt
--


Unfortunately not - if you mount a file system with write cache enabled 
and see "barriers disabled" messages in /var/log/messages, this is 
exactly what happens.

File systems issue write barrier operations that in turn do cache 
flushes (ATA_FLUSH_EXT) commands or its SCSI  equivalent.

MD5 and MD6 do not pass these operations on currently and there is no 
other file system level mechanism that somehow bypasses the IO stack to 
invalidate or flush the cache.

Note that some devices have non-volatile write caches (specifically 

I think that you are confused - barriers are implemented using cache 
flushes.

Ric


--


While most common filesystem do have barrier support it is:

 - not actually enabled for the two most common filesystems
 - the support for write barriers an cache flushing tends to be buggy

All currently working barrier implementations on Linux are built upon
queue drains and cache flushes, plus sometimes setting the FUA bit.

--


Or just missing - I think that MD5/6 simply drop the requests at present.

I wonder if it would be worth having MD probe for write cache enabled & warn if 

--


In my opinion even that is too weak.  We know how to control the cache
settings on all common disks (that is scsi and ata), so we should always
disable the write cache unless we know that the whole stack (filesystem,
raid, volume managers) supports barriers.  And even then we should make
sure the filesystems does actually use barriers everywhere that's needed
which failed at for years.

--


..

That stack does not know that my MD device has full battery backup,
so it bloody well better NOT prevent me from enabling the write caches.

In fact, MD should have nothing to do with that.  I do like/prefer the
way that XFS currently does it:  disables barriers and logs the event,
but otherwise doesn't try to enforce policy upon me from kernel space.

Cheers
--


No one is going to prevent you from doing it.  That question is one of
sane defaults.  And always safe, but slower if you have advanced
equipment is a much better default than usafe by default on most of
the install base.

--


I've always agreed with "be safe first" and have worked where
we always shut write cache off unless we knew it had battery.

But before we make disabling cache the default, this is the impact:

- users will see it as a performance regression

- trashy OS vendors who never disable cache will benchmark
   better than "out of the box" linux.

Because as we all know, users don't read release notes.

Been there, done that, felt the pain.

jim
--


Just to add some support to this, all of the external RAID arrays that I know of 
normally run with write cache disabled on the component drives. In addition, 
many of them will disable their internal write cache if/when they detect that 
they have lost their UPS.

I think that if we had done this kind of sane default earlier for MD levels that 
do not handle barriers, we would not have left some people worried about our 
software RAID.

To be clear, if a sophisticated user wants to override this default, that should 
be supported. It is not (in my opinion) a safe default behaviour.

Ric

--


Do they use "off the shelf" SATA (or PATA) disks, and if so, which ones?
-- 
Krzysztof Halasa
--


Which drives various vendors ships changes with specific products. Usually, they 
ship drives that have carefully vetted firmware, etc. but they are close to the 
same drives you buy on the open market.

Seagate has a huge slice of the market,

ric

--


But they aren't the same, are they? If they are not, the fact they can
run well with the write-through cache doesn't mean the off-the-shelf
ones can do as well.

Are they SATA (or PATA) at all? SCSI etc. are usually different
animals, though there are SCSI and SATA models which differ only in
electronics.

Do you have battery-backed write-back RAID cache (which acknowledges
flushes before the data is written out to disks)? PC can't do that.
-- 
Krzysztof Halasa
--


Storage vendors have a wide range of options, but what you get today is a 
collection of s-ata (not much any more), sas or fc.


We (red hat) have all kinds of different raid boxes...

ric


--


A have no doubt about it, but are those you know equipped with
battery-backed write-back cache? Are they using SATA disks?

We can _at_best_ compare non-battery-backed RAID using SATA disks with
what we typically have in a PC.
-- 
Krzysztof Halasa
--

From: Ric Wheeler
Date: Thursday, September 3, 2009 - 7:15 am

The whole thread above is about software MD using commodity drives (S-ATA or 
SAS) without battery backed write cache.

We have that (and I have it personally) and do test it.

You must disable the write cache on these commodity drives *if* the MD RAID 
level does not support barriers properly.

This will greatly reduce errors after a power loss (both in degraded state and 
non-degraded state), but it will not eliminate data loss entirely. You simply 
cannot do that with any storage device!

Note that even without MD raid, the file system issues IO's in file system block 
size (4096 bytes normally) and most commodity storage devices use a 512  byte 
sector size which means that we have to update 8 512b sectors.

Drives can (and do) have multiple platters and surfaces and it is perfectly 
normal to have contiguous logical ranges of sectors map to non-contiguous 
sectors physically. Imagine a 4KB write stripe that straddles two adjacent 
tracks on one platter (requiring a seek) or mapped across two surfaces 
(requiring a head switch). Also, a remapped sector can require more or less a 
full surface seek from where ever you are to the remapped sector area of the drive.

These are all examples that can after a power loss,  even a local (non-MD) 
device,  do a partial update of that 4KB write range of sectors. Note that 
unlike unlike RAID/MD, local storage has no parity on the server to detect this 
partial write.

This is why new file systems like btrfs and zfs do checksumming of data and 
metadata. This won't prevent partial updates during a write, but can at least 
detect them and try to do some kind of recovery.

In other words, this is not just an MD issue, it is entirely possible even with 
non-MD devices.

Also, when you enable the write cache (MD or not) you are buffering multiple 
MB's of data that can go away on power loss. Far greater (10x) the exposure that 
the partial RAID rewrite case worries about.

ric
--

From: Florian Weimer
Date: Thursday, September 3, 2009 - 7:26 am

Database software often attempts to deal with this phenomenon
(sometimes called "torn page writes").  For example, you can make sure
that the first time you write to a database page, you keep a full copy
in your transaction log.  If the machine crashes, the log is replayed,
first completely overwriting the partially-written page.  Only after
that, you can perform logical/incremental logging.

The log itself has to be protected with a different mechanism, so that
you don't try to replay bad data.  But you haven't comitted to this
data yet, so it is fine to skip bad records.

Therefore, sub-page corruption is a fundamentally different issue from
super-page corruption.

BTW, older textbooks will tell you that mirroring requires that you
read from two copies of the data and compare it (and have some sort of
tie breaker if you need availability).  And you also have to re-read
data you've just written to disk, to make sure it's actually there and
hit the expected sectors.  We can't even do this anymore, thanks to
disk caches.  And it doesn't seem to be necessary in most cases.

-- 
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99
--

From: Ric Wheeler
Date: Thursday, September 3, 2009 - 8:09 am

Yes - databases worry a lot about this. Another technique that they tend to use 
is to have state bits at the beginning and end of their logical pages. For 
example, the first byte and last byte toggle together from 1 to 0 to 1 to 0 as 
you update.

If the bits don't match, that is a quick level indication of a torn write.

Even with the above scheme, you can still have data loss of course - you just 
need an IO error in the log and in your db table that was recently updated. Not 
entirely unlikely, especially if you use write cache enabled storage and don't 

We have to be careful to keep our terms clear since the DB pages are (usually) 
larger than the FS block size which in turn is larger than non-RAID storage 
sector size. At the FS level, we send down multiples of fs blocks (not 
blocked/aligned at RAID stripe levels, etc).

In any case, we can get sub-FS block level "torn writes" even with a local S-ATA 

We can do something like this with the built in RAID in btrfs. If you detect an 
IO error (or bad checksum) on a read, btrfs knows how to request/grab another copy.

Also note that the SCSI T10 DIF/DIX has baked in support for applications to 
layer on extra data integrity (look for MKP's slide decks). This is really neat 
since you can intercept bad IO's on the way down and prevent overwriting good data.

ric

--

From: Krzysztof Halasa
Date: Thursday, September 3, 2009 - 4:50 pm

Yes. However, you mentioned external RAID arrays disable disk caches.
That's why I asked if they are using SATA or SCSI/etc. disks, and if

The cache is flushed with working barriers. I guess it should be
superior to disabled WB cache, in both performance and expected disk
lifetime.
-- 
Krzysztof Halasa
--

From: Ric Wheeler
Date: Thursday, September 3, 2009 - 5:39 pm

Sorry for the confusion - they disable the write caches on the component 
drives normally, but have their own write cache which is not disabled in 

True - barriers (especially on big, slow s-ata drives) usually give you 
an overall win. SAS drives it seems to make less of an impact, but then 
you always need to benchmark your workload on anything to get the only 
numbers that really matter :-)

ric

--

From: Mark Lord
Date: Friday, September 4, 2009 - 2:21 pm

Ric Wheeler wrote:
..

Rather than further trying to cripple Linux on the notebook,
(it's bad enough already)..

How about instead, *fixing* the MD layer to properly support barriers?
That would be far more useful, productive, and better for end-users.

Cheers
--

From: Ric Wheeler
Date: Friday, September 4, 2009 - 2:29 pm

People using MD on notebooks (not sure there are that many using RAID5 

Fixing MD would be great - not sure that it would end up still faster 
(look at md1 devices with working barriers with compared to md1 with 
write cache disabled).

In the mean time, if you are using MD to make your data more reliable, I 
would still strongly urge you to disable the write cache when you see 
"barriers disabled" messages spit out in /var/log/messages :-)

ric

--

From: Mark Lord
Date: Saturday, September 5, 2009 - 5:57 am

..

There's no inherent reason for it to be slower, except possibly
drives with b0rked FUA support.

So the first step is to fix MD to pass barriers to the LLDs
for most/all RAID types. 

Then, if it has performance issues, those can be addressed
by more application of little grey cells.  :)

Cheers
--

From: Ric Wheeler
Date: Saturday, September 5, 2009 - 6:40 am

The performance issue with MD is that the "simple" answer is to not only 
pass on those downstream barrier ops, but also to block and wait until 
all of those dependent barrier ops complete before ack'ing the IO.

When you do that implementation at least, you will see a very large 
performance impact and I am not sure that you would see any degradation 
vs just turning off the write caches.

Sounds like we should actually do some testing and actually measure, I 
do think that it will vary with the class of device quite a lot just 
like we see with single disk barriers vs write cache disabled on SAS vs 
S-ATA, etc...

ric

--

From: NeilBrown
Date: Saturday, September 5, 2009 - 2:43 pm

Having MD "pass barriers" to LLDs isn't really very useful.
The barrier need to act with respect to all addresses of the device,
and once you pass it down, it can only act with respect to addresses
on that device.
What any striping RAID level needs to do when it sees a barrier
is:
   suspend all future writes
   drain and flush all queues
   submit the barrier write
   drain and flush all queues
   unsuspend writes

I guess "drain can flush all queues" can be done with an empty barrier
so maybe that is exactly what you meant.

The double flush which (I think) is required by the barrier semantic
is unfortunate.  I wonder if it would actually make things slower than
necessary.


--

From: Pavel Machek
Date: Monday, September 7, 2009 - 4:45 am

Yes, but ext3 was designed to handle the partial write  (according to


Yes, that's what barriers are for. Except that they are not there on
MD0/MD5/MD6. They actually work on local sata drives...

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Theodore Tso
Date: Monday, September 7, 2009 - 6:10 am

I'm not sure what made you think that I said that.  In practice things
usually work out, as a conseuqence of the fact that ext3 uses physical

Yes, but ext3 does not enable barriers by default (the patch has been
submitted but akpm has balked because he doesn't like the performance
degredation and doesn't believe that Chris Mason's "workload of doom"
is a common case).  Note though that it is possible for dirty blocks
to remain in the track buffer for *minutes* without being written to
spinning rust platters without a barrier.

See Chris Mason's report of this phenonmenon here:

	http://lkml.org/lkml/2009/3/30/297

Here's Chris Mason "barrier test" which will corrupt ext3 filesystems
50% of the time after a power drop if the filesystem is mounted with
barriers disabled (which is the default; use the mount option
barrier=1 to enable barriers):

	http://lkml.indiana.edu/hypermail/linux/kernel/0805.2/1518.html

(Yes, ext4 has barriers enabled by default.)

							- Ted
--


Or migrate to ext4, which does use barriers by defaults, as well as
journal-level checksumming.  :-)

As far as changing the default to enable barriers for ext3, you'll
need to talk to akpm about that; he's the one who has been against it
in the past.

					- Ted
--


So what I recommend for server class machines is to either turn off
the automatic fsck's (it's the default, but it's documented and there
are supported ways of turning it off --- that's hardly developers
"ramming" it down user's throats), or more preferably, to use LVM, and

You can do this with ext3/ext4 today, now.  Just take a look at
e2croncheck in the contrib directory of e2fsprogs.  Changing it to not

Hmm, why are you running on battery so often?  I make a point of
running connected to the AC mains whenever possible, because a LiOn
battery only has about 200 full-cycle charge/discharges in it, and
given the cost of LiOn batteries, basically each charge/discharge
cycle costs a dollar each.  So I only run on batteries when I
absolutely have to, and in practice it's rare that I dip below 30% or

So e2fsck would fix the cross-linking.  We do need to have some better
tools to do forced rewrite of sectors that have gone bad in a HDD.  It
can be done by using badblocks -n, but translating the sector number
emitted by the device driver (which for some drivers is relative to
the beginning of the partition, and for others is relative to the
beginning of the disk).  It is possible to run badblocks -w on the
whole disk, of course, but it's better to just run it on the specific

Well, it actually is a problem.  And there may be other problems
hiding that you're not aware of.  Running "badblocks -b 4096 -n" may
discover other blocks that have failed, and you can then decide
whether you want to let fsck fix things up.  If you don't, though,
it's probably not fair to blame ext3 or e2fsck for any future
failures (not that it's likely to stop you :-).

	      	   	       	       	   - Ted
--


For people using e2croncheck, where you can check it when the system
is idle and without needing to do a power cycle, I'd recommend once a

Some distributions will allow you to cancel an fsck; either by using
^C, or hitting escape.  That's a matter for the boot scripts, which
are distribution specific.  Ubuntu has a way of doing this, for
example, if I recall correctly --- although since I've started using
e2croncheck, I've never had an issue with an e2fsck taking place on
bootup.  Also, ext4, fscks are so much much faster that even before I
upgraded to using an SSD, it's never been an issue for me.  It's

Complain to your distribution.  :-)

Or this is Linux and open source; fix it yourself, and submit the
patches back to your distribution.  If all you want to do is whine,
then maybe Rob's choice is the best way, go switch to the velvet-lined
closed system/jail which is the Macintosh.  :-)

(I created e2croncheck to solve my problem; if that isn't good enough
for you, I encourage you to find/create your own fixes.)

							- Ted
--


frequently they are exactly the same drives, with exactly the same 
firmware.

you disable the write caches on the drives themselves, but you add a large 

it depends on what raid array you use, some use SATA, some use SAS/SCSI

David Lang
--


I was thinking about that as well. Having us disable the write cache when we 
know it is not supported (like in the MD5 case) would certainly be *much* safer 
for almost everyone.

We would need to have a way to override the write cache disabling for people who 
either know that they have a non-volatile write cache (unlikely as it would 
probably be to put MD5 on top of a hardware RAID/external array, but some of the 
new SSD's claim to have non-volatile write cache).

It would also be very useful to have all of our top tier file systems enable 
barriers by default, provide consistent barrier on/off mount options and log a 
nice warning when not enabled....

ric

--


I've done this when the hardware raid only suppored raid 5 but I wanted 
raid 6. I've also done it when I had enough disks to need more than one 
hardware raid card to talk to them all, but wanted one logical drive for 

most people are not willing to live with unbuffered write performance. 
they care about their data, but they also care about performance, and 
since performance is what they see on an ongong basis, they tend to care 
more about performance.

given that we don't even have barriers enabled by default on ext3 due to 
the performance hit, what makes you think that disabling buffers entirely 
is going to be acceptable to people?

David Lang
--


We do (and have for a number of years) enable barriers by default for XFS and 
reiserfs. In SLES, ext3 has default barriers as well.

Ric

--


I'm not sure what you mean with unbuffered write support, the only
common use of that term is for userspace I/O using the read/write
sysctem calls directly in comparism to buffered I/O which uses
the stdio library.

But be ensure that the use of barriers and cache flushes in fsync does not
completely disable caching (or "buffering"), it just does flush flushes
the disk write cache in case we either commit a log buffer than need to
be on disk, or performan an fsync where we really do want to have data
on disk instead of lying to the application about the status of the
I/O completion.  Which btw could be interpreted as a violation of the
Posix rules.

--


as I understood it, the proposal that I responded to was to change the 
kernel to detect if barriers are enabled for the entire stack or not, and 
if not disable the write caches on the drives.

there are definantly times when that is the correct thing to do, but I 
am not sure that it is the correct thing to do by default.

David Lang
--


If you are using one with journal, you'll still need to run fsck at
boot time, to make sure metadata is still consistent... Protection
provided by journaling is not effective in this configuration.

(You have the point that pretty much all users of the blockdevice will
be affected by powerfail degraded mode.)
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--


But we do; comptently designed (and in the cast of software RAID,
competently packaged) RAID subsystems send notifications to the system
administrator when there is a hard drive failure.  Some hardware RAID
systems will send a page to the system administrator.  A mid-range
Areca card has a separate ethernet port so it can send e-mail to the
administrator, even if the OS is hosed for some reason.

And it's not a matter of journalling ineffective; the much bigger deal
is, "your data is at risk"; perhaps because the file system metadata
may become subject to corruption, but more critically, because the
file data may become subject to corruption.  Metadata becoming subject
to corruption is important primarily because it leads to data becoming
corruption; metadata is the tail; the user's data is the dog.

So we *do* have the warning light; the problem is that just as some
people may not realize that "check brakes" means, "YOU COULD DIE",
some people may not realize that "hard drive failure; RAID array
degraded" could mean, "YOU COULD LOSE DATA".

Fortunately, for software RAID, this is easily solved; if you are so
concerned, why don't you submit a patch to mdadm adjusting the e-mail
sent to the system administrator when the array is in a degraded
state, such that it states, "YOU COULD LOSE DATA".  I would gently
suggest to you this would be ***far*** more effective that a patch to
kernel documentation.

						- Ted
--


In the case of a degraded array, could the kernel be more proactive
(or maybe even mdadm) and have the filesystem remount itself withOUT
journalling enabled?  This seems on the surface to be possible, but I
don't know the internal particulars that might prevent/allow it.
--


This a misconception - with or without journalling, you are open to a second 
failure during a RAID rebuild.

Also note that by default, ext3 does not mount with barriers turned on.

Even if you mount with barriers, MD5 does not handle barriers, so you stand to 
lose a lot of data if you have a power outage.

Ric

--

From: Ron Johnson
Date: Monday, August 31, 2009 - 2:01 pm

On 2009-08-31 13:01, Ric Wheeler wrote:

Pardon me for asking for such a seemingly obvious question, but what 
(besides "Message-Digest algorithm 5") is MD5?

(I've always seen "multiple drive" written in the lower case "md".)

-- 
Brawndo's got what plants crave.  It's got electrolytes!
--


also sprach Jesse Brandeburg <jesse.brandeburg@gmail.com> [2009.08.31.1949 =

Why would I want to disable the filesystem journal in that case?

--=20
`. `'`   http://people.debian.org/~madduck    http://vcs-pkg.org
  `-  Debian - when you have better things to do than fixing systems
=20
"i can stand brute force, but brute reason is quite unbearable. there
 is something unfair about its use. it is hitting below the
 intellect."
                                                        -- oscar wilde

I misspoke w.r.t journalling, the idea I was trying to get across was
to remount with -o sync while running on a degraded array, but given
some of the other comments in this thread I'm not even sure that would
help.  the idea was to make writes as safe as possible (at the cost of
speed) when running on a degraded array, and to have the transition be
as hands-free as possible, just have the kernel (or mdadm) by default
remount.
--


Much better, I'd think, to "just" have it scream out DANGER!! WILL 
ROBINSON!! DANGER!! to syslog and to an email hook.

-- 
Brawndo's got what plants crave.  It's got electrolytes!
--


also sprach Jesse Brandeburg <jesse.brandeburg@gmail.com> [2009.09.01.0026 =

I don't see how that is any more necessary with a degraded array
than it is when you have a fully working array. Sync just ensures
that the data are written and not cached, but that has absolutely
nothing to do with the underlying storage. Or am I failing to see
the link?

--=20
`. `'`   http://people.debian.org/~madduck    http://vcs-pkg.org
  `-  Debian - when you have better things to do than fixing systems
=20
"how do you feel about women's rights?"
"i like either side of them."
                                                       -- groucho marx

Well, my MMC/uSD cards do not have ethernet ports to remind me that
they are unreliable :-(.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Florian Weimer
Date: Friday, August 28, 2009 - 12:11 am

In RAID 1 mode, it should read both copies and error out on
mismatch. 8-)

-- 
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99
--

From: NeilBrown
Date: Friday, August 28, 2009 - 12:23 am

Despite your smiley:

  no it shouldn't, and no one is making any claims about raid1 being
  unsafe, only raid4/5/6.

NeilBrown

--

From: david
Date: Tuesday, August 25, 2009 - 4:46 pm

substatute 'degraded MD RAID 5' for 'MD RAID 5' and you have a point here. 
although the language you are using is pretty harsh. you make it sound 
like this is a problem with ext3 when the filesystem has nothing to do 
with it. the problem is that a degraded raid 5 array can be corrupted by 

by the way, while you are thinking about failures that can happen from a 
failed write corrupting additional blocks, think about the nightmare that 
can happen if those blocks are in the journal.

the 'repair' of ext2 by a fsck is actually much less than you are thinking 
that it is.

David Lang
--

From: Theodore Tso
Date: Tuesday, August 25, 2009 - 9:11 am

It seems that you are really hung up on whether or not the filesystem
metadata is consistent after a power failure, when I'd argue that the
problem with using storage devices that don't have good powerfail
properties have much bigger problems (such as the potential for silent
data corruption, or even if fsck will fix a trashed inode table with
ext2, massive data loss).  So instead of your suggested patch, it
might be better simply to have a file in Documentation/filesystems
that states something along the lines of:

"There are storage devices that high highly undesirable properties
when they are disconnected or suffer power failures while writes are
in progress; such devices include flash devices and software RAID 5/6
arrays without journals, as well as hardware RAID 5/6 devices without
battery backups.  These devices have the property of potentially
corrupting blocks being written at the time of the power failure, and
worse yet, amplifying the region where blocks are corrupted such that
adjacent sectors are also damaged during the power failure.

Users who use such storage devices are well advised take
countermeasures, such as the use of Uninterruptible Power Supplies,
and making sure the flash device is not hot-unplugged while the device
is being used.  Regular backups when using these devices is also a
Very Good Idea.

Otherwise, file systems placed on these devices can suffer silent data
and file system corruption.  An forced use of fsck may detect metadata
corruption resulting in file system corruption, but will not suffice
to detect data corruption."

My big complaint is that you seem to think that ext3 some how let you
down, but I'd argue that the real issue is that the storage device let
you down.  Any journaling filesystem will have the properties that you
seem to be complaining about, so the fact that your patch only
documents this as assumptions made by ext2 and ext3 is unfair; it also
applies to xfs, jfs, reiserfs, reiser4, etc.  Further more, most users
are even ...
From: Pavel Machek
Date: Tuesday, August 25, 2009 - 3:21 pm

In FTL case, damaged sectors are not neccessarily adjacent. Otherwise

Ok, would you be against adding:

"Running non-journalled filesystem on these may be desirable, as

Yes, it applies to all journalling filesystems; it is just that I was 
clever/paranoid enough to avoid anything non-ext3.

ext3 docs still says:
# The journal supports the transactions start and stop, and in case of a
# crash, the journal can replay the transactions to quickly put the

Ok, works for me.

---

From: Theodore Tso <tytso@mit.edu>

Document that many devices are too broken for filesystems to protect
data in case of powerfail.

Signed-of-by: Pavel Machek <pavel@ucw.cz> 

diff --git a/Documentation/filesystems/dangers.txt b/Documentation/filesystems/dangers.txt
new file mode 100644
index 0000000..e1a46dd
--- /dev/null
+++ b/Documentation/filesystems/dangers.txt
@@ -0,0 +1,19 @@
+There are storage devices that high highly undesirable properties
+when they are disconnected or suffer power failures while writes are
+in progress; such devices include flash devices and software RAID 5/6
+arrays without journals, as well as hardware RAID 5/6 devices without
+battery backups.  These devices have the property of potentially
+corrupting blocks being written at the time of the power failure, and
+worse yet, amplifying the region where blocks are corrupted such that
+additional sectors are also damaged during the power failure.
+        
+Users who use such storage devices are well advised take
+countermeasures, such as the use of Uninterruptible Power Supplies,
+and making sure the flash device is not hot-unplugged while the device
+is being used.  Regular backups when using these devices is also a
+Very Good Idea.
+        
+Otherwise, file systems placed on these devices can suffer silent data
+and file system corruption.  An forced use of fsck may detect metadata
+corruption resulting in file system corruption, but will not suffice
+to detect data corruption.
\ No newline at end of ...
From: david
Date: Tuesday, August 25, 2009 - 3:33 pm

is it under all conditions, or only when you have already lost redundancy?

prior discussions make me think this was only if the redundancy is already 
lost.

also, the talk about software RAID 5/6 arrays without journals will be 
confusing (after all, if you are using ext3/XFS/etc you are using a 
journal, aren't you?)

you then go on to talk about hardware raid 5/6 without battery backup. I'm 
think that you are being too specific here. any array without battery 
backup can lead to 'interesting' situations when you loose power.

in addition, even with a single drive you will loose some data on power 
loss (unless you do sync mounts with disabled write caches), full data 
journaling can help protect you from this, but the default journaling just 
protects the metadata.

David Lang
--

From: Pavel Machek
Date: Tuesday, August 25, 2009 - 3:40 pm

I'm not so sure now.

Lets say you are writing to the (healthy) RAID5 and have a powerfail.

So now data blocks do not correspond to the parity block. You don't
yet have the corruption, but you already have a problem.


Slightly confusing, yes. Should I just say "MD RAID 5" and avoid
talking about hardware RAID arrays, where that's really

"Data loss" here means "damaging data that were already fsynced". That
will not happen on single disk (with barriers on etc), but will happen
on RAID5 and flash.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: david
Date: Tuesday, August 25, 2009 - 3:59 pm

you need to, otherwise you are claiming that all linux software raid 
implementations will loose data on powerfail, which I don't think is the 

it's the same combination of problems (non-redundant array and write lost 
to powerfail/reboot), just in a different order.

reccomending a scrub of the raid after an unclean shutdown would make 
sense, along with a warning that if you loose all redundancy before the 
scrub is completed and there was a write failure in the unscrubbed portion 

what about dm raid?


this definition of data loss wasn't clear prior to this. you need to 
define this, and state that the reason that flash and raid arrays can 
suffer from this is that both of them deal with blocks of storage larger 
than the data block (eraseblock or raid stripe) and there are conditions 
that can cause the loss of the entire eraseblock or raid stripe which can 
affect data that was previously safe on disk (and if power had been lost 
before the latest write, the prior data would still be safe)

note that this doesn't nessasarily affect all flash disks. if the disk 
doesn't replace the old block in the FTL until the data has all been 
sucessfuly copies to the new eraseblock you don't have this problem.

some (possibly all) cheap thumb drives don't do this, but I would expect 
that the expensive SATA SSDs to do things in the right order.

do this right and you are properly documenting a failure mode that most 
people don't understand, but go too far and you are crying wolf.

David Lang
--

From: Pavel Machek
Date: Tuesday, August 25, 2009 - 4:37 pm

I actually think it was. write() syscall does not guarantee anything,


I'd expect SATA SSDs to have that solved, yes. Again, Ted does not say

Ok, latest version is below, can you suggest improvements? (And yes,
details when exactly RAID-5 misbehaves should be noted somewhere. I
don't know enough about RAID arrays, can someone help?)
									Pavel

---
There are storage devices that high highly undesirable properties
when they are disconnected or suffer power failures while writes are
in progress; such devices include flash devices and MD RAID 4/5/6
arrays.  These devices have the property of potentially
corrupting blocks being written at the time of the power failure, and
worse yet, amplifying the region where blocks are corrupted such that
additional sectors are also damaged during the power failure.
        
Users who use such storage devices are well advised take
countermeasures, such as the use of Uninterruptible Power Supplies,
and making sure the flash device is not hot-unplugged while the device
is being used.  Regular backups when using these devices is also a
Very Good Idea.
        
Otherwise, file systems placed on these devices can suffer silent data
and file system corruption.  An forced use of fsck may detect metadata
corruption resulting in file system corruption, but will not suffice
to detect data corruption.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Ric Wheeler
Date: Tuesday, August 25, 2009 - 4:48 pm

I would strike the entire mention of MD devices since it is your assertion, not 
a proven fact. You will cause more data loss from common events (single sector 
errors, complete drive failure) by steering people away from more reliable 
storage configurations because of a really rare edge case (power failure during 

All users who care about data integrity - including those who do not use MD5 but 

This is very misleading. All storage "can" have silent data loss, you are making 
a statement without specifics about frequency.

FSCK can repair the file system metadata, but will not detect any data loss or 
corruption in the data blocks allocated to user files. To detect data loss 
properly, you need to checksum (or digitally sign) all objects stored in a file 
system and verify them on a regular basis.

Also helps to keep a separate list of those objects on another device so that 
when the metadata does take a hit, you can enumerate your objects and verify 
that you have not lost anything.

ric


ric


--

From: Pavel Machek
Date: Tuesday, August 25, 2009 - 5:06 pm

That actually is a fact. That's how MD RAID 5 is designed. And btw

I'm not sure what's rare about power failures. Unlike single sector
errors, my machine actually has a button that produces exactly that
event. Running degraded raid5 arrays for extended periods may be
slightly unusual configuration, but I suspect people should just do
that for testing. (And from the discussion, people seem to think that

substitute with "can (by design)"?

Now, if you can suggest useful version of that document meeting your
criteria?

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Ric Wheeler
Date: Tuesday, August 25, 2009 - 5:12 pm

From: Pavel Machek
Date: Tuesday, August 25, 2009 - 5:20 pm

So what? He clearly knows how it works.

Instead of arguing he's wrong, will you simply label everything as

Look, I don't need full drive failure for this to happen. I can just
remove one disk from array. I don't need power failure, I can just
press the power button. I don't even need to rebuild anything, I can
just write to degraded array.

Given that all events are under my control, statistics make little
sense here.
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: david
Date: Tuesday, August 25, 2009 - 5:26 pm

if you are intentionally causing several low-probability things to happen 
at once you increase the risk of corruption

note that you also need a write to take place, and be interrupted in just 
the right way.

David Lang
--

From: Ric Wheeler
Date: Tuesday, August 25, 2009 - 5:28 pm

You are deliberately causing a double failure - pressing the power button after 
pulling a drive is exactly that scenario.

Pull your single (non-MD5) disk out while writing (hot unplug from the S-ATA 
side, leaving power on) and run some tests to verify your assertions...

ric

--

From: Pavel Machek
Date: Tuesday, August 25, 2009 - 5:38 pm

Exactly. And now I'm trying to get that documented, so that people

I actually did that some time ago with pulling SATA disk (I actually
pulled both SATA *and* power -- that was the way hotplug envelope
worked; that's more harsh test than what you suggest, so that should
be ok). Write test was fsync heavy, with logging to separate drive,
checking that all the data where fsync succeeded are indeed
accessible. I uncovered few bugs in ext* that jack fixed, I uncovered
some libata weirdness that is not yet fixed AFAIK, but with all the
patches applied I could not break that single SATA disk.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Ric Wheeler
Date: Tuesday, August 25, 2009 - 5:45 pm

The problem I have is that the way you word it steers people away from RAID5 and 
better data integrity. Your intentions are good, but your text is going to do 
considerable harm.

Most people don't intentionally drop power (or have a power failure) during RAID 


Fsync heavy workloads with working barriers will tend to keep the write cache 
pretty empty (two barrier flushes per fsync) so this is not too surprising.

Drive behaviour depends on a lot of things though - how the firmware prioritizes 
writes over reads, etc.

ric
--

From: Pavel Machek
Date: Wednesday, August 26, 2009 - 4:21 am

Example I seen went like this:

Drive in raid 5 failed; hot spare was available (no idea about
UPS). System apparently locked up trying to talk to the failed drive,
or maybe admin just was not patient enough, so he just powercycled the
array. He lost the array.

So while most people will not agressively powercycle the RAID array,
drive failure still provokes little tested error paths, and getting
unclean shutdown is quite easy in such case.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Ric Wheeler
Date: Wednesday, August 26, 2009 - 4:58 am

Then what we need to document is do not power cycle an array during a 
rebuild, right?

If it wasn't the admin that timed out and the box really was hung (no 
drive activity lights, etc), you will need to power cycle/reboot but 
then you should not have this active rebuild issuing writes either...

In the end, there are cascading failures that will defeat any data 
protection scheme, but that does not mean that the value of that scheme 
is zero. We need to be get more people to use RAID (including MD5) and 
try to enhance it as we go. Just using a single disk is not a good thing...

ric


Ric

--

From: Theodore Tso
Date: Wednesday, August 26, 2009 - 5:40 am

Well, the softwar raid layer could be improved so that it implements
scrubbing by default (i.e., have the md package install a cron job to
implement a periodict scrub pass automatically).  The MD code could
also regularly check to make sure the hot spare is OK; the other
possibility is that hot spare, which hadn't been used in a long time,

Yep; the solution is to improve the storage devices.  It is *not* to
encourage people to think RAID is not worth it, or that somehow ext2
is better than ext3 because it runs fsck's all the time at boot up.
That's just crazy talk.

						- Ted
--

From: Ric Wheeler
Date: Wednesday, August 26, 2009 - 6:11 am

Actually, MD does this scan already (not automatically, but you can set up a 
simple cron job to kick off a periodic "check"). It is a delicate balance to get 
the frequency of the scrubbing correct.

On one hand, you want to make sure that you detect errors in a timely fashion, 
certainly detection of single sector errors before you might develop a second 
sector level error on another drive.

On the other hand, running scans/scrubs continually impacts the performance of 
your real workload and can potentially impact your components' life span by 
subjecting them to a heavy workload.

Rule of thumb seems from my experience is that most people settle in with a scan 

Agreed....

ric
--

From: david
Date: Wednesday, August 26, 2009 - 6:44 am

debian defaults to doing this once a month (first sunday of each month), 
on some of my systems this scrub takes almost a week to complete.

David Lang
--

From: Pavel Machek
Date: Saturday, August 29, 2009 - 2:38 am

From: Rik van Riel
Date: Tuesday, August 25, 2009 - 9:24 pm

I recommend a sledgehammer.

If you want to lose your data, you might as well have some fun.

No need to bore yourself to tears by simulating events that are
unlikely to happen simultaneously to careful system administrators.

-- 
All rights reversed.
--

From: Pavel Machek
Date: Wednesday, August 26, 2009 - 4:22 am

Sledgehammer is hardware problem, and I'm demonstrating
software/documentation problem we have here.
								Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Rik van Riel
Date: Wednesday, August 26, 2009 - 7:45 am

So your argument is that a sledgehammer is a hardware
problem, while a broken hard disk and a power failure
are software/documentation issues?

I'd argue that the broken hard disk and power failure
are hardware issues, too.

-- 
All rights reversed.
--

From: Pavel Machek
Date: Saturday, August 29, 2009 - 2:39 am

Noone told me that degraded md raid5 is dangerous. Thats documentation
issue #1. Maybe I just pulled the disk for fun.

ext3 docs told me that journal protects me against fs corruption
during power fails. It does not in this particular case. Seems like
docs issue #2. Maybe I just hit the reset button because it was there.

Randomly hitting power button may be stupid, but should not result in
filesystem corruption on reasonably working filesystem/storage stack.

									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Ron Johnson
Date: Saturday, August 29, 2009 - 4:47 am

You're kidding, right?

Or are you being too effectively sarcastic?

-- 
Obsession with "preserving cultural heritage" is a racist impediment
to moral, physical and intellectual progress.
--

From: jim owens
Date: Saturday, August 29, 2009 - 9:12 am

No he is not... and that is exactly why Ted and Ric have been
fighting so hard against his scare the children documentation.

In 20 years, I have not found a way to educate those who think
"I know computers so it must work the way I want and expect."

Tremendous amounts of information and recommendations are out
there on the web, in books, classes, etc.  But people don't
research before using or understand before they have a problem.


Pavel, *THE KERNEL IS NOT BUGGY* end of story!

Everyone experienced in storage understands the "in the
edge case that Pavel hit, you will loose your data", and we
take our responsibility to tell people what works and does
not work very seriously.  And we try very hard to reduce the
amount of edge case data losses.

But as Ric and Ted and many others keep trying to explain:

- There is no such thing as "never fails" data storage.

- The goal of journal file systems is not what you thing.

- The goal of raid is not what you think.

- We do not want the vast majority of computer users who
   are not kernel engineers to stop using the technology
   that in 99.99 percent of the use cases keeps their data
   as safe as we can reasonably make it, just because they
   read Pavel's 0.01 percent scary and inaccurate case.

And the worst part is this 0.01 percent case problem
is really "I did not know what I was doing".

jim

--

From: david
Date: Tuesday, August 25, 2009 - 4:56 pm

change this to say 'degraded MD RAID 4/5/6 arrays'

also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly 
suspect that they do)

then you need to add a note that if the array becomes degraded before a 
scrub cycle happens previously hidden damage (that would have been 


re-word this something like

In addition to the standard risk of corrupting the blocks being written at 
the time of the power failure, additonal blocks (in the same flash 

David Lang
--

From: Pavel Machek
Date: Tuesday, August 25, 2009 - 5:12 pm

I'd prefer not to talk about scrubing and such details here. Better

Actually I don't think so. I believe SATA disks do not corrupt even
the sector they are writing to -- they just have big enough
capacitors. And yes I believe ext3 depends on that.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: david
Date: Tuesday, August 25, 2009 - 5:20 pm

I disagree with that, the way you are wording this makes it sound as if 
raid isn't worth it. if you are going to say that raid is risky you need 

you are incorrect on this.

ext3 (like every other filesystem) just accepts the risk (zfs makes some 
attempt to detect such corruption)

David Lang
--

From: Pavel Machek
Date: Tuesday, August 25, 2009 - 5:39 pm

Ok, would this help? I don't really want to go to scrubbing details.

(*) Degraded array or single disk failure "near" the powerfail is

I'd like Ted to comment on this. He wrote the original document, and
I'd prefer not to introduce mistakes.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: david
Date: Tuesday, August 25, 2009 - 6:17 pm

that sounds reasonable

David Lang
--

From: Ric Wheeler
Date: Tuesday, August 25, 2009 - 5:26 pm

Than you should punt the MD discussion to the MD documentation entirely.

I would suggest:

"Users of any file system that have a single media (SSD, flash or normal disk) 
can suffer from catastrophic and complete data loss if that single media fails. 
To reduce your exposure to data loss after a single point of failure, consider 
using either hardware or properly configured software RAID. See the 
documentation on MD RAID for how to configure it.

To insure proper fsync() semantics, you will need to have a storage device that 
supports write barriers or have a non-volatile write cache. If not, best 

Pavel, no S-ATA drive has capacitors to hold up during a power failure (or even 
enough power to destage their write cache). I know this from direct, personal 
knowledge having built RAID boxes at EMC for years. In fact, almost all RAID 
boxes require that the write cache be hardwired to off when used in their arrays.

Drives fail partially on a very common basis - look at your remapped sector 
count with smartctl.

RAID (including MD RAID5) will protect you from this most common error as it 
will protect you from complete drive failure which is also an extremely common 
event.

Your scenario is really, really rare - doing a full rebuild after a complete 
drive failure (takes a matter of hours, depends on the size of the disk) and 
having a power failure during that rebuild.

Of course adding a UPS to any storage system (including MD RAID system) helps 
make it more reliable, specifically in your scenario.

The more important point is that having any RAID (MD1, MD5 or MD6) will greatly 
reduce your chance of data loss if configured correctly. With ext3, ext2 or zfs.

Ric

--

From: Pavel Machek
Date: Tuesday, August 25, 2009 - 5:44 pm

I never claimed they have enough power to flush entire cache -- read
the paragraph again. I do believe the disks have enough capacitors to
finish writing single sector, and I do believe ext3 depends on that.

									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Ric Wheeler
Date: Tuesday, August 25, 2009 - 5:50 pm

Some scary terms that drive people mention (and measure):

"high fly writes"
"over powered seeks"
"adjacent tack erasure"

If you do get a partial track written, the data integrity bits that the data is 
embedded in will flag it as invalid and give you and IO error on the next read. 
Note that the damage is not persistent, it will get repaired (in place) on the 
next write to that sector.

Also it is worth noting that ext2/3/4 write file system "blocks" not single 
sectors. Each ext3 IO is 8 distinct disk sector writes and those can span tracks 
on a drive which require a seek which all consume power.

On power loss, a disk will immediately park the heads...

ric

--

From: david
Date: Tuesday, August 25, 2009 - 6:19 pm

keep in mind that in a powerfail situation the data being sent to the 
drive may be corrupt (the ram gets flaky while a DMA to the drive copies 
the bad data to the drive, which writes it before the power loss gets bad 
enough for the drive to decide there is a problem and shutdown)

you just plain cannot count on writes that are in flight when a powerfail 
happens to do predictable things, let alone what you consider sane or 
proper.

David Lang
--

From: Pavel Machek
Date: Wednesday, August 26, 2009 - 4:25 am

From what I see, this kind of failure is rather harder to reproduce
than the software problems. And at least SGI machines were designed to
avoid this...

Anyway, I'd like to hear from ext3 people... what happens on read
errors in journal? That's what you'd expect to see in situation above.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Theodore Tso
Date: Wednesday, August 26, 2009 - 5:37 am

On a power failure, what normally happens is that the random garbage
gets written into the disk drive's last dying gasp, since the memory
starts going insane and sends garbage to the disk.  So the disk
successfully completes the write, but the sector contains garbage.
Since HDD's tend to be last thing to die, being less sensitive to
voltage drops than the memory or DMA controller, my experience is that
you don't get a read error after the system comes up, you just get
garbage written into the journal.

The ext3 journalling code waits until all of the journal code is
written, and only then writes the commit block.  On restart, we look
for the last valid commit block.  So if the power failure is before we
write the commit block, we replay the journal up until the previous
commit block.  If the power failure is while we are writing the commit
block, garbage will be written out instead of the commit block, and so
it falls back to the previous case.

We do not allow any updates to the filesystem metadata to take place
until the commit block has been written; therefore the filesystem
stays consistent.

If there the journal *does* develop read errors, then fsck will
require a manual fsck, and so the boot operation will get stopped so a
system administrator can provide manual intervention.  The best bet
for the sysadmin is to replay as much of the journal she can, and then
let fsck fix any resulting filesystem inconsistencies.  In practice,
though, I've not experienced or seen any reports of this happening
from a power failure; usually it happens if the laptop gets dropped or
the hard drive suffers or suffers some other kind of hardware failure.

    	       	       	  	       - Ted
--

From: Pavel Machek
Date: Saturday, August 29, 2009 - 11:49 pm

...and that should result in consistent fs with no data loss, because
read error is essentialy the same as garbage given back, right?

...plus, this is significant difference from logical-logging
filesystems, no?

Should this go to Documentation/, somewhere?

								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Rik van Riel
Date: Tuesday, August 25, 2009 - 9:20 pm

Not necessarily.  Say you wrote out the entire stripe
in a 5 disk RAID 5 array, but only 3 data blocks and
the parity block got written out before power failure.

If the disk with the 4th (unwritten) data block were
to fail and get taken out of the RAID 5 array, the
degradation of the array could actually undo your data
corruption.

With RAID 5 and incomplete writes, you just don't know.

This kind of thing could go wrong at any level in the
system, with any kind of RAID 5 setup.

Of course, on a single disk system without RAID you can
still get incomplete writes, for the exact same reasons.

RAID 5 does not make things worse.  It will protect your
data against certain failure modes, but not against others.

With or without RAID, you still need to make backups.

-- 
All rights reversed.
--

From: Pavel Machek
Date: Tuesday, August 25, 2009 - 3:27 pm

Document things ext2 expects from storage filesystems, and the fact
that it can not handle barriers. Also remove jounaling description, as
that's really ext3 material.

Signed-off-by: Pavel Machek <pavel@ucw.cz>

diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt
index 67639f9..e300ca8 100644
--- a/Documentation/filesystems/ext2.txt
+++ b/Documentation/filesystems/ext2.txt
@@ -338,27 +339,17 @@ enough 4-character names to make up unique directory entries, so they
 have to be 8 character filenames, even then we are fairly close to
 running out of unique filenames.
 
+Requirements
+============
+
+Ext2 expects disk/storage subsystem not to return write errors.
+
+It also needs write caching to be disabled for reliable fsync
+operation; ext2 does not know how to issue barriers as of
+2.6.31. hdparm -W0 disables it on SATA disks.
+
 Journaling
-----------
-
-A journaling extension to the ext2 code has been developed by Stephen
-Tweedie.  It avoids the risks of metadata corruption and the need to
-wait for e2fsck to complete after a crash, without requiring a change
-to the on-disk ext2 layout.  In a nutshell, the journal is a regular
-file which stores whole metadata (and optionally data) blocks that have
-been modified, prior to writing them into the filesystem.  This means
-it is possible to add a journal to an existing ext2 filesystem without
-the need for data conversion.
-
-When changes to the filesystem (e.g. a file is renamed) they are stored in
-a transaction in the journal and can either be complete or incomplete at
-the time of a crash.  If a transaction is complete at the time of a crash
-(or in the normal case where the system does not crash), then any blocks
-in that transaction are guaranteed to represent a valid filesystem state,
-and are copied into the filesystem.  If a transaction is incomplete at
-the time of the crash, then there is no guarantee of consistency for
-the blocks in that transaction so they are discarded ...
From: Rob Landley
Date: Wednesday, August 26, 2009 - 8:34 pm

Suppose a small office makes nightly backups to an offsite server via rsync.  If 
a thunderstorm goes by causing their system to reboot twice in a 15 minute 
period, would they rather notice the filesystem corruption immediately upon 

Yup.  Hopefully btrfs will cope less badly?  They keep talking about 

I doubt the cupholder crowd is going to stop treating USB sticks as magical 
any time soon, but I also wonder how many of them even remember Linux _exists_ 

Professionals have horror stories about this issue, therefore documenting it 
is _less_ important?

Ok...

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds
--

From: David Woodhouse
Date: Thursday, August 27, 2009 - 1:46 am

This just goes to show why having this "translation layer" done in
firmware on the device itself is a _bad_ idea. We're much better off
when we have full access to the underlying flash and the OS can actually
see what's going on. That way, we can actually debug, fix and recover

It's a known failure mode of _everything_ that uses flash to pretend to
be a block device. As I see it, there are no SSD devices which don't
lose data; there are only SSD devices which haven't lost your data
_yet_.

There's no fundamental reason why it should be this way; it just is.

(I'm kind of hoping that the shiny new expensive ones that everyone's
talking about right now, that I shouldn't really be slagging off, are
actually OK. But they're still new, and I'm certainly not trusting them
with my own data _quite_ yet.)

-- 
dwmw2

--

From: david
Date: Friday, August 28, 2009 - 7:46 am

so what sort of test would be needed to identify if a device has this 
problem?

people can do ad-hoc tests by pulling the devices in use and then checking 
the entire device, but something better should be available.

it seems to me that there are two things needed to define the tests.

1. a predictable write load so that it's easy to detect data getting lose

2. some statistical analysis to decide how many device pulls are needed 
(under the write load defined in #1) to make the odds high that the 
problem will be revealed.

with this we could have people test various devices and report if the test 
detects unrelated data being lost (or businesses, and I think the tech 
hardware sites would jump into this given some sort of accepted test)

for USB devices there may be a way to use the power management functions 
to cut power to the device without requiring it to physically be pulled, 
if this is the case (even if this only works on some specific chipsets), 
it would drasticly speed up the testing

David Lang
--

From: Pavel Machek
Date: Saturday, August 29, 2009 - 3:09 am

This is really so easy to reproduce, that such speedup is not
neccessary. Just try the scripts :-).
									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: david
Date: Saturday, August 29, 2009 - 9:27 am

so if it doesn't get corrupted after 5 unplugs does that mean that that 
particular device doesn't have a problem? or does it just mean you got 
lucky?

would 10 sucessful unplugs mean that it's safe?

what about 20?

we need to get this beyond anecdotal evidence mode, to something that 
(even if not perfect, as you can get 100 'heads' in a row with an honest 
coin) gives you pretty good assurances that a particular device is either 
good or bad.

David Lang
--

From: Pavel Machek
Date: Saturday, August 29, 2009 - 2:33 pm

From: Greg Freemyer
Date: Monday, August 24, 2009 - 2:11 pm

I agree it should be documented, but the ext3 atomicity issue is only
an issue on unexpected shutdown while the array is degraded.  I surely
hope most people running raid5 are not seeing that level of unexpected
shutdown, let along in a degraded array,

If they are, the atomicity issue pretty strongly says they should not
be using raid5 in that environment.  At least not for any filesystem I
know.  Having writes to LBA n corrupt LBA n+128 as an example is
pretty hard to design around from a fs perspective.

Greg
--

From: Rob Landley
Date: Tuesday, August 25, 2009 - 1:56 pm

Right now, people think that a degraded raid 5 is equivalent to raid 0.  As 
this thread demonstrates, in the power failure case it's _worse_, due to write 
granularity being larger than the filesystem sector size.  (Just like flash.)

Knowing that, some people might choose to suspend writes to their raid until 
it's finished recovery.  Perhaps they'll set up a system where a degraded raid 
5 gets remounted read only until recovery completes, and then writes go to a 
new blank hot spare disk using all that volume snapshoting or unionfs stuff 
people have been working on.  (The big boys already have hot spare disks 
standing by on a lot of these systems, ready to power up and go without human 
intervention.  Needing two for actual reliability isn't that big a deal.)

Or maybe the raid guys might want to tweak the recovery logic so it's not 
entirely linear, but instead prioritizes dirty pages over clean ones.  So if 
somebody dirties a page halfway through a degraded raid 5, skip ahead to 
recover that chunk first to the new disk first (yes leaving holes, it's not that 
hard to track), and _then_ let the write go through.

But unless people know the issue exists, they won't even start thinking about 

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds
--

From: david
Date: Tuesday, August 25, 2009 - 2:08 pm

if you've got the drives available you should be running raid 6 not raid 5 
so that you have to loose two drives before you loose your error checking.

in my opinion that's a far better use of a drive than a hot spare.

David Lang
--

From: Rob Landley
Date: Tuesday, August 25, 2009 - 11:52 am

It's not quite that simple anymore.

These days, most modern drives add an "overcoat", which is a vapor deposition 
layer of carbon (I.E. diamond) on top of the magnetic media, and then add a 
nanolayer of some kind of nonmagnetic lubricant on top of that.  That protects 
the magnetic layer from physical contact with the head; it takes a pretty 
solid whack to chip through diamond and actually gouge your disk:

  http://www.datarecoverylink.com/understanding_magnetic_media.html

You can also do fun things with various nitridies (carbon nitride, silicon 
nitride, titanium nitride) which are pretty darn tough too, although I dunno 
about their suitability to hard drives:

  http://www.physical-vapor-deposition.com/

So while it _is_ possible to whack your drive and scratch the platter, merely 
"touching" won't do it.  (Laptops wouldn't be feasible if they couldn't cope 
with a little jostling while running.)  In the case of repeated small whacks, 
your heads may actually go first.  (I vaguely recall the little aerofoil wing 
thingy holding up the disk touches first, and can get ground down by repeated 
contact with the diamond layer (despite the lubricant, that just buys time) so 
it gets shorter and shorter and can't reliably keep the head above the disk 
rather than in contact with it.  But I'm kind of stale myself here, not sure 
that's still current.)

Here's a nice youtube video of a 2007 defcon talk from a hard drive recovery 
professional, "What's that Clicking Noise", series starts here:
  http://www.youtube.com/watch?v=vCapEFNZAJ0

And here's that guy's web page:
  http://www.myharddrivedied.com/presentations/index.html

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds
--

From: Florian Weimer
Date: Tuesday, August 25, 2009 - 7:43 am

Hmm.  What does "not being able to handle failed writes" actually
mean?  AFAICS, there are two possible answers: "all bets are off", or

Right.  And a lot of database systems make the same assumption.
Oracle Berkeley DB cannot deal with partial page writes at all, and
PostgreSQL assumes that it's safe to flip a few bits in a sector
without proper WAL (it doesn't care if the changes actually hit the
disk, but the write shouldn't make the sector unreadable or put random


I think the general idea is to protect valuable data with WAL.  You
overwrite pages on disk only after you've made a backup copy into WAL.
After a power loss event, you replay the log and overwrite all garbage
that might be there.  For the WAL, you rely on checksum and sequence
numbers.  This still doesn't help against write failures where the
system continues running (because the fsync() during checkpointing
isn't guaranteed to report errors), but it should deal with the power
failure case.  But this assumes that the file system protects its own
data structure in a similar way.  Is this really too much to demand?

Partial failures are extremely difficult to deal with because of their
asynchronous nature.  I've come to accept that, but it's still
disappointing.

-- 
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99
--

From: Theodore Tso
Date: Monday, August 24, 2009 - 6:50 am

So I got confused when I quoted your note, which I had assumed was
exactly what Pavel had written in his documentation.  In fact, what he
had written was this:

+Don't damage the old data on a failed write (ATOMIC-WRITES)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Either whole sector is correctly written or nothing is written during
+powerfail.
+
+....

So he had explicitly stated that he only cared about the whole sector
being written (or not written) in the power fail case, and not any
other.  I'd suggest changing ATOMIC-WRITES to
ATOMIC-WRITE-ON-POWERFAIL, since the one-line summary, "Don't damage
the old data on a failed write", is also singularly misleading.

    	     	  	 	    	 	    - Ted
--

From: Pavel Machek
Date: Monday, August 24, 2009 - 11:48 am

Ok, something like this?

Don't damage the old data on a powerfail (ATOMIC-WRITES-ON-POWERFAIL)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Either whole sector is correctly written or nothing is written during
powerfail.


									Pavel

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Pavel Machek
Date: Monday, August 24, 2009 - 11:39 am

Ok, I added "Not all filesystems require all of these
to be satisfied for safe operation" sentence there.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Greg Freemyer
Date: Monday, August 24, 2009 - 6:21 am

Can someone clarify if this is true in raid-6 with just a single disk
failure?  I don't see why it would be.

And if not can the above text be changed to reflect raid 4/5 with a
single disk failure and raid 6 with a double disk failure are the
modes that have atomicity problems.

Greg
--

From: Pavel Machek
Date: Monday, August 24, 2009 - 11:44 am

I don't know enough about raid-6, but... I said "degraded mode" above,
and you can read it as double failure in raid-6 case ;-). I'll prefer
to avoid too many details here.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Rob Landley
Date: Monday, August 24, 2009 - 2:11 pm

Acked-by: Rob Landley <rob@landley.net>


It's coming up on 2.6.31, has it learned anything since or should that version 

Possible rewording of this paragraph:

  Ext3 handles trash getting written into sectors during powerfail
  surprisingly well.  It's not foolproof, but it is resilient.  Incomplete
  journal entries are ignored, and journal replay of complete entries will
  often "repair" garbage written into the inode table.  The data=journal
  option extends this behavior to file and directory data blocks as well
  (without which your dentries can still be badly corrupted by a power fail
  during a write).

(I'm not entirely sure about that last bit, but clarifying it one way or the 
other would be nice because I can't tell from reading it which it is.  My 
_guess_ is that directories are just treated as files with an attitude and an 
extra cacheing layer...?)

Rob
-- 
Latency is more important than throughput. It's that simple. - Linus Torvalds
--

From: Pavel Machek
Date: Monday, August 24, 2009 - 2:33 pm

Thanks, applied, it looks better than what I wrote. I removed the ()
part, as I'm not sure about it...
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Jan Kara
Date: Tuesday, August 25, 2009 - 11:45 am

No, they did not. We were discussing how to be able to enable / disable
sending barriers, someone told he'd implement it but it somehow never got
beyond an initial attempt.
  Actually, after recent sync cleanups (and when my O_SYNC cleanups get
merged) it should be pretty easy because every filesystem now has ->fsync()
and ->sync_fs() callback so we just have to add sending barriers to these
two functions and implement possibility to set via sysfs that barriers on the
block device should be ignored.
  I've put it to my todo list but if someone else has time for this, I
certainly would not mind :). It would be a nice beginner project...

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
--

From: Pavel Machek
Date: Monday, March 16, 2009 - 5:30 am

Updated version here.



diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
new file mode 100644
index 0000000..710d119
--- /dev/null
+++ b/Documentation/filesystems/expectations.txt
@@ -0,0 +1,47 @@
+Linux block-backed filesystems can only work correctly when several
+conditions are met in the block layer and below (disks, flash
+cards). Some of them are obvious ("data on media should not change
+randomly"), some are less so.
+
+Write errors not allowed (NO-WRITE-ERRORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Writes to media never fail. Even if disk returns error condition
+during write, filesystems can't handle that correctly, because success
+on fsync was already returned when data hit the journal.
+
+	Fortunately writes failing are very uncommon on traditional 
+	spinning disks, as they have spare sectors they use when write
+	fails.
+
+Sector writes are atomic (ATOMIC-SECTORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Either whole sector is correctly written or nothing is written during
+powerfail.
+
+	Unfortunately, none of the cheap USB/SD flash cards I've seen
+	do behave like this, and are thus unsuitable for all Linux
+	filesystems I know.
+
+		An inherent problem with using flash as a normal block
+		device is that the flash erase size is bigger than
+		most filesystem sector sizes.  So when you request a
+		write, it may erase and rewrite some 64k, 128k, or
+		even a couple megabytes on the really _big_ ones.
+
+		If you lose power in the middle of that, filesystem
+		won't notice that data in the "sectors" _around_ the
+		one your were trying to write to got trashed.
+
+	Because RAM tends to fail faster than rest of system during 
+	powerfail, special hw killing DMA transfers may be necessary;
+	otherwise, disks may write garbage during powerfail.
+	Not sure how common that problem is on generic PC machines.
+
+	Note that atomic write is very hard to guarantee for ...
From: Theodore Tso
Date: Monday, March 16, 2009 - 12:03 pm

Some of what is here are bugs, and some are legitimate long-term
interfaces (for example, the question of losing I/O errors when two
processes are writing to the same file, or to a directory entry, and
errors aren't or in some cases, can't, be reflected back to userspace).

I'm a little concerned that some of this reads a bit too much like a
rant (and I know Pavel was very frustrated when he tried to use a
flash card with a sucky flash card socket) and it will get used the
wrong way by apoligists, because it mixes areas where "we suck, we
should do better", which a re bug reports, and "Posix or the
underlying block device layer makes it hard", and simply states them
as fundamental design requirements, when that's probably not true.

There's a lot of work that we could do to make I/O errors get better
reflected to userspace by fsync().  So state things as bald
requirements I think goes a little too far IMHO.  We can surely do

The last half of this sentence "because success on fsync was already
returned when data hit the journal", obviously doesn't apply to all
filesystems, since some filesystems, like ext2, don't journal data.
Even for ext3, it only applies in the case of data=journal mode.  

There are other issues here, such as fsync() only reports an I/O
problem to one caller, and in some cases I/O errors aren't propagated
up the storage stack.  The latter is clearly just a bug that should be
fixed; the former is more of an interface limitation.  But you don't
talk about in this section, and I think it would be good to have a
more extended discussion about I/O errors when writing data blocks,


The characteristic you descrive here is not an issue about whether
the whole sector is either written or nothing happens to the data ---
but rather, or at least in addition to that, there is also the issue
that when a there is a flash card failure --- particularly one caused
by a sucky flash card reader design causing the SD card to disconnect
from the laptop in the middle of a ...
From: Pavel Machek
Date: Monday, March 23, 2009 - 11:23 am

Well, I guess there's thin line between error and "legitimate
long-term interfaces". I still believe that fsync() is broken by

It started as a rant, obviously I'd like to get away from that and get
it into suitable format for inclusion. (Not being native speaker does
not help here).

But I do believe that we should get this documented; many common
storage subsystems are broken, and can cause data loss. We should at

Well, I guess that can be refined later. Heck, I'm not able to tell
which are simple bugs likely to be fixed soon, and which are
fundamental issues that are unlikely to be fixed sooner than 2030. I
guess it is fair to document them ASAP, and then fix those that can be

If the fsync() can be fixed... that would be great. But I'm not sure




...

Ok, added to ext3 specific section. New version is attached. Feel free
to help here; my goal is to get this documented, I'm not particulary
attached to wording etc...

Signed-off-by: Pavel Machek <pavel@ucw.cz>
									Pavel

diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt
new file mode 100644
index 0000000..0de456d
--- /dev/null
+++ b/Documentation/filesystems/expectations.txt
@@ -0,0 +1,49 @@
+Linux block-backed filesystems can only work correctly when several
+conditions are met in the block layer and below (disks, flash
+cards). Some of them are obvious ("data on media should not change
+randomly"), some are less so.
+
+Write errors not allowed (NO-WRITE-ERRORS)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Writes to media never fail. Even if disk returns error condition
+during write, filesystems can't handle that correctly.
+
+	Fortunately writes failing are very uncommon on traditional 
+	spinning disks, as they have spare sectors they use when write
+	fails.
+
+Don't cause collateral damage to adjacent sectors on a failed write ...
From: Sitsofe Wheeler
Date: Monday, March 16, 2009 - 12:40 pm

When you say Linux filesystems do you mean "filesystems originally
designed on Linux" or do you mean "filesystems that Linux supports"?
Additionally whatever the answer, people are going to need help
answering the "which is the least bad?" question and saying what's not
good without offering alternatives is only half helpful... People need
to put SOMETHING on these cheap (and not quite so cheap) devices... The
last recommendation I heard was that until btrfs/logfs/nilfs arrive
people are best off sticking with FAT -
http://marc.info/?l=linux-kernel&m=122398315223323&w=2 . Perhaps that

The document makes it sound like nearly everything bar battery backed
hardware RAIDed SCSI disks (with perfect firmware) is bad  - is this
the intent?

-- 
Sitsofe | http://sucs.org/~sits/
--

From: Rob Landley
Date: Monday, March 16, 2009 - 2:43 pm

Actually, the best filesystem for USB flash devices is probably UDF.  (Yes, 
the DVD filesystem turns out to be writeable if you put it on a writeable 
media.  The ISO spec requires write support, so any OS that supports DVDs also 
supports this.)

The reasons for this are:

A) It's the only filesystem other than FAT that's supported out of the box by 
windows, mac, _and_ Linux for hotpluggable media.

B) It doesn't have the horrible limitations of FAT (such as a max filesize of 
2 gigabytes).

C) Microsoft doesn't claim to own it, and thus hasn't sued anybody over 
patents on it.

However, when it comes to cutting the power on a mounted filesystem (either by 
yanking the device or powering off the machine) without losing your data, 
without warning, they all suck horribly.

If you yank a USB flash disk in the middle of a write, and the device has 
decided to wipe a 2 megabyte erase sector that's behind a layer of wear 
levelling and thus consists of a series of random sectors scattered all over 
the disk, you're screwed no matter what filesystem you use.  You know the 
vinyl "record scratch" sound?  Imagine that, on a digital level.  Bad Things 

SCSI disks?  They still make those?

Everything fails, it's just a question of how.  Rotational media combined with 
journaling at least fails in fairly understandable ways, so ext3 on sata is 
reasonable.

Flash gets into trouble when it presents the _interface_ of rotational media 
(a USB block device with normal 512 byte read/write sectors, which never wear 
out) which doesn't match what the hardware's actually doing (erase block sizes 
of up to several megabytes at a time, hidden behind a block remapping layer 
for wear leveling).

For devices that have built in flash that DON'T pretend to be a conventional 
block device, but instead expose their flash erase granularity and let the OS 
do the wear levelling itself, we have special flash filesystems that can be 
reasonably reliable.  It's just that ext3 isn't one of ...
From: Kyle Moffett
Date: Monday, March 16, 2009 - 9:55 pm

The really nice SSDs actually reserve ~15-30% of their internal
block-level storage and actually run their own log-structured virtual
disk in hardware.  From what I understand the Intel SSDs are that way.
 Real-time garbage collection is tricky, but if you require (for
example) a max of ~80% utilization then you can provide good latency
and bandwidth guarantees.  There's usually something like a
log-structured virtual-to-physical sector map as well.  If designed
properly with automatic hardware checksumming, such a system can
actually provide atomic writes and barriers with virtually no impact
on performance.

With firmware-level hardware knowledge and the ability to perform
extremely efficient parallel reads of flash blocks, such a
log-structured virtual block device can be many times more efficient
than a general purpose OS running a log-structured filesystem.  The
result is that for an ordinary ext3-esque filesystem with 4k blocks
you can treat the SSD as though it is an atomic-write seek-less block
device.

Now if only I had the spare cash to go out and buy one of the shiny
Intel ones for my laptop... :-)

Cheers,
Kyle Moffett
--

From: Pavel Machek
Date: Monday, March 23, 2009 - 4:00 am

"Linux filesystems I know" :-). No filesystem that Linux supports,

According to me, people should just AVOID those devices. I don't plan

Battery backed RAID should be ok, as should be plain single SATA drive.
									Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

From: Robert Hancock
Date: Friday, August 28, 2009 - 6:33 pm

I've heard rumors of disks that claim to support cache flushes but 
really just ignore them, but have never heard any specifics of model 
numbers, etc. which are known to do this, so it may just be legend. If 
we do have such knowledge then we should really be blacklisting those 
drives and warning the user that we can't ensure data integrity. (Even 
powering down the system would be unsafe in this case.)
--

From: Alan Cox
Date: Saturday, August 29, 2009 - 6:04 am

This should not be the case for any vaguely modern drive. The standard
requires the drive flushes the cache if sent the command and the size of
caches on modern drives rather require it.

Alan
--

From: Greg Freemyer
Date: Monday, March 16, 2009 - 12:45 pm

On Thu, Mar 12, 2009 at 5:21 AM, Pavel Machek <pavel@ucw.cz> wrote:

I had *assumed* that SSDs worked like:

1) write request comes in
2) new unused erase block area marked to hold the new data
3) updated data written to the previously unused erase block
4) mapping updated to replace the old erase block with the new one

If it were done that way, a failure in the middle would just leave the
SSD with the old data in it.

If it is not done that way, then I can see your issue.  (I love the
potential performance of SSDs, but I'm beginning to hate the
implementations and spec writing.)

Greg
-- 
Greg Freemyer
Head of EDD Tape Extraction and Processing team
Litigation Triage Solutions Specialist
http://www.linkedin.com/in/gregfreemyer
First 99 Days Litigation White Paper -
http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf

The Norcross Group
The Intersection of Evidence & Technology
http://www.norcrossgroup.com
--

From: Pavel Machek
Date: Monday, March 16, 2009 - 2:48 pm

The really expensive ones (Intel SSD) apparently work like that, but I
never seen one of those. USB sticks and SD cards I tried behave like I
described above.
								Pavel
-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

Previous thread: [PATCH]:resize2fs:adjust the inode before inode_tables were covered by Gui Xiaohua on Thursday, March 12, 2009 - 1:14 am. (2 messages)

Next thread: 2.6.29-rc7: ext4 hangs on sparc (was: next-20090310: ext4 hangs) by Alexander Beregalov on Thursday, March 12, 2009 - 5:35 am. (1 message)