Not all block devices are suitable for all filesystems. In fact, some block devices are so broken that reliable operation is pretty much impossible. Document stuff ext2/ext3 needs for reliable operation. Signed-off-by: Pavel Machek <pavel@ucw.cz> diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt new file mode 100644 index 0000000..9c3d729 --- /dev/null +++ b/Documentation/filesystems/expectations.txt @@ -0,0 +1,47 @@ +Linux block-backed filesystems can only work correctly when several +conditions are met in the block layer and below (disks, flash +cards). Some of them are obvious ("data on media should not change +randomly"), some are less so. + +Write errors not allowed (NO-WRITE-ERRORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Writes to media never fail. Even if disk returns error condition +during write, filesystems can't handle that correctly, because success +on fsync was already returned when data hit the journal. + + Fortunately writes failing are very uncommon on traditional + spinning disks, as they have spare sectors they use when write + fails. + +Sector writes are atomic (ATOMIC-SECTORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Either whole sector is correctly written or nothing is written during +powerfail. + + Unfortuantely, none of the cheap USB/SD flash cards I seen do + behave like this, and are unsuitable for all linux filesystems + I know. + + An inherent problem with using flash as a normal block + device is that the flash erase size is bigger than + most filesystem sector sizes. So when you request a + write, it may erase and rewrite the next 64k, 128k, or + even a couple megabytes on the really _big_ ones. + + If you lose power in the middle of that, filesystem + won't notice that data in the "sectors" _around_ the + one your were trying to write to got trashed. + + Because RAM tends to fail faster than rest of system during + powerfail, special hw killing DMA ...
Hi, ^^^^ Shouldn't this be "Ext2"? All the best, Jochen -- http://seehuhn.de/ --
Thanks, fixed. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
I vaguely recall that the behavior of when a write error _does_ occur is to remount the filesystem read only? (Is this VFS or per-fs?) Is there any kind of hotplug event associated with this? I'm aware write errors shouldn't happen, and by the time they do it's too late Somebody corrected me, it's not "the next" it's "the surrounding". (Writes aren't always cleanly at the start of an erase block, so critical data These days instead of "atomic" it's better to think in terms of "barriers". Requesting a flush blocks until all the data written _before_ that point has made it to disk. This wait may be arbitrarily long on a busy system with lots of disk transactions happening in parallel (perhaps because Firefox decided to garbage collect and is spending the next 30 seconds swapping itself back in to And here we're talking about ext2. Does neither one know about write barriers, or does this just apply to ext2? (What about ext4?) Also I remember a historical problem that not all disks honor write barriers, because actual data integrity makes for horrible benchmark numbers. Dunno how current that is with SATA, Alan Cox would probably know. Rob --
This is not about barriers (that should be different topic). Atomic write means that either whole sector is written, or nothing at all is written. Because raid5 needs to update both master data and parity at This document is about ext2. Ext3 can support barriers in Sounds like broken disk, then. We should blacklist those. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Care to elaborate? (When a filesystem is mounted RO, I'm not sure what Fun. When "please do not turn of your playstation until game save completes" honestly seems like the best solution for making the technology reliable, something is wrong with the technology. Good point, but I thought that's what journaling was for? I'm aware that any flash filesystem _must_ be journaled in order to work sanely, and must be able to view the underlying erase granularity down to the bare metal, through any remapping the hardware's doing. Possibly what's really needed is a "flash is weird" section, since flash filesystems can't be mounted on arbitrary block devices. Although an "-O erase_size=128" option so they _could_ would be nice. There's "mtdram" which seems to be the only remaining use for ram disks, but why there isn't an "mtdwrap" that works with arbitrary underlying block devices, I have It wasn't just one brand of disk cheating like that, and you'd have to ask him (or maybe Jens Axboe or somebody) whether the problem is still current. I've been off in embedded-land for a few years now... Rob --
Ok, can you suggest a patch? I believe remount-ro is already Well, fsync() error reporting does not really work properly, but I guess it will save you for the remount-ro case. So the data will be in I believe journaling operates on assumption that "either whole sector I don't think that works. Compactflash (etc) cards basically randomly remap the data, so you can't really run flash filesystem over compactflash/usb/SD card -- you don't know the details of remapping. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Actualy raid5 should have no problem with a power failure during
normal operations of the raid. The parity block should get marked out
of sync, then the new data block should be written, then the new
The real problem comes in degraded mode. In that case the data block
(if present) and parity block must be written at the same time
atomically. If the system crashes after writing one but before writing
the other then the data block on the missng drive changes its
contents. And for example with a chunk size of 1MB and 16 disks that
could be 15MB away from the block you actualy do change. And you can
not recover that after a crash as you need both the original and
changed contents of the block.
So writing one sector has the risk of corrupting another (for the FS)
totally unconnected sector. No amount of journaling will help
there. The raid5 would need to do journaling or use battery backed
cache.
MfG
Goswin
--
Thanks, I updated my notes. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Running journaling filesystem such as ext3 over flashdisk or degraded RAID array is a bad idea: journaling guarantees no longer apply and you will get data corruption on powerfail. We can't solve it easily, but we should certainly warn the users. I actually lost data because I did not understand these limitations... Signed-off-by: Pavel Machek <pavel@ucw.cz> diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt new file mode 100644 index 0000000..80fa886 --- /dev/null +++ b/Documentation/filesystems/expectations.txt @@ -0,0 +1,52 @@ +Linux block-backed filesystems can only work correctly when several +conditions are met in the block layer and below (disks, flash +cards). Some of them are obvious ("data on media should not change +randomly"), some are less so. + +Write errors not allowed (NO-WRITE-ERRORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Writes to media never fail. Even if disk returns error condition +during write, filesystems can't handle that correctly. + + Fortunately writes failing are very uncommon on traditional + spinning disks, as they have spare sectors they use when write + fails. + +Don't cause collateral damage to adjacent sectors on a failed write (NO-COLLATERALS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Unfortunately, cheap USB/SD flash cards I've seen do have this bug, +and are thus unsuitable for all filesystems I know. + + An inherent problem with using flash as a normal block device + is that the flash erase size is bigger than most filesystem + sector sizes. So when you request a write, it may erase and + rewrite some 64k, 128k, or even a couple megabytes on the + really _big_ ones. + + If you lose power in the middle of that, filesystem won't + notice that data in the "sectors" _around_ the one your were + trying to write to got trashed. + + RAID-4/5/6 in degraded mode has same problem. + + +Don't damage the old data on a ...
You should make clear that the file lists per-file-system rules and Isn't this by design? In other words, if the metadata doesn't survive non-atomic writes, wouldn't it be an ext3 bug? -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 --
The only one that falls into that category is the one about not being able to handle failed writes, and the way most failures take place, they generally fail the ATOMIC-WRITES criterion in any case. That is, when a write fails, an attempt to read from that sector will generally result in either (a) an error, or (b) data other than what was there Part of the problem here is that "atomic-writes" is confusing; it doesn't mean what many people think it means. The assumption which many naive filesystem designers make is that writes succeed or they don't. If they don't succeed, they don't change the previously existing data in any way. So in the case of journalling, the assumption which gets made is that when the power fails, the disk either writes a particular disk block, or it doesn't. The problem here is as with humans and animals, death is not an event, it is a process. When the power fails, the system just doesn't stop functioning; the power on the +5 and +12 volt rails start dropping to zero, and different components fail at different times. Specifically, DRAM, being the most voltage sensitve, tends to fail before the DMA subsystem, the PCI bus, and the hard drive fails. So as a result, garbage can get written out to disk as part of the failure. That's just the way hardware works. Now consider a file system which does logical journalling. It has written to the journal, using a compact encoding, "the i_blocks field is now 25, and i_size is 13000", and the journal transaction has committed. So now, it's time to update the inode on disk; but at that precise moment, the power failures, and garbage is written to the inode table. Oops! The entire sector containing the inode is trashed. But the only thing which recorded in the journal is the new value of i_blocks and i_size. So a journal replay won't help file systems that do logical block journalling. Is that a file system "bug"? Well, it's better to call that a mismatch between the assumptions made of ...
Hi Theodore, thanks for the insightful writing. On 08/24/2009 04:01 PM, Theodore Tso wrote: There is a thing called eMMC (embedded MMC) in the embedded world. You may consider it as a non-removable MMC. This thing is a block device from the Linux POW, and you may mount ext3 on top of it. And people do this. The device seems to have a decent FTL, and does not look bad. However, there are subtle things which mortals never think about. In case of eMMC - power cuts may make some sectors unreadable - eMMC returns ECC errors on reads. Namely, the sectors which were being written at the very moment when the power cut happened may become unreadable. And this makes ext3 refuse mounting the file-system, this makes chkfs.ext3 refuse the file-system. Although this should be fixable in SW, but we did not find time to do this so far. Anyway, my point is that documenting subtle things like this is a very good thing to do, just because nowadays we are trying to use existing software with flash-based storage devices, which may violate these subtle assumptions, or introduce other ones. Probably, Pavel did too good job in generalizing things, and it could be better to make a doc about HDD vs SSD or HDD vs Flash-based-storage. Not sure. But the idea to document subtle FS assumption is good, IMO. -- Best Regards, Artem Bityutskiy (Артём Битюцкий) --
The standard procedure for this seems to be to cc: Jonathan Corbet on the discussion, make puppy eyes at him, and subscribe to Linux Weekly News. Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds --
Yep, and at that point you lost data. You had "silent data corruption" from fs point of view, and that's bad. It will be probably very bad on XFS, probably okay on Ext3, and certainly okay on Ext2: you do filesystem check, and you should be able to repair any damage. So yes, physical journaling is good, but If those filesystem assumptions were not documented, I'd call it Actually, ext2 should be able to survive that, no? Error writing -> Well... there's very big difference between harddrives and flash There's a difference. In case of cosmic rays, hardware is clearly buggy. I have one machine with bad DRAM (about 1 errors in 2 days), and I still use it. I will not complain if ext3 trashes that. In case of degraded raid-5, even with perfect hardware, and with ext3 on top of that, you'll get silent data corruption. Nice, eh? Clearly, Linux is buggy there. It could be argued it is raid-5's Well well well. Before I pulled that flash card, I assumed that doing so is safe, because flashcard is presented as block device and ext3 should cope with sudden disk disconnects. And I was wrong wrong wrong. (Noone told me at the university. I guess I should want my money back). Plus note that it is not only my trashy laptop and one trashy MMC card; every USB thumb drive I seen is affected. (OTOH USB disks should be safe AFAICT). Ext3 is unsuitable for flash cards and RAID arrays, plain and simple. It is not documented anywhere :-(. [ext2 should work better -- Can you suggest better patch? I'm not saying we should redesign ext3, I hold ext2/ext3 to higher standards than other filesystem in tree. I'd not use XFS/VFAT etc. I would not want people to migrate towards XFS/VFAT, and yes I believe XFSs/VFATs/... requirements should be documented, too. (But I know too little about those filesystems). If you can suggest better wording, please help me. But... those requirements are non-trivial, commonly not met and the result is data loss. It has to be documented ...
I don't see why you think that. In general, fsck (for any fs) only checks metadata. If you have silent data corruption that corrupts things that are fixable by fsck, you most likely have silent corruption hitting things users care about like their data blocks inside of files. Fsck will not fix (or notice) any of that, that is where things like full data checksums can help. Also note (from first hand experience), unless you check and validate your data, you can have data corruptions that will not get flagged as IO I think that we need to help people understand the full spectrum of data concerns, starting with reasonable best practices that will help most people suffer *less* (not no) data loss. And make very sure that they are not falsely assured that by following any specific script that they can skip backups, remote backups, etc :-) Nothing in our code in any part of the kernel deals well with every I think that the example and the response are both off base. If your head ever touches the platter, you won't be reading from a huge part of your drive ever again (usually, you have 2 heads per platter, 3-4 platters, impact would kill one head and a corresponding percentage of your data). No file system will recover that data although you might be able to scrape out some remaining useful bits and bytes. More common causes of silent corruption would be bad DRAM in things like the drive write cache, hot spots (that cause adjacent track data errors), etc. Note in this last case, your most recently written data It is hard for anyone to see the real data without looking in detail at large numbers of parts. Back at EMC, we looked at failures for lots of parts so we got a clear grasp on trends. I do agree that flash/SSD parts are still very young so we will have interesting and unexpected Nothing is perfect. It is still a trade off between storage utilization (how much storage we give users for say 5 2TB drives), performance and I think that ...
Ok, but in case of data corruption, at least your filesystem does not I can reproduce data loss with ext3 on flashcard in about 40 seconds. I'd not call that "odd event". It would be nice to handle that, but that is hard. So ... can we at least get that documented _Maybe_ SSDs, being HDD replacements are better. I don't know. _All_ flash cards (MMC, USB, SD) had the problems. You don't need to get clear grasp on trends. Those cards just don't meet ext3 "Nothing is perfect"?! That's design decision/problem in raid5/ext3. I believe that should be at least documented. (And understand why ZFS is And I still use my zaurus with crappy DRAM. I would not trust raid5 array with my data, for multiple reasons. The fact that degraded raid5 breaks ext3 assumptions should The papers show failures in "once a year" range. I have "twice a minute" failure scenario with flashdisks. Not sure how often "degraded raid5 breaks ext3 atomicity" would bite, but I bet it would be on "once a day" scale. We should document those. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Even worse, your data is potentially gone and you have not noticed it... This is why array vendors and archival storage products do periodic scans of all stored data (read all the bytes, compared to a Part of documenting best practices is to put down very specific things that do/don't work. What I worry about is producing too much detail to be of use for real end users. I have to admit that I have not paid enough attention to this specifics of your ext3 + flash card issue - is it the ftl stuff doing out of order Your statement is overly broad - ext3 on a commercial RAID array that does RAID5 or RAID6, etc has no issues that I know of. Again, you say RAID5 without enough specifics. Are you pointing just at Documentation is fine with sufficient, hard data.... ric --
Well, I was trying to write for kernel audience. Someone can turn that The problem is that flash cards destroy whole erase block on unplug, Pull them hot. [Some people try -osync to avoid data loss on flash cards... that will If your commercial RAID array is battery backed, maybe. But I was Degraded MD RAID5 on anything, including SATA, and including Degraded MD RAID5 does not work by design; whole stripe will be damaged on powerfail or reset or kernel bug, and ext3 can not cope with that kind of damage. [I don't see why statistics should be neccessary for that; the same way we don't need statistics to see that ext2 needs fsck after powerfail.] Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Kernel people who don't do storage or file systems will still need a summary - making very specific proposals based on real data and analysis Even if you unmount the file system? Why isn't this an issue with ext2? Sounds like you want to suggest very specifically that journalled file systems are not appropriate for low end flash cards (which seems quite Pulling hot any device will cause data loss for recent data loss, even Many people in the real world who use RAID5 (for better or worse) use Degraded is one faulted drive while MD is doing a rebuild? And then you hot unplug it or power cycle? I think that would certainly cause failure What you are describing is a double failure and RAID5 is not double failure tolerant regardless of the file system type.... I don't want to be overly negative since getting good documentation is certainly very useful. We just need to be document things correctly based on real data. Ric --
No, I'm talking hot unplug here. It is the issue with ext2, but ext2 Right. But in ext3 case you basically loose whole filesystem, because You get single disk failure then powerfail (or reset or kernel panic). I would not call that double failure. I agree that it will mean problems for most filesystems. Anyway, even if that can be called a double failure, this limitation should be clearly documented somewhere. ...and that's exactly what I'm trying to fix. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Are you sure he isn't talking about how RAID must write all the data chunks to make a complete stripe and if there is a power-loss, some of the chunks may be written and some may not? As I read Pavel's point he is saying that the incomplete write can be detected by the incorrect parity chunk, but degraded RAID-5 has no working parity chunk so the incomplete write would go undetected. I know this is a RAID failure mode. However, I actually thought this was a problem even for a intact RAID-5. AFAIK, RAID-5 does not generally read the complete stripe and perform verification unless that is requested, because doing so would hurt performance and lose the entire point of the RAID-5 rotating parity blocks. -- Zan Lynx zlynx@acm.org "Knowledge is Power. Power Corrupts. Study Hard. Be Evil." --
Not sure; is not RAID expected to verify the array after unclean shutdown? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Not usually - that would take multiple hours of verification, roughly equivalent to doing a RAID rebuild since you have to read each sector of every drive (although you would do this at full speed if the array was offline, not throttled like we do with rebuilds). That is part of the thing that scrubbing can do. Note that once you find a bad bit of data, it is really useful to be able to map that back into a humanly understandable object/repair action. For example, map the bad data range back to metadata which would translate into a fsck run or a list of impacted files or directories.... Ric --
q write to raid 5 doesn't need to write to all drives, but it does need to write to two drives (the drive you are modifying and the parity drive) if you are not degraded and only suceed on one write you will detect the corruption later when you try to verify the data. if you are degraded and only suceed on one write, then the entire stripe gets corrupted. but this is a double failure (one drive + unclean shutdown) if you have battery-backed cache you will finish the writes when you reboot. if you don't have battery-backed cache (or are using software raid and crashed in the middle of sending the writes to the drive) you loose, but unless you disable write buffers and do sync writes (which nobody is going to do because of the performance problems) you will loose data in an unclean shutdown anyway. --
Sure --- but name **any** filesystem that can deal with the fact that 128k or 256k worth of data might disappear when you pull out the flash It's not just high end RAID arrays that have battery backups; I happen to use a mid-range hardware RAID card that comes with a battery backup. It's just a matter of choosing your hardware carefully. If your concern is that with Linux MD, you could potentially lose an entire stripe in RAID 5 mode, then you should say that explicitly; but again, this isn't a filesystem specific cliam; it's true for all filesystems. I don't know of any file system that can survive having a RAID stripe-shaped-hole blown into the middle of it due to a power failure. I'll note, BTW, that AIX uses a journal to protect against these sorts of problems with software raid; this also means that with AIX, you also don't have to rebuild a RAID 1 device after an unclean shutdown, like you have do with Linux MD. This was on the EVMS's team development list to implement for Linux, but it got canned after LVM won out, lo those many years ago. Ce la vie; but it's a problem which is solvable at the RAID layer, and which is traditionally and historically solved in competent RAID implementations. - Ted --
First... I consider myself quite competent in the os level, yet I did not realize what flash does and what that means for data integrity. That means we need some documentation, or maybe we should refuse to mount those devices r/w or something. Then to answer your question... ext2. You expect to run fsck after unclean shutdown, and you expect to have to solve some problems with it. So the way ext2 deals with the flash media actually matches what the user expects. (*) OTOH in ext3 case you expect consistent filesystem after unplug; and Again, ext2 handles that in a way user expects it. At least I was teached "ext2 needs fsck after powerfail; ext3 can Yep, we should add journal to RAID; or at least write "Linux MD *needs* an UPS" in big and bold letters. I'm trying to do the second part. (Attached is current version of the patch). [If you'd prefer patch saying that MMC/USB flash/Linux MD arrays are generaly unsafe to use without UPS/reliable connection/no kernel bugs... then I may try to push that. I was not sure... maybe some filesystem _can_ handle this kind of issues?] Pavel (*) Ok, now... user expects to run fsck, but very advanced users may not expect old data to be damaged. Certainly I was not advanced enough user few months ago. diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt new file mode 100644 index 0000000..d1ef4d0 --- /dev/null +++ b/Documentation/filesystems/expectations.txt @@ -0,0 +1,57 @@ +Linux block-backed filesystems can only work correctly when several +conditions are met in the block layer and below (disks, flash +cards). Some of them are obvious ("data on media should not change +randomly"), some are less so. Not all filesystems require all of these +to be satisfied for safe operation. + +Write errors not allowed (NO-WRITE-ERRORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Writes to media never fail. Even if disk returns error condition +during write, filesystems ...
the problem is that people have been preaching that journaling filesystems eliminate all data loss for no cost (or at worst for minimal cost). they don't, they never did. they address one specific problem (metadata inconsistancy), but they do not address data loss, and never did (and for the most part the filesystem developers never claimed to) depending on how much data gets lost, you may or may not be able to recover enough to continue to use the filesystem, and when your block device takes actions in larger chunks than the filesystem asked it to, it's very possible for seemingly unrelated data to be lost as well. this is true for every single filesystem, nothing special about ext3 people somehow have the expectation that ext3 does the data equivalent of solving world hunger, it doesn't, it never did, and it never claimed to. bashing it because it doesn't isn't fair. bashing XFS because it doesn't also isn't fair. personally I don't consider the two filesystems to be significantly different in terms of the data loss potential. I think people are more aware of the potentials with XFS than with ext3, but I believe that the risk of loss is really about the same (and pretty much for the same you were teached wrong. the people making these claims for ext3 didn't understand what ext3 does and doesn't do. --
Well, in case of flashcard and degraded MD Raid5, ext3 does _not_ address metadata inconsistency problem. And that's why I'm trying to fix the documentation. Current ext3 documentation says: #Journaling Block Device layer #----------------------------- #The Journaling Block Device layer (JBD) isn't ext3 specific. It was #designed #to add journaling capabilities to a block device. The ext3 filesystem #code #will inform the JBD of modifications it is performing (called a #transaction). #The journal supports the transactions start and stop, and in case of a #crash, #the journal can replay the transactions to quickly put the partition #back into #a consistent state. There's no mention that this does not work on flash cards and degraded Cool. So... can we fix the documentation? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
So, would you be happy if ext3 fsck was always run on reboot (at least for flash devices)? --
For flash devices, MD Raid 5 and anything else that needs it; yes that would make me happy ;-). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
the thing is that fsck would not fix the problem. it may (if the data lost was metadata) detect the problem and tell you how many files you have lost, but if the data lost was all in a data file you would not detect it with a fsck the only way you would detect the missing data is to read all the files on the filesystem and detect that the data you are reading is wrong. but how can you tell if the data you are reading is wrong? on a flash drive, your read can return garbage, but how do you know that garbage isn't the contents of the file? on a degraded raid5 array you have no way to test data integrity, so when the missing drive is replaced, the rebuild algorithm will calculate the appropriate data to make the parity calculations work out and write garbage to that drive. David Lang --
Sorry, but that just shows your naivete. Metadata takes up such a small part of the disk that fscking it and finding it to be OK is absolutely no guarantee that the data on the filesystem has not been horribly mangled. Personally, what I care about is my data. The metadata is just a way to get to my data, while the data is actually important. -- All rights reversed. --
Personally, I care about metadata consistency, and ext3 documentation suggests that journal protects its integrity. Except that it does not on broken storage devices, and you still need to run fsck there. How do you protect your data is another question, but ext3 documentation does not claim journal to protect them, so that's up to the user I guess. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
as the ext3 authors have stated many times over the years, you still need to run fsck periodicly anyway. what the journal gives you is a reasonable chance of skipping it when the system crashes and you want to get it back up ASAP. --
Where is that documented? I very much agree with that, but when suse10 switched periodic fsck off, I could not find any docs to show that it is bad idea. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Probably from some 6-8 years ago, in e-mail postings that I made. My argument has always been that PC-class hardware is crap, and it's a Really Good Idea to periodically check the metadata because corruption there can end up causing massive data loss. The main problem is that doing it at reboot time really hurt system availability, and "after 20 reboots (plus or minus)" resulted in fsck checks at wildly varying intervals depending on how often people reboot. What I've been recommending for some time is that people use LVM, and run fsck on a snapshot every week or two, at some convenient time when the system load is at a minimum. There is an e2croncheck script in the e2fsprogs sources, in the contrib directory; it's short enough that I'll attach here here. Is it *necessary*? In a world where hardware is perfect, no. In a world where people don't bother buying ECC memory because it's 10% more expensive, and PC builders use the cheapest possible parts --- I think it's a really good idea. - Ted P.S. Patches so that this shell script takes a config file, and/or parses /etc/fstab to automatically figure out which filesystems should be checked, are greatly appreciated. Getting distro's to start including this in their e2fsprogs packaging scripts would also be greatly appreciated. #!/bin/sh # # e2croncheck -- run e2fsck automatically out of /etc/cron.weekly # # This script is intended to be run by the system administrator # periodically from the command line, or to be run once a week # or so by the cron daemon to check a mounted filesystem (normally # the root filesystem, but it could be used to check other filesystems # that are always mounted when the system is booted). # # Make sure you customize "VG" so it is your LVM volume group name, # "VOLUME" so it is the name of the filesystem's logical volume, # and "EMAIL" to be your e-mail address # # Written by Theodore Ts'o, Copyright 2007, 2008, 2009. # # This file may be redistributed under the terms ...
Aside ... can we default mkfs.ext3 to not set a mandatory fsck interval then? :) --
Well, in SUSE11-or-so, distro stopped period fscks, silently :-(. I believed that it was really bad idea at that point, but because I could not find piece of documentation recommending them, I lost the argument. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
It's an engineering trade-off. If you have perfect memory that is never has cosmic-ray hiccups, and hard drives that never write data to the wrong place, etc. then you don't need periodic fsck's. If you do have imperfect hardware, the question then is how imperfect your hardware is, and how frequently it introduces errors. If you check too frequently, though, users get upset, especially when it happens at the most inconvenient time (when you're trying to recover from unscheduled downtime by rebooting); if you check too infrequently then it doesn't help you too much since too much data gets damaged before fsck notices. So these days, what I strongly recommend is that people use LVM snapshots, and schedule weekly checks during some low usage period (i.e., 3am on Saturdays), using something like the e2croncheck shell script. - Ted --
There was another script written to do this that handled the e2fsck, reiserfsck and xfs_check, detecting all volume groups automatically, along with e.g. validating that the snapshot volume doesn't exist before starting the check (which may indicate that the previous e2fsck is still running), and not running while on AC power. The last version was in the thread "forced fsck (again?)" dated 2008-01-28. Would it be better to use that one? In that thread we discussed not clobbering the last checked time as e2croncheck does, so the admin can see how long it was since the filesystem was last checked. Maybe it makes more sense to get the lvcheck script included into util- linux-ng or lvm2 packages, and have it added automatically to the cron.weekly directory? Then the distros could disable the at-boot checking safely, while still being able to detect corruption caused by cables/RAM/drives/software. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. --
That's not where fs documentation belongs :-(. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Caring about metadata consistency and not data is just weird, I'm sorry. I can't imagine anyone who actually *cares* about what they have stored, whether it's digital photographs of child taking a first step, or their thesis research, caring about more about the metadata than the data. Giving advice that pretends that most users have that priority is Just Wrong. That's why what we should document is that people should avoid broken storage devices, and advice on how to use RAID properly. At the end of the day, getting people to switch from ext2 to ext3 on some misguided notion that this way, they'll know when their metadata is safe (at least in the power failure case; but not the system hangs and you have to reboot case), and getting them to ignore the question of why are they using a broken storage device in the first place, is Documentation malpractice. - Ted --
I thought the reason for that was that if your metadata is horked, further writes to the disk can trash unrelated existing data because it's lost track of what's allocated and what isn't. So back when the assumption was "what's written stays written", then keeping the metadata sane was still darn important to prevent normal operation from overwriting unrelated existing data. Then Pavel notified us of a situation where interrupted writes to the disk can trash unrelated existing data _anyway_, because the flash block size on the 16 gig flash key I bought retail at Fry's is 2 megabytes, and the filesystem thinks it's 4k or smaller. It seems like what _broke_ was the assumption that the filesystem block size >= the disk block size, and nobody noticed for a while. (Except the people making jffs2 and friends, anyway.) Today we have cheap plentiful USB keys that act like hard drives, except that their write block size isn't remotely the same as hard drives', but they pretend it is, and then the block wear levelling algorithms fuzz things further. (Gee, a drive controller lying about drive geometry, the scsi crowd should feel right at home.) Now Pavel's coming back with a second situation where RAID stripes (under certain circumstances) seem to have similar granularity issues, again breaking what seems to be the same assumption. Big media use big chunks for data, and media is getting bigger. It doesn't seem like this problem is going to diminish in future. I agree that it seems like a good idea to have BIG RED WARNING SIGNS about those kind of media and how _any_ journaling filesystem doesn't really help here. So specifically documenting "These kinds of media lose unrelated random data if writes to them are interrupted, journaling filesystems can't help with this and may actually hide the problem, and even an fsck will only find corrupted metadata not lost file contents" seems kind of useful. That said, ext3's assumption that filesystem block size ...
actually, you don't know if your USB key works that way or not. Pavel has ssome that do, that doesn't mean that all flash drives do when you do a write to a flash drive you have to do the following items 1. allocate an empty eraseblock to put the data on 2. read the old eraseblock 3. merge the incoming write to the eraseblock 4. write the updated data to the flash 5. update the flash trnslation layer to point reads at the new location instead of the old location. now if the flash drive does things in this order you will not loose any previously written data. if the flash drive does step 5 before it does step 4, then you have a window where a crash can loose data (and no btrfs won't survive any better to have a large chunk of data just disappear) it's possible that some super-cheap flash drives skip having a flash translation layer entirely, on those the process would be 1. read the old data into ram 2. merge the new write into the data in ram 3. erase the old data 4. write the new data this obviously has a significant data loss window. but if the device doesn't have a flash translation layer, then repeated writes to any one sector will kill the drive fairly quickly. (updates to the FAT would kill the sectors the FAT, journal, root directory, or superblock lives in due to the fact that every change to the disk requires I think an update to the documentation is a good thing (especially after learning that a raid 6 array that has lost a single disk can still be corrupted during a powerfail situation), but I also agree that Pavel's I thought that that assumption was in the VFS layer, not in any particular filesystem David Lang --
Pretty much all the ones that present a USB disk interface to the outside world and then thus have to do hardware levelling. Here's Valerie Aurora on the topic: That's what something like jffs2 will do, sure. (And note that mounting those suckers is slow while it reads the whole disk to figure out what order to put the chunks in.) However, your average consumer level device A) isn't very smart, B) is judged almost entirely by price/capacity ratio and thus usually won't even hide capacity for bad block remapping. You expect them to have significant hidden I've never seen one that presented a USB disk interface that _didn't_ do this. (Not that this observation means much.) Neither the windows nor the Macintosh world is calling for this yet. Even the Linux guys barely know about it. And these are the same kinds of manufacturers that NOPed out the flush commands to Yup. It's got enough of one to get past the warantee, but beyond that they're The VFS layer cares about how to talk to the backing store? I thought that was the filesystem driver's job... Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds --
I am not saying that all devices get this right (not by any means), but I _am_ saying that devices with wear-leveling _can_ avoid this problem entirely you do not need to do a log-structured filesystem. all you need to do is to always write to a new block rather than re-writing a block in place. even if the disk only does a 12-block rotation for it's wear leveling, that is enough for it to not loose other data when you write. to loose data you have to be updating a block in place by erasing the old one first. _anything_ that writes the data to a new location before it erases the old location will prevent you from loosing other data. I'm all for documenting that this problem can and does exist, but I'm not in agreement with documentation that states that _all_ flash drives have this problem because (with wear-leveling in a flash translation layer on the device) it's not inherent to the technology. so even if all existing flash devices had this problem, there could be one released tomorrow that didn't. this is like the problem that flash SSDs had last year that could cause them to stall for up to a second on write-heavy workloads. it went from a problem that almost every drive for sale had (and something that was generally accepted as being a characteristic of SSDs), to being extinct in about one product cycle after the problem was identified. I think this problem will also disappear rapidly once it's publicised. so what's needed is for someone to come up with a way to test this, let people test the various devices, find out how broad the problem is, and publicise the results. personally, I expect that the better disk-replacements will not have a problem with this. I would also be surprised if the larger thumb drives had this problem. if a flash eraseblock can be used 100k times, then if you use FAT on a 16G drive and write 1M files and update the FAT after each file (like you would with a camera), the block the FAT is on will die after ...
That would need two erases per single sector writen, no? Erase is in milisecond range, so the performance would be just way too bad :-(. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
no, it only needs one erase if you don't have a pool of pre-erased blocks, then you need to do an erase of the new block you are allocating (before step 4) if you do have a pool of pre-erased blocks, then you don't have to do any erase of the data blocks until after step 5 and you do the erase when you add the old data block to the pool of pre-erased blocks later. in either case the requirements of wear leveling require that the flash translation layer update it's records to show that an additional write took place. what appears to be happening on some cheap devices is that they do the following instead 1. allocate an empty eraseblock to put the data on 2. read the old eraseblock 3. merge the incoming write to the eraseblock 4. erase the old eraseblock 5. write the updated data to the flash I don't know where in (or after) this process theyupdate the wear-levling/flash translation layer info. with this algortihm, if the device looses power between step 4 and step 5 you loose all the data on the eraseblock. with deferred erasing of blocks, the safer algortihm is actually the faster one (up until you run out of your pool of available eraseblocks, at which time it slows down to the same speed as the unreliable one. most flash drives are fairly slow to write to in any case. even the Intel X25M drives are in the same ballpark as rotating media for writes. as far as I know only the X25E SSD drives are faster to write to than rotating media, and most of them are _far_ slower. David Lang --
Hence wanting documentation properly explaining the situation, yes. Often the people writing the documentation aren't the people who know the most about the situation, but the people who found out they NEED said documentation, and post errors until they get sufficient corrections. In which case "you're wrong, it's actually _this_" is helpful, and "you're Are you saying ext3 should default to journal=data then? It seems that the default journaling only handles the metadata, and people seem to think that journaled filesystems exist for a reason. There seems to be a lot of "the guarantees you think a journal provides aren't worth anything, so the fact there are circumstances under which it doesn't provide them isn't worth telling anybody about" in this thread. So we shouldn't bother journaled filesystems? I'm not sure what the intended argument is here... I have no clue what the finished documentation on this issue should look like either. But I want to read it. Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds --
But if the 256k hole is in data blocks, fsck won't find a problem, even with ext2. And if the 256k hole is the inode table, you will *still* suffer massive data loss. Fsck will tell you how badly screwed you are, but it doesn't "fix" the disk; most users don't consider questions of the form "directory entry <precious-thesis-data> points to trashed inode, You don't get a consistent filesystem with ext2, either. And if your claim is that several hundred lines of fsck output detailing the filesystem's destruction somehow makes things all better, I suspect most users would disagree with you. In any case, depending on where the flash was writing at the time of the unplug, the data corruption could be silent anyway. Maybe this came as a surprise to you, but anyone who has used a compact flash in a digital camera knows that you ***have*** to wait until the led has gone out before trying to eject the flash card. I remember seeing all sorts of horror stories from professional photographers about how they lost an important wedding's day worth of pictures with the attendant commercial loss, on various digital photography forums. It tends to be the sort of mistake that digital photographers only make once. (It's worse with people using Digital SLR's shooting in raw mode, since it can take upwards of 30 seconds or more to write out a 12-30MB raw image, and if you eject at the wrong time, you can trash the contents of the entire CF card; in the worst case, the Flash Translation Layer data can get corrupted, and the card is completely ruined; you can't even reformat it at the filesystem level, but have to get a special Windows program from the CF manufacturer to --maybe-- reset the FTL layer. Early CF cards were especially vulnerable to this; more recent CF cards are better, but it's a known failure mode of CF cards.) - Ted --
Well it will fix the disk in the end. And no, "directory entry <precious-thesis-data> points to trashed inode, may I delete directory entry?" is not _terribly_ helpful, but it is slightly helpful and It actually comes as surprise to me. Actually yes and no. I know that digital cameras use VFAT, so pulling CF card out of it may do bad thing, unless I run fsck.vfat afterwards. If digital camera was using ext3, I'd expect it to be safely pullable at any time. Will IBM microdrive do any difference there? Anyway, it was not known to me. Rather than claiming "everyone knows" (when clearly very few people really understand all the details), can we simply document that? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
I really think that the expectation that all OS's (windows, mac, even your ipod) all teach you not to hot unplug a device with any file system. Users have an "eject" or "safe unload" in windows, your iPod tells you not to power off or disconnect, etc. I don't object to making that general statement - "Don't hot unplug a device with an active file system or actively used raw device" - but would object to the overly general statement about ext3 not working on flash, RAID5 not working, etc... ric --
On Tue, 25 Aug 2009 09:37:12 -0400 The overall general statement for all media and all OS's should be "Do you have a backup, have you tested it recently" --
It might be nice to know when you _needed_ said backup, and when you shouldn't re-backup bad data over it, because your data corruption actually got detected before then. And maybe a pony. Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds --
You can object any way you want, but running ext3 on flash or MD RAID5 is stupid: * ext2 would be faster * ext2 would provide better protection against powerfail. "ext3 works on flash and MD RAID5, as long as you do not have powerfail" seems to be the accurate statement, and if you don't need to protect against powerfails, you can just use ext2. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Not true - that is true today with or without journals as we have discussed in great detail. Including specifically ext2. Basically, any file system (Linux, windows, OSX, etc) that writes into the page Not true in the slightest, you continue to ignore the ext2/3/4 developers Strange how your personal preference is totally out of sync with the entire enterprise class user base. ric --
No, not ext3 on SATA disk with barriers on and proper use of fsync(). I actually tested that. Yes, I should be able to hotunplug SATA drives and expect the data I know I will lose data. Both ext2 and ext3 will lose data on flashdisk. (That's what I'm trying to document). But... what is the benefit of ext3 journaling on MD RAID5? (On flash, ext3 at least protects you against kernel panic. MD RAID5 is in software, so... that Perhaps noone told them MD RAID5 is dangerous? You see, that's exactly what I'm trying to document here. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
the block device can loose data, it has absolutly nothing to do with the a MD raid array that's degraded to the point where there is no redundancy is dangerous, but I don't think that any of the enterprise users would be surprised. I think they will be surprised that it's possible that a prior failed write that hasn't been scrubbed can cause data loss when the array later degrades. David Lang --
Cool, so Ted's "raid5 has highly undesirable properties" is actually pretty accurate. Some raid person should write more detailed README, I'd say... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
You can and will lose data (even after fsync) with any type of storage at some rate. What you are missing here is that data loss needs to be measured in hard numbers - say percentage of installed boxes that have config X that lose data. Strangely enough, this is what high end storage companies do for a living, configure, deploy and then measure results. A long winded way of saying that just because you can induce data failure by recreating an event that happens almost never (power loss while rebuilding a RAID5 group specifically) does not mean that this makes RAID5 with ext3 unreliable. What does happen all of the time is single bad sector IO's and (less often, but more than your scenario) complete drive failures. In both cases, MD RAID5 will repair that damage before a second failure (including a power failure) happens 99.99% of the time. I can promise you that hot unplugging and replugging a S-ATA drive will also lose you data if you are actively writing to it (ext2, 3, whatever). Your micro datah loss benchmark is not a valid reflection of the wider experience and I fear that you will cause people to lose more data, not less, Faster recovery time on any normal kernel crash or power outage. Data loss Using MD RAID5 will save more people from commonly occurring errors (sector and disk failures) than will lose it because of your rebuild interrupted by a power failure worry. What you are trying to do is to document a belief you have that is not born out by real data across actual user boxes running real work loads. Unfortunately, getting that data is hard work and one of the things that we as a community do especially poorly. All of the data (secret data from my past and published data by NetApp, Google, etc) that I have seen would directly contradict your assertions and you will cause harm to our users with this. Ric --
I'm talking "by design" here. I will lose data even on SATA drive that is properly powered on if I I can promise you that running S-ATA drive will also lose you data, even if you are not actively writing to it. Just wait 10 years; so what is your point? But ext3 is _designed_ to preserve fsynced data on SATA drive, while it is _not_ designed to preserve fsynced data on MD RAID5. No, because you'll actually repair the ext2 with fsck after the kernel crash or power outage. Data loss will not be equivalent; in particular you'll not lose data writen _after_ power outage to ext2. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
You are dead wrong. For RAID5 arrays, you assume that you have a hard failure and a power outage before you can rebuild the RAID (order of hours at full tilt). The failure rate of S-ATA drives is at the rate of a few percentage of the installed base in a year. Some drives will fail faster than that (bad parts, bad environmental conditions, etc). Why don't you hold all of your most precious data on that single S-ATA drive for five year on one box and put a second copy on a small RAID5 with ext3 for the same period? Repeat experiment until you get up to something like google scale or the other papers on failures in national labs in the US and then we can have an informed I lost a s-ata drive 24 hours after installing it in a new box. If I had MD5 RAID5, I would not have lost any. My point is that you fail to take into account the rate of failures of a given As Ted (who wrote fsck for ext*) said, you will lose data in both. Your argument is not based on fact. You need to actually prove your point, not just state it as fact. ric --
me to, in fact just after I copied data from a raid array to it so that I could rebuild the raid array differently :-( David Lang --
I'm not interested in discussing statistics with you. I'd rather discuss fsync() and storage design issues. ext3 is designed to work on single SATA disks, and it is not designed to work on flash cards/degraded MD RAID5s, as Ted acknowledged. Because that fact is non obvious to the users, I'd like to see it documented, and now have nice short writeup from Ted. If you want to argue that ext3/MD RAID5/no UPS combination is still less likely to fail than single SATA disk given part fail probabilities, go ahead and present nice statistics. Its just that I'm not interested in them. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
That is a proven fact and a well published one. If you choose to ignore published work (and common sense) that RAID makes you lose data less than non-RAID, why should anyone care what you write? Ric --
http://lkml.org/lkml/2009/8/25/312 Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
I will let Ted clarify his text on his own, but the quoted text says "... have potential...". Why not ask Neil if he designed MD to not work properly with ext3? Ric --
So let me clarify by saying the following things. 1) Filesystems are designed to expect that storage devices have certain properties. These include returning the same data that you wrote, and that an error when writing a sector, or a power failure when writing sector, should not be amplified to cause collateral damage with previously succfessfully written sectors. 2) Degraded RAID 5/6 filesystems do not meet these properties. Neither to cheap flash drives. This increases the chances you can lose, bigtime. 3) Does that mean that you shouldn't use ext3 on RAID drives? Of course not! First of all, Ext3 still saves you against kernel panics and hangs caused by device driver bugs or other kernel hangs. You will lose less data, and avoid needing to run a long and painful fsck after a forced reboot, compared to if you used ext2. You are making an assumption that the only time running the journal takes place is after a power failure. But if the system hangs, and you need to hit the Big Red Switch, or if you using the system in a Linux High Availability setup and the ethernet card fails, so the STONITH ("shoot the other node in the head") system forces a hard reset of the system, or you get a kernel panic which forces a reboot, in all of these cases ext3 will save you from a long fsck, and it will do so safely. Secondly, what's the probability of a failure causes the RAID array to become degraded, followed by a power failure, versus a power failure while the RAID array is not running in degraded mode? Hopefully you are running with the RAID array in full, proper running order a much larger percentage of the time than running with the RAID array in degraded mode. If not, the bug is with the system administrator! If you are someone who tends to run for long periods of time in degraded mode --- then better get a UPS. And certainly if you want to avoid the chances of failure, periodically scrubbing the disks so you detect hard drive failures early, instead of ...
Actually... ext3 + MD RAID5 will still have a problem on kernel panic. MD RAID5 is implemented in software, so if kernel panics, you can still get inconsistent data in your array. I mostly agree with the rest. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Only if the MD RAID array is running in degraded mode (and again, if
the system is in this state for a long time, the bug is in the system
administrator). And even then, it depends on how the kernel dies. If
the system hangs due to some deadlock, or we get an OOPS that kills a
process while still holding some locks, and that leads to a deadlock,
it's likely the low-level MD driver can still complete the stripe
write, and no data will be lost. If the kernel ties itself in knots
due to running out of memory, and the OOM handler is invoked, someone
hitting the reset button to force a reboot will also be fine.
If the RAID array is degraded, and we get an oops in interrupt
handler, such that the system is immediately halted --- then yes, data
could get lost. But there are many system crashes where the software
RAID's ability to complete a stripe write would not be compromised.
- Ted
--
Just to add some real world data, Bianca Schroeder published a really good paper that looks at failures in national labs which has actual measured disk failures: http://www.cs.cmu.edu/~bianca/fast07.pdf Her numbers showed various rates of failures, but depending on the box, drive type, etc, they lost between 1-6% of the install drives each year. There is also a good paper from Google: http://labs.google.com/papers/disk_failures.html Both of the above are largely linux boxes. And several other FAST papers on failures in commercial RAID boxes, most notably by NetApp. If reading papers is not at the top of your list of things to do, just skim through and look for the tables on disk failures, etc. which have great measurements of what really failed in these systems... Ric --
I agree with the whole write up outside of the above - degraded RAID does meet this requirement unless you have a second (or third, counting the split write) failure during the rebuild. Note that the window of exposure during a RAID rebuild is linear with the size of your disk and how much you detune the rebuild... --
The argument is that if the degraded RAID array is running in this
state for a long time, and the power fails while the software RAID is
in the middle of writing out a stripe, such that the stripe isn't
completely written out, we could lose all of the data in that stripe.
In other words, a power failure in the middle of writing out a stripe
in a degraded RAID array counts as a second failure.
To me, this isn't a particularly interesting or newsworthy point,
since a competent system administrator who cares about his data and/or
his hardware will (a) have a UPS, and (b) be running with a hot spare
and/or will imediately replace a failed drive in a RAID array.
- Ted
--
I agree that this is not an interesting (or likely) scenario, certainly when compared to the much more frequent failures that RAID will protect against which is why I object to the document as Pavel suggested. It will steer people away from using RAID and directly increase their chances of losing their data if they use just a single disk. Ric --
So instead of fixing or at least documenting known software deficiency in Linux MD stack, you'll try to surpress that information so that people use more of raid5 setups? Perhaps the better documentation will push them to RAID1, or maybe make them buy an UPS? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
people aren't objecting to better documentation, they are objecting to misleading documentation. for flash drives the danger is very straightforward (although even then you have to note that it depends heavily on the firmware of the device, some will loose lots of data, some won't loose any) a good thing to do here would be for someone to devise a test to show this problem, and then gather the results of lots of people performing this test to see what the commonalities are. you are generalizing that since you have lost data on flash drives, all flash drives are dangerous. what if it turns out that only one manufacturer is doing things wrong? you will have discouraged people from using flash drives for no reason. (potentially causing them to loose data becouse they ae scared away from using flash drives and don't implement anything better) to be safe, all that a flash drive needs to do is to not change the FTL pointers until the data has fully been recorded in it's new location. this is probably a trivial firmware change. for raid arrays, we are still learning the nuances of what actually can happen. the comment that Rik made a few hours ago when he pointed out that with raid 5 you won't trash the entire stripe (which is what I thought happened from prior comments), but instead run the risk of loosing two relativly definable chunks of data 1. the block you are writing (which you can loose anyway) 2. the block that would live on the disk that is missing. that drasticly lessens the impact of the problem I would like to see someone explain what would happen on raid 6, and I think that the possibilities that Neil talked about where he said that it was possible to try the various combinations and see which ones agree with each other would be a good thing to implement if he can do so. but the super simplified statement you keep trying to make is significantly overstating and oversimplifying the problem. David Lang --
Actually Ric is. He's trying hard to make RAID5 look better than it Do the flash manufacturers claim they do not cause collateral damage during powerfail? If not, they probably are dangerous. Anyway, you wanted a test, and one is attached. It normally takes like Offer better docs? You are right that it does not lose whole stripe, it merely loses random block on same stripe, but result for journaling filesystem is similar. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
I object to misleading and dangerous documentation that you have proposed. I spend a lot of time working in data integrity, talking and writing about it so I care deeply that we don't misinform people. In this thread, I put out a draft that is accurate several times and you have failed to respond to it. The big picture that you don't agree with is: (1) RAID (specifically MD RAID) will dramatically improve data integrity for real users. This is not a statement of opinion, this is a statement of fact that has been shown to be true in large scale deployments with commodity hardware. (2) RAID5 protects you against a single failure and your test case purposely injects a double failure. (3) How to configure MD reliably should be documented in MD documentation, not in each possible FS or raw device application (4) Data loss occurs in non-journalling file systems and journalling file systems when you suffer double failures or hot unplug storage, especially inexpensive FLASH parts. ric --
Most people would be surprised that press of reset button is 'failure' It does not happen on inexpensive DISK parts, so people do not expect that and it is worth pointing out. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Pavel, you have no information and an attitude of not wanting to listen to anyone who has real experience or facts. Not just me, but also Ted and others. Totally pointless to reply to you further. Ric --
For the record, I've been able to follow Pavel's arguments, and I've been able to follow Ted's arguments. But as far as I can tell, you're arguing about a different topic than the rest of us. There's a difference between: A) This filesystem was corrupted because the underlying hardware is permanently damaged, no longer functioning as it did when it was new, and never will again. B) We had a transient glitch that ate the filesystem. The underlying hardware is as good as new, but our data is gone. You can argue about whether or not "new" was ever any good, but Linux has run on PC-class hardware from day 1. Sure PC-class hardware remains crap in many different ways, but this is not a _new_ problem. Refusing to work around what people actually _have_ and insisting we get a better class of user instead _is_ a new problem, kind of a disturbing one. USB keys are the modern successor to floppy drives, and even now Documentation/blockdev/floppy.txt is still full of some of the torturous workarounds implemented for that over the past 2 decades. The hardware existed, and instead of turning up their nose at it they made it work as best they could. Perhaps what's needed for the flash thing is a userspace package, the way mdutils made floppies a lot more usable than the kernel managed at the time. For the flash problem perhaps some FUSE thing a bit like mtdblock might be nice, a translation layer remapping an arbitrary underlying block device into larger granularity chunks and being sure to do the "write the new one before you erase the old one" trick that so many hardware-only flash devices _don't_, and then maybe even use Pavel's crash tool to figure out the write granularity of various sticks and ship it with a whitelist people can email updates to so we don't have to guess large. (Pressure on the USB vendors to give us a "raw view" extension bypassing the "pretend to be a hard drive, with remapping" hardware in future devices would be nice too, ...
no other OS avoids this problem either. I actually don't see how you can do this from userspace, because when you write to the device you have _no_ idea where on the device your data will actually land. writing in larger chunks may or may not help, (if you do a 128K write, and the device is emulating 512b blocks on top of 128K eraseblocks, depending on the current state of the flash translation layer, you could end up writing to many different eraseblocks, up to the theoretical max of 256) David Lang --
It certainly is not easy. Self-correcting codes could probably be used, but that would be very special, very slow, and very non-standard. (Basically... we could design filesystem so that it would survive damage of arbitrarily 512K on disk -- using self-correcting codes in CD-like manner). I'm not sure if it would be practical. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
I had no trouble following what Ric was arguing about. Ric never said "use only the best devices and you won't have problems". Ric was arguing the exact opposite - ALL devices are crap if you define crap as "can loose data". What he is saying is you need to UNDERSTAND your devices and their behavior and you must act accordingly. PAVEL DID NOT ACT ACCORDING TO HIS DEVICE LIMITATIONS. We understand he was clueless, but user error is still user error! And Ric said do not stigmatize whole classes of A) devices, B) raid, We have been trying forever to deal with device problems and as Ric kept trying to explain we do understand them. The problem is not "can we be better" it is "at what cost". As they keep saying "fast", "cheap", "safe"... pick any 2. Adding software solutions to solve it will always turn "fast" to "slow". Most people will choose some risk they can manage (such as Saw it. I am not an MD guy so I will not say anything bad about it except all the "journal" crud. It really is only pandering to Pavel because ALL filesystems can be screwed and that is what they really need to know. The journal stuff distracts those who are not running a journaling filesystem, even if your description is correct except that as we fs people keep saying, fsck is meaningless and again will only give you a false sense of security that your data is OK. jim --
And if you include meteor strike and flooding in your operating criteria you can come up with quite a straw man argument. It still doesn't mean "X is I think he understands he was clueless too, that's why he investigated the I don't care what "Pavel says", so you can leave the ad hominem at the door, thanks. The kernel presents abstractions, such as block device nodes. Sometimes implementation details bubble through those abstractions. Presumably, we agree on that so far. I was once asked to write what became Documentation/rbtree.txt, which got merged. I've also read maybe half of Documentation/RCU. Neither technique is specific to Linux, but this doesn't seem to have been an objection at the time. The technique, "journaling", is widely perceived as eliminating the need for fsck (and thus the potential for filesystem corruption) in the case of unclean shutdowns. But there are easily reproducible cases where the technique, "journaling", does not do this. Thus journaling, as a concept, has limitations which are _not_ widely understood by the majority of people who purchase and use USB flash keys. The kernel doesn't currently have any documentation on journaling theory where mention of journaling's limitations could go. It does have a section on its internal Journaling API in Documentation/DocBook/filesystems.tmpl which links to two papers (both about ext3, even though reiserfs was merged first and IBM's JFS was implemented before either) from 1998 and 2000 respectively. The 2000 paper brushes against disk granularity answering a question starting at 72m, 21s, and brushes against software raid and write ordering starting at the 72m 32s mark. But it never directly addresses either issue... Sigh, I'm well into tl;dr territory here, aren't I? Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds --
See, this is exactly the problem we have with all the proposed documentation. The reader (you) did not get what the writer (me) was trying to say. That does not say either of us was wrong in what we thought was meant, simply that we did not communicate. What I meant was we did not want to accept Pavel's incorrect We don't have any problem with documenting abstractions. But they must be written as abstracts and accurate, not as IMO blogs. It is not "he means well, so we will just accept it". The rule for kernel docs should be the same as for code. If it is not correct in all cases or causes problems, we don't accept it. jim --
That's why I've mostly stopped bothering with this thread. I could respond to Ric Wheeler's latest (what does write barriers have to do with whether or not a multi-sector stripe is guaranteed to be atomically updated during a panic or power failure?) but there's just no point. The LWN article on the topic is out, and incomplete as it is I expect it's the best documentation anybody will actually _read_. Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds --
The point of that post was that the failure that you and Pavel both attribute to RAID and journalled fs happens whenever the storage cannot promise to do atomic writes of a logical FS block (prevent torn pages/split writes/etc). I gave a specific example of why this happens even with simple, single disk systems. Further, if you have the write cache enabled on your local S-ATA/SAS drives and do not have working barriers (as is the case with MD RAID5/6), you have a hard promise of data loss on power outage and these split writes are not going to be the cause of your issues. You can verify this by testing. Or, try to find people that do storage --
ext3 does not expect atomic write of 4K block, according to Ted. So Would anyone (probably privately?) share the lwn link? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
I am not sure what you mean by "expect." ext3 (and other file systems) certainly expect that acknowledged writes will still be there after a crash. With your disk write cache on (and no working barriers or non-volatile write cache), this will always require a repair via fsck or leave you with corrupted data or metadata. ext4, btrfs and zfs all do checksumming of writes, but this is a detection mechanism. Repair of the partial write is done on detection (if you have another copy in btrfs or xfs) or by repair (ext4's fsck). For what it's worth, this is the same story with databases (DB2, Oracle, etc). They spend a lot of energy trying to detect partial writes from the application level's point of view and their granularity is often --
On Sat, 5 Sep 2009 12:28:10 +0200 http://lwn.net/SubscriberLink/349970/9875eff987190551/ assuming you've not already gotten one from elsewhere. jon --
Thanks, and thanks for nice article! Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
I agree; it's very nicely written, balanced, and doesn't scare users unduly. - Ted --
Apparently because most people haven't read Documentation/md.txt: Boot time assembly of degraded/dirty arrays ------------------------------------------- If a raid5 or raid6 array is both dirty and degraded, it could have undetectable data corruption. This is because the fact that it is 'dirty' means that the parity cannot be trusted, and the fact that it is degraded means that some datablocks are missing and cannot reliably be reconstructed (due to no parity). And so on for several more paragraphs. Perhaps the documentation needs to be extended to note that "journaling will not help here, because the lost data blocks render entire stripes unreconstructable"... Hmmm, I'll take a stab at it. (I'm not addressing the raid 0 issues brought up elsewhere in this thread because I don't comfortably understand the current state of play...) Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds --
From: Rob Landley <rob@landley.net>
Add more warnings to the "Boot time assembly of degraded/dirty arrays" section,
explaining that using a journaling filesystem can't overcome this problem.
Signed-off-by: Rob Landley <rob@landley.net>
---
Documentation/md.txt | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/Documentation/md.txt b/Documentation/md.txt
index 4edd39e..52b8450 100644
--- a/Documentation/md.txt
+++ b/Documentation/md.txt
@@ -75,6 +75,23 @@ So, to boot with a root filesystem of a dirty degraded raid[56], use
md-mod.start_dirty_degraded=1
+Note that Journaling filesystems do not effectively protect data in this
+case, because the update granularity of the RAID is larger than the journal
+was designed to expect. Reconstructing data via partity information involes
+matching together corresponding stripes, and updating only some of these
+stripes renders the corresponding data in all the unmatched stripes
+meaningless. Thus seemingly unrelated data in other parts of the filesystem
+(stored in the unmatched stripes) can become unreadable after a partial
+update, but the journal is only aware of the parts it modified, not the
+"collateral damage" elsewhere in the filesystem which was affected by those
+changes.
+
+Thus successful journal replay proves nothing in this context, and even a
+full fsck only shows whether or not the filesystem's metadata was affected.
+(A proper solution to this problem would involve adding journaling to the RAID
+itself, at least during degraded writes. In the meantime, try not to allow
+a system to shut down uncleanly with its RAID both dirty and degraded, it
+can handle one but not both.)
Superblock formats
------------------
--
Latency is more important than throughput. It's that simple. - Linus Torvalds
--
I like it! Not sure if I know enough about MD to add ack, but... Acked-by: Pavel Machek <pavel@ucw.cz> Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
NACK. Now you have moved the inaccurate documentation about journalling file systems into the MD documentation. Repeat after me: (1) partial writes to a RAID stripe (with or without file systems, with or without journals) create an invalid stripe (2) partial writes can be prevented in most cases by running with write cache disabled or working barriers (3) fsck can (for journalling fs or non journalling fs) detect and fix your file system. It won't give you back the data in that stripe, but you will get the rest of your metadata and data back and usable. You don't need MD in the picture to test this - take fsfuzzer or just dd and zero out a RAID stripe width of data from a file system. If you hit data blocks, your fsck (for ext2) or mount (for any journalling fs) will not see an error. If metadata, fsck in both cases when run will try to fix it as best as it can. Also note that partial writes (similar to torn writes) can happen for multiple reasons on non-RAID systems and leave the same kind of damage. Side note, proposing a half sketched out "fix" for partial stripe writes in documentation is not productive. Much better to submit a fully thought out proposal or actual patches to demonstrate the issue. Rob, you should really try to take a few disks, build a working MD RAID5 group and test your ideas. Try it with and without the write cache enabled. Measure and report, say after 20 power losses, how files integrity and fsck repairs were impacted. Try the same with ext2 and ext3. Regards, Ric --
Given how long experience with storage you claim, you should know that ....and understand by now that statistics are irrelevant for design problems. Ouch and trying to silence people by telling them to fix the problem instead of documenting it is not nice either. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
so let's get broader testing (including testing the SSDs as well as the I think that every single one of them will tell you to not unplug the drive while writing to it. in fact, I'll bet they all tell you to not Ok, help me understand this. I copy these two files to a system, change them to point at the correct device, run them and unplug the drive while it's running. when I plug the device back in, how do I tell if it lost something unexpected? since you are writing from urandom I have no idea what data _should_ be on the drive, so how can I detect that a data block has been corrupted? David Lang
I have mirror on disk you are not unplugging. See cmp || exit lines. The test continues until it detects corruption. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
I am against documenting unlikely scenarios out of context that will lead people to do the wrong thing. ric --
First of all, it's not a "known software deficiency"; you can't do
anything about a degraded RAID array, other than to replace the failed
disk. Secondly, what we should document is things like "don't use
crappy flash devices", "don't let the RAID array run in degraded mode
for a long time" and "if you must (which is a bad idea), better have a
UPS or a battery-backed hardware RAID". What we should *not* document
is
"ext3 is worthless for RAID 5 arrays" (simply wrong)
and
"ext2 is better than ext3 because it forces you to run a long, slow
fsck after each boot, and that helps you to catch filesystem
corruptions when the storage devices goes bad" (Second part of the
statement is true, but it's still bad general advice, and it's
horribly misleading)
and
"ext2 and ext3 have this surprising dependency that disks act like
disks". (alarmist)
- Ted
--
AFAICT, you mount block device, not disk. Many block devices fail the test. And since users (and block device developers) do not know in detail how disks behave, it is hard to blame them... ("you may corrupt sector you are writing to and ext3 handles that ok" was surprise for me, for example). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Or panic, hang, the drive failed because the system is overheating because the air conditioner suddenly died and the server room is now an oven. (Yup, I'm a bit concerned by the argument that we don't need to document serious pitfalls because every Linux system has a sufficiently competent administrator they already know stuff that didn't even come up until the second or third day it was discussed on lkml. I worked at a company that retested their UPSes a year after installing them and found that _none_ of them supplied more than 15 seconds charge, and when they dismantled them the batteries had physically bloated inside their little plastic cases. (Same company as the dead air conditioner, possibly overheating was involved but the little _lights_ said everything was ok.) That was by no means the first UPS I'd seen die, the suckers have a higher failure rate than hard drives in my experience. This is a device where the batteries get constantly charged and almost never tested because if it _does_ fail you just rebooted your production server, so a lot of smaller companies Here's hoping they shut the system down properly to install the new drive in the raid then, eh? Not accidentally pull the plug before it's finished running the ~7 minutes of shutdown scripts in the last Red Hat Enterprise I messed with... Does this situation apply during the rebuild? I.E. once a hot spare has been supplied, is the copy to the new drive linear, or will it write dirty pages to the new drive out of order, even before the reconstruction's gotten that far, _and_ do so in an order that doesn't open this race window of the data being unable to be reconstructed? If "degraded array" just means "don't have a replacement disk yet", then it sounds like what Pavel wants to document is "don't write to a degraded array at all, because power failures can cost you data due to write granularity being larger than filesystem block size". (Which still comes as news to some ...
I'm not convinced that information which needs to be known by System Administrators is best documented in the kernel Documentation directory. Should there be a HOWTO document on stuff like that? Sure, if someone wants to put something like that together, having free documentation about ways to set up your storage stack in a sane way is not a bad thing. It should be noted that these sorts of issues are discussed in various books targetted at System Administrators, and in Usenix's System Administration tutorials. The computer industry is highly specialized, and so just because an OS kernel hacker might not be familiar with these issues, doesn't mean that professionals whose job it is to run data centers don't know about these things! Similarly, you could be a whiz at Linux's networking stack, but you might not know about certain pitfalls in configuring a Cisco router using IOS; does that mean we should have an IOS tutorial in the kernel Sure, but the fact that we don't currently say much about storage stacks doesn't mean we should accept a patch that might actively Sounds like they were using really cheap UPS's; certainly not the kind I would expect to find in a data center. And if company's system administrator is using the cheapest possible consumer-grade UPS's, then yes, they might have a problem. Even an educational institution like MIT, where I was an network administrator some 15 years ago, had proper UPS's, *and* we had a diesel generator which kicked in after 15 seconds --- and we tested the diesel generator every Friday morning, Even my home RAID array uses hot-plug SATA disks, so I can replace a failed disk without shutting down my system. (And yes, I have a backup battery for the hardware RAID, and the firmware runs periodic tests on it; the hardware RAID card also will send me e-mail if a RAID array drive fails and it needs to use my hot-spare. At that point, I order a new hard drive, secure in the knowledge that the system can still suffer another ...
One thing that does need fixing for some MD configurations is to stress again that we need to make sure that barrier operations are properly supported or users will need to disable the write cache on devices with volatile write caches. Ric --
Agreed; chime in on Christoph's linux-vfs thread if people have input.
I quickly glanced at MD and DM. Currently, upstream, we see a lot of
if (unlikely(bio_barrier(bio))) {
bio_endio(bio, -EOPNOTSUPP);
return 0;
}
in DM and MD make_request functions.
Only md/raid1 supports barriers at present, it seems. None of the other
MD drivers support barriers.
DM has some barrier code... but the above code was pasted from DM's
make_request function, so I am guessing that DM's barrier stuff is
incomplete and disabled at present.
I've been mentioning this issue for years... glad some people finally
noticed :)
Jeff
--
That code is from the new request-based multipath implementation in 2.6.31 which doesn't yet. But bio-based dm does support barriers now. (Just missing some patches to complete the dm-raid1 support that are still under review IIRC.) Alasdair --
Not even md/raid0? Ouch :-(. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Only for raid1 there's no requiriment for inter-drive ordering. Hence only raid1 supports barriers (and gained that support very recently, in 1 or 2 kernel releases). For the rest, including raid0 and linear, inter-drive ordering is necessary to implement barriers. Or md should have its own queue (flushing) mechanisms. /mjt --
It is not only for system administrators; I was trying to find out if ext3 documentation states that journal protects fs integrity on powerfail. If you don't want to talk about storage stacks, perhaps that should be removed? Now... You mocked me up for 'ext3 expects disks to behave like disks (alarmist)'. I actually believe that should be written somewhere. ext3 depends on fairly subtle storage disk characteristics, and many common configs just do not meet the expectations (missing barriers is most common, followed by collateral damage). Maybe not documenting that was okay 10 years ago, but with all the USB sticks and raid arrays around, its just sloppy. Because those characteristics are not documented, storage stack authors do not know what they have to guarantee, and the result is bad. See for example nbd -- it does not propagate barriers and is therefore unsafe. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Can we get a proper scrub function (full rewrite of all component disks), please? Not every disk out there will stop a streaming read to Debian got this right :-) -- "One disk to rule them all, One disk to find them. One disk to bring them all and in the darkness grind them. In the Land of Redmond where the shadows lie." -- The Silicon Valley Tarot Henrique Holschuh --
Yes. Unfortunately, different filesystems expect different properties from block devices. ext3 will work with write cache enabled/barriers enabled, while ext2 needs write cache disabled. The requirements are also quite surprising; AFAICT ext3 can handle disk writing garbage to single sector during powerfail, while xfs can not handle that. Now, how do you expect users to know these subtle details when it is not documented anywhere? And why are you fighting against documenting As was uncovered, MD RAID does not properly support barriers, Trust me, 99% of sysadmins are not compentent by your definition. So ext3 greatly contributes to administrator incomentency: # The journal supports the transactions start and stop, and in case of a # crash, the journal can replay the transactions to quickly put the # partition back into a consistent state. ...it does not mention that (non-default!) barrier=1 is needed to make this reliable, nor it mentions that there are certain requirements for this to work. It just says that journal will magically help you. And you wonder while people expect magic from your filesystem? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
The reality in your document does not match up with the reality out there in the world. That sounds like a good reason not to have your (incorrect) document out there, confusing people. -- All rights reversed. --
On google scale anvil lightning can fry your machine out of a clear sky. However, there are still a few non-enterprise users out there, and knowing that specific usage patterns don't behave like they expect might be useful to Actually, that's _exactly_ what he's talking about. When writing to a degraded raid or a flash disk, journaling is essentially useless. If you get a power failure, kernel panic, somebody tripping over a USB cable, and so on, your filesystem will not be protected by journaling. Your data won't be trashed _every_ time, but the likelihood is much greater than experience with journaling in other contexts would suggest. Worse, the journaling may be counterproductive by _hiding_ many errors that fsck would promptly detect, so when the error is detected it may not be associated with the event that caused it. It also may not be noticed until good backups of the data have been overwritten or otherwise cycled out. You seem to be arguing that Linux is no longer used anywhere but the enterprise, so issues affecting USB flash keys or cheap software-only RAID aren't worth documenting? Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds --
You are missing the broader point of both papers. They (and people like me when back at EMC) look at large numbers of machines and try to fix what actually breaks when run in the real world and causes data loss. The motherboards, S-ATA controllers, disk types are the same class of parts that I have in my desktop box today. The advantage of google, national labs, etc is that they have large numbers of systems and can draw conclusions that are meaningful to our broad user base. Specifically, in using S-ATA drives (just like ours, maybe slightly more reliable) they see up to 7% of those drives fail each year. All users have "soft" drive failures like single remapped sectors. These errors happen extremely commonly and are what RAID deals with well. What does not happen commonly is that during the RAID rebuild (kicked off only after a drive is kicked out), you push the power button or have a second failure (power outage). We will have more users loose data if they decide to use ext2 instead of ext3 and use only single disk storage. We have real numbers that show that is true. Injecting double faults into a system that handles single faults is frankly not that interesting. You can get better protection from these double faults if you move to "cloud" like storage configs where each box is fault tolerant, but you also spread your data over multiple boxes in multiple locations. Regards, --
No, I'm dismissing the papers (some of which I read when they first came out and got slashdotted) as irrelevant to the topic at hand. Pavel has two failure modes which he can trivially reproduce. The USB stick one is reproducible on a laptop by jostling said stick. I myself used to have a literal USB keychain, and the weight of keys dangling from it pulled it out of the USB socket fairly easily if I wasn't careful. At the time nobody had told me a journaling filesystem was not a reasonable safeguard here. Presumably the degraded raid one can be reproduced under an emulator, with no hardware directly involved at all, so talking about hardware failure rates ignores the fact that he's actually discussing a _software_ problem. It may happen in _response_ to hardware failures, but the damage he's attempting to document happens entirely in software. These failure modes can cause data loss which journaling can't help, but which journaling might (or might not) conceivably hide so you don't immediately notice it. They share a common underlying assumption that the storage device's update granularity is less than or equal to the filesystem's block size, which is not actually true of all modern storage devices. The fact he's only _found_ two instances where this assumption bites doesn't mean there aren't more waiting to be found, especially as more new storage media types get introduced. Pavel's response was to attempt to document this. Not that journaling is _bad_, but that it doesn't protect against this class of problem. Your response is to talk about google clusters, cloud storage, and cite academic papers of statistical hardware failure rates. As I understand the discussion, that's not actually the issue Pavel's talking about, merely one potential trigger for it. Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds --
I don't think anyone is disagreeing with the statement that journaling doesn't protect against this class of problems, but Pavel's statements didn't say that. he stated that ext3 is more dangerous than ext2. David Lang --
Well, if you use 'common' fsck policy, ext3 _is_ more dangerous. But I'm not pushing that to documentation, I'm trying to push info everyone agrees with. (check the patches). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
The filesystem itself isn't more dangerous, but it may provide a false sense of security when used on storage devices it wasn't designed for. Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds --
Agreed. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
from this discussin (and the similar discussion on lwn.net) there appears to be confusion/disagreement over what fsck does and what the results of not running it are. it has been stated here that fsck cannot fix broken data, all it tries to do is to clean up metadata, but it would probably help to get a clear statement of what exactly that means. I know that it: finds entries that don't actually have data and deletes them finds entries where multiple files share data blocks and duplicates the (bad for one file) data to seperate them finds blocks that have been orphaned (allocated, but no directory pointer to them) and creates entries in lost+found but if a fsck does not get run on a filesystem that has been damaged, what additional damage can be done? can it overwrite data that could have been saved? can it cause new files that are created (or new data written to existing, but uncorrupted files) to be lost? or is it just a matter of not knowing about existing corruption? David Lang --
Let me give you my formulation of fsck which may be helpful. Fsck can not fix broken data; and (particularly in fsck -y mode) may not even recover the maximal amount of lost data caused by metadata corruption. (This is why sometimes an expert using debugfs can recover more data than fsck -y, and if you have some really precious data, like ten years' worth of Ph.D. research that you've never bothered to back up[1], the first thing you should do is buy a new hard drive and make a sector-by-sector copy of the disk and *then* run fsck. A new terrabyte hard drive costs $100; how much is your data worth to you?) [1] This isn't hypothetical; while I was at MIT this sort of thing actually happened more than once --- which brings up the philosophical question of whether someone who is that stupid about not doing backups on critical data *deserves* to get a Ph.D. degree. :-) Fsck's primary job is to make sure that further writes to the filesystem, whether you are creating new files or removing directory hierarchies, etc., will not cause *additional* data loss due to meta data corruption in the file system. Its secondary goals are to preserve as much data as possible, and to make sure that file system metadata is valid (i.e., so that a block pointer contains a valid block address, so that an attempt to read a file won't cause an I/O error when the filesystems attempts to seek to a non-existent sector on disk). For some filesystems, invalid, corrupt metadata can actually cause a system panic or oops message, so it's not necessarily safe to mount a filesystem with corrupt metadata read-only without risking the need to reboot the machine in question. More recently, there are folks who have been filing security bugs when they detect such cases, so there are fewer examples of such cases, but historically it was a good idea to run fsck because otherwise it's possible the kernel might oops or Consider the case where there are data blocks in use by inodes, containing precious data, ...
So your argument basically is 'our abs brakes are broken, but lets not tell anyone; our car is still safer than a horse'. and 'while we know our abs brakes are broken, they are not major factor in accidents, so lets not tell anyone'. Sorry, but I'd expect slightly higher moral standards. If we can document it in a way that's non-scary, and does not push people to single disks (horses), please go ahead; but you have to mention that md raid breaks journalling assumptions (our abs brakes really are broken). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
You continue to ignore the technical facts that everyone (both MD and ext3) people put in front of you. If you have a specific bug in MD code, please propose a patch. Ric --
Interesting. So, what's technically wrong with the patch below? Pavel --- From: Theodore Tso <tytso@mit.edu> Document that many devices are too broken for filesystems to protect data in case of powerfail. Signed-of-by: Pavel Machek <pavel@ucw.cz> diff --git a/Documentation/filesystems/dangers.txt b/Documentation/filesystems/dangers.txt new file mode 100644 index 0000000..2f3eec1 --- /dev/null +++ b/Documentation/filesystems/dangers.txt @@ -0,0 +1,21 @@ +There are storage devices that high highly undesirable properties when +they are disconnected or suffer power failures while writes are in +progress; such devices include flash devices and DM/MD RAID 4/5/6 (*) +arrays. These devices have the property of potentially corrupting +blocks being written at the time of the power failure, and worse yet, +amplifying the region where blocks are corrupted such that additional +sectors are also damaged during the power failure. + +Users who use such storage devices are well advised take +countermeasures, such as the use of Uninterruptible Power Supplies, +and making sure the flash device is not hot-unplugged while the device +is being used. Regular backups when using these devices is also a +Very Good Idea. + +Otherwise, file systems placed on these devices can suffer silent data +and file system corruption. An forced use of fsck may detect metadata +corruption resulting in file system corruption, but will not suffice +to detect data corruption. + +(*) Degraded array or single disk failure "near" the powerfail is +neccessary for this property of RAID arrays to bite. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
You mean apart from ".... that high highly undesirable ...." ??
^^^^^^^^^^^
And the phrase "Regular backups when using these devices ...." should
be "Regular backups when using any devices .....".
^^^
If you have a device failure near a power fail on a raid5 you might
lose some blocks of data. If you have a device failure near (or not
near) a power failure on raid0 or jbod etc you will certainly lose lots
of blocks of data.
I think it would be better to say:
".... and degraded DM/MD RAID 4/5/6(*) arrays..."
^^^^^^^^
with
(*) If device failure causes the array to become degraded during or
immediately after the power failure, the same problem can result.
And "necessary" only have the one 'c' :-)
--
Ok, I still believe kernel documentation should be ... well... in kernel, not in LWN article, so I fixed the patch according to your comments. Signed-off-by: Pavel Machek <pavel@ucw.cz> diff --git a/Documentation/filesystems/dangers.txt b/Documentation/filesystems/dangers.txt new file mode 100644 index 0000000..14d0324 --- /dev/null +++ b/Documentation/filesystems/dangers.txt @@ -0,0 +1,21 @@ +There are storage devices that have highly undesirable properties when +they are disconnected or suffer power failures while writes are in +progress; such devices include flash devices and degraded DM/MD RAID +4/5/6 (*) arrays. These devices have the property of potentially +corrupting blocks being written at the time of the power failure, and +worse yet, amplifying the region where blocks are corrupted such that +additional sectors are also damaged during the power failure. + +Users who use such storage devices are well advised take +countermeasures, such as the use of Uninterruptible Power Supplies, +and making sure the flash device is not hot-unplugged while the device +is being used. Regular backups when using any devices, and these +devices in particular is also a Very Good Idea. + +Otherwise, file systems placed on these devices can suffer silent data +and file system corruption. An forced use of fsck may detect metadata +corruption resulting in file system corruption, but will not suffice +to detect data corruption. + +(*) If device failure causes the array to become degraded during or +immediately after the power failure, the same problem can result. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
My suggestion was that you stop trying to document your assertion of an issue and actually suggest fixes in code or implementation. I really don't think that you have properly diagnosed your specific failure or done sufficient. However, if you put a full analysis and suggested code out to the MD devel lists, we can debate technical implementation as we normally do. As Ted quite clearly stated, documentation on how RAID works, how to configure it, etc, is best put in RAID documentation. What you claim as a key issue is an issue for all file systems (including ext2). The only note that I would put in ext3/4 etc documentation would be: "Reliable storage is important for any file system. Single disks (or FLASH or SSD) do fail on a regular basis. To reduce your risk of data loss, it is advisable to use RAID which can overcome these common issues. If using MD software RAID, see the RAID documentation on how best to configure your storage. With or without RAID, it is always important to back up your data to an external device and keep copies of that backup off site." --
I don't think I should be required to rewrite linux md layer in order Uh, how clever, instead of documenting that our md raid code does not always work as expected, you document that components fail. Newspeak 101? You even failed to mention little design problem with flash and eraseblock size... and the fact that you don't need flash to fail to get data loss. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
NACK. I didn't write this patch, and it's disingenuous for you to try to claim that I authored it. You took text I wrote from the *middle* of an e-mail discussion and you ignored multiple corrections to typo's that I made --- typo's that I would have corrected if I had ultimately decided to post this as a patch, which I did NOT. While Neil Brown's corrections are minimally necessary so the text is at least technically *correct*, it's still not the right advice to give system administrators. It's better than the fear-mongering patches you had proposed earlier, but what would be better *still* is telling people why running with degraded RAID arrays is bad, and to give them further tips about how to use RAID arrays safely. To use your ABS brakes analogy, just becase it's not safe to rely on ABS brakes if the "check brakes" light is on, that doesn't justify writing something alarmist which claims that ABS brakes don't work 100% of the time, don't use ABS brakes, they're broken!!!! The first part of it is true, since ABS brakes can suffer mechnical failure. But what we should be telling drivers is, "if the 'check brakes' light comes on, don't keep driving with it, go to a garage and get it fixed!!!". Similarly, if you get a notice that your RAID is running in degraded mode, you've already suffered one failure; you won't survive another failure, so fix that issue ASAP! If you're really paranoid, you could decide to "pull over to the side of the road"; that is, you could stop writing to the RAID array as soon as possible, and then get the the RAID array rebuilt before proceeding. That can reduce the chances of a second failure. But in the real world, there are costs associated with taking a production server off-line, and the prudent system administrator has to do a risk-reward tradeoff. A better approach might to have the array configured with a hot spare, and to regularly scrub the array, and configure the RAID array with either a battery backup or a UPS. ...
Well, you did write original text, so I wanted to give you Maybe this belongs to Doc*/filesystems, and more detailed RAID If it only was this simple. We don't have 'check brakes' (aka 'journalling ineffective') warning light. If we had that, I would not have problem. It is rather that your ABS brakes are ineffective if 'check engine' (RAID degraded) is lit. And yes, running with 'check engine' for extended periods may be bad idea, but I know people that do that... and I still hope their brakes work (and believe they should 'your RAID array is degraded' is very counter intuitive way to say '...and btw your journalling is no longer effective, either'. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Why should this be placed in *kernel* documentation anyway? The "dangers of RAID", the hints that "backups are a good idea" - isn't that something for howtos for sysadmins? No end-user will ever look into Documentation/ anyway. The sysadmins should know what they're doing and see the upsides and downsides of RAID and journalling filesystems. And they'll turn to howtos and tutorials to find out. And maybe seek *reference* documentation in Documentation/ - but I don't think Storage-101 should be covered in a mostly hidden place like Documentation/. Christian. -- BOFH excuse #212: Of course it doesn't work. We've performed a software upgrade. --
The fact that two kernel subsystems (MD RAID, journaling filesystems) do not work well together is surprising and should be documented near the source. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
the 'RAID degraded' warning says that _anything_ you put on that block device is at risk. it doesn't matter if you are using a filesystem with a journal, one without, or using the raw device directly. David Lang --
The easiest way to lose your data in Linux - with RAID, without RAID, S-ATA or SAS - is to run with the write cache enabled. If you compare the size of even a large RAID stripe it will be measured in KB and as this thread has mentioned already, you stand to have damage to just one stripe (or even just a disk sector or two). If you lose power with the write caches enabled on that same 5 drive RAID set, you could lose as much as 5 * 32MB of freshly written data on a power loss (16-32MB write caches are common on s-ata disks these days). For MD5 (and MD6), you really must run with the write cache disabled until we get barriers to work for those configurations. It would be interesting for Pavel to retest with the write cache enabled/disabled on his power loss scenarios with multi-drive RAID. Regards, Ric --
Ric Wheeler wrote: This is fundamentally wrong. Many filesystems today use either barriers or flushes (if barriers are not supported), and the times when disk drives I highly doubt barriers will ever be supported on anything but simple raid1, because it's impossible to guarantee ordering across multiple drives. Well, it *is* possible to have write barriers with journalled (and/or with battery-backed-cache) raid[456]. Note that even if raid[456] does not support barriers, write cache flushes still works. /mjt --
Unfortunately not - if you mount a file system with write cache enabled and see "barriers disabled" messages in /var/log/messages, this is exactly what happens. File systems issue write barrier operations that in turn do cache flushes (ATA_FLUSH_EXT) commands or its SCSI equivalent. MD5 and MD6 do not pass these operations on currently and there is no other file system level mechanism that somehow bypasses the IO stack to invalidate or flush the cache. Note that some devices have non-volatile write caches (specifically I think that you are confused - barriers are implemented using cache flushes. Ric --
While most common filesystem do have barrier support it is: - not actually enabled for the two most common filesystems - the support for write barriers an cache flushing tends to be buggy All currently working barrier implementations on Linux are built upon queue drains and cache flushes, plus sometimes setting the FUA bit. --
Or just missing - I think that MD5/6 simply drop the requests at present. I wonder if it would be worth having MD probe for write cache enabled & warn if --
In my opinion even that is too weak. We know how to control the cache settings on all common disks (that is scsi and ata), so we should always disable the write cache unless we know that the whole stack (filesystem, raid, volume managers) supports barriers. And even then we should make sure the filesystems does actually use barriers everywhere that's needed which failed at for years. --
.. That stack does not know that my MD device has full battery backup, so it bloody well better NOT prevent me from enabling the write caches. In fact, MD should have nothing to do with that. I do like/prefer the way that XFS currently does it: disables barriers and logs the event, but otherwise doesn't try to enforce policy upon me from kernel space. Cheers --
No one is going to prevent you from doing it. That question is one of sane defaults. And always safe, but slower if you have advanced equipment is a much better default than usafe by default on most of the install base. --
I've always agreed with "be safe first" and have worked where we always shut write cache off unless we knew it had battery. But before we make disabling cache the default, this is the impact: - users will see it as a performance regression - trashy OS vendors who never disable cache will benchmark better than "out of the box" linux. Because as we all know, users don't read release notes. Been there, done that, felt the pain. jim --
Just to add some support to this, all of the external RAID arrays that I know of normally run with write cache disabled on the component drives. In addition, many of them will disable their internal write cache if/when they detect that they have lost their UPS. I think that if we had done this kind of sane default earlier for MD levels that do not handle barriers, we would not have left some people worried about our software RAID. To be clear, if a sophisticated user wants to override this default, that should be supported. It is not (in my opinion) a safe default behaviour. Ric --
Do they use "off the shelf" SATA (or PATA) disks, and if so, which ones? -- Krzysztof Halasa --
Which drives various vendors ships changes with specific products. Usually, they ship drives that have carefully vetted firmware, etc. but they are close to the same drives you buy on the open market. Seagate has a huge slice of the market, ric --
But they aren't the same, are they? If they are not, the fact they can run well with the write-through cache doesn't mean the off-the-shelf ones can do as well. Are they SATA (or PATA) at all? SCSI etc. are usually different animals, though there are SCSI and SATA models which differ only in electronics. Do you have battery-backed write-back RAID cache (which acknowledges flushes before the data is written out to disks)? PC can't do that. -- Krzysztof Halasa --
Storage vendors have a wide range of options, but what you get today is a collection of s-ata (not much any more), sas or fc. We (red hat) have all kinds of different raid boxes... ric --
A have no doubt about it, but are those you know equipped with battery-backed write-back cache? Are they using SATA disks? We can _at_best_ compare non-battery-backed RAID using SATA disks with what we typically have in a PC. -- Krzysztof Halasa --
The whole thread above is about software MD using commodity drives (S-ATA or SAS) without battery backed write cache. We have that (and I have it personally) and do test it. You must disable the write cache on these commodity drives *if* the MD RAID level does not support barriers properly. This will greatly reduce errors after a power loss (both in degraded state and non-degraded state), but it will not eliminate data loss entirely. You simply cannot do that with any storage device! Note that even without MD raid, the file system issues IO's in file system block size (4096 bytes normally) and most commodity storage devices use a 512 byte sector size which means that we have to update 8 512b sectors. Drives can (and do) have multiple platters and surfaces and it is perfectly normal to have contiguous logical ranges of sectors map to non-contiguous sectors physically. Imagine a 4KB write stripe that straddles two adjacent tracks on one platter (requiring a seek) or mapped across two surfaces (requiring a head switch). Also, a remapped sector can require more or less a full surface seek from where ever you are to the remapped sector area of the drive. These are all examples that can after a power loss, even a local (non-MD) device, do a partial update of that 4KB write range of sectors. Note that unlike unlike RAID/MD, local storage has no parity on the server to detect this partial write. This is why new file systems like btrfs and zfs do checksumming of data and metadata. This won't prevent partial updates during a write, but can at least detect them and try to do some kind of recovery. In other words, this is not just an MD issue, it is entirely possible even with non-MD devices. Also, when you enable the write cache (MD or not) you are buffering multiple MB's of data that can go away on power loss. Far greater (10x) the exposure that the partial RAID rewrite case worries about. ric --
Database software often attempts to deal with this phenomenon (sometimes called "torn page writes"). For example, you can make sure that the first time you write to a database page, you keep a full copy in your transaction log. If the machine crashes, the log is replayed, first completely overwriting the partially-written page. Only after that, you can perform logical/incremental logging. The log itself has to be protected with a different mechanism, so that you don't try to replay bad data. But you haven't comitted to this data yet, so it is fine to skip bad records. Therefore, sub-page corruption is a fundamentally different issue from super-page corruption. BTW, older textbooks will tell you that mirroring requires that you read from two copies of the data and compare it (and have some sort of tie breaker if you need availability). And you also have to re-read data you've just written to disk, to make sure it's actually there and hit the expected sectors. We can't even do this anymore, thanks to disk caches. And it doesn't seem to be necessary in most cases. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 --
Yes - databases worry a lot about this. Another technique that they tend to use is to have state bits at the beginning and end of their logical pages. For example, the first byte and last byte toggle together from 1 to 0 to 1 to 0 as you update. If the bits don't match, that is a quick level indication of a torn write. Even with the above scheme, you can still have data loss of course - you just need an IO error in the log and in your db table that was recently updated. Not entirely unlikely, especially if you use write cache enabled storage and don't We have to be careful to keep our terms clear since the DB pages are (usually) larger than the FS block size which in turn is larger than non-RAID storage sector size. At the FS level, we send down multiples of fs blocks (not blocked/aligned at RAID stripe levels, etc). In any case, we can get sub-FS block level "torn writes" even with a local S-ATA We can do something like this with the built in RAID in btrfs. If you detect an IO error (or bad checksum) on a read, btrfs knows how to request/grab another copy. Also note that the SCSI T10 DIF/DIX has baked in support for applications to layer on extra data integrity (look for MKP's slide decks). This is really neat since you can intercept bad IO's on the way down and prevent overwriting good data. ric --
Yes. However, you mentioned external RAID arrays disable disk caches. That's why I asked if they are using SATA or SCSI/etc. disks, and if The cache is flushed with working barriers. I guess it should be superior to disabled WB cache, in both performance and expected disk lifetime. -- Krzysztof Halasa --
Sorry for the confusion - they disable the write caches on the component drives normally, but have their own write cache which is not disabled in True - barriers (especially on big, slow s-ata drives) usually give you an overall win. SAS drives it seems to make less of an impact, but then you always need to benchmark your workload on anything to get the only numbers that really matter :-) ric --
Ric Wheeler wrote: .. Rather than further trying to cripple Linux on the notebook, (it's bad enough already).. How about instead, *fixing* the MD layer to properly support barriers? That would be far more useful, productive, and better for end-users. Cheers --
People using MD on notebooks (not sure there are that many using RAID5 Fixing MD would be great - not sure that it would end up still faster (look at md1 devices with working barriers with compared to md1 with write cache disabled). In the mean time, if you are using MD to make your data more reliable, I would still strongly urge you to disable the write cache when you see "barriers disabled" messages spit out in /var/log/messages :-) ric --
.. There's no inherent reason for it to be slower, except possibly drives with b0rked FUA support. So the first step is to fix MD to pass barriers to the LLDs for most/all RAID types. Then, if it has performance issues, those can be addressed by more application of little grey cells. :) Cheers --
The performance issue with MD is that the "simple" answer is to not only pass on those downstream barrier ops, but also to block and wait until all of those dependent barrier ops complete before ack'ing the IO. When you do that implementation at least, you will see a very large performance impact and I am not sure that you would see any degradation vs just turning off the write caches. Sounds like we should actually do some testing and actually measure, I do think that it will vary with the class of device quite a lot just like we see with single disk barriers vs write cache disabled on SAS vs S-ATA, etc... ric --
Having MD "pass barriers" to LLDs isn't really very useful. The barrier need to act with respect to all addresses of the device, and once you pass it down, it can only act with respect to addresses on that device. What any striping RAID level needs to do when it sees a barrier is: suspend all future writes drain and flush all queues submit the barrier write drain and flush all queues unsuspend writes I guess "drain can flush all queues" can be done with an empty barrier so maybe that is exactly what you meant. The double flush which (I think) is required by the barrier semantic is unfortunate. I wonder if it would actually make things slower than necessary. --
Yes, but ext3 was designed to handle the partial write (according to Yes, that's what barriers are for. Except that they are not there on MD0/MD5/MD6. They actually work on local sata drives... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
I'm not sure what made you think that I said that. In practice things usually work out, as a conseuqence of the fact that ext3 uses physical Yes, but ext3 does not enable barriers by default (the patch has been submitted but akpm has balked because he doesn't like the performance degredation and doesn't believe that Chris Mason's "workload of doom" is a common case). Note though that it is possible for dirty blocks to remain in the track buffer for *minutes* without being written to spinning rust platters without a barrier. See Chris Mason's report of this phenonmenon here: http://lkml.org/lkml/2009/3/30/297 Here's Chris Mason "barrier test" which will corrupt ext3 filesystems 50% of the time after a power drop if the filesystem is mounted with barriers disabled (which is the default; use the mount option barrier=1 to enable barriers): http://lkml.indiana.edu/hypermail/linux/kernel/0805.2/1518.html (Yes, ext4 has barriers enabled by default.) - Ted --
Or migrate to ext4, which does use barriers by defaults, as well as journal-level checksumming. :-) As far as changing the default to enable barriers for ext3, you'll need to talk to akpm about that; he's the one who has been against it in the past. - Ted --
So what I recommend for server class machines is to either turn off the automatic fsck's (it's the default, but it's documented and there are supported ways of turning it off --- that's hardly developers "ramming" it down user's throats), or more preferably, to use LVM, and You can do this with ext3/ext4 today, now. Just take a look at e2croncheck in the contrib directory of e2fsprogs. Changing it to not Hmm, why are you running on battery so often? I make a point of running connected to the AC mains whenever possible, because a LiOn battery only has about 200 full-cycle charge/discharges in it, and given the cost of LiOn batteries, basically each charge/discharge cycle costs a dollar each. So I only run on batteries when I absolutely have to, and in practice it's rare that I dip below 30% or So e2fsck would fix the cross-linking. We do need to have some better tools to do forced rewrite of sectors that have gone bad in a HDD. It can be done by using badblocks -n, but translating the sector number emitted by the device driver (which for some drivers is relative to the beginning of the partition, and for others is relative to the beginning of the disk). It is possible to run badblocks -w on the whole disk, of course, but it's better to just run it on the specific Well, it actually is a problem. And there may be other problems hiding that you're not aware of. Running "badblocks -b 4096 -n" may discover other blocks that have failed, and you can then decide whether you want to let fsck fix things up. If you don't, though, it's probably not fair to blame ext3 or e2fsck for any future failures (not that it's likely to stop you :-). - Ted --
For people using e2croncheck, where you can check it when the system is idle and without needing to do a power cycle, I'd recommend once a Some distributions will allow you to cancel an fsck; either by using ^C, or hitting escape. That's a matter for the boot scripts, which are distribution specific. Ubuntu has a way of doing this, for example, if I recall correctly --- although since I've started using e2croncheck, I've never had an issue with an e2fsck taking place on bootup. Also, ext4, fscks are so much much faster that even before I upgraded to using an SSD, it's never been an issue for me. It's Complain to your distribution. :-) Or this is Linux and open source; fix it yourself, and submit the patches back to your distribution. If all you want to do is whine, then maybe Rob's choice is the best way, go switch to the velvet-lined closed system/jail which is the Macintosh. :-) (I created e2croncheck to solve my problem; if that isn't good enough for you, I encourage you to find/create your own fixes.) - Ted --
frequently they are exactly the same drives, with exactly the same firmware. you disable the write caches on the drives themselves, but you add a large it depends on what raid array you use, some use SATA, some use SAS/SCSI David Lang --
I was thinking about that as well. Having us disable the write cache when we know it is not supported (like in the MD5 case) would certainly be *much* safer for almost everyone. We would need to have a way to override the write cache disabling for people who either know that they have a non-volatile write cache (unlikely as it would probably be to put MD5 on top of a hardware RAID/external array, but some of the new SSD's claim to have non-volatile write cache). It would also be very useful to have all of our top tier file systems enable barriers by default, provide consistent barrier on/off mount options and log a nice warning when not enabled.... ric --
I've done this when the hardware raid only suppored raid 5 but I wanted raid 6. I've also done it when I had enough disks to need more than one hardware raid card to talk to them all, but wanted one logical drive for most people are not willing to live with unbuffered write performance. they care about their data, but they also care about performance, and since performance is what they see on an ongong basis, they tend to care more about performance. given that we don't even have barriers enabled by default on ext3 due to the performance hit, what makes you think that disabling buffers entirely is going to be acceptable to people? David Lang --
We do (and have for a number of years) enable barriers by default for XFS and reiserfs. In SLES, ext3 has default barriers as well. Ric --
I'm not sure what you mean with unbuffered write support, the only common use of that term is for userspace I/O using the read/write sysctem calls directly in comparism to buffered I/O which uses the stdio library. But be ensure that the use of barriers and cache flushes in fsync does not completely disable caching (or "buffering"), it just does flush flushes the disk write cache in case we either commit a log buffer than need to be on disk, or performan an fsync where we really do want to have data on disk instead of lying to the application about the status of the I/O completion. Which btw could be interpreted as a violation of the Posix rules. --
as I understood it, the proposal that I responded to was to change the kernel to detect if barriers are enabled for the entire stack or not, and if not disable the write caches on the drives. there are definantly times when that is the correct thing to do, but I am not sure that it is the correct thing to do by default. David Lang --
If you are using one with journal, you'll still need to run fsck at boot time, to make sure metadata is still consistent... Protection provided by journaling is not effective in this configuration. (You have the point that pretty much all users of the blockdevice will be affected by powerfail degraded mode.) Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
But we do; comptently designed (and in the cast of software RAID, competently packaged) RAID subsystems send notifications to the system administrator when there is a hard drive failure. Some hardware RAID systems will send a page to the system administrator. A mid-range Areca card has a separate ethernet port so it can send e-mail to the administrator, even if the OS is hosed for some reason. And it's not a matter of journalling ineffective; the much bigger deal is, "your data is at risk"; perhaps because the file system metadata may become subject to corruption, but more critically, because the file data may become subject to corruption. Metadata becoming subject to corruption is important primarily because it leads to data becoming corruption; metadata is the tail; the user's data is the dog. So we *do* have the warning light; the problem is that just as some people may not realize that "check brakes" means, "YOU COULD DIE", some people may not realize that "hard drive failure; RAID array degraded" could mean, "YOU COULD LOSE DATA". Fortunately, for software RAID, this is easily solved; if you are so concerned, why don't you submit a patch to mdadm adjusting the e-mail sent to the system administrator when the array is in a degraded state, such that it states, "YOU COULD LOSE DATA". I would gently suggest to you this would be ***far*** more effective that a patch to kernel documentation. - Ted --
In the case of a degraded array, could the kernel be more proactive (or maybe even mdadm) and have the filesystem remount itself withOUT journalling enabled? This seems on the surface to be possible, but I don't know the internal particulars that might prevent/allow it. --
This a misconception - with or without journalling, you are open to a second failure during a RAID rebuild. Also note that by default, ext3 does not mount with barriers turned on. Even if you mount with barriers, MD5 does not handle barriers, so you stand to lose a lot of data if you have a power outage. Ric --
On 2009-08-31 13:01, Ric Wheeler wrote: Pardon me for asking for such a seemingly obvious question, but what (besides "Message-Digest algorithm 5") is MD5? (I've always seen "multiple drive" written in the lower case "md".) -- Brawndo's got what plants crave. It's got electrolytes! --
also sprach Jesse Brandeburg <jesse.brandeburg@gmail.com> [2009.08.31.1949 = Why would I want to disable the filesystem journal in that case? --=20 `. `'` http://people.debian.org/~madduck http://vcs-pkg.org `- Debian - when you have better things to do than fixing systems =20 "i can stand brute force, but brute reason is quite unbearable. there is something unfair about its use. it is hitting below the intellect." -- oscar wilde
I misspoke w.r.t journalling, the idea I was trying to get across was to remount with -o sync while running on a degraded array, but given some of the other comments in this thread I'm not even sure that would help. the idea was to make writes as safe as possible (at the cost of speed) when running on a degraded array, and to have the transition be as hands-free as possible, just have the kernel (or mdadm) by default remount. --
Much better, I'd think, to "just" have it scream out DANGER!! WILL ROBINSON!! DANGER!! to syslog and to an email hook. -- Brawndo's got what plants crave. It's got electrolytes! --
also sprach Jesse Brandeburg <jesse.brandeburg@gmail.com> [2009.09.01.0026 = I don't see how that is any more necessary with a degraded array than it is when you have a fully working array. Sync just ensures that the data are written and not cached, but that has absolutely nothing to do with the underlying storage. Or am I failing to see the link? --=20 `. `'` http://people.debian.org/~madduck http://vcs-pkg.org `- Debian - when you have better things to do than fixing systems =20 "how do you feel about women's rights?" "i like either side of them." -- groucho marx
Well, my MMC/uSD cards do not have ethernet ports to remind me that they are unreliable :-(. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
In RAID 1 mode, it should read both copies and error out on mismatch. 8-) -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 --
Despite your smiley: no it shouldn't, and no one is making any claims about raid1 being unsafe, only raid4/5/6. NeilBrown --
substatute 'degraded MD RAID 5' for 'MD RAID 5' and you have a point here. although the language you are using is pretty harsh. you make it sound like this is a problem with ext3 when the filesystem has nothing to do with it. the problem is that a degraded raid 5 array can be corrupted by by the way, while you are thinking about failures that can happen from a failed write corrupting additional blocks, think about the nightmare that can happen if those blocks are in the journal. the 'repair' of ext2 by a fsck is actually much less than you are thinking that it is. David Lang --
It seems that you are really hung up on whether or not the filesystem metadata is consistent after a power failure, when I'd argue that the problem with using storage devices that don't have good powerfail properties have much bigger problems (such as the potential for silent data corruption, or even if fsck will fix a trashed inode table with ext2, massive data loss). So instead of your suggested patch, it might be better simply to have a file in Documentation/filesystems that states something along the lines of: "There are storage devices that high highly undesirable properties when they are disconnected or suffer power failures while writes are in progress; such devices include flash devices and software RAID 5/6 arrays without journals, as well as hardware RAID 5/6 devices without battery backups. These devices have the property of potentially corrupting blocks being written at the time of the power failure, and worse yet, amplifying the region where blocks are corrupted such that adjacent sectors are also damaged during the power failure. Users who use such storage devices are well advised take countermeasures, such as the use of Uninterruptible Power Supplies, and making sure the flash device is not hot-unplugged while the device is being used. Regular backups when using these devices is also a Very Good Idea. Otherwise, file systems placed on these devices can suffer silent data and file system corruption. An forced use of fsck may detect metadata corruption resulting in file system corruption, but will not suffice to detect data corruption." My big complaint is that you seem to think that ext3 some how let you down, but I'd argue that the real issue is that the storage device let you down. Any journaling filesystem will have the properties that you seem to be complaining about, so the fact that your patch only documents this as assumptions made by ext2 and ext3 is unfair; it also applies to xfs, jfs, reiserfs, reiser4, etc. Further more, most users are even ...
In FTL case, damaged sectors are not neccessarily adjacent. Otherwise Ok, would you be against adding: "Running non-journalled filesystem on these may be desirable, as Yes, it applies to all journalling filesystems; it is just that I was clever/paranoid enough to avoid anything non-ext3. ext3 docs still says: # The journal supports the transactions start and stop, and in case of a # crash, the journal can replay the transactions to quickly put the Ok, works for me. --- From: Theodore Tso <tytso@mit.edu> Document that many devices are too broken for filesystems to protect data in case of powerfail. Signed-of-by: Pavel Machek <pavel@ucw.cz> diff --git a/Documentation/filesystems/dangers.txt b/Documentation/filesystems/dangers.txt new file mode 100644 index 0000000..e1a46dd --- /dev/null +++ b/Documentation/filesystems/dangers.txt @@ -0,0 +1,19 @@ +There are storage devices that high highly undesirable properties +when they are disconnected or suffer power failures while writes are +in progress; such devices include flash devices and software RAID 5/6 +arrays without journals, as well as hardware RAID 5/6 devices without +battery backups. These devices have the property of potentially +corrupting blocks being written at the time of the power failure, and +worse yet, amplifying the region where blocks are corrupted such that +additional sectors are also damaged during the power failure. + +Users who use such storage devices are well advised take +countermeasures, such as the use of Uninterruptible Power Supplies, +and making sure the flash device is not hot-unplugged while the device +is being used. Regular backups when using these devices is also a +Very Good Idea. + +Otherwise, file systems placed on these devices can suffer silent data +and file system corruption. An forced use of fsck may detect metadata +corruption resulting in file system corruption, but will not suffice +to detect data corruption. \ No newline at end of ...
is it under all conditions, or only when you have already lost redundancy? prior discussions make me think this was only if the redundancy is already lost. also, the talk about software RAID 5/6 arrays without journals will be confusing (after all, if you are using ext3/XFS/etc you are using a journal, aren't you?) you then go on to talk about hardware raid 5/6 without battery backup. I'm think that you are being too specific here. any array without battery backup can lead to 'interesting' situations when you loose power. in addition, even with a single drive you will loose some data on power loss (unless you do sync mounts with disabled write caches), full data journaling can help protect you from this, but the default journaling just protects the metadata. David Lang --
I'm not so sure now. Lets say you are writing to the (healthy) RAID5 and have a powerfail. So now data blocks do not correspond to the parity block. You don't yet have the corruption, but you already have a problem. Slightly confusing, yes. Should I just say "MD RAID 5" and avoid talking about hardware RAID arrays, where that's really "Data loss" here means "damaging data that were already fsynced". That will not happen on single disk (with barriers on etc), but will happen on RAID5 and flash. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
you need to, otherwise you are claiming that all linux software raid implementations will loose data on powerfail, which I don't think is the it's the same combination of problems (non-redundant array and write lost to powerfail/reboot), just in a different order. reccomending a scrub of the raid after an unclean shutdown would make sense, along with a warning that if you loose all redundancy before the scrub is completed and there was a write failure in the unscrubbed portion what about dm raid? this definition of data loss wasn't clear prior to this. you need to define this, and state that the reason that flash and raid arrays can suffer from this is that both of them deal with blocks of storage larger than the data block (eraseblock or raid stripe) and there are conditions that can cause the loss of the entire eraseblock or raid stripe which can affect data that was previously safe on disk (and if power had been lost before the latest write, the prior data would still be safe) note that this doesn't nessasarily affect all flash disks. if the disk doesn't replace the old block in the FTL until the data has all been sucessfuly copies to the new eraseblock you don't have this problem. some (possibly all) cheap thumb drives don't do this, but I would expect that the expensive SATA SSDs to do things in the right order. do this right and you are properly documenting a failure mode that most people don't understand, but go too far and you are crying wolf. David Lang --
I actually think it was. write() syscall does not guarantee anything,
I'd expect SATA SSDs to have that solved, yes. Again, Ted does not say
Ok, latest version is below, can you suggest improvements? (And yes,
details when exactly RAID-5 misbehaves should be noted somewhere. I
don't know enough about RAID arrays, can someone help?)
Pavel
---
There are storage devices that high highly undesirable properties
when they are disconnected or suffer power failures while writes are
in progress; such devices include flash devices and MD RAID 4/5/6
arrays. These devices have the property of potentially
corrupting blocks being written at the time of the power failure, and
worse yet, amplifying the region where blocks are corrupted such that
additional sectors are also damaged during the power failure.
Users who use such storage devices are well advised take
countermeasures, such as the use of Uninterruptible Power Supplies,
and making sure the flash device is not hot-unplugged while the device
is being used. Regular backups when using these devices is also a
Very Good Idea.
Otherwise, file systems placed on these devices can suffer silent data
and file system corruption. An forced use of fsck may detect metadata
corruption resulting in file system corruption, but will not suffice
to detect data corruption.
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
I would strike the entire mention of MD devices since it is your assertion, not a proven fact. You will cause more data loss from common events (single sector errors, complete drive failure) by steering people away from more reliable storage configurations because of a really rare edge case (power failure during All users who care about data integrity - including those who do not use MD5 but This is very misleading. All storage "can" have silent data loss, you are making a statement without specifics about frequency. FSCK can repair the file system metadata, but will not detect any data loss or corruption in the data blocks allocated to user files. To detect data loss properly, you need to checksum (or digitally sign) all objects stored in a file system and verify them on a regular basis. Also helps to keep a separate list of those objects on another device so that when the metadata does take a hit, you can enumerate your objects and verify that you have not lost anything. ric ric --
That actually is a fact. That's how MD RAID 5 is designed. And btw I'm not sure what's rare about power failures. Unlike single sector errors, my machine actually has a button that produces exactly that event. Running degraded raid5 arrays for extended periods may be slightly unusual configuration, but I suspect people should just do that for testing. (And from the discussion, people seem to think that substitute with "can (by design)"? Now, if you can suggest useful version of that document meeting your criteria? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
So what? He clearly knows how it works. Instead of arguing he's wrong, will you simply label everything as Look, I don't need full drive failure for this to happen. I can just remove one disk from array. I don't need power failure, I can just press the power button. I don't even need to rebuild anything, I can just write to degraded array. Given that all events are under my control, statistics make little sense here. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
if you are intentionally causing several low-probability things to happen at once you increase the risk of corruption note that you also need a write to take place, and be interrupted in just the right way. David Lang --
You are deliberately causing a double failure - pressing the power button after pulling a drive is exactly that scenario. Pull your single (non-MD5) disk out while writing (hot unplug from the S-ATA side, leaving power on) and run some tests to verify your assertions... ric --
Exactly. And now I'm trying to get that documented, so that people I actually did that some time ago with pulling SATA disk (I actually pulled both SATA *and* power -- that was the way hotplug envelope worked; that's more harsh test than what you suggest, so that should be ok). Write test was fsync heavy, with logging to separate drive, checking that all the data where fsync succeeded are indeed accessible. I uncovered few bugs in ext* that jack fixed, I uncovered some libata weirdness that is not yet fixed AFAIK, but with all the patches applied I could not break that single SATA disk. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
The problem I have is that the way you word it steers people away from RAID5 and better data integrity. Your intentions are good, but your text is going to do considerable harm. Most people don't intentionally drop power (or have a power failure) during RAID Fsync heavy workloads with working barriers will tend to keep the write cache pretty empty (two barrier flushes per fsync) so this is not too surprising. Drive behaviour depends on a lot of things though - how the firmware prioritizes writes over reads, etc. ric --
Example I seen went like this: Drive in raid 5 failed; hot spare was available (no idea about UPS). System apparently locked up trying to talk to the failed drive, or maybe admin just was not patient enough, so he just powercycled the array. He lost the array. So while most people will not agressively powercycle the RAID array, drive failure still provokes little tested error paths, and getting unclean shutdown is quite easy in such case. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Then what we need to document is do not power cycle an array during a rebuild, right? If it wasn't the admin that timed out and the box really was hung (no drive activity lights, etc), you will need to power cycle/reboot but then you should not have this active rebuild issuing writes either... In the end, there are cascading failures that will defeat any data protection scheme, but that does not mean that the value of that scheme is zero. We need to be get more people to use RAID (including MD5) and try to enhance it as we go. Just using a single disk is not a good thing... ric Ric --
Well, the softwar raid layer could be improved so that it implements scrubbing by default (i.e., have the md package install a cron job to implement a periodict scrub pass automatically). The MD code could also regularly check to make sure the hot spare is OK; the other possibility is that hot spare, which hadn't been used in a long time, Yep; the solution is to improve the storage devices. It is *not* to encourage people to think RAID is not worth it, or that somehow ext2 is better than ext3 because it runs fsck's all the time at boot up. That's just crazy talk. - Ted --
Actually, MD does this scan already (not automatically, but you can set up a simple cron job to kick off a periodic "check"). It is a delicate balance to get the frequency of the scrubbing correct. On one hand, you want to make sure that you detect errors in a timely fashion, certainly detection of single sector errors before you might develop a second sector level error on another drive. On the other hand, running scans/scrubs continually impacts the performance of your real workload and can potentially impact your components' life span by subjecting them to a heavy workload. Rule of thumb seems from my experience is that most people settle in with a scan Agreed.... ric --
debian defaults to doing this once a month (first sunday of each month), on some of my systems this scrub takes almost a week to complete. David Lang --
Ok, I guess you are right here. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
I recommend a sledgehammer. If you want to lose your data, you might as well have some fun. No need to bore yourself to tears by simulating events that are unlikely to happen simultaneously to careful system administrators. -- All rights reversed. --
Sledgehammer is hardware problem, and I'm demonstrating software/documentation problem we have here. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
So your argument is that a sledgehammer is a hardware problem, while a broken hard disk and a power failure are software/documentation issues? I'd argue that the broken hard disk and power failure are hardware issues, too. -- All rights reversed. --
Noone told me that degraded md raid5 is dangerous. Thats documentation issue #1. Maybe I just pulled the disk for fun. ext3 docs told me that journal protects me against fs corruption during power fails. It does not in this particular case. Seems like docs issue #2. Maybe I just hit the reset button because it was there. Randomly hitting power button may be stupid, but should not result in filesystem corruption on reasonably working filesystem/storage stack. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
You're kidding, right? Or are you being too effectively sarcastic? -- Obsession with "preserving cultural heritage" is a racist impediment to moral, physical and intellectual progress. --
No he is not... and that is exactly why Ted and Ric have been fighting so hard against his scare the children documentation. In 20 years, I have not found a way to educate those who think "I know computers so it must work the way I want and expect." Tremendous amounts of information and recommendations are out there on the web, in books, classes, etc. But people don't research before using or understand before they have a problem. Pavel, *THE KERNEL IS NOT BUGGY* end of story! Everyone experienced in storage understands the "in the edge case that Pavel hit, you will loose your data", and we take our responsibility to tell people what works and does not work very seriously. And we try very hard to reduce the amount of edge case data losses. But as Ric and Ted and many others keep trying to explain: - There is no such thing as "never fails" data storage. - The goal of journal file systems is not what you thing. - The goal of raid is not what you think. - We do not want the vast majority of computer users who are not kernel engineers to stop using the technology that in 99.99 percent of the use cases keeps their data as safe as we can reasonably make it, just because they read Pavel's 0.01 percent scary and inaccurate case. And the worst part is this 0.01 percent case problem is really "I did not know what I was doing". jim --
change this to say 'degraded MD RAID 4/5/6 arrays' also find out if DM RAID 4/5/6 arrays suffer the same problem (I strongly suspect that they do) then you need to add a note that if the array becomes degraded before a scrub cycle happens previously hidden damage (that would have been re-word this something like In addition to the standard risk of corrupting the blocks being written at the time of the power failure, additonal blocks (in the same flash David Lang --
I'd prefer not to talk about scrubing and such details here. Better Actually I don't think so. I believe SATA disks do not corrupt even the sector they are writing to -- they just have big enough capacitors. And yes I believe ext3 depends on that. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
I disagree with that, the way you are wording this makes it sound as if raid isn't worth it. if you are going to say that raid is risky you need you are incorrect on this. ext3 (like every other filesystem) just accepts the risk (zfs makes some attempt to detect such corruption) David Lang --
Ok, would this help? I don't really want to go to scrubbing details. (*) Degraded array or single disk failure "near" the powerfail is I'd like Ted to comment on this. He wrote the original document, and I'd prefer not to introduce mistakes. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Than you should punt the MD discussion to the MD documentation entirely. I would suggest: "Users of any file system that have a single media (SSD, flash or normal disk) can suffer from catastrophic and complete data loss if that single media fails. To reduce your exposure to data loss after a single point of failure, consider using either hardware or properly configured software RAID. See the documentation on MD RAID for how to configure it. To insure proper fsync() semantics, you will need to have a storage device that supports write barriers or have a non-volatile write cache. If not, best Pavel, no S-ATA drive has capacitors to hold up during a power failure (or even enough power to destage their write cache). I know this from direct, personal knowledge having built RAID boxes at EMC for years. In fact, almost all RAID boxes require that the write cache be hardwired to off when used in their arrays. Drives fail partially on a very common basis - look at your remapped sector count with smartctl. RAID (including MD RAID5) will protect you from this most common error as it will protect you from complete drive failure which is also an extremely common event. Your scenario is really, really rare - doing a full rebuild after a complete drive failure (takes a matter of hours, depends on the size of the disk) and having a power failure during that rebuild. Of course adding a UPS to any storage system (including MD RAID system) helps make it more reliable, specifically in your scenario. The more important point is that having any RAID (MD1, MD5 or MD6) will greatly reduce your chance of data loss if configured correctly. With ext3, ext2 or zfs. Ric --
I never claimed they have enough power to flush entire cache -- read the paragraph again. I do believe the disks have enough capacitors to finish writing single sector, and I do believe ext3 depends on that. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Some scary terms that drive people mention (and measure): "high fly writes" "over powered seeks" "adjacent tack erasure" If you do get a partial track written, the data integrity bits that the data is embedded in will flag it as invalid and give you and IO error on the next read. Note that the damage is not persistent, it will get repaired (in place) on the next write to that sector. Also it is worth noting that ext2/3/4 write file system "blocks" not single sectors. Each ext3 IO is 8 distinct disk sector writes and those can span tracks on a drive which require a seek which all consume power. On power loss, a disk will immediately park the heads... ric --
keep in mind that in a powerfail situation the data being sent to the drive may be corrupt (the ram gets flaky while a DMA to the drive copies the bad data to the drive, which writes it before the power loss gets bad enough for the drive to decide there is a problem and shutdown) you just plain cannot count on writes that are in flight when a powerfail happens to do predictable things, let alone what you consider sane or proper. David Lang --
From what I see, this kind of failure is rather harder to reproduce than the software problems. And at least SGI machines were designed to avoid this... Anyway, I'd like to hear from ext3 people... what happens on read errors in journal? That's what you'd expect to see in situation above. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
On a power failure, what normally happens is that the random garbage
gets written into the disk drive's last dying gasp, since the memory
starts going insane and sends garbage to the disk. So the disk
successfully completes the write, but the sector contains garbage.
Since HDD's tend to be last thing to die, being less sensitive to
voltage drops than the memory or DMA controller, my experience is that
you don't get a read error after the system comes up, you just get
garbage written into the journal.
The ext3 journalling code waits until all of the journal code is
written, and only then writes the commit block. On restart, we look
for the last valid commit block. So if the power failure is before we
write the commit block, we replay the journal up until the previous
commit block. If the power failure is while we are writing the commit
block, garbage will be written out instead of the commit block, and so
it falls back to the previous case.
We do not allow any updates to the filesystem metadata to take place
until the commit block has been written; therefore the filesystem
stays consistent.
If there the journal *does* develop read errors, then fsck will
require a manual fsck, and so the boot operation will get stopped so a
system administrator can provide manual intervention. The best bet
for the sysadmin is to replay as much of the journal she can, and then
let fsck fix any resulting filesystem inconsistencies. In practice,
though, I've not experienced or seen any reports of this happening
from a power failure; usually it happens if the laptop gets dropped or
the hard drive suffers or suffers some other kind of hardware failure.
- Ted
--
...and that should result in consistent fs with no data loss, because read error is essentialy the same as garbage given back, right? ...plus, this is significant difference from logical-logging filesystems, no? Should this go to Documentation/, somewhere? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Not necessarily. Say you wrote out the entire stripe in a 5 disk RAID 5 array, but only 3 data blocks and the parity block got written out before power failure. If the disk with the 4th (unwritten) data block were to fail and get taken out of the RAID 5 array, the degradation of the array could actually undo your data corruption. With RAID 5 and incomplete writes, you just don't know. This kind of thing could go wrong at any level in the system, with any kind of RAID 5 setup. Of course, on a single disk system without RAID you can still get incomplete writes, for the exact same reasons. RAID 5 does not make things worse. It will protect your data against certain failure modes, but not against others. With or without RAID, you still need to make backups. -- All rights reversed. --
Document things ext2 expects from storage filesystems, and the fact that it can not handle barriers. Also remove jounaling description, as that's really ext3 material. Signed-off-by: Pavel Machek <pavel@ucw.cz> diff --git a/Documentation/filesystems/ext2.txt b/Documentation/filesystems/ext2.txt index 67639f9..e300ca8 100644 --- a/Documentation/filesystems/ext2.txt +++ b/Documentation/filesystems/ext2.txt @@ -338,27 +339,17 @@ enough 4-character names to make up unique directory entries, so they have to be 8 character filenames, even then we are fairly close to running out of unique filenames. +Requirements +============ + +Ext2 expects disk/storage subsystem not to return write errors. + +It also needs write caching to be disabled for reliable fsync +operation; ext2 does not know how to issue barriers as of +2.6.31. hdparm -W0 disables it on SATA disks. + Journaling ----------- - -A journaling extension to the ext2 code has been developed by Stephen -Tweedie. It avoids the risks of metadata corruption and the need to -wait for e2fsck to complete after a crash, without requiring a change -to the on-disk ext2 layout. In a nutshell, the journal is a regular -file which stores whole metadata (and optionally data) blocks that have -been modified, prior to writing them into the filesystem. This means -it is possible to add a journal to an existing ext2 filesystem without -the need for data conversion. - -When changes to the filesystem (e.g. a file is renamed) they are stored in -a transaction in the journal and can either be complete or incomplete at -the time of a crash. If a transaction is complete at the time of a crash -(or in the normal case where the system does not crash), then any blocks -in that transaction are guaranteed to represent a valid filesystem state, -and are copied into the filesystem. If a transaction is incomplete at -the time of the crash, then there is no guarantee of consistency for -the blocks in that transaction so they are discarded ...
Suppose a small office makes nightly backups to an offsite server via rsync. If a thunderstorm goes by causing their system to reboot twice in a 15 minute period, would they rather notice the filesystem corruption immediately upon Yup. Hopefully btrfs will cope less badly? They keep talking about I doubt the cupholder crowd is going to stop treating USB sticks as magical any time soon, but I also wonder how many of them even remember Linux _exists_ Professionals have horror stories about this issue, therefore documenting it is _less_ important? Ok... Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds --
This just goes to show why having this "translation layer" done in firmware on the device itself is a _bad_ idea. We're much better off when we have full access to the underlying flash and the OS can actually see what's going on. That way, we can actually debug, fix and recover It's a known failure mode of _everything_ that uses flash to pretend to be a block device. As I see it, there are no SSD devices which don't lose data; there are only SSD devices which haven't lost your data _yet_. There's no fundamental reason why it should be this way; it just is. (I'm kind of hoping that the shiny new expensive ones that everyone's talking about right now, that I shouldn't really be slagging off, are actually OK. But they're still new, and I'm certainly not trusting them with my own data _quite_ yet.) -- dwmw2 --
so what sort of test would be needed to identify if a device has this problem? people can do ad-hoc tests by pulling the devices in use and then checking the entire device, but something better should be available. it seems to me that there are two things needed to define the tests. 1. a predictable write load so that it's easy to detect data getting lose 2. some statistical analysis to decide how many device pulls are needed (under the write load defined in #1) to make the odds high that the problem will be revealed. with this we could have people test various devices and report if the test detects unrelated data being lost (or businesses, and I think the tech hardware sites would jump into this given some sort of accepted test) for USB devices there may be a way to use the power management functions to cut power to the device without requiring it to physically be pulled, if this is the case (even if this only works on some specific chipsets), it would drasticly speed up the testing David Lang --
This is really so easy to reproduce, that such speedup is not neccessary. Just try the scripts :-). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
so if it doesn't get corrupted after 5 unplugs does that mean that that particular device doesn't have a problem? or does it just mean you got lucky? would 10 sucessful unplugs mean that it's safe? what about 20? we need to get this beyond anecdotal evidence mode, to something that (even if not perfect, as you can get 100 'heads' in a row with an honest coin) gives you pretty good assurances that a particular device is either good or bad. David Lang --
I'd say 20 means its safe. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
I agree it should be documented, but the ext3 atomicity issue is only an issue on unexpected shutdown while the array is degraded. I surely hope most people running raid5 are not seeing that level of unexpected shutdown, let along in a degraded array, If they are, the atomicity issue pretty strongly says they should not be using raid5 in that environment. At least not for any filesystem I know. Having writes to LBA n corrupt LBA n+128 as an example is pretty hard to design around from a fs perspective. Greg --
Right now, people think that a degraded raid 5 is equivalent to raid 0. As this thread demonstrates, in the power failure case it's _worse_, due to write granularity being larger than the filesystem sector size. (Just like flash.) Knowing that, some people might choose to suspend writes to their raid until it's finished recovery. Perhaps they'll set up a system where a degraded raid 5 gets remounted read only until recovery completes, and then writes go to a new blank hot spare disk using all that volume snapshoting or unionfs stuff people have been working on. (The big boys already have hot spare disks standing by on a lot of these systems, ready to power up and go without human intervention. Needing two for actual reliability isn't that big a deal.) Or maybe the raid guys might want to tweak the recovery logic so it's not entirely linear, but instead prioritizes dirty pages over clean ones. So if somebody dirties a page halfway through a degraded raid 5, skip ahead to recover that chunk first to the new disk first (yes leaving holes, it's not that hard to track), and _then_ let the write go through. But unless people know the issue exists, they won't even start thinking about Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds --
if you've got the drives available you should be running raid 6 not raid 5 so that you have to loose two drives before you loose your error checking. in my opinion that's a far better use of a drive than a hot spare. David Lang --
It's not quite that simple anymore. These days, most modern drives add an "overcoat", which is a vapor deposition layer of carbon (I.E. diamond) on top of the magnetic media, and then add a nanolayer of some kind of nonmagnetic lubricant on top of that. That protects the magnetic layer from physical contact with the head; it takes a pretty solid whack to chip through diamond and actually gouge your disk: http://www.datarecoverylink.com/understanding_magnetic_media.html You can also do fun things with various nitridies (carbon nitride, silicon nitride, titanium nitride) which are pretty darn tough too, although I dunno about their suitability to hard drives: http://www.physical-vapor-deposition.com/ So while it _is_ possible to whack your drive and scratch the platter, merely "touching" won't do it. (Laptops wouldn't be feasible if they couldn't cope with a little jostling while running.) In the case of repeated small whacks, your heads may actually go first. (I vaguely recall the little aerofoil wing thingy holding up the disk touches first, and can get ground down by repeated contact with the diamond layer (despite the lubricant, that just buys time) so it gets shorter and shorter and can't reliably keep the head above the disk rather than in contact with it. But I'm kind of stale myself here, not sure that's still current.) Here's a nice youtube video of a 2007 defcon talk from a hard drive recovery professional, "What's that Clicking Noise", series starts here: http://www.youtube.com/watch?v=vCapEFNZAJ0 And here's that guy's web page: http://www.myharddrivedied.com/presentations/index.html Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds --
Hmm. What does "not being able to handle failed writes" actually mean? AFAICS, there are two possible answers: "all bets are off", or Right. And a lot of database systems make the same assumption. Oracle Berkeley DB cannot deal with partial page writes at all, and PostgreSQL assumes that it's safe to flip a few bits in a sector without proper WAL (it doesn't care if the changes actually hit the disk, but the write shouldn't make the sector unreadable or put random I think the general idea is to protect valuable data with WAL. You overwrite pages on disk only after you've made a backup copy into WAL. After a power loss event, you replay the log and overwrite all garbage that might be there. For the WAL, you rely on checksum and sequence numbers. This still doesn't help against write failures where the system continues running (because the fsync() during checkpointing isn't guaranteed to report errors), but it should deal with the power failure case. But this assumes that the file system protects its own data structure in a similar way. Is this really too much to demand? Partial failures are extremely difficult to deal with because of their asynchronous nature. I've come to accept that, but it's still disappointing. -- Florian Weimer <fweimer@bfk.de> BFK edv-consulting GmbH http://www.bfk.de/ Kriegsstraße 100 tel: +49-721-96201-1 D-76133 Karlsruhe fax: +49-721-96201-99 --
So I got confused when I quoted your note, which I had assumed was
exactly what Pavel had written in his documentation. In fact, what he
had written was this:
+Don't damage the old data on a failed write (ATOMIC-WRITES)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Either whole sector is correctly written or nothing is written during
+powerfail.
+
+....
So he had explicitly stated that he only cared about the whole sector
being written (or not written) in the power fail case, and not any
other. I'd suggest changing ATOMIC-WRITES to
ATOMIC-WRITE-ON-POWERFAIL, since the one-line summary, "Don't damage
the old data on a failed write", is also singularly misleading.
- Ted
--
Ok, something like this? Don't damage the old data on a powerfail (ATOMIC-WRITES-ON-POWERFAIL) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Either whole sector is correctly written or nothing is written during powerfail. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Ok, I added "Not all filesystems require all of these to be satisfied for safe operation" sentence there. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Can someone clarify if this is true in raid-6 with just a single disk failure? I don't see why it would be. And if not can the above text be changed to reflect raid 4/5 with a single disk failure and raid 6 with a double disk failure are the modes that have atomicity problems. Greg --
I don't know enough about raid-6, but... I said "degraded mode" above, and you can read it as double failure in raid-6 case ;-). I'll prefer to avoid too many details here. -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Acked-by: Rob Landley <rob@landley.net> It's coming up on 2.6.31, has it learned anything since or should that version Possible rewording of this paragraph: Ext3 handles trash getting written into sectors during powerfail surprisingly well. It's not foolproof, but it is resilient. Incomplete journal entries are ignored, and journal replay of complete entries will often "repair" garbage written into the inode table. The data=journal option extends this behavior to file and directory data blocks as well (without which your dentries can still be badly corrupted by a power fail during a write). (I'm not entirely sure about that last bit, but clarifying it one way or the other would be nice because I can't tell from reading it which it is. My _guess_ is that directories are just treated as files with an attitude and an extra cacheing layer...?) Rob -- Latency is more important than throughput. It's that simple. - Linus Torvalds --
Thanks, applied, it looks better than what I wrote. I removed the () part, as I'm not sure about it... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
No, they did not. We were discussing how to be able to enable / disable sending barriers, someone told he'd implement it but it somehow never got beyond an initial attempt. Actually, after recent sync cleanups (and when my O_SYNC cleanups get merged) it should be pretty easy because every filesystem now has ->fsync() and ->sync_fs() callback so we just have to add sending barriers to these two functions and implement possibility to set via sysfs that barriers on the block device should be ignored. I've put it to my todo list but if someone else has time for this, I certainly would not mind :). It would be a nice beginner project... Honza -- Jan Kara <jack@suse.cz> SUSE Labs, CR --
Updated version here. diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt new file mode 100644 index 0000000..710d119 --- /dev/null +++ b/Documentation/filesystems/expectations.txt @@ -0,0 +1,47 @@ +Linux block-backed filesystems can only work correctly when several +conditions are met in the block layer and below (disks, flash +cards). Some of them are obvious ("data on media should not change +randomly"), some are less so. + +Write errors not allowed (NO-WRITE-ERRORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Writes to media never fail. Even if disk returns error condition +during write, filesystems can't handle that correctly, because success +on fsync was already returned when data hit the journal. + + Fortunately writes failing are very uncommon on traditional + spinning disks, as they have spare sectors they use when write + fails. + +Sector writes are atomic (ATOMIC-SECTORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Either whole sector is correctly written or nothing is written during +powerfail. + + Unfortunately, none of the cheap USB/SD flash cards I've seen + do behave like this, and are thus unsuitable for all Linux + filesystems I know. + + An inherent problem with using flash as a normal block + device is that the flash erase size is bigger than + most filesystem sector sizes. So when you request a + write, it may erase and rewrite some 64k, 128k, or + even a couple megabytes on the really _big_ ones. + + If you lose power in the middle of that, filesystem + won't notice that data in the "sectors" _around_ the + one your were trying to write to got trashed. + + Because RAM tends to fail faster than rest of system during + powerfail, special hw killing DMA transfers may be necessary; + otherwise, disks may write garbage during powerfail. + Not sure how common that problem is on generic PC machines. + + Note that atomic write is very hard to guarantee for ...
Some of what is here are bugs, and some are legitimate long-term interfaces (for example, the question of losing I/O errors when two processes are writing to the same file, or to a directory entry, and errors aren't or in some cases, can't, be reflected back to userspace). I'm a little concerned that some of this reads a bit too much like a rant (and I know Pavel was very frustrated when he tried to use a flash card with a sucky flash card socket) and it will get used the wrong way by apoligists, because it mixes areas where "we suck, we should do better", which a re bug reports, and "Posix or the underlying block device layer makes it hard", and simply states them as fundamental design requirements, when that's probably not true. There's a lot of work that we could do to make I/O errors get better reflected to userspace by fsync(). So state things as bald requirements I think goes a little too far IMHO. We can surely do The last half of this sentence "because success on fsync was already returned when data hit the journal", obviously doesn't apply to all filesystems, since some filesystems, like ext2, don't journal data. Even for ext3, it only applies in the case of data=journal mode. There are other issues here, such as fsync() only reports an I/O problem to one caller, and in some cases I/O errors aren't propagated up the storage stack. The latter is clearly just a bug that should be fixed; the former is more of an interface limitation. But you don't talk about in this section, and I think it would be good to have a more extended discussion about I/O errors when writing data blocks, The characteristic you descrive here is not an issue about whether the whole sector is either written or nothing happens to the data --- but rather, or at least in addition to that, there is also the issue that when a there is a flash card failure --- particularly one caused by a sucky flash card reader design causing the SD card to disconnect from the laptop in the middle of a ...
Well, I guess there's thin line between error and "legitimate long-term interfaces". I still believe that fsync() is broken by It started as a rant, obviously I'd like to get away from that and get it into suitable format for inclusion. (Not being native speaker does not help here). But I do believe that we should get this documented; many common storage subsystems are broken, and can cause data loss. We should at Well, I guess that can be refined later. Heck, I'm not able to tell which are simple bugs likely to be fixed soon, and which are fundamental issues that are unlikely to be fixed sooner than 2030. I guess it is fair to document them ASAP, and then fix those that can be If the fsync() can be fixed... that would be great. But I'm not sure ... Ok, added to ext3 specific section. New version is attached. Feel free to help here; my goal is to get this documented, I'm not particulary attached to wording etc... Signed-off-by: Pavel Machek <pavel@ucw.cz> Pavel diff --git a/Documentation/filesystems/expectations.txt b/Documentation/filesystems/expectations.txt new file mode 100644 index 0000000..0de456d --- /dev/null +++ b/Documentation/filesystems/expectations.txt @@ -0,0 +1,49 @@ +Linux block-backed filesystems can only work correctly when several +conditions are met in the block layer and below (disks, flash +cards). Some of them are obvious ("data on media should not change +randomly"), some are less so. + +Write errors not allowed (NO-WRITE-ERRORS) +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Writes to media never fail. Even if disk returns error condition +during write, filesystems can't handle that correctly. + + Fortunately writes failing are very uncommon on traditional + spinning disks, as they have spare sectors they use when write + fails. + +Don't cause collateral damage to adjacent sectors on a failed write ...
When you say Linux filesystems do you mean "filesystems originally designed on Linux" or do you mean "filesystems that Linux supports"? Additionally whatever the answer, people are going to need help answering the "which is the least bad?" question and saying what's not good without offering alternatives is only half helpful... People need to put SOMETHING on these cheap (and not quite so cheap) devices... The last recommendation I heard was that until btrfs/logfs/nilfs arrive people are best off sticking with FAT - http://marc.info/?l=linux-kernel&m=122398315223323&w=2 . Perhaps that The document makes it sound like nearly everything bar battery backed hardware RAIDed SCSI disks (with perfect firmware) is bad - is this the intent? -- Sitsofe | http://sucs.org/~sits/ --
Actually, the best filesystem for USB flash devices is probably UDF. (Yes, the DVD filesystem turns out to be writeable if you put it on a writeable media. The ISO spec requires write support, so any OS that supports DVDs also supports this.) The reasons for this are: A) It's the only filesystem other than FAT that's supported out of the box by windows, mac, _and_ Linux for hotpluggable media. B) It doesn't have the horrible limitations of FAT (such as a max filesize of 2 gigabytes). C) Microsoft doesn't claim to own it, and thus hasn't sued anybody over patents on it. However, when it comes to cutting the power on a mounted filesystem (either by yanking the device or powering off the machine) without losing your data, without warning, they all suck horribly. If you yank a USB flash disk in the middle of a write, and the device has decided to wipe a 2 megabyte erase sector that's behind a layer of wear levelling and thus consists of a series of random sectors scattered all over the disk, you're screwed no matter what filesystem you use. You know the vinyl "record scratch" sound? Imagine that, on a digital level. Bad Things SCSI disks? They still make those? Everything fails, it's just a question of how. Rotational media combined with journaling at least fails in fairly understandable ways, so ext3 on sata is reasonable. Flash gets into trouble when it presents the _interface_ of rotational media (a USB block device with normal 512 byte read/write sectors, which never wear out) which doesn't match what the hardware's actually doing (erase block sizes of up to several megabytes at a time, hidden behind a block remapping layer for wear leveling). For devices that have built in flash that DON'T pretend to be a conventional block device, but instead expose their flash erase granularity and let the OS do the wear levelling itself, we have special flash filesystems that can be reasonably reliable. It's just that ext3 isn't one of ...
The really nice SSDs actually reserve ~15-30% of their internal block-level storage and actually run their own log-structured virtual disk in hardware. From what I understand the Intel SSDs are that way. Real-time garbage collection is tricky, but if you require (for example) a max of ~80% utilization then you can provide good latency and bandwidth guarantees. There's usually something like a log-structured virtual-to-physical sector map as well. If designed properly with automatic hardware checksumming, such a system can actually provide atomic writes and barriers with virtually no impact on performance. With firmware-level hardware knowledge and the ability to perform extremely efficient parallel reads of flash blocks, such a log-structured virtual block device can be many times more efficient than a general purpose OS running a log-structured filesystem. The result is that for an ordinary ext3-esque filesystem with 4k blocks you can treat the SSD as though it is an atomic-write seek-less block device. Now if only I had the spare cash to go out and buy one of the shiny Intel ones for my laptop... :-) Cheers, Kyle Moffett --
"Linux filesystems I know" :-). No filesystem that Linux supports, According to me, people should just AVOID those devices. I don't plan Battery backed RAID should be ok, as should be plain single SATA drive. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
I've heard rumors of disks that claim to support cache flushes but really just ignore them, but have never heard any specifics of model numbers, etc. which are known to do this, so it may just be legend. If we do have such knowledge then we should really be blacklisting those drives and warning the user that we can't ensure data integrity. (Even powering down the system would be unsafe in this case.) --
This should not be the case for any vaguely modern drive. The standard requires the drive flushes the cache if sent the command and the size of caches on modern drives rather require it. Alan --
On Thu, Mar 12, 2009 at 5:21 AM, Pavel Machek <pavel@ucw.cz> wrote: I had *assumed* that SSDs worked like: 1) write request comes in 2) new unused erase block area marked to hold the new data 3) updated data written to the previously unused erase block 4) mapping updated to replace the old erase block with the new one If it were done that way, a failure in the middle would just leave the SSD with the old data in it. If it is not done that way, then I can see your issue. (I love the potential performance of SSDs, but I'm beginning to hate the implementations and spec writing.) Greg -- Greg Freemyer Head of EDD Tape Extraction and Processing team Litigation Triage Solutions Specialist http://www.linkedin.com/in/gregfreemyer First 99 Days Litigation White Paper - http://www.norcrossgroup.com/forms/whitepapers/99%20Days%20whitepaper.pdf The Norcross Group The Intersection of Evidence & Technology http://www.norcrossgroup.com --
The really expensive ones (Intel SSD) apparently work like that, but I never seen one of those. USB sticks and SD cards I tried behave like I described above. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_T |
