Sure, and there is TileFS too. But why wouldn't it be possible to do this on the current fs infrastructure, using just a smart fsck, working incrementally on some sub-dir? Thanks! -- Al -
Several data structures are file system wide and require finding every allocated file and block to check that they are correct. In particular, block and inode bitmaps can't be checked per subdirectory. http://infohost.nmt.edu/~val/review/chunkfs.pdf -VAL -VAL -
Ok, but let's look at this a bit more opportunistic / optimistic. Even after a black-out shutdown, the corruption is pretty minimal, using ext3fs at least. So let's take advantage of this fact and do an optimistic fsck, to assure integrity per-dir, and assume no external corruption. Then we release this checked dir to the wild (optionally ro), and check the next. Once we find external inconsistencies we either fix it unconditionally, based on some preconfigured actions, or present the user with options. All this could be per-dir or using some form of on-the-fly file-block-zoning. And there probably is a lot more to it, but it should conceptually be Thanks! -- Al -
On Wed, 9 Jan 2008 14:52:14 +0300 You can't play fast and loose with data integrity. Besides, if we looked at things optimistically, we would conclude You will really want to read this paper, if you haven't already. -- All Rights Reversed -
And that's the reality, because people are mostly optimistic and feel extremely tempted to just force-mount a dirty ext3fs, instead of waiting hours-on-end for a complete fsck, which mostly comes back with some benign Well not ever, but most people probably fsck during scheduled shutdowns, or Definitely a good read, but attacking the problem from a completely different POV. BTW: Dropped some cc's due to bounces. Thanks! -- Al -
After a unclean shutdown, assuming you have decent hardware that doesn't lie about when blocks hit iron oxide, you shouldn't have any So what can you check? The *only* thing you can check is whether or not the directory syntax looks sane, whether the inode structure looks sane, and whether or not the blocks reported as belong to an inode looks sane. What is very hard to check is whether or not the link count on the inode is correct. Suppose the link count is 1, but there are actually two directory entries pointing at it. Now when someone unlinks the file through one of the directory hard entries, the link count will go to zero, and the blocks will start to get reused, even though the inode is still accessible via another pathname. Oops. Data Loss. This is why doing incremental, on-line fsck'ing is *hard*. You're not going to find this while doing each directory one at a time, and if the filesystem is changing out from under you, it gets worse. And it's not just the hard link count. There is a similar issue with the block allocation bitmap. Detecting the case where two files are simultaneously can't be done if you are doing it incrementally, and if the filesystem is changing out from under you, it's impossible, unless you also have the filesystem telling you every single change while it is happening, and you keep an insane amount of bookkeeping. One that you *might* be able to do, is to mount a filesystem readonly, check it in the background while you allow users to access it read-only. There are a few caveats, however ---- (1) some filesystem errors may cause the data to be corrupt, or in the worst case, could cause the system to panic (that's would arguably be a filesystem/kernel bug, but we've not necessarily done as much testing here as we should.) (2) if there were any filesystem errors found, you would beed to completely unmount the filesystem to flush the inode cache and remount it before it would be safe to remount the filesystem read/write. You can't ...
We could buffer this, and only actually overwrite when we are completely Ok, you have a point, so how about we change the implementation detail a bit, from external fsck to internal fsck, leveraging the internal fs bookkeeping, while allowing immediate but controlled read/write access. Thanks for more thoughts! -- Al -
What hardware is crappy here. Lets say... internal hdd in thinkpad x60? What are ext3 expectations of disk (is there doc somewhere)? For example... if disk does not lie, but powerfail during write damages the sector -- is ext3 still going to work properly? If disk does not lie, but powerfail during write may cause random numbers to be returned on read -- can fsck handle that? What abou disk that kills 5 sectors around sector being written during powerfail; can ext3 survive that? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
Nope. However the few disks that did this rapidly got firmware updates most of the time. and fsck knows about writing sectors to remove read generally. Note btw that for added fun there is nothing that guarantees the blocks around a block on the media are sequentially numbered. The usually are but you never know. Alan -
Ok, should something like this be added to the documentation? It would be cool to be able to include few examples (modern SATA disks support bariers so are safe, any IDE from 1989 is unsafe), but I do not know enough about hw... Signed-off-by: Pavel Machek <pavel@suse.cz> diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index b45f3c1..adfcc9d 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt @@ -183,6 +183,18 @@ mke2fs: create a ext3 partition with th debugfs: ext2 and ext3 file system debugger. ext2online: online (mounted) ext2 and ext3 filesystem resizer +Requirements +============ + +Ext3 needs disk that does not do write-back caching or disk that +supports barriers and Linux configuration that can use them. + +* if disk damages the sector being written during powerfail, ext3 + can't cope with that. Fortunately, such disks got firmware updates + to fix this long time ago. + +* if disk writes random data during powerfail, ext3 should survive + that most of the time. References ========== -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
ext3 is not the only filesystem that will have trouble due to volatile write caches. We see problems often enough with XFS due to volatile write caches that it's in our FAQ: http://oss.sgi.com/projects/xfs/faq.html#wcache Cheers, Dave. -- Dave Chinner Principal Engineer SGI Australian Software Group -
Nice FAQ, yep. Perhaps you should move parts of it to Documentation/ , and I could then make ext3 FAQ point to it? I had write cache enabled on my main computer. Oops. I guess that means we do need better documentation. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
AFAIK no drive saves the cache. The worst case cache flush for drives is several seconds with no retries and a couple of minutes if something really bad happens. This is why the kernel has some knowledge of barriers and uses them to issue flushes when needed. -
Indeed, you are right, which is supported by actual measurements:
http://sr5tech.com/write_back_cache_experiments.htm
Sorry for implying that anybody has engineered a drive that can do
such a nice thing with writeback cache.
The "disk motor as a generator" tale may not be purely folklore. When
an IDE drive is not in writeback mode, something special needs to done
to ensure the last write to media is not a scribble.
A small UPS can make writeback mode actually reliable, provided the
system is smart enough to take the drives out of writeback mode when
the line power is off.
Regards,
Daniel
-
On Tue, 15 Jan 2008 20:24:27 -0500 We've had mount -o barrier=1 for ext3 for a while now, it makes writeback caching safe. XFS has this on by default, as does reiserfs. -chris -
Maybe ext3 should do barriers by default? Having ext3 in "lets corrupt data by default"... seems like bad idea. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
Problem is, ext3 has barriers off by default so it's not saving most people. And then if you turn them on, but have your filesystem on an lvm device, lvm strips them out again. -Eric -
Is it? I guess I should try to measure it. (Linux already does writeback caching, with 2GB of memory. I wonder how important disks's 2MB of cache can be). Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
It serves essentially the same purpose as the 'async' option in /etc/exports (i.e. we declare it "done" when the other end of the wire says it's caught the data, not when it's actually committed), with similar latency wins. Of course, it's impedance-matching for bursty traffic - the 2M doesn't do much at all if you're streaming data to it. For what it's worth, the 80G Seagate drive in my laptop claims it has 8M, so it probably does 4 times as much good as 2M. ;)
I doubt "impedance-matching" is useful here. SATA link is fast/low latency, and kernel already does buffering with main memory... Hmm... what is the way to measure that? Tar decompress kernel few times with cache on / cache off? Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
In fact it will hit every filesystem. A write-back cache that can't be forced to write back bythe filesystem will cause corruption on uncontained power loss, period. -
Hi Pavel, Along with this effort, could you let me know if the world actually cares about online fsck? Now we know how to do it I think, but is it worth the effort. Regards, Daniel -
On Tue, 15 Jan 2008 20:44:38 -0500 With a filesystem that is compartmentalized and checksums metadata, I believe that an online fsck is absolutely worth having. Instead of the filesystem resorting to mounting the whole volume read-only on certain errors, part of the filesystem can be offlined while an fsck runs. This could even be done automatically in many situations. -- All rights reversed. -
In ext4 we store per-group state flags in each group, and the group descriptor is checksummed (to detect spurious flags), so it should be relatively straight forward to store an "error" flag in a single group and have it become read-only. As a starting point, it would be worthwhile to check instances of ext4_error() to see how many of them can be targetted at a specific group. I'd guess most of them could be (corrupt inodes, directory and indirect blocks, incorrect bitmaps). Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. -
ext3's "lets fsck on every 20 mounts" is good idea, but it can be annoying when developing. Having option to fsck while filesystem is online takes that annoyance away. So yes, it would be very useful for me... For long-running servers, this may be less of a problem... but OTOH their filesystems are not checked at all as long servers are online... so online fsck is actually important there, too, but for other reasons. So yes, it is very useful for world. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html -
I'm sure everyone on cc: knows this, but for the record you can change ext3's fsck on N mounts or every N days to something that makes sense for your use case. Usually I just turn it off entirely and run fsck by hand when I'm worried: # tune2fs -c 0 -i 0 /dev/whatever -VAL -
Most users seem to care deeply about "things just work". Here is why ntfs-3g also took the online fsck path some time ago. NTFS support had a highly bad reputation on Linux thus the new code was written with rigid sanity checks and extensive automatic, regression testing. One of the consequences is that we're detecting way too many inconsistencies left behind by the Windows and other NTFS drivers, hardware faults, device drivers. To better utilize the non-existing developer resources, it was obvious to suggest the already existing Windows fsck (chkdsk) in such cases. Simple and safe as most people like us would think who never used Windows. However years of experience shows that depending on several factors chkdsk may start or not, may report the real problems or not, but on the other hand it may report bogus issues, it may run long or just forever, and it may even remove completely valid files. So one could perhaps even consider suggestions to run chkdsk a call to play Russian roulette. Thankfully NTFS has some level of metadata redundancy with signatures and weak "checksums" which make possible to correct some common and obvious corruptions on the fly. Similarly to ZFS, Windows Server 2008 also has self-healing NTFS: http://technet2.microsoft.com/windowsserver2008/en/library/6f883d0d-3668-4e15-b7ad-4df... Szaka -- NTFS-3G: http://ntfs-3g.org -
I guess that is enough votes to justify going ahead and trying an implementation of the reverse mapping ideas I posted. But of course more votes for this is better. If online incremental fsck is something people want, then please speak up here and that will very definitely help make it happen. On the walk-before-run principle, it would initially just be filesystem checking, not repair. But even this would help, by setting per-group checked flags that offline fsck could use to do a much quicker repair pass. And it will let you know when a volume needs to be taken offline without having to build in planned downtime just in case, which already eats a bunch of nines. Regards, Daniel -
I think that you have to keep in mind the way disk (and other media) fail. You can get media failures after a successful write or errors that pop up as the media ages. Not to mention the way most people run with write cache enabled and no write barriers enabled - a sure recipe for corruption. Of course, there are always software errors to introduce corruption even when we get everything else right ;-) From what I see, media errors are the number one cause of corruption in file systems. It is critical that fsck (and any other tools) continue after an IO error since they are fairly common (just assume that sector is lost and do your best as you continue on). ric -
Hi Ted, In this case I am listening to Chicken Little carefully and really do believe the sky will fall if we fail to come up with an incremental online fsck some time in the next few years. I realize the challenge verges on insane, but I have been slowly chewing away at this question for some time. Val proposes to simplify the problem by restricting the scope of block pointers and hard links. Best of luck with that, the concept of fault isolation domains has a nice ring to it. I prefer to stick close to tried and true Ext3 and not change the basic algorithms. Rather than restricting pointers, I propose to add a small amount of new metadata to accelerate global checking. The idea is to be able to build per-group reverse maps very quickly, to support mapping physical blocks back to inodes that own them, and mapping inodes back to the directories that reference them. I see on-the-fly filesystem reverse mapping as useful for more than just online fsck. For example it would be nice to be able to work backwards efficiently from a list of changed blocks such as ddsnap produces to a list of file level changes. The amount of metadata required to support efficient on-the-fly reverse mapping is surprisingly small: 2K per block group per terabyte, in a fixed location at the base of each group. This is consistent with my goal of producing code that is mergable for Ext4 and backportable to Ext3. Building a block reverse map for a given group is easy and efficient. The first pass walks across the inode table and already maps most of the physical blocks for typical usage patterns, because most files only have direct pointers. Index blocks discovered in the first pass go onto a list to be processed by subsequent passes, which may discover additional index blocks. Just keep pushing the index blocks back onto the list and the algorithm terminates when the list is empty. This builds a reverse map for the group including references to external ...
If you have /home/usera, /home/userb, and /home/userc, the vast majority of fs screw-ups can't be detected by only looking at one sub-dir. For example, you can't tell definitively that all blocks referenced by an inode under /home/usera are properly only allocated to one file until you *also* look at the inodes under user[bc]. Heck, you can't even tell if the link count for a file is correct unless you walk the entire filesystem - you can find a file with a link count of 3 in the inode, and you find one reference under usera, and a second under userb - you can't tell if the count is one too high or not until you walk through userc and actually see (or fail to see) a third directory entry referencing it.
