Has there been some thought about an incremental fsck?
You know, somehow fencing a sub-dir to do an online fsck?
Thanks for some thoughts!
--
Al--
Is that anything like a cluster fsck? ]:>
--
While an _incremental_ fsck isn't so easy for existing filesystem types,
what is pretty easy to automate is making a read-only snapshot of a
filesystem via LVM/DM and then running e2fsck against that. The kernel
and filesystem have hooks to flush the changes from cache and make the
on-disk state consistent.You can then set the the ext[234] superblock mount count and last check
time via tune2fs if all is well, or schedule an outage if there are
inconsistencies found.There is a copy of this script at:
http://osdir.com/ml/linux.lvm.devel/2003-04/msg00001.htmlNote that it might need some tweaks to run with DM/LVM2 commands/output,
but is mostly what is needed.Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.--
You can do this now with ddsnap (an out-of-tree device mapper target)
either by checking a local snapshot or a replicated snapshot on a
different machine, see:Doing the check on a remote machine seems attractive because the fsck
does not create a load on the server.Regards,
Daniel
--
On Wed, 9 Jan 2008 00:22:55 +0300
Search for "chunkfs"
--
All rights reversed.
--
If you have /home/usera, /home/userb, and /home/userc, the vast majority of
fs screw-ups can't be detected by only looking at one sub-dir. For example,
you can't tell definitively that all blocks referenced by an inode under
/home/usera are properly only allocated to one file until you *also* look at
the inodes under user[bc]. Heck, you can't even tell if the link count for
a file is correct unless you walk the entire filesystem - you can find a file
with a link count of 3 in the inode, and you find one reference under usera,
and a second under userb - you can't tell if the count is one too high or
not until you walk through userc and actually see (or fail to see) a third
directory entry referencing it.
Several data structures are file system wide and require finding every
allocated file and block to check that they are correct. In
particular, block and inode bitmaps can't be checked per subdirectory.http://infohost.nmt.edu/~val/review/chunkfs.pdf
-VAL
-VAL
--
Ok, but let's look at this a bit more opportunistic / optimistic.
Even after a black-out shutdown, the corruption is pretty minimal, using
ext3fs at least. So let's take advantage of this fact and do an optimistic
fsck, to assure integrity per-dir, and assume no external corruption. Then
we release this checked dir to the wild (optionally ro), and check the next.
Once we find external inconsistencies we either fix it unconditionally,
based on some preconfigured actions, or present the user with options.All this could be per-dir or using some form of on-the-fly file-block-zoning.
And there probably is a lot more to it, but it should conceptually be
Thanks!
--
Al--
After a unclean shutdown, assuming you have decent hardware that
doesn't lie about when blocks hit iron oxide, you shouldn't have anySo what can you check? The *only* thing you can check is whether or
not the directory syntax looks sane, whether the inode structure looks
sane, and whether or not the blocks reported as belong to an inode
looks sane.What is very hard to check is whether or not the link count on the
inode is correct. Suppose the link count is 1, but there are actually
two directory entries pointing at it. Now when someone unlinks the
file through one of the directory hard entries, the link count will go
to zero, and the blocks will start to get reused, even though the
inode is still accessible via another pathname. Oops. Data Loss.This is why doing incremental, on-line fsck'ing is *hard*. You're not
going to find this while doing each directory one at a time, and if
the filesystem is changing out from under you, it gets worse. And
it's not just the hard link count. There is a similar issue with the
block allocation bitmap. Detecting the case where two files are
simultaneously can't be done if you are doing it incrementally, and if
the filesystem is changing out from under you, it's impossible, unless
you also have the filesystem telling you every single change while it
is happening, and you keep an insane amount of bookkeeping.One that you *might* be able to do, is to mount a filesystem readonly,
check it in the background while you allow users to access it
read-only. There are a few caveats, however ---- (1) some filesystem
errors may cause the data to be corrupt, or in the worst case, could
cause the system to panic (that's would arguably be a
filesystem/kernel bug, but we've not necessarily done as much testing
here as we should.) (2) if there were any filesystem errors found,
you would beed to completely unmount the filesystem to flush the inode
cache and remount it before it would be safe to remount the filesystem
read/write. You can't jus...
Hi Ted,
In this case I am listening to Chicken Little carefully and really do
believe the sky will fall if we fail to come up with an incremental
online fsck some time in the next few years. I realize the challenge
verges on insane, but I have been slowly chewing away at this question
for some time.Val proposes to simplify the problem by restricting the scope of block
pointers and hard links. Best of luck with that, the concept of fault
isolation domains has a nice ring to it. I prefer to stick close to
tried and true Ext3 and not change the basic algorithms.Rather than restricting pointers, I propose to add a small amount of new
metadata to accelerate global checking. The idea is to be able to
build per-group reverse maps very quickly, to support mapping physical
blocks back to inodes that own them, and mapping inodes back to the
directories that reference them.I see on-the-fly filesystem reverse mapping as useful for more than just
online fsck. For example it would be nice to be able to work backwards
efficiently from a list of changed blocks such as ddsnap produces to a
list of file level changes.The amount of metadata required to support efficient on-the-fly reverse
mapping is surprisingly small: 2K per block group per terabyte, in a
fixed location at the base of each group. This is consistent with my
goal of producing code that is mergable for Ext4 and backportable to
Ext3.Building a block reverse map for a given group is easy and efficient.
The first pass walks across the inode table and already maps most of
the physical blocks for typical usage patterns, because most files only
have direct pointers. Index blocks discovered in the first pass go
onto a list to be processed by subsequent passes, which may discover
additional index blocks. Just keep pushing the index blocks back onto
the list and the algorithm terminates when the list is empty. This
builds a reverse map for the group including references to external
groups.
...
What hardware is crappy here. Lets say... internal hdd in thinkpad
x60?What are ext3 expectations of disk (is there doc somewhere)? For
example... if disk does not lie, but powerfail during write damages
the sector -- is ext3 still going to work properly?If disk does not lie, but powerfail during write may cause random
numbers to be returned on read -- can fsck handle that?What abou disk that kills 5 sectors around sector being written during
powerfail; can ext3 survive that?Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
I think that you have to keep in mind the way disk (and other media)
fail. You can get media failures after a successful write or errors that
pop up as the media ages.Not to mention the way most people run with write cache enabled and no
write barriers enabled - a sure recipe for corruption.Of course, there are always software errors to introduce corruption even
when we get everything else right ;-)From what I see, media errors are the number one cause of corruption in
file systems. It is critical that fsck (and any other tools) continue
after an IO error since they are fairly common (just assume that sector
is lost and do your best as you continue on).ric
--
Nope. However the few disks that did this rapidly got firmware updates
most of the time. and fsck knows about writing sectors to remove read
generally. Note btw that for added fun there is nothing that guarantees
the blocks around a block on the media are sequentially numbered. The
usually are but you never know.Alan
--
Ok, should something like this be added to the documentation?
It would be cool to be able to include few examples (modern SATA disks
support bariers so are safe, any IDE from 1989 is unsafe), but I do
not know enough about hw...Signed-off-by: Pavel Machek <pavel@suse.cz>
diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt
index b45f3c1..adfcc9d 100644
--- a/Documentation/filesystems/ext3.txt
+++ b/Documentation/filesystems/ext3.txt
@@ -183,6 +183,18 @@ mke2fs: create a ext3 partition with th
debugfs: ext2 and ext3 file system debugger.
ext2online: online (mounted) ext2 and ext3 filesystem resizer+Requirements
+============
+
+Ext3 needs disk that does not do write-back caching or disk that
+supports barriers and Linux configuration that can use them.
+
+* if disk damages the sector being written during powerfail, ext3
+ can't cope with that. Fortunately, such disks got firmware updates
+ to fix this long time ago.
+
+* if disk writes random data during powerfail, ext3 should survive
+ that most of the time.References
==========--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
Hi Pavel,
Along with this effort, could you let me know if the world actually
cares about online fsck? Now we know how to do it I think, but is it
worth the effort.Regards,
Daniel
--
Most users seem to care deeply about "things just work". Here is why
ntfs-3g also took the online fsck path some time ago.NTFS support had a highly bad reputation on Linux thus the new code was
written with rigid sanity checks and extensive automatic, regression
testing. One of the consequences is that we're detecting way too many
inconsistencies left behind by the Windows and other NTFS drivers,
hardware faults, device drivers.To better utilize the non-existing developer resources, it was obvious to
suggest the already existing Windows fsck (chkdsk) in such cases. Simple
and safe as most people like us would think who never used Windows.However years of experience shows that depending on several factors chkdsk
may start or not, may report the real problems or not, but on the other
hand it may report bogus issues, it may run long or just forever, and it
may even remove completely valid files. So one could perhaps even consider
suggestions to run chkdsk a call to play Russian roulette.Thankfully NTFS has some level of metadata redundancy with signatures and
weak "checksums" which make possible to correct some common and obvious
corruptions on the fly.Similarly to ZFS, Windows Server 2008 also has self-healing NTFS:
http://technet2.microsoft.com/windowsserver2008/en/library/6f883d0d-3668...Szaka
--
NTFS-3G: http://ntfs-3g.org
--
I guess that is enough votes to justify going ahead and trying an
implementation of the reverse mapping ideas I posted. But of course
more votes for this is better. If online incremental fsck is
something people want, then please speak up here and that will very
definitely help make it happen.On the walk-before-run principle, it would initially just be
filesystem checking, not repair. But even this would help, by setting
per-group checked flags that offline fsck could use to do a much
quicker repair pass. And it will let you know when a volume needs to
be taken offline without having to build in planned downtime just in
case, which already eats a bunch of nines.Regards,
Daniel
--
ext3's "lets fsck on every 20 mounts" is good idea, but it can be
annoying when developing. Having option to fsck while filesystem is
online takes that annoyance away.So yes, it would be very useful for me...
For long-running servers, this may be less of a problem... but OTOH
their filesystems are not checked at all as long servers are
online... so online fsck is actually important there, too, but for
other reasons.So yes, it is very useful for world.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
I'm sure everyone on cc: knows this, but for the record you can change
ext3's fsck on N mounts or every N days to something that makes sense
for your use case. Usually I just turn it off entirely and run fsck
by hand when I'm worried:# tune2fs -c 0 -i 0 /dev/whatever
-VAL
--
On Tue, 15 Jan 2008 20:44:38 -0500
With a filesystem that is compartmentalized and checksums metadata,
I believe that an online fsck is absolutely worth having.Instead of the filesystem resorting to mounting the whole volume
read-only on certain errors, part of the filesystem can be offlined
while an fsck runs. This could even be done automatically in many
situations.--
All rights reversed.
--
In ext4 we store per-group state flags in each group, and the group
descriptor is checksummed (to detect spurious flags), so it should
be relatively straight forward to store an "error" flag in a single
group and have it become read-only.As a starting point, it would be worthwhile to check instances of
ext4_error() to see how many of them can be targetted at a specific
group. I'd guess most of them could be (corrupt inodes, directory
and indirect blocks, incorrect bitmaps).Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.--
ext3 is not the only filesystem that will have trouble due to
volatile write caches. We see problems often enough with XFS
due to volatile write caches that it's in our FAQ:http://oss.sgi.com/projects/xfs/faq.html#wcache
Cheers,
Dave.
--
Dave Chinner
Principal Engineer
SGI Australian Software Group
--
In fact it will hit every filesystem. A write-back cache that can't
be forced to write back bythe filesystem will cause corruption on
uncontained power loss, period.--
Nice FAQ, yep. Perhaps you should move parts of it to Documentation/ ,
and I could then make ext3 FAQ point to it?I had write cache enabled on my main computer. Oops. I guess that
means we do need better documentation.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
Is it?
I guess I should try to measure it. (Linux already does writeback
caching, with 2GB of memory. I wonder how important disks's 2MB of
cache can be).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
It serves essentially the same purpose as the 'async' option in /etc/exports
(i.e. we declare it "done" when the other end of the wire says it's caught
the data, not when it's actually committed), with similar latency wins. Of
course, it's impedance-matching for bursty traffic - the 2M doesn't do much
at all if you're streaming data to it. For what it's worth, the 80G Seagate
drive in my laptop claims it has 8M, so it probably does 4 times as much
good as 2M. ;)
I doubt "impedance-matching" is useful here. SATA link is fast/low
latency, and kernel already does buffering with main memory...Hmm... what is the way to measure that? Tar decompress kernel few
times with cache on / cache off?
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
AFAIK no drive saves the cache. The worst case cache flush for drives is
several seconds with no retries and a couple of minutes if something
really bad happens.This is why the kernel has some knowledge of barriers and uses them to
issue flushes when needed.
--
Problem is, ext3 has barriers off by default so it's not saving most people.
And then if you turn them on, but have your filesystem on an lvm device,
lvm strips them out again.-Eric
--
Indeed, you are right, which is supported by actual measurements:
http://sr5tech.com/write_back_cache_experiments.htm
Sorry for implying that anybody has engineered a drive that can do
such a nice thing with writeback cache.The "disk motor as a generator" tale may not be purely folklore. When
an IDE drive is not in writeback mode, something special needs to done
to ensure the last write to media is not a scribble.A small UPS can make writeback mode actually reliable, provided the
system is smart enough to take the drives out of writeback mode when
the line power is off.Regards,
Daniel
--
No it doesn't. The last write _is_ a scribble. Systems that make atomic
updates to disk drives use a shadow update mechanism and write the master
sector twice. If the power fails in the middle of writing one, it will
almost certainly be unreadable due to a CRC failure, and the other one
will have either the old or new master block contents.And I think there's a problem with drives that, upon sensing the
unreadable sector, assign an alternate even though the sector is fine, and
you eventually run out of spares.Incidentally, while this primitive behavior applies to IDE (ATA et al)
drives, that isn't the only thing people put filesystem on. Many
important filesystems go on higher level storage subsystems that contain
IDE drives and cache memory and batteries. A device like this _does_ make
sure that all data that it says has been written is actually retrievable
even if there's a subsequent power outage, even while giving the
performance of writeback caching.--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems--
Have you observed that in the wild? A former engineer of a disk drive
company suggests to me that the capacitors on the board provide enough
power to complete the last sector, even to park the head.Regards,
Daniel
--
The problem isn't with the disk drive; it's from the DRAM, which tend
to be much more voltage sensitive than the hard drives --- so it's
quite likely that you could end up DMA'ing garbage from the memory.
In fact the fact that the disk drives lasts longer due to capacitors
on the board, rotational inertia of the platters, etc., is part of the
problem.It was observed in the wild by SGI, many years ago on their hardware.
They later added extra capacitors on the motherboard and a powerfail
interrupt which caused the Irix to run around frantically shutting
down DMA's for a controlled shutdown. Of course, PC-class hardware
has none of this. My source for this was Jim Mostek, one of the
original Linux XFS porters. He had given me source code to a test
program that would show this; basically zeroed out a region of disk,
then started writing series of patterns on that part of the, and you
you kicked out the power cord, and then see if there was any garbage
on the disk. If you saw something that wasn't one of the patterns
being written to the disk, then you knew you had a problem. I can't
find the program any more, but it wouldn't be hard to write.I do know that I have seen reports from many ext2 users in the field
that could only be explained by the hard drive scribbling garbage onto
the inode table. Ext3 solves this problem because of its physical
block journaling.- Ted
--
Even if true (which I doubt), this is not implemented.
A modern drive can have 16-32 MB of write cache. Worst case, those
I can tell you directly that when you drop power to a drive, you will
lose write cache data if the write cache is enabled. With barriers
enabled, our testing shows that file systems survive power failures
which routinely caused corruption without them ;-)ric
--
We weren't actually talking about writing out the cache. While that was
part of an earlier thread which ultimately conceded that disk drives most
probably do not use the spinning disk energy to write out the cache, the
claim was then made that the drive at least survives long enough to finish
writing the sector it was writing, thereby maintaining the integrity of
the data at the drive level. People often say that a disk drive
guarantees atomic writes at the sector level even in the face of a power
failure.But I heard some years ago from a disk drive engineer that that is a myth
just like the rotational energy thing. I added that to the discussion,
but admitted that I haven't actually seen a disk drive write a partial
sector.Ted brought up the separate issue of the host sending garbage to the disk
device because its own power is failing at the same time, which makes the
integrity at the disk level moot (or even undesirable, as you'd rather
write a bad sector than a good one with the wrong data).--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems--
Did he work for Maxtor, by any chance? :-/
A disk drive whose power is cut needs to have enough residual power to
park its heads (or *massive* data loss will occur), and at that point it
might as well keep enough on hand to finish an in-progress sector write.There are two possible sources of onboard temporary power: a large
enough capacitor, or the rotational energy of the platters (an
electrical motor also being a generator.) I don't care which one they
use, but they need to do something.-hpa
--
I believe the power for that comes from a third source: a spring. Parking
the heads is too important to leave to active circuits.--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems--
Well, it would be impossible or at least very hard to see that in
practice, right? My understanding is that drives do sector-level
checksums, so if there was a partially written sector, the checksum
would be bogus and the drive would return an error when you tried toYep, exactly. It would be interesting to see if this happens on
modern hardware; all of the evidence I've had for this is years old at
this point.- Ted
--
There is extensive per sector error correction on each sector written.
What you would see in this case (or many, many other possible ways
drives can corrupt media) is a "media error" on the next read.You would never get back the partially written contents of that sector
at the host.Having our tools (fsck especially) be resilient in the face of media
errors is really critical. Although I don't think the scenario of a
partially written sector is common, media errors in general are commonSee the NetApp paper from Sigmetrics 2007 for some interesting analysis...
ric
--
Agreed.
Jeff
--
I just had a talk with a colleague, John Palmer, who worked on disk drive
design for about 5 years in the '90s and he gave me a very confident,
credible explanation of some of the things we've been wondering about disk
drive power loss in this thread, complete with demonstrations of various
generations of disk drives, dismantled.First of all, it is plain to see that there is no spring capable of
parking the head, and there is no capacitor that looks big enough to
possibly supply the energy to park the head, in any of the models I looked
at. Since parking of the heads is essential, we can only conclude that
the myth of the kinetic energy of the disks being used for that (turned
into electricity by the drive motor) is true. The energy required is not
just to move the heads to the parking zone, but to latch them there as
well.The myth is probably just that that energy is used for anything else; it's
really easy to build a dumb circuit to park the heads using that power;
keeping a computer running is something else.The drive does drop a write in the middle of the sector if it is writing
at the time of power loss. The designers were too conservative to keep
writing as power fails -- there's no telling what damage you might do. So
the drive cuts the power to the heads at the first sign of power loss. If
a write was in progress, this means there is one garbage sector on the
disk. It can't be read.Trying to finish writing the sector is something I can image some drive
model somewhere trying to do, but if even _some_ take the conservative
approach, everyone has to design for it, so it doesn't matter.A device might then reassign that sector the next time you try to write to
it (after failing to read it), thinking the medium must be bad. But there
are various algorithms for deciding when to reassign a sector, so it might
not too.--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems-...
I have a Seagate Barracuda 7200.9 80 Gbyte SATA drive that I
use for experiments. I can permanently destroy a EXT3 file-system
at least 50% of the time by disconnecting the data cable while
a `dd` write to a file is in progress. Something bad happens
making partition information invalid. I have to re-partition
to reuse the drive.If I try the same experiment by disconnecting power to the drive
the file is no good (naturally), but the rest of the file-system
is fine.My theory is that the destination offset is present in every
SATA access and some optimization code within the drive sets
the heads to track zero and writes before any CRC or checksum
is done to find out if it was the correct offset with the
correct data!Cheers,
Dick Johnson
Penguin : Linux version 2.6.22.1 on an i686 machine (5588.29 BogoMips).
My book : http://www.AbominableFirebug.com/
_****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to DeliveryErrors@analogic.com - and destroy all copies of this information, including any attachments, without reading or disclosing them.Thank you.
--
Does turning off writeback cache on disk help? This is quite serious,
I'd say...Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
PC class hardware has a power good signal which drops just before the
rest.
--
No, I haven't. It's hearsay, and from about 3 years ago.
As for parking the head, that's hard to believe, since it's so easy and
more reliable to use a spring and an electromagnet.--
Bryan Henderson IBM Almaden Research Center
San Jose CA Filesystems--
You are assuming drives can't tell the difference between stray data loss
and sectors that can't be recovered by rewriting and reuse. I was under
the impression modern drives could do this ?Alan
--
On Tue, 15 Jan 2008 20:24:27 -0500
We've had mount -o barrier=1 for ext3 for a while now, it makes
writeback caching safe. XFS has this on by default, as does reiserfs.-chris
--
Maybe ext3 should do barriers by default? Having ext3 in "lets corrupt
data by default"... seems like bad idea.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--
We could buffer this, and only actually overwrite when we are completely
Ok, you have a point, so how about we change the implementation detail a bit,
from external fsck to internal fsck, leveraging the internal fs bookkeeping,
while allowing immediate but controlled read/write access.Thanks for more thoughts!
--
Al--
On Wed, 9 Jan 2008 14:52:14 +0300
You can't play fast and loose with data integrity.
Besides, if we looked at things optimistically, we would conclude
You will really want to read this paper, if you haven't already.
--
All Rights Reversed
--
And that's the reality, because people are mostly optimistic and feel
extremely tempted to just force-mount a dirty ext3fs, instead of waiting
hours-on-end for a complete fsck, which mostly comes back with some benignWell not ever, but most people probably fsck during scheduled shutdowns, or
Definitely a good read, but attacking the problem from a completely different
POV.BTW: Dropped some cc's due to bounces.
Thanks!
--
Al--
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Andi Kleen | [PATCH x86] [0/16] Various i386/x86-64 changes |
| Vladislav Bolkhovitin | Re: Integration of SCST in the mainstream Linux kernel |
| Pavel Roskin | ndiswrapper and GPL-only symbols redux |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Natalie Protasevich | [BUG] New Kernel Bugs |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Arjan van de Ven | Re: [GIT]: Networking |
