Sorry, that last send was totally stuffed up w.r.t. mail headers. So for the benefit of proper cc:'s here it comes again... A collection of patches to make ext3 & 4 use barriers by default, and to call blkdev_issue_flush on fsync if they are enabled. --
I can't think of any valid reason for ext3 to not use barriers when they are available; I believe this is necessary for filesystem integrity in the face of a volatile write cache on storage. An administrator who trusts that the cache is sufficiently battery- backed (and power supplies are sufficiently redundant, etc...) can always turn it back off again. SuSE has carried such a patch for quite some time now. Also document the mount option while we're at it. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Acked-by: Jan Kara <jack@suse.cz> --- Documentation/filesystems/ext3.txt | 12 ++++++++++-- fs/ext3/super.c | 11 +++++++++-- 2 files changed, 19 insertions(+), 4 deletions(-) diff --git a/Documentation/filesystems/ext3.txt b/Documentation/filesystems/ext3.txt index b45f3c1..daab1f5 100644 --- a/Documentation/filesystems/ext3.txt +++ b/Documentation/filesystems/ext3.txt @@ -52,8 +52,16 @@ commit=nrsec (*) Ext3 can be told to sync all its data and metadata Setting it to very large values will improve performance. -barrier=1 This enables/disables barriers. barrier=0 disables - it, barrier=1 enables it. +barrier=<0|1(*)> This enables/disables the use of write barriers in + the jbd code. barrier=0 disables, barrier=1 enables. + This also requires an IO stack which can support + barriers, and if jbd gets an error on a barrier + write, it will disable again with a warning. + Write barriers enforce proper on-disk ordering + of journal commits, making volatile disk write caches + safe to use, at some performance penalty. If + your disks are battery-backed in one way or another, + disabling barriers may safely improve performance. orlov (*) This enables the new Orlov block allocator. It is enabled by default. diff --git a/fs/ext3/super.c b/fs/ext3/super.c index fe3119a..9c30dc7 100644 --- a/fs/ext3/super.c +++ b/fs/ext3/super.c @@ -555,6 +555,7 @@ static int ext3_show_options(struct seq_file *seq, ...
Yes please. Safe defaults are the way to go. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
To ensure that bits are truly on-disk after an fsync,
ext3 should call blkdev_issue_flush if barriers are supported.
Inspired by an old thread on barriers, by reiserfs & xfs
which do the same, and by a patch SuSE ships with their kernel
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---
fs/ext3/fsync.c | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/fs/ext3/fsync.c b/fs/ext3/fsync.c
index d336341..f6167ec 100644
--- a/fs/ext3/fsync.c
+++ b/fs/ext3/fsync.c
@@ -27,6 +27,7 @@
#include <linux/sched.h>
#include <linux/writeback.h>
#include <linux/jbd.h>
+#include <linux/blkdev.h>
#include <linux/ext3_fs.h>
#include <linux/ext3_jbd.h>
@@ -45,6 +46,7 @@
int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
{
struct inode *inode = dentry->d_inode;
+ journal_t *journal = EXT3_SB(inode->i_sb)->s_journal;
int ret = 0;
J_ASSERT(ext3_journal_current_handle() == NULL);
@@ -85,6 +87,9 @@ int ext3_sync_file(struct file * file, struct dentry *dentry, int datasync)
.nr_to_write = 0, /* sys_fsync did this */
};
ret = sync_inode(inode, &wbc);
+
+ if (journal && (journal->j_flags & JFS_BARRIER))
+ blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
}
out:
return ret;
-- 1.5.3.6
--
blkdev_issue_flush uses BIO_RW_BARRIER, which becomes REQ_HARDBARRIER. REQ_HARDBARRIER does _not_ mean force the data to disk. It is a barrier request, which when optimised for the highest performance can return a long time before the data is physically on the disk. (It can even do nothing, correctly). In many cases, block drivers implement a barrier using a hardware flush command, so it turns out the same. Then your patch is effective. (It is however sending unnecessary hardware commands sometimes, because fsync's journal write (when there is one) will also issue a flush command, and another is not required; it may reduce performance.) But that is not what REQ_HARDBARRIER means: block drivers are allowed to implement barriers in other ways which are more efficient. I'm not sure if this problem is theoretical only, or if any block drivers actually do take advantage of this. A good fix would be to add REQ_HARDFLUSH, with a precise meaning: don't return until the hardware reports the data safely committed. It would be very little change to most (if not all) current drivers. -- Jamie --
I can't think of any valid reason for ext4 to not use barriers when they are available; I believe this is necessary for filesystem integrity in the face of a volatile write cache on storage. An administrator who trusts that the cache is sufficiently battery- backed (and power supplies are sufficiently redundant, etc...) can always turn it back off again. SuSE has carried such a patch for ext3 for quite some time now. Also document the mount option while we're at it. Signed-off-by: Eric Sandeen <sandeen@redhat.com> --- Documentation/filesystems/ext4.txt | 12 ++++++++++-- fs/ext4/super.c | 11 +++++++++-- 2 files changed, 19 insertions(+), 4 deletions(-) diff --git a/Documentation/filesystems/ext4.txt b/Documentation/filesystems/ext4.txt index 560f88d..0c5086d 100644 --- a/Documentation/filesystems/ext4.txt +++ b/Documentation/filesystems/ext4.txt @@ -139,8 +139,16 @@ commit=nrsec (*) Ext4 can be told to sync all its data and metadata Setting it to very large values will improve performance. -barrier=1 This enables/disables barriers. barrier=0 disables - it, barrier=1 enables it. +barrier=<0|1(*)> This enables/disables the use of write barriers in + the jbd code. barrier=0 disables, barrier=1 enables. + This also requires an IO stack which can support + barriers, and if jbd gets an error on a barrier + write, it will disable again with a warning. + Write barriers enforce proper on-disk ordering + of journal commits, making volatile disk write caches + safe to use, at some performance penalty. If + your disks are battery-backed in one way or another, + disabling barriers may safely improve performance. orlov (*) This enables the new Orlov block allocator. It is enabled by default. diff --git a/fs/ext4/super.c b/fs/ext4/super.c index 52dd067..77b036a 100644 --- a/fs/ext4/super.c +++ b/fs/ext4/super.c @@ -671,6 +671,7 @@ static int ext4_show_options(struct seq_file *seq, struct vfsmount *vfs) ...
To ensure that bits are truly on-disk after an fsync,
we should call blkdev_issue_flush if barriers are supported.
Inspired by an old thread on barriers, by reiserfs & xfs
which do the same, and by a patch SuSE ships with their kernel
Signed-off-by: Eric Sandeen <sandeen@redhat.com>
---
fs/ext4/fsync.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 1c8ba48..a45c373 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -27,6 +27,7 @@
#include <linux/sched.h>
#include <linux/writeback.h>
#include <linux/jbd2.h>
+#include <linux/blkdev.h>
#include "ext4.h"
#include "ext4_jbd2.h"
@@ -45,6 +46,7 @@
int ext4_sync_file(struct file * file, struct dentry *dentry, int datasync)
{
struct inode *inode = dentry->d_inode;
+ journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
int ret = 0;
J_ASSERT(ext4_journal_current_handle() == NULL);
@@ -85,6 +87,8 @@ int ext4_sync_file(struct file * file, struct dentry *dentry, int datasync)
.nr_to_write = 0, /* sys_fsync did this */
};
ret = sync_inode(inode, &wbc);
+ if (journal && (journal->j_flags & JBD2_BARRIER))
+ blkdev_issue_flush(inode->i_sb->s_bdev, NULL);
}
out:
return ret;
-- 1.5.3.6
--
This patch isn't necessary, and in fact will cause a double flush. When you call fsync(), it calls ext4_force_commit(), and we do a the equivalent of a blkdev_issue_flush() today (which is what happenes when you do a submit_bh(WRITE_BARRIER, bh), which is what setting set_ordered_mode(bh) ends up causing. - Ted --
ISTR fsync() on ext3 did not always force a commit, if in-place data writes did not change any metadata. Has this been fixed in ext4 but not ext3 then? Does WRITE_BARRIER always cause a flush? It does not have to according to Documentation/block/barrier.txt. There are caveats about tagged queuing "not yet implemented" in the text, but can we rely on that? The documentation is older than the current implementation; those caveats might no longer apply. -- Jamie --
I think that might still be true, but I'm still looking through it (in the background...) I tried blktrace to see what was going on but I'm not sure what an "NB" in the RWBS field means, anyone know? -Eric --
Eric already knows this now, but for the benefit of anyone else that may be curious - it's an empty (data-less) barrier. 'N' is basically 'no data' (eg not a read nor a write) and 'B' is barrier. -- Jens Axboe --
It does, if you use ordered tags then that assumes write through caching (or ordered tag + drain + flush after completion). -- Jens Axboe --
Oh. That's really unclear from the opening paragraph of barrier.txt, which _defines_ what I/O barriers are for, and does not mention flushing: I/O barrier requests are used to guarantee ordering around the barrier requests. Unless you're crazy enough to use disk drives for implementing synchronization constructs (wow, sounds interesting...), the ordering is meaningful only for write requests for things like journal checkpoints. All requests queued before a barrier request must be finished (made it to the physical medium) before the barrier request is started, and all requests queued after the barrier request must be started only after the barrier request is finished (again, made it to the physical medium). So I assumed the reason flush is talked about later was only because most devices don't offer an alternative. Later in barrier.txt, in the section about flushing, it says: the reason you use I/O barriers is mainly to protect filesystem integrity when power failure or some other events abruptly stop the drive from operating and possibly make the drive lose data in its cache. So, I/O barriers need to guarantee that requests actually get written to non-volatile medium in order Woa! Nothing about flushing being important, just "to guarantee ... in order". Thus flushing looks like an implementation detail - all we could do at the time. It does not seem to be the _point_ of WRITE_BARRIER (according to the text), which is to ensure journalling integrity by ordering writes. Really, the main reason I was confused was that I imagine some SCSI-like devices letting you do partially ordered writes to write-back cache - with their cache preserving ordering constraints the same way as some CPU or database caches. (Perhaps I've been thinking about CPUs too long!) Anyway, moving on.... Let's admit I am wrong about that :-) And get back to my idea. Ignoring actual disks for a moment ( ;-) ), there are some I/O scheduling ...
It may not mention flushing explicitly, but it does not have to since
the flushing is one way to implement what the above describes. Note how
it says that request must have made it to physical medium before
allowing others to continue? That means you have to either write through
Yeah, that is precisely what it is and why it does not mention flushing
Right, that is what ordered tags give you. But if you have write back
caching enabled, then you get a completion event before the data is
actually on disk. Perhaps that is OK for some cases, perhaps not. The
Linux barrier implementation has always guarenteed that the data is
actually on platter before considering a barrier write done, as
There are some good ideas in there, I'll punt that to the fs people.
Ignoring actual disk drives makes things easier :-). While SCSI has
ordered IO with write back caching, SATA does not. You basically have to
drain and flush there, or use one of the other variants for getting data
in disk - see blkdev.h:
/*
* Hardbarrier is supported with one of the following methods.
*
* NONE : hardbarrier unsupported
* DRAIN : ordering by draining is enough
* DRAIN_FLUSH : ordering by draining w/ pre and post flushes
* DRAIN_FUA : ordering by draining w/ pre flush and FUA
* write
* TAG : ordering by tag is enough
* TAG_FLUSH : ordering by tag w/ pre and post flushes
* TAG_FUA : ordering by tag w/ pre flush and FUA write
--
Jens Axboe
--
On Fri, 16 May 2008 14:02:46 -0500 Last time this came up lots of workloads slowed down by 30% so I dropped the patches in horror. I just don't think we can quietly go and slow everyone's machines down by this much. The overhead of journalling is already pretty horrid. If we were seeing a significant number of "hey, my disk got wrecked" reports which attributable to this then yes, perhaps we should change the default. But I've never seen _any_, although I've seen claims that others have seen reports. There are no happy solutions here, and I'm inclined to let this dog remain asleep and continue to leave it up to distributors to decide what their default should be. Do we know which distros are enabling barriers by default? --
I actually did a bit of research and found the old thread, honestly. I But if journali[zi]ng guarantees are thrown out the window by volatile caches on disk, why bother with the half-solution? Slower while you run, worthless when you lose power? Sounds like the worst of both Hm, how would we know, really? What does it look like? It'd totally depend on what got lost... When do you find out? Again depends what you're doing, I think. I'll admit that I don't have any good evidence of my own. I'll go off and do some plug-pull-testing and a benchmark or two. But, drive caches are only getting bigger, I assume this can't help. I have a hard time seeing how speed at the cost of correctness is the SuSE does (via patch for ext3). Red Hat & Fedora don't, and install by default on lvm which won't pass barriers anyway. So maybe it's hypocritical to send this patch from redhat.com :) And as another "who uses barriers" datapoint, reiserfs & xfs both have them on by default. I suppose alternately I could send another patch to remove "remember that ext3/4 by default offers higher data integrity guarantees than most." from Documentation/filesystems/ext4.txt ;) -Eric --
On Fri, 16 May 2008 15:53:31 -0500 Yeah, it's all so handwavy. The only thing which isn't handwavy is We could add a big scary printk at mount time and provide a document? --
Can I suggest making /proc/mounts say "barrier=0" when journal is not enabled, instead of omitting the option. Boot logs are too large to pay close attention to unless it's really obvious. (2.4 kernels _do_ have a similar message about "data integrity not guaranteed" with USB drivers - I never understood what it was getting it, and why it was removed for 2.6). However, if I saw barrier=0 in /proc/mounts it would at least prompt me to look it up and then making an informed decision. Personally I had assumed barriers were enabled by default with ext3, as some distros do that, the 2.4 patches did that, and: I *have* experienced corruption following power loss without barriers, and none with barriers. When I mentioned that turning off write cache or using barriers is a solution to a programmer working on the same project, she said "oh, yes, we've had reports of disk corruption too - thanks for the advice", and the advice worked, so I am not the only one. (In the interests of perspective, that's with ext3 on patched 2.4 kernels on a ARM device, but still - the barriers seem to work). On a related note, there is advice floating about the net to run with IDE write cache turned off if you're running a database and care about integrity. That has much worse performance than barriers. I guess the patch which fixes fsync is particularly useful for those database users, as it means they can run with write cache enabled and depend on fsync() to give equivalent integrity now. (Enabling journalling is not really relevant to this). -- Jamie --
Right now, ext3_show_options has the scheme: /* * Show an option if * - it's set to a non-default value OR * - if the per-sb default is different from the global default */ so only non-default is shown, so today barrier=0 is not shown. I suppose that could be changed... FWIW, my patch would show barrier=0 if it's manually mounted that way (against new proposed defaults), or if we are running w/o barriers due ... and I've seen hand-waving about shortened drive life running this way? but who really knows.... Thanks, --
Yes, I suggest that, even if barrier=0 remains the default. I suggest showing barrier=1 no matter what the default, too, since if the default is changed, no option will become ambiguous in bug reports, Speaking of failed barrier I/O, it should be possible to fall back to "disable cache, write commit block, enable cache" on old drives which I've no idea. It makes sense: disabling write cache increases the number of seeks for many loads. In any case, with barriers implemented properly you get basically the same level of integrity as disabling the write cache, without the substantial performance hit. -- Jamie --
You have to pull the plug quite a lot, while there is data in write cache, and when the data is something you will notice later. Checking filesystem is hard. Something systematic would be good - for which you will want an electronically controlled power switch. I have seen corruption which I believe is from lack of barriers, and hasn't occurred since I implemented them (or disabled write cache). But it's hard to be sure that was the real cause. If you just want to test the block I/O layer and drive itself, don't use the filesystem, but write a program which just access the block device, continuously writing with/without barriers every so often, and after power cycle read back to see what was and wasn't written. I think there may be drives which won't show any effect - if they have enough internal power (from platter inertia) to write everything in the cache before losing it. If you want to test, the worst case is to queue many small writes at seek positions acrosss the disk, so that flushing the disk's write cache takes the longest time. A good pattern might be take numbers 0..2^N-1 (e.g. 0..255), for each number reverse the bit order (0, 128, 64, 192...) and do writes at those block positions, scaling up to the range of the whole disk. The idea is if the disk just caches the last few queued, they will always be quite spread out. However, a particular disk's cache algorithm may bound the write cache size by predicted seek time needed to flush it, rather than bytes, I agree. The MacOS X folks decided that speed is most important for fsync(). fsync() does not guarantee commit to platter. *But* they added an fcntl() for applications to request a commit to platter, which SQLite at least uses. I don't know if MacOS X uses barriers for filesystem SuSE did the same for 2.4: they patched the kernel to add barriers to 2.4. (I'm using an improved version of that patch in my embedded Neither did they patch for barriers in 2.4, but we can't hold ...
Right, that was the plan. I wasn't really going to stand there and pull the plug. :) I'd like to get to "out of $NUMBER power-loss events Well, I think it is worth testing through the filesystem, different ... and those with flux capacitors. ;) I've heard of this mechanism but I'm not sure I believe it is present in any modern drive. Not sure I suppose we could go about it 2 ways; come up with something diabolical and try very hard to break it (I think we know that we can) or do something more realistic (like untarring & building a kernel?) and see FWIW... md also only does it on raid1... but lvm with a single device Well it all depends on what you're testing (and the hardware you're testing it on). Between ext3 & xfs you can find plenty of tests which will show either one or the other as faster. And most benchmark results It certainly does come into play in benchmarking scenarios... -Eric --
I agree, but intentional tests on the block device will show the drives characteristcs on power failure much sooner and more If you do the large-seek write pattern I suggested, and timing confirms the drive is queueing many of them in cache, it will be I would suggest one of the metadata intensive filesystem tests, Fwiw, if it were implemented in MD, the difference between barriers and flushes could be worth having for performance, even when underlying devices implement barriers with flush. It would be good, especially for MT, to optimising away those operations in unnecessary cases on underlying device request queues, as well as the main MD queue. An example is WRITE-BARRIER, usually implemented as FLUSH, WRITE, FLUSH, can actually report completion when the WRITE is finished, and doesn't need to issue the second FLUSH at all for a long time, until there's a subsequent WRITE on that drive (and on the same partition, fwiw.) I'm increasingly thinking that decomposing WRITE-BARRIER to three requests, FLUSH+WRITE+FLUSH, should be done at the generic I/O request level, because that is the best place to optimise away, merge, or delay unnecessary flushes, and to relax request ordering around them if we ever do that. (BARRIER would remain only as an op for extra performance only on drivers which can do barriers another way; many drivers would never see or handle it). Since we need FLUSH for proper fsync() anyway, that would simplify drivers too. Then the MD patch could just implement FLUSH, which is probably simpler than implementing WRITE-BARRIER. -- Jamie --
I suspect the real reason why we get away with it so much with ext3 is that the journal is usually contiguous on disk, hence, when you write to the journal, it's highly unlikely that commit block will be written and the blocks before the commit block have not. In addition, because we are doing physical block journalling, it repairs a huge amount of damage during the journal replay. So as we are writing the journal, the disk drive sees a large contiguous write stream, followed by singleton writes where the disk blocks end up on disk. The most important reason, though, is that the blocks which are dirty don't get flushed out to disk right away! They don't have to, since they are in the journal, and the journal replay will write the correct data to disk. Before the journal commit, the buffer heads are pinned, so they can't be written out. After the journal commit, the buffer heads may be written out, but they don't get written out right away; the kernel will only write them out when the periodic buffer cache flush takes place, *or* if the journal need to wrap, at which point if there are pending writes to an old commit that haven't been superceded by another journal commit, the jbd layer has to force them out. But the point is this is done in an extremely lazy fashion. As a result, it's very tough to create a situation where a hard drive will reorder write requests aggressively enough that we would see a potential problem. I suspect if we want to demonstrate the problem, we would need to do a number of things: * create a highly fragmented journal inode, forcing the jbd layer to seek all over the disk while writing out the journal blocks during a commit * make the journal small, forcing the journal to wrap very often * run with journal=data mode, to put maximal stress on the journal * make the workload one which creates and deletes large number of files scattered all over the directory hierarchy, so that we limit the number of blocks which are ...
yup. Plus with a commit only happening once per few seconds, the time window for a corrupting power outage is really really small, in relative terms. All these improbabilities multiply. Journal wrapping could cause the commit block to get written before its data blocks. We could improve that by changing the journal layout a bit: wrap the entire commit rather than just its tail. Such a change might be intrusive though. The other possible scenario is when the fs thinks the metadata blocks have been checkpointed, so it reuses the journal space, only they weren't really checkpointed. But that would require that we'd gone what he said. --
Or we backport jbd2's checksum support in the commit block to jbd;
with the checksum support, if the commit block is written out-of-order
with the rest of the blocks in the commit, the commit will simply not
be recognized as valid, and we'll use the previous commit block as the
last valid commit.
So with ext4, the only problems we should actually have w/o a barrier
are:
* Writes that were supposed to happen only *after* the journal
commit is written get reordered *before* the journal commit.
(But normally these writes, while _allowed_ after a journal
commit are not forced by the kernel.)
* In data=ordered mode, data blocks that *should* have been
written out before the journal commit, get reordered until
*after* the journal commit.
And in both cases, where the crash has to happen sometime *very*
shortly after commit record has been forced out.
Thinking about this some more, the most likely way I can think of some
problems happening would be an unexpected power failure that happened
exactly right as an unmount (or filesystem snapshot) was taking place.
That's one of the few cases I can think of where, a journal commit
write is followed immediately by the metadata writes. And in
data=ordered more, that sequence would be data writes, followed
immediately by journal writes, followed immediately by metadata
writes.
So if you want to demonstrate that this really *could* happen in real
life, without an artificially contrived, horribly fragmented journal
inode, here's a worst case scenario I would try to arrange.
(1) Pre-fill the disk 100% with some string, such as "my secret love
letters".
(2) In data=ordered mode, unpack a very large tarball, ideally with
the files ad directory ordered maximally pessimized so that files are
created in directory #1, then directory #6, then directory #4, then
directory #1, then directory #2, etc. (AKA the anti-Reiser4 benchmark
tarball, because Hans would do kernel unpack ...The jbd2 journal checksum support, along with the 64-bit journal block layout is designed to be 100% compatible with jbd, so it would be trivial to "back port" jbd2. Just copy jbd2 into jbd, and add the option to set (or set by default) the JBD_FEATURE_COMPAT_CHECKSUM feature in the journal. I was going to suggest complete removal of jbd2 at that point, but the ordered-mode rewrite is still pending and should live in ext4 for a reasonably long while before going back to ext3. Cheers, Andreas -- Andreas Dilger Sr. Staff Engineer, Lustre Group Sun Microsystems of Canada, Inc. --
Never mind, I was confused when I wrote this; I somehow thought we were setting ordered mode on a per queue basis, instead of on a per-buffer-head basis. Also, looking more closely on the jbd2 implementation, it looks like using the async_commit option, which relies on the checksum for more efficient commit, completely disables any barrier support. That's because the only place we go into ordered more is if we are writing a synchronous journal commit. If async journal commit is enabled, then we don't write a barrier at all, which leaves us in potential trouble with if data blocks end up getting reordered with respect to the journal commit in data=ordered more. I *think* what we need to do is to issue an empty barrier request between the data blocks and the journal writes in data=ordered mode, and still issue a WRITE_BARRIER request when writing the commit block, but to not actually wait for the write to complete. I think if we do that, we should be safe, and hopefully by not waiting for the commit block to complete, the performance hit shouldn't be as bad as previously reported. - Ted --
That is a great optimisation to make the filesystem safe against power fails without barriers. Upon replay, the filesystem is consistent. But does it break fsync()? Consistency isn't enough for fsync(). Returning from fsync() means "this data is committed now". -- Jamie --
Well, the barriers happen like so (even if we actually only do one barrier in submit_bh, it turns into two) write log blocks flush #1 write commit block flush #2 write metadata blocks I'd agree with Ted, there's a fairly small chance of things get reordered around flush #1. flush #2 is likely to have lots of reordering though. It should be easy to create situations where the metadata for a transaction is written before the log blocks ever see the disk. EMC did a ton of automated testing around this when Jens and I did the initial barrier implementations, and they were able to trigger corruptions in fsync heavy workloads with randomized power offs. I'll dig up the workload they used. -chris --
True, but even with a very heavy fsync() workload, a commit doesn't cause the metadata blocks to be written until we have to do a journal truncate operation. A heavy fsync() workload would increase how quickly we would use up the journal and need to do a journal truncate, I could imagine a mode which forces a barrier operation for commits triggered by fsync()'s, but not commits that occur due to a natural closing of transactions. I'm not sure it's worth it, though, since many of the benchmarks that we care about (like Postmark) do use fsync() fairly heavily. The really annoying thing is that what is really needed is a way to make write barriers cheaper; we don't need to do a synchronous flush, but unfortunately for most drives there isn't any other way of keeping disk writes from getting reordered. - Ted --
The workload we used was to run our existing Centera application on a rack of boxes. The application is a bit special in that it does a digital signature on each file and never returns success for the client until an fsync is done on the server (kind of like synchronous NFS). What we did for our test was to pound away on a rack of these boxes (say 32 boxes, each with 4 large ATA or S-ATA drives) and then drop power to the whole rack. All of our data file systems were reiserfs, some of the system partitions were ext2. The test would be marked as passed if we could reboot all of the boxes and have the client validate that the digital signature of all files written and ack'ed were valid. We also looked for issues seen during the reboot (fsck grumbles, corrupted directories, etc). I didn't run the tests personally, but seem to recall that without barriers we routinely saw file system corruption on that reboot. The hard thing is to figure out how to test this kind of scenario without dropping power. To expose the failure mode, it might be sufficient to drop power to a drive with smartctl (or, if you have hot swap bays, just pull them). Just a personal note, my last day at EMC was this past Friday. Monday, I start working for Red Hat (focused on file systems) so I will have to figure out to get this kind of test going without all of my big EMC toys ;-) ric --
Fwiw, you can test correctness by running a nested VM-in-VM guest which simulates disk write cache and flush operations (which flush to the host kernel). I believe recent QEMU/KVM has this as an option. Data written by the innermost guest will reside in the middle kernel's write cache. Disk cache flush commands from the innermost kernel will cause the middle kernel to write dirty sectors to the outer kernel. Killing the outermost host process is roughly equivalent to pulling the plug on a real machine, but faster and without hurting real hardware. This might not tell you much about performance, but you should be able to run a lot of repeatable filesystem barrier tests this way. -- Jamie --
I wonder, does sending any kind of reset commands to drives cause them to discard their write cache? If the hot swap power switch is accessible from software, that would be a nice trick :-) -- Jamie --
Except when it crosses an MD disk boundary. Then it's really likely. We could also ask if there's _any_ possibility, when they are a merged single I/O, of them not getting written in the expected order? What about when FUA is set, does that imply any order? But it's all moot: Checksumming is the way forward here, no doubt. Checksumming makes the multi-sector write "atomic or corrupt". That's the same expectation as a commit sector provides by itself, but generalised to the whole journal entry. -- Jamie --
I'm not sure how good such exact numbers would do. Surely power down behaviour that would depend on the exact disk/controller/system combination? Some might be better at getting data out at power less, some might be worse. To get a good picture, you would probably need to do such tests over a wide range of systems and put all the data together, but even then you couldn't be sure you covered all well. For a distributor it would probably make more sense to do such tests as part of system certification and then modify the defaults to match all tested systems. But that doesn't really work for good mainline defaults. I suspect for mainline the only good default is to be very conservative. -Andi --
Given how rarely people have reported problems, I think it's a really good idea to understand what exactly our exposure is for $COMMON_HARDWARE. And I suspect the biggest question isn't the hardware, but the workload. Here are the questions that I think are worth asking: * How often can we get corruption on a common desktop workload? Given that we're mostly kernel developers, and kernbench is probably worst case for desktops, that's useful. * What is the performance hit on a common desktop workload (let's use kernbench for consistency). * How often can we get corruption on a hard-core enterprise application with lots of fsync()'s? (i.e., postmark, et. al) * What is the performance hit on a an fsync()-heavy workload? I have a feeling that the likelihood of corruption when running kernbench is minimal, but the performance hit is probably minimal as well. And that the corruption for potential is higher for an fsync-heavy workload, but that's also where we are seeing the (reported) 30% hit. The other thing which we should consider is that I suspect we can do much better for ext4 given that we have journal checksums. As Chris pointed out, right now, with barriers turned on, we are doing this: write log blocks flush #1 write commit block flush #2 write metadata blocks If we don't mind mixing bh and bio functions, we could change it to this for ext4 (when journal checksumming is enabled) write log blocks write commit block flush (via submitting an empty barrier block I/O request) write metadata blocks This should hopefully reduce the performance hit by half, since we're eliminating one of the flushes. Even more interesting would be moving the flush until right before we attempt to write the metadata blocks, and allowing data writes which don't require metadata updates through. That should be safe, even in data=ordered mode. The point is we should think about ways that we can optimize barrier mode for ext4. If we do this, then it may be that ...
Theodore Tso wrote: I'll propose that very close to 0% of users will ever report "having barriers off seems to have corrupted my disk on power loss!" even if that's exactly what happened. And it'd be very tricky to identify in a post-mortem. Instead we'd probably see other weird things caught down the road during some later fsck or during filesystem use, and then suggest that they go check their cables, run memtest86 or something... Perhaps it's not the intent of this reply, Ted, but various other bits of this thread have struck me as trying to rationalize away the problem. If the discussion were about proper locking to avoid corruption, would we really be saying well, gosh, it's a *really* small window, and *most* people won't hit it very often, and proper locking would slow things down.... So I think that as you suggest, looking for ways to make barriers less painful is the far better route, rather than sacrificing correctness for speed by turning them off by default when we know there is a chance for problems. People running journaling filesystems most likely expect to be safe from this sort of thing, not most of the time, but all of the time. -Eric --
Not really. It's a matter of understanding how big the problem is. We know what the cost of the solution is, and it's really large. It's a tradeoff, and it is unobvious where the ideal answer lies, If it slowed really really important workloads by 30% then we'd be running around with our hair on fire fixing that up. But fixing this one is nowhere near as easy as fixing some locking Well. Reducing the cost would of course make the decision easy. --
I think one mistake we (myself included) have made all along with the barrier
code is intermixing discussions about the cost of the solution with
discussions about needing barriers at all. Everyone thinks the barriers are
slow because we also think running without barriers is mostly safe.
Barriers are actually really fast, at least when you compare them to running
with the writecache off. Making them faster in general may be possible, but
they are somewhat pushed off to the side right now because so few people are
running them.
Here's a test workload that corrupts ext3 50% of the time on power fail
testing for me. The machine in this test is my poor dell desktop (3ghz, dual
core, 2GB of ram), and the power controller is me walking over and ripping
the plug out the back.
In other words, this is not a big automated setup doing randomized power fails
on 64 nodes over 16 hours and many TB of data. The data working set for this
script is 32MB, and it takes about 10 minutes per run.
The workload has 4 parts:
1) A directory tree full of empty files with very long names (160 chars)
2) A process hogging a significant percent of system ram. This must be
enough to force constant metadata writeback due to memory pressure, and is
controlled with -p size_in_mb
3) A process constantly writing, fsyncing and truncating to zero a single 64k
file
4) A process constantly renaming the files with very long names from (1)
between long-named-file.0 and long-named-file.1
The idea was to simulate a loaded mailserver, and to find the corruptions by
reading through the directory tree and finding files long-named-file.0 and
long-named-file.1 existing at the same time. In practice, it is faster to
just run fsck -f on the FS after a crash.
In order to consistently cause corruptions, the size of the directory from
(1) needs to be at least as large as the ext3 log. This is controlled with
the -s command line option. Smaller sizes may work for the impatient, ...Here's a new version that still gets about corruptions 50% of the time, but does it with fewer files by using longer file names (240 chars instead of 160 chars). I tested this one with a larger FS (40GB instead of 2GB) and larger log (128MB instead of 32MB). barrier-test -s 32 -p 1500 was still able to get a 50% corruption rate on the larger FS. -chris
Hmm, this is worse than I'd have expected :( If it is that bad, I think we should really enable them by default... I can give your script a try on my test machine when I get back (which is next week). Honza -- Jan Kara <jack@suse.cz> SuSE CR Labs --
That would be great, I'd like to confirm that my machine isn't the only one on the planet so easily broken ;) I was also able to trigger corruptions on XFS (one run out of two), so it is unlikely I'm seeing bugs in the ext3 replay or logging code. -chris --
Hi Chris, Just to clarify, was this test with barriers on or off for XFS? I'm wondering if it was with barriers on, then we have a reproducible bug in log replay. Or whether you were just confirming the problems with barriers off on another filesystem. Thanks muchly, Tim. --
On Monday 19 May 2008, Timothy Shimmin wrote: Barriers were off on xfs, the corruptions were not xfs' fault. -chris --
I ran this twice, killing power after 'renames ready'. The first time it was fine, the second time I got: centera:~/e2fsprogs-1.40.9/e2fsck # ./e2fsck -f /dev/sdb1 e2fsck 1.40.9 (27-Apr-2008) /dev/sdb1: recovering journal Pass 1: Checking inodes, blocks, and sizes Pass 2: Checking directory structure Problem in HTREE directory inode 2395569: node (281) has bad max hash Problem in HTREE directory inode 2395569: node (849) has bad max hash Problem in HTREE directory inode 2395569: node (1077) has bad max hash Problem in HTREE directory inode 2395569: node (1718) has bad max hash Problem in HTREE directory inode 2395569: node (4609) has bad max hash Problem in HTREE directory inode 2395569: node (4864) has bad max hash Problem in HTREE directory inode 2395569: node (5092) has bad max hash Problem in HTREE directory inode 2395569: node (5148) has bad max hash Problem in HTREE directory inode 2395569: node (5853) has bad max hash Problem in HTREE directory inode 2395569: node (7588) has bad max hash Problem in HTREE directory inode 2395569: node (8663) has bad max hash Invalid HTREE directory inode 2395569 (/barrier-test). Clear HTree index<y>? yes Pass 3: Checking directory connectivity Pass 3A: Optimizing directories Duplicate entry '0650419c70f3beadaa9a2c2f4999745a9c02066a0650419c70f3beadaa9a2c2f4999745a9c02066a0650419c70f3beadaa9a2c2f4999745a9c02066a0650419c70f3beadaa9a2c2f4999745a9c02066a0650419c70f3beadaa9a2c2f4999745a9c02066a0650419c70f3beadaa9a2c2f4999745a9c02066a.0' in /barrier-test (2395569) found. Clear? yes and tons of 'duplicate entry' errors after that. And then Pass 4: Checking reference counts Inode 168 ref count is 0, should be 1. Fix? yes Unattached zero-length inode 255. Clear? yes Inode 1221 ref count is 0, should be 1. Fix? yes Inode 1253 ref count is 1, should be 2. Fix? yes Inode 2692 ref count is 0, should be 1. Fix? yes Inode 3465 ref count is 0, should be 1. Fix? yes and lots of those too. So definitely easy to trigger. Test ...
Great, thanks Jens. So, one compromise may be to change the barriers on ext3 to look like the patch Ted just sent out for ext4. It should be mostly safe to skip the barrier between the log blocks and the commit block since the drive is likely to do those sequentially anyway. A little extra logic could be added to detect log wrapping and force an extra barrier in that case. Reiserfs saw some significant performance gains when I changed the code from: write log blocks barrier wait on log blocks write commit barrier wait on commit to write log blocks barrier write commit barrier wait on all of them Both were tested with the great big emc power failure machine and both passed. In the event of an IO error on log blocks, we should zero out the commit. -chris --
Ok, Andrew, is this enough to get barrier patch applied and stop corrupting data in default config, or do you want some more testing? I guess 20% benchmark regression is bad, but seldom and impossible to debug data corruption is worse... Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
It is for me; I think we have to enable barriers for ext3/4, and then
work to improve the overhead in ext4/jbd2. It's probably true that
the vast majority of systems don't run under conditions similar to
what Chris used to demonstrate the problem, but the default has to be
filesystem safety. People who are sure they are extremely unlikely to
run into this problem can turn barriers off (and I suspect they won't
see that much difference in the most common workloads anyway).
- Ted
--
It is 20%? I recall 30% from a few years ago, but that's vague and it might have changed. Has much quantitative testing been done recently? I might have missed it. If we do make this change I think it should be accompanied by noisy printks so that as many people as possible know about the decision which we just made for them. afaik there is no need to enable this feature if the machine (actually the disks) are on a UPS, yes? --
As long as your power supply (or your UPS) doesn't go boom, I suppose so. It is too bad that there is no way to determine no-barrier safety from software. (maybe apcupsd could do something... ;) I guess it's levels of confidence. I agree that a user education campaign is probably in order... maybe if this thread is long enough to make LWN it'll raise some awareness. :) -Eric --
Yes, as long as you're confident that there won't be a kernel bug/regression causing a lockup while the server is under severe memory pressure while doing lots of fsync's, file creations, renames, etc. And as long as your 100% confident that UPS's will never fail, janitors will never accidentally hit the Emergency Power Office switch, and so on. - Ted --
That won't cause corruption, will it? The problem is purely one of Well. The product of two very small numbers is... --
Another unquantified very small number. You have a UPS, but the first time you discover it's not wired up properly is when it fails. This happened recently in a customer's office... The UPS power wiring melted, leaving a big stinking hole, and all the servers went down - quickly. Another is misconfiguring the UPS daemon so it continues writing to disk for 30 minutes until the battery runs down... Like the chance of two disks going wrong in a RAID at the same time... that would *never* happen would it? *tip-toes out of the room* -- Jamie --
Can a kernel lockup cause this kind of corruption? Will a system reboot wipe the disk's write cache? I had imagined only power loss would prevent the disk from writing it's cache eventually; is that wrong? -- Jamie --
I favour saying barrier=1 in /proc/mounts, even though it will become the new default. At least for a while. -- Jamie --
Nice, htree as a canary for disk corruption. This makes sense since directory data is the only verifiable structure at the logical data level and htree offers the only large scale, verifiable structure. Thanks for the lovely test methodology example. Let me additionally offer this tool: http://code.google.com/p/zumastor/source/browse/trunk/ddsnap/tests/devspam.c?r=1564 devspam The idea is to write an efficiently verifiable pattern across a range of a file, including a mix of position-dependent codes and a user supplied code. In read mode, devspam will check that the position dependent codes are correct and match the user supplied code. This can be easily extended to a "check that all the user supplied codes are the same" mode, which would help detect consistency failure in regular data files much as htree does with directories. Hmm, this probably wants to incorporate a sequence number as well, to detect corruption under a random block update load as you have triggered with htree. I used this tool to exorcise the majority of bugs in ddsnap. It is a wonderful canary, not only catching bugs early but showing where where they occurred. From what I have seen, Sun seems to rely mostly on MD5 checksums for detecting corruption in ZFS. We should do more of that too. Regards, Daniel --
Ooh. Might it be possible for fsck to identify certain problems as likely caused by power loss in misordered writes? If metadata sectors had an update generation number, to be compared with journal entries, perhaps that would be more feasible. -- Jamie --
Out of curiosity, exactly *what* semantics did MacOS X give fsync(), then? Did it simply start the process of staging writes to disk, but not wait for the writes to hit the platter before returning? That's basically the equivalent of ext3's barrier=0. - Ted --
I haven't read the code and don't use MacOS myself.
From its fcntl() man page:
Note that while fsync() will flush all data from the host to the
drive (i.e. the "permanent storage device"), the drive itself may
not physically write the data to the platters for quite some time
and it may be written in an out-of-order sequence.
Specifically, if the drive loses power or the OS crashes, the
application may find that only some or none of their data was
written. The disk drive may also re-order the data so that later
writes may be present while earlier writes are not.
This is not a theoretical edge case. This scenario is easily
reproduced with real world workloads and drive power failures.
For applications that require tighter guarantess about the
integrity of their data, MacOS X provides the F_FULLFSYNC
fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered
data to permanent storage. Applications such as databases that
require a strict ordering of writes should use F_FULLFSYNC to
ensure their data is written in the order they expect. Please see
fcntl(2) for more detail.
Some notable things:
1. Para 2 says "if the drive loses power __or the OS crashes__".
Does this mean some drives will abandon cached writes when reset
despite retaining power?
2. Para 3 to be re-read by the skeptical.
3. Para 4 perpetuates the confused idea that write ordering is what
it's all about, for things like databases. In fact, sometimes
ordering barriers are all that's needed and flush is unnecessary
performance baggage. But sometimes an fsync() which only
guarantees ordering is insufficient. An "ideal"
database-friendly block layer would offer both.
I doubt if common unix mail transports use F_FULLSYNC on Darwin
instead of fsync(), before reporting a mail received safely, but they
probably should. I recall SQLite does use it (unless I'm confusing
it ...As far as I know that made its way first into MySQL 4.1.9 (2005-01), then into SQLite (2005-02), then into PostgreSQL (2005-04), so the database coders all picked that up around the same time and have been using it for years now. http://www.mail-archive.com/sqlite-users@sqlite.org/msg06502.html shows part of that timeline, and even includes some useful comments on the underlying Darwin implementation from that mailing list. That suggests one reason Apple is able to make this work is that their selection process for hard drives requires the drive honors the request to flush its cache they send. The implied caveat there is that even their F_FULLSYNC won't necessarily work with external Firewire/USB drives where that's out of their control. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD --
Apart from batching larger journal commits, I thought of another way which might improve write performance with barriers. Basically, right now a barrier prevents any write I/Os from being moved before or after it in the elevator, as well as issuing commands itself. It may be that relaxing barriers to allow moving writes when it's fine will allow the elevator to schedule writes to reduce seeking? A clear example of when it's fine is data=ordered, overwriting data in a file. Those data writes don't have to be ordered against other filesystem activity which is triggering journal commits. So they could be moved around barrier ops, if the elevator deems that would reduce seeks or to coalesce ops. Similar arguments apply to moving writes past fsync() flushes, but the rules are a bit different. -- Jamie --
There used to be a couple of reports of people doing some write workload use automatic power switch to reset turn on again repeat for 24 hours getting consistent corruption after some time <24hours and those SUSE did for some releases, although reiserfs used to be the default there (but that one uses barriers too if the disk supports them) BTW for a lot of distros which default to LVM install it won't make any difference because DM right now doesn't pass barriers through. I have a patch queued for that for the one-disk case at least, but it didn't make .26-rc. -Andi --
I didn't think checksumming avoided barriers completely. Just the barrier before the commit block, not the barrier after. -chris --
Funny thing, I was looking in this over the weekend. It currently
avoids barriers entirely if journal_async_commit is enabled (which is
not the default); if enabled, it effectively forces barrier=0. This
is IMHO a bug.
I'm working on a patch where "barrier=1" will use a barrier before and
after, and "barrier=1,journal_async_commit" will use a barrier after.
- Ted
--
And here it is.... Comments, please. And Eric, if you have a chance to benchmark this to see how this patch works out, comparing "barrier=0, "barrier=1", and "barrier=1,journal_async_commit", as we had discussed earlier on the ext4, I'd be very much obliged. As I mentioned earlier, adding blkdev_issue_flush() to ext[34]/fsync.c is I believe not necessary. We should be doing the right thing in the commit.c file. In the future, if we want some extra bonus points, in the case where writes have taken place to the inode, but no metadata operations have taken place (the common case when a database is writing to a pre-existing tablespace), it would be nice if fsync() could notice this case, force out the datablocks the old-fashioned way without forcing an journal commit, and then calling blkdev_issue_flush(). That should give us some nice performance wins for database workloads; unfortunately it probably won't do us any good on mailserver workloads. Regards, - Ted From ce99d82266d003e9b2284a1f46bd6f0e3a87c066 Mon Sep 17 00:00:00 2001 From: Theodore Ts'o <tytso@mit.edu> Date: Mon, 19 May 2008 22:40:52 -0400 Subject: [PATCH] ext4: Fix use of write barrier in commit logic We were previously using set_buffer_ordered() and then calling submit_bh() to write the commit block, which resulted in this flush behavior: write log blocks blkdev_issue_flush() // flush #1 write commit block blkdev_issue_flush() // flush #2 write metadata blocks And if the journal_async_commit mount option was used, the commit was written without using set_buffer_ordered(), which resulted in no flushes happening at all. Change the commit code to use blkdev_issue_flush() explicitly, and to omit flush #1 if journal_async_commit is enabled, but to always perform the second flush if write barriers are enabled. This change should hopefully reduce the overhead of using write barriers when an ext4 filesystem is mounted using "barrier=1,journal_async_commit", while still maintaining ...
Last time I looked, the database write + fsync case did not result in a journal commit in some cases, and therefore no blkdev_issue_flush(). The problem was one of correctness. Has this been fixed? A database had no way to issue a hard disk barrier in these cases without doing weird stuff like forcing metadata changes prior to fsync (e.g. toggling permissions bits), which causes an intolerable two disk seeks per fsync. That is _way_ expensive. There is also no way in ext3 to _just_ fdatasync() (no metadata even if it has changed), with disk barrier/flush. Imho a good place to call blkdev_issue_flush() is in the VFS, after it's written all the data blocks, unless the filesystem has a better override. That would work with most filesystems automatically. Request queue optimisations may trivially remove redundant flushes. -- Jamie --
Just a couple of comments about barriers that might be worth throwing in the mix. From what I have seen, running with barriers is almost always a win over running with write cache disabled for medium to large files (large files with a S-ATA drive go about twice as fast in my testing). For very small files, running with the barrier or disabling the write cache are a lot closer in performance. When we are looking for ways to batch, it is also critical to keep in mind the latency of the storage. The current ext3 transaction batching code makes running multi-threaded workloads on high speed media (like non-volatile disk arrays) run much slower. (Josef had some patches which helped fix this in a similar way that XFS deals with this.) One other note is that moving to a FLASH based device will not make the issue of barriers go away since (most? all?) FLASH parts you are likely to see have their own write cache which is used to buffer up writes in order to make the erase cycle less painful. That means that we have a fairly large *volatile* write cache which needs to be flushed/controlled just like we do with S-ATA/SCSI/etc ;-) ric --
A little optimisation note.
You don't need the barrier after in some cases, or it can be deferred
until a better time. E.g. when the disk write cache is probably empty
(some time after write-idle), barrier flushes may take the same time
as NOPs.
This sequence:
#1 write metadata to journal
#1 write commit block (checksummed)
BARRIER
#1 write metadata in place
... time passes ...
#2 write metadata to journal
#2 write commit block (checksummed)
BARRIER
#2 write metadata in place
... time passes ...
#3 write metadata to journal
#3 write commit block (checksummed)
BARRIER
#3 write metadata in place
Can be rewritten as:
#1 write metadata to journal
#1 write commit block (checksummed)
... time passes ...
#2 write metadata to journal
#2 write commit block (checksummed)
... time passes ...
#3 write metadata to journal
#3 write commit block (checksummed)
... time passes ...
BARRIER (probably instant).
#1 write metadata in place
#2 write metadata in place
#3 write metadata in place
Provided some conditions hold. All the metadata and all the journal
writes being non-overlapping I/O ranges would be sufficient.
What's more, barriers can be deferred past data=ordered in-place data
writes, although that's not always an optimisation.
-- Jamie
--
I hesitate to get too fancy here, if the disk is idle we probably won't notice This is true, and would be a fairly good performance boost. It fits nicely with the jbd trick of avoiding writes of a metadata block if a later transaction has logged it. But, it complicates the decision about when you're allowed to dirty a metadata block for writeback. It used to be dirty-after-commit and it would change to dirty-after-barrier. I suspect that is some significant surgery into jbd. Also, since a commit isn't really done until the barrier is done, you can't reuse blocks freed by the committing transaction until after the barrier, which means changes in the deletion handling code. Maybe I'm a wimp, but these are the two parts of write ahead logging I always It might be really interesting to have a i'm-about-to-barrier-find-some-io-to-run call. Something along the lines of draining the dirty pages when the drive is woken up in laptop mode. There's lots of fun with page lock vs journal lock ordering, but Jan has a handle on that I think. -chris --
I think you're right, but it's hard to be sure. One of the problems with barrier-implemented-as-flush-all is that it flushes data=ordered data, even when that's not wanted, and there can be a lot of data in the disk's write cache, spread over many seeks. Then it's good to delay barrier-flushes to batch metadata commits, but good to issue the barrier-flushes prior to large batches of data=ordered data, so the latter can be survive in the disk write cache for seek optimisations with later requests which aren't yet known. All this sounds complicated at the JBD layer, and IMHO much simpler at Rather than tracking when it's "allowed" to dirty a metadata block, it will be simpler to keep a flag saying "barrier needed", and just issue the barrier prior to writing a metadata block, if the flag is set. So metadata write scheduling doesn't need to be changed at all. That will be quite simple. You might still change the scheduling, but only as a performance heuristic in any way which turns out to be easy. Really, that flag should live in the request elevator instead, where it could do more good. I.e. WRITE_BARRIER wouldn't actually issue a barrier op to disk after writing. It would just set a request elevator flag, so a barrier op is issued prior to the next WRITE. That road opens some nice optimisations on software RAID, which aren't Good point. In this case, re-allocating time isn't the problem: actually writing to them is. Writes to recycled block require to be ordered after commits which recycled them. As above, just issue the barrier prior to the next write which needs to be ordered - effectively it's glued on the front of the write op. This comes for free with no change to deletion code (wow :-) if the only operations are WRITE_BARRIER (= flush before and after or I'm suspecting the opposite might be better. I'm-about-to-barrier-please-move-the-barrier-in-front-of-unordered-writes. The more writes you _don't_ flush synchronously, the ...
Jens and I talked about tossing the barriers completely and just doing FUA for all metadata writes. For drives with NCQ, we'll get something close to optimal because the higher layer elevators are already doing most of the hard work. Either way, you do want the flush to cover all the data=ordered writes, at least all the ordered writes from the transaction you're about to commit. Telling the difference between data=ordered from an old transaction or from the running transaction gets into pushing ordering down to the lower levels Adding explicit ordering into the IO path is really interesting. We toss a bunch of IO down to the lower layers with information about dependencies and let the lower layers figure it out. James had a bunch of ideas here, but I'm afraid the only people that understood it were James and the whiteboard he was scribbling on. The trick is to code the ordering in such a way that an IO failure breaks the chain, and that the filesystem has some sensible chance to deal with all these requests that have failed because an earlier write failed. Also, once we go down the ordering road, it is really tempting to forget that ordering does ensure consistency but doesn't tell us the write actually happened. fsync and friends need to hook into the dependency chain to wait for the barrier instead of waiting for the commit. But, back to the short term for a second, what we need are some benchmarks for barriers on and off and some guidance from the ext34 maintainers about turning them on by default. We shouldn't be pushing this FS integrity decision off on the distros. My test prog is definitely a worst case, but I'm pretty confident that most mail server workloads end up doing similar IO. A 16MB or 32MB disk cache is common these days, and that is a very sizable percentage of the jbd log size. I think the potential for corruptions on power failure is only growing over time. -chris --
Will need benchmarking, but might just work. Still need barriers for reasonable performance + integrity on systems without NCQ, like the kit on my desk. Without barriers I have to turn That's right. You can keep it quite simple: just distinguish barriers from flushes. Or make it more detailed, distinguishing different I think simply distinguishing barrier from flush is relatively Yeah, if you're going to do it _properly_ that's the way :-) But the I/O failure strategy is really a different problem, and shouldn't necessarily distract from barrier I/O elevator scheduling. The I/O failure strategy applies just the same even with no barriers and no disk cache at all. Even there, you want a failed write to True, but fortunately it's quite a simple thing. fsync et al need to ask for flush, nothing else does. You simply don't need consistency from journalling writes without fsync. Journal writes without fsync happen at times determined by arcane kernel daemons. Applications and even users don't have any say, and therefore no particular Mail servers seem like the most vulnerable systems to me too. They are often designed to relay a successful fsync() result across the network. RAID doesn't protect against power-fails with flushless fsync, and SMTP does not redundantly hold the same email at different locations for recovery. (The most solid email services will use a distributed database which is not vulnerable to these problems. But That depends on the disk implementation. Another use of the 32MB is to hold write-through NCQ commands in flight while they are sorted. If the trend is towards that, perhaps it's quite resistant to power failure. Still, I'm on the side of safety - having witnessed these power-fail corruptions myself. The type I've seen will get more common, as the trend in home "appliances" is towards pulling the plug to turn them off, and to have these larger storage devices in them. -- Jamie --
Hi! Do you remember seeing "my disk was wrecked" for reiserfs? I believe I got some corruption, but it was a) long time ago b) took half a year of hard powerdowns to provoke c) I did not report it anywhere because ... it just was not reproducible. Pavel -- (english) http://www.livejournal.com/~pavelmachek (cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html --
Here's a recent one, just FYI. http://marc.info/?l=ext3-users&m=121205834608161&w=2 -Eric --
