Hello, This patchset contains five patches to convert block drivers implementing REQ_HARDBARRIER to support REQ_FLUSH/FUA. 0001-block-loop-implement-REQ_FLUSH-FUA-support.patch 0002-virtio_blk-implement-REQ_FLUSH-FUA-support.patch 0003-lguest-replace-VIRTIO_F_BARRIER-support-with-VIRTIO_.patch 0004-md-implment-REQ_FLUSH-FUA-support.patch 0005-dm-implement-REQ_FLUSH-FUA-support.patch I'm fairly sure about conversions 0001-0003. 0004 should be okay although multipath wasn't tested. 0005, I'm not quite sure about. It works fine for the tests I've done but there are many other targets and code paths that I didn't test. So, please be careful with the last patch. I think it would be best to route the last two through the respective md/dm trees after the core part is merged and pulled into those trees. The nice thing about the conversion is that in many cases it replaces postflush with FUA writes which can be handled by request queues lower in the chain. For md/dm, this replaces an array wide cacheflush with FUA writes to only affected member devices. After this patchset, the followings remain to be converted. * blktrace * scsi_error.c for some reason tests REQ_HARDBARRIER. I think this part can be dropped altogether but am not sure. * drbd and xen. I have no idea. These patches are on top of block#for-2.6.36-post (c047ab2dddeeafbd6f7c00e45a13a5c4da53ea0b) + block-replace-barrier-with-sequenced-flush patchset[1] + block-fix-incorrect-bio-request-flag-conversion-in-md patch[2] and available in the following git tree. git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git flush-fua and contain the following changes. Documentation/lguest/lguest.c | 36 +++----- drivers/block/loop.c | 18 ++-- drivers/block/virtio_blk.c | 26 ++--- drivers/md/dm-crypt.c | 2 drivers/md/dm-io.c | 20 ---- drivers/md/dm-log.c | 2 drivers/md/dm-raid1.c | 8 - ...
From: Tejun Heo <tj@kernle.org>
VIRTIO_F_BARRIER is deprecated. Replace it with VIRTIO_F_FLUSH/FUA
support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
---
Documentation/lguest/lguest.c | 36 ++++++++++++++++--------------------
1 files changed, 16 insertions(+), 20 deletions(-)
diff --git a/Documentation/lguest/lguest.c b/Documentation/lguest/lguest.c
index e9ce3c5..3be47d4 100644
--- a/Documentation/lguest/lguest.c
+++ b/Documentation/lguest/lguest.c
@@ -1639,15 +1639,6 @@ static void blk_request(struct virtqueue *vq)
off = out->sector * 512;
/*
- * The block device implements "barriers", where the Guest indicates
- * that it wants all previous writes to occur before this write. We
- * don't have a way of asking our kernel to do a barrier, so we just
- * synchronize all the data in the file. Pretty poor, no?
- */
- if (out->type & VIRTIO_BLK_T_BARRIER)
- fdatasync(vblk->fd);
-
- /*
* In general the virtio block driver is allowed to try SCSI commands.
* It'd be nice if we supported eject, for example, but we don't.
*/
@@ -1679,6 +1670,19 @@ static void blk_request(struct virtqueue *vq)
/* Die, bad Guest, die. */
errx(1, "Write past end %llu+%u", off, ret);
}
+
+ /* Honor FUA by syncing everything. */
+ if (ret >= 0 && (out->type & VIRTIO_BLK_T_FUA)) {
+ ret = fdatasync(vblk->fd);
+ verbose("FUA fdatasync: %i\n", ret);
+ }
+
+ wlen = sizeof(*in);
+ *in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
+ } else if (out->type & VIRTIO_BLK_T_FLUSH) {
+ /* Flush */
+ ret = fdatasync(vblk->fd);
+ verbose("FLUSH fdatasync: %i\n", ret);
wlen = sizeof(*in);
*in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
} else {
@@ -1702,15 +1706,6 @@ static void blk_request(struct virtqueue *vq)
}
}
- /*
- * OK, so we noted that it was pretty poor to use an fdatasync as a
- * barrier. But Christoph Hellwig points out that we need a sync
- * *afterwards* ...VIRTIO_F_BARRIER is deprecated. Replace it with VIRTIO_F_FLUSH
support.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Rusty Russell <rusty@rustcorp.com.au>
Cc: Christoph Hellwig <hch@lst.de>
---
FUA support dropped as suggested by Christoph. Rusty, can you please
ack this version too? I tested it with the updated virtio_blk and it
works fine.
Thanks.
Documentation/lguest/lguest.c | 29 +++++++++--------------------
1 file changed, 9 insertions(+), 20 deletions(-)
Index: block/Documentation/lguest/lguest.c
===================================================================
--- block.orig/Documentation/lguest/lguest.c
+++ block/Documentation/lguest/lguest.c
@@ -1639,15 +1639,6 @@ static void blk_request(struct virtqueue
off = out->sector * 512;
/*
- * The block device implements "barriers", where the Guest indicates
- * that it wants all previous writes to occur before this write. We
- * don't have a way of asking our kernel to do a barrier, so we just
- * synchronize all the data in the file. Pretty poor, no?
- */
- if (out->type & VIRTIO_BLK_T_BARRIER)
- fdatasync(vblk->fd);
-
- /*
* In general the virtio block driver is allowed to try SCSI commands.
* It'd be nice if we supported eject, for example, but we don't.
*/
@@ -1679,6 +1670,13 @@ static void blk_request(struct virtqueue
/* Die, bad Guest, die. */
errx(1, "Write past end %llu+%u", off, ret);
}
+
+ wlen = sizeof(*in);
+ *in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
+ } else if (out->type & VIRTIO_BLK_T_FLUSH) {
+ /* Flush */
+ ret = fdatasync(vblk->fd);
+ verbose("FLUSH fdatasync: %i\n", ret);
wlen = sizeof(*in);
*in = (ret >= 0 ? VIRTIO_BLK_S_OK : VIRTIO_BLK_S_IOERR);
} else {
@@ -1702,15 +1700,6 @@ static void blk_request(struct virtqueue
}
}
- /*
- * OK, so we noted that it was pretty poor to use an fdatasync as a
- * barrier. But Christoph Hellwig points out that we need a sync
- * *afterwards* as well: "Barriers ...From: Tejun Heo <tj@kernle.org> This patch converts dm to support REQ_FLUSH/FUA instead of now deprecated REQ_HARDBARRIER. For common parts, * Barrier retry logic dropped from dm-io. * Empty WRITE_BARRIERs are replaced with WRITE_FLUSHes. * It's now guaranteed that all FLUSH bio's which are passed onto dm targets are zero length. bio_empty_barrier() tests are replaced with REQ_FLUSH tests. * Dropped unlikely() around REQ_FLUSH tests. Flushes are not unlikely enough to be marked with unlikely(). For bio-based dm, * Preflush is handled as before but postflush is dropped and replaced with passing down REQ_FUA to member request_queues. This replaces one array wide cache flush w/ member specific FUA writes. * __split_and_process_bio() now calls __clone_and_map_flush() directly for flushes and guarantees all FLUSH bio's going to targets are zero length. * -EOPNOTSUPP retry logic dropped. For request-based dm, * Nothing much changes. It just needs to handle FLUSH requests as before. It would be beneficial to advertise FUA capability so that it can propagate FUA flags down to member request_queues instead of sequencing it as WRITE + FLUSH at the top queue. Lightly tested linear, stripe, raid1, snap and crypt targets. Please proceed with caution as I'm not familiar with the code base. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: dm-devel@redhat.com Cc: Christoph Hellwig <hch@lst.de> --- drivers/md/dm-crypt.c | 2 +- drivers/md/dm-io.c | 20 +---- drivers/md/dm-log.c | 2 +- drivers/md/dm-raid1.c | 8 +- drivers/md/dm-region-hash.c | 16 ++-- drivers/md/dm-snap-persistent.c | 2 +- drivers/md/dm-snap.c | 6 +- drivers/md/dm-stripe.c | 2 +- drivers/md/dm.c | 180 +++++++++++++++++++------------------- 9 files changed, 113 insertions(+), 125 deletions(-) diff --git a/drivers/md/dm-crypt.c b/drivers/md/dm-crypt.c index ...
On Mon, Aug 16 2010 at 12:52pm -0400, What tree does this patch apply to? I know it doesn't apply to That logic wasn't just about retries (at least not in the latest kernel). With commit 708e929513502fb0 the -EOPNOTSUPP checking also serves to optimize the barrier+discard case (when discards aren't Can you expand on that TODO a bit? What is the mechanism to propagate FUA down to a DM device's members? I'm only aware of propagating member devices' features up to the top-level DM device's request-queue (not the opposite). Are you saying that establishing the FUA capability on the top-level DM device's request_queue is sufficient? If so then why not make the This is concerning... if we're to offer more comprehensive review I think we need more detail on what guided your changes rather than details of what the resulting changes are. Mike --
Hello, (from the head message) These patches are on top of block#for-2.6.36-post (c047ab2dddeeafbd6f7c00e45a13a5c4da53ea0b) + block-replace-barrier-with-sequenced-flush patchset[1] + block-fix-incorrect-bio-request-flag-conversion-in-md patch[2] and available in the following git tree. git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git flush-fua [1] http://thread.gmane.org/gmane.linux.kernel/1022363 [2] http://thread.gmane.org/gmane.linux.kernel/1023435 With the patch applied, there's no second flush. Those requests would now be REQ_FLUSH + REQ_DISCARD. The first can't be avoided anyway and there won't be the second flush to begin with, so I don't think this Yeah, I think it would be enough to always advertise FLUSH|FUA if the member devices support FLUSH (regardless of FUA support). The reason Yeap, I want you to be concerned. :-) This was the first time I looked at the dm code and there are many different disjoint code paths and I couldn't fully follow or test all of them, so it definitely needs a I'll try to explain it. If you have any further questions, please let me know. * For common part (dm-io, dm-log): * Empty REQ_HARDBARRIER is converted to empty REQ_FLUSH. * REQ_HARDBARRIER w/ data is converted either to data + REQ_FLUSH + REQ_FUA or data + REQ_FUA. The former is the safe equivalent conversion but there could be cases where ther latter is enough. * For bio based dm: * Unlike REQ_HARDBARRIER, REQ_FLUSH/FUA doesn't have any ordering requirements. Remove assumptions of ordering and/or draining. A related question: Is dm_wait_for_completion() used in process_flush() safe against starvation under continuous influx of other commands? * As REQ_FLUSH/FUA doesn't require any ordering of requests before or after it, on array devices, the latter part - REQ_FUA - can be handled like other writes. ie. REQ_FLUSH needs to be broadcasted to all devices but once that is complete the ...
In fact the pre-flush is completely superflous for discards, but that's a separate discussion and should not be changed as part of this patchset but rather explicitly. --
On Tue, Aug 17 2010 at 5:33am -0400, Makes sense, but your patches still need to be refreshed against the latest (2.6.36-rc1) upstream code. Numerous changes went in to DM You'll need Mikulas (bio-based) and NEC (request-based, Kiyoshi and Jun'ichi) to give it serious review. NOTE: NEC has already given some preliminary feedback to hch in the "[PATCH, RFC 2/2] dm: support REQ_FLUSH directly" thread: https://www.redhat.com/archives/dm-devel/2010-August/msg00026.html OK, so you folded dm_flush() directly into process_flush() -- the code that was dm_flush() only needs to be called once now. As for your specific dm_wait_for_completion() concern -- I'll defer to Mikulas. But I'll add: we haven't had any reported starvation issues with DM's existing barrier support. DM uses a mempool for its clones, so it should naturally throttle (without starvation) when memory gets bio-based DM already split the barrier out from the data (in process_barrier). You've renamed process_barrier to process_flush and OK, so seems 1 is done, 2 is still TODO. Looking at your tree it seems 2 would be as simple as using the following in dm_init_request_based_queue (on the most current upstream dm.c): blk_queue_flush(q, REQ_FLUSH | REQ_FUA); (your current patch only sets REQ_FLUSH in alloc_dev). --
Hello, Sure thing. The block part isn't fixed yet and so the RFC tag. Once the block layer part is settled, it probably should be pulled into Oh, you already cc'd them. Great. Hello, guys, the original thread is Hmmm... I think both issues don't exist in this incarnation of I see but single pending flush and steady write streams w/o saturating the mempool would be able to stall dm_wait_for_completeion(), no? Eh Yeah and threw in WARN_ON() there to make sure REQ_FLUSH + data bios Oh, I was talking about the other way around. Passing REQ_FUA in bio->bi_rw down to member request_queues. Sometimes while Yeah, but for that direction, just adding REQ_FUA to blk_queue_flush() should be enough. I'll add it. Thanks. -- tejun --
On Tue, Aug 17 2010 at 12:51pm -0400, Why base your work on a partial 2.6.36 tree? Why not rebase to linus' 2.6.36-rc1? Once we get the changes in a more suitable state (across the entire tree) we can split the individual changes out to their respective trees. Without a comprehensive tree I fear this code isn't going to get tested or reviewed properly. For example: any review of DM changes, against stale DM code, is wasted Seems we need to change __blk_rq_prep_clone to propagate REQ_FUA just like REQ_DISCARD: http://git.kernel.org/linus/3383977 Doesn't seem like we need to do the same for REQ_FLUSH (at least not for rq-based DM's benefit) because dm.c:setup_clone already special cases flush requests and sets REQ_FLUSH in cmd_flags. Mike --
Hello, Because the block tree contains enough changes which are not in Yeap, sure, it will happen all in a good time, but I don't really agree that review against block tree would be complete waste of effort. Things usually don't change that drastically unless dm is being rewritten as we speak. Anyways, I'll soon post a rebased / Oooh, right. I for some reason thought block layer was already doing Hmmm... but in general, I think it would be better to make __blk_rq_prep_clone() to copy all command related bits. If some command bits shouldn't be copied, the caller should take care of them. Thanks. -- tejun --
Hi Tejun, Mike, Your understanding is correct, dm_wait_for_completion() for flush will stall in such cases for request-based dm. That's why I mentioned below in https://www.redhat.com/archives/dm-devel/2010-August/msg00026.html. In other words, current request-based device-mapper can't handle other requests while a flush request is in progress. In flush request handling, request-based dm uses dm_wait_for_completion() to wait for the completion of cloned flush requests, depending on the fact that there should be only flush requests in flight owning to the block layer sequencing. It's not a separate issue and we need to resolve it at least. I'm still considering how I can fix the request-based dm. Thanks, Kiyoshi Ueda --
Hello, I see. bio based implementation also uses dm_wait_for_completion() but it also has DMF_QUEUE_IO_TO_THREAD to plug all the follow up bio's while flush is in progress, which sucks for throughput but Right, I thought you were talking about REQ_FLUSHes not sycnhronized against barrier write. Anyways, yeah, it's a problem. I don't think not being able to handle multiple flushes concurrently would be a major issue. The problem is not being able to process other bios/requests while a flush is in progress. All that's necessary is making the completion detection a bit more fine grained so that it counts the number of in flight flush bios/requests and completes when it reaches zero instead of waiting for all outstanding commands. Shouldn't be too hard. Thanks. -- tejun --
Phillipp and Lars, can you look at the changes in http://www.spinics.net/lists/linux-fsdevel/msg35884.html and http://www.spinics.net/lists/linux-fsdevel/msg36001.html and see how drbd can be adapted to it? It's one of two drivers still missing, and because it's not quite intuitive what it's doing it's very hard for Tejun and me to help. --
Yes of course. I'm currently on vacation, actually, and Phil is likely very busy, but we should have a patche ready within the next few days anyways. DRBD® and LINBIT® are registered trademarks of LINBIT, Austria. --
Hi Christoph, I was able to finish the greater part of the necessary work today. You can find it there: git://git.drbd.org/linux-2.6-drbd.git flush-fua Of course that is based on git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git flush-fua REQ_HARDBARRIER is not yet completely eliminated from drbd, but I am very confident that I will be able to finish that by tomorrow. Best, Phil DRBD(R) and LINBIT(R) are registered trademarks of LINBIT, Austria. --
Hi Jens,
Christoph Hellwig pointed out that drbd is the only driver
that still uses/supports barriers instead of FUE in linux-next.
I revitalized three patches that remove the barrier bits from DRBD...
The following changes since commit 8825f7c3e5c7b251b49fc594658a96f59417ee16:
drbd: Silenced an assert (2010-10-22 15:55:22 +0200)
are available in the git repository at:
git://git.drbd.org/linux-2.6-drbd.git for-jens
Philipp Reisner (3):
drbd: Removed the BIO_RW_BARRIER support form the receiver/epoch code
drbd: REQ_HARDBARRIER -> REQ_FUA transition for meta data accesses
drbd: Removed checks for REQ_HARDBARRIER on incomming BIOs
drivers/block/drbd/drbd_actlog.c | 16 +---
drivers/block/drbd/drbd_int.h | 20 +---
drivers/block/drbd/drbd_main.c | 7 +-
drivers/block/drbd/drbd_nl.c | 8 +-
drivers/block/drbd/drbd_proc.c | 1 -
drivers/block/drbd/drbd_receiver.c | 212 +++++-------------------------------
drivers/block/drbd/drbd_req.c | 14 ---
drivers/block/drbd/drbd_worker.c | 21 ----
8 files changed, 40 insertions(+), 259 deletions(-)
--
This looks good. You probably don't actually need the MD_NO_FUA flag - REQ_FUA writes will always work - they just ignored by drivers not implementing cache control. --
REQ_HARDBARRIER is dead now, so remove the leftovers. What's left
at this point is:
- various checks inside the block layer.
- sanity checks in bio based drivers.
- now unused bio_empty_barrier helper.
- Xen blockfront use of BLKIF_OP_WRITE_BARRIER - it's dead for a while,
but Xen really needs to sort out it's barrier situaton.
- setting of ordered tags in uas - dead code copied from old scsi
drivers.
- scsi different retry for barriers - it's dead and should have been
removed when flushes were converted to FS requests.
- blktrace handling of barriers - removed. Someone who knows blktrace
better should add support for REQ_FLUSH and REQ_FUA, though.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Index: linux-2.6/block/blk-core.c
===================================================================
--- linux-2.6.orig/block/blk-core.c 2010-10-23 17:03:30.119004105 +0200
+++ linux-2.6/block/blk-core.c 2010-10-23 17:03:40.797255816 +0200
@@ -1204,13 +1204,6 @@ static int __make_request(struct request
int where = ELEVATOR_INSERT_SORT;
int rw_flags;
- /* REQ_HARDBARRIER is no more */
- if (WARN_ONCE(bio->bi_rw & REQ_HARDBARRIER,
- "block: HARDBARRIER is deprecated, use FLUSH/FUA instead\n")) {
- bio_endio(bio, -EOPNOTSUPP);
- return 0;
- }
-
/*
* low level driver can indicate that it wants pages above a
* certain limit bounced to low memory (ie for highmem, or even
Index: linux-2.6/block/elevator.c
===================================================================
--- linux-2.6.orig/block/elevator.c 2010-10-23 17:03:44.740255886 +0200
+++ linux-2.6/block/elevator.c 2010-10-23 17:05:30.715017097 +0200
@@ -429,7 +429,7 @@ void elv_dispatch_sort(struct request_qu
q->nr_sorted--;
boundary = q->end_sector;
- stop_flags = REQ_SOFTBARRIER | REQ_HARDBARRIER | REQ_STARTED;
+ stop_flags = REQ_SOFTBARRIER | REQ_STARTED;
list_for_each_prev(entry, &q->queue_head) {
struct request *pos = list_entry_rq(entry);
@@ -691,7 +691,7 @@ ...Oops. Meant to take that out. Thanks. Acked-by: Matthew Wilcox <willy@linux.intel.com> -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
Thanks Christoph, applied for 2.6.37 (but with a few days of brewing). -- Jens Axboe --
Btw, in case it wasn't clear - this needs to be applied only after the drbd patches that remove the drbd use of REQ_HARDBARRIER. --
So, what's the plan? I'd really hate to see 2.6.37 getting released with those stale barrier bits left in. --
It's queued up, I'll be pushing this out this week:
git://git.kernel.dk/linux-2.6-block.git for-linus
Christoph Hellwig (1):
block: remove REQ_HARDBARRIER
Daniel J Blueman (1):
ioprio: fix RCU locking around task dereference
Jens Axboe (7):
Merge branch 'for-jens' of git://git.drbd.org/linux-2.6-drbd into for-2.6.37/drivers
block: check for proper length of iov entries in blk_rq_map_user_iov()
block: take care not to overflow when calculating total iov length
block: limit vec count in bio_kmalloc() and bio_alloc_map_data()
bio: take care not overflow page count when mapping/copying user data
cciss: fix proc warning on attempt to remove non-existant directory
Merge branch 'for-2.6.37/drivers' into for-linus
Lars Ellenberg (6):
drbd: consolidate explicit drbd_md_sync into drbd_create_new_uuid
drbd: tag a few error messages with "assert failed"
drbd: fix potential deadlock on detach
drbd: fix potential data divergence after multiple failures
drbd: fix a misleading printk
drbd: rate limit an error message
Mike Snitzer (1):
block: read i_size with i_size_read()
Philipp Reisner (4):
drbd: Silenced an assert
drbd: Removed the BIO_RW_BARRIER support form the receiver/epoch code
drbd: REQ_HARDBARRIER -> REQ_FUA transition for meta data accesses
drbd: Removed checks for REQ_HARDBARRIER on incomming BIOs
Sergey Senozhatsky (1):
ioprio: rcu_read_lock/unlock protect find_task_by_vpid call (V2)
Stephen M. Cameron (5):
cciss: fix board status waiting code
cciss: Use kernel provided PCI state save and restore functions
cciss: limit commands allocated on reset_devices
cciss: use usleep_range not msleep for small sleeps
cciss: remove controllers supported by hpsa
Vasiliy Kulikov (1):
block: ioctl: fix information leak to userland
block/blk-core.c | 11 +--
block/blk-map.c ...I'll update blktrace to be current there, btw. -- Jens Axboe --
Looks good to me. Reviewed-by: Tejun Heo <tj@kernel.org> Thanks. -- tejun --
