Hello, This patchset replaces the current barrier implementation with sequenced flush which doesn't impose any restriction on ordering around the flush requests. This patchst is result of the following discussion thread. http://thread.gmane.org/gmane.linux.file-systems/43877 In summary, filesystems can take over the ordering of requests around commit writes and the block layer should just supply a mechanism to perform the commit writes themselves. This would greatly lessen tha stall caused by queue dumping and draining used by the current barrier implementation for request ordering. This patchset converts barrier mechanism to sequenced flush/fua mechanism in the following steps. 1. Kill the mostly unused ORDERED_BY_TAG support. 2. Deprecate REQ_HARDBARRIER support. All hard barrier requests are failed with -EOPNOTSUPP. 3. Drop barrier ordering by queue draining mechanism. 4. Rename barrier to flush and implement new interface based on REQ_FLUSH and REQ_FUA as suggested by Christoph. blkdev_issue_flush() is converted to use the new mechanism but all the filesystems still use the deprecated REQ_HARDBARRIER which always fails. Each filesystem needs to be updated to enforce request ordering themselves and then to use REQ_FLUSH/FUA mechanism. loop, md, dm, etc... haven't been converted yet and REQ_FLUSH/FUA doesn't work with them yet. I'll convert most of them soonish if this patchset is generally agreed upon. This patchset contains the following patches. 0001-block-loop-queue-ordered-mode-should-be-DRAIN_FLUSH.patch 0002-block-kill-QUEUE_ORDERED_BY_TAG.patch 0003-block-deprecate-barrier-and-replace-blk_queue_ordere.patch 0004-block-remove-spurious-uses-of-REQ_HARDBARRIER.patch 0005-block-misc-cleanups-in-barrier-code.patch 0006-block-drop-barrier-ordering-by-queue-draining.patch 0007-block-rename-blk-barrier.c-to-blk-flush.c.patch 0008-block-rename-barrier-ordered-to-flush.patch ...
With ordering requirements dropped, barrier and ordered are misnomers.
Now all block layer does is sequencing FLUSH and FUA. Rename them to
flush.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
---
block/blk-core.c | 21 +++++-----
block/blk-flush.c | 98 +++++++++++++++++++++++------------------------
block/blk.h | 4 +-
include/linux/blkdev.h | 26 ++++++------
4 files changed, 73 insertions(+), 76 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 82bd6d9..efe391b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -136,7 +136,7 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
{
struct request_queue *q = rq->q;
- if (&q->bar_rq != rq) {
+ if (&q->flush_rq != rq) {
if (error)
clear_bit(BIO_UPTODATE, &bio->bi_flags);
else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
@@ -160,13 +160,12 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
if (bio->bi_size == 0)
bio_endio(bio, error);
} else {
-
/*
- * Okay, this is the barrier request in progress, just
- * record the error;
+ * Okay, this is the sequenced flush request in
+ * progress, just record the error;
*/
- if (error && !q->orderr)
- q->orderr = error;
+ if (error && !q->flush_err)
+ q->flush_err = error;
}
}
@@ -520,7 +519,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
init_timer(&q->unplug_timer);
setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q);
INIT_LIST_HEAD(&q->timeout_list);
- INIT_LIST_HEAD(&q->pending_barriers);
+ INIT_LIST_HEAD(&q->pending_flushes);
INIT_WORK(&q->unplug_work, blk_unplug_work);
kobject_init(&q->kobj, &blk_queue_ktype);
@@ -1758,11 +1757,11 @@ static void blk_account_io_completion(struct request *req, unsigned int bytes)
static void blk_account_io_done(struct request *req)
{
/*
- * Account IO completion. bar_rq isn't accounted as a ...Btw, I think this one should just go away. It's only used by ide in an attempt to make ordered sequences atomic, which isn't needed for the new design. --
Hello, Yeap, agreed. I couldn't really understand why the the sequence needed to be atomic for ide in the first place so just left it alone. Do you understand why it tried to be atomic? Thanks. -- tejun --
I think initial drafs of the barrier specification talked about atomic sequences. Except for that I can't think of any reason. --
Hello, Hmm... alright, I'll rip it out. Thanks. -- tejun --
Filesystems will take all the responsibilities for ordering requests around commit writes and will only indicate how the commit writes themselves should be handled by block layers. This patch drops barrier ordering by queue draining from block layer. Ordering by draining implementation was somewhat invasive to request handling. List of notable changes follow. * Each queue has 1 bit color which is flipped on each barrier issue. This is used to track whether a given request is issued before the current barrier or not. REQ_ORDERED_COLOR flag and coloring implementation in __elv_add_request() are removed. * Requests which shouldn't be processed yet for draining were stalled by returning -EAGAIN from blk_do_ordered() according to the test result between blk_ordered_req_seq() and blk_blk_ordered_cur_seq(). This logic is removed. * Draining completion logic in elv_completed_request() removed. * All barrier sequence requests were queued to request queue and then trckled to lower layer according to progress and thus maintaining request orders during requeue was necessary. This is replaced by queueing the next request in the barrier sequence only after the current one is complete from blk_ordered_complete_seq(), which removes the need for multiple proxy requests in struct request_queue and the request sorting logic in the ELEVATOR_INSERT_REQUEUE path of elv_insert(). * As barriers no longer have ordering constraints, there's no need to dump the whole elevator onto the dispatch queue on each barrier. Insert barriers at the front instead. * If other barrier requests come to the front of the dispatch queue while one is already in progress, they are stored in q->pending_barriers and restored to dispatch queue one-by-one after each barrier completion from blk_ordered_complete_seq(). Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> --- block/blk-barrier.c | 220 ...
REQ_HARDBARRIER is deprecated. Remove spurious uses in the following
users. Please note that other than osdblk, all other uses were
already spurious before deprecation.
* osdblk: osdblk_rq_fn() won't receive any request with
REQ_HARDBARRIER set. Remove the test for it.
* pktcdvd: use of REQ_HARDBARRIER in pkt_generic_packet() doesn't mean
anything. Removed.
* aic7xxx_old: Setting MSG_ORDERED_Q_TAG on REQ_HARDBARRIER is
spurious. Removed.
* sas_scsi_host: Setting TASK_ATTR_ORDERED on REQ_HARDBARRIER is
spurious. Removed.
* scsi_tcq: The ordered tag path wasn't being used anyway. Removed.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: James Bottomley <James.Bottomley@suse.de>
Cc: Peter Osterlund <petero2@telia.com>
---
drivers/block/osdblk.c | 3 +--
drivers/block/pktcdvd.c | 1 -
drivers/scsi/aic7xxx_old.c | 21 ++-------------------
drivers/scsi/libsas/sas_scsi_host.c | 13 +------------
include/scsi/scsi_tcq.h | 6 +-----
5 files changed, 5 insertions(+), 39 deletions(-)
diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
index 72d6246..87311eb 100644
--- a/drivers/block/osdblk.c
+++ b/drivers/block/osdblk.c
@@ -310,8 +310,7 @@ static void osdblk_rq_fn(struct request_queue *q)
break;
/* filter out block requests we don't understand */
- if (rq->cmd_type != REQ_TYPE_FS &&
- !(rq->cmd_flags & REQ_HARDBARRIER)) {
+ if (rq->cmd_type != REQ_TYPE_FS) {
blk_end_request_all(rq, 0);
continue;
}
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index b1cbeb5..0166ea1 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -753,7 +753,6 @@ static int pkt_generic_packet(struct pktcdvd_device *pd, struct packet_command *
rq->timeout = 60*HZ;
rq->cmd_type = REQ_TYPE_BLOCK_PC;
- rq->cmd_flags |= REQ_HARDBARRIER;
if (cgc->quiet)
rq->cmd_flags |= REQ_QUIET;
diff --git ...Barrier is deemed too heavy and will soon be replaced by FLUSH/FUA requests. Deprecate barrier. All REQ_HARDBARRIERs are failed with -EOPNOTSUPP and blk_queue_ordered() is replaced with simpler blk_queue_flush(). blk_queue_flush() takes combinations of REQ_FLUSH and FUA. If a device has write cache and can flush it, it should set REQ_FLUSH. If the device can handle FUA writes, it should also set REQ_FUA. All blk_queue_ordered() users are converted. * ORDERED_DRAIN is mapped to 0 which is the default value. * ORDERED_DRAIN_FLUSH is mapped to REQ_FLUSH. * ORDERED_DRAIN_FLUSH_FUA is mapped to REQ_FLUSH | REQ_FUA. Signed-off-by: Tejun Heo <tj@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Nick Piggin <npiggin@kernel.dk> Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Jeremy Fitzhardinge <jeremy@xensource.com> Cc: Chris Wright <chrisw@sous-sol.org> Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> Cc: Boaz Harrosh <bharrosh@panasas.com> Cc: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com> Cc: David S. Miller <davem@davemloft.net> Cc: Alasdair G Kergon <agk@redhat.com> Cc: Pierre Ossman <drzeus@drzeus.cx> Cc: Stefan Weinhuber <wein@de.ibm.com> --- block/blk-barrier.c | 29 ---------------------------- block/blk-core.c | 6 +++- block/blk-settings.c | 20 +++++++++++++++++++ drivers/block/brd.c | 1 - drivers/block/loop.c | 2 +- drivers/block/osdblk.c | 2 +- drivers/block/ps3disk.c | 2 +- drivers/block/virtio_blk.c | 25 ++++++++--------------- drivers/block/xen-blkfront.c | 43 +++++++++++------------------------------ drivers/ide/ide-disk.c | 13 +++++------ drivers/md/dm.c | 2 +- drivers/mmc/card/queue.c | 1 - drivers/s390/block/dasd.c | 1 - drivers/scsi/sd.c | 16 +++++++------- include/linux/blkdev.h | 6 +++- 15 files changed, 67 insertions(+), 102 deletions(-) diff --git a/block/blk-barrier.c ...
Without ordering requirements, barrier and ordering are minomers.
Rename block/blk-barrier.c to block/blk-flush.c. Rename of symbols
will follow.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
---
block/Makefile | 2 +-
block/blk-barrier.c | 248 ---------------------------------------------------
block/blk-flush.c | 248 +++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 249 insertions(+), 249 deletions(-)
delete mode 100644 block/blk-barrier.c
create mode 100644 block/blk-flush.c
diff --git a/block/Makefile b/block/Makefile
index 0bb499a..f627e4b 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -3,7 +3,7 @@
#
obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
- blk-barrier.o blk-settings.o blk-ioc.o blk-map.o \
+ blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
blk-iopoll.o blk-lib.o ioctl.o genhd.o scsi_ioctl.o
diff --git a/block/blk-barrier.c b/block/blk-barrier.c
deleted file mode 100644
index e8b2e5c..0000000
--- a/block/blk-barrier.c
+++ /dev/null
@@ -1,248 +0,0 @@
-/*
- * Functions related to barrier IO handling
- */
-#include <linux/kernel.h>
-#include <linux/module.h>
-#include <linux/bio.h>
-#include <linux/blkdev.h>
-#include <linux/gfp.h>
-
-#include "blk.h"
-
-static struct request *queue_next_ordseq(struct request_queue *q);
-
-/*
- * Cache flushing for ordered writes handling
- */
-unsigned blk_ordered_cur_seq(struct request_queue *q)
-{
- if (!q->ordseq)
- return 0;
- return 1 << ffz(q->ordseq);
-}
-
-static struct request *blk_ordered_complete_seq(struct request_queue *q,
- unsigned seq, int error)
-{
- struct request *next_rq = NULL;
-
- if (error && !q->orderr)
- q->orderr = error;
-
- BUG_ON(q->ordseq & seq);
- q->ordseq |= seq;
-
- if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE) {
- /* not complete yet, queue the next ordered sequence */
- next_rq = ...Nobody is making meaningful use of ORDERED_BY_TAG now and queue
draining for barrier requests will be removed soon which will render
the advantage of tag ordering moot. Kill ORDERED_BY_TAG. The
following users are affected.
* brd: converted to ORDERED_DRAIN.
* virtio_blk: ORDERED_TAG path was already marked deprecated. Removed.
* xen-blkfront: ORDERED_TAG case dropped.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
---
block/blk-barrier.c | 35 +++++++----------------------------
drivers/block/brd.c | 2 +-
drivers/block/virtio_blk.c | 9 ---------
drivers/block/xen-blkfront.c | 8 +++-----
drivers/scsi/sd.c | 4 +---
include/linux/blkdev.h | 17 +----------------
6 files changed, 13 insertions(+), 62 deletions(-)
diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index f0faefc..c807e9c 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -26,10 +26,7 @@ int blk_queue_ordered(struct request_queue *q, unsigned ordered)
if (ordered != QUEUE_ORDERED_NONE &&
ordered != QUEUE_ORDERED_DRAIN &&
ordered != QUEUE_ORDERED_DRAIN_FLUSH &&
- ordered != QUEUE_ORDERED_DRAIN_FUA &&
- ordered != QUEUE_ORDERED_TAG &&
- ordered != QUEUE_ORDERED_TAG_FLUSH &&
- ordered != QUEUE_ORDERED_TAG_FUA) {
+ ordered != QUEUE_ORDERED_DRAIN_FUA) {
printk(KERN_ERR "blk_queue_ordered: bad value %d\n", ordered);
return -EINVAL;
}
@@ -155,21 +152,9 @@ static inline bool start_ordered(struct request_queue *q, struct request **rqp)
* For an empty barrier, there's no actual BAR request, which
* in turn makes POSTFLUSH unnecessary. Mask them off.
*/
- if (!blk_rq_sectors(rq)) {
+ if (!blk_rq_sectors(rq))
q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
...Hello Tejun, Have you seen Hannes Reinecke's and my measurements in http://marc.info/?l=linux-scsi&m=128110662528485&w=2 and http://marc.info/?l=linux-scsi&m=128111995217405&w=2 correspondingly? If yes, what else evidences do you need to see that the tag ordering is a big performance win? Vlad --
It's not tag odering that is a win but big queue depth. That's what you measured and what I fully agree on. I haven't been able to get out of Hannes what he actually measured. And if you'd actually look at the patchset allowing deep queues is exactly what it allows us, and while I haven't done testing on this patchset but only on my previous version it does get us back to use the full potential of large arrays exactly because of that. --
Make the following cleanups in preparation of barrier/flush update.
* blk_do_ordered() declaration is moved from include/linux/blkdev.h to
block/blk.h.
* blk_do_ordered() now returns pointer to struct request, with %NULL
meaning "try the next request" and ERR_PTR(-EAGAIN) "try again
later". The third case will be dropped with further changes.
* In the initialization of proxy barrier request, data direction is
already set by init_request_from_bio(). Drop unnecessary explicit
REQ_WRITE setting and move init_request_from_bio() above REQ_FUA
flag setting.
* add_request() is collapsed into __make_request().
These changes don't make any functional difference.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
block/blk-barrier.c | 32 ++++++++++++++------------------
block/blk-core.c | 21 ++++-----------------
block/blk.h | 7 +++++--
include/linux/blkdev.h | 1 -
4 files changed, 23 insertions(+), 38 deletions(-)
diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index ed0aba5..f1be85b 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -110,9 +110,9 @@ static void queue_flush(struct request_queue *q, unsigned which)
elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
}
-static inline bool start_ordered(struct request_queue *q, struct request **rqp)
+static inline struct request *start_ordered(struct request_queue *q,
+ struct request *rq)
{
- struct request *rq = *rqp;
unsigned skip = 0;
q->orderr = 0;
@@ -149,11 +149,9 @@ static inline bool start_ordered(struct request_queue *q, struct request **rqp)
/* initialize proxy request and queue it */
blk_rq_init(q, rq);
- if (bio_data_dir(q->orig_bar_rq->bio) == WRITE)
- rq->cmd_flags |= REQ_WRITE;
+ init_request_from_bio(rq, q->orig_bar_rq->bio);
if (q->ordered & QUEUE_ORDERED_DO_FUA)
rq->cmd_flags |= REQ_FUA;
- init_request_from_bio(rq, q->orig_bar_rq->bio);
rq->end_io = bar_end_io;
elv_insert(q, rq, ...loop implements FLUSH using fsync but was incorrectly setting its ordered mode to DRAIN. Change it to DRAIN_FLUSH. In practice, this doesn't change anything as loop doesn't make use of the block layer ordered implementation. Signed-off-by: Tejun Heo <tj@kernel.org> --- drivers/block/loop.c | 2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/drivers/block/loop.c b/drivers/block/loop.c index f3c636d..c3a4a2e 100644 --- a/drivers/block/loop.c +++ b/drivers/block/loop.c @@ -832,7 +832,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode, lo->lo_queue->unplug_fn = loop_unplug; if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync) - blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN); + blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN_FLUSH); set_capacity(lo->lo_disk, size); bd_set_size(bdev, size << 9); -- 1.7.1 --
Now that the backend conversion is complete, export sequenced
FLUSH/FUA capability through REQ_FLUSH/FUA flags. REQ_FLUSH means the
device cache should be flushed before executing the request. REQ_FUA
means that the data in the request should be on non-volatile media on
completion.
Block layer will choose the correct way of implementing the semantics
and execute it. The request may be passed to the device directly if
the device can handle it; otherwise, it will be sequenced using one or
more proxy requests. Devices will never see REQ_FLUSH and/or FUA
which it doesn't support.
* QUEUE_ORDERED_* are removed and QUEUE_FSEQ_* are moved into
blk-flush.c.
* REQ_FLUSH w/o data can also be directly passed to drivers without
sequencing but some drivers assume that zero length requests don't
have rq->bio which isn't true for these requests requiring the use
of proxy requests.
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
---
block/blk-core.c | 2 +-
block/blk-flush.c | 85 ++++++++++++++++++++++++++----------------------
block/blk.h | 3 ++
include/linux/blkdev.h | 38 +--------------------
4 files changed, 52 insertions(+), 76 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index efe391b..c00ace2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1204,7 +1204,7 @@ static int __make_request(struct request_queue *q, struct bio *bio)
spin_lock_irq(q->queue_lock);
- if (bio->bi_rw & REQ_HARDBARRIER) {
+ if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
where = ELEVATOR_INSERT_FRONT;
goto get_rq;
}
diff --git a/block/blk-flush.c b/block/blk-flush.c
index dd87322..452c552 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -1,5 +1,5 @@
/*
- * Functions related to barrier IO handling
+ * Functions to sequence FLUSH and FUA writes.
*/
#include <linux/kernel.h>
#include <linux/module.h>
@@ -9,6 +9,15 @@
#include "blk.h"
+/* FLUSH/FUA sequences ...The patchset looks functionally correct to me, and with a small patch to make use of WRITE_FUA_FLUSH survives xfstests, and instrumenting the underlying qemu shows that we actually get the flush requests where we should. No performance or power fail testing done yet. But I do not like the transition very much. The new WRITE_FUA_FLUSH request is exactly what filesystems expect from a current barrier request, so I'd rather move to that functionality without breaking stuff inbetween. So if it was to me I'd keep patches 1, 2, 4 and 5 from your series, than a main one to relax barrier semantics, then have the renaming patches 7 and 8, and possible keep patch 11 separate from the main implementation change, and if absolutely also a separate one to introduce REQ_FUA and REQ_FLUSH in the bio interface, but keep things working while doing this. Then we can patches do disable the reiserfs barrier "optimization" as the very first one, and DM/MD support which I'm currently working on as the last one and we can start doing the heavy testing. --
Hello, Christoph. There are two reason to avoid changing the meaning of REQ_HARDBARRIER and just deprecate it. One is to avoid breaking filesystems' expectations underneath it. Please note that there are out-of-tree filesystems too. I think it would be too dangerous to relax REQ_HARDBARRIER. Another is that pseudo block layer drivers (loop, virtio_blk, md/dm...) have assumptions about REQ_HARDBARRIER behavior and things would be broken in obscure ways between REQ_HARDBARRIER semantics change and updates to each of those drivers, so I don't really think Oops, I've already converted loop, virtio_blk/lguest and am working on md/dm right now too. I'm almost done with md and now doing dm. :-) Maybe we should post them right now so that we don't waste too much time trying to solve the same problems? Thanks. -- tejun --
Note that the renaming patch would include a move from REQ_HARDBARRIER
to REQ_FLUSH_FUA, so things just using REQ_HARDBARRIER will fail to
compile. And while out of tree filesystems do exist they it's their
problem to keep up with kernel changes. They decide not to be part
Here's the dm patch. It only handles normal bio based dm yet, which
I understand and can test. request based dm (multipath) still needs
work.
Index: linux-2.6/drivers/md/dm-crypt.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-crypt.c 2010-08-13 16:11:04.207010218 +0200
+++ linux-2.6/drivers/md/dm-crypt.c 2010-08-13 16:11:10.048003862 +0200
@@ -1249,7 +1249,7 @@ static int crypt_map(struct dm_target *t
struct dm_crypt_io *io;
struct crypt_config *cc;
- if (unlikely(bio_empty_barrier(bio))) {
+ if (bio_empty_flush(bio)) {
cc = ti->private;
bio->bi_bdev = cc->dev->bdev;
return DM_MAPIO_REMAPPED;
Index: linux-2.6/drivers/md/dm-io.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-io.c 2010-08-13 16:11:04.213011894 +0200
+++ linux-2.6/drivers/md/dm-io.c 2010-08-13 16:11:10.049003792 +0200
@@ -364,7 +364,7 @@ static void dispatch_io(int rw, unsigned
*/
for (i = 0; i < num_regions; i++) {
*dp = old_pages;
- if (where[i].count || (rw & REQ_HARDBARRIER))
+ if (where[i].count || (rw & REQ_FLUSH))
do_region(rw, i, where + i, dp, io);
}
@@ -412,8 +412,8 @@ retry:
}
set_current_state(TASK_RUNNING);
- if (io->eopnotsupp_bits && (rw & REQ_HARDBARRIER)) {
- rw &= ~REQ_HARDBARRIER;
+ if (io->eopnotsupp_bits && (rw & REQ_FLUSH)) {
+ rw &= ~REQ_FLUSH;
goto retry;
}
Index: linux-2.6/drivers/md/dm-raid1.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-raid1.c 2010-08-13 16:11:04.220013431 +0200
+++ linux-2.6/drivers/md/dm-raid1.c 2010-08-13 16:11:10.054018319 +0200
@@ -670,7 +670,7 @@ static ...Hello, Do you want to change the whole thing in a single commit? That would be a pretty big invasive patch touching multiple subsystems. Also, I don't know what to do about drdb and would like to leave its conversion to the maintainer (in separate patches). Here's the combined patch I've been working on. I've verified loop and virtio_blk/loop. I just (like five mins ago) got dm/dm conversion compiling, so I'm sure they're broken. The neat part is that thanks to the separation between REQ_FLUSH and FUA handling, bio mangling drivers only have to sequence the pre-flush and pass FUA directly to lower layers which in many cases saves an array-wide cache flush cycle. After getting this patch working, the only remaining bits would be blktrace and drdb. Thanks. Documentation/lguest/lguest.c | 36 +++----- drivers/block/loop.c | 18 ++-- drivers/block/virtio_blk.c | 26 ++--- drivers/md/dm-io.c | 20 ---- drivers/md/dm-log.c | 2 drivers/md/dm-raid1.c | 8 - drivers/md/dm-snap-persistent.c | 2 drivers/md/dm.c | 176 +++++++++++++++++++-------------------- drivers/md/linear.c | 4 drivers/md/md.c | 117 +++++--------------------- drivers/md/md.h | 23 +---- drivers/md/multipath.c | 4 drivers/md/raid0.c | 4 drivers/md/raid1.c | 178 +++++++++++++--------------------------- drivers/md/raid1.h | 2 drivers/md/raid10.c | 6 - drivers/md/raid5.c | 18 +--- include/linux/virtio_blk.h | 6 + 18 files changed, 244 insertions(+), 406 deletions(-) Index: block/drivers/block/loop.c =================================================================== --- block.orig/drivers/block/loop.c +++ block/drivers/block/loop.c @@ -477,17 +477,17 @@ static int do_bio_filebacked(struct loop pos = ((loff_t) bio->bi_sector << 9) + lo->lo_offset; ...
We can just stop draining in the block layer in the first patch, then stop doing the stuff in md/dm/etc in the following and then do the final renaming patches. It would still be less patches then now, but keep things working through the whole transition, which would really I'd suggest not adding FUA support to virtio yet. Just using the flush feature gives you a fully working barrier implementation. Eventually we might want to add a flag in the block queue to send REQ_FLUSH|REQ_FUA request through to virtio directly so that we can avoid separate pre- and post flushes, but I really want to benchmark if We only need the special md_flush_request handling for empty REQ_FLUSH requests. REQ_WRITE | REQ_FLUSH just need the Once we only handle empty barriers here we can directly call bio_endio instead of first scheduling a work queue.Once we only handle empty barriers here we can directly call bio_endio and the super wakeup I kept bio_empty_barrier as bio_empty_flush, which actually is a quite We never send flush requests as REQ_TYPE_BLOCK_PC anymore, so no need No need for the special error handling here, flush requests can now AFAICS this is only needed for the old barrier code, no need for this My suspicion is that we can get rif of all that special casing here and just use blk_rq_prep_clone once it's been updated to propagate REQ_FLUSH, similar to the DISCARD flag. I also suspect that there is absolutely no need to the barrier work queue once we stop waiting for outstanding request. But then again There's not need to use a separate flush_bio here. __split_and_process_bio does the right thing for empty REQ_FLUSH requests. See my patch for how to do this differenty. And yeah, my version has been tested. --
Hello, Christoph. I'm not really convinced that would help much. If bisecting can point to the conversion as the culprit for whatever kind of failure, wouldn't that be enough? No matter what we do the conversion will be a single step thing. If we make the filesystems enforce the ordering first and then relax ordering in the block layer, bisection would still just point at the later patch. The same goes for md/dm, the best we can find out would be whether the conversion is correct or not anyway. I'm not against restructuring the patchset if it makes more sense but it just feels like it would be a bit pointless effort (and one which would require much tighter coordination among different trees) at this I wrote this in the other mail but I think it would make difference if the backend storag is md/dm especially if it's shared by multiple VMs. Hmm, not really, the WRITE should happen after all the data in cache are committed to NV media, meaning that empty FLUSH should already Hmm... maybe. The reason why I removed bio_empty_flush() was that except for the front-most sequencer (block layer for all the request based ones and the front-most make_request for bio based ones), it doesn't make sense to see REQ_FLUSH + data bios. They should be sequenced at the front-most stage anyway, so I didn't have much use for them. Those code paths couldn't deal with REQ_FLUSH + data bios I wasn't sure about that part. You removed store_flush_error(), but DM_ENDIO_REQUEUE should still have higher priority than other The same question as before. I think we still need to prioritize Yeah, probably. It was an accidental edit to begin with and I left But how do you make sure REQ_FLUSHes for preflush finish before starting the write? Thanks. -- tejun --
The filesystems already enforce the ordering, except reiserfs which What other trees do you mean? The conversions of the 8 filesystems that actually support barriers need to go through this tree anyway if we want to be able to test it. Also the changes in the filesystem are absolutely minimal - it's basically just s/WRITE_BARRIER/WRITE_FUA_FLUSH/ after my initial patch kill BH_Orderd, The current bio_empty_barrier is only used in dm, and indeed only makes sense for make_request-based drivers. But I think it's a rather useful helper for them. Either way, it's not a big issue and either way is Hmm, okay. I see how the special flush_bio makes the waiting easier, let's see if Mike or other in the DM team have a better idea. --
Hi, I was mostly thinking about dm/md, drdb and stuff, but you're talking about filesystem conversion patches being routed through block tree, I might just resequence it to finish this part of discussion but what does that really buy us? It's not really gonna help bisection. Bisection won't be able to tell anything in higher resolution than "the new implementation doesn't work". If you show me how it would IIUC, when any of flushes get DM_ENDIO_REQUEUE (which tells the dm core layer to retry the whole bio later), it trumps all other failures and the bio is retried later. That was why DM_ENDIO_REQUEUE was prioritized over other error codes, which actually is sort of incorrect in that once a FLUSH fails, it _MUST_ be reported to upper layers as FLUSH failure implies data already lost. So, DM_ENDIO_REQUEUE actually should have lower priority than other failures. But, then again, the error codes still need to be Yeah, it would be better if it can be sequenced w/o using a work but let's leave it for later. Thanks. -- tejun --
I think we really need all the conversions in one tree, block layer, remapping drivers and filesystems. Btw, I've done the conversion for all filesystems and I'm running tests It's not bisecting to find bugs in the barrier conversion. We can't easily bisect it down anyway. The problem is when we try to bisect other problems and get into the middle of the series barriers suddenly are gone. Which is not very helpful for things like data integrity I think that's something we better leave to the DM team. --
Hello, I don't know. If filesystem changes are really trivial maybe, but Ah, okay, hmmm.... alright, I'll resequence the patches. If the filesystem changes can be put into a single tree somehow, we can keep Sure, but we shouldn't be ripping out the code to do that. Thanks. -- tejun --
Hello, Sorry but I'm doing it. It just doesn't make much sense. I can't relax the ordering for REQ_HARDBARRIER without breaking the remapping drivers. So, to keep things working, I'll have to 1. relax the ordering 2. implement new REQ_FLUSH/FUA based interface and 3. use them in the filesystems in the same patch. That's just wrong. And I don't think md/dm changes can or should go through the block tree. They're way too invasive for that. It's a new implementation and barrier won't work (fail gracefully) for several commits during the transition. I don't think there's a better way around it. Thanks. -- tejun --
Hi Tejun, Christoph,
I think that's correct and changing the priority of DM_ENDIO_REQUEUE
for REQ_FLUSH down to the lowest should be fine.
(I didn't know that FLUSH failure implies data loss possibility.)
But the patch is not enough, you have to change target drivers, too.
E.g. As for multipath, you need to change
drivers/md/dm-mpath.c:do_end_io() to return error for REQ_FLUSH
like the REQ_DISCARD support included in 2.6.36-rc1.
By the way, if these patch-set with the change above are included,
even one path failure for REQ_FLUSH on multipath configuration will
be reported to upper layer as error, although it's retried using
other paths currently.
Then, if an upper layer won't take correct recovery action for the error,
it would be seen as a regression for users. (e.g. Frequent EXT3-error
resulting in read-only mount on multipath configuration.)
Although I think the explicit error is fine rather than implicit data
corruption, please check upper layers carefully so that users won't see
such errors as much as possible.
Thanks,
Kiyoshi Ueda
--
Hello, At least on ATA, FLUSH failure implies that data is already lost, so I'll take a look but is there an easy to test mpath other than having Argh... then it will have to discern why FLUSH failed. It can retry for transport errors but if it got aborted by the device it should report upwards. Maybe just turn off barrier support in mpath for now? Thanks. -- tejun --
On Mon, Aug 23 2010 at 8:14am -0400, It is easy enough to make a single path use mpath. Just verify/modify /etc/multipath.conf so that your device isn't blacklisted. multipathd will even work with a scsi-debug device. You obviously won't get path failover but you'll see the path get marked Yes, we discussed this issue of needing to train dm-multipath to know if there was a transport failure or not (at LSF). But I'm not sure when Hannes intends to repost his work in this area (updated to account for I think we'd prefer to have a device fail rather than jeopardize data integrity. Clearly not ideal but... --
Hi Tejun, Yes, checking whether it's a transport error in lower layer is the right solution. (Since I know it's not available yet, I just hoped if upper layers had some other options.) Anyway, only reporting errors for REQ_FLUSH to upper layer without such a solution would make dm-multipath almost unusable in real world, If it's possible, it could be a workaround for a short term. But how can you do that? I think it's not enough to just drop REQ_FLUSH flag from q->flush_flags. Underlying devices of a mpath device may have write-back cache and it may be enabled. So if a mpath device doesn't set REQ_FLUSH flag in q->flush_flags, it becomes a device which has write-back cache but doesn't support flush. Then, upper layer can do nothing to ensure cache flush? Thanks, Kiyoshi Ueda --
On Tue, Aug 24 2010 at 12:59pm -0400, Seems clear that we must fix mpath to receive the SCSI errors, in some form, so it can decide if a retry is required/valid or not. Such error processing was a big selling point for the transition from bio-based to request-based multipath; so it's unfortunate that this Hannes already proposed some patches: https://patchwork.kernel.org/patch/61282/ https://patchwork.kernel.org/patch/61283/ https://patchwork.kernel.org/patch/61596/ This work was discussed at LSF, see "Error Handling - Hannes Reinecke" here: http://lwn.net/Articles/400589/ I thought James, Alasdair and others offered some guidance on what he'd like to see... Unfortunately, even though I was at this LSF session, I can't recall any specific consensus on how Hannes' work should be refactored (to avoid adding SCSI sense processing code directly in dm-mpath). Maybe James, Hannes or others remember? Was it enough to just have the SCSI sense processing code split out in a I'll have to review this thread again to understand why mpath's existing retry logic is broken behavior. mpath is used with more capable SCSI devices so I'm missing why a failed FLUSH implies data loss. Mike --
Hello, SBC doesn't specify the failure behavior, so it could be that retrying flush could be safe. But for most disk type devices, flush failure usually indicates that the device exhausted all the options to commit some of pending data to NV media - ie. even remapping failed for whatever reason. Even if retry is safe, it's more likely to simply delay notification of failure. In ATA, the situation is clearer, when a device actively fails a flush, the drive reports the first failed sector it failed to commit and the next flush will continue _after_ the sector - IOW, data is already lost. <speculation> I think there's no reason mpath should be tasked with retrying flush failure. That's upto the SCSI EH. If the command failed in 'safe' transient way - ie. device busy or whatnot, SCSI EH can and does retry the command. There are several FAILFAST bits already and SCSI EH can avoid retrying transport errors for mpath (maybe it already does that?) and just need to be able to tell upper layer that the failure was a fast one and upper layer is responsible for retrying? Is there any reason to pass the whole sense information upwards? </speculation> Anyways, flush failure is different from read/write failures. Read/writes can always be retried cleanly. They are stateless. I don't know how SCSI devices would actually behavior but it's a bit scary to retry SYNCHRONIZE_CACHE a device failed and report success upwards. Thanks. -- tejun --
Hi Tejun, Right. If the error is safe/needed to retry using other paths, mpath should retry even if REQ_FLUSH. Otherwise, only one path failure may result in system down. Just passing any REQ_FLUSH error upwards regardless the error type will make such situations, and users will feel the behavior as I'm not sure how long will it take. Anyway, as you said, the flush error handling of dm-mpath is already broken if data loss really happens on any storage used by dm-mpath. Although it's a serious issue and quick fix is required, I think you may leave the old behavior in your patch-set, since it's a separate issue. Thanks, Kiyoshi Ueda --
On Wed, Aug 25 2010 at 4:00am -0400, Right, there are hardware configurations that lend themselves to FLUSH retries mattering, namely: 1) a SAS drive with 2 ports and a writeback cache 2) theoretically possible: SCSI array that is mpath capable but advertises cache as writeback (WCE=1) The SAS case is obviously a more concrete example of why FLUSH retries are worthwhile in mpath. But I understand (and agree) that we'd be better off if mpath could differentiate between failures rather than blindly retrying on failures like it does today (fails path and retries if additional paths I'm not seeing where anything is broken with current mpath. If a multipathed LUN is WCE=1 then it should be fair to assume the cache is mirrored or shared across ports. Therefore retrying the SYNCHRONIZE CACHE is needed. Do we still have fear that SYNCHRONIZE CACHE can silently drop data? Seems unlikely especially given what Tejun shared from SBC. It seems that at worst, with current mpath, we retry when it doesn't make sense (e.g. target failure). Mike --
Hi Mike,
Do we have any proof to wipe that fear?
If retrying on flush failure is safe on all storages used with multipath
(e.g. SCSI, CCISS, DASD, etc), then current dm-mpath should be fine in
the real world.
But I'm afraid if there is a storage where something like below can happen:
- a flush command is returned as error to mpath because a part of
cache has physically broken at the time or so, then that part of
data loses and the size of the cache is shrunk by the storage.
- mpath retries the flush command using other path.
- the flush command is returned as success to mpath.
- mpath passes the result, success, to upper layer, but some of
the data already lost.
Thanks,
Kiyoshi Ueda
--
On Fri, Aug 27 2010 at 5:47am -0400, That does seem like a valid concern. But I'm not seeing why its unique to SYNCHRONIZE CACHE. Any IO that fails on the target side should be passed up once the error gets to DM. Mike --
Hi Kiyoshi, On Mon, Aug 30 2010 at 2:13am -0400, I reached out to Fred Knight on this, to get a more insight from a pure SCSI SBC perspective, and he shared the following: ----- End forwarded message ----- Seems like verifying/improving the handling of CHECK CONDITION is a more pressing concern than silent data loss purely due to SYNCHRONIZE CACHE retries. Without proper handling we could completely miss these deferred errors. But how to effectively report such errors to upper layers is unclear to me given that a particular SCSI command can carry error information for IO that was already acknowledged successful (e.g. to the FS). drivers/scsi/scsi_error.c's various calls to scsi_check_sense() illustrate Linux's current CHECK CONDITION handling. I need to look closer at how deferred errors propagate to upper layers. After an initial look it seems scsi_error.c does handle retrying commands where appropriate. I believe Hannes has concerns/insight here. Mike --
Quite. We _should_ be handling deferred errors correctly;
if you check drivers/scsi/scsi_lib.c:scsi_io_completion()
you'll find this:
if (host_byte(result) == DID_RESET) {
/* Third party bus reset or reset for error recovery
* reasons. Just retry the command and see what
* happens.
*/
action = ACTION_RETRY;
} else if (sense_valid && !sense_deferred) {
...
} else {
description = "Unhandled error code";
action = ACTION_FAIL;
}
ie for deferred errors we're already aborting the command. Not sure
if I agree with this bit in drivers/scsi/scsi_lib.c:
static int scsi_check_sense(struct scsi_cmnd *scmd)
{
struct scsi_device *sdev = scmd->device;
struct scsi_sense_hdr sshdr;
if (! scsi_command_normalize_sense(scmd, &sshdr))
return FAILED; /* no valid sense data */
if (scsi_sense_is_deferred(&sshdr))
return NEEDS_RETRY;
I doubt we can resolve the situation by retrying the command, which
will be the wrong command to retry anyway. I would rather
have those retry 'SUCCESS' and add another case in scsi_io_completion()
to notify us about the deferred error.
I'll be sending a patch.
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
hare@suse.de +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
Ah. No. That is actually correct. SPC-3 states:
If the task terminates with CHECK CONDITION status and the sense data
describes a deferred error, the command for the terminated task shall
not have been processed.
So we're good after all and I would just add this patch:
diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index fb841e3..efb4609 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -912,7 +912,10 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int
good_bytes)
break;
}
} else {
- description = "Unhandled error code";
+ if (sense_deferred)
+ description = "Deferred error";
+ else
+ description = "Unhandled error code";
action = ACTION_FAIL;
}
to make the whole situation more transparent.
Cheers,
Hannes
--
Dr. Hannes Reinecke zSeries & Storage
hare@suse.de +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--
In SCSI there are conditions when a command, including FLUSH (SYNC_CACHE), failed which don't imply lost data. For them the caller expected to retry the failed command. Most common cases are Unit Attentions and TASK QUEUE FULL status. Vlad --
I generally agree with the patchset, but I believe this particular move is a really bad move. I'm not mentioning the obvious that a common functionality (enforcing requests ordering in this case) should be handled by a common library, but not internally by a zillion file systems Linux has. The worst in this move is that it would hide all the requests ordering semantic inside file systems in, most likely, a very much unclear way. That would lead that if I or someone else decide to implement the "hardware offload" of requests ordering (ORDERED requests), I or he/she would not be able to see any improvement until at least one file system be changed to be able to use it. Worse, if the implementor can't demonstrate the improvement, how can he encourage file systems developers to update their file systems? Which, basically, would mean that only a person with *BOTH* deep storage and file systems internals knowledge can do the job. How many do you know such people? Both storage and file systems topics are very wide and tricky, so nearly always people specialize in one of them, not both. Thus, this move would basically mean that the proper ordered queuing would probably never be implemented in Linux. I believe, much better would be to create a common interface, which file systems would use to enforce requests order, when they need it. Advantages of this approach: 1. The ordering requirements of file systems would be clear. 2. They would be handled in one place by a common code. 3. Any storage level expert can try to implement ordered queuing without a deep dive into file systems design and implementation. I already suggested such interface in http://marc.info/?l=linux-scsi&m=128077574815881&w=2. Internally for the moment it can be implemented using existing REQ_FLUSH/FUA/etc. and waiting for all the requests in the group to finish. As a nice side effect, if a device doesn't support FUA, it would be possible to issue SYNC_CACHE command(s) only for ...
I/O ordering is still handled mostly by common code, that is the pagecache and the buffercache, although a few filesystems like XFS and btrfs have their own implementation of the second one. The current ordered semantics of barriers have only successfull implemented by a complete queue drain, and not effectively been used by filesystems. This patchset removes the bogus global ordering enforced by the block layer whenever a filesystems wants to be able to use cache flushes, and because of that allows deeper outstanding queue depth I/O with less latency. Now I know you in particular are a fan of scsi ordered tags. And as I told you before I'm open to review such an implementation if it shows us any advantages. Adding it after this patch is in fact not any more complicated than before, I'd almost be tempted it's easier as you don't have to plug it into the complex state machine we used for barriers, and more importantly we drop the requirement for the barrier sequence to be atomic, which in fact made implementing barriers using tagged queues impossible with the current scsi layer. As far as playing with ordered tags it's just adding a new flag for it on the bio that gets passed down to the driver. For a final version you'd need a queue-level feature if it's supported, but you don't even need that for the initial work. Then you can implement a variant of blk_do_flush that does away with queueing additional requests once finish but queues all two or three at the same time with your new ordered flag set, at which point you are back to the level or ordered tag usage that the old code allows. You're still left with all the hard problems of actually implementing error handling for it and using it higher up in the filesystem and generic page cache code. I'd really love to see your results, up to the point of just trying that once I get a little spare time. But my theory is that it won't help us - the problem with ordered tags is that they enforce global ordering while we ...
But how about file systems doing internal local order-by-drain? Without converting them to use ordered commands it would be impossible to show full potential of them and to make the conversion one would need deep internal FS knowledge. That's my point. But if there's a trivial way to The local ordering vs global ordering is relevant only if you have several applications/threads load. But how about a single application/thread? Another point, for which, AFAIU, the ORDERED commands were invented, is that they make ordering on the _another_ side of the link _after_ all link/transfer latencies. This is why it's hard to see advantage of them on local disks. Vlad --
Hello, I still think the benefit of ordering by tag would be marginal at best, and what have you guys measured there? Under the current framework, there's no easy way to measure full ordered-by-tag implementation. The mechanism for filesystems to communicate the ordering information (which would be a partially ordered graph) just isn't there and there is no way the current usage of ordering-by-tag only for barrier sequence can achieve anything close to that level of difference. Ripping out the original ordering by tag mechanism doesn't amount to much. The use of ordering-by-tag was pretty half-assed there anyway. If you think exporting full ordering information from filesystem to the lower layers is worthwhile, please go ahead. It would be very interesting to see how much actual difference it can make compared to ordering-by-filesystem and if it's actually better and the added complexity is manageable, there's no reason not to do that. Thank you. -- tejun --
Hello, Basically, I measured how iSCSI link utilization depends from amount of queued commands and queued data size. This is why I made it as a table. From it you can see which improvement you will have removing queue draining after 1, 2, 4, etc. commands depending of commands sizes. For instance, on my previous XFS rm example, where rm of 4 files took 3.5 minutes with nobarrier option, I could see that XFS was sending 1-3 32K commands in a row. From my table you can see that if it sent all them at once without draining, it would have about 150-200% speed increase. Vlad --
Hello, You compared barrier off/on. Of course, it will make a big difference. I think good part of that gain should be realized by the currently proposed patchset which removes draining. What's needed to be demonstrated is the difference between ordered-by-waiting and ordered-by-tag. We've never had code to do that properly. The original ordered-by-tag we had only applied tag ordering to two or three command sequences inside a barrier, which doesn't amount to much (and could even be harmful as it imposes draining of all simple commands inside the device only to reduce issue latencies for a few commands). You'll need to hook into filesystem and somehow export the ordering information down to the driver so that whatever needs ordering is sent out as ordered commands. As I've wrote multiple times, I'm pretty skeptical it will bring much. Ordered tag mandates draining inside the device just like the original barrier implementation. Sure, it's done at a lower layer and command issue latencies will be reduced thanks to that but ordered-by-waiting doesn't require _any_ draining at all. The whole pipeline can be kept full all the time. I'm often wrong tho, so please feel free to go ahead and prove me wrong. :-) Thanks. -- tejun --
Actually, I thought about ordered tag writes, too. But eventually I had to give up on this for a simple reason: Ordered tag controls the ordering on the SCSI _TARGET_. But for a meaningful implementation we need to control the ordering all the way down from ->queuecommand(). Which means we have three areas we need to cover here: - driver (ie between ->queuecommand() and passing it off to the firmware) - firmware - fabric Sadly, the latter two are really hard to influence. And, what's more, with the new/modern CNAs with multiple queues and possible multiple routes to the target it becomes impossible to guarantee ordering. So using ordered tags for FibreChannel is not going to work, which makes implementing it a bit of a pointless exercise for me. Cheers, Hannes -- Dr. Hannes Reinecke zSeries & Storage hare@suse.de +49 911 74053 688 SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg GF: Markus Rex, HRB 16746 (AG Nürnberg) --
The situation is, actually, much better than you think. An SCSI transport should provide an in-order delivery of commands. In some transports it is required (e.g. iSCSI), in some - optional (e.g. FC). For FC "an application client may determine if a device server supports the precise delivery function by using the MODE SENSE and MODE SELECT commands to examine and set the enable precise delivery checking (EPDC) bit in the Fibre Channel Logical Unit Control page" (Fibre Channel Protocol for SCSI (FCP)). You can find more details in FCP section "Precise delivery of SCSI commands". Regarding multiple queues, in case of a multipath access to a device SCSI requires either each path be a separate I_T nexus, where order of commands is maintained, or a transport required to maintain in-order commands delivery among multiple paths in a single I_T nexus (session) as it is done in iSCSI's MC/S and, most likely, wide SAS ports. So, everything is in the specs. We only need to use it properly. How it can be done on the drivers level as well as how errors recovery can be done using ACA and UA_INTLCK facilities I wrote few weeks ago in the "[RFC] relaxed barrier semantics" thread. Vlad --
FYI: One issue with this series is that make_request based drivers not have to access all REQ_FLUSH and REQ_FUA requests. We'll either need to add handling to empty REQ_FLUSH requests to all of them or figure out a way to prevent them getting sent. That is assuming they'll simply ignore REQ_FLUSH/REQ_FUA on normal writes. --
Hello, Can you be a bit more specific? In most cases, request based drivers should be fine. They sit behind the front most request_queue which would discompose REQ_FLUSH/FUAs into appropriate command sequence. For the request based drivers, it's not different from the original REQ_HARDBARRIER mechanism, it'll just see flushes and optionally FUA writes. Thanks. -- tejun --
I said make_request based drivers, that is drivers taking bios. These get bios directly from __generic_make_request and need to deal with REQ_FLUSH/FUA themselves. We have quite a few more than just dm/md of this kind: arch/powerpc/sysdev/axonram.c: blk_queue_make_request(bank->disk->queue, axon_ram_make_request); drivers/block/aoe/aoeblk.c: blk_queue_make_request(d->blkq, aoeblk_make_request); drivers/block/brd.c: blk_queue_make_request(brd->brd_queue, brd_make_request); drivers/block/drbd/drbd_main.c: blk_queue_make_request(q, drbd_make_request_26); drivers/block/loop.c: blk_queue_make_request(lo->lo_queue, loop_make_request); drivers/block/pktcdvd.c: blk_queue_make_request(q, pkt_make_request); drivers/block/ps3vram.c: blk_queue_make_request(queue, ps3vram_make_request); drivers/block/umem.c: blk_queue_make_request(card->queue, mm_make_request); drivers/s390/block/dcssblk.c: blk_queue_make_request(dev_info->dcssblk_queue, dcssblk_make_request); drivers/s390/block/xpram.c: blk_queue_make_request(xpram_queues[i], xpram_make_request); drivers/staging/zram/zram_drv.c:blk_queue_make_request(zram->queue, zram_make_request); --
Hello, Will work on these. Thanks. -- tejun --
FYI: here's a little writeup to document the new cache flushing scheme, intended to replace Documentation/block/barriers.txt. Any good suggestion for a filename in the kernel tree? --- Explicit volatile write cache control ===================================== Introduction ------------ Many storage devices, especially in the consumer market, come with volatile write back caches. That means the devices signal I/O completion to the operating system before data actually has hit the physical medium. This behavior obviously speeds up various workloads, but it means the operating system needs to force data out to the physical medium when it performs a data integrity operation like fsync, sync or an unmount. The Linux block layer provides a two simple mechanism that lets filesystems control the caching behavior of the storage device. These mechanisms are a forced cache flush, and the Force Unit Access (FUA) flag for requests. Explicit cache flushes ---------------------- The REQ_FLUSH flag can be OR ed into the r/w flags of a bio submitted from the filesystem and will make sure the volatile cache of the storage device has been flushed before the actual I/O operation is started. The explicit guarantees write requests that have completed before the bio was submitted actually are on the physical medium before this request has started. In addition the REQ_FLUSH flag can be set on an otherwise empty bio structure, which causes only an explicit cache flush without any dependent I/O. It is recommend to use the blkdev_issue_flush() helper for a pure cache flush. Forced Unit Access ----------------- The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the filesystem and will make sure that I/O completion for this requests is not signaled before the data has made it to non-volatile storage on the physical medium. Implementation details for filesystems -------------------------------------- Filesystem can simply set the REQ_FLUSH and ...
I was thinking that we might be better off using the "durable writes" term more since it is well documented (at least in the database world, where it is the "D" Should we mention that users can also disable the write cache on the target device? It might also be worth mentioning that storage needs to be properly configured - i.e., an internal hardware RAID card with battery backing needs can expose itself as a writethrough cache *only if* it actually has control over all of the backend disks and can flush/disable their write caches. Maybe that is too much detail, but I know that people have lost data with some of these setups. The rest of the write up below sounds good, thanks for pulling this together! --
sata_lies.txt? Ok, maybe writeback_cache.txt? -chris --
writeback_cache.txt is certainly the least confusing :) ric --
Hello, The term is very foreign to people outside of enterprise / database loop. writeback-cache.txt or write-cache-control.txt sounds good It might be useful to give several example configurations with different cache configurations. I don't have much experience with battery backed arrays but aren't they suppose to report write through cache automatically? Thanks. -- tejun --
They usually do. I have one that doesn't, but SYNCHRONIZE CACHE on it is so fast that it effectively must be a no-op. --
Arrays are not a problem in general - they normally have internally, redundant batteries to hold up the cache. The issue is when you have an internal hardware RAID card with a large cache. Those cards sit in your server and the batteries on the card protect its internal cache, but do not have the capacity to hold up the drives behind it. Normally, those drives should have their write cache disabled, but sometimes (especially with S-ATA disks) this is not done. ric --
I haven't seen it. I don't care particularly about this case, but once it a while people want to disable flushing for testing or because they really don't care. What about adding a sysfs attribue to every request_queue that allows disabling the cache flushing feature? Compared to the barrier option this controls the feature at the right level and makes it available to everyone instead of beeing duplicated. After a while we can then simply ignore the barrier/nobarrier options. --
Hello, Yeah, that sounds reasonable. blk_queue_flush() can be called anytime without locking anyway, so it should be really easy to implement too. Thanks. -- tejun --
I don't think we can simply call blk_queue_flush - we must ensure to never set more bits than the device allows. We'll just need two sets of flags in the request queue, with the sysfs file checking that it never allows more flags than blk_queue_flush. I'll prepare a patch for this on top of the current series. --
Agree, that would be fine. -- Jens Axboe --
Hi Jens, There are actually two distinct problems: (1) arrays with a non-volatile write cache (battery backed, navram, whatever) that do not NOOP a SYNC_CACHE command. I know of one brand that seems to do this, but it is not a common brand. If we do not issue flushes for write through caches, I think that we will avoid this in any case. (2) hardware raid cards with internal buffer memory and on-card battery backup (they sit in your server, disks sit in jbod like expansion shelves). These are fine if the drives in those shelves have write cache disabled. ric --
Yes - this is certainly one way to do it. Note that this will not work if the card advertises itself as a write through cache (and we end up not sending down the SYNC_CACHE commands). At least one hardware RAID card (I unfortunately cannot mention the brand) did not do this command forwarding. ric --
On Mon, Aug 23, 2010 at 11:19:13AM -0400, Ric Wheeler wrote: Actually some of such cards keep write cache on the drives enabled and issue FLUSH CACHE commands to the drives. E.g., 3ware 9690SA behaves like this at least with SATA drives (the FLUSH CACHE commands can be seen after enabling performance monitoring - they often end up in the "10 commands having the largest latency" table). This can actually be safe if the card waits for the FLUSH CACHE completion before making the write cache data in its battery-backed memory available for reuse (and the drive implements the FLUSH CACHE command correctly).
Hello, A bit nit picky but flash devices can also have writeback caches and the term physical medium sounds a bit off for those cases. Maybe just It may be worthwhile to explain the sequence of operations when REQ_FLUSH + data + REQ_FUA is executed. It can be extrapolated from the previous two descriptions but I think giving examples of different sequences depending on FLUSH/FUA configuration would be helpful to help understanding the overall picture of things. Other than those, looks good to me. Thanks. -- tejun --
Below is an updated version of the documentation. It fixes several typos Zach Brown noticed and replaces all references to a physical medium with the term non-volatile storage. I haven't added any examples yet as I need to figure how they fit into the rest of the document. --- Explicit volatile write cache control ===================================== Introduction ------------ Many storage devices, especially in the consumer market, come with volatile write back caches. That means the devices signal I/O completion to the operating system before data actually has hit the non-volatile storage. This behavior obviously speeds up various workloads, but it means the operating system needs to force data out to the non-volatile storage when it performs a data integrity operation like fsync, sync or an unmount. The Linux block layer provides two simple mechanism that lets filesystems control the caching behavior of the storage device. These mechanisms are a forced cache flush, and the Force Unit Access (FUA) flag for requests. Explicit cache flushes ---------------------- The REQ_FLUSH flag can be OR ed into the r/w flags of a bio submitted from the filesystem and will make sure the volatile cache of the storage device has been flushed before the actual I/O operation is started. This explicitly guarantees that previously completed write requests are on non-volatile storage before the flagged bio starts. In addition the REQ_FLUSH flag can be set on an otherwise empty bio structure, which causes only an explicit cache flush without any dependent I/O. It is recommend to use the blkdev_issue_flush() helper for a pure cache flush. Forced Unit Access ----------------- The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the filesystem and will make sure that I/O completion for this requests is only signaled after the data has been commited to non-volatile storage. Implementation details for ...
We need to call blk_rq_init and elv_insert for all cases in queue_next_fseq,
so take these calls into common code. Also move the end_io initialization
from queue_flush into queue_next_fseq and rename queue_flush to
init_flush_request now that it's old name doesn't apply anymore.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Index: linux-2.6/block/blk-flush.c
===================================================================
--- linux-2.6.orig/block/blk-flush.c 2010-08-17 15:34:27.864004351 +0200
+++ linux-2.6/block/blk-flush.c 2010-08-17 16:12:53.504253827 +0200
@@ -74,16 +74,11 @@ static void post_flush_end_io(struct req
blk_flush_complete_seq(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
}
-static void queue_flush(struct request_queue *q, struct request *rq,
- rq_end_io_fn *end_io)
+static void init_flush_request(struct request *rq, struct gendisk *disk)
{
- blk_rq_init(q, rq);
rq->cmd_type = REQ_TYPE_FS;
rq->cmd_flags = REQ_FLUSH;
- rq->rq_disk = q->orig_flush_rq->rq_disk;
- rq->end_io = end_io;
-
- elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
+ rq->rq_disk = disk;
}
static struct request *queue_next_fseq(struct request_queue *q)
@@ -91,29 +86,28 @@ static struct request *queue_next_fseq(s
struct request *orig_rq = q->orig_flush_rq;
struct request *rq = &q->flush_rq;
+ blk_rq_init(q, rq);
+
switch (blk_flush_cur_seq(q)) {
case QUEUE_FSEQ_PREFLUSH:
- queue_flush(q, rq, pre_flush_end_io);
+ init_flush_request(rq, orig_rq->rq_disk);
+ rq->end_io = pre_flush_end_io;
break;
-
case QUEUE_FSEQ_DATA:
- /* initialize proxy request, inherit FLUSH/FUA and queue it */
- blk_rq_init(q, rq);
init_request_from_bio(rq, orig_rq->bio);
rq->cmd_flags &= ~(REQ_FLUSH | REQ_FUA);
rq->cmd_flags |= orig_rq->cmd_flags & (REQ_FLUSH | REQ_FUA);
rq->end_io = flush_data_end_io;
-
- elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
break;
-
case QUEUE_FSEQ_POSTFLUSH:
- queue_flush(q, rq, post_flush_end_io);
+ init_flush_request(rq, ...Nit: it's "its" above, not "it's". If in doubt, if it's "it is" (or "it has") it's "it's" but if it could be "his" or "hers" it's "its". I'm guessing English isn't your first language (a) because of your .de address and (b) because it's better than most British people's, but still, it's a common mistake. If I can remember any of the German I studied all those years ago, "its" is roughly equivalent to "sein", and "it's" to "es ist". Cheers, John. --
