[PATCH 09/11] block: implement REQ_FLUSH/FUA based interface for FLUSH/FUA requests

Previous thread: [PATCH 11/11] block: use REQ_FLUSH in blkdev_issue_flush() by Tejun Heo on Thursday, August 12, 2010 - 5:41 am. (1 message)

Next thread: BUG: using smp_processor_id() in preemptible [00000000] code by Sergey Senozhatsky on Thursday, August 12, 2010 - 5:51 am. (1 message)
From: Tejun Heo
Date: Thursday, August 12, 2010 - 5:41 am

Hello,

This patchset replaces the current barrier implementation with
sequenced flush which doesn't impose any restriction on ordering
around the flush requests.  This patchst is result of the following
discussion thread.

  http://thread.gmane.org/gmane.linux.file-systems/43877

In summary, filesystems can take over the ordering of requests around
commit writes and the block layer should just supply a mechanism to
perform the commit writes themselves.  This would greatly lessen tha
stall caused by queue dumping and draining used by the current barrier
implementation for request ordering.

This patchset converts barrier mechanism to sequenced flush/fua
mechanism in the following steps.

1. Kill the mostly unused ORDERED_BY_TAG support.

2. Deprecate REQ_HARDBARRIER support.  All hard barrier requests are
   failed with -EOPNOTSUPP.

3. Drop barrier ordering by queue draining mechanism.

4. Rename barrier to flush and implement new interface based on
   REQ_FLUSH and REQ_FUA as suggested by Christoph.

blkdev_issue_flush() is converted to use the new mechanism but all the
filesystems still use the deprecated REQ_HARDBARRIER which always
fails.  Each filesystem needs to be updated to enforce request
ordering themselves and then to use REQ_FLUSH/FUA mechanism.

loop, md, dm, etc... haven't been converted yet and REQ_FLUSH/FUA
doesn't work with them yet.  I'll convert most of them soonish if this
patchset is generally agreed upon.

This patchset contains the following patches.

  0001-block-loop-queue-ordered-mode-should-be-DRAIN_FLUSH.patch
  0002-block-kill-QUEUE_ORDERED_BY_TAG.patch
  0003-block-deprecate-barrier-and-replace-blk_queue_ordere.patch
  0004-block-remove-spurious-uses-of-REQ_HARDBARRIER.patch
  0005-block-misc-cleanups-in-barrier-code.patch
  0006-block-drop-barrier-ordering-by-queue-draining.patch
  0007-block-rename-blk-barrier.c-to-blk-flush.c.patch
  0008-block-rename-barrier-ordered-to-flush.patch
  ...
From: Tejun Heo
Date: Thursday, August 12, 2010 - 5:41 am

With ordering requirements dropped, barrier and ordered are misnomers.
Now all block layer does is sequencing FLUSH and FUA.  Rename them to
flush.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
---
 block/blk-core.c       |   21 +++++-----
 block/blk-flush.c      |   98 +++++++++++++++++++++++------------------------
 block/blk.h            |    4 +-
 include/linux/blkdev.h |   26 ++++++------
 4 files changed, 73 insertions(+), 76 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 82bd6d9..efe391b 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -136,7 +136,7 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
 {
 	struct request_queue *q = rq->q;
 
-	if (&q->bar_rq != rq) {
+	if (&q->flush_rq != rq) {
 		if (error)
 			clear_bit(BIO_UPTODATE, &bio->bi_flags);
 		else if (!test_bit(BIO_UPTODATE, &bio->bi_flags))
@@ -160,13 +160,12 @@ static void req_bio_endio(struct request *rq, struct bio *bio,
 		if (bio->bi_size == 0)
 			bio_endio(bio, error);
 	} else {
-
 		/*
-		 * Okay, this is the barrier request in progress, just
-		 * record the error;
+		 * Okay, this is the sequenced flush request in
+		 * progress, just record the error;
 		 */
-		if (error && !q->orderr)
-			q->orderr = error;
+		if (error && !q->flush_err)
+			q->flush_err = error;
 	}
 }
 
@@ -520,7 +519,7 @@ struct request_queue *blk_alloc_queue_node(gfp_t gfp_mask, int node_id)
 	init_timer(&q->unplug_timer);
 	setup_timer(&q->timeout, blk_rq_timed_out_timer, (unsigned long) q);
 	INIT_LIST_HEAD(&q->timeout_list);
-	INIT_LIST_HEAD(&q->pending_barriers);
+	INIT_LIST_HEAD(&q->pending_flushes);
 	INIT_WORK(&q->unplug_work, blk_unplug_work);
 
 	kobject_init(&q->kobj, &blk_queue_ktype);
@@ -1758,11 +1757,11 @@ static void blk_account_io_completion(struct request *req, unsigned int bytes)
 static void blk_account_io_done(struct request *req)
 {
 	/*
-	 * Account IO completion.  bar_rq isn't accounted as a ...
From: Christoph Hellwig
Date: Tuesday, August 17, 2010 - 6:26 am

Btw, I think this one should just go away.  It's only used by
ide in an attempt to make ordered sequences atomic, which isn't
needed for the new design.
--

From: Tejun Heo
Date: Tuesday, August 17, 2010 - 9:23 am

Hello,


Yeap, agreed.  I couldn't really understand why the the sequence
needed to be atomic for ide in the first place so just left it alone.
Do you understand why it tried to be atomic?

Thanks.

-- 
tejun
--

From: Christoph Hellwig
Date: Tuesday, August 17, 2010 - 10:08 am

I think initial drafs of the barrier specification talked about atomic
sequences.  Except for that I can't think of any reason.

--

From: Tejun Heo
Date: Tuesday, August 17, 2010 - 11:23 pm

Hello,


Hmm... alright, I'll rip it out.

Thanks.

-- 
tejun
--

From: Tejun Heo
Date: Thursday, August 12, 2010 - 5:41 am

Filesystems will take all the responsibilities for ordering requests
around commit writes and will only indicate how the commit writes
themselves should be handled by block layers.  This patch drops
barrier ordering by queue draining from block layer.  Ordering by
draining implementation was somewhat invasive to request handling.
List of notable changes follow.

* Each queue has 1 bit color which is flipped on each barrier issue.
  This is used to track whether a given request is issued before the
  current barrier or not.  REQ_ORDERED_COLOR flag and coloring
  implementation in __elv_add_request() are removed.

* Requests which shouldn't be processed yet for draining were stalled
  by returning -EAGAIN from blk_do_ordered() according to the test
  result between blk_ordered_req_seq() and blk_blk_ordered_cur_seq().
  This logic is removed.

* Draining completion logic in elv_completed_request() removed.

* All barrier sequence requests were queued to request queue and then
  trckled to lower layer according to progress and thus maintaining
  request orders during requeue was necessary.  This is replaced by
  queueing the next request in the barrier sequence only after the
  current one is complete from blk_ordered_complete_seq(), which
  removes the need for multiple proxy requests in struct request_queue
  and the request sorting logic in the ELEVATOR_INSERT_REQUEUE path of
  elv_insert().

* As barriers no longer have ordering constraints, there's no need to
  dump the whole elevator onto the dispatch queue on each barrier.
  Insert barriers at the front instead.

* If other barrier requests come to the front of the dispatch queue
  while one is already in progress, they are stored in
  q->pending_barriers and restored to dispatch queue one-by-one after
  each barrier completion from blk_ordered_complete_seq().

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
---
 block/blk-barrier.c       |  220 ...
From: Tejun Heo
Date: Thursday, August 12, 2010 - 5:41 am

REQ_HARDBARRIER is deprecated.  Remove spurious uses in the following
users.  Please note that other than osdblk, all other uses were
already spurious before deprecation.

* osdblk: osdblk_rq_fn() won't receive any request with
  REQ_HARDBARRIER set.  Remove the test for it.

* pktcdvd: use of REQ_HARDBARRIER in pkt_generic_packet() doesn't mean
  anything.  Removed.

* aic7xxx_old: Setting MSG_ORDERED_Q_TAG on REQ_HARDBARRIER is
  spurious.  Removed.

* sas_scsi_host: Setting TASK_ATTR_ORDERED on REQ_HARDBARRIER is
  spurious.  Removed.

* scsi_tcq: The ordered tag path wasn't being used anyway.  Removed.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: James Bottomley <James.Bottomley@suse.de>
Cc: Peter Osterlund <petero2@telia.com>
---
 drivers/block/osdblk.c              |    3 +--
 drivers/block/pktcdvd.c             |    1 -
 drivers/scsi/aic7xxx_old.c          |   21 ++-------------------
 drivers/scsi/libsas/sas_scsi_host.c |   13 +------------
 include/scsi/scsi_tcq.h             |    6 +-----
 5 files changed, 5 insertions(+), 39 deletions(-)

diff --git a/drivers/block/osdblk.c b/drivers/block/osdblk.c
index 72d6246..87311eb 100644
--- a/drivers/block/osdblk.c
+++ b/drivers/block/osdblk.c
@@ -310,8 +310,7 @@ static void osdblk_rq_fn(struct request_queue *q)
 			break;
 
 		/* filter out block requests we don't understand */
-		if (rq->cmd_type != REQ_TYPE_FS &&
-		    !(rq->cmd_flags & REQ_HARDBARRIER)) {
+		if (rq->cmd_type != REQ_TYPE_FS) {
 			blk_end_request_all(rq, 0);
 			continue;
 		}
diff --git a/drivers/block/pktcdvd.c b/drivers/block/pktcdvd.c
index b1cbeb5..0166ea1 100644
--- a/drivers/block/pktcdvd.c
+++ b/drivers/block/pktcdvd.c
@@ -753,7 +753,6 @@ static int pkt_generic_packet(struct pktcdvd_device *pd, struct packet_command *
 
 	rq->timeout = 60*HZ;
 	rq->cmd_type = REQ_TYPE_BLOCK_PC;
-	rq->cmd_flags |= REQ_HARDBARRIER;
 	if (cgc->quiet)
 		rq->cmd_flags |= REQ_QUIET;
 
diff --git ...
From: Tejun Heo
Date: Thursday, August 12, 2010 - 5:41 am

Barrier is deemed too heavy and will soon be replaced by FLUSH/FUA
requests.  Deprecate barrier.  All REQ_HARDBARRIERs are failed with
-EOPNOTSUPP and blk_queue_ordered() is replaced with simpler
blk_queue_flush().

blk_queue_flush() takes combinations of REQ_FLUSH and FUA.  If a
device has write cache and can flush it, it should set REQ_FLUSH.  If
the device can handle FUA writes, it should also set REQ_FUA.

All blk_queue_ordered() users are converted.

* ORDERED_DRAIN is mapped to 0 which is the default value.
* ORDERED_DRAIN_FLUSH is mapped to REQ_FLUSH.
* ORDERED_DRAIN_FLUSH_FUA is mapped to REQ_FLUSH | REQ_FUA.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
Cc: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Cc: Boaz Harrosh <bharrosh@panasas.com>
Cc: Geert Uytterhoeven <Geert.Uytterhoeven@sonycom.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Alasdair G Kergon <agk@redhat.com>
Cc: Pierre Ossman <drzeus@drzeus.cx>
Cc: Stefan Weinhuber <wein@de.ibm.com>
---
 block/blk-barrier.c          |   29 ----------------------------
 block/blk-core.c             |    6 +++-
 block/blk-settings.c         |   20 +++++++++++++++++++
 drivers/block/brd.c          |    1 -
 drivers/block/loop.c         |    2 +-
 drivers/block/osdblk.c       |    2 +-
 drivers/block/ps3disk.c      |    2 +-
 drivers/block/virtio_blk.c   |   25 ++++++++---------------
 drivers/block/xen-blkfront.c |   43 +++++++++++------------------------------
 drivers/ide/ide-disk.c       |   13 +++++------
 drivers/md/dm.c              |    2 +-
 drivers/mmc/card/queue.c     |    1 -
 drivers/s390/block/dasd.c    |    1 -
 drivers/scsi/sd.c            |   16 +++++++-------
 include/linux/blkdev.h       |    6 +++-
 15 files changed, 67 insertions(+), 102 deletions(-)

diff --git a/block/blk-barrier.c ...
From: Tejun Heo
Date: Thursday, August 12, 2010 - 5:41 am

Without ordering requirements, barrier and ordering are minomers.
Rename block/blk-barrier.c to block/blk-flush.c.  Rename of symbols
will follow.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
---
 block/Makefile      |    2 +-
 block/blk-barrier.c |  248 ---------------------------------------------------
 block/blk-flush.c   |  248 +++++++++++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 249 insertions(+), 249 deletions(-)
 delete mode 100644 block/blk-barrier.c
 create mode 100644 block/blk-flush.c

diff --git a/block/Makefile b/block/Makefile
index 0bb499a..f627e4b 100644
--- a/block/Makefile
+++ b/block/Makefile
@@ -3,7 +3,7 @@
 #
 
 obj-$(CONFIG_BLOCK) := elevator.o blk-core.o blk-tag.o blk-sysfs.o \
-			blk-barrier.o blk-settings.o blk-ioc.o blk-map.o \
+			blk-flush.o blk-settings.o blk-ioc.o blk-map.o \
 			blk-exec.o blk-merge.o blk-softirq.o blk-timeout.o \
 			blk-iopoll.o blk-lib.o ioctl.o genhd.o scsi_ioctl.o
 
diff --git a/block/blk-barrier.c b/block/blk-barrier.c
deleted file mode 100644
index e8b2e5c..0000000
--- a/block/blk-barrier.c
+++ /dev/null
@@ -1,248 +0,0 @@
-/*
- * Functions related to barrier IO handling
- */
-#include <linux/kernel.h>
-#include <linux/module.h>
-#include <linux/bio.h>
-#include <linux/blkdev.h>
-#include <linux/gfp.h>
-
-#include "blk.h"
-
-static struct request *queue_next_ordseq(struct request_queue *q);
-
-/*
- * Cache flushing for ordered writes handling
- */
-unsigned blk_ordered_cur_seq(struct request_queue *q)
-{
-	if (!q->ordseq)
-		return 0;
-	return 1 << ffz(q->ordseq);
-}
-
-static struct request *blk_ordered_complete_seq(struct request_queue *q,
-						unsigned seq, int error)
-{
-	struct request *next_rq = NULL;
-
-	if (error && !q->orderr)
-		q->orderr = error;
-
-	BUG_ON(q->ordseq & seq);
-	q->ordseq |= seq;
-
-	if (blk_ordered_cur_seq(q) != QUEUE_ORDSEQ_DONE) {
-		/* not complete yet, queue the next ordered sequence */
-		next_rq = ...
From: Tejun Heo
Date: Thursday, August 12, 2010 - 5:41 am

Nobody is making meaningful use of ORDERED_BY_TAG now and queue
draining for barrier requests will be removed soon which will render
the advantage of tag ordering moot.  Kill ORDERED_BY_TAG.  The
following users are affected.

* brd: converted to ORDERED_DRAIN.
* virtio_blk: ORDERED_TAG path was already marked deprecated.  Removed.
* xen-blkfront: ORDERED_TAG case dropped.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Nick Piggin <npiggin@kernel.dk>
Cc: Michael S. Tsirkin <mst@redhat.com>
Cc: Jeremy Fitzhardinge <jeremy@xensource.com>
Cc: Chris Wright <chrisw@sous-sol.org>
---
 block/blk-barrier.c          |   35 +++++++----------------------------
 drivers/block/brd.c          |    2 +-
 drivers/block/virtio_blk.c   |    9 ---------
 drivers/block/xen-blkfront.c |    8 +++-----
 drivers/scsi/sd.c            |    4 +---
 include/linux/blkdev.h       |   17 +----------------
 6 files changed, 13 insertions(+), 62 deletions(-)

diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index f0faefc..c807e9c 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -26,10 +26,7 @@ int blk_queue_ordered(struct request_queue *q, unsigned ordered)
 	if (ordered != QUEUE_ORDERED_NONE &&
 	    ordered != QUEUE_ORDERED_DRAIN &&
 	    ordered != QUEUE_ORDERED_DRAIN_FLUSH &&
-	    ordered != QUEUE_ORDERED_DRAIN_FUA &&
-	    ordered != QUEUE_ORDERED_TAG &&
-	    ordered != QUEUE_ORDERED_TAG_FLUSH &&
-	    ordered != QUEUE_ORDERED_TAG_FUA) {
+	    ordered != QUEUE_ORDERED_DRAIN_FUA) {
 		printk(KERN_ERR "blk_queue_ordered: bad value %d\n", ordered);
 		return -EINVAL;
 	}
@@ -155,21 +152,9 @@ static inline bool start_ordered(struct request_queue *q, struct request **rqp)
 	 * For an empty barrier, there's no actual BAR request, which
 	 * in turn makes POSTFLUSH unnecessary.  Mask them off.
 	 */
-	if (!blk_rq_sectors(rq)) {
+	if (!blk_rq_sectors(rq))
 		q->ordered &= ~(QUEUE_ORDERED_DO_BAR |
 ...
From: Vladislav Bolkhovitin
Date: Friday, August 13, 2010 - 5:56 am

Hello Tejun,


Have you seen Hannes Reinecke's and my measurements in 
http://marc.info/?l=linux-scsi&m=128110662528485&w=2 and 
http://marc.info/?l=linux-scsi&m=128111995217405&w=2 correspondingly?

If yes, what else evidences do you need to see that the tag ordering is 
a big performance win?

Vlad
--

From: Christoph Hellwig
Date: Friday, August 13, 2010 - 6:06 am

It's not tag odering that is a win but big queue depth.  That's what you
measured and what I fully agree on.  I haven't been able to get out of
Hannes what he actually measured.

And if you'd actually look at the patchset allowing deep queues is
exactly what it allows us, and while I haven't done testing on this
patchset but only on my previous version it does get us back to use
the full potential of large arrays exactly because of that.
--

From: Tejun Heo
Date: Thursday, August 12, 2010 - 5:41 am

Make the following cleanups in preparation of barrier/flush update.

* blk_do_ordered() declaration is moved from include/linux/blkdev.h to
  block/blk.h.

* blk_do_ordered() now returns pointer to struct request, with %NULL
  meaning "try the next request" and ERR_PTR(-EAGAIN) "try again
  later".  The third case will be dropped with further changes.

* In the initialization of proxy barrier request, data direction is
  already set by init_request_from_bio().  Drop unnecessary explicit
  REQ_WRITE setting and move init_request_from_bio() above REQ_FUA
  flag setting.

* add_request() is collapsed into __make_request().

These changes don't make any functional difference.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 block/blk-barrier.c    |   32 ++++++++++++++------------------
 block/blk-core.c       |   21 ++++-----------------
 block/blk.h            |    7 +++++--
 include/linux/blkdev.h |    1 -
 4 files changed, 23 insertions(+), 38 deletions(-)

diff --git a/block/blk-barrier.c b/block/blk-barrier.c
index ed0aba5..f1be85b 100644
--- a/block/blk-barrier.c
+++ b/block/blk-barrier.c
@@ -110,9 +110,9 @@ static void queue_flush(struct request_queue *q, unsigned which)
 	elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
 }
 
-static inline bool start_ordered(struct request_queue *q, struct request **rqp)
+static inline struct request *start_ordered(struct request_queue *q,
+					    struct request *rq)
 {
-	struct request *rq = *rqp;
 	unsigned skip = 0;
 
 	q->orderr = 0;
@@ -149,11 +149,9 @@ static inline bool start_ordered(struct request_queue *q, struct request **rqp)
 
 		/* initialize proxy request and queue it */
 		blk_rq_init(q, rq);
-		if (bio_data_dir(q->orig_bar_rq->bio) == WRITE)
-			rq->cmd_flags |= REQ_WRITE;
+		init_request_from_bio(rq, q->orig_bar_rq->bio);
 		if (q->ordered & QUEUE_ORDERED_DO_FUA)
 			rq->cmd_flags |= REQ_FUA;
-		init_request_from_bio(rq, q->orig_bar_rq->bio);
 		rq->end_io = bar_end_io;
 
 		elv_insert(q, rq, ...
From: Tejun Heo
Date: Thursday, August 12, 2010 - 5:41 am

loop implements FLUSH using fsync but was incorrectly setting its
ordered mode to DRAIN.  Change it to DRAIN_FLUSH.  In practice, this
doesn't change anything as loop doesn't make use of the block layer
ordered implementation.

Signed-off-by: Tejun Heo <tj@kernel.org>
---
 drivers/block/loop.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index f3c636d..c3a4a2e 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -832,7 +832,7 @@ static int loop_set_fd(struct loop_device *lo, fmode_t mode,
 	lo->lo_queue->unplug_fn = loop_unplug;
 
 	if (!(lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
-		blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN);
+		blk_queue_ordered(lo->lo_queue, QUEUE_ORDERED_DRAIN_FLUSH);
 
 	set_capacity(lo->lo_disk, size);
 	bd_set_size(bdev, size << 9);
-- 
1.7.1

--

From: Tejun Heo
Date: Thursday, August 12, 2010 - 5:41 am

Now that the backend conversion is complete, export sequenced
FLUSH/FUA capability through REQ_FLUSH/FUA flags.  REQ_FLUSH means the
device cache should be flushed before executing the request.  REQ_FUA
means that the data in the request should be on non-volatile media on
completion.

Block layer will choose the correct way of implementing the semantics
and execute it.  The request may be passed to the device directly if
the device can handle it; otherwise, it will be sequenced using one or
more proxy requests.  Devices will never see REQ_FLUSH and/or FUA
which it doesn't support.

* QUEUE_ORDERED_* are removed and QUEUE_FSEQ_* are moved into
  blk-flush.c.

* REQ_FLUSH w/o data can also be directly passed to drivers without
  sequencing but some drivers assume that zero length requests don't
  have rq->bio which isn't true for these requests requiring the use
  of proxy requests.

Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Christoph Hellwig <hch@infradead.org>
---
 block/blk-core.c       |    2 +-
 block/blk-flush.c      |   85 ++++++++++++++++++++++++++----------------------
 block/blk.h            |    3 ++
 include/linux/blkdev.h |   38 +--------------------
 4 files changed, 52 insertions(+), 76 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index efe391b..c00ace2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1204,7 +1204,7 @@ static int __make_request(struct request_queue *q, struct bio *bio)
 
 	spin_lock_irq(q->queue_lock);
 
-	if (bio->bi_rw & REQ_HARDBARRIER) {
+	if (bio->bi_rw & (REQ_FLUSH | REQ_FUA)) {
 		where = ELEVATOR_INSERT_FRONT;
 		goto get_rq;
 	}
diff --git a/block/blk-flush.c b/block/blk-flush.c
index dd87322..452c552 100644
--- a/block/blk-flush.c
+++ b/block/blk-flush.c
@@ -1,5 +1,5 @@
 /*
- * Functions related to barrier IO handling
+ * Functions to sequence FLUSH and FUA writes.
  */
 #include <linux/kernel.h>
 #include <linux/module.h>
@@ -9,6 +9,15 @@
 
 #include "blk.h"
 
+/* FLUSH/FUA sequences ...
From: Christoph Hellwig
Date: Friday, August 13, 2010 - 4:48 am

The patchset looks functionally correct to me, and with a small patch
to make use of WRITE_FUA_FLUSH survives xfstests, and instrumenting the
underlying qemu shows that we actually get the flush requests where we should.

No performance or power fail testing done yet.

But I do not like the transition very much.  The new WRITE_FUA_FLUSH
request is exactly what filesystems expect from a current barrier
request, so I'd rather move to that functionality without breaking stuff
inbetween.

So if it was to me I'd keep patches 1, 2, 4 and 5 from your series, than
a main one to relax barrier semantics, then have the renaming patches 7
and 8, and possible keep patch 11 separate from the main implementation
change, and if absolutely also a separate one to introduce REQ_FUA and
REQ_FLUSH in the bio interface, but keep things working while doing
this.

Then we can patches do disable the reiserfs barrier "optimization" as
the very first one, and DM/MD support which I'm currently working on
as the last one and we can start doing the heavy testing.

--

From: Tejun Heo
Date: Friday, August 13, 2010 - 6:48 am

Hello, Christoph.



There are two reason to avoid changing the meaning of REQ_HARDBARRIER
and just deprecate it.  One is to avoid breaking filesystems'
expectations underneath it.  Please note that there are out-of-tree
filesystems too.  I think it would be too dangerous to relax
REQ_HARDBARRIER.

Another is that pseudo block layer drivers (loop, virtio_blk,
md/dm...) have assumptions about REQ_HARDBARRIER behavior and things
would be broken in obscure ways between REQ_HARDBARRIER semantics
change and updates to each of those drivers, so I don't really think

Oops, I've already converted loop, virtio_blk/lguest and am working on
md/dm right now too.  I'm almost done with md and now doing dm. :-)
Maybe we should post them right now so that we don't waste too much
time trying to solve the same problems?

Thanks.

-- 
tejun
--

From: Christoph Hellwig
Date: Friday, August 13, 2010 - 7:38 am

Note that the renaming patch would include a move from REQ_HARDBARRIER
to REQ_FLUSH_FUA, so things just using REQ_HARDBARRIER will fail to
compile.  And while out of tree filesystems do exist they it's their
problem to keep up with kernel changes.  They decide not to be part


Here's the dm patch.  It only handles normal bio based dm yet, which
I understand and can test.  request  based dm (multipath) still needs
work.


Index: linux-2.6/drivers/md/dm-crypt.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-crypt.c	2010-08-13 16:11:04.207010218 +0200
+++ linux-2.6/drivers/md/dm-crypt.c	2010-08-13 16:11:10.048003862 +0200
@@ -1249,7 +1249,7 @@ static int crypt_map(struct dm_target *t
 	struct dm_crypt_io *io;
 	struct crypt_config *cc;
 
-	if (unlikely(bio_empty_barrier(bio))) {
+	if (bio_empty_flush(bio)) {
 		cc = ti->private;
 		bio->bi_bdev = cc->dev->bdev;
 		return DM_MAPIO_REMAPPED;
Index: linux-2.6/drivers/md/dm-io.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-io.c	2010-08-13 16:11:04.213011894 +0200
+++ linux-2.6/drivers/md/dm-io.c	2010-08-13 16:11:10.049003792 +0200
@@ -364,7 +364,7 @@ static void dispatch_io(int rw, unsigned
 	 */
 	for (i = 0; i < num_regions; i++) {
 		*dp = old_pages;
-		if (where[i].count || (rw & REQ_HARDBARRIER))
+		if (where[i].count || (rw & REQ_FLUSH))
 			do_region(rw, i, where + i, dp, io);
 	}
 
@@ -412,8 +412,8 @@ retry:
 	}
 	set_current_state(TASK_RUNNING);
 
-	if (io->eopnotsupp_bits && (rw & REQ_HARDBARRIER)) {
-		rw &= ~REQ_HARDBARRIER;
+	if (io->eopnotsupp_bits && (rw & REQ_FLUSH)) {
+		rw &= ~REQ_FLUSH;
 		goto retry;
 	}
 
Index: linux-2.6/drivers/md/dm-raid1.c
===================================================================
--- linux-2.6.orig/drivers/md/dm-raid1.c	2010-08-13 16:11:04.220013431 +0200
+++ linux-2.6/drivers/md/dm-raid1.c	2010-08-13 16:11:10.054018319 +0200
@@ -670,7 +670,7 @@ static ...
From: Tejun Heo
Date: Friday, August 13, 2010 - 7:51 am

Hello,



Do you want to change the whole thing in a single commit?  That would
be a pretty big invasive patch touching multiple subsystems.  Also, I
don't know what to do about drdb and would like to leave its
conversion to the maintainer (in separate patches).


Here's the combined patch I've been working on.  I've verified loop
and virtio_blk/loop.  I just (like five mins ago) got dm/dm conversion
compiling, so I'm sure they're broken.  The neat part is that thanks
to the separation between REQ_FLUSH and FUA handling, bio mangling
drivers only have to sequence the pre-flush and pass FUA directly to
lower layers which in many cases saves an array-wide cache flush
cycle.

After getting this patch working, the only remaining bits would be
blktrace and drdb.

Thanks.

 Documentation/lguest/lguest.c   |   36 +++-----
 drivers/block/loop.c            |   18 ++--
 drivers/block/virtio_blk.c      |   26 ++---
 drivers/md/dm-io.c              |   20 ----
 drivers/md/dm-log.c             |    2
 drivers/md/dm-raid1.c           |    8 -
 drivers/md/dm-snap-persistent.c |    2
 drivers/md/dm.c                 |  176 +++++++++++++++++++--------------------
 drivers/md/linear.c             |    4
 drivers/md/md.c                 |  117 +++++---------------------
 drivers/md/md.h                 |   23 +----
 drivers/md/multipath.c          |    4
 drivers/md/raid0.c              |    4
 drivers/md/raid1.c              |  178 +++++++++++++---------------------------
 drivers/md/raid1.h              |    2
 drivers/md/raid10.c             |    6 -
 drivers/md/raid5.c              |   18 +---
 include/linux/virtio_blk.h      |    6 +
 18 files changed, 244 insertions(+), 406 deletions(-)

Index: block/drivers/block/loop.c
===================================================================
--- block.orig/drivers/block/loop.c
+++ block/drivers/block/loop.c
@@ -477,17 +477,17 @@ static int do_bio_filebacked(struct loop
 	pos = ((loff_t) bio->bi_sector << 9) + lo->lo_offset;

 ...
From: Christoph Hellwig
Date: Saturday, August 14, 2010 - 3:36 am

We can just stop draining in the block layer in the first patch, then
stop doing the stuff in md/dm/etc in the following and then do the
final renaming patches.  It would still be less patches then now, but
keep things working through the whole transition, which would really

I'd suggest not adding FUA support to virtio yet.  Just using the flush
feature gives you a fully working barrier implementation.

Eventually we might want to add a flag in the block queue to send
REQ_FLUSH|REQ_FUA request through to virtio directly so that we can
avoid separate pre- and post flushes, but I really want to benchmark if

We only need the special md_flush_request handling for
empty REQ_FLUSH requests.  REQ_WRITE | REQ_FLUSH just need the

Once we only handle empty barriers here we can directly call bio_endio
instead of first scheduling a work queue.Once we only handle empty
barriers here we can directly call bio_endio and the super wakeup

I kept bio_empty_barrier as bio_empty_flush, which actually is a quite



We never send flush requests as REQ_TYPE_BLOCK_PC anymore, so no need


No need for the special error handling here, flush requests can now

AFAICS this is only needed for the old barrier code, no need for this


My suspicion is that we can get rif of all that special casing here
and just use blk_rq_prep_clone once it's been updated to propagate
REQ_FLUSH, similar to the DISCARD flag.

I also suspect that there is absolutely no need to the barrier work
queue once we stop waiting for outstanding request.  But then again

There's not need to use a separate flush_bio here.
__split_and_process_bio does the right thing for empty REQ_FLUSH
requests.  See my patch for how to do this differenty.  And yeah,
my version has been tested.

--

From: Tejun Heo
Date: Tuesday, August 17, 2010 - 2:59 am

Hello, Christoph.


I'm not really convinced that would help much.  If bisecting can point
to the conversion as the culprit for whatever kind of failure,
wouldn't that be enough?  No matter what we do the conversion will be
a single step thing.  If we make the filesystems enforce the ordering
first and then relax ordering in the block layer, bisection would
still just point at the later patch.  The same goes for md/dm, the
best we can find out would be whether the conversion is correct or not
anyway.

I'm not against restructuring the patchset if it makes more sense but
it just feels like it would be a bit pointless effort (and one which
would require much tighter coordination among different trees) at this

I wrote this in the other mail but I think it would make difference if
the backend storag is md/dm especially if it's shared by multiple VMs.

Hmm, not really, the WRITE should happen after all the data in cache
are committed to NV media, meaning that empty FLUSH should already


Hmm... maybe.  The reason why I removed bio_empty_flush() was that
except for the front-most sequencer (block layer for all the request
based ones and the front-most make_request for bio based ones), it
doesn't make sense to see REQ_FLUSH + data bios.  They should be
sequenced at the front-most stage anyway, so I didn't have much use
for them.  Those code paths couldn't deal with REQ_FLUSH + data bios


I wasn't sure about that part.  You removed store_flush_error(), but
DM_ENDIO_REQUEUE should still have higher priority than other


The same question as before.  I think we still need to prioritize


Yeah, probably.  It was an accidental edit to begin with and I left

But how do you make sure REQ_FLUSHes for preflush finish before
starting the write?

Thanks.

-- 
tejun
--

From: Christoph Hellwig
Date: Tuesday, August 17, 2010 - 6:19 am

The filesystems already enforce the ordering, except reiserfs which

What other trees do you mean?  The conversions of the 8 filesystems
that actually support barriers need to go through this tree anyway
if we want to be able to test it.  Also the changes in the filesystem
are absolutely minimal - it's basically just
s/WRITE_BARRIER/WRITE_FUA_FLUSH/ after my initial patch kill BH_Orderd,


The current bio_empty_barrier is only used in dm, and indeed only makes
sense for make_request-based drivers.  But I think it's a rather useful
helper for them.  Either way, it's not a big issue and either way is


Hmm, okay.  I see how the special flush_bio makes the waiting easier,
let's see if Mike or other in the DM team have a better idea.

--

From: Tejun Heo
Date: Tuesday, August 17, 2010 - 9:41 am

Hi,


I was mostly thinking about dm/md, drdb and stuff, but you're talking
about filesystem conversion patches being routed through block tree,

I might just resequence it to finish this part of discussion but what
does that really buy us?  It's not really gonna help bisection.
Bisection won't be able to tell anything in higher resolution than
"the new implementation doesn't work".  If you show me how it would

IIUC, when any of flushes get DM_ENDIO_REQUEUE (which tells the dm
core layer to retry the whole bio later), it trumps all other failures
and the bio is retried later.  That was why DM_ENDIO_REQUEUE was
prioritized over other error codes, which actually is sort of
incorrect in that once a FLUSH fails, it _MUST_ be reported to upper
layers as FLUSH failure implies data already lost.  So,
DM_ENDIO_REQUEUE actually should have lower priority than other
failures.  But, then again, the error codes still need to be

Yeah, it would be better if it can be sequenced w/o using a work but
let's leave it for later.

Thanks.

-- 
tejun
--

From: Christoph Hellwig
Date: Tuesday, August 17, 2010 - 9:59 am

I think we really need all the conversions in one tree, block layer,
remapping drivers and filesystems.

Btw, I've done the conversion for all filesystems and I'm running tests

It's not bisecting to find bugs in the barrier conversion.  We can't
easily bisect it down anyway.  The problem is when we try to bisect
other problems and get into the middle of the series barriers suddenly
are gone.  Which is not very helpful for things like data integrity

I think that's something we better leave to the DM team.

--

From: Tejun Heo
Date: Tuesday, August 17, 2010 - 11:35 pm

Hello,


I don't know.  If filesystem changes are really trivial maybe, but


Ah, okay, hmmm.... alright, I'll resequence the patches.  If the
filesystem changes can be put into a single tree somehow, we can keep

Sure, but we shouldn't be ripping out the code to do that.

Thanks.

-- 
tejun
--

From: Tejun Heo
Date: Wednesday, August 18, 2010 - 1:11 am

Hello,


Sorry but I'm doing it.  It just doesn't make much sense.  I can't
relax the ordering for REQ_HARDBARRIER without breaking the remapping
drivers.  So, to keep things working, I'll have to 1. relax the
ordering 2. implement new REQ_FLUSH/FUA based interface and 3. use
them in the filesystems in the same patch.  That's just wrong.  And I
don't think md/dm changes can or should go through the block tree.
They're way too invasive for that.  It's a new implementation and
barrier won't work (fail gracefully) for several commits during the
transition.  I don't think there's a better way around it.

Thanks.

-- 
tejun
--

From: Kiyoshi Ueda
Date: Friday, August 20, 2010 - 1:26 am

Hi Tejun, Christoph,


I think that's correct and changing the priority of DM_ENDIO_REQUEUE
for REQ_FLUSH down to the lowest should be fine.
(I didn't know that FLUSH failure implies data loss possibility.)

But the patch is not enough, you have to change target drivers, too.
E.g. As for multipath, you need to change
     drivers/md/dm-mpath.c:do_end_io() to return error for REQ_FLUSH
     like the REQ_DISCARD support included in 2.6.36-rc1.


By the way, if these patch-set with the change above are included,
even one path failure for REQ_FLUSH on multipath configuration will
be reported to upper layer as error, although it's retried using
other paths currently.
Then, if an upper layer won't take correct recovery action for the error,
it would be seen as a regression for users. (e.g. Frequent EXT3-error
resulting in read-only mount on multipath configuration.)

Although I think the explicit error is fine rather than implicit data
corruption, please check upper layers carefully so that users won't see
such errors as much as possible.

Thanks,
Kiyoshi Ueda
--

From: Tejun Heo
Date: Monday, August 23, 2010 - 5:14 am

Hello,


At least on ATA, FLUSH failure implies that data is already lost, so

I'll take a look but is there an easy to test mpath other than having

Argh... then it will have to discern why FLUSH failed.  It can retry
for transport errors but if it got aborted by the device it should
report upwards.  Maybe just turn off barrier support in mpath for now?

Thanks.

-- 
tejun
--

From: Mike Snitzer
Date: Monday, August 23, 2010 - 7:17 am

On Mon, Aug 23 2010 at  8:14am -0400,

It is easy enough to make a single path use mpath.  Just verify/modify
/etc/multipath.conf so that your device isn't blacklisted.

multipathd will even work with a scsi-debug device.

You obviously won't get path failover but you'll see the path get marked

Yes, we discussed this issue of needing to train dm-multipath to know if
there was a transport failure or not (at LSF).  But I'm not sure when
Hannes intends to repost his work in this area (updated to account for

I think we'd prefer to have a device fail rather than jeopardize data
integrity.  Clearly not ideal but...
--

From: Kiyoshi Ueda
Date: Tuesday, August 24, 2010 - 3:24 am

Hi Tejun,


Yes, checking whether it's a transport error in lower layer is
the right solution.
(Since I know it's not available yet, I just hoped if upper layers
 had some other options.)

Anyway, only reporting errors for REQ_FLUSH to upper layer without
such a solution would make dm-multipath almost unusable in real world,

If it's possible, it could be a workaround for a short term.
But how can you do that?

I think it's not enough to just drop REQ_FLUSH flag from q->flush_flags.
Underlying devices of a mpath device may have write-back cache and
it may be enabled.
So if a mpath device doesn't set REQ_FLUSH flag in q->flush_flags, it
becomes a device which has write-back cache but doesn't support flush.
Then, upper layer can do nothing to ensure cache flush?

Thanks,
Kiyoshi Ueda
--

From: Mike Snitzer
Date: Tuesday, August 24, 2010 - 10:52 am

On Tue, Aug 24 2010 at 12:59pm -0400,

Seems clear that we must fix mpath to receive the SCSI errors, in some
form, so it can decide if a retry is required/valid or not.

Such error processing was a big selling point for the transition from
bio-based to request-based multipath; so it's unfortunate that this

Hannes already proposed some patches:
https://patchwork.kernel.org/patch/61282/
https://patchwork.kernel.org/patch/61283/
https://patchwork.kernel.org/patch/61596/

This work was discussed at LSF, see "Error Handling - Hannes Reinecke"
here: http://lwn.net/Articles/400589/

I thought James, Alasdair and others offered some guidance on what he'd
like to see...

Unfortunately, even though I was at this LSF session, I can't recall any
specific consensus on how Hannes' work should be refactored (to avoid
adding SCSI sense processing code directly in dm-mpath).  Maybe James,
Hannes or others remember?

Was it enough to just have the SCSI sense processing code split out in a

I'll have to review this thread again to understand why mpath's existing
retry logic is broken behavior.  mpath is used with more capable SCSI
devices so I'm missing why a failed FLUSH implies data loss.

Mike
--

From: Tejun Heo
Date: Tuesday, August 24, 2010 - 11:14 am

Hello,


SBC doesn't specify the failure behavior, so it could be that retrying
flush could be safe.  But for most disk type devices, flush failure
usually indicates that the device exhausted all the options to commit
some of pending data to NV media - ie. even remapping failed for
whatever reason.  Even if retry is safe, it's more likely to simply
delay notification of failure.

In ATA, the situation is clearer, when a device actively fails a
flush, the drive reports the first failed sector it failed to commit
and the next flush will continue _after_ the sector - IOW, data is
already lost.

<speculation>
I think there's no reason mpath should be tasked with retrying flush
failure.  That's upto the SCSI EH.  If the command failed in 'safe'
transient way - ie. device busy or whatnot, SCSI EH can and does retry
the command.  There are several FAILFAST bits already and SCSI EH can
avoid retrying transport errors for mpath (maybe it already does
that?) and just need to be able to tell upper layer that the failure
was a fast one and upper layer is responsible for retrying?  Is there
any reason to pass the whole sense information upwards?
</speculation>

Anyways, flush failure is different from read/write failures.
Read/writes can always be retried cleanly.  They are stateless.  I
don't know how SCSI devices would actually behavior but it's a bit
scary to retry SYNCHRONIZE_CACHE a device failed and report success
upwards.

Thanks.

-- 
tejun
--

From: Kiyoshi Ueda
Date: Wednesday, August 25, 2010 - 1:00 am

Hi Tejun,


Right.
If the error is safe/needed to retry using other paths, mpath should
retry even if REQ_FLUSH.  Otherwise, only one path failure may result
in system down.
Just passing any REQ_FLUSH error upwards regardless the error type
will make such situations, and users will feel the behavior as

I'm not sure how long will it take.
Anyway, as you said, the flush error handling of dm-mpath is already
broken if data loss really happens on any storage used by dm-mpath.
Although it's a serious issue and quick fix is required, I think
you may leave the old behavior in your patch-set, since it's
a separate issue.

Thanks,
Kiyoshi Ueda
--

From: Mike Snitzer
Date: Wednesday, August 25, 2010 - 8:28 am

On Wed, Aug 25 2010 at  4:00am -0400,

Right, there are hardware configurations that lend themselves to FLUSH
retries mattering, namely:
1) a SAS drive with 2 ports and a writeback cache
2) theoretically possible: SCSI array that is mpath capable but
   advertises cache as writeback (WCE=1)

The SAS case is obviously a more concrete example of why FLUSH retries
are worthwhile in mpath.

But I understand (and agree) that we'd be better off if mpath could
differentiate between failures rather than blindly retrying on failures
like it does today (fails path and retries if additional paths

I'm not seeing where anything is broken with current mpath.  If a
multipathed LUN is WCE=1 then it should be fair to assume the cache is
mirrored or shared across ports.  Therefore retrying the SYNCHRONIZE
CACHE is needed.

Do we still have fear that SYNCHRONIZE CACHE can silently drop data?
Seems unlikely especially given what Tejun shared from SBC.

It seems that at worst, with current mpath, we retry when it doesn't
make sense (e.g. target failure).

Mike
--

From: Kiyoshi Ueda
Date: Friday, August 27, 2010 - 2:47 am

Hi Mike,


Do we have any proof to wipe that fear?

If retrying on flush failure is safe on all storages used with multipath
(e.g. SCSI, CCISS, DASD, etc), then current dm-mpath should be fine in
the real world.
But I'm afraid if there is a storage where something like below can happen:
    - a flush command is returned as error to mpath because a part of
      cache has physically broken at the time or so, then that part of
      data loses and the size of the cache is shrunk by the storage.
    - mpath retries the flush command using other path.
    - the flush command is returned as success to mpath.
    - mpath passes the result, success, to upper layer, but some of
      the data already lost.

Thanks,
Kiyoshi Ueda
--

From: Mike Snitzer
Date: Friday, August 27, 2010 - 6:49 am

On Fri, Aug 27 2010 at  5:47am -0400,

That does seem like a valid concern.  But I'm not seeing why its unique
to SYNCHRONIZE CACHE.  Any IO that fails on the target side should be
passed up once the error gets to DM.

Mike
--


Hi Kiyoshi,

On Mon, Aug 30 2010 at  2:13am -0400,

I reached out to Fred Knight on this, to get a more insight from a pure
SCSI SBC perspective, and he shared the following:

----- End forwarded message -----

Seems like verifying/improving the handling of CHECK CONDITION is a more
pressing concern than silent data loss purely due to SYNCHRONIZE CACHE
retries.  Without proper handling we could completely miss these
deferred errors.

But how to effectively report such errors to upper layers is unclear to
me given that a particular SCSI command can carry error information for
IO that was already acknowledged successful (e.g. to the FS).

drivers/scsi/scsi_error.c's various calls to scsi_check_sense()
illustrate Linux's current CHECK CONDITION handling.  I need to look
closer at how deferred errors propagate to upper layers.  After an
initial look it seems scsi_error.c does handle retrying commands where
appropriate.

I believe Hannes has concerns/insight here.

Mike
--


Quite. We _should_ be handling deferred errors correctly;
if you check drivers/scsi/scsi_lib.c:scsi_io_completion()
you'll find this:

	if (host_byte(result) == DID_RESET) {
		/* Third party bus reset or reset for error recovery
		 * reasons.  Just retry the command and see what
		 * happens.
		 */
		action = ACTION_RETRY;
	} else if (sense_valid && !sense_deferred) {
                ...
	} else {
		description = "Unhandled error code";
		action = ACTION_FAIL;
	}

ie for deferred errors we're already aborting the command. Not sure
if I agree with this bit in drivers/scsi/scsi_lib.c:

static int scsi_check_sense(struct scsi_cmnd *scmd)
{
	struct scsi_device *sdev = scmd->device;
	struct scsi_sense_hdr sshdr;

	if (! scsi_command_normalize_sense(scmd, &sshdr))
		return FAILED;	/* no valid sense data */

	if (scsi_sense_is_deferred(&sshdr))
		return NEEDS_RETRY;

I doubt we can resolve the situation by retrying the command, which
will be the wrong command to retry anyway. I would rather
have those retry 'SUCCESS' and add another case in scsi_io_completion()
to notify us about the deferred error.

I'll be sending a patch.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--


Ah. No. That is actually correct. SPC-3 states:
If the task terminates with CHECK CONDITION status and the sense data
describes a deferred error, the command for the terminated task shall
not have been processed.

So we're good after all and I would just add this patch:

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index fb841e3..efb4609 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -912,7 +912,10 @@ void scsi_io_completion(struct scsi_cmnd *cmd, unsigned int
 good_bytes)
                        break;
                }
        } else {
-               description = "Unhandled error code";
+               if (sense_deferred)
+                       description = "Deferred error";
+               else
+                       description = "Unhandled error code";
                action = ACTION_FAIL;
        }
 
to make the whole situation more transparent.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--

From: Vladislav Bolkhovitin
Date: Tuesday, August 24, 2010 - 10:11 am

In SCSI there are conditions when a command, including FLUSH 
(SYNC_CACHE), failed which don't imply lost data. For them the caller 
expected to retry the failed command. Most common cases are Unit 
Attentions and TASK QUEUE FULL status.

Vlad
--

From: Vladislav Bolkhovitin
Date: Friday, August 13, 2010 - 5:55 am

I generally agree with the patchset, but I believe this particular move 
is a really bad move.

I'm not mentioning the obvious that a common functionality (enforcing 
requests ordering in this case) should be handled by a common library, 
but not internally by a zillion file systems Linux has.

The worst in this move is that it would hide all the requests ordering 
semantic inside file systems in, most likely, a very much unclear way. 
That would lead that if I or someone else decide to implement the 
"hardware offload" of requests ordering (ORDERED requests), I or he/she 
would not be able to see any improvement until at least one file system 
be changed to be able to use it. Worse, if the implementor can't 
demonstrate the improvement, how can he encourage file systems 
developers to update their file systems? Which, basically, would mean 
that only a person with *BOTH* deep storage and file systems internals 
knowledge can do the job. How many do you know such people? Both storage 
and file systems topics are very wide and tricky, so nearly always 
people specialize in one of them, not both.

Thus, this move would basically mean that the proper ordered queuing 
would probably never be implemented in Linux.

I believe, much better would be to create a common interface, which file 
systems would use to enforce requests order, when they need it.

Advantages of this approach:

1. The ordering requirements of file systems would be clear.

2. They would be handled in one place by a common code.

3. Any storage level expert can try to implement ordered queuing without 
a deep dive into file systems design and implementation.

I already suggested such interface in 
http://marc.info/?l=linux-scsi&m=128077574815881&w=2. Internally for the 
moment it can be implemented using existing REQ_FLUSH/FUA/etc. and 
waiting for all the requests in the group to finish. As a nice side 
effect, if a device doesn't support FUA, it would be possible to issue 
SYNC_CACHE command(s) only for ...
From: Christoph Hellwig
Date: Friday, August 13, 2010 - 6:17 am

I/O ordering is still handled mostly by common code, that is the
pagecache and the buffercache, although a few filesystems like XFS and
btrfs have their own implementation of the second one.

The current ordered semantics of barriers have only successfull
implemented by a complete queue drain, and not effectively been used
by filesystems.  This patchset removes the bogus global ordering
enforced by the block layer whenever a filesystems wants to be able
to use cache flushes, and because of that allows deeper outstanding
queue depth I/O with less latency.

Now I know you in particular are a fan of scsi ordered tags.  And as I
told you before I'm open to review such an implementation if it shows
us any advantages.  Adding it after this patch is in fact not any more
complicated than before, I'd almost be tempted it's easier as you don't
have to plug it into the complex state machine we used for barriers, and
more importantly we drop the requirement for the barrier sequence to
be atomic, which in fact made implementing barriers using tagged queues
impossible with the current scsi layer.

As far as playing with ordered tags it's just adding a new flag for
it on the bio that gets passed down to the driver.  For a final version
you'd need a queue-level feature if it's supported, but you don't
even need that for the initial work.  Then you can implement a
variant of blk_do_flush that does away with queueing additional requests
once finish but queues all two or three at the same time with your
new ordered flag set, at which point you are back to the level or
ordered tag usage that the old code allows.  You're still left with
all the hard problems of actually implementing error handling for it
and using it higher up in the filesystem and generic page cache code.

I'd really love to see your results, up to the point of just trying
that once I get a little spare time.  But my theory is that it won't
help us - the problem with ordered tags is that they enforce global
ordering while we ...
From: Vladislav Bolkhovitin
Date: Wednesday, August 18, 2010 - 12:29 pm

But how about file systems doing internal local order-by-drain? Without 
converting them to use ordered commands it would be impossible to show 
full potential of them and to make the conversion one would need deep 
internal FS knowledge. That's my point. But if there's a trivial way to 

The local ordering vs global ordering is relevant only if you have 
several applications/threads load. But how about a single 
application/thread?

Another point, for which, AFAIU, the ORDERED commands were invented, is 
that they make ordering on the _another_ side of the link _after_ all 
link/transfer latencies. This is why it's hard to see advantage of them 
on local disks.

Vlad
--

From: Tejun Heo
Date: Friday, August 13, 2010 - 6:21 am

Hello,


I still think the benefit of ordering by tag would be marginal at
best, and what have you guys measured there?  Under the current
framework, there's no easy way to measure full ordered-by-tag
implementation.  The mechanism for filesystems to communicate the
ordering information (which would be a partially ordered graph) just
isn't there and there is no way the current usage of ordering-by-tag
only for barrier sequence can achieve anything close to that level of
difference.

Ripping out the original ordering by tag mechanism doesn't amount to
much.  The use of ordering-by-tag was pretty half-assed there anyway.
If you think exporting full ordering information from filesystem to
the lower layers is worthwhile, please go ahead.  It would be very
interesting to see how much actual difference it can make compared to
ordering-by-filesystem and if it's actually better and the added
complexity is manageable, there's no reason not to do that.

Thank you.

-- 
tejun
--

From: Vladislav Bolkhovitin
Date: Wednesday, August 18, 2010 - 12:30 pm

Hello,


Basically, I measured how iSCSI link utilization depends from amount of 
queued commands and queued data size. This is why I made it as a table. 
 From it you can see which improvement you will have removing queue 
draining after 1, 2, 4, etc. commands depending of commands sizes.

For instance, on my previous XFS rm example, where rm of 4 files took 
3.5 minutes with nobarrier option, I could see that XFS was sending 1-3 
  32K commands in a row. From my table you can see that if it sent all 
them at once without draining, it would have about 150-200% speed increase.

Vlad
--

From: Tejun Heo
Date: Thursday, August 19, 2010 - 2:51 am

Hello,


You compared barrier off/on.  Of course, it will make a big
difference.  I think good part of that gain should be realized by the
currently proposed patchset which removes draining.  What's needed to
be demonstrated is the difference between ordered-by-waiting and
ordered-by-tag.  We've never had code to do that properly.

The original ordered-by-tag we had only applied tag ordering to two or
three command sequences inside a barrier, which doesn't amount to much
(and could even be harmful as it imposes draining of all simple
commands inside the device only to reduce issue latencies for a few
commands).  You'll need to hook into filesystem and somehow export the
ordering information down to the driver so that whatever needs
ordering is sent out as ordered commands.

As I've wrote multiple times, I'm pretty skeptical it will bring much.
Ordered tag mandates draining inside the device just like the original
barrier implementation.  Sure, it's done at a lower layer and command
issue latencies will be reduced thanks to that but ordered-by-waiting
doesn't require _any_ draining at all.  The whole pipeline can be kept
full all the time.  I'm often wrong tho, so please feel free to go
ahead and prove me wrong.  :-)

Thanks.

-- 
tejun
--

From: Hannes Reinecke
Date: Monday, August 30, 2010 - 2:54 am

Actually, I thought about ordered tag writes, too.
But eventually I had to give up on this for a simple reason:
Ordered tag controls the ordering on the SCSI _TARGET_. But for a
meaningful implementation we need to control the ordering all the way
down from ->queuecommand(). Which means we have three areas we need
to cover here:
- driver (ie between ->queuecommand() and passing it off to the firmware)
- firmware
- fabric

Sadly, the latter two are really hard to influence. And, what's more,
with the new/modern CNAs with multiple queues and possible multiple
routes to the target it becomes impossible to guarantee ordering.
So using ordered tags for FibreChannel is not going to work, which
makes implementing it a bit of a pointless exercise for me.

Cheers,

Hannes
-- 
Dr. Hannes Reinecke		      zSeries & Storage
hare@suse.de			      +49 911 74053 688
SUSE LINUX Products GmbH, Maxfeldstr. 5, 90409 Nürnberg
GF: Markus Rex, HRB 16746 (AG Nürnberg)
--

From: Vladislav Bolkhovitin
Date: Monday, August 30, 2010 - 1:34 pm

The situation is, actually, much better than you think. An SCSI 
transport should provide an in-order delivery of commands. In some 
transports it is required (e.g. iSCSI), in some - optional (e.g. FC). 
For FC "an application client may determine if a device server supports 
the precise delivery function by using the MODE SENSE and MODE SELECT 
commands to examine and set the enable precise delivery checking (EPDC) 
bit in the Fibre Channel Logical Unit Control page" (Fibre Channel 
Protocol for SCSI (FCP)). You can find more details in FCP section 
"Precise delivery of SCSI commands".

Regarding multiple queues, in case of a multipath access to a device 
SCSI requires either each path be a separate I_T nexus, where order of 
commands is maintained, or a transport required to maintain in-order 
commands delivery among multiple paths in a single I_T nexus (session) 
as it is done in iSCSI's MC/S and, most likely, wide SAS ports.

So, everything is in the specs. We only need to use it properly. How it 
can be done on the drivers level as well as how errors recovery can be 
done using ACA and UA_INTLCK facilities I wrote few weeks ago in the 
"[RFC] relaxed barrier semantics" thread.

Vlad
--

From: Christoph Hellwig
Date: Wednesday, August 18, 2010 - 2:46 am

FYI: One issue with this series is that make_request based drivers
not have to access all REQ_FLUSH and REQ_FUA requests.  We'll either
need to add handling to empty REQ_FLUSH requests to all of them or
figure out a way to prevent them getting sent.  That is assuming they'll
simply ignore REQ_FLUSH/REQ_FUA on normal writes.

--

From: Tejun Heo
Date: Thursday, August 19, 2010 - 2:57 am

Hello,


Can you be a bit more specific?  In most cases, request based drivers
should be fine.  They sit behind the front most request_queue which
would discompose REQ_FLUSH/FUAs into appropriate command sequence.
For the request based drivers, it's not different from the original
REQ_HARDBARRIER mechanism, it'll just see flushes and optionally FUA
writes.

Thanks.

-- 
tejun
--

From: Christoph Hellwig
Date: Thursday, August 19, 2010 - 3:20 am

I said make_request based drivers, that is drivers taking bios.   These
get bios directly from __generic_make_request and need to deal with
REQ_FLUSH/FUA themselves.  We have quite a few more than just dm/md of
this kind:

arch/powerpc/sysdev/axonram.c:	blk_queue_make_request(bank->disk->queue, axon_ram_make_request);
drivers/block/aoe/aoeblk.c:     blk_queue_make_request(d->blkq, aoeblk_make_request);
drivers/block/brd.c:		blk_queue_make_request(brd->brd_queue, brd_make_request);
drivers/block/drbd/drbd_main.c: blk_queue_make_request(q, drbd_make_request_26);
drivers/block/loop.c:		blk_queue_make_request(lo->lo_queue, loop_make_request);
drivers/block/pktcdvd.c:        blk_queue_make_request(q, pkt_make_request);
drivers/block/ps3vram.c:        blk_queue_make_request(queue, ps3vram_make_request);
drivers/block/umem.c:		blk_queue_make_request(card->queue, mm_make_request);
drivers/s390/block/dcssblk.c:	blk_queue_make_request(dev_info->dcssblk_queue, dcssblk_make_request);
drivers/s390/block/xpram.c:	blk_queue_make_request(xpram_queues[i], xpram_make_request);
drivers/staging/zram/zram_drv.c:blk_queue_make_request(zram->queue, zram_make_request);


--

From: Tejun Heo
Date: Thursday, August 19, 2010 - 3:22 am

Hello,






Will work on these.

Thanks.

-- 
tejun
--

From: Christoph Hellwig
Date: Friday, August 20, 2010 - 6:22 am

FYI: here's a little writeup to document the new cache flushing scheme,
intended to replace Documentation/block/barriers.txt.  Any good
suggestion for a filename in the kernel tree?

---

Explicit volatile write cache control
=====================================

Introduction
------------

Many storage devices, especially in the consumer market, come with volatile
write back caches.  That means the devices signal I/O completion to the
operating system before data actually has hit the physical medium.  This
behavior obviously speeds up various workloads, but it means the operating
system needs to force data out to the physical medium when it performs
a data integrity operation like fsync, sync or an unmount.

The Linux block layer provides a two simple mechanism that lets filesystems
control the caching behavior of the storage device.  These mechanisms are
a forced cache flush, and the Force Unit Access (FUA) flag for requests.


Explicit cache flushes
----------------------

The REQ_FLUSH flag can be OR ed into the r/w flags of a bio submitted from the
filesystem and will make sure the volatile cache of the storage device
has been flushed before the actual I/O operation is started.  The explicit
guarantees write requests that have completed before the bio was submitted
actually are on the physical medium before this request has started.
In addition the REQ_FLUSH flag can be set on an otherwise empty bio
structure, which causes only an explicit cache flush without any dependent
I/O.  It is recommend to use the blkdev_issue_flush() helper for a pure
cache flush.


Forced Unit Access
-----------------

The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
filesystem and will make sure that I/O completion for this requests is not
signaled before the data has made it to non-volatile storage on the
physical medium.


Implementation details for filesystems
--------------------------------------

Filesystem can simply set the REQ_FLUSH and ...
From: Ric Wheeler
Date: Friday, August 20, 2010 - 8:18 am

I was thinking that we might be better off using the "durable writes" term more 
since it is well documented (at least in the database world, where it is the "D" 

Should we mention that users can also disable the write cache on the target device?

It might also be worth mentioning that storage needs to be properly configured - 
i.e., an internal hardware RAID card with battery backing needs can expose 
itself as a writethrough cache *only if* it actually has control over all of the 
backend disks and can flush/disable their write caches.

Maybe that is too much detail, but I know that people have lost data with some 
of these setups.

The rest of the write up below sounds good, thanks for pulling this together!


--

From: Chris Mason
Date: Friday, August 20, 2010 - 9:00 am

sata_lies.txt?

Ok, maybe writeback_cache.txt?

-chris
--

From: Ric Wheeler
Date: Friday, August 20, 2010 - 9:02 am

writeback_cache.txt is certainly the least confusing :)

ric

--

From: Tejun Heo
Date: Monday, August 23, 2010 - 5:30 am

Hello,


The term is very foreign to people outside of enterprise / database
loop.  writeback-cache.txt or write-cache-control.txt sounds good

It might be useful to give several example configurations with
different cache configurations.  I don't have much experience with
battery backed arrays but aren't they suppose to report write through
cache automatically?

Thanks.

-- 
tejun
--

From: Christoph Hellwig
Date: Monday, August 23, 2010 - 5:48 am

They usually do.  I have one that doesn't, but SYNCHRONIZE CACHE on
it is so fast that it effectively must be a no-op.

--

From: Ric Wheeler
Date: Monday, August 23, 2010 - 6:58 am

Arrays are not a problem in general - they normally have internally, redundant 
batteries to hold up the cache.

The issue is when you have an internal hardware RAID card with a large cache. 
Those cards sit in your server and the batteries on the card protect its 
internal cache, but do not have the capacity to hold up the drives behind it.

Normally, those drives should have their write cache disabled, but sometimes 
(especially with S-ATA disks) this is not done.

ric

--

From: Christoph Hellwig
Date: Monday, August 23, 2010 - 7:08 am

I haven't seen it.  I don't care particularly about this case, but once
it a while people want to disable flushing for testing or because they
really don't care.

What about adding a sysfs attribue to every request_queue that allows
disabling the cache flushing feature?  Compared to the barrier option
this controls the feature at the right level and makes it available
to everyone instead of beeing duplicated.  After a while we can then
simply ignore the barrier/nobarrier options.
--

From: Tejun Heo
Date: Monday, August 23, 2010 - 7:13 am

Hello,


Yeah, that sounds reasonable.  blk_queue_flush() can be called anytime
without locking anyway, so it should be really easy to implement too.

Thanks.

-- 
tejun
--

From: Christoph Hellwig
Date: Monday, August 23, 2010 - 7:19 am

I don't think we can simply call blk_queue_flush - we must ensure to
never set more bits than the device allows.  We'll just need two
sets of flags in the request queue, with the sysfs file checking that
it never allows more flags than blk_queue_flush.

I'll prepare a patch for this on top of the current series.

--

From: Jens Axboe
Date: Wednesday, August 25, 2010 - 4:31 am

Agree, that would be fine.

-- 
Jens Axboe

--

From: Ric Wheeler
Date: Monday, August 23, 2010 - 8:19 am

Hi Jens,

There are actually two distinct problems:

(1) arrays with a non-volatile write cache (battery backed, navram, whatever) 
that do not NOOP a SYNC_CACHE command. I know of one brand that seems to do 
this, but it is not a common brand. If we do not issue flushes for write through 
caches, I think that we will avoid this in any case.

(2) hardware raid cards with internal buffer memory and on-card battery backup 
(they sit in your server, disks sit in jbod like expansion shelves). These are 
fine if the drives in those shelves have write cache disabled.

ric
--

From: Ric Wheeler
Date: Monday, August 23, 2010 - 9:49 am

Yes - this is certainly one way to do it. Note that this will not work if the 
card advertises itself as a write through cache (and we end up not sending down 
the SYNC_CACHE commands).

At least one hardware RAID card (I unfortunately cannot mention the brand) did 
not do this command forwarding.

ric

--

From: Sergey Vlasov
Date: Monday, August 23, 2010 - 9:45 am

On Mon, Aug 23, 2010 at 11:19:13AM -0400, Ric Wheeler wrote:

Actually some of such cards keep write cache on the drives enabled and
issue FLUSH CACHE commands to the drives.  E.g., 3ware 9690SA behaves
like this at least with SATA drives (the FLUSH CACHE commands can be
seen after enabling performance monitoring - they often end up in the
"10 commands having the largest latency" table).  This can actually be
safe if the card waits for the FLUSH CACHE completion before making
the write cache data in its battery-backed memory available for reuse
(and the drive implements the FLUSH CACHE command correctly).
From: Tejun Heo
Date: Monday, August 23, 2010 - 5:36 am

Hello,


A bit nit picky but flash devices can also have writeback caches and
the term physical medium sounds a bit off for those cases.  Maybe just

It may be worthwhile to explain the sequence of operations when
REQ_FLUSH + data + REQ_FUA is executed.  It can be extrapolated from
the previous two descriptions but I think giving examples of different
sequences depending on FLUSH/FUA configuration would be helpful to
help understanding the overall picture of things.

Other than those, looks good to me.

Thanks.

-- 
tejun
--

From: Christoph Hellwig
Date: Monday, August 23, 2010 - 7:05 am

Below is an updated version of the documentation.  It fixes several
typos Zach Brown noticed and replaces all references to a physical
medium with the term non-volatile storage.  I haven't added any examples
yet as I need to figure how they fit into the rest of the document.

---

Explicit volatile write cache control
=====================================

Introduction
------------

Many storage devices, especially in the consumer market, come with volatile
write back caches.  That means the devices signal I/O completion to the
operating system before data actually has hit the non-volatile storage.  This
behavior obviously speeds up various workloads, but it means the operating
system needs to force data out to the non-volatile storage when it performs
a data integrity operation like fsync, sync or an unmount.

The Linux block layer provides two simple mechanism that lets filesystems
control the caching behavior of the storage device.  These mechanisms are
a forced cache flush, and the Force Unit Access (FUA) flag for requests.


Explicit cache flushes
----------------------

The REQ_FLUSH flag can be OR ed into the r/w flags of a bio submitted from
the filesystem and will make sure the volatile cache of the storage device
has been flushed before the actual I/O operation is started.  This explicitly
guarantees that previously completed write requests are on non-volatile
storage before the flagged bio starts. In addition the REQ_FLUSH flag can be
set on an otherwise empty bio structure, which causes only an explicit cache
flush without any dependent I/O.  It is recommend to use
the blkdev_issue_flush() helper for a pure cache flush.


Forced Unit Access
-----------------

The REQ_FUA flag can be OR ed into the r/w flags of a bio submitted from the
filesystem and will make sure that I/O completion for this requests is only
signaled after the data has been commited to non-volatile storage.


Implementation details for ...
From: Christoph Hellwig
Date: Monday, August 23, 2010 - 7:15 am

We need to call blk_rq_init and elv_insert for all cases in queue_next_fseq,
so take these calls into common code.  Also move the end_io initialization
from queue_flush into queue_next_fseq and rename queue_flush to
init_flush_request now that it's old name doesn't apply anymore.

Signed-off-by: Christoph Hellwig <hch@lst.de>

Index: linux-2.6/block/blk-flush.c
===================================================================
--- linux-2.6.orig/block/blk-flush.c	2010-08-17 15:34:27.864004351 +0200
+++ linux-2.6/block/blk-flush.c	2010-08-17 16:12:53.504253827 +0200
@@ -74,16 +74,11 @@ static void post_flush_end_io(struct req
 	blk_flush_complete_seq(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
 }
 
-static void queue_flush(struct request_queue *q, struct request *rq,
-			rq_end_io_fn *end_io)
+static void init_flush_request(struct request *rq, struct gendisk *disk)
 {
-	blk_rq_init(q, rq);
 	rq->cmd_type = REQ_TYPE_FS;
 	rq->cmd_flags = REQ_FLUSH;
-	rq->rq_disk = q->orig_flush_rq->rq_disk;
-	rq->end_io = end_io;
-
-	elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
+	rq->rq_disk = disk;
 }
 
 static struct request *queue_next_fseq(struct request_queue *q)
@@ -91,29 +86,28 @@ static struct request *queue_next_fseq(s
 	struct request *orig_rq = q->orig_flush_rq;
 	struct request *rq = &q->flush_rq;
 
+	blk_rq_init(q, rq);
+
 	switch (blk_flush_cur_seq(q)) {
 	case QUEUE_FSEQ_PREFLUSH:
-		queue_flush(q, rq, pre_flush_end_io);
+		init_flush_request(rq, orig_rq->rq_disk);
+		rq->end_io = pre_flush_end_io;
 		break;
-
 	case QUEUE_FSEQ_DATA:
-		/* initialize proxy request, inherit FLUSH/FUA and queue it */
-		blk_rq_init(q, rq);
 		init_request_from_bio(rq, orig_rq->bio);
 		rq->cmd_flags &= ~(REQ_FLUSH | REQ_FUA);
 		rq->cmd_flags |= orig_rq->cmd_flags & (REQ_FLUSH | REQ_FUA);
 		rq->end_io = flush_data_end_io;
-
-		elv_insert(q, rq, ELEVATOR_INSERT_FRONT);
 		break;
-
 	case QUEUE_FSEQ_POSTFLUSH:
-		queue_flush(q, rq, post_flush_end_io);
+		init_flush_request(rq, ...
From: John Robinson
Date: Monday, August 23, 2010 - 9:28 am

Nit: it's "its" above, not "it's". If in doubt, if it's "it is" (or "it 
has") it's "it's" but if it could be "his" or "hers" it's "its".

I'm guessing English isn't your first language (a) because of your .de 
address and (b) because it's better than most British people's, but 
still, it's a common mistake. If I can remember any of the German I 
studied all those years ago, "its" is roughly equivalent to "sein", and 
"it's" to "es ist".

Cheers,

John.
--

Previous thread: [PATCH 11/11] block: use REQ_FLUSH in blkdev_issue_flush() by Tejun Heo on Thursday, August 12, 2010 - 5:41 am. (1 message)

Next thread: BUG: using smp_processor_id() in preemptible [00000000] code by Sergey Senozhatsky on Thursday, August 12, 2010 - 5:51 am. (1 message)