This patch converts request-based dm to support the new REQ_FLUSH/FUA.
The original request-based flush implementation depended on
request_queue blocking other requests while a barrier sequence is in
progress, which is no longer true for the new REQ_FLUSH/FUA.
In general, request-based dm doesn't have infrastructure for cloning
one source request to multiple targets, but the original flush
implementation had a special mostly independent path which can issue
flushes to multiple targets and sequence them. However, the
capability isn't currently in use and adds a lot of complexity.
Moreoever, it's unlikely to be useful in its current form as it
doesn't make sense to be able to send out flushes to multiple targets
when write requests can't be.
This patch rips out special flush code path and deals handles
REQ_FLUSH/FUA requests the same way as other requests. The only
special treatment is that REQ_FLUSH requests use the block address 0
when finding target, which is enough for now.
Signed-off-by: Tejun Heo <tj@kernel.org>
---
drivers/md/dm.c | 204 ++++++-------------------------------------------------
1 files changed, 20 insertions(+), 184 deletions(-)
diff --git a/drivers/md/dm.c b/drivers/md/dm.c
index e67c519..81a012f 100644
--- a/drivers/md/dm.c
+++ b/drivers/md/dm.c
@@ -143,20 +143,9 @@ struct mapped_device {
spinlock_t deferred_lock;
/*
- * Protect barrier_error from concurrent endio processing
- * in request-based dm.
- */
- spinlock_t barrier_error_lock;
- int barrier_error;
-
- /*
- * Processing queue (flush/barriers)
+ * Processing queue (flush)
*/
struct workqueue_struct *wq;
- struct work_struct barrier_work;
-
- /* A pointer to the currently processing pre/post flush request */
- struct request *flush_request;
/*
* The current mapping.
@@ -732,23 +721,6 @@ static void end_clone_bio(struct bio *clone, int error)
blk_update_request(tio->orig, 0, nr_bytes);
}
-static void store_barrier_error(struct mapped_device ...On Mon, Aug 30 2010 at 5:58am -0400,
Looks very comparable to the patch I prepared but I have 2 observations
blk_rq_prep_clone() of a REQ_FLUSH request will result in a
rq_data_dir(clone) of read.
I still had the following:
if (rq->cmd_flags & REQ_FLUSH) {
blk_rq_init(NULL, clone);
clone->cmd_type = REQ_TYPE_FS;
/* without this the clone has a rq_data_dir of 0 */
clone->cmd_flags |= WRITE_FLUSH;
} else {
r = blk_rq_prep_clone(clone, rq, tio->md->bs, GFP_ATOMIC,
dm_rq_bio_constructor, tio);
...
Request-based DM's REQ_FLUSH still works without this special casing but
I figured I'd raise this to ask: what is the proper rq_data_dir() is for
I also needed to avoid the ->busy call for REQ_FLUSH:
if (!(rq->cmd_flags & REQ_FLUSH)) {
ti = dm_table_find_target(map, blk_rq_pos(rq));
BUG_ON(!dm_target_is_valid(ti));
if (ti->type->busy && ti->type->busy(ti))
goto plug_and_out;
} else {
/* rq-based only ever has one target! leverage this for FLUSH */
ti = dm_table_get_target(map, 0);
}
If I allowed ->busy to be called for REQ_FLUSH it would result in a
deadlock. I haven't identified where/why yet.
Other than these remaining issues this patch looks good.
Thanks,
Mike
--
Hello, Hmmm... why? blk_rq_prep_clone() copies all REQ_* flags in Technically block layer doesn't care one way or the other but WRITE definitely. Maybe it would be a good idea to enforce that from block Ah... that's probably from "if (!elv_queue_empty(q))" check below, flushes are on a separate queue but I forgot to update elv_queue_empty() to check the flush queue. elv_queue_empty() can return %true spuriously in which case the queue won't be plugged and restarted later leading to queue hang. I'll fix elv_queue_empty(). Thanks. -- tejun --
I think I was too quick to blame elv_queue_empty(). Can you please
test whether the following patch fixes the hang?
Thanks.
---
block/blk-flush.c | 18 +++++++++++++-----
1 file changed, 13 insertions(+), 5 deletions(-)
Index: block/block/blk-flush.c
===================================================================
--- block.orig/block/blk-flush.c
+++ block/block/blk-flush.c
@@ -28,7 +28,8 @@ unsigned blk_flush_cur_seq(struct reques
}
static struct request *blk_flush_complete_seq(struct request_queue *q,
- unsigned seq, int error)
+ unsigned seq, int error,
+ bool from_end_io)
{
struct request *next_rq = NULL;
@@ -51,6 +52,13 @@ static struct request *blk_flush_complet
if (!list_empty(&q->pending_flushes)) {
next_rq = list_entry_rq(q->pending_flushes.next);
list_move(&next_rq->queuelist, &q->queue_head);
+ /*
+ * Moving a request silently to queue_head may
+ * stall the queue, kick the queue if we
+ * aren't in the issue path already.
+ */
+ if (from_end_io)
+ __blk_run_queue(q);
}
}
return next_rq;
@@ -59,19 +67,19 @@ static struct request *blk_flush_complet
static void pre_flush_end_io(struct request *rq, int error)
{
elv_completed_request(rq->q, rq);
- blk_flush_complete_seq(rq->q, QUEUE_FSEQ_PREFLUSH, error);
+ blk_flush_complete_seq(rq->q, QUEUE_FSEQ_PREFLUSH, error, true);
}
static void flush_data_end_io(struct request *rq, int error)
{
elv_completed_request(rq->q, rq);
- blk_flush_complete_seq(rq->q, QUEUE_FSEQ_DATA, error);
+ blk_flush_complete_seq(rq->q, QUEUE_FSEQ_DATA, error, true);
}
static void post_flush_end_io(struct request *rq, int error)
{
elv_completed_request(rq->q, rq);
- blk_flush_complete_seq(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
+ blk_flush_complete_seq(rq->q, QUEUE_FSEQ_POSTFLUSH, error, true);
}
static void init_flush_request(struct request *rq, struct gendisk *disk)
@@ -165,7 +173,7 @@ struct request *blk_do_flush(struct ...On Mon, Aug 30 2010 at 11:07am -0400, It does, thanks! Tested-by: Mike Snitzer <snitzer@redhat.com> --
On Mon, Aug 30 2010 at 3:08pm -0400, Hmm, but unfortunately I was too quick to say the patch fixed the hang. It is much more rare, but I can still get a hang. I just got the following running vgcreate against an DM mpath (rq-based) device: INFO: task vgcreate:3517 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. vgcreate D ffff88003d677a00 5168 3517 3361 0x00000080 ffff88003d677998 0000000000000046 ffff880000000000 ffff88003d677fd8 ffff880039c84860 ffff88003d677fd8 00000000001d3880 ffff880039c84c30 ffff880039c84c28 00000000001d3880 00000000001d3880 ffff88003d677fd8 Call Trace: [<ffffffff81389308>] io_schedule+0x73/0xb5 [<ffffffff811c7304>] get_request_wait+0xef/0x17d [<ffffffff810642be>] ? autoremove_wake_function+0x0/0x39 [<ffffffff811c7890>] __make_request+0x333/0x467 [<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9 [<ffffffff811c5e91>] generic_make_request+0x342/0x3bf [<ffffffff81074714>] ? trace_hardirqs_off+0xd/0xf [<ffffffff81069df2>] ? local_clock+0x41/0x5a [<ffffffff811c5fe9>] submit_bio+0xdb/0xf8 [<ffffffff810754a4>] ? trace_hardirqs_on+0xd/0xf [<ffffffff811381a6>] dio_bio_submit+0x7b/0x9c [<ffffffff81138dbe>] __blockdev_direct_IO+0x7f3/0x97d [<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9 [<ffffffff81136d7a>] blkdev_direct_IO+0x57/0x59 [<ffffffff81135f58>] ? blkdev_get_blocks+0x0/0x90 [<ffffffff810ce301>] generic_file_aio_read+0xed/0x5b4 [<ffffffff81077932>] ? lock_release_non_nested+0xd5/0x23b [<ffffffff810e40f8>] ? might_fault+0x5c/0xac [<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9 [<ffffffff8110e131>] do_sync_read+0xcb/0x108 [<ffffffff81074688>] ? trace_hardirqs_off_caller+0x1f/0x9e [<ffffffff81389a99>] ? __mutex_unlock_slowpath+0x120/0x132 [<ffffffff8119d805>] ? fsnotify_perm+0x4a/0x50 [<ffffffff8119d86c>] ? security_file_permission+0x2e/0x33 [<ffffffff8110e7a3>] vfs_read+0xab/0x107 [<ffffffff81075473>] ? ...
Can you please try this one instead?
Thanks.
---
block/blk-flush.c | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)
Index: block/block/blk-flush.c
===================================================================
--- block.orig/block/blk-flush.c
+++ block/block/blk-flush.c
@@ -56,22 +56,38 @@ static struct request *blk_flush_complet
return next_rq;
}
+static void blk_flush_complete_seq_end_io(struct request_queue *q,
+ unsigned seq, int error)
+{
+ bool was_empty = elv_queue_empty(q);
+ struct request *next_rq;
+
+ next_rq = blk_flush_complete_seq(q, seq, error);
+
+ /*
+ * Moving a request silently to empty queue_head may stall the
+ * queue. Kick the queue in those cases.
+ */
+ if (next_rq && was_empty)
+ __blk_run_queue(q);
+}
+
static void pre_flush_end_io(struct request *rq, int error)
{
elv_completed_request(rq->q, rq);
- blk_flush_complete_seq(rq->q, QUEUE_FSEQ_PREFLUSH, error);
+ blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_PREFLUSH, error);
}
static void flush_data_end_io(struct request *rq, int error)
{
elv_completed_request(rq->q, rq);
- blk_flush_complete_seq(rq->q, QUEUE_FSEQ_DATA, error);
+ blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_DATA, error);
}
static void post_flush_end_io(struct request *rq, int error)
{
elv_completed_request(rq->q, rq);
- blk_flush_complete_seq(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
+ blk_flush_complete_seq_end_io(rq->q, QUEUE_FSEQ_POSTFLUSH, error);
}
static void init_flush_request(struct request *rq, struct gendisk *disk)
--
tejun
--
On Tue, Aug 31 2010 at 6:29am -0400, Still hit the hang on the 5th iteration of my test: while true ; do ./test_dm_discard_mpath.sh && sleep 1 ; done Would you like me to (re)send my test script offlist? INFO: task vgcreate:2617 blocked for more than 120 seconds. "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. vgcreate D ffff88007bf7ba00 4688 2617 2479 0x00000080 ffff88007bf7b998 0000000000000046 ffff880000000000 ffff88007bf7bfd8 ffff88005542a430 ffff88007bf7bfd8 00000000001d3880 ffff88005542a800 ffff88005542a7f8 00000000001d3880 00000000001d3880 ffff88007bf7bfd8 Call Trace: [<ffffffff81389338>] io_schedule+0x73/0xb5 [<ffffffff811c7304>] get_request_wait+0xef/0x17d [<ffffffff810642be>] ? autoremove_wake_function+0x0/0x39 [<ffffffff811c7890>] __make_request+0x333/0x467 [<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9 [<ffffffff811c5e91>] generic_make_request+0x342/0x3bf [<ffffffff81074714>] ? trace_hardirqs_off+0xd/0xf [<ffffffff81069df2>] ? local_clock+0x41/0x5a [<ffffffff811c5fe9>] submit_bio+0xdb/0xf8 [<ffffffff810754a4>] ? trace_hardirqs_on+0xd/0xf [<ffffffff811381a6>] dio_bio_submit+0x7b/0x9c [<ffffffff81138dbe>] __blockdev_direct_IO+0x7f3/0x97d [<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9 [<ffffffff81136d7a>] blkdev_direct_IO+0x57/0x59 [<ffffffff81135f58>] ? blkdev_get_blocks+0x0/0x90 [<ffffffff810ce301>] generic_file_aio_read+0xed/0x5b4 [<ffffffff81077932>] ? lock_release_non_nested+0xd5/0x23b [<ffffffff810e40f8>] ? might_fault+0x5c/0xac [<ffffffff810251e5>] ? pvclock_clocksource_read+0x50/0xb9 [<ffffffff8110e131>] do_sync_read+0xcb/0x108 [<ffffffff81074688>] ? trace_hardirqs_off_caller+0x1f/0x9e [<ffffffff81389ac9>] ? __mutex_unlock_slowpath+0x120/0x132 [<ffffffff8119d805>] ? fsnotify_perm+0x4a/0x50 [<ffffffff8119d86c>] ? security_file_permission+0x2e/0x33 [<ffffffff8110e7a3>] vfs_read+0xab/0x107 [<ffffffff81075473>] ? trace_hardirqs_on_caller+0x11d/0x141 ...
init_flush_request() only set REQ_FLUSH when initializing flush
requests making them READ requests. Use WRITE_FLUSH instead.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Mike Snitzer <snitzer@redhat.com>
---
So, this was the culprit for the incorrect data direction for flush
requests.
Thanks.
block/blk-flush.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Index: block/block/blk-flush.c
===================================================================
--- block.orig/block/blk-flush.c
+++ block/block/blk-flush.c
@@ -77,7 +77,7 @@ static void post_flush_end_io(struct req
static void init_flush_request(struct request *rq, struct gendisk *disk)
{
rq->cmd_type = REQ_TYPE_FS;
- rq->cmd_flags = REQ_FLUSH;
+ rq->cmd_flags = WRITE_FLUSH;
rq->rq_disk = disk;
}
--
This patch converts request-based dm to support the new REQ_FLUSH/FUA.
The original request-based flush implementation depended on
request_queue blocking other requests while a barrier sequence is in
progress, which is no longer true for the new REQ_FLUSH/FUA.
In general, request-based dm doesn't have infrastructure for cloning
one source request to multiple targets, but the original flush
implementation had a special mostly independent path which can issue
flushes to multiple targets and sequence them. However, the
capability isn't currently in use and adds a lot of complexity.
Moreoever, it's unlikely to be useful in its current form as it
doesn't make sense to be able to send out flushes to multiple targets
when write requests can't be.
This patch rips out special flush code path and deals handles
REQ_FLUSH/FUA requests the same way as other requests. The only
special treatment is that REQ_FLUSH requests use the block address 0
when finding target, which is enough for now.
* added BUG_ON(!dm_target_is_valid(ti)) in dm_request_fn() as
suggested by Mike Snitzer
Signed-off-by: Tejun Heo <tj@kernel.org>
Cc: Mike Snitzer <snitzer@redhat.com>
---
Here's a version w/ BUG_ON() added. Once the queue hang issue is
tracked down, I'll refresh the whole series and repost.
Thanks.
drivers/md/dm.c | 206 +++++---------------------------------------------------
1 file changed, 22 insertions(+), 184 deletions(-)
Index: block/drivers/md/dm.c
===================================================================
--- block.orig/drivers/md/dm.c
+++ block/drivers/md/dm.c
@@ -143,20 +143,9 @@ struct mapped_device {
spinlock_t deferred_lock;
/*
- * Protect barrier_error from concurrent endio processing
- * in request-based dm.
- */
- spinlock_t barrier_error_lock;
- int barrier_error;
-
- /*
- * Processing queue (flush/barriers)
+ * Processing queue (flush)
*/
struct workqueue_struct *wq;
- struct work_struct barrier_work;
-
- /* A pointer to the ...On Mon, Aug 30 2010 at 11:45am -0400, Looks good. Acked-by: Mike Snitzer <snitzer@redhat.com> Junichi and/or Kiyoshi, Could you please review this patch and add your Acked-by if it is OK? (Alasdair will want to see NEC's Ack to accept this patch). Thanks, Mike --
Hi Tejun,
Thank you for your work.
I don't see any obvious problem on this patch.
However, I hit a NULL pointer dereference below when I use a mpath
device with barrier option of ext3. I'm investigating the cause now.
(Also I'm not sure the cause of the hang which Mike is hitting yet.)
I tried on the commit 28dd53b26d362c16234249bad61db8cbd9222d0b of
git://git.kernel.org/pub/scm/linux/kernel/git/tj/misc.git flush-fua.
# mke2fs -j /dev/mapper/mpatha
# mount -o barrier=1 /dev/mapper/mpatha /mnt/0
# dd if=/dev/zero of=/mnt/0/a bs=512 count=1
# sync
BUG: unable to handle kernel NULL pointer dereference at 0000000000000078
IP: [<ffffffffa0070ec3>] scsi_finish_command+0xa3/0x120 [scsi_mod]
PGD 29fd9a067 PUD 2a21ff067 PMD 0
Oops: 0000 [#1] SMP
last sysfs file: /sys/devices/system/cpu/cpu3/cache/index2/shared_cpu_map
CPU 1
Modules linked in: ext4 jbd2 crc16 ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables autofs4 lockd sunrpc cpufreq_ondemand acpi_cpufreq bridge stp llc iscsi_tcp libiscsi_tcp libiscsi scsi_transport_iscsi dm_mirror dm_region_hash dm_log dm_service_time dm_multipath scsi_dh dm_mod video output sbs sbshc battery ac kvm_intel kvm e1000e sg sr_mod cdrom lpfc scsi_transport_fc piix rtc_cmos rtc_core ioatdma ata_piix button serio_raw rtc_lib libata dca megaraid_sas sd_mod scsi_mod crc_t10dif ext3 jbd uhci_hcd ohci_hcd ehci_hcd [last unloaded: microcode]
Pid: 0, comm: kworker/0:0 Not tainted 2.6.36-rc2+ #1 MS-9196/Express5800/120Lj [N8100-1417]
RIP: 0010:[<ffffffffa0070ec3>] [<ffffffffa0070ec3>] scsi_finish_command+0xa3/0x120 [scsi_mod]
RSP: 0018:ffff880002c83e50 EFLAGS: 00010297
RAX: 0000000000000000 RBX: 0000000000001000 RCX: 0000000000000000
RDX: 0000000000007d7c RSI: ffffffff81389c55 RDI: 0000000000000286
RBP: ffff880002c83e70 R08: 0000000000000002 R09: 0000000000000001
R10: 0000000000000001 R11: 0000000000000000 R12: ffff8802a2acf750
R13: ffff8802a25686c8 R14: ffff8802791f7eb8 R15: 0000000000000100
FS: 0000000000000000(0000) ...On Wed, Sep 01 2010 at 3:15am -0400, FYI, I can't reproduce this using all of Tejun's latest patches (not yet in the flush-fua git tree). But I haven't tried the specific flush-fua commit that you referenced. Mike --
Hello, Hmm... I'm trying to reproduce this problem but hasn't been successful I sure can remove it but md->wq already has most stuff necessary to process deferred requests and when someone starts using it, having the comment there about the rather delicate ordering would definitely be helpful, so I suggest keeping the comment. Thanks. -- tejun --
Ooh, never mind. Reproduced it. Thanks. -- tejun --
Hi Tejun, OK, makes sense. Thanks, Kiyoshi Ueda --
rq->rq_disk and bio->bi_bdev->bd_disk may differ if a request has passed through remapping drivers. FSEQ_DATA request incorrectly followed bio->bi_bdev->bd_disk ending up being issued w/ mismatching rq_disk. Make it follow orig_rq->rq_disk. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> --- Kiyoshi, can you please apply this patch on top and verify that the problem goes away? The reason I and Mike didn't see the problem was because we were using REQ_FUA capable underlying device, which makes block layer bypass sequencing for FSEQ_DATA request. Thanks. block/blk-flush.c | 7 +++++++ 1 file changed, 7 insertions(+) Index: block/block/blk-flush.c =================================================================== --- block.orig/block/blk-flush.c +++ block/block/blk-flush.c @@ -111,6 +111,13 @@ static struct request *queue_next_fseq(s break; case QUEUE_FSEQ_DATA: init_request_from_bio(rq, orig_rq->bio); + /* + * orig_rq->rq_disk may be different from + * bio->bi_bdev->bd_disk if orig_rq got here through + * remapping drivers. Make sure rq->rq_disk points + * to the same one as orig_rq. + */ + rq->rq_disk = orig_rq->rq_disk; rq->cmd_flags &= ~(REQ_FLUSH | REQ_FUA); rq->cmd_flags |= orig_rq->cmd_flags & (REQ_FLUSH | REQ_FUA); rq->end_io = flush_data_end_io; --
Hi Tejun,
Ah, I see, thank you for the quick fix!
I confirmed no panic occurs with this patch.
Tested-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
By the way, I had been considering a block-layer interface which remaps
struct request and its bios to a block device such as:
void blk_remap_request(struct request *rq, struct block_device *bdev)
{
rq->rq_disk = bdev->bd_disk;
__rq_for_each_bio(bio, rq) {
bio->bi_bdev = bdev->bd_disk;
}
}
If there is such an interface and remapping drivers use it, then these
kind of issues may be avoided in the future.
Thanks,
Kiyoshi Ueda
--
Hello, I think the problem is more with request initialization. After all, once bios are packed into a request, they are (or at least should be) just data containers. We now have multiple request init paths in block layer and different ones initialize different subsets and it's not very clear which fields are supposed to be initialized to what by whom. But yeah I agree removing discrepancy between request and bio would be nice to have too. It's not really remapping tho. Maybe just blk_set_rq_q() or something like that (it should also set rq->q)? Thanks. -- tejun --
Hello, Oh, right. Maybe blk_set_rq_bdev() then? Thanks. -- tejun --
Hi Tejun, Yeah, although struct request doesn't have the member 'bdev', blk_set_rq_bdev() may be better from the view of the arguments. Thanks, Kiyoshi Ueda --
