Hi Jens, James and Alasdair,
This is a new version of request-based dm-multipath patches.
The patches are created on top of 2.6.27-rc6 + Alasdair's dm patches
for linux-next below:
dm-mpath-use-more-error-codes.patch
dm-mpath-remove-is_active-from-struct-dm_path.patch
Major changes from the previous version (*) are:
- Moved busy state information for device/host to
q->backing_dev_info from q->queue_flags, since backing_dev_info
seems to be more appropriate location. (PATCH 03)
And corresponding changes to the scsi driver. (PATCH 04)
- Added a queue flag to indicate whether the block device is
request stackable or not, so that request stacking drivers
can avoid to stack request-based device on bio-based device.
(PATCH 05)
- Fixed the problem that requests are not flushed on flush suspend.
(PATCH 10)
- Changed queue initialization method for bio-based dm devices
from blk_alloc_queue() to blk_init_queue(). (PATCH 11)
- Changed congestion check method in dm-multipath not to invoke
__choose_pgpath(). (PATCH 13)
(*) http://lkml.org/lkml/2008/3/19/478
Some basic function/performance testings are done with NEC iStorage
(active-active multipath), and no problem was found.
Please review and apply if no problem.
Summary of the patch-set:
01/13: block: add request data completion interface
02/13: block: add request submission interface
03/13: mm: export driver's busy state via backing_dev_info
04/13: scsi: export busy status
05/13: block: add a queue flag for request stacking support
06/13: dm: remove unused code (preparation for request-based dm)
07/13: dm: tidy local_init (preparation for request-based dm)
08/13: dm: prepare mempools on module init for request-based dm
09/13: dm: add target interface for request-based dm
10/13: dm: add core functions for request-based dm
11/13: dm: add a switch to enable request-based dm if target is ready
12/13: dm: ...This patch adds blk_update_request(), which updates struct request
with completing its data part, but doesn't complete the struct
request itself.
Though it looks like end_that_request_first() of older kernels,
blk_update_request() should be used only by request stacking drivers.
Request-based dm will use it in bio->bi_end_io callback to update
the original request when a data part of a cloned request completes.
Followings are additional background information of why request-based
dm needs this interface.
- Request stacking drivers can't use blk_end_request() directly from
the lower driver's completion context (bio->bi_end_io or rq->end_io),
because some device drivers (e.g. ide) may try to complete
their request with queue lock held, and it may cause deadlock.
See below for detailed description of possible deadlock:
<http://marc.info/?l=linux-kernel&m=120311479108569&w=2>
- To solve that, request-based dm offloads the completion of
cloned struct request to softirq context (i.e. using
blk_complete_request() from rq->end_io). See PATCH 10.
- Though it is possible to use the same solution from bio->bi_end_io,
it will delay the notification of bio completion to the original
submitter. Also, it will cause inefficient partial completion,
because the lower driver can't perform the cloned request anymore
and request-based dm needs to requeue and redispatch it to
the lower driver again later. That's not good.
- So request-based dm needs blk_update_request() to perform the bio
completion in the lower driver's completion context, which is more
efficient.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Jens Axboe <jens.axboe@oracle.com>
---
block/blk-core.c | 59 +++++++++++++++++++++++++++++++++++++++++--------
include/linux/blkdev.h | 2 +
2 files changed, 52 insertions(+), 9 deletions(-)
Index: ...This patch adds blk_submit_request(), a generic request submission interface for request stacking drivers. Request-based dm will use it to submit their clones to underlying devices. blk_rq_check_limits() is also added because it is possible that the lower queue has stronger limitations than the upper queue if multiple drivers are stacking at request-level. Not only for blk_submit_request()'s internal use, the function will be used by request-based dm when the queue limitation is modified (e.g. by replacing dm's table). Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Jens Axboe <jens.axboe@oracle.com> --- block/blk-core.c | 81 +++++++++++++++++++++++++++++++++++++++++++++++++ include/linux/blkdev.h | 2 + 2 files changed, 83 insertions(+) Index: 2.6.27-rc6/block/blk-core.c =================================================================== --- 2.6.27-rc6.orig/block/blk-core.c +++ 2.6.27-rc6/block/blk-core.c @@ -1517,6 +1517,87 @@ void submit_bio(int rw, struct bio *bio) EXPORT_SYMBOL(submit_bio); /** + * blk_rq_check_limits - Helper function to check a request for the queue limit + * @q: the queue + * @rq: the request being checked + * + * Description: + * @rq may have been made based on weaker limitations of upper-level queues + * in request stacking drivers, and it may violate the limitation of @q. + * Since the block layer and the underlying device driver trust @rq + * after it is inserted to @q, it should be checked against @q before + * the insertion using this generic function. + * + * This function should also be useful for request stacking drivers + * in some cases below, so export this fuction. + * Request stacking drivers like request-based dm may change the queue + * limits while requests are in the queue (e.g. dm's table swapping). + * Such request stacking drivers should check those requests agaist + * the new queue limits again when ...
This looks awfully similar to blk_execute_rq_nowait With an Added blk_rq_check_limits, minus the __generic_unplug_device() and q->request_fn(q) calls. Perhaps the common code could be re factored --
Hi Boaz, Jens,
They might look simlar but don't have much in common actually.
I could refactor them like the attached patch, but I'm not sure
this is a correct way and this is cleaner than the current code.
(e.g. blk_execute_rq_nowait() can't be called with irqs-disabled,
but blk_insert_request() and my blk_submit_request() can be called
with irqs-disabled.)
So I'd leave them as it is unless you or others strongly prefer
the attached patch...
Anyway, I would like to leave the refactoring as a separate patch,
blk_insert_request() is in blk-core.c and it is similar to
blk_submit_request(), so I added it to blk-core.c.
But maybe both should be in blk-exec.c.
I don't have any problem on this, I'd like to hear Jens' opinion.
Thanks,
Kiyoshi Ueda
---
block/blk-core.c | 20 +++----------------
block/blk-exec.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++-------
2 files changed, 54 insertions(+), 23 deletions(-)
Index: linux-2.6-block/block/blk-core.c
===================================================================
--- linux-2.6-block.orig/block/blk-core.c
+++ linux-2.6-block/block/blk-core.c
@@ -881,7 +881,7 @@ EXPORT_SYMBOL(blk_get_request);
*/
void blk_start_queueing(struct request_queue *q)
{
- if (!blk_queue_plugged(q))
+ if (!blk_queue_plugged(q) && !blk_queue_stopped(q))
q->request_fn(q);
else
__generic_unplug_device(q);
@@ -930,11 +930,10 @@ EXPORT_SYMBOL(blk_requeue_request);
* of the queue for things like a QUEUE_FULL message from a device, or a
* host that is unable to accept a particular command.
*/
-void blk_insert_request(struct request_queue *q, struct request *rq,
- int at_head, void *data)
+void blk_insert_special_request(struct request_queue *q, struct request *rq,
+ int at_head, void *data)
{
int where = at_head ? ELEVATOR_INSERT_FRONT : ELEVATOR_INSERT_BACK;
- unsigned long flags;
/*
* tell I/O scheduler that this isn't a regular read/write (ie it
@@ -946,18 +945,7 @@ void ...If it wasn't for the _irq vs _irqsave, I would apply it. But I agree, your current approach is fine. -- Jens Axboe --
Hi Jens, OK, I'll rebase my patches for for-2.6.28 of your block git and repost the block bits. Maybe I need a time to confirm whether the diffrences between 2.6.27-rc6 and the block git affect my patches. (Hopefully, I'd like to repost this week.) Thanks, Kiyoshi Ueda --
This patch adds an interface to check lld's busy status
from the block layer. (scsi patch is also included.)
This resolves a performance problem on request stacking devices below.
Some drivers like scsi mid layer stop dispatching request when
they detect busy state on its low-level device like host/bus/device.
It allows other requests to stay in the I/O scheduler's queue
for a chance of merging.
Request stacking drivers like request-based dm should follow
the same logic.
However, there is no generic interface for the stacked device
to check if the underlying device(s) are busy.
If the request stacking driver dispatches and submits requests to
the busy underlying device, the requests will stay in
the underlying device's queue without a chance of merging.
This causes performance problem on burst I/O load.
With this patch, busy state of the underlying device is exported
via the state flag of queue's backing_dev_info. So the request
stacking driver can check it and stop dispatching requests if busy.
The underlying device driver must set/clear the flag appropriately:
ON: when the device driver can't process requests immediately.
OFF: when the device driver can process requests immediately,
including abnormal situations where the device driver needs
to kill all requests.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
---
include/linux/backing-dev.h | 8 ++++++++
mm/backing-dev.c | 13 +++++++++++++
2 files changed, 21 insertions(+)
Index: 2.6.27-rc6/include/linux/backing-dev.h
===================================================================
--- 2.6.27-rc6.orig/include/linux/backing-dev.h
+++ 2.6.27-rc6/include/linux/backing-dev.h
@@ -26,6 +26,7 @@ enum bdi_state {
BDI_pdflush, /* A pdflush thread is working this device */
BDI_write_congested, /* The ...This patch adds a queue flag to indicate the block device can be used for request stacking. Request stacking drivers need to stack their devices on top of only devices of which q->request_fn is functional. Since bio stacking drivers (e.g. md, loop) basically initialize their queue using blk_alloc_queue() and don't set q->request_fn, the check of (q->request_fn == NULL) looks enough for that purpose. However, dm becomes both types of stacking driver (bio-based and request-based) with this patch-set. And dm always sets q->request_fn even if the dm device is bio-based of which q->request_fn is not functional actually. So we need something else to distinguish the type of the device. Adding a queue flag is a solution for that. The reason why dm always sets q->request_fn is to keep the compatibility of dm user-space tools. Currently, all dm user-space tools are using bio-based dm without specifying the type of the dm device they use. To use request-based dm without changing such tools, the kernel must decide the type of the dm device automatically. The automatic type decision can't be done at the device creation time and needs to be deferred until such tools load a mapping table, since the actual type is decided by dm target type included in the mapping table. So a dm device has to be initialized using blk_init_queue() so that we can load either type of table. Then, all queue stuffs are set (e.g. q->request_fn) and we have no element to distinguish that it is bio-based or request-based, even after a table is loaded and the type of the device is decided. By the way, some stuffs of the queue (e.g. request_list, elevator) are needless when the dm device is used as bio-based. But the memory size is not so large (about 20[KB] per queue on ia64), so I hope the memory loss can be acceptable for bio-based dm users. Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com> Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com> Cc: Jens Axboe <jens.axboe@oracle.com> --- ...
This patch change scsi mid-layer to export its busy status.
Not set the busy flag, when scsi can't dispatch I/Os anymore and
needs to kill I/Os. Otherwise, request stacking drivers may hold
requests forever.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
---
drivers/scsi/scsi.c | 4 ++--
drivers/scsi/scsi_lib.c | 23 ++++++++++++++++++++++-
2 files changed, 24 insertions(+), 3 deletions(-)
Index: 2.6.27-rc6/drivers/scsi/scsi_lib.c
===================================================================
--- 2.6.27-rc6.orig/drivers/scsi/scsi_lib.c
+++ 2.6.27-rc6/drivers/scsi/scsi_lib.c
@@ -459,17 +459,30 @@ static void scsi_init_cmd_errh(struct sc
void scsi_device_unbusy(struct scsi_device *sdev)
{
+ int host_congested;
struct Scsi_Host *shost = sdev->host;
unsigned long flags;
spin_lock_irqsave(shost->host_lock, flags);
shost->host_busy--;
+ if ((shost->can_queue > 0 && shost->host_busy >= shost->can_queue) ||
+ shost->host_blocked || shost->host_self_blocked ||
+ scsi_host_in_recovery(shost))
+ host_congested = 1;
+ else
+ host_congested = 0;
+
if (unlikely(scsi_host_in_recovery(shost) &&
(shost->host_failed || shost->host_eh_scheduled)))
scsi_eh_wakeup(shost);
spin_unlock(shost->host_lock);
+
spin_lock(sdev->request_queue->queue_lock);
sdev->device_busy--;
+ if (bdi_lld_congested(&sdev->request_queue->backing_dev_info) &&
+ !host_congested && !(sdev->device_busy >= sdev->queue_depth) &&
+ !sdev->device_blocked)
+ clear_bdi_lld_congested(&sdev->request_queue->backing_dev_info);
spin_unlock_irqrestore(sdev->request_queue->queue_lock, flags);
}
@@ -1495,9 +1508,14 @@ static void scsi_request_fn(struct reque
* accept it.
*/
req = elv_next_request(q);
- if (!req || !scsi_dev_queue_ready(q, sdev))
+ if (!req)
break;
+ if (!scsi_dev_queue_ready(q, sdev)) ...This patch removes dead codes for the noflush suspend.
No functional change.
This patch is just a clean up of the codes and not functionally
related to request-based dm. But included here due to literal
dependency.
The dm_queue_flush(md, DM_WQ_FLUSH_ALL, NULL) in dm_suspend()
is never invoked because:
- The 'goto flush_and_out' is same as 'goto out', because
the 'goto flush_and_out' is called only when '!noflush'
- If the 'r && noflush' is true, the interrupt check code above
is invoked and 'goto out'
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Signed-off-by: Milan Broz <mbroz@redhat.com>
Cc: Alasdair G Kergon <agk@redhat.com>
---
drivers/md/dm.c | 14 +-------------
1 files changed, 1 insertion(+), 13 deletions(-)
Index: 2.6.27-rc6/drivers/md/dm.c
===================================================================
--- 2.6.27-rc6.orig/drivers/md/dm.c
+++ 2.6.27-rc6/drivers/md/dm.c
@@ -76,7 +76,6 @@ union map_info *dm_get_mapinfo(struct bi
*/
struct dm_wq_req {
enum {
- DM_WQ_FLUSH_ALL,
DM_WQ_FLUSH_DEFERRED,
} type;
struct work_struct work;
@@ -1384,9 +1383,6 @@ static void dm_wq_work(struct work_struc
down_write(&md->io_lock);
switch (req->type) {
- case DM_WQ_FLUSH_ALL:
- __merge_pushback_list(md);
- /* pass through */
case DM_WQ_FLUSH_DEFERRED:
__flush_deferred_io(md);
break;
@@ -1516,7 +1512,7 @@ int dm_suspend(struct mapped_device *md,
if (!md->suspended_bdev) {
DMWARN("bdget failed in dm_suspend");
r = -ENOMEM;
- goto flush_and_out;
+ goto out;
}
/*
@@ -1567,14 +1563,6 @@ int dm_suspend(struct mapped_device *md,
set_bit(DMF_SUSPENDED, &md->flags);
-flush_and_out:
- if (r && noflush)
- /*
- * Because there may be already I/Os in the pushback list,
- * flush them before return.
- */
- dm_queue_flush(md, DM_WQ_FLUSH_ALL, NULL);
-
out:
if (r && md->suspended_bdev) {
...This patch tidies local_init() as preparation for another patch
(PATCH 08), which creates some kmem_cache for request-based dm.
No functional change.
This patch is just a clean up of the codes and not functionally
related to request-based dm. But included here due to literal
dependency.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
---
drivers/md/dm.c | 34 +++++++++++++++++-----------------
1 files changed, 17 insertions(+), 17 deletions(-)
Index: 2.6.27-rc6/drivers/md/dm.c
===================================================================
--- 2.6.27-rc6.orig/drivers/md/dm.c
+++ 2.6.27-rc6/drivers/md/dm.c
@@ -150,40 +150,40 @@ static struct kmem_cache *_tio_cache;
static int __init local_init(void)
{
- int r;
+ int r = -ENOMEM;
/* allocate a slab for the dm_ios */
_io_cache = KMEM_CACHE(dm_io, 0);
if (!_io_cache)
- return -ENOMEM;
+ return r;
/* allocate a slab for the target ios */
_tio_cache = KMEM_CACHE(dm_target_io, 0);
- if (!_tio_cache) {
- kmem_cache_destroy(_io_cache);
- return -ENOMEM;
- }
+ if (!_tio_cache)
+ goto out_free_io_cache;
r = dm_uevent_init();
- if (r) {
- kmem_cache_destroy(_tio_cache);
- kmem_cache_destroy(_io_cache);
- return r;
- }
+ if (r)
+ goto out_free_tio_cache;
_major = major;
r = register_blkdev(_major, _name);
- if (r < 0) {
- kmem_cache_destroy(_tio_cache);
- kmem_cache_destroy(_io_cache);
- dm_uevent_exit();
- return r;
- }
+ if (r < 0)
+ goto out_uevent_exit;
if (!_major)
_major = r;
return 0;
+
+out_uevent_exit:
+ dm_uevent_exit();
+out_free_tio_cache:
+ kmem_cache_destroy(_tio_cache);
+out_free_io_cache:
+ kmem_cache_destroy(_io_cache);
+
+ return r;
}
static void local_exit(void)
--
This patch prepares some kmem_cache for request-based dm.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
---
drivers/md/dm.c | 45 ++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 44 insertions(+), 1 deletion(-)
Index: 2.6.27-rc6/drivers/md/dm.c
===================================================================
--- 2.6.27-rc6.orig/drivers/md/dm.c
+++ 2.6.27-rc6/drivers/md/dm.c
@@ -32,6 +32,7 @@ static unsigned int _major = 0;
static DEFINE_SPINLOCK(_minor_lock);
/*
+ * For bio based dm.
* One of these is allocated per bio.
*/
struct dm_io {
@@ -43,6 +44,7 @@ struct dm_io {
};
/*
+ * For bio based dm.
* One of these is allocated per target within a bio. Hopefully
* this will be simplified out one day.
*/
@@ -52,6 +54,31 @@ struct dm_target_io {
union map_info info;
};
+/*
+ * For request based dm.
+ * One of these is allocated per request.
+ *
+ * Since assuming "original request : cloned request = 1 : 1" and
+ * a counter for number of clones like struct dm_io.io_count isn't needed,
+ * struct dm_io and struct target_io can be merged.
+ */
+struct dm_rq_target_io {
+ struct mapped_device *md;
+ struct dm_target *ti;
+ struct request *orig, clone;
+ int error;
+ union map_info info;
+};
+
+/*
+ * For request based dm.
+ * One of these is allocated per bio.
+ */
+struct dm_clone_bio_info {
+ struct bio *orig;
+ struct request *rq;
+};
+
union map_info *dm_get_mapinfo(struct bio *bio)
{
if (bio && bio->bi_private)
@@ -147,6 +174,8 @@ struct mapped_device {
#define MIN_IOS 256
static struct kmem_cache *_io_cache;
static struct kmem_cache *_tio_cache;
+static struct kmem_cache *_rq_tio_cache;
+static struct kmem_cache *_bio_info_cache;
static int __init local_init(void)
{
@@ -162,9 +191,17 @@ static int __init local_init(void)
if (!_tio_cache)
goto out_free_io_cache;
+ _rq_tio_cache ...This patch adds the following target interfaces for request-based dm.
map_rq : for mapping a request
rq_end_io : for finishing a request
busy : for avoiding performance regression from bio-based dm.
Target can tell dm core not to map requests now, and
that may help requests in the block layer queue to be
bigger by I/O merging.
In bio-based dm, this behavior is done by device
drivers which managing the block layer queue.
But in request-based dm, dm core has to do that
since dm core manages the block layer queue, and
target drivers help is needed for it.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
---
include/linux/device-mapper.h | 15 +++++++++++++++
1 files changed, 15 insertions(+)
Index: 2.6.27-rc6/include/linux/device-mapper.h
===================================================================
--- 2.6.27-rc6.orig/include/linux/device-mapper.h
+++ 2.6.27-rc6/include/linux/device-mapper.h
@@ -46,6 +46,8 @@ typedef void (*dm_dtr_fn) (struct dm_tar
*/
typedef int (*dm_map_fn) (struct dm_target *ti, struct bio *bio,
union map_info *map_context);
+typedef int (*dm_map_request_fn) (struct dm_target *ti, struct request *clone,
+ union map_info *map_context);
/*
* Returns:
@@ -58,6 +60,9 @@ typedef int (*dm_map_fn) (struct dm_targ
typedef int (*dm_endio_fn) (struct dm_target *ti,
struct bio *bio, int error,
union map_info *map_context);
+typedef int (*dm_request_endio_fn) (struct dm_target *ti,
+ struct request *clone, int error,
+ union map_info *map_context);
typedef void (*dm_flush_fn) (struct dm_target *ti);
typedef void (*dm_presuspend_fn) (struct dm_target *ti);
@@ -77,6 +82,13 @@ typedef int (*dm_ioctl_fn) (struct dm_ta
typedef int (*dm_merge_fn) (struct dm_target *ti, ...This patch adds core functions for request-based dm.
When struct mapped device (md) is initialized as request-based,
md->queue has an I/O scheduler and the following functions are set:
make_request_fn: __make_request() (existing block layer function)
request_fn: dm_request_fn() (newly added function)
Actual initializations are done in another patch (PATCH 11).
Below is a brief summary of how request-based dm behaves, including:
- making request from bio
- cloning, mapping and dispatching request
- completing request and bio
- suspending md
- resuming md
bio to request
==============
md->queue->make_request_fn() (__make_request()) is called for a bio
submitted to the md.
Then, the bio is kept in the queue as a new request or merged into
another request in the queue if possible.
Cloning and Mapping
===================
Cloning and mapping are done in md->queue->request_fn() (dm_request_fn()),
when requests are dispatched after they are sorted by the I/O scheduler.
dm_request_fn() checks busy state of underlying devices using
target's defer_map() function and stops dispatching requests
to keep them on the dm device's queue if busy.
It helps better I/O merging, since no merge is done for a request
once it is dispatched to underlying devices.
Actual cloning and mapping are done in dm_prep_fn() and map_request()
called from dm_request_fn().
dm_prep_fn() clones not only request but also bios of the request
so that dm can hold bio completion in error cases and prevent
the bio submitter from noticing the error.
(See the "Completion" section below for details.)
After the cloning, the clone is mapped by target's map_rq() function
and inserted to underlying device's queue using __elv_add_request().
Completion
==========
Request completion can be hooked by rq->end_io(), but then, all bios
in the request will have been completed even error cases, and the bio
submitter will have noticed the error.
To prevent the bio completion in error ...On Fri, Sep 12, 2008 at 8:16 PM, Kiyoshi Ueda <k-ueda@ct.jp.nec.com> wrote: <snip> Why not add barrier support in the beginning itself, so that targets can be developed with barriers in mind? At least can we make the target to return error, instead of the core? Thanks Nikanth Karthikesan --
Hi Nikanth, Currently, there is no barrier support in dm, not only request-based. Barrier support is a different feature in the next step, I think. As you noticed in the PATCH#11, current request-based dm has the limitation that multiple targets are not supported, so it may look barrier support in request-based dm is easy. But we may be able to remove the limitation in the future, so depending on it is not a good idea. Thanks, Kiyoshi Ueda --
This patch detects requests violating the queue limitations
and rejects them.
The same limitation checks are done when requests are submitted
to the queue by blk_submit_request().
However, such violation can happen if a table is swapped and
the queue limitations are shrunk while some requests are
in the queue.
Since struct request is a reliable one in the block layer and
device drivers, dispatching such requests is pretty dangerous.
(e.g. it may cause kernel panic easily.)
So avoid to dispatch such problematic requests in request-based dm.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
---
drivers/md/dm.c | 24 ++++++++++++++++++++++++
1 files changed, 24 insertions(+)
Index: 2.6.27-rc6/drivers/md/dm.c
===================================================================
--- 2.6.27-rc6.orig/drivers/md/dm.c
+++ 2.6.27-rc6/drivers/md/dm.c
@@ -1469,6 +1469,30 @@ static void map_request(struct dm_target
tio->ti = ti;
atomic_inc(&md->pending);
+
+ /*
+ * Although submitted requests to the md->queue are checked against
+ * the table/queue limitations at the submission time, the limitations
+ * may be changed by a table swapping while those already checked
+ * requests are in the md->queue.
+ * If the limitations have been shrunk in such situations, we may be
+ * dispatching requests violating the current limitations here.
+ * Since struct request is a reliable one in the block-layer
+ * and device drivers, dispatching such requests is dangerous.
+ * (e.g. it may cause kernel panic easily.)
+ * Avoid to dispatch such problematic requests in request-based dm.
+ *
+ * Since dm_kill_request() decrements the md->pending, this have to
+ * be done after incrementing the md->pending.
+ */
+ r = blk_rq_check_limits(rq->q, rq);
+ if (unlikely(r)) {
+ DMWARN("violating the queue limitation. the limitation may be"
+ " shrunk while there are some ...This patch enables request-based dm.
o Request-based dm and bio-based dm coexist, since there are
some target drivers which are more fitting to bio-based dm.
Also, there are other bio-based devices in the kernel
(e.g. md, loop).
Since bio-based device can't receive struct request,
there are some limitations on device stacking between
bio-based and request-based.
type of underlying device
bio-based requeset-based
----------------------------------------------
bio-based OK OK
request-based NG OK
The device type is recognized by the queue flag in the kernel,
so dm follows that.
o The type of a dm device is decided at the first table loading time.
Until then, mempool creations are deferred, since mempools for
request-based dm are different from those for bio-based dm.
Once the type of a dm device is decided, the type can't be changed.
o Currently, request-based dm supports only tables that have a single
target. To support multiple targets, we need to support request
splitting or prevent bio/request from spanning multiple targets.
The former needs lots of changes in the block layer, and the latter
needs that all target drivers support merge() function.
Both will take a time.
Signed-off-by: Kiyoshi Ueda <k-ueda@ct.jp.nec.com>
Signed-off-by: Jun'ichi Nomura <j-nomura@ce.jp.nec.com>
Cc: Alasdair G Kergon <agk@redhat.com>
---
drivers/md/dm-ioctl.c | 13 ++++
drivers/md/dm-table.c | 68 +++++++++++++++++++++++
drivers/md/dm.c | 122 ++++++++++++++++++++++++++++++++++--------
drivers/md/dm.h | 15 +++++
include/linux/device-mapper.h | 1
5 files changed, 196 insertions(+), 23 deletions(-)
Index: 2.6.27-rc6/drivers/md/dm-table.c
===================================================================
--- 2.6.27-rc6.orig/drivers/md/dm-table.c
+++ 2.6.27-rc6/drivers/md/dm-table.c
@@ -108,6 ...So will some configurations would become impossible, if a target is made request-based and bio-based one is removed. Hope both Barrier support should be straight forward? Thanks Nikanth Karthikesan --
Hi Nikanth, If the target is used anywhere in the device stack (e.g. linear), both types of the target are needed, and that is possible. As for multipath, see my another reply to your comment for PATCH 13. Thanks, Kiyoshi Ueda --
This patch converts dm-multipath target to request-based from bio-based.
Basically, the patch just converts the I/O unit from struct bio
to struct request.
In the course of the conversion, it also changes the I/O queueing
mechanism. The change in the I/O queueing is described in details
as follows.
I/O queueing mechanism change
-----------------------------
In I/O submission, map_io(), there is no mechanism change from
bio-based, since the clone request is ready for retry as it is.
However, in I/O complition, do_end_io(), there is a mechanism change
from bio-based, since the clone request is not ready for retry.
In do_end_io() of bio-based, the clone bio has all needed memory
for resubmission. So the target driver can queue it and resubmit
it later without memory allocations.
The mechanism has almost no overhead.
On the other hand, in do_end_io() of request-based, the clone request
doesn't have clone bios, so the target driver can't resubmit it
as it is. To resubmit the clone request, memory allocation for
clone bios is needed, and it takes some overheads.
To avoid the overheads just for queueing, the target driver doesn't
queue the clone request inside itself.
Instead, the target driver asks dm core for queueing and remapping
the original request of the clone request, since the overhead for
queueing is just a freeing memory for the clone request.
As a result, the target driver doesn't need to record/restore
the information of the original request for resubmitting
the clone request. So dm_bio_details in dm_mpath_io is removed.
multipath_busy()
---------------------
The target driver returns "busy", only when the following case:
o The target driver will map I/Os, if map() function is called
and
o The mapped I/Os will wait on underlying device's queue due to
their congestions, if map() function is called now.
In other cases, the target driver doesn't return "busy".
Otherwise, dm core will keep the I/Os and the target driver can't
do what it ...Can we keep both the bio-based as well as request-based dm-multipath? Thanks Nikanth Karthikesan --
Hi Nikanth, Why do you want to keep bio-based dm-multipath? In the realistic environment, dm-multipath is usually the bottom of the stacking devices. So I think existing working configurations are still possible with request-based dm-multipath. If we keep both types of multipath targets, maintenance cost would be doubled and additional user interface might be needed to specify which type is used. I can't see any major benefit worth paying the cost of keeping bio-based dm-multipath. I know dm developers sometimes use linear/error maps under multipath just for testing purpose. If that is your requirement, I can make request-based targets for linear and error (e.g. named like linear_rq and error_rq). Thanks, Kiyoshi Ueda --
You have to base the block patches off the for-2.6.28 branch of the block git repo, otherwise I cannot merge the block bits. -- Jens Axboe --
