From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
PROBLEM:
/proc/diskstats would display a strange output as follows.
$ cat /proc/diskstats |grep sda
8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
~~~~~~~~~~
8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
The detailed root cause is as follows.
Assuming that there are two partition, sda1 and sda2.
1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
is 0 and sda2's one is 1.
| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------
2. A bio belongs to sda1 is issued and is merged into the request mentioned on
step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
from sda2 region to sda1 region. However the two partition's
hd_struct->in_flight are not changed.
| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------
3. The request is finished and blk_account_io_done() is called. In this case,
sda2's hd_struct->in_flight, not a sda1's one, is decremented.
| hd_struct->in_flight
---------------------------
sda1 | -1
sda2 | 1
---------------------------
HOW TO FIX:
The patch fixes the problem by changing a merging rule.
The problem is caused by merging different partition's I/Os. So the patch
check whether a merging bio or request is a same partition as a request or not
by using a partition's ...I really think this is wrong.
We should just carry the partition information around in the req and
the bio, and just compare the pointers, rather than compare the range.
No need to even dereference the pointers, you should be able to just
do
/* don't merge if not on the same partition */
if (bio->part != req->part)
return 0;
or something.
This is doubly true since the accounting already does that horrible
partition lookup: rather than look it up, we should just _set_ it in
__generic_make_request(), where I think we already know it since we do
that whole blk_partition_remap().
So just something like the appended (TOTALLY UNTESTED) perhaps?
Note that this should get it right even for overlapping partitions etc.
Linus
Hi Linus, Yasuaki, and Jens
The problem can occur even if your patches are applied. Think about a case
like the following.
1) There are 2 partition, sda1 and sda2, on sda.
2) Open sda and issue an IO to sda2's first sector. Then sda2's in_flight
is incremented though you open not sda2 but sda. It is because of
partition lookup method. It is based on which partition rq->__sector
sector belongs to.
3) Issue an IO to sda1's last sector and it merged to the IO issued in
step (2) because their part are both sda. In addition, rq->__sector
is modified to the sda1's region.
4) After completing the IO, sda1's in_flight is decremented and diskstat
is corrupted here.
I think fixing this case is difficult and would cause more complexity.
I hit on another approach. Although it doesn'tprevent any merge as Linus
preferred, it can fix the problem anyway. In this idea, in_flight is
incremented and decremented for the partition which the request belonged
to in its creation. It has the following merits.
- It can fix the problem which Yasuaki reported, including the cases which
I mentioned above.
- It only append one extra field to request.
Although it would causes a bit gap, it doesn't have most influences because
merging requests beyond partitions is the rare case.
I confirmed the attached patch can be applied to 2.6.37-rc4 and succeeded
to compile. If you can accept this idea, I'll test it soon.
---
block/blk-core.c | 12 +++++++-----
block/blk-merge.c | 2 +-
include/linux/blkdev.h | 6 ++++++
3 files changed, 14 insertions(+), 6 deletions(-)
Index: linux-2.6.37-rc4/block/blk-core.c
===================================================================
--- linux-2.6.37-rc4.orig/block/blk-core.c 2010-11-30 13:42:04.000000000 +0900
+++ linux-2.6.37-rc4/block/blk-core.c 2010-12-07 14:31:55.000000000 +0900
@@ -64,11 +64,13 @@ static void drive_stat_acct(struct reque
return;
cpu = ...Ok, so idea seems to be that lets keep track of the sector number against which we do the accounting. Even if we are doing merging later, accounting sector of the request will not change so that in_flight will not go out of the sync. The only thing is that by allowing merging across partitions, one request can have IO from multiple partitions and all of it being accounted to only one partition. So it is more of little accounting error. Though I am not sure how big a issue that is. Would "acct_sector" be a better name. It just means this is the sector which we will be using for accounting purposes of this rq. Thanks Vivek --
I really would prefer if we fixed up the patchset we ended up reverting. At least that had a purpose with growing struct request, since we saved on doing the partition lookups. -- Jens Axboe --
Hi Jens, Revert is already finished. 2.6.37-rc-5 and latest stable kernel doesn't contain Yasuaki's former logic. Thanks, Satoru --
That I prefer we fix that code up, since I think it's the best solution to the problem. -- Jens Axboe --
Hi Jens, I already postedit. https://lkml.org/lkml/2010/12/8/12 I think it is OK without mail subject :-) Thanks, Satoru --
No, that's not it at all. What I mean (and what was reverted) was caching the partition lookup, and using that for the stats. The problem with that approach turned out to be the elevator queiscing logic not being fully correct. One easier way to fix that would be to reference count the part stats, instead of having to drain the queue. -- Jens Axboe --
Taking reference to hd_struct and storing it in rq, will definitely save us 1 lookup while doing accounting on completion path. It does not save on rq size though. IIUC, current patch does not increase the number of existing lookups. So current situation does not deteriorate with the patch. But storing a reference in rq and avoiding 1 lookup in completion path definitely sounds better. Thanks Vivek --
Storing a pointer to partition in rq also got the advantage that we can easily not allow merging of requests across partitions for better accounting. Satoru, so yes, if you can implement what jens is suggesting, would be good. Thanks Vivek --
I don't think it really matters (as long as we keep in_flight consistent of course). The requests that got merged across partition are very likely rare enough to be neglectable. --
The partition is already deleted in a rcu callback function. If I
understand RCU correctly, we just need to use rcu_dereference() each time
we use rq->part. Something like the following *untested* patch.
Jerome
---
diff --git a/block/blk-core.c b/block/blk-core.c
index 4ce953f..70d23f9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -64,13 +64,15 @@ static void drive_stat_acct(struct request *rq, int new_io)
return;
cpu = part_stat_lock();
- part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
- if (!new_io)
+ if (!new_io) {
+ part = blk_rq_get_part(rq);
part_stat_inc(cpu, part, merges[rw]);
- else {
+ } else {
+ part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
part_round_stats(cpu, part);
part_inc_in_flight(part, rw);
+ rq->part = part;
}
part_stat_unlock();
@@ -128,6 +130,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
rq->ref_count = 1;
rq->start_time = jiffies;
set_start_time_ns(rq);
+ rq->part = NULL;
}
EXPORT_SYMBOL(blk_rq_init);
@@ -1776,7 +1784,7 @@ static void blk_account_io_completion(struct request *req, unsigned int bytes)
int cpu;
cpu = part_stat_lock();
- part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+ part = blk_rq_get_part(req);
part_stat_add(cpu, part, sectors[rw], bytes >> 9);
part_stat_unlock();
}
@@ -1796,7 +1804,7 @@ static void blk_account_io_done(struct request *req)
int cpu;
cpu = part_stat_lock();
- part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+ part = blk_rq_get_part(req);
part_stat_inc(cpu, part, ios[rw]);
part_stat_add(cpu, part, ticks[rw], duration);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 77b7c26..0ea5416 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -351,7 +351,7 @@ static void blk_account_io_merge(struct request *req)
int cpu;
cpu = part_stat_lock();
- part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+ part = ...Jerome, I don't think that rcu_dereference() is going to solve the problem. The partition table will be freed as soon as one rcu period is over. So to make sure partition table is not freed one has to be holding rcu_read_lock(). It is not a good idea to keep rcu period going till request finishes so a better idea will to to reference count it. Thanks --
Exactly. The only change you would do to partition handling is ensure that each io grabs a reference to it and drops it at the end. You need not even do that in the core bits outside each IO, we just need to ensure that the partition struct persists in memory even if it is no longer the valid partition table. The rcu call to free the memory would happen when the ref drops to zero. What Vivek says it completely correct, and rcu_dereference() would only help if you also had rcu_read_lock() over each IO. That is not feasible at all. -- Jens Axboe --
/proc/diskstats would display a strange output as follows.
$ cat /proc/diskstats |grep sda
8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
~~~~~~~~~~
8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
The detailed root cause is as follows.
Assuming that there are two partition, sda1 and sda2.
1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
is 0 and sda2's one is 1.
| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------
2. A bio belongs to sda1 is issued and is merged into the request mentioned on
step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
from sda2 region to sda1 region. However the two partition's
hd_struct->in_flight are not changed.
| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------
3. The request is finished and blk_account_io_done() is called. In this case,
sda2's hd_struct->in_flight, not a sda1's one, is decremented.
| hd_struct->in_flight
---------------------------
sda1 | -1
sda2 | 1
---------------------------
The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.
Also add a refcount ...[snip]
I don't think this is completely safe. The rcu lock is held due to the
part_stat_lock(), but that only prevents the __delete_partition()
callback from happening. Lets say you have this:
CPU0 CPU1
part = disk_map_sector_rcu()
kref_put(part); <- now 0
part_stat_unlock()
__delete_partition();
...
delete_partition_rcu_cb();
merge, or endio, boom
Now rq has ->part pointing to freed memory, later merges or end
accounting will touch freed memory.
I think we can fix this by just having delete_partition_rcu_rb() check
the reference count and return if non-zero. Since someone holds a
reference to the table, they will drop it and we'll re-schedule the rcu
callback.
--
Jens Axboe
--
This is interesting. Using RCU with kref(). So even if somebody has done a kref_put() and this is last reference, but rcu period is not over, somebody can still go and take reference again and set it to 1 again and then partition will not be freed as delete_partition_rcu_cb() will find it set. I guess read shall have to be atomic_read() and struct kref is opaque so one might have to introduce kref_read() or something like that and possibly update Documentation/kref.txt for this usage of with RCU. I would also recommend it to get it reviewed from Paul McKenney to make sure this usage of RCU is fine. Thanks Vivek --
A kref_get(part) happens here, or am I missing something?
If we use an atomic kref_test_and_get() function, which only increment the
refcount if it's not zero, we would be safe. If kref_test_and_get() fails,
---
diff --git a/block/blk-core.c b/block/blk-core.c
index 4ce953f..72d12d2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -64,13 +64,21 @@ static void drive_stat_acct(struct request *rq, int new_io)
return;
cpu = part_stat_lock();
- part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
- if (!new_io)
+ if (!new_io) {
+ part = rq->part;
part_stat_inc(cpu, part, merges[rw]);
- else {
+ } else {
+ part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
+ if(part->partno && !kref_test_and_get(&part->ref))
+ /*
+ * The partition is already being removed,
+ * the request will be accounted on the disk only
+ */
+ part = &rq->rq_disk->part0;
part_round_stats(cpu, part);
part_inc_in_flight(part, rw);
+ rq->part = part;
}
part_stat_unlock();
@@ -128,6 +136,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
rq->ref_count = 1;
rq->start_time = jiffies;
set_start_time_ns(rq);
+ rq->part = NULL;
}
EXPORT_SYMBOL(blk_rq_init);
@@ -1776,7 +1785,7 @@ static void blk_account_io_completion(struct request *req, unsigned int bytes)
int cpu;
cpu = part_stat_lock();
- part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+ part = req->part;
part_stat_add(cpu, part, sectors[rw], bytes >> 9);
part_stat_unlock();
}
@@ -1796,13 +1805,15 @@ static void blk_account_io_done(struct request *req)
int cpu;
cpu = part_stat_lock();
- part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+ part = req->part;
part_stat_inc(cpu, part, ios[rw]);
part_stat_add(cpu, part, ticks[rw], duration);
part_round_stats(cpu, part);
part_dec_in_flight(part, rw);
+ if (part->partno)
+ kref_put(&part->ref, __delete_partition);
part_stat_unlock();
...It might happen that kref_get(part) is called after kref_put() has been called and reference has reached 0 and and delete_parition() call has been scheduled as soon as rcu grace period is over. So if you do Conceptually it kind of makes sense to me. So if even if we get the pointer to partition under rcu_read_lock(), we will not account the IO Do we have to check this part->partno both while taking and releasing reference. Can't we take one extra reference for disk->part0, at alloc_disk_node() time so that it is never freed and only freed when disk is going away and gendisk is being freed. That way, you don't have to differentiate between disk->part0 and rest of the partitions while taking or dropping references. Thanks --
We could do it in any way, as long as we don't end up trying to free disk->part0. I choose not to touch part0->ref at all, but we could also drop all the part->partno test, and get a reference on part0 when we use it as a backup. I have no strong opinion about a way or an other. --
Ok, I personally like taking an extra reference to disk->part0 and put a proper comment there so that at rest of the places we don't try to differentiate between part0 and other partitions. Having said that I am not too concerned about it. It is just one of the minor details. So post the new version of patch after testing. Thanks Vivek --
Add kref_test_and_get() function, which atomically add a reference only if
refcount is not zero. This prevent to add a reference to an object that is
already being removed.
Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
---
include/linux/kref.h | 1 +
lib/kref.c | 16 ++++++++++++++++
2 files changed, 17 insertions(+)
diff --git a/include/linux/kref.h b/include/linux/kref.h
index 6cc38fc..90b9e44 100644
--- a/include/linux/kref.h
+++ b/include/linux/kref.h
@@ -23,6 +23,7 @@ struct kref {
void kref_init(struct kref *kref);
void kref_get(struct kref *kref);
+int kref_test_and_get(struct kref *kref);
int kref_put(struct kref *kref, void (*release) (struct kref *kref));
#endif /* _KREF_H_ */
diff --git a/lib/kref.c b/lib/kref.c
index d3d227a..5f663b9 100644
--- a/lib/kref.c
+++ b/lib/kref.c
@@ -37,6 +37,22 @@ void kref_get(struct kref *kref)
}
/**
+ * kref_test_and_get - increment refcount for object only if refcount is not
+ * zero.
+ * @kref: object.
+ *
+ * Return non-zero if the refcount was incremented, 0 otherwise
+ */
+int kref_test_and_get(struct kref *kref)
+{
+ int ret;
+ smp_mb__before_atomic_inc();
+ ret = atomic_inc_not_zero(&kref->refcount);
+ smp_mb__after_atomic_inc();
+ return ret;
+}
+
+/**
* kref_put - decrement refcount for object.
* @kref: object.
* @release: pointer to the function that will clean up the object when the
--
/proc/diskstats would display a strange output as follows.
$ cat /proc/diskstats |grep sda
8 0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
8 1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
~~~~~~~~~~
8 2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
8 3 sda3 54 487 2188 92 0 0 0 0 0 88 92
8 4 sda4 4 0 8 0 0 0 0 0 0 0 0
8 5 sda5 81 2027 2130 138 0 0 0 0 0 87 137
Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.
The detailed root cause is as follows.
Assuming that there are two partition, sda1 and sda2.
1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
is 0 and sda2's one is 1.
| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------
2. A bio belongs to sda1 is issued and is merged into the request mentioned on
step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
from sda2 region to sda1 region. However the two partition's
hd_struct->in_flight are not changed.
| hd_struct->in_flight
---------------------------
sda1 | 0
sda2 | 1
---------------------------
3. The request is finished and blk_account_io_done() is called. In this case,
sda2's hd_struct->in_flight, not a sda1's one, is decremented.
| hd_struct->in_flight
---------------------------
sda1 | -1
sda2 | 1
---------------------------
The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.
Also add a refcount ...No, don't do this, use a kref correctly and no such function should be That is the function that should properly increment the reference count on the object. If the object is "being removed", then it will return At all. thanks, greg k-h --
atomic_inc_not_zero() has full memory barrier semantic already, you dont need smp_mb__before_atomic_inc() or smp_mb__after_atomic_inc(); --
We just removed a function like this recently as it really isn't the solution for what you need here at all. Please use a lock that protects the desctruction of the kref, like the rest of the kernel, to keep someone from grabing a reference while the device could be going away. That's the only way to safely do this. See the archives for why this overall, isn't a good idea for a kref to have this type of function. thanks, greg k-h --
