Re: [PATCH 1/2] kref: add kref_test_and_get

Previous thread: [RFC][PATCH 0/2 v6] blk: fix a wrong accounting of hd_struct->in_flight by Yasuaki Ishimatsu on Monday, December 6, 2010 - 2:44 am. (1 message)

Next thread: [PATCH 1/2] Change a argument of disk_map_sector_rcu() to hd_part->start_sect from req->__sector by Yasuaki Ishimatsu on Monday, December 6, 2010 - 2:45 am. (1 message)
From: Yasuaki Ishimatsu
Date: Monday, December 6, 2010 - 2:44 am

From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>

PROBLEM:

/proc/diskstats would display a strange output as follows.

$ cat /proc/diskstats |grep sda
   8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
   8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                ~~~~~~~~~~
   8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
   8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
   8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
   8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

The detailed root cause is as follows.

Assuming that there are two partition, sda1 and sda2.

1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
   is 0 and sda2's one is 1.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

2. A bio belongs to sda1 is issued and is merged into the request mentioned on
   step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
   from sda2 region to sda1 region. However the two partition's
   hd_struct->in_flight are not changed.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

3. The request is finished and blk_account_io_done() is called. In this case,
   sda2's hd_struct->in_flight, not a sda1's one, is decremented.

        | hd_struct->in_flight
   ---------------------------
   sda1 |         -1
   sda2 |          1
   ---------------------------

HOW TO FIX:

The patch fixes the problem by changing a merging rule.

The problem is caused by merging different partition's I/Os. So the patch
check whether a merging bio or request is a same partition as a request or not
by using a partition's ...
From: Linus Torvalds
Date: Monday, December 6, 2010 - 9:08 am

I really think this is wrong.

We should just carry the partition information around in the req and
the bio, and just compare the pointers, rather than compare the range.
No need to even dereference the pointers, you should be able to just
do

   /* don't merge if not on the same partition */
   if (bio->part != req->part)
      return 0;

or something.

This is doubly true since the accounting already does that horrible
partition lookup: rather than look it up, we should just _set_ it in
__generic_make_request(), where I think we already know it since we do
that whole blk_partition_remap().

So just something like the appended (TOTALLY UNTESTED) perhaps?

Note that this should get it right even for overlapping partitions etc.

                     Linus
From: Satoru Takeuchi
Date: Tuesday, December 7, 2010 - 12:18 am

Hi Linus, Yasuaki,  and Jens


The problem can occur even if your patches are applied. Think about a case
like the following.

  1) There are 2 partition, sda1 and sda2, on sda.
  2) Open sda and issue an IO to sda2's first sector. Then sda2's in_flight
     is incremented though you open not sda2 but sda. It is because of
     partition lookup method. It is based on which partition rq->__sector
     sector belongs to.
  3) Issue an IO to sda1's last sector and it merged to the IO issued in
     step (2) because their part are both sda. In addition, rq->__sector
     is modified to the sda1's region.
  4) After completing the IO, sda1's in_flight is decremented and diskstat
     is corrupted here.

I think fixing this case is difficult and would cause more complexity.

I hit on another approach. Although it doesn'tprevent any merge as Linus
preferred, it can fix the problem anyway. In this idea, in_flight is
incremented and decremented for the partition which the request belonged
to in its creation. It has the following merits.

  - It can fix the problem which Yasuaki reported, including the cases which
    I mentioned above.
  - It only append one extra field to request.

Although it would causes a bit gap, it doesn't have most influences because
merging requests beyond partitions is the rare case.

I confirmed the attached patch can be applied to 2.6.37-rc4 and succeeded
to compile. If you can accept this idea, I'll test it soon.

---
  block/blk-core.c       |   12 +++++++-----
  block/blk-merge.c      |    2 +-
  include/linux/blkdev.h |    6 ++++++
  3 files changed, 14 insertions(+), 6 deletions(-)

Index: linux-2.6.37-rc4/block/blk-core.c
===================================================================
--- linux-2.6.37-rc4.orig/block/blk-core.c	2010-11-30 13:42:04.000000000 +0900
+++ linux-2.6.37-rc4/block/blk-core.c	2010-12-07 14:31:55.000000000 +0900
@@ -64,11 +64,13 @@ static void drive_stat_acct(struct reque
  		return;
  
  	cpu = ...
From: Vivek Goyal
Date: Tuesday, December 7, 2010 - 11:39 am

Ok, so idea seems to be that lets keep track of the sector number against
which we do the accounting. Even if we are doing merging later, accounting
sector of the request will not change so that in_flight will not go out
of the sync.

The only thing is that by allowing merging across partitions, one request
can have IO from multiple partitions and all of it being accounted to
only one partition. So it is more of little accounting error. Though I am
not sure how big a issue that is.


Would "acct_sector" be a better name. It just means this is the sector
which we will be using for accounting purposes of this rq.

Thanks
Vivek
--

From: Jens Axboe
Date: Wednesday, December 8, 2010 - 12:33 am

I really would prefer if we fixed up the patchset we ended up reverting.
At least that had a purpose with growing struct request, since we saved
on doing the partition lookups.

-- 
Jens Axboe

--

From: Satoru Takeuchi
Date: Wednesday, December 8, 2010 - 12:59 am

Hi Jens,


Revert is already finished. 2.6.37-rc-5 and latest stable kernel doesn't
contain Yasuaki's former logic.


Thanks,
Satoru

--

From: Jens Axboe
Date: Wednesday, December 8, 2010 - 1:06 am

That I prefer we fix that code up, since I think it's the best solution
to the problem.

-- 
Jens Axboe

--

From: Satoru Takeuchi
Date: Wednesday, December 8, 2010 - 1:11 am

Hi Jens,


I already postedit.

https://lkml.org/lkml/2010/12/8/12

I think it is OK without mail subject :-)

Thanks,
Satoru

--

From: Jens Axboe
Date: Wednesday, December 8, 2010 - 7:46 am

No, that's not it at all. What I mean (and what was reverted) was
caching the partition lookup, and using that for the stats. The problem
with that approach turned out to be the elevator queiscing logic not
being fully correct. One easier way to fix that would be to reference
count the part stats, instead of having to drain the queue.

-- 
Jens Axboe

--

From: Vivek Goyal
Date: Wednesday, December 8, 2010 - 8:51 am

Taking reference to hd_struct and storing it in rq, will definitely save
us 1 lookup while doing accounting on completion path. It does not save
on rq size though.

IIUC, current patch does not increase the number of existing lookups. So 
current situation does not deteriorate with the patch.

But storing a reference in rq and avoiding 1 lookup in completion path
definitely sounds better.

Thanks
Vivek
--

From: Vivek Goyal
Date: Wednesday, December 8, 2010 - 8:58 am

Storing a pointer to partition in rq also got the advantage that we can
easily not allow merging of requests across partitions for better
accounting.

Satoru, so yes, if you can implement what jens is suggesting, would be
good.

Thanks
Vivek
--

From: Jerome Marchand
Date: Friday, December 10, 2010 - 4:22 am

I don't think it really matters (as long as we keep in_flight consistent
of course). The requests that got merged across partition are very likely 
rare enough to be neglectable.


--

From: Jerome Marchand
Date: Friday, December 10, 2010 - 9:12 am

The partition is already deleted in a rcu callback function. If I
understand RCU correctly, we just need to use rcu_dereference() each time
we use rq->part. Something like the following *untested* patch.

Jerome

---

diff --git a/block/blk-core.c b/block/blk-core.c
index 4ce953f..70d23f9 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -64,13 +64,15 @@ static void drive_stat_acct(struct request *rq, int new_io)
 		return;
 
 	cpu = part_stat_lock();
-	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
 
-	if (!new_io)
+	if (!new_io) {
+		part = blk_rq_get_part(rq);
 		part_stat_inc(cpu, part, merges[rw]);
-	else {
+	} else {
+		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
 		part_round_stats(cpu, part);
 		part_inc_in_flight(part, rw);
+		rq->part = part;
 	}
 
 	part_stat_unlock();
@@ -128,6 +130,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->ref_count = 1;
 	rq->start_time = jiffies;
 	set_start_time_ns(rq);
+	rq->part = NULL;
 }
 EXPORT_SYMBOL(blk_rq_init);
 
@@ -1776,7 +1784,7 @@ static void blk_account_io_completion(struct request *req, unsigned int bytes)
 		int cpu;
 
 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = blk_rq_get_part(req);
 		part_stat_add(cpu, part, sectors[rw], bytes >> 9);
 		part_stat_unlock();
 	}
@@ -1796,7 +1804,7 @@ static void blk_account_io_done(struct request *req)
 		int cpu;
 
 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = blk_rq_get_part(req);
 
 		part_stat_inc(cpu, part, ios[rw]);
 		part_stat_add(cpu, part, ticks[rw], duration);
diff --git a/block/blk-merge.c b/block/blk-merge.c
index 77b7c26..0ea5416 100644
--- a/block/blk-merge.c
+++ b/block/blk-merge.c
@@ -351,7 +351,7 @@ static void blk_account_io_merge(struct request *req)
 		int cpu;
 
 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = ...
From: Vivek Goyal
Date: Friday, December 10, 2010 - 9:55 am

Jerome,

I don't think that rcu_dereference() is going to solve the problem. The
partition table will be freed as soon as one rcu period is over. So to
make sure partition table is not freed one has to be holding
rcu_read_lock(). It is not a good idea to keep rcu period going till
request finishes so a better idea will to to reference count it.

Thanks
--

From: Jens Axboe
Date: Tuesday, December 14, 2010 - 1:25 pm

Exactly. The only change you would do to partition handling is ensure
that each io grabs a reference to it and drops it at the end. You need
not even do that in the core bits outside each IO, we just need to
ensure that the partition struct persists in memory even if it is no
longer the valid partition table. The rcu call to free the memory would
happen when the ref drops to zero.

What Vivek says it completely correct, and rcu_dereference() would only
help if you also had rcu_read_lock() over each IO. That is not feasible
at all.


-- 
Jens Axboe

--

From: Jerome Marchand
Date: Friday, December 17, 2010 - 6:42 am

/proc/diskstats would display a strange output as follows.

$ cat /proc/diskstats |grep sda
   8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
   8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                ~~~~~~~~~~
   8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
   8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
   8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
   8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

The detailed root cause is as follows.

Assuming that there are two partition, sda1 and sda2.

1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
   is 0 and sda2's one is 1.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

2. A bio belongs to sda1 is issued and is merged into the request mentioned on
   step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
   from sda2 region to sda1 region. However the two partition's
   hd_struct->in_flight are not changed.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

3. The request is finished and blk_account_io_done() is called. In this case,
   sda2's hd_struct->in_flight, not a sda1's one, is decremented.

        | hd_struct->in_flight
   ---------------------------
   sda1 |         -1
   sda2 |          1
   ---------------------------

The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.

Also add a refcount ...
From: Jens Axboe
Date: Friday, December 17, 2010 - 12:06 pm

[snip]


I don't think this is completely safe. The rcu lock is held due to the
part_stat_lock(), but that only prevents the __delete_partition()
callback from happening. Lets say you have this:

CPU0                                         CPU1
part = disk_map_sector_rcu()
                                             kref_put(part); <- now 0
part_stat_unlock()
                                             __delete_partition();
                                             ...
                                             delete_partition_rcu_cb();
merge, or endio, boom

Now rq has ->part pointing to freed memory, later merges or end
accounting will touch freed memory.

I think we can fix this by just having delete_partition_rcu_rb() check
the reference count and return if non-zero. Since someone holds a
reference to the table, they will drop it and we'll re-schedule the rcu
callback.


-- 
Jens Axboe

--

From: Vivek Goyal
Date: Friday, December 17, 2010 - 3:32 pm

This is interesting. Using RCU with kref(). So even if somebody has done
a kref_put() and this is last reference, but rcu period is not over, somebody
can still go and take reference again and set it to 1 again and then
partition will not be freed as delete_partition_rcu_cb() will find it set.

I guess read shall have to be atomic_read() and struct kref is opaque so
one might have to introduce kref_read() or something like that and
possibly update Documentation/kref.txt for this usage of with RCU. I would
also recommend it to get it reviewed from  Paul McKenney to make sure this
usage of RCU is fine.

Thanks
Vivek
--

From: Jerome Marchand
Date: Thursday, December 23, 2010 - 8:10 am

A kref_get(part) happens here, or am I missing something?
If we use an atomic kref_test_and_get() function, which only increment the
refcount if it's not zero, we would be safe. If kref_test_and_get() fails,


---

diff --git a/block/blk-core.c b/block/blk-core.c
index 4ce953f..72d12d2 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -64,13 +64,21 @@ static void drive_stat_acct(struct request *rq, int new_io)
 		return;
 
 	cpu = part_stat_lock();
-	part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
 
-	if (!new_io)
+	if (!new_io) {
+		part = rq->part;
 		part_stat_inc(cpu, part, merges[rw]);
-	else {
+	} else {
+		part = disk_map_sector_rcu(rq->rq_disk, blk_rq_pos(rq));
+		if(part->partno && !kref_test_and_get(&part->ref))
+			/*
+			 * The partition is already being removed,
+			 * the request will be accounted on the disk only
+			 */
+			part = &rq->rq_disk->part0;
 		part_round_stats(cpu, part);
 		part_inc_in_flight(part, rw);
+		rq->part = part;
 	}
 
 	part_stat_unlock();
@@ -128,6 +136,7 @@ void blk_rq_init(struct request_queue *q, struct request *rq)
 	rq->ref_count = 1;
 	rq->start_time = jiffies;
 	set_start_time_ns(rq);
+	rq->part = NULL;
 }
 EXPORT_SYMBOL(blk_rq_init);
 
@@ -1776,7 +1785,7 @@ static void blk_account_io_completion(struct request *req, unsigned int bytes)
 		int cpu;
 
 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = req->part;
 		part_stat_add(cpu, part, sectors[rw], bytes >> 9);
 		part_stat_unlock();
 	}
@@ -1796,13 +1805,15 @@ static void blk_account_io_done(struct request *req)
 		int cpu;
 
 		cpu = part_stat_lock();
-		part = disk_map_sector_rcu(req->rq_disk, blk_rq_pos(req));
+		part = req->part;
 
 		part_stat_inc(cpu, part, ios[rw]);
 		part_stat_add(cpu, part, ticks[rw], duration);
 		part_round_stats(cpu, part);
 		part_dec_in_flight(part, rw);
 
+		if (part->partno)
+			kref_put(&part->ref, __delete_partition);
 		part_stat_unlock();
 ...
From: Vivek Goyal
Date: Thursday, December 23, 2010 - 8:39 am

It might happen that kref_get(part) is called after kref_put() has been
called and reference has reached 0 and and delete_parition() call has
been scheduled as soon as rcu grace period is over. So if you do

Conceptually it kind of makes sense to me. So if even if we get the
pointer to partition under rcu_read_lock(), we will not account the IO

Do we have to check this part->partno both while taking and releasing
reference. Can't we take one extra reference for disk->part0, at
alloc_disk_node() time so that it is never freed and only freed when
disk is going away and gendisk is being freed.

That way, you don't have to differentiate between disk->part0 and rest
of the partitions while taking or dropping references.

Thanks
--

From: Jerome Marchand
Date: Thursday, December 23, 2010 - 10:04 am

We could do it in any way, as long as we don't end up trying to free
disk->part0. I choose not to touch part0->ref at all, but we could also
drop all the part->partno test, and get a reference on part0 when we use
it as a backup. I have no strong opinion about a way or an other.

--

From: Vivek Goyal
Date: Friday, December 24, 2010 - 12:29 pm

Ok, I personally like taking an extra reference to disk->part0 and put
a proper comment there so that at rest of the places we don't try to
differentiate between part0 and other partitions. 

Having said that I am not too concerned about it. It is just one of the
minor details. So post the new version of patch after testing.

Thanks
Vivek
--

From: Jerome Marchand
Date: Tuesday, January 4, 2011 - 8:52 am

Add kref_test_and_get() function, which atomically add a reference only if
refcount is not zero. This prevent to add a reference to an object that is
already being removed.

Signed-off-by: Jerome Marchand <jmarchan@redhat.com>
---
 include/linux/kref.h |    1 +
 lib/kref.c           |   16 ++++++++++++++++
 2 files changed, 17 insertions(+)

diff --git a/include/linux/kref.h b/include/linux/kref.h
index 6cc38fc..90b9e44 100644
--- a/include/linux/kref.h
+++ b/include/linux/kref.h
@@ -23,6 +23,7 @@ struct kref {
 
 void kref_init(struct kref *kref);
 void kref_get(struct kref *kref);
+int kref_test_and_get(struct kref *kref);
 int kref_put(struct kref *kref, void (*release) (struct kref *kref));
 
 #endif /* _KREF_H_ */
diff --git a/lib/kref.c b/lib/kref.c
index d3d227a..5f663b9 100644
--- a/lib/kref.c
+++ b/lib/kref.c
@@ -37,6 +37,22 @@ void kref_get(struct kref *kref)
 }
 
 /**
+ * kref_test_and_get - increment refcount for object only if refcount is not
+ * zero.
+ * @kref: object.
+ *
+ * Return non-zero if the refcount was incremented, 0 otherwise
+ */
+int kref_test_and_get(struct kref *kref)
+{
+	int ret;
+	smp_mb__before_atomic_inc();
+	ret = atomic_inc_not_zero(&kref->refcount);
+	smp_mb__after_atomic_inc();
+	return ret;
+}
+
+/**
  * kref_put - decrement refcount for object.
  * @kref: object.
  * @release: pointer to the function that will clean up the object when the
--

From: Jerome Marchand
Date: Tuesday, January 4, 2011 - 8:55 am

/proc/diskstats would display a strange output as follows.

$ cat /proc/diskstats |grep sda
   8       0 sda 90524 7579 102154 20464 0 0 0 0 0 14096 20089
   8       1 sda1 19085 1352 21841 4209 0 0 0 0 4294967064 15689 4293424691
                                                ~~~~~~~~~~
   8       2 sda2 71252 3624 74891 15950 0 0 0 0 232 23995 1562390
   8       3 sda3 54 487 2188 92 0 0 0 0 0 88 92
   8       4 sda4 4 0 8 0 0 0 0 0 0 0 0
   8       5 sda5 81 2027 2130 138 0 0 0 0 0 87 137

Its reason is the wrong way of accounting hd_struct->in_flight. When a bio is
merged into a request belongs to different partition by ELEVATOR_FRONT_MERGE.

The detailed root cause is as follows.

Assuming that there are two partition, sda1 and sda2.

1. A request for sda2 is in request_queue. Hence sda1's hd_struct->in_flight
   is 0 and sda2's one is 1.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

2. A bio belongs to sda1 is issued and is merged into the request mentioned on
   step1 by ELEVATOR_BACK_MERGE. The first sector of the request is changed
   from sda2 region to sda1 region. However the two partition's
   hd_struct->in_flight are not changed.

        | hd_struct->in_flight
   ---------------------------
   sda1 |          0
   sda2 |          1
   ---------------------------

3. The request is finished and blk_account_io_done() is called. In this case,
   sda2's hd_struct->in_flight, not a sda1's one, is decremented.

        | hd_struct->in_flight
   ---------------------------
   sda1 |         -1
   sda2 |          1
   ---------------------------

The patch fixes the problem by caching the partition lookup
inside the request structure, hence making sure that the increment
and decrement will always happen on the same partition struct. This
also speeds up IO with accounting enabled, since it cuts down on
the number of lookups we have to do.

Also add a refcount ...
From: Greg KH
Date: Tuesday, January 4, 2011 - 2:00 pm

No, don't do this, use a kref correctly and no such function should be

That is the function that should properly increment the reference count
on the object.  If the object is "being removed", then it will return

At all.

thanks,

greg k-h
--

From: Eric Dumazet
Date: Tuesday, January 4, 2011 - 9:05 am

atomic_inc_not_zero() has full memory barrier semantic already, you dont
need smp_mb__before_atomic_inc() or smp_mb__after_atomic_inc();



--

From: Greg KH
Date: Tuesday, January 4, 2011 - 1:57 pm

We just removed a function like this recently as it really isn't the
solution for what you need here at all.

Please use a lock that protects the desctruction of the kref, like the
rest of the kernel, to keep someone from grabing a reference while the
device could be going away.  That's the only way to safely do this.

See the archives for why this overall, isn't a good idea for a kref to
have this type of function.

thanks,

greg k-h
--

Previous thread: [RFC][PATCH 0/2 v6] blk: fix a wrong accounting of hd_struct->in_flight by Yasuaki Ishimatsu on Monday, December 6, 2010 - 2:44 am. (1 message)

Next thread: [PATCH 1/2] Change a argument of disk_map_sector_rcu() to hd_part->start_sect from req->__sector by Yasuaki Ishimatsu on Monday, December 6, 2010 - 2:45 am. (1 message)