[PATCH 3/3] blkio: Increment the blkio cgroup stats for real now

Previous thread: 2.6.33.1 pwc (quickcam pro 4000 is broken) by Justin Piszcz on Thursday, April 1, 2010 - 2:45 pm. (1 message)

Next thread: Re: [PATCH] drivers:staging: sources for ST core by Pavan Savoy on Thursday, April 1, 2010 - 3:43 pm. (2 messages)
From: Divyesh Shah
Date: Thursday, April 1, 2010 - 3:00 pm

The following series implements some additional stats for IO controller.

These stats have helped us debug issues with earlier IO controller
versions and should be useful now as well.
We've been using these stats for monitoring and debugging problems after the
fact as these stats can be collected and stored for later use.

One might argue that most of this information can be exported using blktrace
when debugging. However, blktrace has non-trivial performance impact and
cannot be always turned on. These stats provide a way for continuous monitoring
without losing much performance on rotational disks. We've been able to look
at these stats and debug issues after problems have been reported in the wild
and understand the IO pattern of the affected workloads.
Some of these stats are also a good data source for high-level analysis and
capacity planning.

This patchset adds 4 stats and I will send out another patchset later for
stats like io_merged and some stats that can be turned on only for
debugging - idle_time (total time spent idling for this blkio_group),
wait_time (total time spent by the blkio_group waiting before any one of its
queues got a timeslice). I've tried to breakdown the stats and sent the most
basic ones here.
---

Divyesh Shah (3):
      Increment the blkio cgroup stats for real now.
      Add io controller stats like
      Remove per-cfqq nr_sectors as we'll be passing that info at request dispatch


 block/blk-cgroup.c     |  239 ++++++++++++++++++++++++++++++++++++++++++++----
 block/blk-cgroup.h     |   57 +++++++++--
 block/blk-core.c       |    6 +
 block/cfq-iosched.c    |   14 +--
 include/linux/blkdev.h |   20 ++++
 5 files changed, 291 insertions(+), 45 deletions(-)

-- 
Divyesh
--

From: Divyesh Shah
Date: Thursday, April 1, 2010 - 3:01 pm

that info at request dispatch with other stats now. This patch removes the
existing support for accounting sectors for a blkio_group. This will be added
back differently in the next two patches.

Signed-off-by: Divyesh Shah<dpshah@google.com>
---

 block/blk-cgroup.c  |    3 +--
 block/blk-cgroup.h  |    6 ++----
 block/cfq-iosched.c |   10 ++--------
 3 files changed, 5 insertions(+), 14 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4b686ad..5be3981 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -56,10 +56,9 @@ struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
 EXPORT_SYMBOL_GPL(cgroup_to_blkio_cgroup);
 
 void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
-			unsigned long time, unsigned long sectors)
+						unsigned long time)
 {
 	blkg->time += time;
-	blkg->sectors += sectors;
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_blkio_group_stats);
 
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 8ccc204..fe44517 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -106,7 +106,7 @@ extern int blkiocg_del_blkio_group(struct blkio_group *blkg);
 extern struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
 						void *key);
 void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
-			unsigned long time, unsigned long sectors);
+						unsigned long time);
 #else
 struct cgroup;
 static inline struct blkio_cgroup *
@@ -123,8 +123,6 @@ blkiocg_del_blkio_group(struct blkio_group *blkg) { return 0; }
 static inline struct blkio_group *
 blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key) { return NULL; }
 static inline void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
-			unsigned long time, unsigned long sectors)
-{
-}
+						unsigned long time) {}
 #endif
 #endif /* _BLK_CGROUP_H */
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ef1680b..c18e348 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -141,8 +141,6 @@ ...
From: Divyesh Shah
Date: Thursday, April 1, 2010 - 3:01 pm

- io_service_time
- io_wait_time
- io_serviced
- io_service_bytes

These stats are accumulated per operation type helping us to distinguish between
read and write, and sync and async IO. This patch does not increment any of
these stats.

Signed-off-by: Divyesh Shah<dpshah@google.com>
---

 block/blk-cgroup.c  |  178 ++++++++++++++++++++++++++++++++++++++++++++++-----
 block/blk-cgroup.h  |   39 +++++++++--
 block/cfq-iosched.c |    2 -
 3 files changed, 194 insertions(+), 25 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 5be3981..ad6843f 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -55,12 +55,15 @@ struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
 }
 EXPORT_SYMBOL_GPL(cgroup_to_blkio_cgroup);
 
-void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
-						unsigned long time)
+void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time)
 {
-	blkg->time += time;
+	unsigned long flags;
+
+	spin_lock_irqsave(&blkg->stats_lock, flags);
+	blkg->stats.time += time;
+	spin_unlock_irqrestore(&blkg->stats_lock, flags);
 }
-EXPORT_SYMBOL_GPL(blkiocg_update_blkio_group_stats);
+EXPORT_SYMBOL_GPL(blkiocg_update_timeslice_used);
 
 void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
 			struct blkio_group *blkg, void *key, dev_t dev)
@@ -170,13 +173,121 @@ blkiocg_weight_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
 	return 0;
 }
 
-#define SHOW_FUNCTION_PER_GROUP(__VAR)					\
+static int
+blkiocg_reset_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
+{
+	struct blkio_cgroup *blkcg;
+	struct blkio_group *blkg;
+	struct hlist_node *n;
+	struct blkio_group_stats *stats;
+
+	blkcg = cgroup_to_blkio_cgroup(cgroup);
+	spin_lock_irq(&blkcg->lock);
+	hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
+		spin_lock(&blkg->stats_lock);
+		stats = &blkg->stats;
+		memset(stats, 0, sizeof(struct ...
From: Vivek Goyal
Date: Friday, April 2, 2010 - 11:10 am

Hi Divyesh,

Some more description what exactly these stats are will be helpful. Please
also update Documentation/cgroup/blkio-controller.txt file appropriately.

get_key_name() should be static? Can we prefix all these internal function
names with blkio? 

Do we have to introduce this separate notion of char *disk_id. Can we
simply stick to dev_t and do the sprintf like functions to convert that


Can we move above define to the beginning of the file. So that it is

static function? blkio_get_typed_stat()?


I think now so many macros to define different kind of functions is
becoming really confusing. How about creating a two dimensional stat
array which is indexed by type of io stat like "io_service_time" and
then indexed by sub io type that is "READ, WRITE, SYNC" etc. And we
can define a single function to read all kind of stats.

That way we will not have to define one GET_STAT_INDEXED() and GET_STAT()
for each unindexed variable. If there are not multiple functions then,
then everywhere we don't have to pass around a function pointer to read
the variable stat and code should become much simpler. Example, code
follows.


enum stat_type {
	BLKIO_STAT_TIME
	BLKIO_STAT_SECTORS
	BLKIO_STAT_WAIT_TIME
	BLKIO_STAT_SERVICE_TIME
}

enum stat_sub_type {
	BLKIO_STAT_READ
	BLKIO_STAT_WRITE
	BLKIO_STAT_SYNC
}

blkio_get_stat_var(enum stat_type var, enum stat_sub_type var_type) {

	if (var == BLKIO_STAT_TIME)
		reutrn blkg->stat.time

	/* Same as above for sectors */

	return blkg->stat[var][var_type];
}


blkiocg_reset_write() is invoked if a user does some write on any of the
above files. This will reset all the stats. So if I do echo x >
blkio.time, I will see all other files being reset? I think this is
little counter intutive.

Either we need to reset only those specific stats (the file being written
to) or may be we can provide a separate file "blkio.reset_stats" to reset
all the stats?

Secondly, instead of providing separate files for all the ...
From: Divyesh Shah
Date: Friday, April 2, 2010 - 1:53 pm

Vivek, thanks for the detailed review. Comments inline. I'll send a V2
patchset soon.




I'm not sure adding this level of complexity (2nd dimension for stats)
and the numerous if (..) statements (or switch) is warranted when they
only apply for the get_stat() and get_typed_stat() functions. This

I added a note in the Documentation file that writing an int to any of
the stats should reset all. I don't get the point of introducing
another file for this and would want stats to be cleared all together
in order to have say io_service_time and io_serviced in sync, i.e.,
the io_service_time should always be the total service time for the

--

From: Vivek Goyal
Date: Monday, April 5, 2010 - 7:45 am

On Fri, Apr 02, 2010 at 01:53:30PM -0700, Divyesh Shah wrote:


I think we need only two conditions to check. BLKIO_STAT_TIME and
BLKIO_STAT_SECTROS. Rest of these fall into typed category and we

I think this really is simplification. This also gets rid of all the
function pointers you are passing around to differentiate between two
types of stats. Also gets rid of special macros to generate one function
each for one kind of stat. These dynamic macros make code hard to read
and understand. 

Can you please try this two dimensional array for stats approach once.

I agree that reset should mean reset of all stats. It is a matter of
whether to introduce a separate file or write to one file resets all the
other files. Though I find writting to one file resetting the stats of
other file un-intitutive, at the same time don't feel too excited about
introducing a new file just for reset.

Do we really need to introduce reset interface. Changing the ioscheduler to
noop or deadline and then setting it back to cfq should reset the stats for
all the cgroups.

In fact we can probably get rid of per cgroup reset stats interface. We
will then get rid of extra stat spin lock. If somebody wants to reset
the stats, just change the ioscheduler.

This will work until and unless somebody wants to reset the stats of one
cgroup but not the other.

Thanks
Vivek
--

From: Divyesh Shah
Date: Monday, April 5, 2010 - 3:16 pm

There are 2 conditions now but as I mentioned that this is one of 2-3
patchsets for adding stats. More stats will increase the special
casing. I do agree the macros aren't exactly easy to read. I'll try

As mentioned on the other thread this is exactly what I find the main
use case of this interface. Once again, if you feel too strongly about
it we can have a separate reset_stats interface to make it more
--

From: Vivek Goyal
Date: Friday, April 2, 2010 - 11:17 am

On Thu, Apr 01, 2010 at 03:01:24PM -0700, Divyesh Shah wrote:


Hi Divyesh,

I think now you are exporting blkio.time in ns instead of ms? 

- First of all your are breaking ABI.
- Secondly, how does that help. You are capturing all your slice used
  stats in ms. You already lost any ns granularity. What's the point
  in converting these to ns now?
- Does user space software really need that fine grained accounting. If
  this information is used for charging purposes, isn't ms good enough.
 
Thanks
Vivek
--

From: Divyesh Shah
Date: Friday, April 2, 2010 - 11:54 am

This is an error.. I should be exporting this in ms. Will send out a
patchset with the change once I also address you other comments in a
--

From: Divyesh Shah
Date: Thursday, April 1, 2010 - 3:01 pm

We also add start_time_ns and io_start_time_ns fields to struct request
here to record the time when a request is created and when it is
dispatched to device. We use ns uints here as ms and jiffies are
not very useful for non-rotational media.

Signed-off-by: Divyesh Shah<dpshah@google.com>
---

 block/blk-cgroup.c     |   60 ++++++++++++++++++++++++++++++++++++++++++++++--
 block/blk-cgroup.h     |   14 +++++++++--
 block/blk-core.c       |    6 +++--
 block/cfq-iosched.c    |    4 ++-
 include/linux/blkdev.h |   20 +++++++++++++++-
 5 files changed, 95 insertions(+), 9 deletions(-)

diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index ad6843f..9af7257 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -15,6 +15,7 @@
 #include <linux/kdev_t.h>
 #include <linux/module.h>
 #include <linux/err.h>
+#include <linux/blkdev.h>
 #include "blk-cgroup.h"
 
 static DEFINE_SPINLOCK(blkio_list_lock);
@@ -55,6 +56,26 @@ struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
 }
 EXPORT_SYMBOL_GPL(cgroup_to_blkio_cgroup);
 
+/*
+ * Add to the appropriate stat variable depending on the request type.
+ * This should be called with the blkg->stats_lock held.
+ */
+void io_add_stat(uint64_t *stat, uint64_t add, unsigned int flags)
+{
+	if (flags & REQ_RW)
+		stat[IO_WRITE] += add;
+	else
+		stat[IO_READ] += add;
+	/*
+	 * Everywhere in the block layer, an IO is treated as sync if it is a
+	 * read or a SYNC write. We follow the same norm.
+	 */
+	if (!(flags & REQ_RW) || flags & REQ_RW_SYNC)
+		stat[IO_SYNC] += add;
+	else
+		stat[IO_ASYNC] += add;
+}
+
 void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time)
 {
 	unsigned long flags;
@@ -65,6 +86,41 @@ void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time)
 }
 EXPORT_SYMBOL_GPL(blkiocg_update_timeslice_used);
 
+void blkiocg_update_request_dispatch_stats(struct blkio_group *blkg,
+						struct request *rq)
+{
+	struct blkio_group_stats ...
From: Vivek Goyal
Date: Friday, April 2, 2010 - 12:10 pm

Hi Divyesh,

Can we have any request based information limited to cfq and not put that
in blkio-cgroup. The reason being that I am expecting that some kind of
max bw policy interface will not necessarily be implemented at CFQ
level. We might have to implement it at higher level so that it can
work with all dm/md devices. If that's the case, then it might very well
be either a bio based interface also.

So just keeping that possibility in mind, can we keep blk-cgroup as
generic as possible and not necessarily make it dependent on "struct
request".

If you implement, two dimensional arrays for stats then we can have
following function.

blkio_add_stat(enum stat_type var enum stat_sub_type var_type, u64 val)

The idea is that let policy specify what is the type of IO completed,
read, or write or sync, async and lets not make assumption in blkio-cgroup

io_service_bytes is esentially same as blkio.sectors but it is giving
info in bytes instead of sectors and it is breaking down IO into subtypes.
That's fine if it is helping you.

instead of blk_rq_sectors(rq) << 9, you can probably use blk_rq_bytes().


Same here that knowledge of rq, move into CFQ and keep blk-cgroup.c free
--

From: Divyesh Shah
Date: Friday, April 2, 2010 - 4:36 pm

Ok. I do understand the motivation for keeping the request related
info out of blk-cgroup. Everything except the rq->cmd_flags can be
easily done away with. Maybe I'll need to have CFQ send the sync and
direction bits as args to the functions that need it. Not ideal coz

I would want to avoid calls like these from CFQ into the blkcg code
because many CFQ events trigger update for multiple stats (you'll see
more with stats in later patchsets) and doing these calls
independently for each stat would mean that we would also need to grab

Good point. However, if I'm passing rq related data from CFQ now, I'd
--

From: Vivek Goyal
Date: Monday, April 5, 2010 - 8:12 am

I understand the need to club the updates and reduce the need of taking
stats_lock multiple times. I was thinking of any of following.

- Get rid of reset interface per cgroup. Rely on changing ioscheduler on
  request queue and that will get rid of stats_lock entirely.

- Can we use a function blkio_add_stat() with variable number of arguments
  so that more than one stat can be updated in a single call?

If you have other ideas to implement it without assuming "struct rq" in
blk-cgroup, please do that.

Thanks
Vivek
--

From: Divyesh Shah
Date: Monday, April 5, 2010 - 9:53 am

This takes away the ability to reset stats at will which is very


I've already got rid of any rq assumptions in blk-cgroup. The only
place where we're using rq is for rq_start_time_ns() and
rq_io-start_time_ns() functions but they are not used by the
blk-cgroup code directly (only CFQ uses them). For another user of
--

From: Vivek Goyal
Date: Monday, April 5, 2010 - 10:29 am

What do you mean by "reset stats at will"? You can change ioscheduler at
will and reset stats? The only possible issue I could think of is that only
admin can change the ioscheduler in providing per cgroup interface, one can
give write permission to indiviaul user and allow users to reset stats.

I am not sure in practice why would you allow a user to reset stats.

Ok, you have made blkg->stats_lock visible to cfq. That's fine too.
Can you rename io_add_stat to blkio_add_stat. I think in V2 also, it is
still io_add_stat.

Vivek
--

From: Divyesh Shah
Date: Monday, April 5, 2010 - 3:06 pm

I should've said reset stats for a given cgroup at will. This is
mostly needed when debugging and for automated tests where we're
testing multiple cgroups and/or IO controller behavior between
different workloads (sync readers, buffered writers, sync_writers,
directIO read/write, etc). Resetting cgroup stats is something we use
regularly and depend on. If you feel very strongly about this being
counter-intuitive I can add a reset_stats file and make the other

By visible you mean accessible through cfqg->blkg, right? However, I

--

From: Jens Axboe
Date: Thursday, April 1, 2010 - 11:45 pm

Thanks, I think this makes sense. I've applied it for 2.6.35.

-- 
Jens Axboe

--

Previous thread: 2.6.33.1 pwc (quickcam pro 4000 is broken) by Justin Piszcz on Thursday, April 1, 2010 - 2:45 pm. (1 message)

Next thread: Re: [PATCH] drivers:staging: sources for ST core by Pavan Savoy on Thursday, April 1, 2010 - 3:43 pm. (2 messages)