The following series implements some additional stats for IO controller.
These stats have helped us debug issues with earlier IO controller
versions and should be useful now as well.
We've been using these stats for monitoring and debugging problems after the
fact as these stats can be collected and stored for later use.
One might argue that most of this information can be exported using blktrace
when debugging. However, blktrace has non-trivial performance impact and
cannot be always turned on. These stats provide a way for continuous monitoring
without losing much performance on rotational disks. We've been able to look
at these stats and debug issues after problems have been reported in the wild
and understand the IO pattern of the affected workloads.
Some of these stats are also a good data source for high-level analysis and
capacity planning.
This patchset adds 4 stats and I will send out another patchset later for
stats like io_merged and some stats that can be turned on only for
debugging - idle_time (total time spent idling for this blkio_group),
wait_time (total time spent by the blkio_group waiting before any one of its
queues got a timeslice). I've tried to breakdown the stats and sent the most
basic ones here.
---
Divyesh Shah (3):
Increment the blkio cgroup stats for real now.
Add io controller stats like
Remove per-cfqq nr_sectors as we'll be passing that info at request dispatch
block/blk-cgroup.c | 239 ++++++++++++++++++++++++++++++++++++++++++++----
block/blk-cgroup.h | 57 +++++++++--
block/blk-core.c | 6 +
block/cfq-iosched.c | 14 +--
include/linux/blkdev.h | 20 ++++
5 files changed, 291 insertions(+), 45 deletions(-)
--
Divyesh
--
that info at request dispatch with other stats now. This patch removes the
existing support for accounting sectors for a blkio_group. This will be added
back differently in the next two patches.
Signed-off-by: Divyesh Shah<dpshah@google.com>
---
block/blk-cgroup.c | 3 +--
block/blk-cgroup.h | 6 ++----
block/cfq-iosched.c | 10 ++--------
3 files changed, 5 insertions(+), 14 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 4b686ad..5be3981 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -56,10 +56,9 @@ struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
EXPORT_SYMBOL_GPL(cgroup_to_blkio_cgroup);
void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
- unsigned long time, unsigned long sectors)
+ unsigned long time)
{
blkg->time += time;
- blkg->sectors += sectors;
}
EXPORT_SYMBOL_GPL(blkiocg_update_blkio_group_stats);
diff --git a/block/blk-cgroup.h b/block/blk-cgroup.h
index 8ccc204..fe44517 100644
--- a/block/blk-cgroup.h
+++ b/block/blk-cgroup.h
@@ -106,7 +106,7 @@ extern int blkiocg_del_blkio_group(struct blkio_group *blkg);
extern struct blkio_group *blkiocg_lookup_group(struct blkio_cgroup *blkcg,
void *key);
void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
- unsigned long time, unsigned long sectors);
+ unsigned long time);
#else
struct cgroup;
static inline struct blkio_cgroup *
@@ -123,8 +123,6 @@ blkiocg_del_blkio_group(struct blkio_group *blkg) { return 0; }
static inline struct blkio_group *
blkiocg_lookup_group(struct blkio_cgroup *blkcg, void *key) { return NULL; }
static inline void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
- unsigned long time, unsigned long sectors)
-{
-}
+ unsigned long time) {}
#endif
#endif /* _BLK_CGROUP_H */
diff --git a/block/cfq-iosched.c b/block/cfq-iosched.c
index ef1680b..c18e348 100644
--- a/block/cfq-iosched.c
+++ b/block/cfq-iosched.c
@@ -141,8 +141,6 @@ ...- io_service_time
- io_wait_time
- io_serviced
- io_service_bytes
These stats are accumulated per operation type helping us to distinguish between
read and write, and sync and async IO. This patch does not increment any of
these stats.
Signed-off-by: Divyesh Shah<dpshah@google.com>
---
block/blk-cgroup.c | 178 ++++++++++++++++++++++++++++++++++++++++++++++-----
block/blk-cgroup.h | 39 +++++++++--
block/cfq-iosched.c | 2 -
3 files changed, 194 insertions(+), 25 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 5be3981..ad6843f 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -55,12 +55,15 @@ struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
}
EXPORT_SYMBOL_GPL(cgroup_to_blkio_cgroup);
-void blkiocg_update_blkio_group_stats(struct blkio_group *blkg,
- unsigned long time)
+void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time)
{
- blkg->time += time;
+ unsigned long flags;
+
+ spin_lock_irqsave(&blkg->stats_lock, flags);
+ blkg->stats.time += time;
+ spin_unlock_irqrestore(&blkg->stats_lock, flags);
}
-EXPORT_SYMBOL_GPL(blkiocg_update_blkio_group_stats);
+EXPORT_SYMBOL_GPL(blkiocg_update_timeslice_used);
void blkiocg_add_blkio_group(struct blkio_cgroup *blkcg,
struct blkio_group *blkg, void *key, dev_t dev)
@@ -170,13 +173,121 @@ blkiocg_weight_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
return 0;
}
-#define SHOW_FUNCTION_PER_GROUP(__VAR) \
+static int
+blkiocg_reset_write(struct cgroup *cgroup, struct cftype *cftype, u64 val)
+{
+ struct blkio_cgroup *blkcg;
+ struct blkio_group *blkg;
+ struct hlist_node *n;
+ struct blkio_group_stats *stats;
+
+ blkcg = cgroup_to_blkio_cgroup(cgroup);
+ spin_lock_irq(&blkcg->lock);
+ hlist_for_each_entry(blkg, n, &blkcg->blkg_list, blkcg_node) {
+ spin_lock(&blkg->stats_lock);
+ stats = &blkg->stats;
+ memset(stats, 0, sizeof(struct ...Hi Divyesh,
Some more description what exactly these stats are will be helpful. Please
also update Documentation/cgroup/blkio-controller.txt file appropriately.
get_key_name() should be static? Can we prefix all these internal function
names with blkio?
Do we have to introduce this separate notion of char *disk_id. Can we
simply stick to dev_t and do the sprintf like functions to convert that
Can we move above define to the beginning of the file. So that it is
static function? blkio_get_typed_stat()?
I think now so many macros to define different kind of functions is
becoming really confusing. How about creating a two dimensional stat
array which is indexed by type of io stat like "io_service_time" and
then indexed by sub io type that is "READ, WRITE, SYNC" etc. And we
can define a single function to read all kind of stats.
That way we will not have to define one GET_STAT_INDEXED() and GET_STAT()
for each unindexed variable. If there are not multiple functions then,
then everywhere we don't have to pass around a function pointer to read
the variable stat and code should become much simpler. Example, code
follows.
enum stat_type {
BLKIO_STAT_TIME
BLKIO_STAT_SECTORS
BLKIO_STAT_WAIT_TIME
BLKIO_STAT_SERVICE_TIME
}
enum stat_sub_type {
BLKIO_STAT_READ
BLKIO_STAT_WRITE
BLKIO_STAT_SYNC
}
blkio_get_stat_var(enum stat_type var, enum stat_sub_type var_type) {
if (var == BLKIO_STAT_TIME)
reutrn blkg->stat.time
/* Same as above for sectors */
return blkg->stat[var][var_type];
}
blkiocg_reset_write() is invoked if a user does some write on any of the
above files. This will reset all the stats. So if I do echo x >
blkio.time, I will see all other files being reset? I think this is
little counter intutive.
Either we need to reset only those specific stats (the file being written
to) or may be we can provide a separate file "blkio.reset_stats" to reset
all the stats?
Secondly, instead of providing separate files for all the ...Vivek, thanks for the detailed review. Comments inline. I'll send a V2 patchset soon. I'm not sure adding this level of complexity (2nd dimension for stats) and the numerous if (..) statements (or switch) is warranted when they only apply for the get_stat() and get_typed_stat() functions. This I added a note in the Documentation file that writing an int to any of the stats should reset all. I don't get the point of introducing another file for this and would want stats to be cleared all together in order to have say io_service_time and io_serviced in sync, i.e., the io_service_time should always be the total service time for the --
On Fri, Apr 02, 2010 at 01:53:30PM -0700, Divyesh Shah wrote: I think we need only two conditions to check. BLKIO_STAT_TIME and BLKIO_STAT_SECTROS. Rest of these fall into typed category and we I think this really is simplification. This also gets rid of all the function pointers you are passing around to differentiate between two types of stats. Also gets rid of special macros to generate one function each for one kind of stat. These dynamic macros make code hard to read and understand. Can you please try this two dimensional array for stats approach once. I agree that reset should mean reset of all stats. It is a matter of whether to introduce a separate file or write to one file resets all the other files. Though I find writting to one file resetting the stats of other file un-intitutive, at the same time don't feel too excited about introducing a new file just for reset. Do we really need to introduce reset interface. Changing the ioscheduler to noop or deadline and then setting it back to cfq should reset the stats for all the cgroups. In fact we can probably get rid of per cgroup reset stats interface. We will then get rid of extra stat spin lock. If somebody wants to reset the stats, just change the ioscheduler. This will work until and unless somebody wants to reset the stats of one cgroup but not the other. Thanks Vivek --
There are 2 conditions now but as I mentioned that this is one of 2-3 patchsets for adding stats. More stats will increase the special casing. I do agree the macros aren't exactly easy to read. I'll try As mentioned on the other thread this is exactly what I find the main use case of this interface. Once again, if you feel too strongly about it we can have a separate reset_stats interface to make it more --
On Thu, Apr 01, 2010 at 03:01:24PM -0700, Divyesh Shah wrote: Hi Divyesh, I think now you are exporting blkio.time in ns instead of ms? - First of all your are breaking ABI. - Secondly, how does that help. You are capturing all your slice used stats in ms. You already lost any ns granularity. What's the point in converting these to ns now? - Does user space software really need that fine grained accounting. If this information is used for charging purposes, isn't ms good enough. Thanks Vivek --
This is an error.. I should be exporting this in ms. Will send out a patchset with the change once I also address you other comments in a --
We also add start_time_ns and io_start_time_ns fields to struct request
here to record the time when a request is created and when it is
dispatched to device. We use ns uints here as ms and jiffies are
not very useful for non-rotational media.
Signed-off-by: Divyesh Shah<dpshah@google.com>
---
block/blk-cgroup.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++--
block/blk-cgroup.h | 14 +++++++++--
block/blk-core.c | 6 +++--
block/cfq-iosched.c | 4 ++-
include/linux/blkdev.h | 20 +++++++++++++++-
5 files changed, 95 insertions(+), 9 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index ad6843f..9af7257 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -15,6 +15,7 @@
#include <linux/kdev_t.h>
#include <linux/module.h>
#include <linux/err.h>
+#include <linux/blkdev.h>
#include "blk-cgroup.h"
static DEFINE_SPINLOCK(blkio_list_lock);
@@ -55,6 +56,26 @@ struct blkio_cgroup *cgroup_to_blkio_cgroup(struct cgroup *cgroup)
}
EXPORT_SYMBOL_GPL(cgroup_to_blkio_cgroup);
+/*
+ * Add to the appropriate stat variable depending on the request type.
+ * This should be called with the blkg->stats_lock held.
+ */
+void io_add_stat(uint64_t *stat, uint64_t add, unsigned int flags)
+{
+ if (flags & REQ_RW)
+ stat[IO_WRITE] += add;
+ else
+ stat[IO_READ] += add;
+ /*
+ * Everywhere in the block layer, an IO is treated as sync if it is a
+ * read or a SYNC write. We follow the same norm.
+ */
+ if (!(flags & REQ_RW) || flags & REQ_RW_SYNC)
+ stat[IO_SYNC] += add;
+ else
+ stat[IO_ASYNC] += add;
+}
+
void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time)
{
unsigned long flags;
@@ -65,6 +86,41 @@ void blkiocg_update_timeslice_used(struct blkio_group *blkg, unsigned long time)
}
EXPORT_SYMBOL_GPL(blkiocg_update_timeslice_used);
+void blkiocg_update_request_dispatch_stats(struct blkio_group *blkg,
+ struct request *rq)
+{
+ struct blkio_group_stats ...Hi Divyesh, Can we have any request based information limited to cfq and not put that in blkio-cgroup. The reason being that I am expecting that some kind of max bw policy interface will not necessarily be implemented at CFQ level. We might have to implement it at higher level so that it can work with all dm/md devices. If that's the case, then it might very well be either a bio based interface also. So just keeping that possibility in mind, can we keep blk-cgroup as generic as possible and not necessarily make it dependent on "struct request". If you implement, two dimensional arrays for stats then we can have following function. blkio_add_stat(enum stat_type var enum stat_sub_type var_type, u64 val) The idea is that let policy specify what is the type of IO completed, read, or write or sync, async and lets not make assumption in blkio-cgroup io_service_bytes is esentially same as blkio.sectors but it is giving info in bytes instead of sectors and it is breaking down IO into subtypes. That's fine if it is helping you. instead of blk_rq_sectors(rq) << 9, you can probably use blk_rq_bytes(). Same here that knowledge of rq, move into CFQ and keep blk-cgroup.c free --
Ok. I do understand the motivation for keeping the request related info out of blk-cgroup. Everything except the rq->cmd_flags can be easily done away with. Maybe I'll need to have CFQ send the sync and direction bits as args to the functions that need it. Not ideal coz I would want to avoid calls like these from CFQ into the blkcg code because many CFQ events trigger update for multiple stats (you'll see more with stats in later patchsets) and doing these calls independently for each stat would mean that we would also need to grab Good point. However, if I'm passing rq related data from CFQ now, I'd --
I understand the need to club the updates and reduce the need of taking stats_lock multiple times. I was thinking of any of following. - Get rid of reset interface per cgroup. Rely on changing ioscheduler on request queue and that will get rid of stats_lock entirely. - Can we use a function blkio_add_stat() with variable number of arguments so that more than one stat can be updated in a single call? If you have other ideas to implement it without assuming "struct rq" in blk-cgroup, please do that. Thanks Vivek --
This takes away the ability to reset stats at will which is very I've already got rid of any rq assumptions in blk-cgroup. The only place where we're using rq is for rq_start_time_ns() and rq_io-start_time_ns() functions but they are not used by the blk-cgroup code directly (only CFQ uses them). For another user of --
What do you mean by "reset stats at will"? You can change ioscheduler at will and reset stats? The only possible issue I could think of is that only admin can change the ioscheduler in providing per cgroup interface, one can give write permission to indiviaul user and allow users to reset stats. I am not sure in practice why would you allow a user to reset stats. Ok, you have made blkg->stats_lock visible to cfq. That's fine too. Can you rename io_add_stat to blkio_add_stat. I think in V2 also, it is still io_add_stat. Vivek --
I should've said reset stats for a given cgroup at will. This is mostly needed when debugging and for automated tests where we're testing multiple cgroups and/or IO controller behavior between different workloads (sync readers, buffered writers, sync_writers, directIO read/write, etc). Resetting cgroup stats is something we use regularly and depend on. If you feel very strongly about this being counter-intuitive I can add a reset_stats file and make the other By visible you mean accessible through cfqg->blkg, right? However, I --
Thanks, I think this makes sense. I've applied it for 2.6.35. -- Jens Axboe --
