Re: [RFC][PATCH -mm 0/5] cgroup: block device i/o controller (v9)

Previous thread: [PATCH] byteorder: use generic C version for value byteswapping by Harvey Harrison on Wednesday, August 27, 2008 - 9:06 am. (1 message)

Next thread: [RFC][PATCH -mm 1/5] i/o controller documentation by Andrea Righi on Wednesday, August 27, 2008 - 9:07 am. (5 messages)
From: Andrea Righi
Date: Wednesday, August 27, 2008 - 9:07 am

The objective of the i/o controller is to improve i/o performance
predictability of different cgroups sharing the same block devices.

Respect to other priority/weight-based solutions the approach used by this
controller is to explicitly choke applications' requests that directly (or
indirectly) generate i/o activity in the system.

The direct bandwidth and/or iops limiting method has the advantage of improving
the performance predictability at the cost of reducing, in general, the overall
performance of the system (in terms of throughput).

Detailed informations about design, its goal and usage are described in the
documentation.

Patchset against 2.6.27-rc1-mm1.

The all-in-one patch (and previous versions) can be found at:
http://download.systemimager.org/~arighi/linux/patches/io-throttle/

This patchset is an experimental implementation, it includes functional
differences respect to the previous versions (see the changelog below), and I
haven't done much testing yet. So, comments are really welcome.

Changelog: (v8 -> v9)

* introduce struct res_counter_ratelimit as a generic structure to implement
  throttling-based cgroup subsystems
* removed the throttling hooks from the page cache (set_page_dirty): set a
  single throttling hook in submit_bio() both for read and write operations; a
  generic process that is dirtying pages on a limited block device (for the
  cgroup it belongs to) is forced to flush the same amount of pages back to the
  block device (in this way write operations are forced to occur in the same IO
  context of the process that actually generated the IO)
* collect per cgroup, block device and task throttling statistics (throttle
  counter and total time slept for throttling) and export them to userspace
  through blockio.throttlcnt (in the cgroup filesystem) and
  /proc/PID/io-throttle-stat (per-task statistics)
* fair throttling: simple attempt to distribute the sleeps equally among all
  the tasks belonging to the same cgroup; instead of ...
From: Vivek Goyal
Date: Tuesday, September 2, 2008 - 11:06 am

Hi Andrea,

I was checking out the pass discussion on this topic and there seemed to
be two kind of people. One who wanted to control max bandwidth and other
who liked proportional bandwidth approach  (dm-ioband folks).

I was just wondering, is it possible to have both the approaches and let
users decide at run time which one do they want to use (something like
the way users can choose io schedulers).

Thanks
Vivek
--

From: Andrea Righi
Date: Tuesday, September 2, 2008 - 1:50 pm

Hi Vivek,

yes, sounds reasonable (adding the proportional bandwidth control to my
TODO list).

Right now I've a totally experimental patch to add the ionice-like
functionality (it's not the same but it's quite similar to the
proportional bandwidth feature) on-top-of my IO controller. See below.

The patch is not very well tested, I don't even know if it applies
cleanly to the latest io-throttle patch I posted, or if it have runtime
failures, it needs more testing.

Anyway, this adds the file blockio.ionice that can be used to set
per-cgroup IO priorities, just like ionice, the difference is that it
works per-cgroup instead of per-task (it can be easily improved to
also support per-device priority).

The solution I've used is really trivial: all the tasks belonging to a
cgroup share the same io_context, so actually it means that they also
share the same disk time given by the IO scheduler and the tasks'
requests coming from a cgroup are considered as they were issued by a
single task. This works only for CFQ and AS, because deadline and noop
have no concept of IO contexts.

I would also like to merge the Satoshi's cfq-cgroup functionalities to
provide "fairness" also within each cgroup, but the drawback is that it
would work only for CFQ.

So, in conclusion, I'd really like to implement a more generic
weighted/priority cgroup-based policy to schedule bios like dm-ioband,
maybe implementing the hook directly in submit_bio() or
generic_make_request(), independent also of the dm infrastructure.

-Andrea

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
---
 block/blk-io-throttle.c         |   72 +++++++++++++++++++++++++++++++++++++--
 block/blk-ioc.c                 |   16 +-------
 include/linux/blk-io-throttle.h |    7 ++++
 include/linux/iocontext.h       |   15 ++++++++
 kernel/fork.c                   |    3 +-
 5 files changed, 95 insertions(+), 18 deletions(-)

diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c
index 0fa235d..2a52e8d ...
From: Vivek Goyal
Date: Tuesday, September 2, 2008 - 2:41 pm

Probably we don't want to share io contexts among the tasks of same cgroup
because then requests from all the tasks of the cgroup will be queued
on the same cfq queue and we will loose the notion of task priority.

(I think you already covered this point in next paragraph.)


I thought that implementation at generic layer can provide the fairness
between various cgroups (based on their weight/priority) and then fairness
within cgroup will be provided by respecitve IO scheduler (Depending on what
kind of fairness notion IO scheduler carries, for example task priority in
cfq.).

So at generic layer we probably need to just think about how to keep track
of various cgroups per device (probably in a rb tree like cpu scheduler)
and how to schedule these cgroups to submit request to IO scheduer, based
on cgroup weight/priority.


I was wondering that why dm-ioband is creating another LVM driver
dm-ioband. Configuring an ioband device for every logical/physical device
we want to control looks little odd to me. Can't we achive the same thing
by implementing all the logic in generic block layer without any
additional LVM driver?

Thanks
Vivek
--

From: Vivek Goyal
Date: Friday, September 5, 2008 - 8:59 am

Ok, to be more specific, I was thinking of following.

Currently, all the requests for a block device go into request queue in
a linked list and then associated elevator selects the best request for
dispatch based on various policies as dictated by elevator.

Can we maintan an rb-tree per request queue and all the requests being
queued on that request queue first will go in this rb-tree. Then based on
cgroup grouping and control policy (max bandwidth capping, proportional
bandwidth etc), one can pass the requests to elevator associated with the
queue (which will do the actual job of merging and other things).

So effectively first we provide control at cgroup level and then let
elevator take the best decisions with in that.

This should not require creation of any dm-ioband devices to control the
devices. Each block device will contain one rb-tree (cgroups hanging) as
long has somebody has put a controlling policy on that devices. (We can
probably use your interfaces to create policies on devices through cgroup
files).

This should not require elevator modifications and should work well with
stacked devices. 

I will try to write some prototype patches and see if all the above
gibber makes any sense and is workable or not.

One limitation in this scheme is that we are providing grouping capability
based on cgroups only and it is not as generic what dm-ioband is providing.
Do we really require other ways of creating grouping. Creating another device
for each device you want to control sounds odd to me.

Thanks
Vivek
--

From: Andrea Righi
Date: Friday, September 5, 2008 - 10:38 am

Could a workqueue like kblockd move requests from rb-tree to the equivalent

I think I've to figure better all the implementation details, but yes,
sounds good. This seems to be the right approach to provide any kind of
IO controlling: bandwidth throttling, proportional bandwidth,


In any case libcgroup could help here to define any grouping policy
(uid, gid, pid, ...). So, IMHO the grouping capability provided by
cgroups is in perspective generic as well as what dm-ioband provides.

Thanks,
-Andrea
--

From: Hirokazu Takahashi
Date: Wednesday, September 17, 2008 - 12:18 am

FYI, this also can be done with bio-cgroup, which determine the owner cgroup
of a given anonymous page.

Thanks,
Hirokazu Takahashi
--

From: Andrea Righi
Date: Wednesday, September 17, 2008 - 1:47 am

That would be great! FYI here is how I would like to proceed:

- today I'll post a new version of my cgroup-io-throttle patch rebased
  to 2.6.27-rc5-mm1 (it's well tested and seems to be stable enough).
  To keep the things light and simpler I've implemented custom
  get_cgroup_from_page() / put_cgroup_from_page() in the memory
  controller to retrieve the owner of a page, holding a reference to the
  corresponding memcg, during async writes in submit_bio(); this is not
  probably the best way to proceed, and a more generic framework like
  bio-cgroup sounds better, but it seems to work quite well. The only
  problem I've found is that during swap_writepage() the page is not
  assigned to any page_cgroup (page_get_page_cgroup() returns NULL), and
  so I'm not able to charge the cost of this I/O operation to the right
  cgroup. Does bio-cgroup address or even resolve this issue?
- begin to implement a new branch of cgroup-io-throttle on top of
  bio-cgroup
- also start to implement an additional request queue to provide first a
  control at the cgroup level and a dispatcher to pass the request to
  the elevator (as suggested by Vivek)

Thanks,
-Andrea
--

From: Hirokazu Takahashi
Date: Thursday, September 18, 2008 - 4:24 am

This behavior depends on the version of memory-cgroup.
In the previous version, pages in the swap cache were owned by one of
the cgroups.

Kamezawa-san, one of the implementer, told me he got this feature off
temporarily and he was going to turn it on again. I think this
workaround is chosen because the current implementation of memory

Bio-cgroup can't support pages in the swap cache temporarily with the
current linux kernel either since it shares the same infrastructure
with memory-cgroup.

Now, they have just started to rewrite the infrastructure to track pages
with page_cgroup, which is going to give us good performance ever.
After that I'm going to enhance bio-cgroup more, such as dirty page
tracking. To tell the truth, I already have dirty pages tracking patch
for the current linux in my hand, which isn't posted yet. I'm going to
port it on the new infrastructure.

If memory cgroup team change their mind, I will implement swap-pages

Thanks,
Hirokazu Takahashi.
--

From: Andrea Righi
Date: Thursday, September 18, 2008 - 7:37 am

Very good! in any case it seems I'll get the tracking of swap-pages from
someone else.. so I don't have to change/implement anything in my
io-throttle patchset. :)

I'll start to use bio-cgroup in io-throttle ASAP and do some tests. I'll
keep you informed.

Thanks,
-Andrea
--

From: Vivek Goyal
Date: Thursday, September 18, 2008 - 6:55 am

Hi Andrea,

So if we maintain and rb-tree per request queue and implement the cgroup
rules there, then that will take care of io-throttling also. (One can
control the release of bio/requests to elevator based on any kind of
rules. proportional weight/max-bandwidth).

If that's the case, I was wondering what do you mean by "begin to
implement new branch of cgroup-io-throttle" on top of bio-cgroup".

Thanks
Vivek
--

From: Andrea Righi
Date: Thursday, September 18, 2008 - 7:54 am

Correct, with the rb-tree per request queue solution there's no need to
keep track of the context in the struct bio, since the i/o control
based on per cgroup rules has been already performed by the first i/o
dispatcher. And I would really like to dedicate all my efforts to move
in this direction, but it would be interesting as well to test the
bio-cgroup functionality since it's working from now, it's a generic
framework and used by another project (dm-ioband). This is the reason
because I put it there, specifying to open a new branch, because it
would be an alternative solution to the following point.

-Andrea
--

From: Takuya Yoshikawa
Date: Wednesday, September 17, 2008 - 2:04 am

Hi,


Could you explain which cgroup we should charge when swap in or out occurs?
Are there any difference between the following cases?

Target page is
1. used as page cache and not mapped to any space
2. used as page cache and mapped to some space
3. not used as page cache and mapped to some space


Thanks,
Takuya Yoshikawa
--

From: Andrea Righi
Date: Wednesday, September 17, 2008 - 2:42 am

IMHO we should charge the owner of the page being swapped in/out (not
kswapd I mean). If a task is using a lot of memory and the memory of
this task is swapped out, it's actually generating i/o. Yes, we could
also hit other tasks that are using few pages in this way, but the most
memory consuming guys should be charged proportionally to the memory
they're consuming. IOW, this kind of i/o activity should be charge to the
cgroup the task belongs to.

-Andrea
--

From: Andrea Righi
Date: Wednesday, September 17, 2008 - 3:08 am

As a generic implementation, when a read/write request is submitted to the
IO subsystem (i.e. submit_bio()), look at the first page in the struct bio
and charge the IO cost to the owner of that page. It this makes sense, we
have to just keep track of all the pages when they're submitted to the IO
subsystem in this way. Unfortunately, this doesn't seem to work during
swap_writepage(), but maybe bio-cgroup is able to handle this case.

-Andrea
--

Previous thread: [PATCH] byteorder: use generic C version for value byteswapping by Harvey Harrison on Wednesday, August 27, 2008 - 9:06 am. (1 message)

Next thread: [RFC][PATCH -mm 1/5] i/o controller documentation by Andrea Righi on Wednesday, August 27, 2008 - 9:07 am. (5 messages)