The objective of the i/o controller is to improve i/o performance predictability of different cgroups sharing the same block devices. Respect to other priority/weight-based solutions the approach used by this controller is to explicitly choke applications' requests that directly (or indirectly) generate i/o activity in the system. The direct bandwidth and/or iops limiting method has the advantage of improving the performance predictability at the cost of reducing, in general, the overall performance of the system (in terms of throughput). Detailed informations about design, its goal and usage are described in the documentation. Patchset against 2.6.27-rc1-mm1. The all-in-one patch (and previous versions) can be found at: http://download.systemimager.org/~arighi/linux/patches/io-throttle/ This patchset is an experimental implementation, it includes functional differences respect to the previous versions (see the changelog below), and I haven't done much testing yet. So, comments are really welcome. Changelog: (v8 -> v9) * introduce struct res_counter_ratelimit as a generic structure to implement throttling-based cgroup subsystems * removed the throttling hooks from the page cache (set_page_dirty): set a single throttling hook in submit_bio() both for read and write operations; a generic process that is dirtying pages on a limited block device (for the cgroup it belongs to) is forced to flush the same amount of pages back to the block device (in this way write operations are forced to occur in the same IO context of the process that actually generated the IO) * collect per cgroup, block device and task throttling statistics (throttle counter and total time slept for throttling) and export them to userspace through blockio.throttlcnt (in the cgroup filesystem) and /proc/PID/io-throttle-stat (per-task statistics) * fair throttling: simple attempt to distribute the sleeps equally among all the tasks belonging to the same cgroup; instead of ...
Hi Andrea, I was checking out the pass discussion on this topic and there seemed to be two kind of people. One who wanted to control max bandwidth and other who liked proportional bandwidth approach (dm-ioband folks). I was just wondering, is it possible to have both the approaches and let users decide at run time which one do they want to use (something like the way users can choose io schedulers). Thanks Vivek --
Hi Vivek, yes, sounds reasonable (adding the proportional bandwidth control to my TODO list). Right now I've a totally experimental patch to add the ionice-like functionality (it's not the same but it's quite similar to the proportional bandwidth feature) on-top-of my IO controller. See below. The patch is not very well tested, I don't even know if it applies cleanly to the latest io-throttle patch I posted, or if it have runtime failures, it needs more testing. Anyway, this adds the file blockio.ionice that can be used to set per-cgroup IO priorities, just like ionice, the difference is that it works per-cgroup instead of per-task (it can be easily improved to also support per-device priority). The solution I've used is really trivial: all the tasks belonging to a cgroup share the same io_context, so actually it means that they also share the same disk time given by the IO scheduler and the tasks' requests coming from a cgroup are considered as they were issued by a single task. This works only for CFQ and AS, because deadline and noop have no concept of IO contexts. I would also like to merge the Satoshi's cfq-cgroup functionalities to provide "fairness" also within each cgroup, but the drawback is that it would work only for CFQ. So, in conclusion, I'd really like to implement a more generic weighted/priority cgroup-based policy to schedule bios like dm-ioband, maybe implementing the hook directly in submit_bio() or generic_make_request(), independent also of the dm infrastructure. -Andrea Signed-off-by: Andrea Righi <righi.andrea@gmail.com> --- block/blk-io-throttle.c | 72 +++++++++++++++++++++++++++++++++++++-- block/blk-ioc.c | 16 +------- include/linux/blk-io-throttle.h | 7 ++++ include/linux/iocontext.h | 15 ++++++++ kernel/fork.c | 3 +- 5 files changed, 95 insertions(+), 18 deletions(-) diff --git a/block/blk-io-throttle.c b/block/blk-io-throttle.c index 0fa235d..2a52e8d ...
Probably we don't want to share io contexts among the tasks of same cgroup because then requests from all the tasks of the cgroup will be queued on the same cfq queue and we will loose the notion of task priority. (I think you already covered this point in next paragraph.) I thought that implementation at generic layer can provide the fairness between various cgroups (based on their weight/priority) and then fairness within cgroup will be provided by respecitve IO scheduler (Depending on what kind of fairness notion IO scheduler carries, for example task priority in cfq.). So at generic layer we probably need to just think about how to keep track of various cgroups per device (probably in a rb tree like cpu scheduler) and how to schedule these cgroups to submit request to IO scheduer, based on cgroup weight/priority. I was wondering that why dm-ioband is creating another LVM driver dm-ioband. Configuring an ioband device for every logical/physical device we want to control looks little odd to me. Can't we achive the same thing by implementing all the logic in generic block layer without any additional LVM driver? Thanks Vivek --
Ok, to be more specific, I was thinking of following. Currently, all the requests for a block device go into request queue in a linked list and then associated elevator selects the best request for dispatch based on various policies as dictated by elevator. Can we maintan an rb-tree per request queue and all the requests being queued on that request queue first will go in this rb-tree. Then based on cgroup grouping and control policy (max bandwidth capping, proportional bandwidth etc), one can pass the requests to elevator associated with the queue (which will do the actual job of merging and other things). So effectively first we provide control at cgroup level and then let elevator take the best decisions with in that. This should not require creation of any dm-ioband devices to control the devices. Each block device will contain one rb-tree (cgroups hanging) as long has somebody has put a controlling policy on that devices. (We can probably use your interfaces to create policies on devices through cgroup files). This should not require elevator modifications and should work well with stacked devices. I will try to write some prototype patches and see if all the above gibber makes any sense and is workable or not. One limitation in this scheme is that we are providing grouping capability based on cgroups only and it is not as generic what dm-ioband is providing. Do we really require other ways of creating grouping. Creating another device for each device you want to control sounds odd to me. Thanks Vivek --
Could a workqueue like kblockd move requests from rb-tree to the equivalent I think I've to figure better all the implementation details, but yes, sounds good. This seems to be the right approach to provide any kind of IO controlling: bandwidth throttling, proportional bandwidth, In any case libcgroup could help here to define any grouping policy (uid, gid, pid, ...). So, IMHO the grouping capability provided by cgroups is in perspective generic as well as what dm-ioband provides. Thanks, -Andrea --
FYI, this also can be done with bio-cgroup, which determine the owner cgroup of a given anonymous page. Thanks, Hirokazu Takahashi --
That would be great! FYI here is how I would like to proceed: - today I'll post a new version of my cgroup-io-throttle patch rebased to 2.6.27-rc5-mm1 (it's well tested and seems to be stable enough). To keep the things light and simpler I've implemented custom get_cgroup_from_page() / put_cgroup_from_page() in the memory controller to retrieve the owner of a page, holding a reference to the corresponding memcg, during async writes in submit_bio(); this is not probably the best way to proceed, and a more generic framework like bio-cgroup sounds better, but it seems to work quite well. The only problem I've found is that during swap_writepage() the page is not assigned to any page_cgroup (page_get_page_cgroup() returns NULL), and so I'm not able to charge the cost of this I/O operation to the right cgroup. Does bio-cgroup address or even resolve this issue? - begin to implement a new branch of cgroup-io-throttle on top of bio-cgroup - also start to implement an additional request queue to provide first a control at the cgroup level and a dispatcher to pass the request to the elevator (as suggested by Vivek) Thanks, -Andrea --
This behavior depends on the version of memory-cgroup. In the previous version, pages in the swap cache were owned by one of the cgroups. Kamezawa-san, one of the implementer, told me he got this feature off temporarily and he was going to turn it on again. I think this workaround is chosen because the current implementation of memory Bio-cgroup can't support pages in the swap cache temporarily with the current linux kernel either since it shares the same infrastructure with memory-cgroup. Now, they have just started to rewrite the infrastructure to track pages with page_cgroup, which is going to give us good performance ever. After that I'm going to enhance bio-cgroup more, such as dirty page tracking. To tell the truth, I already have dirty pages tracking patch for the current linux in my hand, which isn't posted yet. I'm going to port it on the new infrastructure. If memory cgroup team change their mind, I will implement swap-pages Thanks, Hirokazu Takahashi. --
Very good! in any case it seems I'll get the tracking of swap-pages from someone else.. so I don't have to change/implement anything in my io-throttle patchset. :) I'll start to use bio-cgroup in io-throttle ASAP and do some tests. I'll keep you informed. Thanks, -Andrea --
Hi Andrea, So if we maintain and rb-tree per request queue and implement the cgroup rules there, then that will take care of io-throttling also. (One can control the release of bio/requests to elevator based on any kind of rules. proportional weight/max-bandwidth). If that's the case, I was wondering what do you mean by "begin to implement new branch of cgroup-io-throttle" on top of bio-cgroup". Thanks Vivek --
Correct, with the rb-tree per request queue solution there's no need to keep track of the context in the struct bio, since the i/o control based on per cgroup rules has been already performed by the first i/o dispatcher. And I would really like to dedicate all my efforts to move in this direction, but it would be interesting as well to test the bio-cgroup functionality since it's working from now, it's a generic framework and used by another project (dm-ioband). This is the reason because I put it there, specifying to open a new branch, because it would be an alternative solution to the following point. -Andrea --
Hi, Could you explain which cgroup we should charge when swap in or out occurs? Are there any difference between the following cases? Target page is 1. used as page cache and not mapped to any space 2. used as page cache and mapped to some space 3. not used as page cache and mapped to some space Thanks, Takuya Yoshikawa --
IMHO we should charge the owner of the page being swapped in/out (not kswapd I mean). If a task is using a lot of memory and the memory of this task is swapped out, it's actually generating i/o. Yes, we could also hit other tasks that are using few pages in this way, but the most memory consuming guys should be charged proportionally to the memory they're consuming. IOW, this kind of i/o activity should be charge to the cgroup the task belongs to. -Andrea --
As a generic implementation, when a read/write request is submitted to the IO subsystem (i.e. submit_bio()), look at the first page in the struct bio and charge the IO cost to the owner of that page. It this makes sense, we have to just keep track of all the pages when they're submitted to the IO subsystem in this way. Unfortunately, this doesn't seem to work during swap_writepage(), but maybe bio-cgroup is able to handle this case. -Andrea --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Loo |
