Hi All, I have got excellent results of dm-ioband, that controls the disk I/O bandwidth even when it accepts delayed write requests. In this time, I ran some benchmarks with a high-end storage. The reason was to avoid a performance bottleneck due to mechanical factors such as seek time. You can see the details of the benchmarks at: http://people.valinux.co.jp/~ryov/dm-ioband/hps/ Thanks, Ryo Tsuruta --
Hi Ryo, I had a query about dm-ioband patches. IIUC, dm-ioband patches will break the notion of process priority in CFQ because now dm-ioband device will hold the bio and issue these to lower layers later based on which bio's become ready. Hence actual bio submitting context might be different and because cfq derives the io_context from current task, it will be broken. To mitigate that problem, we probably need to implement Fernando's suggestion of putting io_context pointer in bio. Have you already done something to solve this issue? Secondly, why do we have to create an additional dm-ioband device for every device we want to control using rules. This looks little odd atleast to me. Can't we keep it in line with rest of the controllers where task grouping takes place using cgroup and rules are specified in cgroup itself (The way Andrea Righi does for io-throttling patches)? To avoid creation of stacking another device (dm-ioband) on top of every device we want to subject to rules, I was thinking of maintaining an rb-tree per request queue. Requests will first go into this rb-tree upon __make_request() and then will filter down to elevator associated with the queue (if there is one). This will provide us the control of releasing bio's to elevaor based on policies (proportional weight, max bandwidth etc) and no need of stacking additional block device. I am working on some experimental proof of concept patches. It will take some time though. I was thinking of following. - Adopt the Andrea Righi's style of specifying rules for devices and group the tasks using cgroups. - To begin with, adopt dm-ioband's approach of proportional bandwidth controller. It makes sense to me limit the bandwidth usage only in case of contention. If there is really a need to limit max bandwidth, then probably we can do something to implement additional rules or implement some policy switcher where user can decide what kind of policies need to be implemented. - Get rid ...
Thanks Vivek. All sounds reasonable to me and I think this is be the right way to proceed. I'll try to design and implement your rb-tree per request-queue idea into my io-throttle controller, maybe we can reuse it also for a more generic solution. Feel free to send me your experimental proof of concept if you want, even if it's not yet complete, I can review it, test and contribute. -Andrea --
Currently I have taken code from bio-cgroup to implement cgroups and to provide functionality to associate a bio to a cgroup. I need this to be able to queue the bio's at right node in the rb-tree and then also to be able to take a decision when is the right time to release few requests. Right now in crude implementation, I am working on making system boot. Once patches are at least in little bit working shape, I will send it to you to have a look. Thanks Vivek --
I wonder... wouldn't be simpler to just use the memory controller to retrieve this information starting from struct page? I mean, following this path (in short, obviously using the appropriate interfaces for locking and referencing the different objects): cgrp = page->page_cgroup->mem_cgroup->css.cgroup Once you get the cgrp it's very easy to use the corresponding controller structure. Actually, this is how I'm doing in cgroup-io-throttle to associate a bio to a cgroup. What other functionalities/advantages bio-cgroup provide in addition to that? Thanks, -Andrea --
Andrea,
Ok, you are first retrieving cgroup associated page owner and then
retrieving repsective iothrottle state using that
cgroup, (cgroup_to_iothrottle). I have yet to dive deeper into cgroup
data structures but does it work if iothrottle and memory controller
are mounted on separate hierarchies?
bio-cgroup guys are also doing similar thing in the sense retrieving
relevant pointer through page and page_cgroup and use that to reach
bio_cgroup strucutre. The difference is that they don't retrieve first
css object of mem_cgroup instead they directly store the pointer of
bio_cgroup in page_cgroup (When page is being charged in memory controller).
While page is being charged, determine the bio_cgroup, associated with
the task and store this info in page->page_cgroup->bio_cgroup.
static inline struct bio_cgroup *bio_cgroup_from_task(struct task_struct
*p)
{
return container_of(task_subsys_state(p, bio_cgroup_subsys_id),
struct bio_cgroup, css);
}
At any later point, one can look at bio and reach respective bio_cgroup
by.
bio->page->page_cgroup->bio_cgroup.
Looks like now we are getting rid of page_cgroup pointer in "struct page"
and we shall have to change the implementation accordingly.
Thanks
Vivek
--
ehm... I've to check. I usually mount all the controllers into the same Actually, only page_get_page_cgroup() implementation would change. And we don't have to worry about the particular implementation (hash, radix_tree, whatever..), in any case bio-cgroup has to simply use the opportune interface: page_get_page_cgroup(struct *page). -Andrea --
I've decided to get Ryo to post the accurate dirty-page tracking patch for bio-cgroup, which isn't perfect yet though. The memory controller never wants to support this tracking because migrating a page between memory cgroups is really heavy. I also thought enhancing the memory controller would be good enough, but a lot of people said they wanted to control memory resource and block I/O resource separately. So you can create several bio-cgroup in one memory-cgroup, or you can use bio-cgroup without memory-cgroup. I also have a plan to implement more acurate tracking mechanism on bio-cgroup after the memory cgroup team re-implement the infrastructure, which won't be supported by memory-cgroup. When a process are moved into another memory cgroup, the pages belonging to the process don't move to the new cgroup because migrating pages is so heavy. It's hard to find the pages from the process and migrating pages may cause some memory pressure. I'll implement this feature only on bio-cgroup with minimum overhead Thanks, Hirokazu Takahashi. --
On Fri, 19 Sep 2008 12:34:05 +0900 (JST) I really would like to move page_cgroup to new cgroup when the process moves... But it's just in my plan and I'm not sure I can do it or not. Anyway what's next for me is 1. fix current discussion to remove page->page_cgroup pointer. 2. reduce locks. 3. support swap and swap-cache. I think algorithm for (1), (2) is now getting smart. Thanks, -Kame --
Kamezawa-San, I am not dead against it, but I would provide a knob/control point for system administrator to decide if movement is important for applications, Are you planning on reposting these. I've been trying other approaches at my end 1. Use radix tree per-node per-zone 2. Use radix trees only for 32 bit systems 3. Depend on CONFIG_HAVE_MEMORY_PRESENT and build a sparse data structure and use pre-allocation -- Balbir --
On Fri, 19 Sep 2008 22:18:01 -0700 I'll post in next Monday. It's obvious that I should do more tests/fixes... My patch has (many) bugs. Severals are fixed but there will be still ;) SwapCache beats me again because it easily reuse uncharged pages... BTW why do you like radix-tree ? It's not very good for our purpose. Thanks, -Kame --
This is completely another problem we have to solve. The CFQ scheduler has really bad assumption that the current process must be the owner. This problem occurs when you use some of device Actually, I already have a patch to solve this problem, which make each bio have a pointer to the io_context of the owner process. Would you take a look at the thread whose subject is "I/O context inheritance" in: http://www.uwsg.iu.edu/hypermail/linux/kernel/0804.2/index.html#2850 Fernando also knows this. Thank you, Hirokazu Takahashi. --
Great. Sure I will have a look at this thread. This is something we shall have to implement, irrespective of the fact whether we go for dm-ioband approach or an rb-tree per request queue approach. Thanks Vivek --
It isn't essential dm-band is implemented as one of the device-mappers.
I've been also considering that this algorithm itself can be implemented
in the block layer directly.
Although, the current implementation has merits. It is flexible.
- Dm-ioband can be place anywhere you like, which may be right before
the I/O schedulers or may be placed on top of LVM devices.
- It supports partition based bandwidth control which can work without
cgroups, which is quite easy to use of.
- It is independent to any I/O schedulers including ones which will
be introduced in the future.
I also understand it's will be hard to set up without some tools
I think it's a bit late to control I/O requests there, since process
may be blocked in get_request_wait when the I/O load is high.
Please imagine the situation that cgroups with low bandwidths are
consuming most of "struct request"s while another cgroup with a high
bandwidth is blocked and can't get enough "struct request"s.
--
Hi, An rb-tree per request queue also should be able to give us this flexibility. Because logic is implemented per request queue, rules can be placed at any layer. Either at bottom most layer where requests are passed to elevator or at higher layer where requests will be passed to lower level block devices in the stack. Just that we shall have to do modifications to some of the higher level dm/md drivers to make use of This scheme should also be independent of any of the IO schedulers. We might have to do small changes in IO-schedulers to decouple the things from __make_request() a bit to insert rb-tree in between __make_request() and IO-scheduler. Otherwise fundamentally, this approach should not That's something I wish to avoid. If we can keep it simple by doing Ok, this is a good point. Because number of struct requests are limited and they seem to be allocated on first come first serve basis, so if a cgroup is generating lot of IO, then it might win. But dm-ioband will face the same issue. Essentially it is also a request queue and it will have limited number of request descriptors. Have you modified the logic somewhere for allocation of request descriptors to the waiting processes based on their weights? If yes, the logic probably can be implemented here too. Thanks Vivek --
Maybe throttling dirty page ratio in memory could help to avoid this problem. I mean, if a cgroup is exceeding the i/o limits do ehm... something.. also at the balance_dirty_pages() level. -Andrea --
That is one of the important features to be implemented for controlling I/O. The dirty page ratio controlling can help to avoid this issue but it isn't guaranteed. So, both of them should be implemented. What would you think happens in cases that some cgroups may have tons of threads which issue a lot of direct I/Os, or others may have huge memory? Thanks, Hirokazu Takahashi. --
Request descriptors are allocated just right before passing I/O requests to the elevators. Even if you move the descriptor allocation point before calling the dm/md drivers, the drivers can't make use of them. When one of the dm drivers accepts a I/O request, the request won't have either a real device number or a real sector number. The request will be re-mapped to another sector of another device in every dm drivers. The request may even be replicated there. So it is really hard to find the right request queue to put It's possible the algorithm of dm-ioband can be placed in the block layer if it is really a big problem. But I doubt it can control every control block I/O as we wish since Nope. Dm-ioband doesn't have this issue since it works before allocating the descriptors. Only I/O requests dm-ioband has passed can allocate its --
You are right. request descriptors are currently allocated at bottom most layer. Anyway, in the rb-tree, we put bio cgroups as logical elements and every bio cgroup then contains the list of either bios or requeust descriptors. So what kind of list bio-cgroup maintains can depend on whether it is a higher layer driver (will maintain bios) or a lower layer driver (will maintain list of request descriptors per bio-cgroup). So basically mechanism of maintaining an rb-tree can be completely ignorant of the fact whether a driver is keeping track of bios or keeping Hmm.., I thought that all the incoming requests to dm/md driver will remain in a single queue maintained by that drvier (irrespective of the fact in which request queue these requests go in lower layers after replication or other operation). I am not very familiar with dm/md Had a question regarding cgroup interface. I am assuming that in a system, one will be using other controllers as well apart from IO-controller. Other controllers will be using cgroup as a grouping mechanism. Now coming up with additional grouping mechanism for only io-controller seems little odd to me. It will make the job of higher level management software harder. Looking at the dm-ioband grouping examples given in patches, I think cases of grouping based in pid, pgrp, uid and kvm can be handled by creating right cgroup and making sure applications are launched/moved into right cgroup by user space tools. I think keeping grouping mechanism in line with rest of the controllers should help because a uniform grouping mechanism should make life simpler. I am not very sure about moving dm-ioband algorithm in block layer. Looks Ok. Got it. dm-ioband does not block on allocation of request descriptors. It does seem to be blocking in prevent_burst_bios() but that would be per group so it should be fine. That means for lower layers, one shall have to do request descritor allocation as per the cgroup weight to make sure a cgroup with ...
I'm getting confused about your idea. I thought you wanted to make each cgroup have its own rb-tree, and wanted to make all the layers share the same rb-tree. If so, are you going to put different things into the same tree? Do you even want all the I/O schedlers use the same tree? Are you going to block request descriptors in the tree? From the view point of performance, all the request descriptors should be passed to the I/O schedulers, since the maximum number of request descriptors is limited. And I still don't understand if you want to make your rb-tree work efficiently, you need to put a lot of bios or request descriptors into the tree. Is that what you are going to do? On the other hand, dm-ioband tries to minimize to have bios blocked. And I have a plan on reducing the maximum number that can be blocked there. I don't care whether the queue is implemented as a rb-tee or some They never look into the queues maintained in drivers. Some of them have its own little queue and others don't. Some may just modify the sector numbers of I/O requests or may create a new I/O request themselves. Others such as md-raid5 have their own queues to control I/Os, where A write request may cause several read requests and have to wait for their completions before the actual write starts. Thanks, Hirokazu Takahashi. --
Ok, I will give more details of the thought process. I was thinking of maintaing an rb-tree per request queue and not an rb-tree per cgroup. This tree can contain all the bios submitted to that request queue through __make_request(). Every node in the tree will represent one cgroup and will contain a list of bios issued from the tasks from that cgroup. Every bio entering the request queue through __make_request() function first will be queued in one of the nodes in this rb-tree, depending on which cgroup that bio belongs to. Once the bios are buffered in rb-tree, we release these to underlying elevator depending on the proportionate weight of the nodes/cgroups. Some more details which I was trying to implement yesterday. There will be one bio_cgroup object per cgroup. This object will contain many bio_group objects. Each bio_group object will be created for each request queue where a bio from bio_cgroup is queued. Essentially the idea is that bios belonging to a cgroup can be on various request queues in the system. So a single object can not serve the purpose as it can not be on many rb-trees at the same time. Hence create one sub object which will keep track of bios belonging to one cgroup on a particular request queue. Each bio_group will contain a list of bios and this bio_group object will be a node in the rb-tree of request queue. For example. Lets say there are two request queues in the system q1 and q2 (lets say they belong to /dev/sda and /dev/sdb). Let say a task t1 in /cgroup/io/test1 is issueing io both for /dev/sda and /dev/sdb. bio_cgroup belonging to /cgroup/io/test1 will have two sub bio_group objects, say bio_group1 and bio_group2. bio_group1 will be in q1's rb-tree and bio_group2 will be in q2's rb-tree. bio_group1 will contain a list of bios issued by task t1 for /dev/sda and bio_group2 will contain a list of bios issued by task t1 for /dev/sdb. I thought the same can be extended for stacked devices also. I am still trying to implementing ...
Vivek Goyal wrote: Vivek, thanks for the detailed explanation. Only a comment. I guess, if we don't change also the per-process optimizations/improvements made by some IO scheduler, I think we can have undesirable behaviours. For example: CFQ uses the per-process iocontext to improve fairness between *all* the processes in a system. But it doesn't have the concept that there's a cgroup context on-top-of the processes. So, some optimizations made to guarantee fairness among processes could conflict with algorithms implemented at the cgroup layer. And potentially lead to undesirable behaviours. For example an issue I'm experiencing with my cgroup-io-throttle patchset is that a cgroup can consistently increase the IO rate (always respecting the max limits), simply increasing the number of IO worker tasks respect to another cgroup with a lower number of IO workers. This is probably due to the fact the CFQ tries to give the same amount of "IO time" to all the tasks, without considering that they're organized in cgroup. I don't see this behaviour with noop or deadline, because they don't have the concept of iocontext. -Andrea --
BTW this is why I proposed to use a single shared iocontext for all the processes running in the same cgroup. Anyway, this is not the best solution, because in this way all the IO requests coming from a cgroup will be queued to the same cfq queue. If I'm not wrong in this way we would implement noop (FIFO) between tasks belonging to the same cgroup and CFQ between cgroups. But, at least for this particular case, we would be able to provide fairness among cgroups. -Andrea --
Ah! also have a look at this: http://download.systemimager.org/~arighi/linux/patches/io-throttle/benchmark/graph/eff... The graph highlights the dependency between the IO rate and the number of tasks running in a cgroup. For this testcase I've used 2 cgroups: - cgroup A, with a single task doing IO (large O_DIRECT read stream) - cgroup B, with a variable number of tasks ranging from 1 to 16 doing IO in parallel If we want to be "fair" the gap of IO performance between the cgroups should be close to 0. Using "plain" cfq (red line) the gap of performance increases incrementing the number of tasks in a cgroup. Using cgroup-io-throttle on top of cfq (green line) the gap of performance is lower (the asymptotic curve is due to the bandwidth capping provided by cgroup-io-throttle). Using cgroup-io-throttle and a single shared iocontext for each cgroup (blue line) the gap of performance is really close to 0. Anyway, I repeat, I don't think this is a wonderful solution, it is just to highlights this issue and share with you the results of some tests I did. -Andrea --
I ever thought the same thing but this approach breaks the compatibility. I think we should make ionice only effective for the processes in the same cgroup. A system gives some amount of bandwidths to its cgroups, and the processes in one of the cgroups fairly share the given bandwidth. I think this is the straight approach. What do you think? I think all the CFQ-cgroup the NEC guys are working, OpenVZ team's CFQ scheduler and dm-ioband with bio-cgroup work like this. Thank you, Hirokazu Takahashi. --
If by "fairly share the given bandwidth" you mean "share according to their IO-nice values" then you're right on this, Hirokazu. We always use a two-level schedulers and would like to see the same behavior in anything that will be --
Yes. There is also another little mechanism that prevent_burst_bios() Yes. But when cgroups with higher weight aren't issueing a lot of I/Os, You mean each layer should have its rb-tree? Is it per device? One lvm logical volume may probably consist from several physical volumes, which will be shared with other logical volumes. And some layers may split one bio into several bios. I hardly can imagine how these structures will be. But I guess it is a good thing that we are going to support Thanks, Hirokazu Takahashi. --
ok. Now with the new thought, I am completely deprecating the idea of queuing the request descriptors. Now I am thinking of capturing the bios and buffering these into the rb-tree as soon as these enter the request queue using associated request function. All the request descriptor allocation will come later when bios are actually release to elevator from Yes, one rb-tree per device, be it physical device or logical device (because there is one request queue associated per physical/logical block device). I was thinking of getting hold/hijack the bios as soon as they are submitted to the device using associated request function. So if there is a logical device built on top of two physical device, the associated bio copy or other logic should not even see the bio the moment it is submitted to the deivce. It will see the bio only when it is released from associated rb-tree to them. Do you think this will not work? To me this is what dm-ioband is doing logically. The only difference is that it does this with the help of a separate request queue. Thanks Vivek --
No, logical block devices doesn't have any request queues and they essentially won't block any bios unless it is impossible to I think it's easy to just make all logical device --- device mapper device --- and all physical device have their own bandwidth control mechanism. But I'm not clear how your algorithm works to control the bandwidth. At which level are you going to guarantee the bandwidth, at the logical volumes layer such as lvm or at the physical device layer? Thanks, Hirokazu Takahashi. --
Grouping in pid, pgrp and uid is not the point, which I've been thinking can be replaced with cgroup once the implementation of bio-cgroup is done. I think problems of cgroup are that they can't support lots of storages and hotplug devices, it just handle them as if they were just one resource. I don't insist the interface of dm-ioband is the best. I just hope the Thanks, Hirokazu Takahashi. --
What sort of support will help you? -- Balbir --
Sorry, I did not understand fully. Can you please explain in detail what kind of situation will not be covered by cgroup interface. Thanks Vivek --
From the concept of the cgroup, if you want control several disks independently, you should make each disk have its own cgroup subsystem, which only can be defined when compiling the kernel. This is impossible because every linux box has various number of disks. So you think it may be possible to make each cgroup have lots of control files for each device as a workaround. But it isn't allowed to add/remove control files when some devices are hot-added or hot-removed. Thanks, Hirokazu Takahashi. --
mmh? not true. You can define a single cgroup subsystem that implements the opportune interfaces to apply your type of control, and use many structures allocated dynamically for each controlled object (one for each block device, disk, partition, ... or using any kind of grouping/splitting policy). Actually, this is how cgroup-io-throttle, as Why not a single control file for all the devices? -Andrea --
This is possible but I wonder if this is really the way we should go. It looks like you tried implementing another ioctl-like interface on the cgroup control file interface. You can do anything you want with this interface though. I guess there should be at least some rules to implement this kind of ioctl-like interface if they don't want to enhance the cgroup interface, Thank you, Hirokazu Takahashi. --
Hi Tsuruta-san, I took a look at your beautiful results! When you have time, would you explain me how you succeeded to check the time, bandwidth, especially when you did write() tests? Actually, I tried similar tests and failed to check the bandwidth correctly. Did you insert something in the kernel source? Thanks, Takuya Yoshikawa --
I'm using our own tool, which issues I/Os in prallel in a specified period and counts up how many I/Os are issued and how many bytes are transferred in the period. I'm also using our own tool for measurement of throughput variation to see the internal data of dm-ioband. This tool is implemented as a kernel module. Thanks, Ryo Tsuruta --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
| Linux Kernel Mailing List | ixbge: remove TX lock and redo TX accounting. |
| Linux Kernel Mailing List | ixgbe: fix several counter register errata |
| Linux Kernel Mailing List | b43: fix build with CONFIG_SSB_PCIHOST=n |
| Linux Kernel Mailing List | 9p: block-based virtio client |
| Michael Breuer |
