SGksDQoNCj4gPiBGZXJuYW5kbyBMdWlzIFbDoXpxdWV6IENhbyB3cm90ZToNCj4gPiA+Pj4gVGhp cyBzZWVtcyB0byBiZSB0aGUgZWFzaWVzdCBwYXJ0LCBidXQgdGhlIGN1cnJlbnQgY2dyb3Vwcw0K PiA+ID4+PiBpbmZyYXN0cnVjdHVyZSBoYXMgc29tZSBsaW1pdGF0aW9ucyB3aGVuIGl0IGNvbWVz IHRvIGRlYWxpbmcgd2l0aCBibG9jaw0KPiA+ID4+PiBkZXZpY2VzOiBpbXBvc3NpYmlsaXR5IG9m IGNyZWF0aW5nL3JlbW92aW5nIGNlcnRhaW4gY29udHJvbCBzdHJ1Y3R1cmVzDQo+ID4gPj4+IGR5 bmFtaWNhbGx5IGFuZCBoYXJkY29kaW5nIG9mIHN1YnN5c3RlbXMgKGkuZS4gcmVzb3VyY2UgY29u dHJvbGxlcnMpLg0KPiA+ID4+PiBUaGlzIG1ha2VzIGl0IGRpZmZpY3VsdCB0byBoYW5kbGUgYmxv Y2sgZGV2aWNlcyB0aGF0IGNhbiBiZSBob3RwbHVnZ2VkDQo+ID4gPj4+IGFuZCBnbyBhd2F5IGF0 IGFueSB0aW1lICh0aGlzIGFwcGxpZXMgbm90IG9ubHkgdG8gdXNiIHN0b3JhZ2UgYnV0IGFsc28N Cj4gPiA+Pj4gdG8gc29tZSBTQVRBIGFuZCBTQ1NJIGRldmljZXMpLiBUbyBjb3BlIHdpdGggdGhp cyBzaXR1YXRpb24gcHJvcGVybHkgd2UNCj4gPiA+Pj4gd291bGQgbmVlZCBob3RwbHVnIHN1cHBv cnQgaW4gY2dyb3VwcywgYnV0LCBhcyBzdWdnZXN0ZWQgYmVmb3JlIGFuZA0KPiA+ID4+PiBkaXNj dXNzZWQgaW4gdGhlIHBhc3QgKHNlZSAoMCkgYmVsb3cpLCB0aGVyZSBhcmUgc29tZSBsaW1pdGF0 aW9ucy4NCj4gPiA+Pj4NCj4gPiA+Pj4gRXZlbiBpbiB0aGUgbm9uLWhvdHBsdWcgY2FzZSBpdCB3 b3VsZCBiZSBuaWNlIGlmIHdlIGNvdWxkIHRyZWF0IGVhY2gNCj4gPiA+Pj4gYmxvY2sgSS9PIGRl dmljZSBhcyBhbiBpbmRlcGVuZGVudCByZXNvdXJjZSwgd2hpY2ggbWVhbnMgd2UgY291bGQgZG8N Cj4gPiA+Pj4gdGhpbmdzIGxpa2UgYWxsb2NhdGluZyBJL08gYmFuZHdpZHRoIG9uIGEgcGVyLWRl dmljZSBiYXNpcy4gQXMgbG9uZyBhcw0KPiA+ID4+PiBwZXJmb3JtYW5jZSBpcyBub3QgY29tcHJv bWlzZWQgdG9vIG11Y2gsIGFkZGluZyBzb21lIGtpbmQgb2YgYmFzaWMNCj4gPiA+Pj4gaG90cGx1 ZyBzdXBwb3J0IHRvIGNncm91cHMgaXMgcHJvYmFibHkgd29ydGggaXQuDQo+ID4gPj4+DQo+ID4g Pj4+ICgwKSBodHRwOi8vbGttbC5vcmcvbGttbC8yMDA4LzUvMjEvMTINCj4gPiA+PiBXaGF0IGFi b3V0IHVzaW5nIG1ham9yLG1pbm9yIG51bWJlcnMgdG8gaWRlbnRpZnkgZWFjaCBkZXZpY2UgYW5k IGFjY291bnQNCj4gPiA+PiBJTyBzdGF0aXN0aWNzPyBJZiBhIGRldmljZSBpcyB1bnBsdWdnZWQg d2UgY291bGQgcmVzZXQgSU8gc3RhdGlzdGljcw0KPiA+ID4+IGFuZC9vciByZW1vdmUgSU8gbGlt aXRhdGlvbnMgZm9yIHRoYXQgZGV2aWNlIGZyb20gdXNlcnNwYWNlIChpLmUuIGJ5IGENCj4gPiA+ PiBkZWFtb24pLCBidXQgcGx1Z2dpbi91bnBsdWdnaW5nIHRoZSBkZXZpY2Ugd291bGQgbm90IGJl IGJsb2NrZWQvYWZm ...
With IO limiting approach minimum requirements are supposed to be guaranteed if the user configures a generic block device so that the sum of the limits doesn't exceed the total IO bandwidth of that device. But, in principle, there's nothing in "throttling" that guarantees "fairness" among different cgroups doing IO on the same block devices, that means there's nothing to guarantee minimum requirements (and this is the reason because I liked the Satoshi's CFQ-cgroup approach together with io-throttle). A more complicated issue is how to evaluate the total IO bandwidth of a generic device. We can use some kind of averaging/prediction, but basically it would be inaccurate due to the mechanic of disks (head seeks, but also caching, buffering mechanisms implemented directly into the device, etc.). It's a hard problem. And the same problem exists also for proportional bandwidth as well, in terms of IO rate predictability I mean. The only difference is that with proportional bandwidth you know that (taking the same example reported by Hirokazu) with i.e. 10 similar IO requests, 7 will be dispatched to the first cgroup and 3 to the other cgroup. So, you don't need anything to guarantee "fairness", but it's hard also for this case to evaluate the cost of the 7 IO requests respect to the cost of the other 3 IO requests as seen by user applications, that is the cost the users care about. -Andrea --
BTW as I said in a previous email, an interesting path to be explored IMHO could be to think in terms of IO time. So, look at the time an IO request is issued to the drive, look at the time the request is served, evaluate the difference and charge the consumed IO time to the appropriate cgroup. Then dispatch IO requests in function of the consumed IO time debts / credits, using for example a token-bucket strategy. And probably the best place to implement the IO time accounting is the elevator. --
Please note that the seek time for a specific IO request is strongly correlated with the IO requests that preceded it, which means that the owner of that request is not the only one to blame if it takes too long to process it. In other words, with the algorithm you propose we may end up charging the wrong guy. --
mmh.. yes. The only scenario I can imagine where this solution is not fair is when there're a lot of guys always requesting the same near blocks and a single guy looking for a single distant block (supposing disk seeks are more expensive than read/write ops). In this case it would be fair to charge a huge amount only to the guy requesting the single distant block and distribute the cost of the seek to move back the head equally among the other guys. Using the algorighm I proposed, instead, both the single "bad" guy and the first "good" guy that moves back the disk head would spend a large sum of IO credits. -Andrea --
Hi, I have a question about your description. In I/O controlling, how do you think about the meaning of "fair" among cgroups ? These days I was confused about it. IMHO, if they have a same access time and same access opportunity for disk I/O regardless of their I/O style(sequential / random / mixed / …), I think it is fare. Of course, in this fair situation, the cgroups with same priority or weight can have a different I/O bandwidth. but, I think it will be in reasonable range. So, if other cgroups with fast I/O was sacrificed for the cgroup with too late I/O to equaliz the I/O quantity, it can be considered "unfair" for the cgroup with fast I/O Do I have something wrong about the "fair" concept? This is just my opinion :) I welcome and appreciate for other opinions and comments about this PS) Andrea, this question is not related to the io-controller But, I just wonder your another project, network io-throttle, is going on now? My colleague has researched the similar project and he is try to implement another one. And i am also interested in net io-controller. Thank you Dong-Jae Kang --
Good question, thanks! fair = equally distribute the IO cost and throttling among cgroups, instead of equal distribution among processes, and equally among the processes belonging to the same cgroup. In the previous scenario the process that moves back the disk head wouldn't be charged for the whole IO cost. It's the belonging cgroup that would be charged instead. So, the accounting is perfectly fair from this point of view, because the cgroup credits are shared among the processes within the cgroup. The IO controller instead should be able to apply throttling in a "fair" way, that means, when the credits are over it should distribute the throttling time equally among the processes within the cgroup, i.e. imposing a total_time_to_sleep/N to each process (where N is the number of processes into the cgroup) or, even better, distribute the total_time_to_sleep as a function of the previously generated task's IO, looking at the IO taskstats for example (/proc/PID/io). But this is another problem anyway. For net-io-controller there's a better solution than mine, have a look at this: http://lkml.org/lkml/2008/7/24/455 -Andrea --
Actually it's a little-known easy problem. The capacity planning community does it all the time, but then describes it in terms that are only interesting (intelligible?) to an enthusiastic amateur mathematician (;-)) One finds the point, called N*, at which the throughput flattens out and and the response time starts to grow without bounds, and calls that level the maximum. In practice, one does an easier variant. One sets a response-time limit and throttles *everyone* proportionally when th disk starts to regularly degrade beyond the limit. Interestingly, because we're slowing the application to prevent slowing the disks, the value we pick needn't be terribly precise. It also doesn't require any pre- knowledge about the disks. Send me a note if you want to discuss this in more detail. --dave -- David Collier-Brown | Always do right. This will gratify Sun Microsystems, Toronto | some people and astonish the rest davecb@sun.com | -- Mark Twain cell: (647) 833-9377, bridge: (877) 385-4099 code: 506 9191# --
Yes, it would be really cool if we could provide hard bandwidth guarantees but it certainly does not look like a trivial task. To achieve that, among other things, we would need to take into account both the topology of block devices (RAID type, etc) and the physical characteristics of the disks that compose them. The former problem could be tackled at the block layer, since it is there that stacking devices are implemented. But it is the elevators who should examine the characteristics of the underlying devices, and schedule IO in such a way that the variable factors, such as seek times, do not compromise the hard bandwidth requirements (of course, it would also be nice if we did not kill global I/O performance in the process). Finally such an elevator would still need to cooperate with the block layer to make further topology-dependent adjustments. - Fernando P.S.: For some reason I received neither Dong-Jae's email nor yours, so I had to pick it up from the mailing list. I would appreciate it if you kept me CCed. --
Hi, Takahashi-san, In previous my posting, what I mean was absolute guaranteeing for minimum bandwidth, regardless of disk seek time, I/O type(sequential, random, or mixed …) of process. I also basically prefer proportional share depending on priority or weight and I think it is meaningful, such like as dm-ioband, 2-layer CFQ(satoshi) and 2-Layer CFQ(vasily). But, additionally in that situation, I think absolute guaranteeing of the minimum bandwidth will be required and several related companies want it to be supported. Because proportional share has inaccuracy of performance predictability, as Andrea mentioned before. Yes, I agree with you. This was what I intend to say Thank you, Dong-Jae Kang --
