>
> If we are pursuing a I/O prioritization model à la CFQ the temptation is
> to implement it at the elevator layer or extend any of the existing I/O
> schedulers.
>
> There have been several proposals that extend either the CFQ scheduler
> (see (1), (2) below) or the AS scheduler (see (3) below). The problem
> with these controllers is that they are scheduler dependent, which means
> that they become unusable when we change the scheduler or when we want
> to control stacking devices which define their own make_request_fn
> function (md and dm come to mind). It could be argued that the physical
> devices controlled by a dm or md driver are likely to be fed by
> traditional I/O schedulers such as CFQ, but these I/O schedulers would
> be running independently from each other, each one controlling its own
> device ignoring the fact that they part of a stacking device. This lack
> of information at the elevator layer makes it pretty difficult to obtain
> accurate results when using stacking devices. It seems that unless we
> can make the elevator layer aware of the topology of stacking devices
> (possibly by extending the elevator API?) evelator-based approaches do
> not constitute a generic solution. Here onwards, for discussion
> purposes, I will refer to this type of I/O bandwidth controllers as
> elevator-based I/O controllers.
>
> A simple way of solving the problems discussed in the previous paragraph
> is to perform I/O control before the I/O actually enters the block layer
> either at the pagecache level (when pages are dirtied) or at the entry
> point to the generic block layer (generic_make_request()). Andrea's I/O
> throttling patches stick to the former variant (see (4) below) and
> Tsuruta-san and Takahashi-san's dm-ioband (see (5) below) take the later
> approach. The rationale is that by hooking into the source of I/O
> requests we can perform I/O control in a topology-agnostic and
> elevator-agnostic way. I will refer to this new type of I/O bandwidth
> controller as block layer I/O controller.
>
> By residing just above the generic block layer the implementation of a
> block layer I/O controller becomes relatively easy, but by not taking
> into account the characteristics of the underlying devices we might risk
> underutilizing them. For this reason, in some cases it would probably
> make sense to complement a generic I/O controller with elevator-based
> I/O controller, so that the maximum throughput can be squeezed from the
> physical devices.
>
> (1) Uchida-san's CFQ-based scheduler:
http://lwn.net/Articles/275944/
> (2) Vasily's CFQ-based scheduler:
http://lwn.net/Articles/274652/
> (3) Naveen Gupta's AS-based scheduler:
http://lwn.net/Articles/288895/
> (4) Andrea Righi's i/o bandwidth controller (I/O throttling):http://thread.gmane.org/gmane.linux.kernel.containers/5975
> (5) Tsuruta-san and Takahashi-san's dm-ioband:
http://thread.gmane.org/gmane.linux.kernel.virtualization/6581
>
> 6.- I/O tracking
>
> This is arguably the most important part, since to perform I/O control
> we need to be able to determine where the I/O is coming from.
>
> Reads are trivial because they are served in the context of the task
> that generated the I/O. But most writes are performed by pdflush,
> kswapd, and friends so performing I/O control just in the synchronous
> I/O path would lead to large inaccuracy. To get this right we would need
> to track ownership all the way up to the pagecache page. In other words,
> it is necessary to track who is dirtying pages so that when they are
> written to disk the right task is charged for that I/O.
>
> Fortunately, such tracking of pages is one of the things the existing
> memory resource controller is doing to control memory usage. This is a
> clever observation which has a useful implication: if the rather
> imbricated tracking and accounting parts of the memory resource
> controller were split the I/O controller could leverage the existing
> infrastructure to track buffered and asynchronous I/O. This is exactly
> what the bio-cgroup (see (6) below) patches set out to do.
>
> It is also possible to do without I/O tracking. For that we would need
> to hook into the synchronous I/O path and every place in the kernel
> where pages are dirtied (see (4) above for details). However controlling
> the rate at which a cgroup can generate dirty pages seems to be a task
> that belongs in the memory controller not the I/O controller. As Dave
> and Paul suggested its probably better to delegate this to the memory
> controller. In fact, it seems that Yamamoto-san is cooking some patches
> that implement just that: dirty balancing for cgroups (see (7) for
> details).
>
> Another argument in favor of I/O tracking is that not only block layer
> I/O controllers would benefit from it, but also the existing I/O
> schedulers and the elevator-based I/O controllers proposed by
> Uchida-san, Vasily, and Naveen (Yoshikawa-san, who is CCed, and myself
> are working on this and hopefully will be sending patches soon).
>
> (6) Tsuruta-san and Takahashi-san's I/O tracking patches:
http://lkml.org/lkml/2008/8/4/90
> (7) Yamamoto-san dirty balancing patches:
http://lwn.net/Articles/289237/
>
> *** How to move on
>
> As discussed before, it probably makes sense to have both a block layer
> I/O controller and a elevator-based one, and they could certainly
> cohabitate. As discussed before, all of them need I/O tracking
> capabilities so I would like to suggest the plan below to get things
> started:
>
> - Improve the I/O tracking patches (see (6) above) until they are in
> mergeable shape.
> - Fix CFQ and AS to use the new I/O tracking functionality to show its
> benefits. If the performance impact is acceptable this should suffice to
> convince the respective maintainer and get the I/O tracking patches
> merged.
> - Implement a block layer resource controller. dm-ioband is a working
> solution and feature rich but its dependency on the dm infrastructure is
> likely to find opposition (the dm layer does not handle barriers
> properly and the maximum size of I/O requests can be limited in some
> cases). In such a case, we could either try to build a standalone
> resource controller based on dm-ioband (which would probably hook into
> generic_make_request) or try to come up with something new.
> - If the I/O tracking patches make it into the kernel we could move on
> and try to get the Cgroup extensions to CFQ and AS mentioned before (see
> (1), (2), and (3) above for details) merged.
> - Delegate the task of controlling the rate at which a task can
> generate dirty pages to the memory controller.
>
> This RFC is somewhat vague but my feeling is that we build some
> consensus on the goals and basic design aspects before delving into
> implementation details.
>
> I would appreciate your comments and feedback.