When the total dirty pages exceed vm_dirty_ratio, the dirtier is made to do the writeback. But this dirtier may not be the one who took the system to this state. Instead, if we can track the dirty count per-file, we could throttle the dirtier of a file, when the file's dirty pages exceed a certain limit. Even though this dirtier may not be the one who dirtied the other pages of this file, it is fair to throttle this process, as it uses that file. This patch 1. Adds dirty page accounting per-file. 2. Exports the number of pages of this file in cache and no of pages dirty via proc-fdinfo. 3. Adds a new tunable, /proc/sys/vm/file_dirty_bytes. When a files dirty data exceeds this limit, the writeback of that inode is done by the current dirtier. This certainly will affect the throughput of certain heavy-dirtying workloads, but should help for interactive systems. Signed-off-by: Nikanth Karthikesan <knikanth@suse.de> --- diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt index b606c2c..4f8bc06 100644 --- a/Documentation/sysctl/vm.txt +++ b/Documentation/sysctl/vm.txt @@ -133,6 +133,15 @@ Setting this to zero disables periodic writeback altogether. ============================================================== +file_dirty_bytes + +When a files total dirty data exceeds file_dirty_bytes, the current generator +of dirty data would be made to do the writeback of dirty pages of that file. + +0 disables this behaviour. + +============================================================== + drop_caches Writing to this will cause the kernel to drop clean caches, dentries and diff --git a/fs/proc/base.c b/fs/proc/base.c index a1c43e7..23becb2 100644 --- a/fs/proc/base.c +++ b/fs/proc/base.c @@ -83,6 +83,8 @@ #include <linux/pid_namespace.h> #include <linux/fs_struct.h> #include <linux/slab.h> +#include <linux/pagemap.h> +#include <linux/writeback.h> #include "internal.h" /* NOTE: @@ -1791,7 +1793,7 @@ out: return ~0U; } ...
I'm not really charmed by this.. it adds another random variable to prod at. Nor does it really tell me why you're wanting to do this. We already have per-task invluence on the dirty limits, a task that sporadically dirties pages (your vi) will already end up with a higher dirty limit This seems wrong, wth are you doing it here and not in the generic Idem, replicating that code at every site that can dirty a page is just wrong, hook into the regular set_page_dirty()->balance_dirty_pages() So you're adding a extra cacheline to dirty even though its not used by Preferably we don't add any extra fields under tree_lock so that we can easily split it up if/when we decide to use a fine-grain locked radix tree. Also, like mentioned, you just added a whole new cacheline to dirty. --
Oh, nice. Per-task limit is an elegant solution, which should help during most of the common cases. But I just wonder what happens, when 1. The dirtier is multiple co-operating processes 2. Some app like a shell script, that repeatedly calls dd with seek and skip? People do this for data deduplication, sparse skipping etc.. 3. The app dies and comes back again. Like a VM that is rebooted, and continues writing to a disk backed by a file on the host. Yes, this should be moved to the single site, inside generic Right, this could be avoided. Some f(vm_dirty_ratio) should be used. I just wanted to provide a way to disable this behaviour at run-time. Thanks --
Those cases do indeed defeat the current per-task-limit, however I think the solution to that is to limit the amount of writeback done by each blocked process. Jan Kara had some good ideas in that department. --
Blocked on what? Sorry, I do not understand. Thanks --
balance_dirty_pages(), by limiting the work done there (or actually, the amount of page writeback completions you wait for -- starting IO isn't that expensive), you can also affect the time it takes, and therefore influence the impact. --
There is an ongoing effort to look at per-cgroup dirty limits and I honestly think it would be nice to do it at that level first. We need it there as a part of the overall I/O controller. As a specialized need it could handle your case as well. -- Three Cheers, Balbir --
Well, it would be good to isolate that to the cgroup code. Also from what I understood, the plan was to simply mark dirty inodes with a cgroup and use that from writeout_inodes() to write out inodes specifically used by that cgroup. That is, on top of what Andrea Righi already proposed, which would provide the actual per cgroup dirty limit (although the per-bdi proportions applied to a cgroup limit aren't strictly correct, but that seems to be something you'll have to live with, a per-bdi-per-cgroup proportion would simply be accounting insanity). That is a totally different thing than what was proposed. --
Understood, I was indirectly trying to get Nikanth to look at cgroups since he was interested in the dirtier (as in task). -- Three Cheers, Balbir --
But this has nothing special to do with the cases like multi-threaded dirtier, which is why I was confused. :) Thanks Nikanth --
I agree with your problem description, a single program which writes a single large file can make an interactive system suck. Creating a 25+GB Blu-Ray image will often saturate the buffer space. I played with per-fd limiting during 2.5.xx development and I had an app writing 5-10GB files. While I wanted to get I found that the effect was about the same as forcing the application to use O_DIRECT, and since it was our application I could do that. Not all badly-behaved programs are open source, so that addressed my issue but not the general case. I think you really need to track by process, not file, as you said "Even though this dirtier may not be the one who dirtied the other pages of this file..." that doesn't work, you block a process which is contributing minimally to the problem while letting the real problem process continue. Ex: a log file, with one process spewing error messages while others write a few lines/min. You have to get it right, I think. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot --
Bill, The block layer in recent kernels are much better at preventing SYNC read/write from being delayed by lots of ASYNC writeback requests. And we are attacking the other responsiveness problems under light Good point. Peter implemented that idea long ago in upstream kernel, see the comment for task_dirty_limit() in commit 1babe1838. Thanks, Fengguang --
Nikanth, there's a more elegant solution in upstream kernel. See the comment for task_dirty_limit() in commit 1babe1838. NFS may want to limit per-file dirty pages, to prevent long stall time inside the nfs_getattr()->filemap_write_and_wait() calls (and problems like that). Peter Staubach has similar ideas on it. Thanks, Fengguang --
