Re: [RFC][PATCH] Per file dirty limit throttling

Previous thread: linux-next: build warning in Linus' tree by Stephen Rothwell on Sunday, August 15, 2010 - 9:00 pm. (1 message)

Next thread: linux-next: manual merge of the alacrity tree with Linus' tree by Stephen Rothwell on Sunday, August 15, 2010 - 9:19 pm. (1 message)
From: Nikanth Karthikesan
Date: Sunday, August 15, 2010 - 9:19 pm

When the total dirty pages exceed vm_dirty_ratio, the dirtier is made to do
the writeback. But this dirtier may not be the one who took the system to this
state. Instead, if we can track the dirty count per-file, we could throttle
the dirtier of a file, when the file's dirty pages exceed a certain limit.
Even though this dirtier may not be the one who dirtied the other pages of
this file, it is fair to throttle this process, as it uses that file.

This patch
1. Adds dirty page accounting per-file.
2. Exports the number of pages of this file in cache and no of pages dirty via
proc-fdinfo.
3. Adds a new tunable, /proc/sys/vm/file_dirty_bytes. When a files dirty data
exceeds this limit, the writeback of that inode is done by the current
dirtier.

This certainly will affect the throughput of certain heavy-dirtying workloads,
but should help for interactive systems.

Signed-off-by: Nikanth Karthikesan <knikanth@suse.de>

---

diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index b606c2c..4f8bc06 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -133,6 +133,15 @@ Setting this to zero disables periodic writeback altogether.
 
 ==============================================================
 
+file_dirty_bytes
+
+When a files total dirty data exceeds file_dirty_bytes, the current generator
+of dirty data would be made to do the writeback of dirty pages of that file.
+
+0 disables this behaviour.
+
+==============================================================
+
 drop_caches
 
 Writing to this will cause the kernel to drop clean caches, dentries and
diff --git a/fs/proc/base.c b/fs/proc/base.c
index a1c43e7..23becb2 100644
--- a/fs/proc/base.c
+++ b/fs/proc/base.c
@@ -83,6 +83,8 @@
 #include <linux/pid_namespace.h>
 #include <linux/fs_struct.h>
 #include <linux/slab.h>
+#include <linux/pagemap.h>
+#include <linux/writeback.h>
 #include "internal.h"
 
 /* NOTE:
@@ -1791,7 +1793,7 @@ out:
 	return ~0U;
 }
 ...
From: Peter Zijlstra
Date: Monday, August 16, 2010 - 4:05 am

I'm not really charmed by this.. it adds another random variable to prod
at. Nor does it really tell me why you're wanting to do this. We already
have per-task invluence on the dirty limits, a task that sporadically
dirties pages (your vi) will already end up with a higher dirty limit


This seems wrong, wth are you doing it here and not in the generic

Idem, replicating that code at every site that can dirty a page is just
wrong, hook into the regular set_page_dirty()->balance_dirty_pages()

So you're adding a extra cacheline to dirty even though its not used by


Preferably we don't add any extra fields under tree_lock so that we can
easily split it up if/when we decide to use a fine-grain locked radix
tree.

Also, like mentioned, you just added a whole new cacheline to dirty.
--

From: Nikanth Karthikesan
Date: Monday, August 16, 2010 - 10:09 pm

Oh, nice.  Per-task limit is an elegant solution, which should help during 
most of the common cases.

But I just wonder what happens, when
1. The dirtier is multiple co-operating processes
2. Some app like a shell script, that repeatedly calls dd with seek and skip? 
People do this for data deduplication, sparse skipping etc..
3. The app dies and comes back again. Like a VM that is rebooted, and 
continues writing to a disk backed by a file on the host.


Yes, this should be moved to the single site, inside generic 

Right, this could be avoided. Some f(vm_dirty_ratio) should be used. I just 
wanted to provide a way to disable this behaviour at run-time.

Thanks

--

From: Peter Zijlstra
Date: Tuesday, August 17, 2010 - 1:24 am

Those cases do indeed defeat the current per-task-limit, however I think
the solution to that is to limit the amount of writeback done by each
blocked process.

Jan Kara had some good ideas in that department.
--

From: Nikanth Karthikesan
Date: Wednesday, August 18, 2010 - 2:22 am

Blocked on what? Sorry, I do not understand.

Thanks
--

From: Peter Zijlstra
Date: Wednesday, August 18, 2010 - 2:58 am

balance_dirty_pages(), by limiting the work done there (or actually, the
amount of page writeback completions you wait for -- starting IO isn't
that expensive), you can also affect the time it takes, and therefore
influence the impact.


--

From: Balbir Singh
Date: Wednesday, August 18, 2010 - 7:08 am

There is an ongoing effort to look at per-cgroup dirty limits and I
honestly think it would be nice to do it at that level first. We need
it there as a part of the overall I/O controller. As a specialized
need it could handle your case as well. 

-- 
	Three Cheers,
	Balbir
--

From: Peter Zijlstra
Date: Wednesday, August 18, 2010 - 7:25 am

Well, it would be good to isolate that to the cgroup code. Also from
what I understood, the plan was to simply mark dirty inodes with a
cgroup and use that from writeout_inodes() to write out inodes
specifically used by that cgroup.

That is, on top of what Andrea Righi already proposed, which would
provide the actual per cgroup dirty limit (although the per-bdi
proportions applied to a cgroup limit aren't strictly correct, but that
seems to be something you'll have to live with, a per-bdi-per-cgroup
proportion would simply be accounting insanity).

That is a totally different thing than what was proposed.


--

From: Balbir Singh
Date: Wednesday, August 18, 2010 - 7:48 am

Understood, I was indirectly trying to get Nikanth to look at cgroups
since he was interested in the dirtier (as in task).


-- 
	Three Cheers,
	Balbir
--

From: Nikanth Karthikesan
Date: Monday, August 23, 2010 - 5:19 am

But this has nothing special to do with the cases like multi-threaded dirtier, 
which is why I was confused. :)

Thanks
Nikanth
--

From: Bill Davidsen
Date: Monday, August 16, 2010 - 9:53 am

I agree with your problem description, a single program which writes a single 
large file can make an interactive system suck. Creating a 25+GB Blu-Ray image 
will often saturate the buffer space. I played with per-fd limiting during 
2.5.xx development and I had an app writing 5-10GB files. While I wanted to get 
I found that the effect was about the same as forcing the application to use 
O_DIRECT, and since it was our application I could do that. Not all 
badly-behaved programs are open source, so that addressed my issue but not the 
general case.

I think you really need to track by process, not file, as you said "Even though 
this dirtier may not be the one who dirtied the other pages of this file..." 
that doesn't work, you block a process which is contributing minimally to the 
problem while letting the real problem process continue. Ex: a log file, with 
one process spewing error messages while others write a few lines/min. You have 
to get it right, I think.

-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot
--

From: Wu Fengguang
Date: Monday, August 16, 2010 - 7:53 pm

Bill,


The block layer in recent kernels are much better at preventing SYNC
read/write from being delayed by lots of ASYNC writeback requests.
And we are attacking the other responsiveness problems under light

Good point. Peter implemented that idea long ago in upstream kernel,
see the comment for task_dirty_limit() in commit 1babe1838.

Thanks,
Fengguang
--

From: Wu Fengguang
Date: Monday, August 16, 2010 - 7:41 pm

Nikanth, there's a more elegant solution in upstream kernel.
See the comment for task_dirty_limit() in commit 1babe1838.

NFS may want to limit per-file dirty pages, to prevent long stall time
inside the nfs_getattr()->filemap_write_and_wait() calls (and problems
like that). Peter Staubach has similar ideas on it.

Thanks,
Fengguang
--

Previous thread: linux-next: build warning in Linus' tree by Stephen Rothwell on Sunday, August 15, 2010 - 9:00 pm. (1 message)

Next thread: linux-next: manual merge of the alacrity tree with Linus' tree by Stephen Rothwell on Sunday, August 15, 2010 - 9:19 pm. (1 message)