Re: [PATCH -mm] page-writeback: fine-grained dirty_ratio and dirty_background_ratio

Previous thread: ps3: BUG: spinlock lockup on CPU#1, udevd/505 or modprobe/633 by Geert Uytterhoeven on Friday, September 12, 2008 - 8:02 am. (2 messages)

Next thread: [RFC] [PATCH -mm 1/2] memcg dirty_ratio and additional page statistics by Andrea Righi on Friday, September 12, 2008 - 8:09 am. (1 message)
From: Andrea Righi
Date: Friday, September 12, 2008 - 8:09 am

The goal of the patch is to control how much dirty file pages a cgroup can have
at any given time (see also [1]).

Dirty file and writeback pages are accounted for each cgroup using the memory
controller statistics. Moreover, the dirty_ratio parameter is added to the
memory controller. It contains, as a percentage of the cgroup memory, the
number of dirty pages at which the processes belonging to the cgroup which are
generating disk writes will start writing out dirty data.

So, the behaviour is actually the same as the global dirty_ratio, except that
it works per cgroup.

Interface:
- two new entries "writeback" and "filedirty" are added to the file
  memory.stat, to export to userspace respectively the number of pages under
  writeback and the number of dirty file pages in the cgroup

- the new file memory.dirty_ratio is added in the cgroup filesystem to show/set
  the memcg dirty_ratio

[ This patch is still experimental and I only did few quick tests. I'd like to
do run more detailed benchmarks and compare the results, I guess the overhead
introduced by this patch shouldn't be so small... and BTW I would prefer a
dirty limit in bytes, intead of using a percentage of memory. Bytes are hugely
more flexible IMHO, they allow to define more fine-grained limits and so this
would work better on large memory machines. ]

[1] http://lkml.org/lkml/2008/9/9/245

-Andrea
--

From: Andrew Morton
Date: Friday, September 12, 2008 - 1:18 pm

On Fri, 12 Sep 2008 17:09:50 +0200


I tend to duck experimental and rfc patches ;)

One thing to think about please: Michael Rubin is hitting problems with
the existing /proc/sys/vm/dirty-ratio.  Its present granularity of 1%
is just too coarse for really large machines, and as
memory-size/disk-speed ratios continue to increase, this will just get
worse.

So after thinking about it a bit I encouraged him to propose a patch
which adds a new /proc/sys/vm/hires-dirty-ratio (for some value of
"hires" ;)) which simply offers a higher-resolution interface to the
same internal kernel machinery.

How does this affect you?  I don't think we should be adding new
interfaces which have the old 1%-resolution problem.  Once we get this
higher-resolution interface sorted out, your new interface should do it
the same way.


--

From: Andrea Righi
Date: Friday, September 12, 2008 - 4:04 pm

Totally agree.

The hires-dirty-ratio interface seems much better. I'll follow the progresses
of this new interface, reusing the same way in my patch doesn't look too difficult,
in any case.

BTW why not use a simple dirty-ratio-in-bytes?

Thanks for commenting,
-Andrea
--

From: Andrew Morton
Date: Friday, September 12, 2008 - 4:10 pm

On Sat, 13 Sep 2008 01:04:35 +0200

s/ratio/amount/  ;)

No particular reason - I haven't really thought about it frankly.

A "ratio" might make more sense in a containerised setup, particularly
if the container can be resized on the fly.
--

From: Michael Rubin
Date: Monday, September 22, 2008 - 3:26 pm

Currently the problme we are hitting is that we cannot specify pdflush
to have background limits less than 1% of memory. I am currently
finishing up a patch right now that adds a dirty_ratio_millis
interface.  I hope to submit the patch to LKML by the end of the week.

The idea is that we don't want to break backwards compatibility and we
also don't want to have two conflicting knobs in the sysctl or
/proc/sys/vm/ space. I thought adding a new knob for those who want to
specify finer grained functionality was a compromise. So the patch has
a vm_dirty_ratio and a vm_dirty_ratio_millis interface. The first to
specify 0-100% and the second to specify .0 to .999%.

So to represent 0.125% of RAM we set
vm_dirty_ratio = 0
vm_dirty_ratio_millis = 125

The same for the background_ratio.

Any feedback?

mrubin

On Fri, Sep 12, 2008 at 4:10 PM, Andrew Morton
--

From: Michael Rubin
Date: Monday, September 22, 2008 - 4:41 pm

On Fri, Sep 12, 2008 at 1:18 PM, Andrew Morton

Re-sending since I top-posted before. Never again. Also adding more
thoughts on a byte based interface.

Currently the problem we are hitting is that we cannot specify pdflush
to have background limits less than 1% of memory. I am currently
finishing up a patch right now that adds a dirty_ratio_millis
interface.  I hope to submit the patch to LKML by the end of the week.

The idea is that we don't want to break backwards compatibility and we
also don't want to have two conflicting knobs in the sysctl or
/proc/sys/vm/ space. I thought adding a new knob for those who want to
specify finer grained functionality was a compromise. So the patch has
a vm_dirty_ratio and a vm_dirty_ratio_millis interface. The first to
specify 0-100% and the second to specify .0 to .999%.

So to represent 0.125% of RAM we set
vm_dirty_ratio = 0
vm_dirty_ratio_millis = 125

The same for the background_ratio.

I would also prefer using a bytes interface but I am not sure how to
offer that without  either removing the legacy interface of the ratios
or by offering a concurrent interface that might be confusing such as
when users are looking at the old one and not aware of a new one.

Any feedback?

mrubin
--

From: Andrea Righi
Date: Tuesday, September 23, 2008 - 5:50 am

I think using millis is ok today, but it may not scale well to systems
with 1TB of memory (in this case the min granularity would be 10MB).

A bytes/pages interface would resolve such problem also for tomorrow
machines.

Moreover, wouldn't it be safer to set them mutually exclusive? I mean,
writing a value != 0 to vm_dirty_millis automatically sets
vm_dirty_ratio to 0 (disabled) and vice versa (this could be implemented
using an appropriate .proc_handler for example).

OK, I would like to set percentages like 12.456%, but if we don't do so
a simple "sysctl -p" could create unexpected behaviours, reconfiguring
the vm_dirty_ratio and not vm_dirty_ratio_millis for example.

The same should be valid also for a bytes/pages interface, so setting
vm_dirty_bytes != 0 (or vm_dirty_pages) should "disable" vm_dirty_ratio
and vice versa.

Thanks,
-Andrea
--

From: KOSAKI Motohiro
Date: Tuesday, September 23, 2008 - 10:48 am

Why vm_dirty_ratio = 0.125 is wrong?

Sure.
We don't have any motivation of its interface change.



--

From: Michael Rubin
Date: Tuesday, September 23, 2008 - 1:21 pm

On Tue, Sep 23, 2008 at 10:48 AM, KOSAKI Motohiro

Here's an idea to build off Kosaki's suggestion and incorporate other
previous suggestions.

What if we have two knobs for every ratio. So we could have
vm_dirty_ratio and also vm_dirty_KB

vm_dirty_KB allows the user to set the number of KB desired and also
read the amount of KB that has been set.

Writing to vm_dirty_ratio works just as before and only allows whole
percentages.
Reading from vm_dirty_ratio will return a reply as before except if KB
has been set it can return a number in percentages (rounded off to
thousandths).

This way we allow new functionality and preserve old functionality
while not surprising the user.
Maybe we should deprecate the vm_dirty_ratio interface also and point

We are seeing problems where we are generating a lot of dirty memory
from asynchronous background writes while more important traffic is
operating with DIRECT_IO. The DIRECT_IO traffic will incur high
latency spikes as the pdflush hits the background threshold and tries
to write a lot of dirty buffers at once.

What we want to do is lower the background threshold low enough so
that we don't end up writing a lot of data at one time. As systems get
more and more memory this is and will become difficult. 1% of system
RAM could tie up a disk.

mrubin
--

From: KOSAKI Motohiro
Date: Tuesday, September 23, 2008 - 11:59 pm

yup.
sorry, I choosed bad word at my last mail. it caused your confusion.
I only disagreed vm_dirty_KB.

I agreed with fine graind vm_dirty_ratio.

Thanks.


--

From: Andrea Righi
Date: Tuesday, October 7, 2008 - 3:35 am

The more I think about this and the more I would prefer to have an
interface in KB (or pages) that automatically adjusts the old int percentage
in dirty_ratio (the same for dirty_background_ratio).

The parser issue for writing decimal values doesn't seem to be a big
problem, but if the user expects to read an int from vm_dirty_ratio and
instead receives something like 0.125, well... this could break
something. So, IMHO also in this way we're changing the kernel-userspace
interface.

-Andrea
--

From: Balbir Singh
Date: Tuesday, October 7, 2008 - 4:04 am

Just provide a vm_dirty_ration_in_bytes interface and keep it in sync with
vm_dirty_ratio (they are just two representations of the same internal value)
and for higher resolution propose that users use the bytes interface.



-- 
	Balbir
--

From: Andrea Righi
Date: Tuesday, October 7, 2008 - 8:49 am

Hi Balbir,

now that I read carefully the documentation, the description in
Documentation/filesystems/proc.txt seems to be a bit misleading. In
proc.txt we say that dirty_ratio and dirty_background_ratio are "a
percentage of total system memory", but in mm/page-writeback.c we apply
the percentages to the dirtyable memory: free pages + reclaimable pages.
So, first of all I think we should clarify this in the documentation...

Saying that, keeping in sync the vm_dirty_amount_in_bytes according to
dirty_ratio_in_percentage is not a trivial task. One is a static value,
the other depends on the dirtyable memory in the system. If we want to
preserve the same behaviour we should do the following:

dirty_ratio = x => dirty_amount_in_bytes = x * dirtyable_memory / 100

dirty_amount_in_bytes = y => dirty_ratio = y / dirtyable_memory * 100

But anytime the dirtyable memory (or the total memory in the system)
changes we should update both values accordingly to preserve the
coherency between them (ouch!).

Possible solutions:

1) introduce fine-grained dirty_ratio handling decimals by an opportune
   parser (disadvantage: this would break the compatibility with all the
   userspace apps that expect to read an int from vm_dirty_ratio)

2) introduce dirty_ratio + dirty_ratio_millis (disadvantage: can
   generate unexpected behaviours when something is written to
   dirty_ratio ignoring the existence of dirty_ratio_millis)

3) introduce dirty_ratio + dirty_amount_in_bytes mutually exclusive,
   writing to one automatically "disable" the other (disadvantage:
   writing to dirty_ratio ignoring dirty_amount_in_bytes can cause
   unexpected behaviours)

4) introduce dirty_ratio + dirty_amount_in_bytes and change the
   old behaviour: when something is written to dirty_ratio,
   dirty_amount_in_bytes is evaluated in function of totalram_pages (or
   the memcg limit) and then we always use this static value, instead of
   something that depends on the dirtyable memory - we can easily ...
From: KAMEZAWA Hiroyuki
Date: Tuesday, October 7, 2008 - 6:16 pm

On Tue, 07 Oct 2008 17:49:49 +0200

Hmm... I agree to "5"... like this ?
==
prvoides
  - vm.dirty_ratio (1/100)
  - vm.dirty_ratio_percentmille(1/100,000, pcm)

and allow
#echo 0.05 > vm/dirty_ratio
#cat vm/dirty_ratio 
0
#cat vm/dirty_ratio_percentmille
500
==

Thanks,
-Kame

--

From: Balbir Singh
Date: Wednesday, October 8, 2008 - 6:13 am

I guess this would be the easiest way forward, I'll let you select the
granularity of the interface and its meaning.


-- 
	Balbir
--

From: Andrea Righi
Date: Thursday, October 9, 2008 - 8:29 am

The current granularity of 5% of dirtyable memory for dirty pages writeback is
too coarse for large memory machines and this will get worse as
memory-size/disk-speed ratio continues to increase.

These large writebacks can be unpleasant for desktop or latency-sensitive
environments, where the time to complete a writeback can be perceived as a
lack of responsiveness by the whole system.

So, something to define fine grained settings is needed.

Following there's a similar solution as discussed in [1], but I tried to
simplify the things a little bit, in order to provide the same functionality
(in particular try to avoid backward compatibility problems) and reduce the
amount of code needed to implement an in-kernel parser to handle percentages
with decimals digits.

The kernel provides the following parameters:
 - dirty_ratio, dirty_background_ratio in percentage
   (1 ... 100)
 - dirty_ratio_pcm, dirty_background_ratio_pcm in units of percent mille
   (1 ... 100,000)

Both dirty_ratio and dirty_ratio_pcm refer to the same vm_dirty_ratio variable,
only the interface to read/write this value is different. The same is valid for
dirty_background_ratio and dirty_background_ratio_pcm.

In this way it's possible to provide a fine grained interface to configure the
writeback policy and at the same time preserve the compatibility with the old
coarse grained dirty_ratio / dirty_background_ratio users.

Examples:
 # echo 5 > /proc/sys/vm/dirty_ratio
 # cat /proc/sys/vm/dirty_ratio
 5
 # cat /proc/sys/vm/dirty_ratio_pcm
 5000

 # echo 500 > /proc/sys/vm/dirty_ratio_pcm
 # cat /proc/sys/vm/dirty_ratio
 0
 # cat /proc/sys/vm/dirty_ratio_pcm
 500

 # echo 5500 > /proc/sys/vm/dirty_ratio_pcm
 # cat /proc/sys/vm/dirty_ratio
 5
 # cat /proc/sys/vm/dirty_ratio_pcm
 5500

[1] http://lkml.org/lkml/2008/10/7/230

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
---
 Documentation/filesystems/proc.txt |   20 +++++++++
 include/linux/sysctl.h             |    7 +++
 ...
From: KAMEZAWA Hiroyuki
Date: Thursday, October 9, 2008 - 5:41 pm

On Thu, 09 Oct 2008 17:29:46 +0200
I like this. thanks.

I wonder...isn't this overflow in 32bit system ?

Thanks,
-Kame


--

From: Andrea Righi
Date: Friday, October 10, 2008 - 2:32 am

Correct! the worst case is (in pages):

4GB = 100,000 * determine_dirtyable_memory()

that means 42950 pages (~168MB) of dirtyable memory is enough to overflow :(.
Using an u64 for dirty_total should resolve.

Delta patch is below.

Unfortunately I have all 64-bit machines right now. Maybe tomorrow I'll
be able to get a 32-bit box, if someone doesn't test this before.

Thanks!
-Andrea

---
Subject: fix overflow in 32-bit systems using fine-grained dirty_ratio

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/page-writeback.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 6bc8c9b..29913e5 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -133,7 +133,7 @@ static struct prop_descriptor vm_dirties;
  */
 static int calc_period_shift(void)
 {
-	unsigned long dirty_total;
+	u64 dirty_total;
 
 	dirty_total = (vm_dirty_ratio * determine_dirtyable_memory())
 			/ ONE_HUNDRED_PCM;
--

From: Andrea Righi
Date: Friday, October 10, 2008 - 6:13 am

I've been able to quickly resolve creating a 1GB mem i386 VM with kvm. :)

Everything seems to work fine and with the following fix it doesn't overflow.

--

From: Andrea Righi
Date: Monday, November 10, 2008 - 1:58 pm

The current granularity of 5% of dirtyable memory for dirty pages writeback is
too coarse for large memory machines and this will get worse as
memory-size/disk-speed ratio continues to increase.

These large writebacks can be unpleasant for desktop or latency-sensitive
environments, where the time to complete each writeback can be perceived as a
lack of responsiveness by the whole system.

Following there's a similar solution as discussed in [1], but a little
bit simplified in order to provide the same functionality (in particular
to avoid backward compatibility problems) and reduce the amount of code
needed to implement an in-kernel parser to handle percentages with
decimals digits.

The kernel provides the following parameters:
 - dirty_ratio, dirty_background_ratio in percentage (1 ... 100)
 - dirty_ratio_pcm, dirty_background_ratio_pcm in units of percent mille (1 ... 100,000)

Both dirty_ratio and dirty_ratio_pcm refer to the same vm_dirty_ratio variable,
only the interface to read/write this value is different. The same is valid for
dirty_background_ratio.

In this way it's possible to provide a fine-grained interface to configure the
writeback policy and at the same time preserve the compatibility with the old
dirty_ratio / dirty_background_ratio users.

Examples:
 # echo 5 > /proc/sys/vm/dirty_ratio
 # cat /proc/sys/vm/dirty_ratio
 5
 # cat /proc/sys/vm/dirty_ratio_pcm
 5000

 # echo 500 > /proc/sys/vm/dirty_ratio_pcm
 # cat /proc/sys/vm/dirty_ratio
 0
 # cat /proc/sys/vm/dirty_ratio_pcm
 500

 # echo 5500 > /proc/sys/vm/dirty_ratio_pcm
 # cat /proc/sys/vm/dirty_ratio
 5
 # cat /proc/sys/vm/dirty_ratio_pcm
 5500

Changelog: (v1 -> v2)

* fix overflow in 32bit systems (calc_period_shift needs a u64)
* rebase (and tested) to 2.6.28-rc2-mm1

[1] http://lkml.org/lkml/2008/10/7/230

Signed-off-by: Andrea Righi <righi.andrea@gmail.com>
---
 Documentation/filesystems/proc.txt |   20 +++++++++
 include/linux/sysctl.h             |    7 +++
 ...
From: Andrew Morton
Date: Monday, November 10, 2008 - 2:12 pm

On Mon, 10 Nov 2008 21:58:28 +0100

hm, so how long until dirty_ratio_pcm becomes too coarse...

What happened to the idea of specifying these in units of kilobytes?
--

From: Andrea Righi
Date: Monday, November 10, 2008 - 3:03 pm

The conclusion was that with units in KB requires much more complexity
to keep in sync the old dirty_ratio (and dirty_background_ratio)
interface with the new one.

The KB limit is a static value, the other depends on the dirtyable
memory. If we want to preserve the same behaviour we should do the
following:

- when dirty_ratio changes to x:
  dirty_amount_in_bytes = x * dirtyable_memory / 100.

- when dirty_amount_in_bytes changes to x:
  dirty_ratio = x / dirtyable_memory * 100

But anytime the dirtyable memory changes (as well as the total memory in
the system) we should update both values accordingly to preserve the
coherency between them.

I wonder if setting also PERCENT_PCM (that is 1% expressed in
fine-grained units) as a parameter could be a better long-term solution.
And also use another name for it, because in this case this would be not
a milli-percent value anymore.

-Andrea
--

From: Andrew Morton
Date: Monday, November 10, 2008 - 3:12 pm

On Mon, 10 Nov 2008 23:03:13 +0100


How about we forget the percentage thing and create
/proc/sys/vm/dirty_ratio_millionths?  That will give us a few more years
of moores_law(memory size)/mores_law(disk speed) too..  
--

From: David Rientjes
Date: Monday, November 10, 2008 - 3:15 pm

I think the idea is for a dynamic dirty_ratio based on a static value 
dirty_amount_in_bytes:

	dirtyable_memory = determine_dirtyable_memory() * PAGE_SIZE;

Only dirty_ratio is actually updated if dirty_amount_in_bytes is static.

This allows you to control how many pages are NR_FILE_DIRTY or 
NR_UNSTABLE_NFS and gives you the granularity that you want with 
dirty_ratio_pcm, but on a byte scale instead of percent.

It's also a clean interface:

	echo 200M > /proc/sys/vm/dirty_ratio_bytes
--

Previous thread: ps3: BUG: spinlock lockup on CPU#1, udevd/505 or modprobe/633 by Geert Uytterhoeven on Friday, September 12, 2008 - 8:02 am. (2 messages)

Next thread: [RFC] [PATCH -mm 1/2] memcg dirty_ratio and additional page statistics by Andrea Righi on Friday, September 12, 2008 - 8:09 am. (1 message)