login
Header Space

 
 

Re: [RFC][PATCH] page reclaim throttle take2

Previous thread: [PATCH] x86_64: force re setting the mmconf for fam10h if acpi=off by Yinghai Lu on Monday, February 25, 2008 - 10:41 pm. (3 messages)

Next thread: [PATCH 0/4] autofs4 - autofs needs a miscelaneous device for ioctls by Ian Kent on Monday, February 25, 2008 - 11:21 pm. (40 messages)
To: <linux-kernel@...>, <linux-mm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>, Balbir Singh <balbir@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Cc: <kosaki.motohiro@...>
Date: Monday, February 25, 2008 - 10:32 pm

Hi

this patch is page reclaim improvement.

o previous discussion:
	http://marc.info/?l=linux-mm&amp;m=120339997125985&amp;w=2

o test method
  $ ./hackbench 120 process 1000

o test result (average of 5 times measure)

limit   hackbench     sys-time     major-fault   max-spent-time 
        time(s)       (s)                        in shrink_zone()
                                                 (jiffies)
--------------------------------------------------------------------
3       42.06         378.70       5336          6306


o reason why restrict parallel reclaim 3 task per zone

we tested various parameter.
  - restrict 1 is best major fault.
    but worst max spent time.
  - restrict 3 is best max spent reclaim time and hackbench result.

I think "restrict 3" cause most good experience.


limit      hackbench     sys-time     major-fault   max-spent-time 
           time(s)       (s)                        in shrink_zone()
                                                    (jiffies)
--------------------------------------------------------------------
1          48.50         283.89       3690          9057
2          44.43         350.94       5245          7159
3          42.06         378.70       5336          6306
4          48.84         401.87       5474          6669
unlimited  282.30        1248.47      29026          -



Please any comments!



Signed-off-by: KOSAKI Motohiro &lt;kosaki.motohiro@jp.fujitsu.com&gt;
CC: KAMEZAWA Hiroyuki &lt;kamezawa.hiroyu@jp.fujitsu.com&gt;
CC: Balbir Singh &lt;balbir@linux.vnet.ibm.com&gt;
CC: Rik van Riel &lt;riel@redhat.com&gt;
CC: Lee Schermerhorn &lt;Lee.Schermerhorn@hp.com&gt;
CC: Nick Piggin &lt;npiggin@suse.de&gt;


---
 include/linux/mmzone.h |    3 +
 mm/page_alloc.c        |    4 +
 mm/vmscan.c            |  101 ++++++++++++++++++++++++++++++++++++++++++++-----
 3 files changed, 99 insertions(+), 9 deletions(-)

Index: b/include/linux/mmzone.h
==================================================...
To: KOSAKI Motohiro <kosaki.motohiro@...>
Cc: <linux-kernel@...>, <linux-mm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>, Balbir Singh <balbir@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Tuesday, February 26, 2008 - 5:18 pm

Small nit, that extra blank line seems at the wrong end of the text

Would it be possible - and worthwhile - to make this FIFO fair?

--
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: <kosaki.motohiro@...>, <linux-kernel@...>, <linux-mm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>, Balbir Singh <balbir@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Wednesday, February 27, 2008 - 12:26 am

Agghhh, sorry ;-)

Hmmm
may be, we don't need perfectly fair.
because try_to_free_page() is unfair mechanism.

but I will test use wake_up() instead wake_up_all().
it makes so so fair order if no performance regression happend.

Thanks very useful comment.


- kosaki



--
To: KOSAKI Motohiro <kosaki.motohiro@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Wednesday, February 27, 2008 - 12:27 am

One more thing, I would request you to add default heuristics (number of
reclaimers), based on the number of cpus in the system. Letting people tuning it
is fine, but defaults should be related to number of cpus, nodes and zones on
the system. Zones can be reaped in parallel per node and cpus allow threads to
run in parallel. So please use that to come up with good defaults, instead of a
number like "3".

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: <balbir@...>
Cc: <kosaki.motohiro@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Wednesday, February 27, 2008 - 12:45 am

I don't think so.
all modern many cpu machine stand on NUMA.
it mean following,
 - if cpu increases, then zone increases, too.

if default value increase by #cpus, lock contension dramatically increase
on large numa.

Have I overlooked anything?


and, (but) i afraid to 3 is too small value.
if you have another test result on large machine, please show me.

- kosaki


--
To: KOSAKI Motohiro <kosaki.motohiro@...>
Cc: <balbir@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Wednesday, February 27, 2008 - 1:00 am

On Wed, 27 Feb 2008 13:45:18 +0900
How about adding something like..
== 
CONFIG_SIMULTANEOUS_PAGE_RECLAIMERS 
int
default 3
depends on DEBUG
help
  This value determines the number of threads which can do page reclaim
  in a zone simultaneously. If this is too big, performance under heavy memory
  pressure will decrease.
  If unsure, use default.
==

Then, you can get performance reports from people interested in this
feature in test cycle.

Thanks,
-Kame


--
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Cc: <kosaki.motohiro@...>, <balbir@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Wednesday, February 27, 2008 - 1:04 am

hm, intersting.
but sysctl parameter is more better, i think.

OK, I'll add it at next post.



--
To: KOSAKI Motohiro <kosaki.motohiro@...>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Wednesday, February 27, 2008 - 1:03 am

I think sysctl should be interesting. The config option provides good
documentation, but it is static in nature (requires reboot to change). I wish we
could have the best of both worlds.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: Balbir Singh <balbir@...>
Cc: KOSAKI Motohiro <kosaki.motohiro@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Wednesday, February 27, 2008 - 1:19 am

I disagree, the config option is indeed static but so is the NUMA topology 
of the machine.  It represents the maximum number of page reclaim threads 
that should be allowed for that specific topology; a maximum should not 
need to be redefined with yet another sysctl and should remain independent 
of various workloads.

However, I would recommend adding the word "MAX" to the config option.

		David
--
To: David Rientjes <rientjes@...>
Cc: <kosaki.motohiro@...>, Balbir Singh <balbir@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Wednesday, February 27, 2008 - 1:33 am

MAX_PARALLEL_RECLAIM_TASK is good word?

- kosaki

--
To: KOSAKI Motohiro <kosaki.motohiro@...>
Cc: Balbir Singh <balbir@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Wednesday, February 27, 2008 - 1:47 am

I'd use _THREAD instead of _TASK, but I'd also wait for Balbir's input 
because perhaps I missed something in my original analysis that this 
config option represents only the maximum number of concurrent reclaim 
threads and other heuristics are used in addition to this that determine 
the exact number of threads depending on VM strain.

		David
--
To: David Rientjes <rientjes@...>
Cc: KOSAKI Motohiro <kosaki.motohiro@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Wednesday, February 27, 2008 - 1:48 am

Things are changing, with memory hot-add remove, CPU hotplug , the topology can
change and is no longer static. One can create fake NUMA nodes on the fly using
a boot option as well.

Since we're talking of parallel reclaims, I think it's a function of CPUs and
Nodes. I'd rather keep it as a sysctl with a good default value based on the
topology. If we end up getting it wrong, the system administrator has a choice.
That is better than expecting him/her to recompile the kernel and boot that. A
sysctl does not create problems either w.r.t changing the number of threads, no
hard to solve race-conditions - it is fairly straight forward




-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: <balbir@...>
Cc: <kosaki.motohiro@...>, David Rientjes <rientjes@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Wednesday, February 27, 2008 - 2:52 am

sorry, I don't understand yet.
I think my patch is already function of CPUs and Nodes.
per zone limit indicate propotional #cpus and #nodes.

please tell me the topology that per zone limit doesn't works so good.

I think boot option and sysctl should be used only while -mm
for get various feedback.
end up, we should select more better default, and remove sysctl.


- kosaki


--
To: Balbir Singh <balbir@...>
Cc: KOSAKI Motohiro <kosaki.motohiro@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Wednesday, February 27, 2008 - 2:09 am

We lack node hotplug, so the dependence on the number of system nodes in 
the equation is static and can easily be defined at compile-time.

I agree that the maximum number of parallel reclaim threads should be a 
function of cpus, so you can easily make it that by adding callback 
functions for cpu hotplug events.

Perhaps a better alternative than creating a set of heuristics and setting 
a user-defined maximum on the number of concurrent reclaim threads is to 
configure the number of threads to be used for each online cpu called 
CONFIG_NUM_RECLAIM_THREADS_PER_CPU.  This solves the lock contention 
problem if configured properly that was mentioned earlier.

Adding yet another sysctl for this functionality seems unnecessary, unless 
it is attempting to address other VM problems where page reclaim needs to 
be throttled when it is being stressed.  Those issues need to be addressed 
directly, in my opinion, instead of attempting to workaround it by 
limiting the number of concurrent reclaim threads.

		David
--
To: David Rientjes <rientjes@...>
Cc: KOSAKI Motohiro <kosaki.motohiro@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Wednesday, February 27, 2008 - 3:59 am

Let's forget node hotplug for the moment, but what if someone

1. Changes the machine configuration and adds more nodes, do we expect the
kernel to be recompiled? Or is it easier to update /etc/sysctl.conf?
2. Uses fake NUMA nodes and increases/decreases the number of nodes across

I am afraid it doesn't. Consider as you scale number of CPU's with the same

We are providing a solution with a good default value, allowing the
administrator to change them when our defaults don't work well.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: Balbir Singh <balbir@...>
Cc: KOSAKI Motohiro <kosaki.motohiro@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Wednesday, February 27, 2008 - 4:47 am

That is why the proposal was made to make this a static configuration 
option, such as CONFIG_NUM_RECLAIM_THREADS_PER_NODE, that will handle both 

The benchmark that have been posted suggest that memory locality is more 
important than lock contention, as I've already mentioned.

		David
--
To: David Rientjes <rientjes@...>
Cc: KOSAKI Motohiro <kosaki.motohiro@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Wednesday, February 27, 2008 - 5:01 am

You mentioned CONFIG_NUM_RECLAIM_THREADS_PER_CPU and not
CONFIG_NUM_RECLAIM_THREADS_PER_NODE. The advantage with syscalls is that even if
we get the thing wrong, the system administrator has an alternative. Please look
through the existing sysctl's and you'll see what I mean. What is wrong with
providing the flexibility that comes with sysctl? We cannot possibly think of
all situations and come up with the right answer for a heuristic. Why not come
up with a default and let everyone use what works for them?


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: <balbir@...>
Cc: David Rientjes <rientjes@...>, KOSAKI Motohiro <kosaki.motohiro@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>, <linux-kernel@...>, <linux-mm@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Wednesday, February 27, 2008 - 5:44 am

I agree with Balbir, just turn it into a sysctl, its easy enough to do,
and those who need it will thank you for it instead of curse you for
hard coding it.

--
To: David Rientjes <rientjes@...>
Cc: <kosaki.motohiro@...>, Balbir Singh <balbir@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Wednesday, February 27, 2008 - 3:10 am

hm,

could you post another patch?
I hope avoid implementless discussion.
and I hope compare by benchmark result.


-kosaki

--
To: KOSAKI Motohiro <kosaki.motohiro@...>
Cc: Balbir Singh <balbir@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Wednesday, February 27, 2008 - 3:19 am

My suggestion is merely to make the number of concurrent page reclaim 
threads be a function of how many online cpus there are.  Threads can 
easily be added or removed for cpu hotplug events by callback functions.

That's different than allowing users to change the number of threads with 
yet another sysctl.  Unless there are situations that can be presented 
where tuning the number of threads is advantageous to reduce lock 
contention, for example, and not simply working around other VM problems, 
then I see no point for an additional sysctl.

So my suggestion is to implement this in terms of 
CONFIG_NUM_RECLAIM_THREADS_PER_CPU and add callback functions for cpu 
hotplug events that add or remove this number of threads.

		David
--
To: David Rientjes <rientjes@...>
Cc: KOSAKI Motohiro <kosaki.motohiro@...>, Balbir Singh <balbir@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Wednesday, February 27, 2008 - 11:30 am

On Tue, 26 Feb 2008 23:19:08 -0800 (PST)

The more CPUs there are, the more lock contention you want?

Somehow that seems backwards :)

-- 
All rights reversed.
--
To: David Rientjes <rientjes@...>
Cc: KOSAKI Motohiro <kosaki.motohiro@...>, Balbir Singh <balbir@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Wednesday, February 27, 2008 - 3:51 am

On Tue, 26 Feb 2008 23:19:08 -0800 (PST)

Hmm, but kswapd, which is main worker of page reclaiming, is per-node.
And reclaim is done based on zone.
per-zone/per-node throttling seems to make sense.

I know his environment has 4cpus per node but throttle to 3 was the best
number in his measurement. Then it seems num-per-cpu is excessive.
(At least, ratio(%) is better.)
When zone-reclaiming is improved to be scale well, we'll have to change
this throttle.

BTW, could someone try his patch on x86_64/ppc ? 
I'd like to see how contention is heavy on other machines.

Thanks,
-kame
 

--
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Cc: KOSAKI Motohiro <kosaki.motohiro@...>, Balbir Singh <balbir@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Wednesday, February 27, 2008 - 3:56 am

That's another argument for not introducing the sysctl; the number of 
nodes and zones are a static property of the machine that cannot change 
without a reboot (numa=fake, mem=, introducing movable zones, etc).  We 
don't have node hotplug that can suddenly introduce additional zones from 
which to reclaim.

My point was that there doesn't appear to be any use case for tuning this 
via a sysctl that isn't simply attempting to workaround some other reclaim 
problem when the VM is stressed.  If that's agreed upon, then deciding 
between a config option that is either per-cpu or per-node should be based 
on the benchmarks that you've run.  At this time, it appears that per-node 

That seems to indicate that the NUMA topology is more important than lock 
contention for the reclaim throttle.

		David
--
To: David Rientjes <rientjes@...>
Cc: KOSAKI Motohiro <kosaki.motohiro@...>, Balbir Singh <balbir@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Wednesday, February 27, 2008 - 4:09 am

On Tue, 26 Feb 2008 23:56:39 -0800 (PST)

Hmm, do you know there is already zone-hotplug ? ;)
(Means, onlining new memory in new zone increase the # of zones.
I agree that what is the best is based on benchmark.
I like per-node, now.
I hear that there is also I/O bottle-neck for page reclaiming, at last.


Thanks,
-Kame

--
To: <balbir@...>
Cc: <kosaki.motohiro@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>, Peter Zijlstra <a.p.zijlstra@...>, <linux-kernel@...>, <linux-mm@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Wednesday, February 27, 2008 - 1:13 am

OK, I obey your opinion.


- kosaki


--
To: Peter Zijlstra <a.p.zijlstra@...>
Cc: KOSAKI Motohiro <kosaki.motohiro@...>, <linux-kernel@...>, <linux-mm@...>, Balbir Singh <balbir@...>, Rik van Riel <riel@...>, Lee Schermerhorn <Lee.Schermerhorn@...>, Nick Piggin <npiggin@...>
Date: Tuesday, February 26, 2008 - 8:50 pm

On Tue, 26 Feb 2008 22:18:38 +0100
I think it doesn't make sense for fairness.

IMHO, this functionality is an unfair one in nature. While someone is
reclaiming pages, other processes can get a newly reclaimed page without
calling try_to_free_page.

For high-priority processes, 

1. avoiding diving into try_to_free_pages if it's congested.
2. just waiting for that someone relcaim pages and grab it ASAP

maybe good for quick work. 

Thanks,
-Kame

--
Previous thread: [PATCH] x86_64: force re setting the mmconf for fam10h if acpi=off by Yinghai Lu on Monday, February 25, 2008 - 10:41 pm. (3 messages)

Next thread: [PATCH 0/4] autofs4 - autofs needs a miscelaneous device for ioctls by Ian Kent on Monday, February 25, 2008 - 11:21 pm. (40 messages)
speck-geostationary