Hi this patch is page reclaim improvement. o previous discussion: http://marc.info/?l=linux-mm&m=120339997125985&w=2 o test method $ ./hackbench 120 process 1000 o test result (average of 5 times measure) limit hackbench sys-time major-fault max-spent-time time(s) (s) in shrink_zone() (jiffies) -------------------------------------------------------------------- 3 42.06 378.70 5336 6306 o reason why restrict parallel reclaim 3 task per zone we tested various parameter. - restrict 1 is best major fault. but worst max spent time. - restrict 3 is best max spent reclaim time and hackbench result. I think "restrict 3" cause most good experience. limit hackbench sys-time major-fault max-spent-time time(s) (s) in shrink_zone() (jiffies) -------------------------------------------------------------------- 1 48.50 283.89 3690 9057 2 44.43 350.94 5245 7159 3 42.06 378.70 5336 6306 4 48.84 401.87 5474 6669 unlimited 282.30 1248.47 29026 - Please any comments! Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> CC: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> CC: Balbir Singh <balbir@linux.vnet.ibm.com> CC: Rik van Riel <riel@redhat.com> CC: Lee Schermerhorn <Lee.Schermerhorn@hp.com> CC: Nick Piggin <npiggin@suse.de> --- include/linux/mmzone.h | 3 + mm/page_alloc.c | 4 + mm/vmscan.c | 101 ++++++++++++++++++++++++++++++++++++++++++++----- 3 files changed, 99 insertions(+), 9 deletions(-) Index: b/include/linux/mmzone.h ==================================================...
Small nit, that extra blank line seems at the wrong end of the text Would it be possible - and worthwhile - to make this FIFO fair? --
Agghhh, sorry ;-) Hmmm may be, we don't need perfectly fair. because try_to_free_page() is unfair mechanism. but I will test use wake_up() instead wake_up_all(). it makes so so fair order if no performance regression happend. Thanks very useful comment. - kosaki --
One more thing, I would request you to add default heuristics (number of reclaimers), based on the number of cpus in the system. Letting people tuning it is fine, but defaults should be related to number of cpus, nodes and zones on the system. Zones can be reaped in parallel per node and cpus allow threads to run in parallel. So please use that to come up with good defaults, instead of a number like "3". -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
I don't think so. all modern many cpu machine stand on NUMA. it mean following, - if cpu increases, then zone increases, too. if default value increase by #cpus, lock contension dramatically increase on large numa. Have I overlooked anything? and, (but) i afraid to 3 is too small value. if you have another test result on large machine, please show me. - kosaki --
On Wed, 27 Feb 2008 13:45:18 +0900 How about adding something like.. == CONFIG_SIMULTANEOUS_PAGE_RECLAIMERS int default 3 depends on DEBUG help This value determines the number of threads which can do page reclaim in a zone simultaneously. If this is too big, performance under heavy memory pressure will decrease. If unsure, use default. == Then, you can get performance reports from people interested in this feature in test cycle. Thanks, -Kame --
hm, intersting. but sysctl parameter is more better, i think. OK, I'll add it at next post. --
I think sysctl should be interesting. The config option provides good documentation, but it is static in nature (requires reboot to change). I wish we could have the best of both worlds. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
I disagree, the config option is indeed static but so is the NUMA topology of the machine. It represents the maximum number of page reclaim threads that should be allowed for that specific topology; a maximum should not need to be redefined with yet another sysctl and should remain independent of various workloads. However, I would recommend adding the word "MAX" to the config option. David --
MAX_PARALLEL_RECLAIM_TASK is good word? - kosaki --
I'd use _THREAD instead of _TASK, but I'd also wait for Balbir's input because perhaps I missed something in my original analysis that this config option represents only the maximum number of concurrent reclaim threads and other heuristics are used in addition to this that determine the exact number of threads depending on VM strain. David --
Things are changing, with memory hot-add remove, CPU hotplug , the topology can change and is no longer static. One can create fake NUMA nodes on the fly using a boot option as well. Since we're talking of parallel reclaims, I think it's a function of CPUs and Nodes. I'd rather keep it as a sysctl with a good default value based on the topology. If we end up getting it wrong, the system administrator has a choice. That is better than expecting him/her to recompile the kernel and boot that. A sysctl does not create problems either w.r.t changing the number of threads, no hard to solve race-conditions - it is fairly straight forward -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
sorry, I don't understand yet. I think my patch is already function of CPUs and Nodes. per zone limit indicate propotional #cpus and #nodes. please tell me the topology that per zone limit doesn't works so good. I think boot option and sysctl should be used only while -mm for get various feedback. end up, we should select more better default, and remove sysctl. - kosaki --
We lack node hotplug, so the dependence on the number of system nodes in the equation is static and can easily be defined at compile-time. I agree that the maximum number of parallel reclaim threads should be a function of cpus, so you can easily make it that by adding callback functions for cpu hotplug events. Perhaps a better alternative than creating a set of heuristics and setting a user-defined maximum on the number of concurrent reclaim threads is to configure the number of threads to be used for each online cpu called CONFIG_NUM_RECLAIM_THREADS_PER_CPU. This solves the lock contention problem if configured properly that was mentioned earlier. Adding yet another sysctl for this functionality seems unnecessary, unless it is attempting to address other VM problems where page reclaim needs to be throttled when it is being stressed. Those issues need to be addressed directly, in my opinion, instead of attempting to workaround it by limiting the number of concurrent reclaim threads. David --
Let's forget node hotplug for the moment, but what if someone 1. Changes the machine configuration and adds more nodes, do we expect the kernel to be recompiled? Or is it easier to update /etc/sysctl.conf? 2. Uses fake NUMA nodes and increases/decreases the number of nodes across I am afraid it doesn't. Consider as you scale number of CPU's with the same We are providing a solution with a good default value, allowing the administrator to change them when our defaults don't work well. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
That is why the proposal was made to make this a static configuration option, such as CONFIG_NUM_RECLAIM_THREADS_PER_NODE, that will handle both The benchmark that have been posted suggest that memory locality is more important than lock contention, as I've already mentioned. David --
You mentioned CONFIG_NUM_RECLAIM_THREADS_PER_CPU and not CONFIG_NUM_RECLAIM_THREADS_PER_NODE. The advantage with syscalls is that even if we get the thing wrong, the system administrator has an alternative. Please look through the existing sysctl's and you'll see what I mean. What is wrong with providing the flexibility that comes with sysctl? We cannot possibly think of all situations and come up with the right answer for a heuristic. Why not come up with a default and let everyone use what works for them? -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
I agree with Balbir, just turn it into a sysctl, its easy enough to do, and those who need it will thank you for it instead of curse you for hard coding it. --
hm, could you post another patch? I hope avoid implementless discussion. and I hope compare by benchmark result. -kosaki --
My suggestion is merely to make the number of concurrent page reclaim threads be a function of how many online cpus there are. Threads can easily be added or removed for cpu hotplug events by callback functions. That's different than allowing users to change the number of threads with yet another sysctl. Unless there are situations that can be presented where tuning the number of threads is advantageous to reduce lock contention, for example, and not simply working around other VM problems, then I see no point for an additional sysctl. So my suggestion is to implement this in terms of CONFIG_NUM_RECLAIM_THREADS_PER_CPU and add callback functions for cpu hotplug events that add or remove this number of threads. David --
On Tue, 26 Feb 2008 23:19:08 -0800 (PST) The more CPUs there are, the more lock contention you want? Somehow that seems backwards :) -- All rights reversed. --
On Tue, 26 Feb 2008 23:19:08 -0800 (PST) Hmm, but kswapd, which is main worker of page reclaiming, is per-node. And reclaim is done based on zone. per-zone/per-node throttling seems to make sense. I know his environment has 4cpus per node but throttle to 3 was the best number in his measurement. Then it seems num-per-cpu is excessive. (At least, ratio(%) is better.) When zone-reclaiming is improved to be scale well, we'll have to change this throttle. BTW, could someone try his patch on x86_64/ppc ? I'd like to see how contention is heavy on other machines. Thanks, -kame --
That's another argument for not introducing the sysctl; the number of nodes and zones are a static property of the machine that cannot change without a reboot (numa=fake, mem=, introducing movable zones, etc). We don't have node hotplug that can suddenly introduce additional zones from which to reclaim. My point was that there doesn't appear to be any use case for tuning this via a sysctl that isn't simply attempting to workaround some other reclaim problem when the VM is stressed. If that's agreed upon, then deciding between a config option that is either per-cpu or per-node should be based on the benchmarks that you've run. At this time, it appears that per-node That seems to indicate that the NUMA topology is more important than lock contention for the reclaim throttle. David --
On Tue, 26 Feb 2008 23:56:39 -0800 (PST) Hmm, do you know there is already zone-hotplug ? ;) (Means, onlining new memory in new zone increase the # of zones. I agree that what is the best is based on benchmark. I like per-node, now. I hear that there is also I/O bottle-neck for page reclaiming, at last. Thanks, -Kame --
OK, I obey your opinion. - kosaki --
On Tue, 26 Feb 2008 22:18:38 +0100 I think it doesn't make sense for fairness. IMHO, this functionality is an unfair one in nature. While someone is reclaiming pages, other processes can get a newly reclaimed page without calling try_to_free_page. For high-priority processes, 1. avoiding diving into try_to_free_pages if it's congested. 2. just waiting for that someone relcaim pages and grab it ASAP maybe good for quick work. Thanks, -Kame --
| Greg Kroah-Hartman | [PATCH 004/196] Chinese: add translation of SubmittingPatches |
| James Bottomley | Re: Integration of SCST in the mainstream Linux kernel |
| Jeff Garzik | Re: [Patch v2] Make PCI extended config space (MMCONFIG) a driver opt-in |
| Chodorenko Michail | PROBLEM: Celeron Core |
git: | |
| Linus Torvalds | People unaware of the importance of "git gc"? |
| Johannes Schindelin | Re: Empty directories... |
| Jakub Narebski | Re: VCS comparison table |
| Sam Song | Re: Fwd: [OT] Re: Git via a proxy server? |
| J.W. Zondag | Dell PE1950 III - Perc 6i |
| Richard Stallman | Real men don't attack straw men |
| GVG GVG | ssh_exchange_identification: Connection closed by remote host |
| Anselm R. Garbe | OpenBSD 4.0 / Xorg -> vesa 1920x1200 widescreen resolution |
| Jim Winstead Jr. | Re: Root Disk/Book Disk Compatibility |
| Anselm Lingnau | File creation date in UNIX (was: Re: VMS) |
| Rafal Kustra (summer student) | mount |
| Nicholas Yue | Re: more on 486/33 weirdness |
