From: Hamid R. Jahanjou Implements the idea of swap-out page clustering from *BSD for Linux. Each time a candidate page is to be swapped out, virtually-nearby pages are scanned to find eligible pages to be swapped out too as a cluster. This technique increases the likelihood of bringing in related data on a page fault and decreases swap space fragmentation in the long run. Currently, Linux searches only physically-nearby pages which is not optimal since, over time, physically- adjacent pages may become unrelated. The code can be statically tuned. No benchmarks. I'm not sure whether the added complexity is acceptable. Signed-off-by: Hamid R. Jahanjou <hamid.jahanjou@gmail.com> --- diff -uprN -X linux-2.6.26.1-vanilla/Documentation/dontdiff linux-2.6.26.1-vanilla/include/linux/pageout_clustering.h linux-2.6.26.1HJ/include/linux/pageout_clustering.h --- linux-2.6.26.1-vanilla/include/linux/pageout_clustering.h 1970-01-01 03:30:00.000000000 +0330 +++ linux-2.6.26.1HJ/include/linux/pageout_clustering.h 2008-08-28 20:19:52.000000000 +0330 @@ -0,0 +1,12 @@ +#ifndef PAGEOUT_CLUSTERING_H_ +#define PAGEOUT_CLUSTERING_H_ + +#include<linux/list.h> +#define INITIAL_CLUSTER_SCAN_RANGE 4 +#define HALF_PAGEOUT_CLUSTER_SIZE 8 /* default number of pages to be clustered + * on either side of a given page */ + +int try_to_cluster(struct page *, struct list_head *, + struct list_head *, int, int); + +#endif diff -uprN -X linux-2.6.26.1-vanilla/Documentation/dontdiff linux-2.6.26.1-vanilla/include/linux/rmap.h linux-2.6.26.1HJ/include/linux/rmap.h --- linux-2.6.26.1-vanilla/include/linux/rmap.h 2008-08-02 02:28:24.000000000 +0330 +++ linux-2.6.26.1HJ/include/linux/rmap.h 2008-08-28 20:19:52.000000000 +0330 @@ -102,6 +102,12 @@ pte_t *page_check_address(struct page *, unsigned long page_address_in_vma(struct page *, struct vm_area_struct *); /* + * Used by pageout clustering + */ +struct anon_vma *page_lock_anon_vma(struct page *); +void ...
On Thu, 04 Sep 2008 15:34:30 +0330 It would be good to get some idea of what testing you have done with this patch. -- All rights reversed. --
On Thu, 04 Sep 2008 15:34:30 +0330 I tried that once. The code all worked as-designed but didn't seem to Benchmarks are essential, please. Good ones. The whole point of the patch is to improve performance. If we don't know whether it improves performance, we cannot proceed in any way. Secondly, please don't just dump a pile of new code in our laps and expect us to pick through it and work out what it does and how it does it. Please at least provide a carefully-written english-language description of the design and implementation. Thirdly, I'd suggest that this code be converted into vaguely standard kernel coding style sooner rather than later. Convert it to use eight-column hard tabs then check it with scripts/checkpatch.pl, thanks. --
Do you recommend any specific benchmarks for memory-related projects ? Which one did you use for your own implementation ? Thank You --
Pretty much any and all benchmarks, when there's not enough memory (boot with mem=). Kernel compiles, database benchmarks, you name it. Basically anything which has a measurable execution time is relevant. Designing and running these things is a lot of work and thought. And it's an integral part of the development process, because some things are going to get slower and others will show large inter-run variations and then one needs to get in there and find explanations and hopefully fixes. For example, start with small and simple microbenchmarks: run two processes which concurrently allocate 1*mem, then modify it all linearly, then read it all linearly. Build up from there. Do not dive into complex benchmarks on day 1, because you'll have no hope of explaining the results which they produce. Beware that behaviour on SMP can be very different from behaviour on uniprocessor because of the relatively large duration of a timeslice. Both must be tested. --
Just some general comments: First I think virtual swap clustering is a great idea in theory and long overdue. Hopefully the numbers will agree. In general the code would be much nicer if you didn't pass around all that much in a structure (which is just a fancy way to have a function with lots of arguments) Perhaps try to restructure it a bit to make this smaller? Ideally clustering_info should disappear or at least get much smaller. Then continue_search seems to be only set to one value so it can be eliminated? Perhaps there is more like this. I didn't quite understand the "adjust the value of our search by the allocation order". The allocation order should be normally 0. I think having a tunable for the cluster sizes would be a good idea. At some point they might be even device dependent (e.g. on a flash device you would like to have them roughly erase block sized) -Andi -- ak@linux.intel.com --
Thanks for the review and the comments. Do you consider the clustering_info struct to hurt the readability The allocation order value is initially passed to the try_to_free_pages() in __alloc_pages_internal() and its value can be well more than zero. The clustering code adjusts the cluster size to the value of the allocation order (in a non-linear fashion of course). In this sense the cluster size is not determined at compile time, but the parameters affecting it are. Perhaps the term "statically-tunable" is appropriate here. -- Hamid R. Jahanjou (hamid.jahanjou@gmail.com) --
It's a fancy way to have a function with lots of arguments and having lots of arguments is typically a sign for a poor code structure, with functions not being properly separated. I know the standard scanner does it too, but it's also somewhat poor there and especially there's no excuse for doing it in much Yes, but that has nothing to do in how much you should swap cluster. AFAIK the main motivation for swap clustering is to avoid the extreme seekiness that happens during swap in (that is the main reason why swap on Linux often takes so long). And also to use the IO device well. IO devices today typically have a "minimum useful unit of IO" which is much larger than a page. Typically it's at least a MB, sometimes even more e.g. on flash or on some RAID controllers. Making it even larger is good to avoid seeks on read which are very costly on spinning disks. Also for user application swapping the order will be near always 0 (except for large page users, but that's really a exotic corner case today) And clustering for one page doesn't make much sense. So I don't think you should care about that order, but make it a tunable and experiment what the best values are on different IO devices and then possibly later make it dependent on the device. -Andi -- ak@linux.intel.com --
On Fri, 5 Sep 2008 21:45:01 +0200 That is indeed the reason. Basically, the swap IO size should be large enough that swap throughput is not incredibly slow when swapping in lots of related data in a process. On the other hand, you do want to avoid evicting data that the process is still using, just because you are swapping out a not recently referenced page on a nearby virtual address. I suspect that the swap IO cluster could contain nearby and virtually contiguous pages that are (1) not recently referenced and (2) on the inactive list. Of course, my guess may be wrong - this needs testing :) -- All rights reversed. --
Rik van Riel wrote: How about writing the nearby pages to swap anyway and mark the still in-use pages as SwapCache. --
On Sun, 07 Sep 2008 18:28:30 -0600 That is what will happen pretty much automatically. Another thing the swap out page clustering code could do is move pages that were recently referenced back to the active list, so the VM will not have to scan those pages again. -- All rights reversed. --
I agree. As you said, we have two conflicting requirements here: on the one hand one likes to swap as many pages as to satisfy the backing storage "proper block IO size," on the other hand, one should not make a process lose too many pages. The code needs tuning really, I'm going to rewrite it making it more flexible according to the suggestion given on I think this is a very good idea, actually the code does this already. Thanks. --
On Wed, Sep 10, 2008 at 1:16 AM, Hamid R. Jahanjou I'd much rather that one process lose a lot of pages than many processes lose a few. When a system is swap thrashing, each process has some of its active working set tossed to swap, which means *any* application you try to switch to is then sluggish. If it were only one application, then only that one takes the penalty hit when the user switches focus and context. --
On Wed, 10 Sep 2008 10:08:07 -0700 Better still, with proper IO clustering that one process can get its pages back into memory quite quickly. Paging of anonymous pages is limited by disk seeks, so minimizing those is a top goal. -- All rights reversed. --
I do not think that the virt_to_page() can work well on userland virtual address space. And the linear searching for whole address space of a vma is not good --
Thank you for the review and comments. You are right in that searching the whole process address space is not a good idea. The idea is to make the scan range flexible, thus normally, assuming that the code has been well-tuned, only a very limited number of VMA's are scanned. In general, i think that the cluster size and the scan range should be tuned depending on the backing storage characteristics and the low-on-memory severity. Like any VM code, this one needs tuning. BTW, the code could be implemented much more elegantly if the VMA's were connected using a doubly-linked list. I wonder if there are enought other codes with the same situation to justify making it doubly-linked. --
Em Thu, 04 Sep 2008 15:34:30 +0330 "Hamid R. Jahanjou" <hamid.jahanjou@gmail.com> escreveu: | From: Hamid R. Jahanjou | | Implements the idea of swap-out page clustering from *BSD for | Linux. Each time a candidate page is to be swapped out, | virtually-nearby pages are scanned to find eligible pages to be | swapped out too as a cluster. This technique increases the likelihood of | bringing in related data on a page fault and decreases swap space | fragmentation in the long run. Currently, Linux searches only | physically-nearby pages which is not optimal since, over time, physically- | adjacent pages may become unrelated. | | The code can be statically tuned. No benchmarks. I'm not sure whether | the added complexity is acceptable. Sorry for this (very) late response, but I ran some very simple tests this week and thought you might be interested in the results. The test machine has 512MB of RAM. For the tests I booted with 64MB and setup a 512MB swap file. For the first test I let an allyesconfig kernel building for several hours with six jobs (make -j6). I did not have any problems, so considering that your code was used I think it is stable enough. In the second test I have built an allnoconfig kernel with and without your patches, still with six jobs. Results: 2.6.27-rc7-vanilla 44:12 minutes 2.6.27-rc7-pgc 52:09 minutes (pgc means page-clustering patches). For both builds I did not change any swap paramenter and I could observe that about 80MB of swap were used most of the time. But please, note that these tests were quite far from being 'scientific' eg. I should have built the kernel more times and should have tried other setups... I will wait for the next version of the patches to run more serious tests. -- Luiz Fernando N. Capitulino --
