Re: [PATCH] VM: Implements the swap-out page-clustering technique

Previous thread: mmc/card/block.c : mmc_blk_open readonly mount bug? by sasin on Thursday, September 4, 2008 - 2:17 am. (3 messages)

Next thread: Who removed b43's QoS default parameters? by Michael Buesch on Thursday, September 4, 2008 - 4:32 am. (1 message)
From: Hamid R. Jahanjou
Date: Thursday, September 4, 2008 - 5:04 am

From: Hamid R. Jahanjou

Implements the idea of swap-out page clustering from *BSD for
Linux. Each time a candidate page is to be swapped out,
virtually-nearby pages are scanned to find eligible pages to be
swapped out too as a cluster. This technique increases the likelihood of
bringing in related data on a page fault and decreases swap space
fragmentation in the long run. Currently, Linux searches only
physically-nearby pages which is not optimal since, over time, physically-
adjacent pages may become unrelated.

The code can be statically tuned. No benchmarks. I'm not sure whether
the added complexity is acceptable.

Signed-off-by: Hamid R. Jahanjou <hamid.jahanjou@gmail.com>
---

diff -uprN -X linux-2.6.26.1-vanilla/Documentation/dontdiff linux-2.6.26.1-vanilla/include/linux/pageout_clustering.h linux-2.6.26.1HJ/include/linux/pageout_clustering.h
--- linux-2.6.26.1-vanilla/include/linux/pageout_clustering.h	1970-01-01 03:30:00.000000000 +0330
+++ linux-2.6.26.1HJ/include/linux/pageout_clustering.h	2008-08-28 20:19:52.000000000 +0330
@@ -0,0 +1,12 @@
+#ifndef PAGEOUT_CLUSTERING_H_
+#define PAGEOUT_CLUSTERING_H_
+
+#include<linux/list.h>
+#define INITIAL_CLUSTER_SCAN_RANGE 4 
+#define HALF_PAGEOUT_CLUSTER_SIZE 8      /* default number of pages to be clustered 
+					  * on either side of a given page */
+
+int try_to_cluster(struct page *, struct list_head *, 
+		   struct list_head *, int, int);
+
+#endif 
diff -uprN -X linux-2.6.26.1-vanilla/Documentation/dontdiff linux-2.6.26.1-vanilla/include/linux/rmap.h linux-2.6.26.1HJ/include/linux/rmap.h
--- linux-2.6.26.1-vanilla/include/linux/rmap.h	2008-08-02 02:28:24.000000000 +0330
+++ linux-2.6.26.1HJ/include/linux/rmap.h	2008-08-28 20:19:52.000000000 +0330
@@ -102,6 +102,12 @@ pte_t *page_check_address(struct page *,
 unsigned long page_address_in_vma(struct page *, struct vm_area_struct *);
 
 /*
+ * Used by pageout clustering 
+ */
+struct anon_vma *page_lock_anon_vma(struct page *);
+void ...
From: Rik van Riel
Date: Thursday, September 4, 2008 - 4:14 pm

On Thu, 04 Sep 2008 15:34:30 +0330

It would be good to get some idea of what testing you have done
with this patch.

-- 
All rights reversed.
--

From: Andrew Morton
Date: Thursday, September 4, 2008 - 4:14 pm

On Thu, 04 Sep 2008 15:34:30 +0330

I tried that once.  The code all worked as-designed but didn't seem to

Benchmarks are essential, please.  Good ones.

The whole point of the patch is to improve performance.  If we don't
know whether it improves performance, we cannot proceed in any way.


Secondly, please don't just dump a pile of new code in our laps and
expect us to pick through it and work out what it does and how it does
it.  Please at least provide a carefully-written english-language
description of the design and implementation.

Thirdly, I'd suggest that this code be converted into vaguely standard
kernel coding style sooner rather than later.  Convert it to use
eight-column hard tabs then check it with scripts/checkpatch.pl,
thanks.


--

From: Hamid R. Jahanjou
Date: Friday, September 5, 2008 - 12:45 am

Do you recommend any specific benchmarks for memory-related projects ?
Which
one did you use for your own implementation ?

Thank You
--

From: Andrew Morton
Date: Thursday, September 4, 2008 - 11:58 pm

Pretty much any and all benchmarks, when there's not enough memory
(boot with mem=).  Kernel compiles, database benchmarks, you name it. 
Basically anything which has a measurable execution time is relevant.

Designing and running these things is a lot of work and thought.  And
it's an integral part of the development process, because some things
are going to get slower and others will show large inter-run variations
and then one needs to get in there and find explanations and hopefully
fixes.

For example, start with small and simple microbenchmarks: run two
processes which concurrently allocate 1*mem, then modify it all
linearly, then read it all linearly.  Build up from there.  Do not
dive into complex benchmarks on day 1, because you'll have no hope of
explaining the results which they produce.

Beware that behaviour on SMP can be very different from behaviour on
uniprocessor because of the relatively large duration of a timeslice. 
Both must be tested.
--

From: Andi Kleen
Date: Friday, September 5, 2008 - 2:19 am

Just some general comments:

First I think virtual swap clustering is a great idea in theory and
long overdue.  Hopefully the numbers will agree.

In general the code would be much nicer if you didn't pass around 
all that much in a structure (which is just a fancy way to have
a function with lots of arguments) Perhaps try to restructure
it a bit to make this smaller? Ideally clustering_info should disappear
or at least get much smaller.

Then continue_search seems to be only set to one value so it 
can be eliminated? Perhaps there is more like this.

I didn't quite understand the "adjust the value of our search by
the allocation order". The allocation order should be normally 0.
I think having a tunable for the cluster sizes would be a good idea.
At some point they might be even device dependent (e.g. on a flash
device you would like to have them roughly erase block sized) 

-Andi


-- 
ak@linux.intel.com
--

From: Hamid R. Jahanjou
Date: Friday, September 5, 2008 - 1:27 pm

Thanks for the review and the comments.
Do you consider the clustering_info struct to hurt the readability

The allocation order value is initially passed to the try_to_free_pages()
in __alloc_pages_internal() and its value can be well more than zero.
The clustering code adjusts the cluster size to the value of the allocation
order (in a non-linear fashion of course). In this sense the cluster
size is
not determined at compile time, but the parameters affecting it are.
Perhaps the term "statically-tunable" is appropriate here.

-- 
Hamid R. Jahanjou
(hamid.jahanjou@gmail.com) 

--

From: Andi Kleen
Date: Friday, September 5, 2008 - 12:45 pm

It's a fancy way to have a function with lots of arguments
and having lots of arguments is typically a sign for a poor code
structure, with functions not being properly separated.

I know the standard scanner does it too, but it's also somewhat
poor there and especially there's no excuse for doing it in much

Yes, but that has nothing to do in how much you should swap cluster.

AFAIK the main motivation for swap clustering is to avoid the extreme
seekiness that happens during swap in (that is the main reason
why swap on Linux often takes so long). And also to use the IO
device well. IO devices today typically have a "minimum useful 
unit of IO" which is much larger than a page. Typically it's at least a 
MB, sometimes even more e.g. on flash or on some RAID controllers.  

Making it even larger is good to avoid seeks on read which are very 
costly on spinning disks.

Also for user application swapping the order will be near always 0
(except for large page users, but that's really a exotic corner case today)
And clustering for one page doesn't make much sense.

So I don't think you should care about that order, but make it a 
tunable and experiment what the best values are on different IO
devices and then possibly later make it dependent on the device.

-Andi
-- 
ak@linux.intel.com
--

From: Rik van Riel
Date: Friday, September 5, 2008 - 10:42 pm

On Fri, 5 Sep 2008 21:45:01 +0200

That is indeed the reason.

Basically, the swap IO size should be large enough that
swap throughput is not incredibly slow when swapping in
lots of related data in a process.

On the other hand, you do want to avoid evicting data 
that the process is still using, just because you are 
swapping out a not recently referenced page on a
nearby virtual address.

I suspect that the swap IO cluster could contain nearby
and virtually contiguous pages that are (1) not recently
referenced and (2) on the inactive list.

Of course, my guess may be wrong - this needs testing :)

-- 
All rights reversed.
--

From: Zan Lynx
Date: Sunday, September 7, 2008 - 5:28 pm

Rik van Riel wrote:

How about writing the nearby pages to swap anyway and mark the still 
in-use pages as SwapCache.
--

From: Rik van Riel
Date: Sunday, September 7, 2008 - 5:55 pm

On Sun, 07 Sep 2008 18:28:30 -0600

That is what will happen pretty much automatically.

Another thing the swap out page clustering code could
do is move pages that were recently referenced back
to the active list, so the VM will not have to scan
those pages again.

-- 
All rights reversed.
--

From: Hamid R. Jahanjou
Date: Wednesday, September 10, 2008 - 1:16 am

I agree. As you said, we have two conflicting requirements here: on the
one hand one likes to swap as many pages as to satisfy the backing
storage "proper block IO size," on the other hand, one should not make a
process lose too many pages.  The code needs tuning really, I'm going to
rewrite it making it more flexible according to the suggestion given on

I think this is a very good idea, actually the code does this already.

Thanks.
--

From: Ray Lee
Date: Wednesday, September 10, 2008 - 10:08 am

On Wed, Sep 10, 2008 at 1:16 AM, Hamid R. Jahanjou

I'd much rather that one process lose a lot of pages than many
processes lose a few. When a system is swap thrashing, each process
has some of its active working set tossed to swap, which means *any*
application you try to switch to is then sluggish. If it were only one
application, then only that one takes the penalty hit when the user
switches focus and context.
--

From: Rik van Riel
Date: Wednesday, September 10, 2008 - 10:39 am

On Wed, 10 Sep 2008 10:08:07 -0700

Better still, with proper IO clustering that one process can get
its pages back into memory quite quickly.

Paging of anonymous pages is limited by disk seeks, so minimizing
those is a top goal.

-- 
All rights reversed.
--

From: Li Yu
Date: Sunday, September 7, 2008 - 8:50 pm

I do not think that the virt_to_page() can work well on userland virtual 
address space.
And the linear searching for whole address space of a vma is not good 

--

From: hamidreza jahanjou
Date: Monday, September 8, 2008 - 2:51 am

Thank you for the review and comments. You are right in that searching
the whole process address space is not a good idea. The idea is to
make the scan range flexible, thus normally, assuming that the code
has been well-tuned, only a very limited number of VMA's are scanned.

In general, i think that the cluster size and the scan range should be
tuned depending on the backing storage characteristics and the
low-on-memory severity. Like any VM code, this one needs tuning.

BTW, the code could be implemented much more elegantly if the VMA's
were connected using a doubly-linked list. I wonder if there are
enought other codes with the same situation to justify making it
doubly-linked.
--

From: Luiz Fernando N. Capitulino
Date: Wednesday, September 24, 2008 - 6:56 am

Em Thu, 04 Sep 2008 15:34:30 +0330
"Hamid R. Jahanjou" <hamid.jahanjou@gmail.com> escreveu:

| From: Hamid R. Jahanjou
| 
| Implements the idea of swap-out page clustering from *BSD for
| Linux. Each time a candidate page is to be swapped out,
| virtually-nearby pages are scanned to find eligible pages to be
| swapped out too as a cluster. This technique increases the likelihood of
| bringing in related data on a page fault and decreases swap space
| fragmentation in the long run. Currently, Linux searches only
| physically-nearby pages which is not optimal since, over time, physically-
| adjacent pages may become unrelated.
| 
| The code can be statically tuned. No benchmarks. I'm not sure whether
| the added complexity is acceptable.

 Sorry for this (very) late response, but I ran some very simple tests
this week and thought you might be interested in the results.

 The test machine has 512MB of RAM. For the tests I booted with 64MB
and setup a 512MB swap file.

 For the first test I let an allyesconfig kernel building for several
hours with six jobs (make -j6). I did not have any problems, so
considering that your code was used I think it is stable enough.

 In the second test I have built an allnoconfig kernel with and without
your patches, still with six jobs. Results:

	2.6.27-rc7-vanilla   44:12 minutes
	2.6.27-rc7-pgc       52:09 minutes

 (pgc means page-clustering patches).

 For both builds I did not change any swap paramenter and I could
observe that about 80MB of swap were used most of the time.

 But please, note that these tests were quite far from being 'scientific'
eg. I should have built the kernel more times and should have tried other
setups...

 I will wait for the next version of the patches to run more serious
tests.

-- 
Luiz Fernando N. Capitulino
--

Previous thread: mmc/card/block.c : mmc_blk_open readonly mount bug? by sasin on Thursday, September 4, 2008 - 2:17 am. (3 messages)

Next thread: Who removed b43's QoS default parameters? by Michael Buesch on Thursday, September 4, 2008 - 4:32 am. (1 message)