[PATCH] mm: MADV_WILLNEED implementation for anonymous memory

Previous thread: [GIT PULL] KVM updates for the 2.6.25 merge window by Avi Kivity on Wednesday, January 30, 2008 - 1:18 pm. (1 message)

Next thread: Strange commit 42a73808ed4f30b739eb52bcbb33a02fe62ceef5 by Adrian Bunk on Wednesday, January 30, 2008 - 1:55 pm. (2 messages)
To: Andrew Morton <akpm@...>
Cc: Hugh Dickins <hugh@...>, linux-kernel <linux-kernel@...>, linux-mm <linux-mm@...>, Nick Piggin <npiggin@...>, riel <riel@...>, Lennart Poettering <mztabzr@...>, Matt Mackall <mpm@...>
Date: Wednesday, January 30, 2008 - 1:28 pm

Subject: mm: MADV_WILLNEED implementation for anonymous memory

Implement MADV_WILLNEED for anonymous pages by walking the page tables and
starting asynchonous swap cache reads for all encountered swap pages.

Doing so required a modification to the page table walking library functions.
Previously ->pte_entry() could be called while holding a kmap_atomic, to
overcome this problem the pte walker is changed to copy batches of the pmd
and iterate them.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
mm/madvise.c | 48 +++++++++++++++++++++++++++++++++++++++++-------
mm/pagewalk.c | 33 +++++++++++++++++++++++++++------
2 files changed, 68 insertions(+), 13 deletions(-)

Index: linux-2.6/mm/madvise.c
===================================================================
--- linux-2.6.orig/mm/madvise.c
+++ linux-2.6/mm/madvise.c
@@ -11,6 +11,8 @@
#include <linux/mempolicy.h>
#include <linux/hugetlb.h>
#include <linux/sched.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>

/*
* Any behaviour which results in changes to the vma->vm_flags needs to
@@ -100,17 +102,51 @@ out:
return error;
}

+static int madvise_willneed_anon_pte(pte_t *ptep, unsigned long addr,
+ unsigned long end, void *arg)
+{
+ struct vm_area_struct *vma = arg;
+ struct page *page;
+
+ if (is_swap_pte(*ptep)) {
+ page = read_swap_cache_async(pte_to_swp_entry(*ptep),
+ GFP_KERNEL, vma, addr);
+ if (page)
+ page_cache_release(page);
+ }
+
+ return 0;
+}
+
+static long madvise_willneed_anon(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start, unsigned long end)
+{
+ unsigned long len;
+ struct mm_walk walk = {
+ .pte_entry = madvise_willneed_anon_pte,
+ };
+
+ len = max_sane_readahead((end - start) >> PAGE_SHIFT);
+ end = start + (len << PAGE_SHIFT);
+
+ walk_page_range(vma->vm_mm, start, end, &walk, vma);
+ *prev = vma;
+
+ return 0;
+}
+
/*
*...

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: <hugh@...>, <linux-kernel@...>, <linux-mm@...>, <npiggin@...>, <riel@...>, <mztabzr@...>, <mpm@...>
Date: Wednesday, January 30, 2008 - 6:40 pm

On Wed, 30 Jan 2008 18:28:59 +0100

Why cannot this use (a perhaps suitably-modified) make_pages_present()?
--

To: Andrew Morton <akpm@...>
Cc: <hugh@...>, <linux-kernel@...>, <linux-mm@...>, <npiggin@...>, <riel@...>, <mztabzr@...>, <mpm@...>
Date: Thursday, January 31, 2008 - 4:44 am

Because make_pages_present() relies on page faults to bring data in and
will thus wait for all data to be present before returning.

This solution is async; it will just issue a read for the requested
pages and moves on.

--

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: <hugh@...>, <linux-kernel@...>, <linux-mm@...>, <npiggin@...>, <riel@...>, <mztabzr@...>, <mpm@...>
Date: Thursday, January 31, 2008 - 5:12 am

I of course realise that. I also realise that swapin_readahead() is
_supposed_ to make the difference moot.

There's something you guys aren't telling us. Several things, actually.
Please don't do that.

Implementation-wise: make_pages_present() _can_ be converted to do this.
But it's a lot of patching, and the result will be a cleaner, faster and
smaller core MM. Whereas your approach is easy, but adds more code and
leaves the old stuff slow-and-dirty.

Guess which approach is preferred? ;)

--

To: Andrew Morton <akpm@...>
Cc: <hugh@...>, <linux-kernel@...>, <linux-mm@...>, <npiggin@...>, <riel@...>, <mztabzr@...>, <mpm@...>
Date: Thursday, January 31, 2008 - 5:35 am

Ok, I'll look at using make_pages_present().

--

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: <hugh@...>, <linux-kernel@...>, <linux-mm@...>, <npiggin@...>, <riel@...>, <mztabzr@...>, <mpm@...>
Date: Thursday, January 31, 2008 - 5:47 am

Am still curious to know what inspired this change. What are the use
cases? Performance testing results, etc?
--

To: Andrew Morton <akpm@...>
Cc: <hugh@...>, <linux-kernel@...>, <linux-mm@...>, <npiggin@...>, <riel@...>, <mztabzr@...>, <mpm@...>
Date: Thursday, January 31, 2008 - 5:53 am

Ah, that is Lennarts Pulse Audio thing, he has samples in memory which
might not have been used for a while, and he wants to be able to
pre-fetch those when he suspects they might need to be played. So that
once the audio thread comes along and stuffs them down /dev/dsp its all
nice in memory.

Since its all soft real-time at best he feels its better to do a best
effort at not hitting swap than it is to strain the system with mlock
usage.

--

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Andrew Morton <akpm@...>, <hugh@...>, <linux-kernel@...>, <linux-mm@...>, <npiggin@...>, <riel@...>, <mztabzr@...>, <mpm@...>
Date: Thursday, January 31, 2008 - 6:15 am

The real problem that seems to make swapping so slow is that the data
tends to be badly fragmented on the swap partition. I suspect if that
problem was attached the need for such prefetching would be far less
because swap in would be much faster.

-Andi
--

To: Andi Kleen <andi@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, <hugh@...>, <linux-kernel@...>, <linux-mm@...>, <npiggin@...>, <riel@...>, <mztabzr@...>, <mpm@...>
Date: Thursday, January 31, 2008 - 6:19 am

Yeah, the 2.5 switch to physical scanning killed us there.

I still don't know why my allocate-swapspace-according-to-virtual-address
change didn't help. Much. Marcelo played with that a bit too.

--

To: Andrew Morton <akpm@...>
Cc: Andi Kleen <andi@...>, Peter Zijlstra <a.p.zijlstra@...>, <hugh@...>, <linux-kernel@...>, <linux-mm@...>, <npiggin@...>, <riel@...>, <mztabzr@...>, <mpm@...>
Date: Thursday, January 31, 2008 - 7:06 am

I've been thinking about just always doing swap on > page clusters.
Any reason swapping couldn't be done on e.g. 1MB chunks?

-Andi
--

To: Andi Kleen <andi@...>
Cc: Andrew Morton <akpm@...>, Andi Kleen <andi@...>, Peter Zijlstra <a.p.zijlstra@...>, <hugh@...>, <linux-kernel@...>, <linux-mm@...>, <npiggin@...>, <mztabzr@...>, <mpm@...>
Date: Thursday, January 31, 2008 - 6:52 am

On Thu, 31 Jan 2008 12:06:10 +0100

Don't malloc() and free() hopelessly fragment memory
over time, ensuring that little related data can be
found inside each 1MB chunk if the process is large
enough? (say, firefox)
--

To: Rik van Riel <riel@...>
Cc: Andi Kleen <andi@...>, Andrew Morton <akpm@...>, Peter Zijlstra <a.p.zijlstra@...>, <hugh@...>, <linux-kernel@...>, <linux-mm@...>, <npiggin@...>, <mztabzr@...>, <mpm@...>
Date: Thursday, January 31, 2008 - 7:32 am

Even if they do (I don't know if it's true or not) it does not really
matter because on modern hard disks/systems it does not cost less to
transfer 1MB versus 4K. The actual threshold seems to be rising in fact.

The only drawback is that the swap might be full sooner, but
I would actually consider this a feature because it would likely
end many prolonged oom death dances much sooner.

-Andi
--

To: Andi Kleen <andi@...>
Cc: Andi Kleen <andi@...>, Andrew Morton <akpm@...>, Peter Zijlstra <a.p.zijlstra@...>, <hugh@...>, <linux-kernel@...>, <linux-mm@...>, <npiggin@...>, <mztabzr@...>, <mpm@...>
Date: Thursday, January 31, 2008 - 7:09 am

On Thu, 31 Jan 2008 12:32:24 +0100

A second drawback would be that we evict more potentially
useful data every time we swap in a whole lot of extra
data around the little bit of data we need.

On the other hand, swapping should be the exception on
many of today's workloads.

Maybe we can measure how many of the swapped in pages end
up being used and how many are evicted again without being
used and automatically change our chunk size based on those
statistics?

I would expect most desktop systems to end up with large
chunks, because they rarely swap.
--

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: <hugh@...>, <linux-kernel@...>, <linux-mm@...>, <npiggin@...>, <riel@...>, <mztabzr@...>, <mpm@...>
Date: Thursday, January 31, 2008 - 6:05 am

hrm. Does he know about pthread_create()?
--

To: Andrew Morton <akpm@...>
Cc: <hugh@...>, <linux-kernel@...>, <linux-mm@...>, <npiggin@...>, <riel@...>, <mztabzr@...>, <mpm@...>
Date: Thursday, January 31, 2008 - 6:10 am

I'm very sure he does. So you're suggesting to just create a thread and
touch that memory and be done with it?

Lennart?

--

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: <hugh@...>, <linux-kernel@...>, <linux-mm@...>, <npiggin@...>, <riel@...>, <mztabzr@...>, <mpm@...>
Date: Thursday, January 31, 2008 - 6:18 am

That would get him out of trouble. But it certainly makes _sense_ for the
kernel to implement MADV_WILLNEED for anon memory. From a consistency POV.

But I don't know that the usefulness of the feature is worth actually
expending code on. Heck, after five-odd years I'm still asking every
second person I meet "why don't you use fadvise()?" (Reponse: ooooh!)

--

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Andrew Morton <akpm@...>, Hugh Dickins <hugh@...>, linux-kernel <linux-kernel@...>, linux-mm <linux-mm@...>, Nick Piggin <npiggin@...>, riel <riel@...>, Lennart Poettering <mztabzr@...>
Date: Wednesday, January 30, 2008 - 2:15 pm

That's a pretty reasonable approach. My original approach was to buffer
a page worth of PTEs with all the attendant malloc annoyances. Then
Andrew and I came up with another fix a bit ago by effectively doing a
batch of size 1: mapping and immediately unmapping per PTE. That's
basically a no-op on !HIGHPTE but could potentially be expensive in the
HIGHPTE case. Your approach might be a good complexity/performance
middle ground.

Unfortunately, I think we only implemented our fix in one of the
relevant places: the /proc/pid/pagemap code hooks a callback at the pte
table level and then does its own walk across the table. Perhaps I
should refactor this so that it hooks in at the pte entry level of the

Looks like this could be:

for (i = 0; i < WALK_BATCH_SIZE && addr + i * PAGE_SIZE != end; i++)
for (i = 0; i < WALK_BATCH_SIZE && addr != end;
i++, addr+= PAGE_SIZE) {
err = walk->pte_entry(ptes[i], addr, addr + PAGE_SIZE,
private);

And we can ditch start.

Also, one wonders if setting batch size to 1 will then convince the
compiler to collapse this into a more trivial loop in the !HIGHPTE case.

--
Mathematics is the supreme nostalgia of our time.

--

Previous thread: [GIT PULL] KVM updates for the 2.6.25 merge window by Avi Kivity on Wednesday, January 30, 2008 - 1:18 pm. (1 message)

Next thread: Strange commit 42a73808ed4f30b739eb52bcbb33a02fe62ceef5 by Adrian Bunk on Wednesday, January 30, 2008 - 1:55 pm. (2 messages)