Subject: mm: MADV_WILLNEED implementation for anonymous memory
Implement MADV_WILLNEED for anonymous pages by walking the page tables and
starting asynchonous swap cache reads for all encountered swap pages.Doing so required a modification to the page table walking library functions.
Previously ->pte_entry() could be called while holding a kmap_atomic, to
overcome this problem the pte walker is changed to copy batches of the pmd
and iterate them.Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
mm/madvise.c | 48 +++++++++++++++++++++++++++++++++++++++++-------
mm/pagewalk.c | 33 +++++++++++++++++++++++++++------
2 files changed, 68 insertions(+), 13 deletions(-)Index: linux-2.6/mm/madvise.c
===================================================================
--- linux-2.6.orig/mm/madvise.c
+++ linux-2.6/mm/madvise.c
@@ -11,6 +11,8 @@
#include <linux/mempolicy.h>
#include <linux/hugetlb.h>
#include <linux/sched.h>
+#include <linux/swap.h>
+#include <linux/swapops.h>/*
* Any behaviour which results in changes to the vma->vm_flags needs to
@@ -100,17 +102,51 @@ out:
return error;
}+static int madvise_willneed_anon_pte(pte_t *ptep, unsigned long addr,
+ unsigned long end, void *arg)
+{
+ struct vm_area_struct *vma = arg;
+ struct page *page;
+
+ if (is_swap_pte(*ptep)) {
+ page = read_swap_cache_async(pte_to_swp_entry(*ptep),
+ GFP_KERNEL, vma, addr);
+ if (page)
+ page_cache_release(page);
+ }
+
+ return 0;
+}
+
+static long madvise_willneed_anon(struct vm_area_struct *vma,
+ struct vm_area_struct **prev,
+ unsigned long start, unsigned long end)
+{
+ unsigned long len;
+ struct mm_walk walk = {
+ .pte_entry = madvise_willneed_anon_pte,
+ };
+
+ len = max_sane_readahead((end - start) >> PAGE_SHIFT);
+ end = start + (len << PAGE_SHIFT);
+
+ walk_page_range(vma->vm_mm, start, end, &walk, vma);
+ *prev = vma;
+
+ return 0;
+}
+
/*
*...
On Wed, 30 Jan 2008 18:28:59 +0100
Why cannot this use (a perhaps suitably-modified) make_pages_present()?
--
Because make_pages_present() relies on page faults to bring data in and
will thus wait for all data to be present before returning.This solution is async; it will just issue a read for the requested
pages and moves on.--
I of course realise that. I also realise that swapin_readahead() is
_supposed_ to make the difference moot.There's something you guys aren't telling us. Several things, actually.
Please don't do that.Implementation-wise: make_pages_present() _can_ be converted to do this.
But it's a lot of patching, and the result will be a cleaner, faster and
smaller core MM. Whereas your approach is easy, but adds more code and
leaves the old stuff slow-and-dirty.Guess which approach is preferred? ;)
--
Ok, I'll look at using make_pages_present().
--
Am still curious to know what inspired this change. What are the use
cases? Performance testing results, etc?
--
Ah, that is Lennarts Pulse Audio thing, he has samples in memory which
might not have been used for a while, and he wants to be able to
pre-fetch those when he suspects they might need to be played. So that
once the audio thread comes along and stuffs them down /dev/dsp its all
nice in memory.Since its all soft real-time at best he feels its better to do a best
effort at not hitting swap than it is to strain the system with mlock
usage.--
The real problem that seems to make swapping so slow is that the data
tends to be badly fragmented on the swap partition. I suspect if that
problem was attached the need for such prefetching would be far less
because swap in would be much faster.-Andi
--
Yeah, the 2.5 switch to physical scanning killed us there.
I still don't know why my allocate-swapspace-according-to-virtual-address
change didn't help. Much. Marcelo played with that a bit too.--
I've been thinking about just always doing swap on > page clusters.
Any reason swapping couldn't be done on e.g. 1MB chunks?-Andi
--
On Thu, 31 Jan 2008 12:06:10 +0100
Don't malloc() and free() hopelessly fragment memory
over time, ensuring that little related data can be
found inside each 1MB chunk if the process is large
enough? (say, firefox)
--
Even if they do (I don't know if it's true or not) it does not really
matter because on modern hard disks/systems it does not cost less to
transfer 1MB versus 4K. The actual threshold seems to be rising in fact.The only drawback is that the swap might be full sooner, but
I would actually consider this a feature because it would likely
end many prolonged oom death dances much sooner.-Andi
--
On Thu, 31 Jan 2008 12:32:24 +0100
A second drawback would be that we evict more potentially
useful data every time we swap in a whole lot of extra
data around the little bit of data we need.On the other hand, swapping should be the exception on
many of today's workloads.Maybe we can measure how many of the swapped in pages end
up being used and how many are evicted again without being
used and automatically change our chunk size based on those
statistics?I would expect most desktop systems to end up with large
chunks, because they rarely swap.
--
hrm. Does he know about pthread_create()?
--
I'm very sure he does. So you're suggesting to just create a thread and
touch that memory and be done with it?Lennart?
--
That would get him out of trouble. But it certainly makes _sense_ for the
kernel to implement MADV_WILLNEED for anon memory. From a consistency POV.But I don't know that the usefulness of the feature is worth actually
expending code on. Heck, after five-odd years I'm still asking every
second person I meet "why don't you use fadvise()?" (Reponse: ooooh!)--
That's a pretty reasonable approach. My original approach was to buffer
a page worth of PTEs with all the attendant malloc annoyances. Then
Andrew and I came up with another fix a bit ago by effectively doing a
batch of size 1: mapping and immediately unmapping per PTE. That's
basically a no-op on !HIGHPTE but could potentially be expensive in the
HIGHPTE case. Your approach might be a good complexity/performance
middle ground.Unfortunately, I think we only implemented our fix in one of the
relevant places: the /proc/pid/pagemap code hooks a callback at the pte
table level and then does its own walk across the table. Perhaps I
should refactor this so that it hooks in at the pte entry level of theLooks like this could be:
for (i = 0; i < WALK_BATCH_SIZE && addr + i * PAGE_SIZE != end; i++)
for (i = 0; i < WALK_BATCH_SIZE && addr != end;
i++, addr+= PAGE_SIZE) {
err = walk->pte_entry(ptes[i], addr, addr + PAGE_SIZE,
private);And we can ditch start.
Also, one wonders if setting batch size to 1 will then convince the
compiler to collapse this into a more trivial loop in the !HIGHPTE case.--
Mathematics is the supreme nostalgia of our time.--
| Greg KH | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| Andrew Morton | Re: 2.6.23-rc6-mm1 |
| Luciano Rocha | usb hdd problems with 2.6.27.2 |
git: | |
| Gerrit Renker | [PATCH 15/37] dccp: Set per-connection CCIDs via socket options |
| Andrew Morton | Re: [BUG] New Kernel Bugs |
| David Miller | [GIT]: Networking |
| Jarek Poplawski | [PATCH take 2] pkt_sched: Protect gen estimators under est_lock. |
