[PATCH 23/25] Noreclaim LRU scan sysctl

Previous thread: [patch 2.6.26-rc4] add HAVE_CLK to Kconfig, for driver dependencies by David Brownell on Thursday, May 29, 2008 - 11:56 am. (2 messages)

Next thread: [PATCH -mm 11/12] more aggressively use lumpy reclaim by Rik van Riel on Thursday, May 29, 2008 - 1:22 pm. (1 message)
From: Lee Schermerhorn
Date: Thursday, May 29, 2008 - 12:50 pm

The patches to follow are a continuation of the V8 "VM pageout scalability
improvements" series that Rik van Riel posted to LKML on 23May08.  These
patches apply atop Rik's series with the following overlap:

Patches 13 through 16 replace the corresponding patches in Rik's posting.

Patch 13, the noreclaim lru infrastructure, now includes Kosaki Motohiro's
memcontrol enhancements to track nonreclaimable pages.

Patches 14 and 15 are largely unchanged, except for refresh.  Includes
some minor statistics formatting cleanup.

Patch 16 includes a fix for an potential [unobserved] race condition during
SHM_UNLOCK.

---

Additional patches in this series:

Patches 17 through 20 keep mlocked pages off the normal [in]active LRU
lists using the noreclaim lru infrastructure.   These patches represent
a fairly significant rework of an RFC patch originally posted by Nick Piggin.

Patches 21 and 22 are optional, but recommended, enhancements to the overall
noreclaim series.  

Patches 23 and 24 are optional enhancements useful during debug and testing.

Patch 25 is a rather verbose document describing the noreclaim lru
infrastructure and the use thereof to keep ramfs, SHM_LOCKED and mlocked
pages off the normal LRU lists.

---

The entire stack, including Rik's split lru patches, are holding up very
well under stress loads.  E.g., ran for over 90+ hours over the weekend on
both x86_64 [32GB, 8core] and ia64 [32GB, 16cpu] platforms without error
over last weekend.  

I think these are ready for a spin in -mm atop Rik's patches.

Lee

--

From: Andrew Morton
Date: Thursday, May 29, 2008 - 1:16 pm

On Thu, 29 May 2008 15:50:30 -0400


I was >this< close to getting onto Rik's patches (honest) but a few
other people have been kicking the tyres and seem to have caused some
punctures so I'm expecting V9?
--

From: Rik van Riel
Date: Thursday, May 29, 2008 - 1:20 pm

On Thu, 29 May 2008 13:16:24 -0700

If I send you a V9 up to patch 12, you can apply Lee's patches
straight over my V9 :)

*fidgets with quilt mail*

-- 
All rights reversed.
--

From: John Stoffel
Date: Friday, May 30, 2008 - 6:52 am

>>>>> "Rik" == Rik van Riel <riel@redhat.com> writes:

Rik> On Thu, 29 May 2008 13:16:24 -0700

Rik> If I send you a V9 up to patch 12, you can apply Lee's patches
Rik> straight over my V9 :)

I haven't seen any performance numbers talking about how well this
stuff works on single or dual CPU machines with smaller amounts of
memory, or whether it's worth using on these machines at all?

The big machines with lots of memory and lots of CPUs are certainly
becoming more prevalent, but for my home machine with 4Gb RAM and dual
core, what's the advantage?  

Let's not slow down the common case for the sake of the bigger guys if
possible.

John

--

From: Rik van Riel
Date: Friday, May 30, 2008 - 7:29 am

On Fri, 30 May 2008 09:52:48 -0400

I wouldn't call your home system with 4GB RAM "small".

After all, the VM that Linux currently has was developed
mostly on machines with less than 1GB of RAM and later
encrusted in bandaids to make sure the large systems did
not fail too badly.

As for small system performance, I believe that my patch
series should cause no performance regressions on those
systems and has a framework that allows us to improve
performance on those systems too.

If you manage to break performance with my patch set
somehow, please let me know so I can fix it.  Something
like the VM is very subtle and any change is pretty
much guaranteed to break something, so I am very interested
in feedback.

-- 
All rights reversed.
--

From: John Stoffel
Date: Friday, May 30, 2008 - 7:36 am

Rik> On Fri, 30 May 2008 09:52:48 -0400

Rik> I wouldn't call your home system with 4GB RAM "small".

*grin* me either in some ways.  But my other main linux box, which
acts as an NFS server has 2Gb of RAM, but a pair of PIII Xeons at
550mhz.  This is the box I'd be worried about in some ways, since it
handles a bunch of stuff like backups, mysql, apache, NFS server,
etc.  

Rik> After all, the VM that Linux currently has was developed mostly
Rik> on machines with less than 1GB of RAM and later encrusted in
Rik> bandaids to make sure the large systems did not fail too badly.

Sure, I understand.  

Rik> As for small system performance, I believe that my patch series
Rik> should cause no performance regressions on those systems and has
Rik> a framework that allows us to improve performance on those
Rik> systems too.

Great!  It would be nice to just be able to track this nicely.

Rik> If you manage to break performance with my patch set somehow,
Rik> please let me know so I can fix it.  Something like the VM is
Rik> very subtle and any change is pretty much guaranteed to break
Rik> something, so I am very interested in feedback.

What are you using to test/benchmark your changes as you develop this
patchset?  What would you suggest as a test load to help check
performance?

John
--

From: Rik van Riel
Date: Friday, May 30, 2008 - 8:27 am

On Fri, 30 May 2008 10:36:05 -0400

Your normal workload.

I am doing some IO throughput, swap throughput and database tests,
however those are probably not representative of what YOU throw at
the VM.

There are no VM benchmarks that cover everything, so what is needed
most at this point is real world exposure.  I cannot promise that
the code is perfect; all I can promise is that I will try to fix
any performance issue that people find.

-- 
All rights reversed.
--

From: MinChan Kim
Date: Thursday, May 29, 2008 - 6:56 pm

I failed to patch Lee's patches over your V9.

barrios@barrios-desktop:~/linux-2.6$ patch -p1 < /tmp/msg0_13.txt
patching file mm/Kconfig
patching file include/linux/page-flags.h
patching file include/linux/mmzone.h
patching file mm/page_alloc.c
patching file include/linux/mm_inline.h
patching file include/linux/swap.h
patching file include/linux/pagevec.h
patching file mm/swap.c
patching file mm/migrate.c
patching file mm/vmscan.c
Hunk #10 FAILED at 1162.
Hunk #11 succeeded at 1210 (offset 3 lines).
Hunk #12 succeeded at 1242 (offset 3 lines).
Hunk #13 succeeded at 1380 (offset 3 lines).
Hunk #14 succeeded at 1411 (offset 3 lines).
Hunk #15 succeeded at 1962 (offset 3 lines).
Hunk #16 succeeded at 2300 (offset 3 lines).
1 out of 16 hunks FAILED -- saving rejects to file mm/vmscan.c.rej
patching file mm/mempolicy.c
patching file mm/internal.h
patching file mm/memcontrol.c
patching file include/linux/memcontrol.h

-- 
Kinds regards,
MinChan Kim
--

From: Lee Schermerhorn
Date: Thursday, May 29, 2008 - 12:50 pm

From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Infrastructure to manage pages excluded from reclaim--i.e., hidden
from vmscan.  Based on a patch by Larry Woodman of Red Hat. Reworked
to maintain "nonreclaimable" pages on a separate per-zone LRU list,
to "hide" them from vmscan.

Kosaki Motohiro added the support for the memory controller noreclaim
lru list.

Pages on the noreclaim list have both PG_noreclaim and PG_lru set.
Thus, PG_noreclaim is analogous to and mutually exclusive with
PG_active--it specifies which LRU list the page is on.  

The noreclaim infrastructure is enabled by a new mm Kconfig option
[CONFIG_]NORECLAIM_LRU.

A new function 'page_reclaimable(page, vma)' in vmscan.c tests whether
or not a page is reclaimable.  Subsequent patches will add the various
!reclaimable tests.  We'll want to keep these tests light-weight for
use in shrink_active_list() and, possibly, the fault path.

To avoid races between tasks putting pages [back] onto an LRU list and
tasks that might be moving the page from nonreclaimable to reclaimable
state, one should test reclaimability under page lock and place
nonreclaimable pages directly on the noreclaim list before dropping the
lock.  Otherwise, we risk "stranding" reclaimable pages on the noreclaim
list.  It's OK to use the pagevec caches for reclaimable pages.  The new
function 'putback_lru_page()'--inverse to 'isolate_lru_page()'--handles
this transition, including potential page truncation while the page is
unlocked.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

 include/linux/memcontrol.h |    2 
 include/linux/mm_inline.h  |   13 ++-
 include/linux/mmzone.h     |   24 ++++++
 include/linux/page-flags.h |   13 +++
 include/linux/pagevec.h    |    1 
 include/linux/swap.h       |   12 +++
 mm/Kconfig                 |   10 ++
 mm/internal.h              |   26 +++++++
 mm/memcontrol.c         ...
From: Lee Schermerhorn
Date: Thursday, May 29, 2008 - 12:50 pm

From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Report non-reclaimable pages per zone and system wide.

Kosaki Motohiro added support for memory controller noreclaim
statistics.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

 drivers/base/node.c |    6 ++++++
 fs/proc/proc_misc.c |    6 ++++++
 mm/memcontrol.c     |    6 ++++++
 mm/page_alloc.c     |   16 +++++++++++++++-
 mm/vmstat.c         |    3 +++
 5 files changed, 36 insertions(+), 1 deletion(-)

Index: linux-2.6.26-rc2-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/page_alloc.c	2008-05-28 10:39:23.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/page_alloc.c	2008-05-28 10:42:52.000000000 -0400
@@ -1918,12 +1918,20 @@ void show_free_areas(void)
 	}
 
 	printk("Active_anon:%lu active_file:%lu inactive_anon%lu\n"
-		" inactive_file:%lu dirty:%lu writeback:%lu unstable:%lu\n"
+		" inactive_file:%lu"
+//TODO:  check/adjust line lengths
+#ifdef CONFIG_NORECLAIM_LRU
+		" noreclaim:%lu"
+#endif
+		" dirty:%lu writeback:%lu unstable:%lu\n"
 		" free:%lu slab:%lu mapped:%lu pagetables:%lu bounce:%lu\n",
 		global_page_state(NR_ACTIVE_ANON),
 		global_page_state(NR_ACTIVE_FILE),
 		global_page_state(NR_INACTIVE_ANON),
 		global_page_state(NR_INACTIVE_FILE),
+#ifdef CONFIG_NORECLAIM_LRU
+		global_page_state(NR_NORECLAIM),
+#endif
 		global_page_state(NR_FILE_DIRTY),
 		global_page_state(NR_WRITEBACK),
 		global_page_state(NR_UNSTABLE_NFS),
@@ -1950,6 +1958,9 @@ void show_free_areas(void)
 			" inactive_anon:%lukB"
 			" active_file:%lukB"
 			" inactive_file:%lukB"
+#ifdef CONFIG_NORECLAIM_LRU
+			" noreclaim:%lukB"
+#endif
 			" present:%lukB"
 			" pages_scanned:%lu"
 			" all_unreclaimable? %s"
@@ -1963,6 +1974,9 @@ void show_free_areas(void)
 			K(zone_page_state(zone, NR_INACTIVE_ANON)),
 ...
From: Lee Schermerhorn
Date: Thursday, May 29, 2008 - 12:50 pm

From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Christoph Lameter pointed out that ram disk pages also clutter the
LRU lists.  When vmscan finds them dirty and tries to clean them,
the ram disk writeback function just redirties the page so that it
goes back onto the active list.  Round and round she goes...

Define new address_space flag [shares address_space flags member
with mapping's gfp mask] to indicate that the address space contains
all non-reclaimable pages.  This will provide for efficient testing
of ramdisk pages in page_reclaimable().

Also provide wrapper functions to set/test the noreclaim state to
minimize #ifdefs in ramdisk driver and any other users of this
facility.

Set the noreclaim state on address_space structures for new
ramdisk inodes.  Test the noreclaim state in page_reclaimable()
to cull non-reclaimable pages.

Similarly, ramfs pages are non-reclaimable.  Set the 'noreclaim'
address_space flag for new ramfs inodes.

These changes depend on [CONFIG_]NORECLAIM_LRU.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>

 drivers/block/brd.c     |   13 +++++++++++++
 fs/ramfs/inode.c        |    1 +
 include/linux/pagemap.h |   22 ++++++++++++++++++++++
 mm/vmscan.c             |    5 +++++
 4 files changed, 41 insertions(+)

Index: linux-2.6.26-rc2-mm1/include/linux/pagemap.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/pagemap.h	2008-05-28 13:01:14.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/pagemap.h	2008-05-28 13:02:50.000000000 -0400
@@ -30,6 +30,28 @@ static inline void mapping_set_error(str
 	}
 }
 
+#ifdef CONFIG_NORECLAIM_LRU
+#define AS_NORECLAIM	(__GFP_BITS_SHIFT + 2)	/* e.g., ramdisk, SHM_LOCK */
+
+static inline void mapping_set_noreclaim(struct address_space *mapping)
+{
+	set_bit(AS_NORECLAIM, &mapping->flags);
+}
+
+static inline int mapping_non_reclaimable(struct address_space ...
From: Lee Schermerhorn
Date: Thursday, May 29, 2008 - 12:50 pm

From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Against:  2.6.26-rc2-mm1

While working with Nick Piggin's mlock patches, I noticed that
shmem segments locked via shmctl(SHM_LOCKED) were not being handled.
SHM_LOCKed pages work like ramdisk pages--the writeback function
just redirties the page so that it can't be reclaimed.  Deal with
these using the same approach as for ram disk pages.

Use the AS_NORECLAIM flag to mark address_space of SHM_LOCKed
shared memory regions as non-reclaimable.  Then these pages
will be culled off the normal LRU lists during vmscan.

Add new wrapper function to clear the mapping's noreclaim state
when/if shared memory segment is munlocked.

Add 'scan_mapping_noreclaim_page()' to mm/vmscan.c to scan all
pages in the shmem segment's mapping [struct address_space] for
reclaimability now that they're no longer locked.  If so, move
them to the appropriate zone lru list.  Note that
scan_mapping_noreclaim_page() must be able to sleep on page_lock(),
so we can't call it holding the shmem info spinlock nor the shmid
spinlock.  So, we pass the mapping [address_space] back to shmctl()
on SHM_UNLOCK for rescuing any nonreclaimable pages after dropping
the spinlocks.  Once we drop the shmid lock, the backing shmem file
can be deleted if the calling task doesn't have the shm area
attached.  To handle this, we take an extra reference on the file
before dropping the shmid lock and drop the reference after scanning
the mapping's noreclaim pages.

Changes depend on [CONFIG_]NORECLAIM_LRU.

Signed-off-by:  Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by:  Rik van Riel <riel@redhat.com>
Signed-off-by:  Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com>

 include/linux/mm.h      |    9 ++--
 include/linux/pagemap.h |   12 ++++--
 include/linux/swap.h    |    4 ++
 ipc/shm.c               |   20 +++++++++-
 mm/shmem.c              |   10 +++--
 mm/vmscan.c             |   93 ++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 136 ...
From: Lee Schermerhorn
Date: Thursday, May 29, 2008 - 12:51 pm

Originally
From: Nick Piggin <npiggin@suse.de>

Against:  2.6.26-rc2-mm1

V8:
+ more refinement of rmap interaction, including attempt to
  handle mlocked pages in non-linear mappings.
+ cleanup of lockdep reported errors.
+ enhancement of munlock page table walker to detect and 
  handle pages under migration [migration ptes].

V6:
+ Kosaki-san and Rik van Riel:  added check for "page mapped
  in vma" to try_to_unlock() processing in try_to_unmap_anon().
+ Kosaki-san added munlock page table walker to avoid use of
  get_user_pages() for munlock.  get_user_pages() proved to be
  unreliable for some types of vmas.
+ added filtering of "special" vmas.  Some [_IO||_PFN] we skip
  altogether.  Others, we just "make_pages_present" to simulate
  old behavior--i.e., populate page tables.  Clear/don't set
  VM_LOCKED in non-mlockable vmas so that we don't try to unlock
  at exit/unmap time.
+ rework PG_mlock page flag definitions for new page flags
  macros.
+ Clear PageMlocked when COWing a page into a VM_LOCKED vma
  so we don't leave an mlocked page in another non-mlocked
  vma.  If the other vma[s] had the page mlocked, we'll re-mlock
  it if/when we try to reclaim it.  This is less expensive than
  walking the rmap in the COW/fault path.
+ in vmscan:shrink_page_list(), avoid  adding anon page to
  the swap cache if it's in a VM_LOCKED vma, even tho'
  PG_mlocked might not be set.  Call try_to_unlock() to
  determine this.  As a result, we'll never try to unmap
  an mlocked anon page.
+ in support of the above change, updated try_to_unlock()
  to use same logic as try_to_unmap() when it encounters a
  VM_LOCKED vma--call mlock_vma_page() directly.  Added
  stub try_to_unlock() for vmscan when NORECLAIM_MLOCK
  not configured.

V4 -> V5:
+ fixed problem with placement of #ifdef CONFIG_NORECLAIM_MLOCK
  in prep_new_page() [Thanks, minchan Kim!].

V3 -> V4:
+ Added #ifdef CONFIG_NORECLAIM_MLOCK, #endif around use of
  PG_mlocked in free_page_check(), et al.  Not defined ...
From: Lee Schermerhorn
Date: Thursday, May 29, 2008 - 12:51 pm

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

Against:  2.6.26-rc2-mm1

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series [no change]
+ fix function return types [void -> int] to fix build when
  not configured.

New in V2.

We need to hold the mmap_sem for write to initiatate mlock()/munlock()
because we may need to merge/split vmas.  However, this can lead to
very long lock hold times attempting to fault in a large memory region
to mlock it into memory.   This can hold off other faults against the
mm [multithreaded tasks] and other scans of the mm, such as via /proc.
To alleviate this, downgrade the mmap_sem to read mode during the 
population of the region for locking.  This is especially the case 
if we need to reclaim memory to lock down the region.  We [probably?]
don't need to do this for unlocking as all of the pages should be
resident--they're already mlocked.

Now, the caller's of the mlock functions [mlock_fixup() and 
mlock_vma_pages_range()] expect the mmap_sem to be returned in write
mode.  Changing all callers appears to be way too much effort at this
point.  So, restore write mode before returning.  Note that this opens
a window where the mmap list could change in a multithreaded process.
So, at least for mlock_fixup(), where we could be called in a loop over
multiple vmas, we check that a vma still exists at the start address
and that vma still covers the page range [start,end).  If not, we return
an error, -EAGAIN, and let the caller deal with it.

Return -EAGAIN from mlock_vma_pages_range() function and mlock_fixup()
if the vma at 'start' disappears or changes so that the page range
[start,end) is no longer contained in the vma.  Again, let the caller
deal with it.  Looks like only sys_remap_file_pages() [via mmap_region()]
should actually care.

With this patch, I no longer see processes like ps(1) blocked for seconds
or minutes at a time waiting for a large [multiple gigabyte] region to be
locked down.  However, I occassionally see delays ...
From: Lee Schermerhorn
Date: Thursday, May 29, 2008 - 12:51 pm

Originally
From: Nick Piggin <npiggin@suse.de>

Against:  2.6.26-rc2-mm1

V6:
+ munlock page in range of VM_LOCKED vma being covered by
  remap_file_pages(), as this is an implied unmap of the
  range.
+ in support of special vma filtering, don't account for
  non-mlockable vmas as locked_vm. 

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series [no changes]

V1 -> V2:
+  modified mmap.c:mmap_region() to return error if mlock_vma_pages_range()
   does.  This can only occur if the vma gets removed/changed while
   we're switching mmap_sem lock modes.   Most callers don't care, but
   sys_remap_file_pages() appears to.

Rework of Nick Piggins's "mm: move mlocked pages off the LRU" patch
-- part 2 0f 2.

Remove mlocked pages from the LRU using "NoReclaim infrastructure"
during mmap(), munmap(), mremap() and truncate().  Try to move back
to normal LRU lists on munmap() when last mlocked mapping removed.
Removed PageMlocked() status when page truncated from file.

Originally Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

 mm/fremap.c   |   26 +++++++++++++++++---
 mm/internal.h |   13 ++++++++--
 mm/mlock.c    |   10 ++++---
 mm/mmap.c     |   75 ++++++++++++++++++++++++++++++++++++++++++++--------------
 mm/mremap.c   |    8 +++---
 mm/truncate.c |    4 +++
 6 files changed, 106 insertions(+), 30 deletions(-)

Index: linux-2.6.26-rc2-mm1/mm/mmap.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/mmap.c	2008-05-23 11:01:34.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/mmap.c	2008-05-23 11:01:41.000000000 -0400
@@ -32,6 +32,8 @@
 #include <asm/tlb.h>
 #include <asm/mmu_context.h>
 
+#include "internal.h"
+
 #ifndef arch_mmap_check
 #define arch_mmap_check(addr, len, flags)	(0)
 #endif
@@ -961,6 +963,7 @@ unsigned long do_mmap_pgoff(struct file ...
From: Lee Schermerhorn
Date: Thursday, May 29, 2008 - 12:51 pm

From: Nick Piggin <npiggin@suse.de>
  To: Linux Memory Management <linux-mm@kvack.org>
  Subject: [patch 4/4] mm: account mlocked pages
  Date:	Mon, 12 Mar 2007 07:39:14 +0100 (CET)

Against:  2.6.26-rc2-mm1

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series
+ fix definitions of NR_MLOCK to fix build errors when not configured.

V1 -> V2:
+  new in V2 -- pulled in & reworked from Nick's previous series

Add NR_MLOCK zone page state, which provides a (conservative) count of
mlocked pages (actually, the number of mlocked pages moved off the LRU).

Reworked by lts to fit in with the modified mlock page support in the
Reclaim Scalability series.

Signed-off-by: Nick Piggin <npiggin@suse.de>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>

 drivers/base/node.c    |   24 +++++++++++++++---------
 fs/proc/proc_misc.c    |    6 ++++++
 include/linux/mmzone.h |    5 +++++
 mm/internal.h          |   14 +++++++++++---
 mm/mlock.c             |   22 ++++++++++++++++++----
 mm/vmstat.c            |    3 +++
 6 files changed, 58 insertions(+), 16 deletions(-)

Index: linux-2.6.26-rc2-mm1/drivers/base/node.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/drivers/base/node.c	2008-05-22 15:24:51.000000000 -0400
+++ linux-2.6.26-rc2-mm1/drivers/base/node.c	2008-05-22 15:26:49.000000000 -0400
@@ -69,6 +69,9 @@ static ssize_t node_read_meminfo(struct 
 		       "Node %d Inactive(file): %8lu kB\n"
 #ifdef CONFIG_NORECLAIM_LRU
 		       "Node %d Noreclaim:      %8lu kB\n"
+#ifdef CONFIG_NORECLAIM_MLOCK
+		       "Node %d Mlocked:        %8lu kB\n"
+#endif
 #endif
 #ifdef CONFIG_HIGHMEM
 		       "Node %d HighTotal:      %8lu kB\n"
@@ -91,16 +94,19 @@ static ssize_t node_read_meminfo(struct 
 		       nid, K(i.totalram),
 		       nid, K(i.freeram),
 		       nid, K(i.totalram - i.freeram),
-		       nid, ...
From: Lee Schermerhorn
Date: Thursday, May 29, 2008 - 12:51 pm

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

Against:  2.6.26-rc2-mm1

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split lru series.

V1 -> V2:
+  no changes

"Optional" part of "noreclaim infrastructure"

In the fault paths that install new anonymous pages, check whether
the page is reclaimable or not using lru_cache_add_active_or_noreclaim().
If the page is reclaimable, just add it to the active lru list [via
the pagevec cache], else add it to the noreclaim list.  

This "proactive" culling in the fault path mimics the handling of
mlocked pages in Nick Piggin's series to keep mlocked pages off
the lru lists.

Notes:

1) This patch is optional--e.g., if one is concerned about the
   additional test in the fault path.  We can defer the moving of
   nonreclaimable pages until when vmscan [shrink_*_list()]
   encounters them.  Vmscan will only need to handle such pages
   once.

2) The 'vma' argument to page_reclaimable() is require to notice that
   we're faulting a page into an mlock()ed vma w/o having to scan the
   page's rmap in the fault path.   Culling mlock()ed anon pages is
   currently the only reason for this patch.

3) We can't cull swap pages in read_swap_cache_async() because the
   vma argument doesn't necessarily correspond to the swap cache
   offset passed in by swapin_readahead().  This could [did!] result
   in mlocking pages in non-VM_LOCKED vmas if [when] we tried to
   cull in this path.

4) Move set_pte_at() to after where we add page to lru to keep it
   hidden from other tasks that might walk the page table.
   We already do it in this order in do_anonymous() page.  And,
   these are COW'd anon pages.  Is this safe?


Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>

 include/linux/swap.h |    2 ++
 mm/memory.c          |   20 ++++++++++++--------
 mm/swap.c            |   21 +++++++++++++++++++++
 3 files changed, 35 insertions(+), 8 deletions(-)

Index: ...
From: Lee Schermerhorn
Date: Thursday, May 29, 2008 - 12:51 pm

From:  Lee Schermerhorn <lee.schermerhorn@hp.com>

Against:  2.6.26-rc2-mm1

Add some event counters to vmstats for testing noreclaim/mlock.  
Some of these might be interesting enough to keep around.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/vmstat.h |   11 +++++++++++
 mm/internal.h          |    4 +++-
 mm/mlock.c             |   33 +++++++++++++++++++++++++--------
 mm/vmscan.c            |   16 +++++++++++++++-
 mm/vmstat.c            |   12 ++++++++++++
 5 files changed, 66 insertions(+), 10 deletions(-)

Index: linux-2.6.26-rc2-mm1/include/linux/vmstat.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/vmstat.h	2008-05-28 13:01:13.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/vmstat.h	2008-05-28 13:03:10.000000000 -0400
@@ -41,6 +41,17 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
 #ifdef CONFIG_HUGETLB_PAGE
 		HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
 #endif
+#ifdef CONFIG_NORECLAIM_LRU
+		NORECL_PGCULLED,	/* culled to noreclaim list */
+		NORECL_PGSCANNED,	/* scanned for reclaimability */
+		NORECL_PGRESCUED,	/* rescued from noreclaim list */
+#ifdef CONFIG_NORECLAIM_MLOCK
+		NORECL_PGMLOCKED,
+		NORECL_PGMUNLOCKED,
+		NORECL_PGCLEARED,
+		NORECL_PGSTRANDED,	/* unable to isolate on unlock */
+#endif
+#endif
 		NR_VM_EVENT_ITEMS
 };
 
Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c	2008-05-28 13:02:55.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmscan.c	2008-05-28 13:03:10.000000000 -0400
@@ -453,12 +453,13 @@ int putback_lru_page(struct page *page)
 {
 	int lru;
 	int ret = 1;
+	int was_nonreclaimable;
 
 	VM_BUG_ON(!PageLocked(page));
 	VM_BUG_ON(PageLRU(page));
 
 	lru = !!TestClearPageActive(page);
-	ClearPageNoreclaim(page);	/* for page_reclaimable() */
+	was_nonreclaimable = ...
From: Lee Schermerhorn
Date: Thursday, May 29, 2008 - 12:51 pm

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

Against:  2.6.26-rc2-mm1

V6:
+ moved to end of series as optional debug patch

V2 -> V3:
+ rebase to 23-mm1 atop RvR's split LRU series

New in V2

This patch adds a function to scan individual or all zones' noreclaim
lists and move any pages that have become reclaimable onto the respective
zone's inactive list, where shrink_inactive_list() will deal with them.

Adds sysctl to scan all nodes, and per node attributes to individual
nodes' zones.

Kosaki:
If reclaimable page found in noreclaim lru when write
/proc/sys/vm/scan_noreclaim_pages, print filename and file offset of
these pages.

TODO:  DEBUGGING ONLY: NOT FOR UPSTREAM MERGE

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>


 drivers/base/node.c  |    5 +
 include/linux/rmap.h |    3 
 include/linux/swap.h |   15 ++++
 kernel/sysctl.c      |   10 +++
 mm/rmap.c            |    4 -
 mm/vmscan.c          |  161 +++++++++++++++++++++++++++++++++++++++++++++++++++
 6 files changed, 196 insertions(+), 2 deletions(-)

Index: linux-2.6.26-rc2-mm1/include/linux/swap.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/swap.h	2008-05-28 13:03:07.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/swap.h	2008-05-28 13:03:13.000000000 -0400
@@ -7,6 +7,7 @@
 #include <linux/list.h>
 #include <linux/memcontrol.h>
 #include <linux/sched.h>
+#include <linux/node.h>
 
 #include <asm/atomic.h>
 #include <asm/page.h>
@@ -235,15 +236,29 @@ static inline int zone_reclaim(struct zo
 #ifdef CONFIG_NORECLAIM_LRU
 extern int page_reclaimable(struct page *page, struct vm_area_struct *vma);
 extern void scan_mapping_noreclaim_pages(struct address_space *);
+
+extern unsigned long scan_noreclaim_pages;
+extern int scan_noreclaim_handler(struct ctl_table *, int, struct file *,
+					void ...
From: Lee Schermerhorn
Date: Thursday, May 29, 2008 - 12:51 pm

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

Against:  2.6.26-rc2-mm1

Allow free of mlock()ed pages.  This shouldn't happen, but during
developement, it occasionally did.

This patch allows us to survive that condition, while keeping the
statistics and events correct for debug.


Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 include/linux/vmstat.h |    1 +
 mm/internal.h          |   17 +++++++++++++++++
 mm/page_alloc.c        |    1 +
 mm/vmstat.c            |    1 +
 4 files changed, 20 insertions(+)

Index: linux-2.6.26-rc2-mm1/mm/internal.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/internal.h	2008-05-28 10:12:15.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/internal.h	2008-05-28 10:15:20.000000000 -0400
@@ -152,6 +152,22 @@ static inline void mlock_migrate_page(st
 	}
 }
 
+/*
+ * free_page_mlock() -- clean up attempts to free and mlocked() page.
+ * Page should not be on lru, so no need to fix that up.
+ * free_pages_check() will verify...
+ */
+static inline void free_page_mlock(struct page *page)
+{
+	if (unlikely(TestClearPageMlocked(page))) {
+		unsigned long flags;
+
+		local_irq_save(flags);
+		__dec_zone_page_state(page, NR_MLOCK);
+		__count_vm_event(NORECL_MLOCKFREED);
+		local_irq_restore(flags);
+	}
+}
 
 #else /* CONFIG_NORECLAIM_MLOCK */
 static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
@@ -161,6 +177,7 @@ static inline int is_mlocked_vma(struct 
 static inline void clear_page_mlock(struct page *page) { }
 static inline void mlock_vma_page(struct page *page) { }
 static inline void mlock_migrate_page(struct page *new, struct page *old) { }
+static inline void free_page_mlock(struct page *page) { }
 
 #endif /* CONFIG_NORECLAIM_MLOCK */
 
Index: linux-2.6.26-rc2-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/page_alloc.c	2008-05-28 10:12:15.000000000 -0400
+++ ...
From: Lee Schermerhorn
Date: Thursday, May 29, 2008 - 12:51 pm

From: Lee Schermerhorn <lee.schermerhorn@hp.com>

Documentation for noreclaim lru list and its usage.

Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>

 Documentation/vm/noreclaim-lru.txt |  609 +++++++++++++++++++++++++++++++++++++
 1 file changed, 609 insertions(+)

Index: linux-2.6.26-rc2-mm1/Documentation/vm/noreclaim-lru.txt
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6.26-rc2-mm1/Documentation/vm/noreclaim-lru.txt	2008-05-28 14:01:32.000000000 -0400
@@ -0,0 +1,609 @@
+
+This document describes the Linux memory management "Noreclaim LRU"
+infrastructure and the use of this infrastructure to manage several types
+of "non-reclaimable" pages.  The document attempts to provide the overall
+rationale behind this mechanism and the rationale for some of the design
+decisions that drove the implementation.  The latter design rationale is
+discussed in the context of an implementation description.  Admittedly, one
+can obtain the implementation details--the "what does it do?"--by reading the
+code.  One hopes that the descriptions below add value by provide the answer
+to "why does it do that?".
+
+Noreclaim LRU Infrastructure:
+
+The Noreclaim LRU adds an additional LRU list to track non-reclaimable pages
+and to hide these pages from vmscan.  This mechanism is based on a patch by
+Larry Woodman of Red Hat to address several scalability problems with page
+reclaim in Linux.  The problems have been observed at customer sites on large
+memory x86_64 systems.  For example, a non-numal x86_64 platform with 128GB
+of main memory will have over 32 million 4k pages in a single zone.  When a
+large fraction of these pages are not reclaimable for any reason [see below],
+vmscan will spend a lot of time scanning the LRU lists looking for the small
+fraction of pages that are reclaimable.  This can result in a situation where
+all cpus are spending 100% of their time in vmscan for hours or ...
From: KOSAKI Motohiro
Date: Friday, May 30, 2008 - 2:27 am

Note:
On fujitsu server(IA64 8CPU 8GB), this patch series works well 48+ hours too :)



--

Previous thread: [patch 2.6.26-rc4] add HAVE_CLK to Kconfig, for driver dependencies by David Brownell on Thursday, May 29, 2008 - 11:56 am. (2 messages)

Next thread: [PATCH -mm 11/12] more aggressively use lumpy reclaim by Rik van Riel on Thursday, May 29, 2008 - 1:22 pm. (1 message)