The patches to follow are a continuation of the V8 "VM pageout scalability improvements" series that Rik van Riel posted to LKML on 23May08. These patches apply atop Rik's series with the following overlap: Patches 13 through 16 replace the corresponding patches in Rik's posting. Patch 13, the noreclaim lru infrastructure, now includes Kosaki Motohiro's memcontrol enhancements to track nonreclaimable pages. Patches 14 and 15 are largely unchanged, except for refresh. Includes some minor statistics formatting cleanup. Patch 16 includes a fix for an potential [unobserved] race condition during SHM_UNLOCK. --- Additional patches in this series: Patches 17 through 20 keep mlocked pages off the normal [in]active LRU lists using the noreclaim lru infrastructure. These patches represent a fairly significant rework of an RFC patch originally posted by Nick Piggin. Patches 21 and 22 are optional, but recommended, enhancements to the overall noreclaim series. Patches 23 and 24 are optional enhancements useful during debug and testing. Patch 25 is a rather verbose document describing the noreclaim lru infrastructure and the use thereof to keep ramfs, SHM_LOCKED and mlocked pages off the normal LRU lists. --- The entire stack, including Rik's split lru patches, are holding up very well under stress loads. E.g., ran for over 90+ hours over the weekend on both x86_64 [32GB, 8core] and ia64 [32GB, 16cpu] platforms without error over last weekend. I think these are ready for a spin in -mm atop Rik's patches. Lee --
On Thu, 29 May 2008 15:50:30 -0400 I was >this< close to getting onto Rik's patches (honest) but a few other people have been kicking the tyres and seem to have caused some punctures so I'm expecting V9? --
On Thu, 29 May 2008 13:16:24 -0700 If I send you a V9 up to patch 12, you can apply Lee's patches straight over my V9 :) *fidgets with quilt mail* -- All rights reversed. --
>>>>> "Rik" == Rik van Riel <riel@redhat.com> writes: Rik> On Thu, 29 May 2008 13:16:24 -0700 Rik> If I send you a V9 up to patch 12, you can apply Lee's patches Rik> straight over my V9 :) I haven't seen any performance numbers talking about how well this stuff works on single or dual CPU machines with smaller amounts of memory, or whether it's worth using on these machines at all? The big machines with lots of memory and lots of CPUs are certainly becoming more prevalent, but for my home machine with 4Gb RAM and dual core, what's the advantage? Let's not slow down the common case for the sake of the bigger guys if possible. John --
On Fri, 30 May 2008 09:52:48 -0400 I wouldn't call your home system with 4GB RAM "small". After all, the VM that Linux currently has was developed mostly on machines with less than 1GB of RAM and later encrusted in bandaids to make sure the large systems did not fail too badly. As for small system performance, I believe that my patch series should cause no performance regressions on those systems and has a framework that allows us to improve performance on those systems too. If you manage to break performance with my patch set somehow, please let me know so I can fix it. Something like the VM is very subtle and any change is pretty much guaranteed to break something, so I am very interested in feedback. -- All rights reversed. --
Rik> On Fri, 30 May 2008 09:52:48 -0400 Rik> I wouldn't call your home system with 4GB RAM "small". *grin* me either in some ways. But my other main linux box, which acts as an NFS server has 2Gb of RAM, but a pair of PIII Xeons at 550mhz. This is the box I'd be worried about in some ways, since it handles a bunch of stuff like backups, mysql, apache, NFS server, etc. Rik> After all, the VM that Linux currently has was developed mostly Rik> on machines with less than 1GB of RAM and later encrusted in Rik> bandaids to make sure the large systems did not fail too badly. Sure, I understand. Rik> As for small system performance, I believe that my patch series Rik> should cause no performance regressions on those systems and has Rik> a framework that allows us to improve performance on those Rik> systems too. Great! It would be nice to just be able to track this nicely. Rik> If you manage to break performance with my patch set somehow, Rik> please let me know so I can fix it. Something like the VM is Rik> very subtle and any change is pretty much guaranteed to break Rik> something, so I am very interested in feedback. What are you using to test/benchmark your changes as you develop this patchset? What would you suggest as a test load to help check performance? John --
On Fri, 30 May 2008 10:36:05 -0400 Your normal workload. I am doing some IO throughput, swap throughput and database tests, however those are probably not representative of what YOU throw at the VM. There are no VM benchmarks that cover everything, so what is needed most at this point is real world exposure. I cannot promise that the code is perfect; all I can promise is that I will try to fix any performance issue that people find. -- All rights reversed. --
I failed to patch Lee's patches over your V9. barrios@barrios-desktop:~/linux-2.6$ patch -p1 < /tmp/msg0_13.txt patching file mm/Kconfig patching file include/linux/page-flags.h patching file include/linux/mmzone.h patching file mm/page_alloc.c patching file include/linux/mm_inline.h patching file include/linux/swap.h patching file include/linux/pagevec.h patching file mm/swap.c patching file mm/migrate.c patching file mm/vmscan.c Hunk #10 FAILED at 1162. Hunk #11 succeeded at 1210 (offset 3 lines). Hunk #12 succeeded at 1242 (offset 3 lines). Hunk #13 succeeded at 1380 (offset 3 lines). Hunk #14 succeeded at 1411 (offset 3 lines). Hunk #15 succeeded at 1962 (offset 3 lines). Hunk #16 succeeded at 2300 (offset 3 lines). 1 out of 16 hunks FAILED -- saving rejects to file mm/vmscan.c.rej patching file mm/mempolicy.c patching file mm/internal.h patching file mm/memcontrol.c patching file include/linux/memcontrol.h -- Kinds regards, MinChan Kim --
From: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Infrastructure to manage pages excluded from reclaim--i.e., hidden from vmscan. Based on a patch by Larry Woodman of Red Hat. Reworked to maintain "nonreclaimable" pages on a separate per-zone LRU list, to "hide" them from vmscan. Kosaki Motohiro added the support for the memory controller noreclaim lru list. Pages on the noreclaim list have both PG_noreclaim and PG_lru set. Thus, PG_noreclaim is analogous to and mutually exclusive with PG_active--it specifies which LRU list the page is on. The noreclaim infrastructure is enabled by a new mm Kconfig option [CONFIG_]NORECLAIM_LRU. A new function 'page_reclaimable(page, vma)' in vmscan.c tests whether or not a page is reclaimable. Subsequent patches will add the various !reclaimable tests. We'll want to keep these tests light-weight for use in shrink_active_list() and, possibly, the fault path. To avoid races between tasks putting pages [back] onto an LRU list and tasks that might be moving the page from nonreclaimable to reclaimable state, one should test reclaimability under page lock and place nonreclaimable pages directly on the noreclaim list before dropping the lock. Otherwise, we risk "stranding" reclaimable pages on the noreclaim list. It's OK to use the pagevec caches for reclaimable pages. The new function 'putback_lru_page()'--inverse to 'isolate_lru_page()'--handles this transition, including potential page truncation while the page is unlocked. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> include/linux/memcontrol.h | 2 include/linux/mm_inline.h | 13 ++- include/linux/mmzone.h | 24 ++++++ include/linux/page-flags.h | 13 +++ include/linux/pagevec.h | 1 include/linux/swap.h | 12 +++ mm/Kconfig | 10 ++ mm/internal.h | 26 +++++++ mm/memcontrol.c ...
From: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Report non-reclaimable pages per zone and system wide. Kosaki Motohiro added support for memory controller noreclaim statistics. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> drivers/base/node.c | 6 ++++++ fs/proc/proc_misc.c | 6 ++++++ mm/memcontrol.c | 6 ++++++ mm/page_alloc.c | 16 +++++++++++++++- mm/vmstat.c | 3 +++ 5 files changed, 36 insertions(+), 1 deletion(-) Index: linux-2.6.26-rc2-mm1/mm/page_alloc.c =================================================================== --- linux-2.6.26-rc2-mm1.orig/mm/page_alloc.c 2008-05-28 10:39:23.000000000 -0400 +++ linux-2.6.26-rc2-mm1/mm/page_alloc.c 2008-05-28 10:42:52.000000000 -0400 @@ -1918,12 +1918,20 @@ void show_free_areas(void) } printk("Active_anon:%lu active_file:%lu inactive_anon%lu\n" - " inactive_file:%lu dirty:%lu writeback:%lu unstable:%lu\n" + " inactive_file:%lu" +//TODO: check/adjust line lengths +#ifdef CONFIG_NORECLAIM_LRU + " noreclaim:%lu" +#endif + " dirty:%lu writeback:%lu unstable:%lu\n" " free:%lu slab:%lu mapped:%lu pagetables:%lu bounce:%lu\n", global_page_state(NR_ACTIVE_ANON), global_page_state(NR_ACTIVE_FILE), global_page_state(NR_INACTIVE_ANON), global_page_state(NR_INACTIVE_FILE), +#ifdef CONFIG_NORECLAIM_LRU + global_page_state(NR_NORECLAIM), +#endif global_page_state(NR_FILE_DIRTY), global_page_state(NR_WRITEBACK), global_page_state(NR_UNSTABLE_NFS), @@ -1950,6 +1958,9 @@ void show_free_areas(void) " inactive_anon:%lukB" " active_file:%lukB" " inactive_file:%lukB" +#ifdef CONFIG_NORECLAIM_LRU + " noreclaim:%lukB" +#endif " present:%lukB" " pages_scanned:%lu" " all_unreclaimable? %s" @@ -1963,6 +1974,9 @@ void show_free_areas(void) K(zone_page_state(zone, NR_INACTIVE_ANON)), ...
From: Lee Schermerhorn <Lee.Schermerhorn@hp.com>
Christoph Lameter pointed out that ram disk pages also clutter the
LRU lists. When vmscan finds them dirty and tries to clean them,
the ram disk writeback function just redirties the page so that it
goes back onto the active list. Round and round she goes...
Define new address_space flag [shares address_space flags member
with mapping's gfp mask] to indicate that the address space contains
all non-reclaimable pages. This will provide for efficient testing
of ramdisk pages in page_reclaimable().
Also provide wrapper functions to set/test the noreclaim state to
minimize #ifdefs in ramdisk driver and any other users of this
facility.
Set the noreclaim state on address_space structures for new
ramdisk inodes. Test the noreclaim state in page_reclaimable()
to cull non-reclaimable pages.
Similarly, ramfs pages are non-reclaimable. Set the 'noreclaim'
address_space flag for new ramfs inodes.
These changes depend on [CONFIG_]NORECLAIM_LRU.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
Signed-off-by: Rik van Riel <riel@redhat.com>
drivers/block/brd.c | 13 +++++++++++++
fs/ramfs/inode.c | 1 +
include/linux/pagemap.h | 22 ++++++++++++++++++++++
mm/vmscan.c | 5 +++++
4 files changed, 41 insertions(+)
Index: linux-2.6.26-rc2-mm1/include/linux/pagemap.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/pagemap.h 2008-05-28 13:01:14.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/pagemap.h 2008-05-28 13:02:50.000000000 -0400
@@ -30,6 +30,28 @@ static inline void mapping_set_error(str
}
}
+#ifdef CONFIG_NORECLAIM_LRU
+#define AS_NORECLAIM (__GFP_BITS_SHIFT + 2) /* e.g., ramdisk, SHM_LOCK */
+
+static inline void mapping_set_noreclaim(struct address_space *mapping)
+{
+ set_bit(AS_NORECLAIM, &mapping->flags);
+}
+
+static inline int mapping_non_reclaimable(struct address_space ...From: Lee Schermerhorn <Lee.Schermerhorn@hp.com> Against: 2.6.26-rc2-mm1 While working with Nick Piggin's mlock patches, I noticed that shmem segments locked via shmctl(SHM_LOCKED) were not being handled. SHM_LOCKed pages work like ramdisk pages--the writeback function just redirties the page so that it can't be reclaimed. Deal with these using the same approach as for ram disk pages. Use the AS_NORECLAIM flag to mark address_space of SHM_LOCKed shared memory regions as non-reclaimable. Then these pages will be culled off the normal LRU lists during vmscan. Add new wrapper function to clear the mapping's noreclaim state when/if shared memory segment is munlocked. Add 'scan_mapping_noreclaim_page()' to mm/vmscan.c to scan all pages in the shmem segment's mapping [struct address_space] for reclaimability now that they're no longer locked. If so, move them to the appropriate zone lru list. Note that scan_mapping_noreclaim_page() must be able to sleep on page_lock(), so we can't call it holding the shmem info spinlock nor the shmid spinlock. So, we pass the mapping [address_space] back to shmctl() on SHM_UNLOCK for rescuing any nonreclaimable pages after dropping the spinlocks. Once we drop the shmid lock, the backing shmem file can be deleted if the calling task doesn't have the shm area attached. To handle this, we take an extra reference on the file before dropping the shmid lock and drop the reference after scanning the mapping's noreclaim pages. Changes depend on [CONFIG_]NORECLAIM_LRU. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: Kosaki Motohiro <kosaki.motohiro@jp.fujitsu.com> include/linux/mm.h | 9 ++-- include/linux/pagemap.h | 12 ++++-- include/linux/swap.h | 4 ++ ipc/shm.c | 20 +++++++++- mm/shmem.c | 10 +++-- mm/vmscan.c | 93 ++++++++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 136 ...
Originally From: Nick Piggin <npiggin@suse.de> Against: 2.6.26-rc2-mm1 V8: + more refinement of rmap interaction, including attempt to handle mlocked pages in non-linear mappings. + cleanup of lockdep reported errors. + enhancement of munlock page table walker to detect and handle pages under migration [migration ptes]. V6: + Kosaki-san and Rik van Riel: added check for "page mapped in vma" to try_to_unlock() processing in try_to_unmap_anon(). + Kosaki-san added munlock page table walker to avoid use of get_user_pages() for munlock. get_user_pages() proved to be unreliable for some types of vmas. + added filtering of "special" vmas. Some [_IO||_PFN] we skip altogether. Others, we just "make_pages_present" to simulate old behavior--i.e., populate page tables. Clear/don't set VM_LOCKED in non-mlockable vmas so that we don't try to unlock at exit/unmap time. + rework PG_mlock page flag definitions for new page flags macros. + Clear PageMlocked when COWing a page into a VM_LOCKED vma so we don't leave an mlocked page in another non-mlocked vma. If the other vma[s] had the page mlocked, we'll re-mlock it if/when we try to reclaim it. This is less expensive than walking the rmap in the COW/fault path. + in vmscan:shrink_page_list(), avoid adding anon page to the swap cache if it's in a VM_LOCKED vma, even tho' PG_mlocked might not be set. Call try_to_unlock() to determine this. As a result, we'll never try to unmap an mlocked anon page. + in support of the above change, updated try_to_unlock() to use same logic as try_to_unmap() when it encounters a VM_LOCKED vma--call mlock_vma_page() directly. Added stub try_to_unlock() for vmscan when NORECLAIM_MLOCK not configured. V4 -> V5: + fixed problem with placement of #ifdef CONFIG_NORECLAIM_MLOCK in prep_new_page() [Thanks, minchan Kim!]. V3 -> V4: + Added #ifdef CONFIG_NORECLAIM_MLOCK, #endif around use of PG_mlocked in free_page_check(), et al. Not defined ...
From: Lee Schermerhorn <lee.schermerhorn@hp.com> Against: 2.6.26-rc2-mm1 V2 -> V3: + rebase to 23-mm1 atop RvR's split lru series [no change] + fix function return types [void -> int] to fix build when not configured. New in V2. We need to hold the mmap_sem for write to initiatate mlock()/munlock() because we may need to merge/split vmas. However, this can lead to very long lock hold times attempting to fault in a large memory region to mlock it into memory. This can hold off other faults against the mm [multithreaded tasks] and other scans of the mm, such as via /proc. To alleviate this, downgrade the mmap_sem to read mode during the population of the region for locking. This is especially the case if we need to reclaim memory to lock down the region. We [probably?] don't need to do this for unlocking as all of the pages should be resident--they're already mlocked. Now, the caller's of the mlock functions [mlock_fixup() and mlock_vma_pages_range()] expect the mmap_sem to be returned in write mode. Changing all callers appears to be way too much effort at this point. So, restore write mode before returning. Note that this opens a window where the mmap list could change in a multithreaded process. So, at least for mlock_fixup(), where we could be called in a loop over multiple vmas, we check that a vma still exists at the start address and that vma still covers the page range [start,end). If not, we return an error, -EAGAIN, and let the caller deal with it. Return -EAGAIN from mlock_vma_pages_range() function and mlock_fixup() if the vma at 'start' disappears or changes so that the page range [start,end) is no longer contained in the vma. Again, let the caller deal with it. Looks like only sys_remap_file_pages() [via mmap_region()] should actually care. With this patch, I no longer see processes like ps(1) blocked for seconds or minutes at a time waiting for a large [multiple gigabyte] region to be locked down. However, I occassionally see delays ...
Originally From: Nick Piggin <npiggin@suse.de> Against: 2.6.26-rc2-mm1 V6: + munlock page in range of VM_LOCKED vma being covered by remap_file_pages(), as this is an implied unmap of the range. + in support of special vma filtering, don't account for non-mlockable vmas as locked_vm. V2 -> V3: + rebase to 23-mm1 atop RvR's split lru series [no changes] V1 -> V2: + modified mmap.c:mmap_region() to return error if mlock_vma_pages_range() does. This can only occur if the vma gets removed/changed while we're switching mmap_sem lock modes. Most callers don't care, but sys_remap_file_pages() appears to. Rework of Nick Piggins's "mm: move mlocked pages off the LRU" patch -- part 2 0f 2. Remove mlocked pages from the LRU using "NoReclaim infrastructure" during mmap(), munmap(), mremap() and truncate(). Try to move back to normal LRU lists on munmap() when last mlocked mapping removed. Removed PageMlocked() status when page truncated from file. Originally Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> mm/fremap.c | 26 +++++++++++++++++--- mm/internal.h | 13 ++++++++-- mm/mlock.c | 10 ++++--- mm/mmap.c | 75 ++++++++++++++++++++++++++++++++++++++++++++-------------- mm/mremap.c | 8 +++--- mm/truncate.c | 4 +++ 6 files changed, 106 insertions(+), 30 deletions(-) Index: linux-2.6.26-rc2-mm1/mm/mmap.c =================================================================== --- linux-2.6.26-rc2-mm1.orig/mm/mmap.c 2008-05-23 11:01:34.000000000 -0400 +++ linux-2.6.26-rc2-mm1/mm/mmap.c 2008-05-23 11:01:41.000000000 -0400 @@ -32,6 +32,8 @@ #include <asm/tlb.h> #include <asm/mmu_context.h> +#include "internal.h" + #ifndef arch_mmap_check #define arch_mmap_check(addr, len, flags) (0) #endif @@ -961,6 +963,7 @@ unsigned long do_mmap_pgoff(struct file ...
From: Nick Piggin <npiggin@suse.de> To: Linux Memory Management <linux-mm@kvack.org> Subject: [patch 4/4] mm: account mlocked pages Date: Mon, 12 Mar 2007 07:39:14 +0100 (CET) Against: 2.6.26-rc2-mm1 V2 -> V3: + rebase to 23-mm1 atop RvR's split lru series + fix definitions of NR_MLOCK to fix build errors when not configured. V1 -> V2: + new in V2 -- pulled in & reworked from Nick's previous series Add NR_MLOCK zone page state, which provides a (conservative) count of mlocked pages (actually, the number of mlocked pages moved off the LRU). Reworked by lts to fit in with the modified mlock page support in the Reclaim Scalability series. Signed-off-by: Nick Piggin <npiggin@suse.de> Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> drivers/base/node.c | 24 +++++++++++++++--------- fs/proc/proc_misc.c | 6 ++++++ include/linux/mmzone.h | 5 +++++ mm/internal.h | 14 +++++++++++--- mm/mlock.c | 22 ++++++++++++++++++---- mm/vmstat.c | 3 +++ 6 files changed, 58 insertions(+), 16 deletions(-) Index: linux-2.6.26-rc2-mm1/drivers/base/node.c =================================================================== --- linux-2.6.26-rc2-mm1.orig/drivers/base/node.c 2008-05-22 15:24:51.000000000 -0400 +++ linux-2.6.26-rc2-mm1/drivers/base/node.c 2008-05-22 15:26:49.000000000 -0400 @@ -69,6 +69,9 @@ static ssize_t node_read_meminfo(struct "Node %d Inactive(file): %8lu kB\n" #ifdef CONFIG_NORECLAIM_LRU "Node %d Noreclaim: %8lu kB\n" +#ifdef CONFIG_NORECLAIM_MLOCK + "Node %d Mlocked: %8lu kB\n" +#endif #endif #ifdef CONFIG_HIGHMEM "Node %d HighTotal: %8lu kB\n" @@ -91,16 +94,19 @@ static ssize_t node_read_meminfo(struct nid, K(i.totalram), nid, K(i.freeram), nid, K(i.totalram - i.freeram), - nid, ...
From: Lee Schermerhorn <lee.schermerhorn@hp.com> Against: 2.6.26-rc2-mm1 V2 -> V3: + rebase to 23-mm1 atop RvR's split lru series. V1 -> V2: + no changes "Optional" part of "noreclaim infrastructure" In the fault paths that install new anonymous pages, check whether the page is reclaimable or not using lru_cache_add_active_or_noreclaim(). If the page is reclaimable, just add it to the active lru list [via the pagevec cache], else add it to the noreclaim list. This "proactive" culling in the fault path mimics the handling of mlocked pages in Nick Piggin's series to keep mlocked pages off the lru lists. Notes: 1) This patch is optional--e.g., if one is concerned about the additional test in the fault path. We can defer the moving of nonreclaimable pages until when vmscan [shrink_*_list()] encounters them. Vmscan will only need to handle such pages once. 2) The 'vma' argument to page_reclaimable() is require to notice that we're faulting a page into an mlock()ed vma w/o having to scan the page's rmap in the fault path. Culling mlock()ed anon pages is currently the only reason for this patch. 3) We can't cull swap pages in read_swap_cache_async() because the vma argument doesn't necessarily correspond to the swap cache offset passed in by swapin_readahead(). This could [did!] result in mlocking pages in non-VM_LOCKED vmas if [when] we tried to cull in this path. 4) Move set_pte_at() to after where we add page to lru to keep it hidden from other tasks that might walk the page table. We already do it in this order in do_anonymous() page. And, these are COW'd anon pages. Is this safe? Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> include/linux/swap.h | 2 ++ mm/memory.c | 20 ++++++++++++-------- mm/swap.c | 21 +++++++++++++++++++++ 3 files changed, 35 insertions(+), 8 deletions(-) Index: ...
From: Lee Schermerhorn <lee.schermerhorn@hp.com>
Against: 2.6.26-rc2-mm1
Add some event counters to vmstats for testing noreclaim/mlock.
Some of these might be interesting enough to keep around.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
include/linux/vmstat.h | 11 +++++++++++
mm/internal.h | 4 +++-
mm/mlock.c | 33 +++++++++++++++++++++++++--------
mm/vmscan.c | 16 +++++++++++++++-
mm/vmstat.c | 12 ++++++++++++
5 files changed, 66 insertions(+), 10 deletions(-)
Index: linux-2.6.26-rc2-mm1/include/linux/vmstat.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/include/linux/vmstat.h 2008-05-28 13:01:13.000000000 -0400
+++ linux-2.6.26-rc2-mm1/include/linux/vmstat.h 2008-05-28 13:03:10.000000000 -0400
@@ -41,6 +41,17 @@ enum vm_event_item { PGPGIN, PGPGOUT, PS
#ifdef CONFIG_HUGETLB_PAGE
HTLB_BUDDY_PGALLOC, HTLB_BUDDY_PGALLOC_FAIL,
#endif
+#ifdef CONFIG_NORECLAIM_LRU
+ NORECL_PGCULLED, /* culled to noreclaim list */
+ NORECL_PGSCANNED, /* scanned for reclaimability */
+ NORECL_PGRESCUED, /* rescued from noreclaim list */
+#ifdef CONFIG_NORECLAIM_MLOCK
+ NORECL_PGMLOCKED,
+ NORECL_PGMUNLOCKED,
+ NORECL_PGCLEARED,
+ NORECL_PGSTRANDED, /* unable to isolate on unlock */
+#endif
+#endif
NR_VM_EVENT_ITEMS
};
Index: linux-2.6.26-rc2-mm1/mm/vmscan.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/vmscan.c 2008-05-28 13:02:55.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/vmscan.c 2008-05-28 13:03:10.000000000 -0400
@@ -453,12 +453,13 @@ int putback_lru_page(struct page *page)
{
int lru;
int ret = 1;
+ int was_nonreclaimable;
VM_BUG_ON(!PageLocked(page));
VM_BUG_ON(PageLRU(page));
lru = !!TestClearPageActive(page);
- ClearPageNoreclaim(page); /* for page_reclaimable() */
+ was_nonreclaimable = ...From: Lee Schermerhorn <lee.schermerhorn@hp.com> Against: 2.6.26-rc2-mm1 V6: + moved to end of series as optional debug patch V2 -> V3: + rebase to 23-mm1 atop RvR's split LRU series New in V2 This patch adds a function to scan individual or all zones' noreclaim lists and move any pages that have become reclaimable onto the respective zone's inactive list, where shrink_inactive_list() will deal with them. Adds sysctl to scan all nodes, and per node attributes to individual nodes' zones. Kosaki: If reclaimable page found in noreclaim lru when write /proc/sys/vm/scan_noreclaim_pages, print filename and file offset of these pages. TODO: DEBUGGING ONLY: NOT FOR UPSTREAM MERGE Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Signed-off-by: Rik van Riel <riel@redhat.com> Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> drivers/base/node.c | 5 + include/linux/rmap.h | 3 include/linux/swap.h | 15 ++++ kernel/sysctl.c | 10 +++ mm/rmap.c | 4 - mm/vmscan.c | 161 +++++++++++++++++++++++++++++++++++++++++++++++++++ 6 files changed, 196 insertions(+), 2 deletions(-) Index: linux-2.6.26-rc2-mm1/include/linux/swap.h =================================================================== --- linux-2.6.26-rc2-mm1.orig/include/linux/swap.h 2008-05-28 13:03:07.000000000 -0400 +++ linux-2.6.26-rc2-mm1/include/linux/swap.h 2008-05-28 13:03:13.000000000 -0400 @@ -7,6 +7,7 @@ #include <linux/list.h> #include <linux/memcontrol.h> #include <linux/sched.h> +#include <linux/node.h> #include <asm/atomic.h> #include <asm/page.h> @@ -235,15 +236,29 @@ static inline int zone_reclaim(struct zo #ifdef CONFIG_NORECLAIM_LRU extern int page_reclaimable(struct page *page, struct vm_area_struct *vma); extern void scan_mapping_noreclaim_pages(struct address_space *); + +extern unsigned long scan_noreclaim_pages; +extern int scan_noreclaim_handler(struct ctl_table *, int, struct file *, + void ...
From: Lee Schermerhorn <lee.schermerhorn@hp.com>
Against: 2.6.26-rc2-mm1
Allow free of mlock()ed pages. This shouldn't happen, but during
developement, it occasionally did.
This patch allows us to survive that condition, while keeping the
statistics and events correct for debug.
Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com>
include/linux/vmstat.h | 1 +
mm/internal.h | 17 +++++++++++++++++
mm/page_alloc.c | 1 +
mm/vmstat.c | 1 +
4 files changed, 20 insertions(+)
Index: linux-2.6.26-rc2-mm1/mm/internal.h
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/internal.h 2008-05-28 10:12:15.000000000 -0400
+++ linux-2.6.26-rc2-mm1/mm/internal.h 2008-05-28 10:15:20.000000000 -0400
@@ -152,6 +152,22 @@ static inline void mlock_migrate_page(st
}
}
+/*
+ * free_page_mlock() -- clean up attempts to free and mlocked() page.
+ * Page should not be on lru, so no need to fix that up.
+ * free_pages_check() will verify...
+ */
+static inline void free_page_mlock(struct page *page)
+{
+ if (unlikely(TestClearPageMlocked(page))) {
+ unsigned long flags;
+
+ local_irq_save(flags);
+ __dec_zone_page_state(page, NR_MLOCK);
+ __count_vm_event(NORECL_MLOCKFREED);
+ local_irq_restore(flags);
+ }
+}
#else /* CONFIG_NORECLAIM_MLOCK */
static inline int is_mlocked_vma(struct vm_area_struct *v, struct page *p)
@@ -161,6 +177,7 @@ static inline int is_mlocked_vma(struct
static inline void clear_page_mlock(struct page *page) { }
static inline void mlock_vma_page(struct page *page) { }
static inline void mlock_migrate_page(struct page *new, struct page *old) { }
+static inline void free_page_mlock(struct page *page) { }
#endif /* CONFIG_NORECLAIM_MLOCK */
Index: linux-2.6.26-rc2-mm1/mm/page_alloc.c
===================================================================
--- linux-2.6.26-rc2-mm1.orig/mm/page_alloc.c 2008-05-28 10:12:15.000000000 -0400
+++ ...From: Lee Schermerhorn <lee.schermerhorn@hp.com> Documentation for noreclaim lru list and its usage. Signed-off-by: Lee Schermerhorn <lee.schermerhorn@hp.com> Documentation/vm/noreclaim-lru.txt | 609 +++++++++++++++++++++++++++++++++++++ 1 file changed, 609 insertions(+) Index: linux-2.6.26-rc2-mm1/Documentation/vm/noreclaim-lru.txt =================================================================== --- /dev/null 1970-01-01 00:00:00.000000000 +0000 +++ linux-2.6.26-rc2-mm1/Documentation/vm/noreclaim-lru.txt 2008-05-28 14:01:32.000000000 -0400 @@ -0,0 +1,609 @@ + +This document describes the Linux memory management "Noreclaim LRU" +infrastructure and the use of this infrastructure to manage several types +of "non-reclaimable" pages. The document attempts to provide the overall +rationale behind this mechanism and the rationale for some of the design +decisions that drove the implementation. The latter design rationale is +discussed in the context of an implementation description. Admittedly, one +can obtain the implementation details--the "what does it do?"--by reading the +code. One hopes that the descriptions below add value by provide the answer +to "why does it do that?". + +Noreclaim LRU Infrastructure: + +The Noreclaim LRU adds an additional LRU list to track non-reclaimable pages +and to hide these pages from vmscan. This mechanism is based on a patch by +Larry Woodman of Red Hat to address several scalability problems with page +reclaim in Linux. The problems have been observed at customer sites on large +memory x86_64 systems. For example, a non-numal x86_64 platform with 128GB +of main memory will have over 32 million 4k pages in a single zone. When a +large fraction of these pages are not reclaimable for any reason [see below], +vmscan will spend a lot of time scanning the LRU lists looking for the small +fraction of pages that are reclaimable. This can result in a situation where +all cpus are spending 100% of their time in vmscan for hours or ...
Note: On fujitsu server(IA64 8CPU 8GB), this patch series works well 48+ hours too :) --
