This is a series of patches for memory resource controller.
Based on mmotm Sep18 ver. This passed some tests and seems works well.
This consists of followings
- fixes.
* fixes do_swap_page() handling.
- new feature
* "root" cgroup is treated as nolimit.
* implements account_move() and move account at force_empty rather than
forgeting all.
* atomic page_cgroup->flags.
* page_cgroup lookup system. (and page_cgroup.h is added.)
- optimize.
* per cpu status update.
- remove page_cgroup pointer from struct page.
- lazy lru add/remove
peformance is here. (on 8cpu Xeon/64bit) not so bad.
2.6.26-rc6-mm1(2008/9/18 version)
==
Execl Throughput 2311.6 lps (29.9 secs, 3 samples)
C Compiler Throughput 1331.9 lpm (60.4 secs, 3 samples)
Shell Scripts (1 concurrent) 7500.7 lpm (60.0 secs, 3 samples)
Shell Scripts (8 concurrent) 3031.0 lpm (60.0 secs, 3 samples)
Shell Scripts (16 concurrent) 1729.7 lpm (60.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places 99310.2 lpm (30.0 secs, 3 samples)
afte all patches.
==
Execl Throughput 2308.7 lps (29.9 secs, 3 samples)
C Compiler Throughput 1343.4 lpm (60.3 secs, 3 samples)
Shell Scripts (1 concurrent) 7451.7 lpm (60.0 secs, 3 samples)
Shell Scripts (8 concurrent) 3024.0 lpm (60.0 secs, 3 samples)
Shell Scripts (16 concurrent) 1752.0 lpm (60.0 secs, 3 samples)
Dc: sqrt(2) to 99 decimal places 99255.3 lpm (30.0 secs, 3 samples)
after all patches + add padding to make "struct page" to be 64 bytes ;)
==
Execl Throughput 2332.2 lps (29.9 secs, 3 samples)
C Compiler Throughput 1345.3 lpm (60.4 secs, 3 samples)
Shell Scripts (1 concurrent) 7564.3 lpm (60.0 secs, 3 samples)
Shell Scripts (8 concurrent) ...There are not-on-LRU pages which can be mapped and they are not worth to
be accounted. (becasue we can't shrink them and need dirty codes to handle
specical case) We don't want to account out-of-vm's-control pages.
When special_mapping_fault() is called, page->mapping is tend to be NULL
and it's charged as Anonymous page. So avoid account it in __do_fault().
We can know that by checking anon var.
insert_page() also handles some special pages from drivers.
Changelog:
- new patch.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/memory.c | 18 ++++++------------
mm/rmap.c | 4 ++--
2 files changed, 8 insertions(+), 14 deletions(-)
Index: mmotm-2.6.27-rc6+/mm/memory.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memory.c
+++ mmotm-2.6.27-rc6+/mm/memory.c
@@ -1323,18 +1323,14 @@ static int insert_page(struct vm_area_st
pte_t *pte;
spinlock_t *ptl;
- retval = mem_cgroup_charge(page, mm, GFP_KERNEL);
- if (retval)
- goto out;
-
retval = -EINVAL;
if (PageAnon(page))
- goto out_uncharge;
+ goto out;
retval = -ENOMEM;
flush_dcache_page(page);
pte = get_locked_pte(mm, addr, &ptl);
if (!pte)
- goto out_uncharge;
+ goto out;
retval = -EBUSY;
if (!pte_none(*pte))
goto out_unlock;
@@ -1350,8 +1346,6 @@ static int insert_page(struct vm_area_st
return retval;
out_unlock:
pte_unmap_unlock(pte, ptl);
-out_uncharge:
- mem_cgroup_uncharge_page(page);
out:
return retval;
}
@@ -2542,7 +2536,7 @@ static int __do_fault(struct mm_struct *
}
- if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
+ if (anon && mem_cgroup_charge(page, mm, GFP_KERNEL)) {
ret = VM_FAULT_OOM;
goto out;
}
@@ -2584,10 +2578,10 @@ static int __do_fault(struct mm_struct *
/* no need to invalidate: a not-present page won't be cached */
update_mmu_cache(vma, address, entry);
} else {
- mem_cgroup_uncharge_page(page);
- if (anon)
+ if (anon) ...While page-cache's charge/uncharge is done under page_lock(), swap-cache
isn't. (anonymous page is charged when it's newly allocated.)
This patch moves do_swap_page()'s charge() call under lock.
This is good for avoiding unnecessary slow-path in charge().
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/memory.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)
Index: mmotm-2.6.27-rc6+/mm/memory.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memory.c
+++ mmotm-2.6.27-rc6+/mm/memory.c
@@ -2320,15 +2320,14 @@ static int do_swap_page(struct mm_struct
count_vm_event(PGMAJFAULT);
}
+ lock_page(page);
+ delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+
if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
- delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
ret = VM_FAULT_OOM;
goto out;
}
-
mark_page_accessed(page);
- lock_page(page);
- delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
/*
* Back out if somebody else already faulted in this pte.
--
Make root cgroup of memory resource controller to have no limit.
By this, users cannot set limit to root group. This is for making root cgroup
as a kind of trash-can.
For accounting pages which has no owner, which are created by force_empty,
we need some cgroup with no_limit. A patch for rewriting force_empty will
will follow this one.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Index: mmotm-2.6.27-rc6+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc6+/mm/memcontrol.c
@@ -136,6 +136,9 @@ struct mem_cgroup {
};
static struct mem_cgroup init_mem_cgroup;
+#define is_root_cgroup(cgrp) ((cgrp) == &init_mem_cgroup)
+
+
/*
* We use the lower bit of the page->page_cgroup pointer as a bit spin
* lock. We need to ensure that page->page_cgroup is at least two
@@ -944,6 +947,10 @@ static int mem_cgroup_write(struct cgrou
switch (cft->private) {
case RES_LIMIT:
+ if (is_root_cgroup(memcg)) {
+ ret = -EINVAL;
+ break;
+ }
/* This function does all necessary parse...reuse it */
ret = res_counter_memparse_write_strategy(buffer, &val);
if (!ret)
Index: mmotm-2.6.27-rc6+/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.27-rc6+.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.27-rc6+/Documentation/controllers/memory.txt
@@ -121,6 +121,9 @@ The corresponding routines that remove a
a page from Page Cache is used to decrement the accounting counters of the
cgroup.
+The root cgroup is not allowed to be set limit but usage is accounted.
+For controlling usage of memory, you need to create a cgroup.
+
2.3 Shared Page Accounting
Shared pages are accounted on the basis of the first touch approach. The
@@ -172,6 +175,7 @@ We can alter the memory limit:
NOTE: We can use a suffix (k, K, m, M, g or G) to indicate values in kilo,
mega or ...Current force_empty of memory resource controller just removes page_cgroup.
This maans the page is never accounted at all and create an in-use page which
has no page_cgroup.
This patch tries to move account to "root" cgroup. By this patch, force_empty
doesn't leak an account but move account to "root" cgroup. Maybe someone can
think of other enhancements as moving account to its parent.
(But moving to the parent means we have to handle "limit" of pages.
Need more complicated work to do that.")
For now, just moves account to root cgroup.
Note: all lock other than old mem_cgroup's lru_lock
in this path is try_lock().
Changelog (v3) -> (v4)
- no changes
Changelog (v2) -> (v3)
- splitted out mem_cgroup_move_account().
- replaced get_page() with get_page_unless_zero().
(This is necessary for avoiding confliction with migration)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Documentation/controllers/memory.txt | 7 ++--
mm/memcontrol.c | 51 +++++++++++++++++++++--------------
2 files changed, 35 insertions(+), 23 deletions(-)
Index: mmotm-2.6.27-rc6+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc6+/mm/memcontrol.c
@@ -29,6 +29,7 @@
#include <linux/slab.h>
#include <linux/swap.h>
#include <linux/spinlock.h>
+#include <linux/pagemap.h>
#include <linux/fs.h>
#include <linux/seq_file.h>
#include <linux/vmalloc.h>
@@ -976,17 +977,14 @@ int mem_cgroup_resize_limit(struct mem_c
/*
- * This routine traverse page_cgroup in given list and drop them all.
- * *And* this routine doesn't reclaim page itself, just removes page_cgroup.
+ * This routine traverse page_cgroup in given list and move them all.
*/
-#define FORCE_UNCHARGE_BATCH (128)
static void mem_cgroup_force_empty_list(struct mem_cgroup *mem,
struct mem_cgroup_per_zone *mz,
enum lru_list lru)
{
struct page_cgroup ...do _NOT_ use yield() ever! unless you know what you're doing, and probably not even then. NAK! --
Hmm, sorry. cond_resched() is ok ? Thanks, -Kame --
depends on what you want to do, please explain what you're trying to do. --
Sorry again. This force_empty is called only in following situation - there is no user threas in this cgroup. - a user tries to rmdir() this cgroup or explicitly type echo 1 > ../memory.force_empty. force_empty() scans lru list of this cgroup and check page_cgroup on the list one by one. Because there are no tasks in this group, force_empty can see following racy condtions while scanning. - global lru tries to remove the page which pointed by page_cgroup and it is not-on-LRU. - the page is locked by someone. ....find some lock contetion with invalidation/truncate. - in later patch, page_cgroup can be on pagevec(i added) and we have to drain it to remove from LRU. In above situation, force_empty() have to wait for some event proceeds. Hmm...detecting busy situation in loop and sleep in out-side-of-loop is better ? Anyway, ok, I'll rewrite this. BTW, sched.c::yield() is for what purpose now ? Thanks, -Kame --
So you either skip the page because it already got un-accounted, or you Then unlock, drain, lock, no need to sleep some arbitrary amount of time The better solution is to wait for events in a non-polling fashion, for example by using wait_event(). yield() might not actually wait at all, suppose you're the highest priority FIFO task on the system - if you used yield and rely on someone else to run you'll deadlock. Also, depending on sysctl_sched_compat_yield, SCHED_OTHER tasks using There are some (lagacy) users of yield, sadly they are all incorrect, but removing them is non-trivial for various reasons. The -rt kernel has 2 sites where yield() is the correct thing to do. In both cases its where 2 SCHED_FIFO-99 tasks (migration and stop_machine) depend on each-other. --
Hmm, spin_unlock -> wait_on_page_locked() -> break loop or spin_lock and retry Thank you for kindly advices. I'll rewrite. Regards, -Kame --
This patch tries to make page->mapping to be NULL before
mem_cgroup_uncharge_cache_page() is called.
"page->mapping == NULL" is a good check for "whether the page is still
radix-tree or not".
This patch also adds BUG_ON() to mem_cgroup_uncharge_cache_page();
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/filemap.c | 2 +-
mm/memcontrol.c | 1 +
mm/migrate.c | 12 +++++++++---
3 files changed, 11 insertions(+), 4 deletions(-)
Index: mmotm-2.6.27-rc6+/mm/filemap.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/filemap.c
+++ mmotm-2.6.27-rc6+/mm/filemap.c
@@ -116,12 +116,12 @@ void __remove_from_page_cache(struct pag
{
struct address_space *mapping = page->mapping;
- mem_cgroup_uncharge_cache_page(page);
radix_tree_delete(&mapping->page_tree, page->index);
page->mapping = NULL;
mapping->nrpages--;
__dec_zone_page_state(page, NR_FILE_PAGES);
BUG_ON(page_mapped(page));
+ mem_cgroup_uncharge_cache_page(page);
/*
* Some filesystems seem to re-dirty the page even after
Index: mmotm-2.6.27-rc6+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc6+/mm/memcontrol.c
@@ -859,6 +859,7 @@ void mem_cgroup_uncharge_page(struct pag
void mem_cgroup_uncharge_cache_page(struct page *page)
{
VM_BUG_ON(page_mapped(page));
+ VM_BUG_ON(page->mapping);
__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
}
Index: mmotm-2.6.27-rc6+/mm/migrate.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/migrate.c
+++ mmotm-2.6.27-rc6+/mm/migrate.c
@@ -330,8 +330,6 @@ static int migrate_page_move_mapping(str
__inc_zone_page_state(newpage, NR_FILE_PAGES);
spin_unlock_irq(&mapping->tree_lock);
- if (!PageSwapCache(newpage))
- mem_cgroup_uncharge_cache_page(page);
return 0;
}
@@ -378,7 +376,15 @@ static ...Some obvious optimization to memcg.
I found mem_cgroup_charge_statistics() is a little big (in object) and
does unnecessary address calclation.
This patch is for optimization to reduce the size of this function.
And res_counter_charge() is 'likely' to success.
Changelog v3->v4:
- merged with an other leaf patch.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/memcontrol.c | 18 ++++++++++--------
1 file changed, 10 insertions(+), 8 deletions(-)
Index: mmotm-2.6.27-rc6+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc6+/mm/memcontrol.c
@@ -67,11 +67,10 @@ struct mem_cgroup_stat {
/*
* For accounting under irq disable, no need for increment preempt count.
*/
-static void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat *stat,
+static inline void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat_cpu *stat,
enum mem_cgroup_stat_index idx, int val)
{
- int cpu = smp_processor_id();
- stat->cpustat[cpu].count[idx] += val;
+ stat->count[idx] += val;
}
static s64 mem_cgroup_read_stat(struct mem_cgroup_stat *stat,
@@ -238,18 +237,21 @@ static void mem_cgroup_charge_statistics
{
int val = (charge)? 1 : -1;
struct mem_cgroup_stat *stat = &mem->stat;
+ struct mem_cgroup_stat_cpu *cpustat;
VM_BUG_ON(!irqs_disabled());
+
+ cpustat = &stat->cpustat[smp_processor_id()];
if (PageCgroupCache(pc))
- __mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_CACHE, val);
+ __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_CACHE, val);
else
- __mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val);
+ __mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_RSS, val);
if (charge)
- __mem_cgroup_stat_add_safe(stat,
+ __mem_cgroup_stat_add_safe(cpustat,
MEM_CGROUP_STAT_PGPGIN_COUNT, 1);
else
- __mem_cgroup_stat_add_safe(stat,
+ __mem_cgroup_stat_add_safe(cpustat,
MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
}
@@ ...Sorry, this patch is comes after "3".
==
This patch makes page_cgroup->flags to be atomic_ops and define
functions (and macros) to access it.
This patch itself makes memcg slow but this patch's final purpose is
to remove lock_page_cgroup() and allowing fast access to page_cgroup.
(And total performance will increase after all patches applied.)
Before trying to modify memory resource controller, this atomic operation
on flags is necessary. Most of flags in this patch is for LRU and modfied
under mz->lru_lock but we'll add another flags which is not for LRU soon.
So we use atomic version here.
Changelog: (v3) -> (v4)
- removed unsued operations.
- adjusted to new ctype MEM_CGROUP_CHARGE_TYPE_SHMEM
Changelog: (v2) -> (v3)
- renamed macros and flags to be longer name.
- added comments.
- added "default bit set" for File, Shmem, Anon.
Changelog: (preview) -> (v1):
- patch ordering is changed.
- Added macro for defining functions for Test/Set/Clear bit.
- made the names of flags shorter.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/memcontrol.c | 122 +++++++++++++++++++++++++++++++++++++-------------------
1 file changed, 82 insertions(+), 40 deletions(-)
Index: mmotm-2.6.27-rc6+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc6+/mm/memcontrol.c
@@ -161,12 +161,46 @@ struct page_cgroup {
struct list_head lru; /* per cgroup LRU list */
struct page *page;
struct mem_cgroup *mem_cgroup;
- int flags;
+ unsigned long flags;
};
-#define PAGE_CGROUP_FLAG_CACHE (0x1) /* charged as cache */
-#define PAGE_CGROUP_FLAG_ACTIVE (0x2) /* page is active in this cgroup */
-#define PAGE_CGROUP_FLAG_FILE (0x4) /* page is file system backed */
-#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8) /* page is unevictableable */
+
+enum {
+ /* flags for mem_cgroup */
+ PCG_CACHE, /* charged as cache */
+ /* flags for LRU placement ...Sorry, this patchs is after "3.5" , before "4"....
==
This patch provides a function to move account information of a page between
mem_cgroups.
This moving of page_cgroup is done under
- the page is locked.
- lru_lock of source/destination mem_cgroup is held.
Then, a routine which touches pc->mem_cgroup without page_lock() should
confirm pc->mem_cgroup is still valid or not. Typlical code can be following.
(while page is not under lock_page())
mem = pc->mem_cgroup;
mz = page_cgroup_zoneinfo(pc)
spin_lock_irqsave(&mz->lru_lock);
if (pc->mem_cgroup == mem)
...../* some list handling */
spin_unlock_irq(&mz->lru_lock);
If you find page_cgroup from mem_cgroup's LRU under mz->lru_lock, you don't
have to worry about anything.
Changelog: (v3) -> (v4)
- no changes.
Changelog: (v2) -> (v3)
- added lock_page_cgroup().
- splitted out from new-force-empty patch.
- added how-to-use text.
- fixed race in __mem_cgroup_uncharge_common().
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/memcontrol.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 82 insertions(+), 3 deletions(-)
Index: mmotm-2.6.27-rc6+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc6+/mm/memcontrol.c
@@ -424,6 +424,7 @@ int task_in_mem_cgroup(struct task_struc
void mem_cgroup_move_lists(struct page *page, enum lru_list lru)
{
struct page_cgroup *pc;
+ struct mem_cgroup *mem;
struct mem_cgroup_per_zone *mz;
unsigned long flags;
@@ -442,9 +443,14 @@ void mem_cgroup_move_lists(struct page *
pc = page_get_page_cgroup(page);
if (pc) {
+ mem = pc->mem_cgroup;
mz = page_cgroup_zoneinfo(pc);
spin_lock_irqsave(&mz->lru_lock, flags);
- __mem_cgroup_move_lists(pc, lru);
+ /*
+ * check against the race with move_account.
+ */
+ if (likely(mem == pc->mem_cgroup))
+ __mem_cgroup_move_lists(pc, lru);
...Is this check needed? Both move_lists and move_account takes page_cgroup lock. Thanks, --
On Wed, 24 Sep 2008 15:50:11 +0900 __mem_cgroup_move_lists() doesn't take. But yes, if you know what it does, you can reduce checks. Above is example. Thanks, --
Remove page_cgroup pointer from struct page.
This patch removes page_cgroup pointer from struct page and make it be able
to get from pfn. Then, relationship of them is
Before this:
pfn <-> struct page <-> struct page_cgroup.
After this:
struct page <-> pfn -> struct page_cgroup -> struct page.
Benefit of this approach is we can remove 8 bytes from struct page.
Other changes are:
- lock/unlock_page_cgroup() uses its own bit on struct page_cgroup.
- all necessary page_cgroups are allocated at boot.
Characteristics:
- page cgroup is allocated as some amount of chunk.
This patch uses SECTION_SIZE as size of chunk if 64bit/SPARSEMEM is enabled.
If not, appropriate default number is selected.
- all page_cgroup struct is maintained by hash.
I think we have 2 ways to handle sparse index in general
...radix-tree and hash. This uses hash because radix-tree's layout is
affected by memory map's layout.
- page_cgroup.h/page_cgroup.c is added.
Changelog: v3 -> v4.
- changed arguments to lookup_page_cgroup() from "pfn" to "page",
Changelog: v2 -> v3
- changed arguments from pfn to struct page*.
- added memory hotplug callback (no undo...needs .more work.)
- adjusted to new mmotm.
Changelog: v1 -> v2
- Fixed memory allocation failure at boot to do panic with good message.
- rewrote charge/uncharge path (no changes in logic.)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
include/linux/mm_types.h | 4
include/linux/page_cgroup.h | 89 +++++++++++++++
mm/Makefile | 2
mm/memcontrol.c | 251 +++++++++++---------------------------------
mm/page_alloc.c | 9 -
mm/page_cgroup.c | 235 +++++++++++++++++++++++++++++++++++++++++
6 files changed, 394 insertions(+), 196 deletions(-)
Index: mmotm-2.6.27-rc6+/mm/page_cgroup.c
===================================================================
--- /dev/null
+++ ...The one thing I don't see here is much explanation about how large this structure will get. Basing it on max_pfn makes me nervous because of what it will do on machines with very sparse memory. Is this like sparsemem where the structure can be small enough to actually span all of physical memory, or will it be a large memory user? Can you lay out how much memory this will use on a machine like Dave Miller's which has 1GB of memory at 0x0 and 1GB of memory at 1TB up in the address space? Also, how large do the hash buckets get in the average case? -- Dave --
I admit this calcuration is too easy. Hmm, based on totalram_pages is on my 48GB box, hashtable was 16384bytes. (in dmesg log.) (section size was 128MB.) I'll rewrite this based on totalram_pages. BTW, do you know difference between num_physpages and totalram_pages ? Thanks, -Kame --
No, I was setting a trap. ;) If you use totalram_pages, I'll just complain that it doesn't work if a memory hotplug machine drastically changes its size. You'll end up with pretty darn big hash buckets. You basically can't get away with the fact that you (potentially) have really sparse addresses to play with here. Using a hash table is exactly the same as using an array such as sparsemem except you randomly index into it instead of using straight arithmetic. My gut says that you'll need to do exactly the same things sparsemem did here, which is at *least* have a two-level lookup before you get to the linear search. The two-level lookup also makes the hotplug problem easier. As I look at this, I always have to bounce between these tradeoffs: 1. deal with sparse address spaces (keeps you from using max_pfn) 2. scale as that sparse address space has memory hotplugged into it (keeps you from using boot-time present_pages) 3. deal with performance impacts from new data structures created to num_physpages appears to be linked to the size of the address space and totalram_pages looks like the amount of ram present. Kinda spanned_pages and present_pages. But, who knows how consistent they are these days. :) -- Dave --
As I wrote, this is just _generic_ one. I'll add FLATMEM and SPARSEMEM support later. I never want to write SPARSEMEM_EXTREME by myself and want to depend In above case, just one step. 16384/8 * 128MB. In ppc, it has 16MB sections, hash table will be bigger. But "walk" is not very long. Anyway, How "walk" is long is not very big problem because look-aside buffer helps. I'll add FLATMEM/SPARSEMEM support later. Could you wait for a while ? Because we have lookup_page_cgroup() after this, we can do anything. Thanks, -Kame --
OK, I'll stop harassing for the moment, and take a look at the cache. :) -- Dave --
Why I don't say "optimize this! now! more!" is where this is called is
limited now. only at charge/uncharge. This is not memmap.
charge ...the first page fault to the page
add to radix-tree
uncharge ...the last unmap aginst the page
remove from radix-tree.
I can make this faster by using charactoristics of FLATMEM and others.
(with more #ifdefs and codes.)
But would like to start from generic one because adding interface is
the first thing I have to do here.
BTW, to be honest, I don't like 2-level-table-lookup like
SPARSEMEM_EXTREME, here. A style like SPARSEMEM_VMEMMAP...using
linear virtual address map will be goal of mine.
Thanks,
-Kame
--
Could you provide further detail? That is, is this solely because our radix tree implementation is sucky for large indexes? If so, I did most of the work of fixing that, just need to spend a little more time to stabalize the code. --
IIUC, radix tree's height is determined by how sparse the space is. In big servers, each node's memory is tend to be aligned to some aligned address. like (following is an extreme example) 256M.....node 0 equips 4GB mem =32section <very big hole> 256T .... node 1 equips 4GB mem =32section <very big hole> 512T .... node 2 equips 4GB mem =32section <very big hole> ..... Then, steps to reach entries is tend to be larger than hash. I'm sorry if I misunderstood. Thanks, -Kame --
No problems,. I'll try and brush up that radix tree code and post sometime soon. --
After sleeping all day, I changed my mind and decided to drop this. It seems no one like this. I'll add FLATMEM/DISCONTIGMEM/SPARSEMEM support directly. I already have wasted a month on this not-interesting work and want to fix this soon. I'm glad if people help me to test FLATMEM/DISCONTIGMEM/SPARSEMEM because there are various kinds of memory map. I have only x86-64 box. Thanks, -Kame On Mon, 22 Sep 2008 20:12:06 +0900 --
On Wed, Sep 24, 2008 at 5:18 AM, KAMEZAWA Hiroyuki Let's look at the basic requirement, make memory resource controller not suck with 32 bit systems. I have been thinking of about removing page_cgroup from struct page only for 32 bit systems (use radix tree), 32 bit systems can have a maximum of 64GB if PAE is enabled, I suspect radix tree should work there and let the 64 bit systems work as is. If performance is an issue, I would recommend the 32 bit folks upgrade to I can help test your patches on powerpc 64 bit and find a 32 bit system to test it as well. What do you think about the points above? Balbir --
On Wed, 24 Sep 2008 07:39:58 +0530
My thinking is below. (assume 64bit)
- remove page_cgroup pointer from struct page allows us to reduce
static memory usage at boot by 8bytes/4096bytes if memory cgroup is disabled.
This reaches 96MB on my 48 GB box. I think this is big.
- pre-allocation of page_cgroup gives us following.
Pros.
- We are not necessary to be afraid of "failure of kmalloc" and
"goes down to memory reclaim at kmalloc"
This makes memory resource controller much simpler and robust.
- We can know what amount of kernel memory will be used for
LRU pages management.
Cons.
- All page_cgroups are allocated at boot.
This reaches 480MB on my 48GB box.
But I think we can ignore "Cons.". If we use up memory, we'll use tons of
page_cgroup. Considering memory fragmentation caused by allocating a lots of
In Kconfig, x86-64 just uses SPARSEMEM and FLATMEM/DISCONTIGMEM cannot be selected.
I can compile by hand but cannot do real test.
I already wrote a replacement, quite easy to read.
(now under test.)
==
pre-allocate all page_cgroup at boot and remove page_cgroup poitner
from struct page. This patch adds an interface as
struct page_cgroup *lookup_page_cgroup(struct page*)
All FLATMEM/DISCONTIGMEM/SPARSEMEM and MEMORY_HOTPLUG is supported.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
include/linux/memcontrol.h | 11 -
include/linux/mm_types.h | 4
include/linux/mmzone.h | 8 +
include/linux/page_cgroup.h | 90 +++++++++++++++
mm/Makefile | 2
mm/memcontrol.c | 256 ++++++++++++--------------------------------
mm/page_alloc.c | 10 -
mm/page_cgroup.c | 200 ++++++++++++++++++++++++++++++++++
8 files changed, 374 insertions(+), 207 deletions(-)
Index: mmotm-2.6.27-rc6+/mm/page_cgroup.c
===================================================================
--- /dev/null
+++ ...This looks like a good patch. I'll review and test it. -- Balbir --
On Wed, 24 Sep 2008 01:31:59 -0700 At least, I should handle "use vmalloc if kmalloc fails" case. But will not have no major updates. I'll update the whole to the newest mmotm and post tomorrow if I can start test tonight. Thanks, -Kame --
Use per-cpu cache for fast access to page_cgroup.
This patch is for making fastpath faster.
Because page_cgroup is accessed when the page is allocated/freed,
we can assume several of continuous page_cgroup will be accessed soon.
(If not interleaved on NUMA...but in such case, alloc/free itself is slow.)
We cache some set of page_cgroup's base pointer on per-cpu area and
use it when we hit.
Changelong: v3 -> v4
- rewrite noinline -> noinline_for_stack.
- added cpu hotplug support.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/page_cgroup.c | 73 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 70 insertions(+), 3 deletions(-)
Index: mmotm-2.6.27-rc6+/mm/page_cgroup.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/page_cgroup.c
+++ mmotm-2.6.27-rc6+/mm/page_cgroup.c
@@ -6,7 +6,7 @@
#include <linux/page_cgroup.h>
#include <linux/hash.h>
#include <linux/memory.h>
-
+#include <linux/cpu.h>
struct pcg_hash_head {
@@ -44,15 +44,26 @@ static int pcg_hashmask __read_mostly;
#define PCG_HASHMASK (pcg_hashmask)
#define PCG_HASHSIZE (1 << pcg_hashshift)
+#define PCG_CACHE_MAX_SLOT (32)
+#define PCG_CACHE_MASK (PCG_CACHE_MAX_SLOT - 1)
+struct percpu_page_cgroup_cache {
+ struct {
+ unsigned long index;
+ struct page_cgroup *base;
+ } slots[PCG_CACHE_MAX_SLOT];
+};
+DEFINE_PER_CPU(struct percpu_page_cgroup_cache, pcg_cache);
+
static int pcg_hashfun(unsigned long index)
{
return hash_long(index, pcg_hashshift);
}
-struct page_cgroup *lookup_page_cgroup(struct page *page)
+noinline_for_stack static struct page_cgroup *
+__lookup_page_cgroup(struct percpu_page_cgroup_cache *pcc,unsigned long pfn)
{
- unsigned long pfn = page_to_pfn(page);
unsigned long index = pfn >> ENTS_PER_CHUNK_SHIFT;
+ int s = index & PCG_CACHE_MASK;
struct pcg_hash *ent;
struct pcg_hash_head *head;
struct hlist_node *node;
@@ -65,6 +76,8 @@ struct ...Free page_cgroup from its LRU in batched manner.
When uncharge() is called, page is pushed onto per-cpu vector and
removed from LRU, later.. This routine resembles to global LRU's pagevec.
This patch is half of the whole patch and a set with following lazy LRU add
patch.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
mm/memcontrol.c | 163 ++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 153 insertions(+), 10 deletions(-)
Index: mmotm-2.6.27-rc6+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc6+/mm/memcontrol.c
@@ -35,6 +35,7 @@
#include <linux/vmalloc.h>
#include <linux/mm_inline.h>
#include <linux/page_cgroup.h>
+#include <linux/cpu.h>
#include <asm/uaccess.h>
@@ -533,6 +534,116 @@ out:
return ret;
}
+
+#define MEMCG_PCPVEC_SIZE (14) /* size of pagevec */
+struct memcg_percpu_vec {
+ int nr;
+ int limit;
+ struct page_cgroup *vec[MEMCG_PCPVEC_SIZE];
+};
+static DEFINE_PER_CPU(struct memcg_percpu_vec, memcg_free_vec);
+
+static void
+__release_page_cgroup(struct memcg_percpu_vec *mpv)
+{
+ unsigned long flags;
+ struct mem_cgroup_per_zone *mz, *prev_mz;
+ struct page_cgroup *pc;
+ int i, nr;
+
+ local_irq_save(flags);
+ nr = mpv->nr;
+ mpv->nr = 0;
+ prev_mz = NULL;
+ for (i = nr - 1; i >= 0; i--) {
+ pc = mpv->vec[i];
+ VM_BUG_ON(PageCgroupUsed(pc));
+ mz = page_cgroup_zoneinfo(pc);
+ if (prev_mz != mz) {
+ if (prev_mz)
+ spin_unlock(&prev_mz->lru_lock);
+ prev_mz = mz;
+ spin_lock(&mz->lru_lock);
+ }
+ __mem_cgroup_remove_list(mz, pc);
+ css_put(&pc->mem_cgroup->css);
+ pc->mem_cgroup = NULL;
+ }
+ if (prev_mz)
+ spin_unlock(&prev_mz->lru_lock);
+ local_irq_restore(flags);
+
+}
+
+static void
+release_page_cgroup(struct page_cgroup *pc)
+{
+ struct memcg_percpu_vec *mpv;
+
+ mpv = &get_cpu_var(memcg_free_vec);
+ mpv->vec[mpv->nr++] = pc;
+ if (mpv->nr >= ...Delaying add_to_lru() and do it in batched manner like page_vec.
For doint that 2 flags PCG_USED and PCG_LRU.
If PCG_LRU is set, page is on LRU. It safe to access LRU via page_cgroup.
(under some lock.)
For avoiding race, this patch uses TestSetPageCgroupUsed().
and checking PCG_USED bit and PCG_LRU bit in add/free vector.
By this, lock_page_cgroup() in mem_cgroup_charge() is removed.
(I don't want to call lock_page_cgroup() under mz->lru_lock when
add/free vector core logic. So, TestSetPageCgroupUsed() logic is added.
TestSet is an easy way to avoid unneccesary nest of locks.)
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
include/linux/page_cgroup.h | 10 +++
mm/memcontrol.c | 125 ++++++++++++++++++++++++++++++--------------
2 files changed, 98 insertions(+), 37 deletions(-)
Index: mmotm-2.6.27-rc6+/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.27-rc6+.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.27-rc6+/include/linux/page_cgroup.h
@@ -23,6 +23,7 @@ enum {
PCG_LOCK, /* page cgroup is locked */
PCG_CACHE, /* charged as cache */
PCG_USED, /* this object is in use. */
+ PCG_LRU, /* this is on LRU */
/* flags for LRU placement */
PCG_ACTIVE, /* page is active in this cgroup */
PCG_FILE, /* page is file system backed */
@@ -41,11 +42,20 @@ static inline void SetPageCgroup##uname(
static inline void ClearPageCgroup##uname(struct page_cgroup *pc) \
{ clear_bit(PCG_##lname, &pc->flags); }
+#define TESTSETPCGFLAG(uname, lname)\
+static inline int TestSetPageCgroup##uname(struct page_cgroup *pc) \
+ { return test_and_set_bit(PCG_##lname, &pc->flags); }
+
/* Cache flag is set only once (at allocation) */
TESTPCGFLAG(Cache, CACHE)
TESTPCGFLAG(Used, USED)
CLEARPCGFLAG(Used, USED)
+TESTSETPCGFLAG(Used, USED)
+
+TESTPCGFLAG(LRU, LRU)
+SETPCGFLAG(LRU, LRU)
+CLEARPCGFLAG(LRU, LRU)
/* LRU management flags (from global-lru definition) ...There is a small race in do_swap_page(). When the page swapped-in is charged, the mapcount can be greater than 0. But, at the same time some process (shares it ) call unmap and make mapcount 1->0 and the page is uncharged. For fixing this, I added a new interface. - precharge account to res_counter by PAGE_SIZE and try to free pages if necessary. - commit register page_cgroup and add to LRU if necessary. - cancel uncharge PAGE_SIZE because of do_swap_page failure. This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting. Usual mem_cgroup_charge_common() does precharge -> commit at a time. These precharge/commit/cancel can be used for other places, shmem, migration, etc..we'll revisit later. Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> include/linux/memcontrol.h | 21 +++++++ mm/memcontrol.c | 135 +++++++++++++++++++++++++++++++-------------- mm/memory.c | 6 +- 3 files changed, 120 insertions(+), 42 deletions(-) Index: mmotm-2.6.27-rc6+/include/linux/memcontrol.h =================================================================== --- mmotm-2.6.27-rc6+.orig/include/linux/memcontrol.h +++ mmotm-2.6.27-rc6+/include/linux/memcontrol.h @@ -32,6 +32,13 @@ struct mm_struct; extern struct page_cgroup *page_get_page_cgroup(struct page *page); extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); +/* for swap handling */ +extern int mem_cgroup_precharge(struct mm_struct *mm, + gfp_t gfp_mask, struct mem_cgroup **ptr); +extern void mem_cgroup_commit_charge_swap(struct page *page, + struct mem_cgroup *ptr); +extern void mem_cgroup_cancel_charge_swap(struct mem_cgroup *ptr); + extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm, gfp_t gfp_mask); extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru); @@ -94,6 +101,20 @@ static inline int mem_cgroup_cache_charg return 0; } +static int ...
On Mon, 22 Sep 2008 19:51:59 +0900 Sorry for crazy patch numbering... 1 -> 1 2 -> 2 3 -> 3 3.5 -> 4 3.6 -> 5 4 -> 6 5 -> 7 6 -> 8 9 -> 9 10 -> 10 11 -> 11 12 -> 12 I may not able to do quick responce, sorry. Thanks, -Kame --
