[PATCH 6/12] memcg optimize percpu stat

Previous thread: sched: by Bharata B Rao on Wednesday, September 24, 2008 - 9:21 pm. (4 messages)

Next thread: [PATCH 7/7] x86: print out irq nr for msi/ht -v2 by Yinghai Lu on Wednesday, September 24, 2008 - 11:13 pm. (17 messages)
From: KAMEZAWA Hiroyuki
Date: Wednesday, September 24, 2008 - 11:11 pm

Hi, I updated the stack and reflected comments.
Against the latest mmotm. (rc7-mm1)

Major changes from previous one is 
  - page_cgroup allocation/lookup manner is changed.
    all FLATMEM/DISCONTIGMEM/SPARSEMEM and MEMORY_HOTPLUG is supported.
  - force_empty is totally rewritten. and a problem that "force_empty takes long time"
    in previous version is fixed (I think...)
  - reordered patches.
     - first half are easy ones.
     - second half are big ones.

I'm still testing with full debug option. No problem found yet.
(I'm afraid of race condition which have not been caught yet.)

[1/12]  avoid accounting special mappings not on LRU. (fix)
[2/12]  move charege() call to swapped-in page under lock_page() (clean up)
[3/12]  make root cgroup to be unlimited. (change semantics.)
[4/12]  make page->mapping NULL before calling uncharge (clean up)
[5/12]  make page->flags to use atomic ops. (changes in infrastructure)
[6/12]  optimize stat. (clean up)
[7/12]  add support function for moving account. (new function)
[8/12]  rewrite force_empty to use move_account. (change semantics.)
[9/12]  allocate all page_cgroup at boot. (changes in infrastructure)
[10/12] free page_cgroup from LRU in lazy way (optimize)
[11/12] add page_cgroup to LRU in lazy way (optimize)
[12/12] fix race at charging swap  (fix by new logic.)

*Any* comment is welcome.

Thanks,
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Wednesday, September 24, 2008 - 11:13 pm

There are not-on-LRU pages which can be mapped and they are not worth to
be accounted. (becasue we can't shrink them and need dirty codes to handle
specical case) We'd like to make use of usual objrmap/radix-tree's protcol
and don't want to account out-of-vm's control pages.

When special_mapping_fault() is called, page->mapping is tend to be NULL 
and it's charged as Anonymous page.
insert_page() also handles some special pages from drivers.

This patch is for avoiding to account special pages.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 mm/memory.c |   18 ++++++------------
 mm/rmap.c   |    4 ++--
 2 files changed, 8 insertions(+), 14 deletions(-)

Index: mmotm-2.6.27-rc6+/mm/memory.c
===================================================================
--- mmotm-2.6.27-rc6+.orig/mm/memory.c
+++ mmotm-2.6.27-rc6+/mm/memory.c
@@ -1323,18 +1323,14 @@ static int insert_page(struct vm_area_st
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	retval = mem_cgroup_charge(page, mm, GFP_KERNEL);
-	if (retval)
-		goto out;
-
 	retval = -EINVAL;
 	if (PageAnon(page))
-		goto out_uncharge;
+		goto out;
 	retval = -ENOMEM;
 	flush_dcache_page(page);
 	pte = get_locked_pte(mm, addr, &ptl);
 	if (!pte)
-		goto out_uncharge;
+		goto out;
 	retval = -EBUSY;
 	if (!pte_none(*pte))
 		goto out_unlock;
@@ -1350,8 +1346,6 @@ static int insert_page(struct vm_area_st
 	return retval;
 out_unlock:
 	pte_unmap_unlock(pte, ptl);
-out_uncharge:
-	mem_cgroup_uncharge_page(page);
 out:
 	return retval;
 }
@@ -2542,7 +2536,7 @@ static int __do_fault(struct mm_struct *
 
 	}
 
-	if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
+	if (anon && mem_cgroup_charge(page, mm, GFP_KERNEL)) {
 		ret = VM_FAULT_OOM;
 		goto out;
 	}
@@ -2584,10 +2578,10 @@ static int __do_fault(struct mm_struct *
 		/* no need to invalidate: a not-present page won't be cached */
 		update_mmu_cache(vma, address, entry);
 	} else {
-		mem_cgroup_uncharge_page(page);
-		if (anon)
+		if (anon) ...
From: Balbir Singh
Date: Friday, September 26, 2008 - 1:25 am

Hmm... I am a little concerned that with these changes actual usage will much
more than what we report in memory.usage_in_bytes. Why not move them to
non-reclaimable LRU list as unevictable pages (once those patches go in, we can
push this as well). I suspect the size of special pages is too short to affect
anything or are you seeing something very different?
-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Friday, September 26, 2008 - 2:17 am

On Fri, 26 Sep 2008 13:55:54 +0530


I don't want put pages never goes to LRU onto memcgroup's LRU.

Thanks,
-Kame

--

From: Balbir Singh
Date: Friday, September 26, 2008 - 2:32 am

I understand.. Thanks for clarifying.. my concern is w.r.t accounting, we
account it in RSS (file RSS).

-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Friday, September 26, 2008 - 2:55 am

On Fri, 26 Sep 2008 15:02:13 +0530
Yes. Because page->mapping is NULL, there are accounted as RSS.

AFAIK, what bad about !PageLRU(page) is..
1. we skips !PageLRU(page) in force_empty. and we cannot reduce account.
2. we don't know we can trust that page is treated as usual LRU page.

I'll update description.

Thanks,
-Kame





--

From: KAMEZAWA Hiroyuki
Date: Wednesday, September 24, 2008 - 11:14 pm

While page-cache's charge/uncharge is done under page_lock(), swap-cache
isn't. (anonymous page is charged when it's newly allocated.)

This patch moves do_swap_page()'s charge() call under lock. This helps
us to avoid to charge already mapped one, unnecessary calls.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 mm/memory.c |    7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

Index: mmotm-2.6.27-rc7+/mm/memory.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memory.c
+++ mmotm-2.6.27-rc7+/mm/memory.c
@@ -2320,15 +2320,14 @@ static int do_swap_page(struct mm_struct
 		count_vm_event(PGMAJFAULT);
 	}
 
+	lock_page(page);
+	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+
 	if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
-		delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 		ret = VM_FAULT_OOM;
 		goto out;
 	}
-
 	mark_page_accessed(page);
-	lock_page(page);
-	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 
 	/*
 	 * Back out if somebody else already faulted in this pte.

--

From: Balbir Singh
Date: Friday, September 26, 2008 - 1:36 am

Seems reasonable to me

Just one quick comment though, as a result of this change, mark_page_accessed is
now called with PageLock held, I suspect you would want to move that call prior
to lock_page().



-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Friday, September 26, 2008 - 2:18 am

On Fri, 26 Sep 2008 14:06:02 +0530
Ok. I'll check it

Thanks,

--

From: KAMEZAWA Hiroyuki
Date: Wednesday, September 24, 2008 - 11:15 pm

Make root cgroup of memory resource controller to have no limit.

By this, users cannot set limit to root group. This is for making root cgroup
as a kind of trash-can.

For accounting pages which has no owner, which are created by force_empty,
we need some cgroup with no_limit. A patch for rewriting force_empty will
will follow this one.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 Documentation/controllers/memory.txt |    4 ++++
 mm/memcontrol.c                      |    7 +++++++
 2 files changed, 11 insertions(+)

Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -136,6 +136,9 @@ struct mem_cgroup {
 };
 static struct mem_cgroup init_mem_cgroup;
 
+#define is_root_cgroup(cgrp)	((cgrp) == &init_mem_cgroup)
+
+
 /*
  * We use the lower bit of the page->page_cgroup pointer as a bit spin
  * lock.  We need to ensure that page->page_cgroup is at least two
@@ -945,6 +948,10 @@ static int mem_cgroup_write(struct cgrou
 
 	switch (cft->private) {
 	case RES_LIMIT:
+		if (is_root_cgroup(memcg)) {
+			ret = -EINVAL;
+			break;
+		}
 		/* This function does all necessary parse...reuse it */
 		ret = res_counter_memparse_write_strategy(buffer, &val);
 		if (!ret)
Index: mmotm-2.6.27-rc7+/Documentation/controllers/memory.txt
===================================================================
--- mmotm-2.6.27-rc7+.orig/Documentation/controllers/memory.txt
+++ mmotm-2.6.27-rc7+/Documentation/controllers/memory.txt
@@ -121,6 +121,9 @@ The corresponding routines that remove a
 a page from Page Cache is used to decrement the accounting counters of the
 cgroup.
 
+The root cgroup is not allowed to be set limit but usage is accounted.
+For controlling usage of memory, you need to create a cgroup.
+
 2.3 Shared Page Accounting
 
 Shared pages are accounted on the basis of the first touch approach. The
@@ ...
From: Balbir Singh
Date: Friday, September 26, 2008 - 1:41 am

This is an ABI change (although not too many people might be using it, I wonder
if we should add memory.features (a set of flags and let users enable them and
provide good defaults), like sched features.

-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Friday, September 26, 2008 - 2:21 am

On Fri, 26 Sep 2008 14:11:00 +0530
I think "feature" flag is complicated, at this stage.
We'll add more features and not settled yet.

Hmm, if you don't like this,
calling try_to_free_page() at force_empty() instead of move_account() ?


Thanks,
-Kame

--

From: Balbir Singh
Date: Friday, September 26, 2008 - 2:29 am

I know.. but breaking ABI is a bad bad thing. We'll have to keep the feature
flags extensible (add new things). If we all feel we don't have enough users

Not sure I understand this.


-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Friday, September 26, 2008 - 2:59 am

On Fri, 26 Sep 2008 14:59:48 +0530
In following series, force_empty() uses move_account() rather than forget all.
By this, accounted file caches are kept as accounted and the whole accounting
will be sane.

Another choice is calling try_to_free_pages() at force_empty rather than forget all
and makes memory usage to be Zero. This will also makes the whole accounting sange.

Thanks,
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Wednesday, September 24, 2008 - 11:16 pm

This patch tries to make page->mapping to be NULL before
mem_cgroup_uncharge_cache_page() is called.

"page->mapping == NULL" is a good check for "whether the page is still
radix-tree or not".
This patch also adds BUG_ON() to mem_cgroup_uncharge_cache_page();


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 mm/filemap.c    |    2 +-
 mm/memcontrol.c |    1 +
 mm/migrate.c    |   12 +++++++++---
 3 files changed, 11 insertions(+), 4 deletions(-)

Index: mmotm-2.6.27-rc7+/mm/filemap.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/filemap.c
+++ mmotm-2.6.27-rc7+/mm/filemap.c
@@ -116,12 +116,12 @@ void __remove_from_page_cache(struct pag
 {
 	struct address_space *mapping = page->mapping;
 
-	mem_cgroup_uncharge_cache_page(page);
 	radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
 	mapping->nrpages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	BUG_ON(page_mapped(page));
+	mem_cgroup_uncharge_cache_page(page);
 
 	/*
 	 * Some filesystems seem to re-dirty the page even after
Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -737,6 +737,7 @@ void mem_cgroup_uncharge_page(struct pag
 void mem_cgroup_uncharge_cache_page(struct page *page)
 {
 	VM_BUG_ON(page_mapped(page));
+	VM_BUG_ON(page->mapping);
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
 }
 
Index: mmotm-2.6.27-rc7+/mm/migrate.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/migrate.c
+++ mmotm-2.6.27-rc7+/mm/migrate.c
@@ -330,8 +330,6 @@ static int migrate_page_move_mapping(str
 	__inc_zone_page_state(newpage, NR_FILE_PAGES);
 
 	spin_unlock_irq(&mapping->tree_lock);
-	if (!PageSwapCache(newpage))
-		mem_cgroup_uncharge_cache_page(page);
 
 	return 0;
 }
@@ -378,7 +376,15 @@ static ...
From: Balbir Singh
Date: Friday, September 26, 2008 - 2:47 am

Isn't it better and correct coding style to do

	/*
	 * Uncharge obsolete file cache
	 */
	if (!PageAnon(page))
		mem_cgroup_uncharge_cache_page(page);
	/* else - uncharged at try_to_unmap() */
	page->mapping = NULL;


-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Friday, September 26, 2008 - 3:07 am

On Fri, 26 Sep 2008 15:17:48 +0530
yea, maybe.
I always wonder what I should do when I want to add comment to if-then-else...

But ok, will remove {}.

Thanks,
-Kame



--

From: KAMEZAWA Hiroyuki
Date: Wednesday, September 24, 2008 - 11:17 pm

This patch makes page_cgroup->flags to be atomic_ops and define
functions (and macros) to access it.

This patch itself makes memcg slow but this patch's final purpose is 
to remove lock_page_cgroup() and allowing fast access to page_cgroup.
(And total performance will increase after all patches applied.)

Before trying to modify memory resource controller, this atomic operation
on flags is necessary. Most of flags in this patch is for LRU and modfied
under mz->lru_lock but we'll add another flags which is not for LRU soon.
So we use atomic version here.

 
Changelog: (v4) -> (v5)
 - removed unsued operations.
 - adjusted to new ctype MEM_CGROUP_CHARGE_TYPE_SHMEM

Changelog: (v3) -> (v4)
 - no changes.

Changelog:  (v2) -> (v3)
 - renamed macros and flags to be longer name.
 - added comments.
 - added "default bit set" for File, Shmem, Anon.

Changelog:  (preview) -> (v1):
 - patch ordering is changed.
 - Added macro for defining functions for Test/Set/Clear bit.
 - made the names of flags shorter.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 mm/memcontrol.c |  122 +++++++++++++++++++++++++++++++++++++-------------------
 1 file changed, 82 insertions(+), 40 deletions(-)

Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -161,12 +161,46 @@ struct page_cgroup {
 	struct list_head lru;		/* per cgroup LRU list */
 	struct page *page;
 	struct mem_cgroup *mem_cgroup;
-	int flags;
+	unsigned long flags;
 };
-#define PAGE_CGROUP_FLAG_CACHE	   (0x1)	/* charged as cache */
-#define PAGE_CGROUP_FLAG_ACTIVE    (0x2)	/* page is active in this cgroup */
-#define PAGE_CGROUP_FLAG_FILE	   (0x4)	/* page is file system backed */
-#define PAGE_CGROUP_FLAG_UNEVICTABLE (0x8)	/* page is unevictableable */
+
+enum {
+	/* flags for mem_cgroup */
+	PCG_CACHE, /* charged as cache */
+	/* flags for LRU placement ...
From: Balbir Singh
Date: Friday, September 26, 2008 - 11:58 pm

Looks good to me

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>
provided we push in the lockless ones too :)

-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, September 24, 2008 - 11:18 pm

Some obvious optimization to memcg.

I found mem_cgroup_charge_statistics() is a little big (in object) and
does unnecessary address calclation.
This patch is for optimization to reduce the size of this function.

And res_counter_charge() is 'likely' to success.

Changelog v3->v4:
 - merged with an other leaf patch.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 mm/memcontrol.c |   18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -66,11 +66,10 @@ struct mem_cgroup_stat {
 /*
  * For accounting under irq disable, no need for increment preempt count.
  */
-static void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat *stat,
+static inline void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat_cpu *stat,
 		enum mem_cgroup_stat_index idx, int val)
 {
-	int cpu = smp_processor_id();
-	stat->cpustat[cpu].count[idx] += val;
+	stat->count[idx] += val;
 }
 
 static s64 mem_cgroup_read_stat(struct mem_cgroup_stat *stat,
@@ -237,18 +236,21 @@ static void mem_cgroup_charge_statistics
 {
 	int val = (charge)? 1 : -1;
 	struct mem_cgroup_stat *stat = &mem->stat;
+	struct mem_cgroup_stat_cpu *cpustat;
 
 	VM_BUG_ON(!irqs_disabled());
+
+	cpustat = &stat->cpustat[smp_processor_id()];
 	if (PageCgroupCache(pc))
-		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_CACHE, val);
+		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_CACHE, val);
 	else
-		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val);
+		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_RSS, val);
 
 	if (charge)
-		__mem_cgroup_stat_add_safe(stat,
+		__mem_cgroup_stat_add_safe(cpustat,
 				MEM_CGROUP_STAT_PGPGIN_COUNT, 1);
 	else
-		__mem_cgroup_stat_add_safe(stat,
+		__mem_cgroup_stat_add_safe(cpustat,
 				MEM_CGROUP_STAT_PGPGOUT_COUNT, 1);
 }
 
@@ ...
From: Balbir Singh
Date: Friday, September 26, 2008 - 2:53 am

Looks good to me

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>

-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, September 24, 2008 - 11:27 pm

This patch provides a function to move account information of a page between
mem_cgroups.

This moving of page_cgroup is done under
 - lru_lock of source/destination mem_cgroup is held.
 - lock_page_cgroup() is held.

Then, a routine which touches pc->mem_cgroup without lock_page_cgroup() should
confirm pc->mem_cgroup is still valid or not. Typlical code can be following.

(while page is not under lock_page())
	mem = pc->mem_cgroup;
	mz = page_cgroup_zoneinfo(pc)
	spin_lock_irqsave(&mz->lru_lock);
	if (pc->mem_cgroup == mem)
		...../* some list handling */
	spin_unlock_irq(&mz->lru_lock);

Or better way is
	lock_page_cgroup(pc);
	....
	unlock_page_cgroup(pc);

But you should confirm the nest of lock and avoid deadlock.
(trylock is better if it's ok.)

If you find page_cgroup from mem_cgroup's LRU under mz->lru_lock,
you don't have to worry about what pc->mem_cgroup points to.

Changelog: (v4) -> (v5)
  - check for lock_page() is removed.
  - rewrote description.

Changelog: (v2) -> (v4)
  - added lock_page_cgroup().
  - splitted out from new-force-empty patch.
  - added how-to-use text.
  - fixed race in __mem_cgroup_uncharge_common().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 mm/memcontrol.c |   84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 81 insertions(+), 3 deletions(-)

Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -426,6 +426,7 @@ int task_in_mem_cgroup(struct task_struc
 void mem_cgroup_move_lists(struct page *page, enum lru_list lru)
 {
 	struct page_cgroup *pc;
+	struct mem_cgroup *mem;
 	struct mem_cgroup_per_zone *mz;
 	unsigned long flags;
 
@@ -444,9 +445,14 @@ void mem_cgroup_move_lists(struct page *
 
 	pc = page_get_page_cgroup(page);
 	if (pc) {
+		mem = pc->mem_cgroup;
 		mz = page_cgroup_zoneinfo(pc);
 ...
From: Daisuke Nishimura
Date: Friday, September 26, 2008 - 12:30 am

I'm sorry, but I've not been convinced yet why these checks are needed here.
(Those checks are removed by [9/12] anyway.)

IIUC, pc->mem_cgroup is moved to another group only by mem_cgroup_move_account
under lock_page_cgroup( and mz->lru_lock).
And those two above(mem_cgroup_move_lists and __mem_cgroup_uncharge_common) sets
mem = pc->mem_cgroup under lock_page_cgroup, so I don't think those checks
(mem != pc->mem_cgroup) is needed.


Thanks,
--

From: KAMEZAWA Hiroyuki
Date: Friday, September 26, 2008 - 2:24 am

On Fri, 26 Sep 2008 16:30:50 +0900
you're right.

Thanks,

--

From: Balbir Singh
Date: Saturday, September 27, 2008 - 12:56 am

What is the interface for moving the accounting? Is it an explicit call like
force_empty? The other concern I have is about merging two LRU lists, when we
move LRU pages from one mem_cgroup to another, where do we add them? To the head
or tail? I think we need to think about it and document it well.

The other thing is that once we have mult-hierarchy support (which we really


Please BUG_ON() if the charging fails, we can be sure we catch assumptions that


Coding style above is broken. Can this race really occur? Why do we get mem


-- 
	Balbir
--

From: kamezawa.hiroyu
Date: Saturday, September 27, 2008 - 1:35 am

Hmm, good point. considering force_empty, head is better.
(But I don't think about this seriously because I assumed root cgroup
 is unlimited)

I don't think this kind of internal behavior should not be documented as UI
yes. of course. please do as you like if this goes.
adding logic to do so needs more patch, so please wait.

I'll remove this charge() and change protocol to be 
yes, I think so. please check SwapCache behavior.
It's costly to call page_cgroup_zoneinfo() again. so just saves mem and
compare pc->mem_cgroup.

Thanks,

--

From: KAMEZAWA Hiroyuki
Date: Wednesday, September 24, 2008 - 11:29 pm

Current force_empty of memory resource controller just removes page_cgroup.
This maans the page is never accounted at all and create an in-use page which
has no page_cgroup.

This patch tries to move account to "root" cgroup. By this patch, force_empty
doesn't leak an account but move account to "root" cgroup. Maybe someone can
think of other enhancements as moving account to its parent.
(But moving to the parent means we have to handle "limit" of pages.
 Need more complicated work to do that.)

For now, just moves account to root cgroup.

Note: all lock other than old mem_cgroup's lru_lock
      in this path is try_lock().

Changelog (v4) -> (5)
 - removed yield()
 - remove lock_page().
 - use list_for_each_entry_safe() instead of list_empty() loop.
 - check list is empty or not rather than see usage.
 - added lru_add_drain_all() at the start of loops.

Changelog (v2) -> (v4)
 - splitted out mem_cgroup_move_account().
 - replaced get_page() with get_page_unless_zero().
   (This is necessary for avoiding confliction with migration)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 Documentation/controllers/memory.txt |    7 ++-
 mm/memcontrol.c                      |   68 ++++++++++++++++++++---------------
 2 files changed, 43 insertions(+), 32 deletions(-)

Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -29,10 +29,12 @@
 #include <linux/slab.h>
 #include <linux/swap.h>
 #include <linux/spinlock.h>
+#include <linux/pagemap.h>
 #include <linux/fs.h>
 #include <linux/seq_file.h>
 #include <linux/vmalloc.h>
 #include <linux/mm_inline.h>
+#include <linux/writeback.h>
 
 #include <asm/uaccess.h>
 
@@ -979,45 +981,34 @@ int mem_cgroup_resize_limit(struct mem_c
 
 
 /*
- * This routine traverse page_cgroup in given list and drop them all.
- * *And* this routine doesn't reclaim page itself, just ...
From: KAMEZAWA Hiroyuki
Date: Wednesday, September 24, 2008 - 11:32 pm

Allocate all page_cgroup at boot and remove page_cgroup poitner
from struct page. This patch adds an interface as

 struct page_cgroup *lookup_page_cgroup(struct page*)

All FLATMEM/DISCONTIGMEM/SPARSEMEM  and MEMORY_HOTPLUG is supported.

Remove page_cgroup pointer reduces the amount of memory by
 - 4 bytes per PAGE_SIZE.
 - 8 bytes per PAGE_SIZE
if memory controller is disabled. (even if configured.)
meta data usage of this is no problem in FLATMEM/DISCONTIGMEM.
On SPARSEMEM, this makes mem_section[] size twice.

On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
On my x86-64 server with 48GB of memory, this saves 96MB of memory.
(and uses xx kbytes for mem_section.)
I think this reduction makes sense.

By pre-allocation, kmalloc/kfree in charge/uncharge are removed. 
This means
  - we're not necessary to be afraid of kmalloc faiulre.
    (this can happen because of gfp_mask type.)
  - we can avoid calling kmalloc/kfree.
  - we can avoid allocating tons of small objects which can be fragmented.
  - we can know what amount of memory will be used for this extra-lru handling.

I added printk message as

	"allocated %ld bytes of page_cgroup"
        "please try cgroup_disable=memory option if you don't want"

maybe enough informative for users.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 include/linux/memcontrol.h  |   11 -
 include/linux/mm_types.h    |    4 
 include/linux/mmzone.h      |    9 +
 include/linux/page_cgroup.h |   90 +++++++++++++++
 mm/Makefile                 |    2 
 mm/memcontrol.c             |  258 ++++++++++++--------------------------------
 mm/page_alloc.c             |   12 --
 mm/page_cgroup.c            |  253 +++++++++++++++++++++++++++++++++++++++++++
 8 files changed, 431 insertions(+), 208 deletions(-)

Index: mmotm-2.6.27-rc7+/mm/page_cgroup.c
===================================================================
--- /dev/null
+++ mmotm-2.6.27-rc7+/mm/page_cgroup.c
@@ -0,0 +1,253 @@
+#include ...
From: Dave Hansen
Date: Thursday, September 25, 2008 - 11:40 am

I thought the use of this variable was under the

	+#ifdef CONFIG_FLAT_NODE_MEM_MAP

options.  Otherwise, we unconditionally bloat mem_section, right?

-- Dave

--

From: KAMEZAWA Hiroyuki
Date: Thursday, September 25, 2008 - 6:17 pm

On Thu, 25 Sep 2008 11:40:47 -0700
Hmmm......Oh, yes ! nice catch.

Thanks, I'll fix.
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Thursday, September 25, 2008 - 6:22 pm

On Fri, 26 Sep 2008 10:17:54 +0900

But in reality, this is under CONFIG_SPARSEMEM and if CONFIG_SPARSEMEM,
FLAT_NODE_MEM_MAP is not true (I think).
Hmm..Maybe I shouldn't use checking CONFIG_FLAT_NODE_MEM_MAP and should
just use CONFIG_SPARSEMEM check. I'll rewrite.

Thanks,
-Kame

--

From: Daisuke Nishimura
Date: Thursday, September 25, 2008 - 6:00 pm

Hmm, who uses this function?

Just for clarification, in what sequence will the page be mapped here?


Thanks,
Daisuke Nishimura.
--

From: KAMEZAWA Hiroyuki
Date: Thursday, September 25, 2008 - 6:43 pm

On Fri, 26 Sep 2008 10:00:22 +0900
Uh, ok. unnecessary. (In my first version, this allocation error
just shows Warning. Now, it panics.)

Please think about folloing situation.

   There is a SwapCache which is referred from 2 process, A, B.
   A maps it.
   B doesn't maps it.

   And now, process A exits.

	CPU0(process A)				CPU1 (process B)
 
    zap_pte_range()
    => page remove from rmap			=> charge() (do_swap_page)
	=> set page->mapcount->0          	
		=> uncharge()			=> set page->mapcount=1

This race is what patch 12/12 is fixed.

Thank you for review.

Regards,
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Thursday, September 25, 2008 - 7:05 pm

On Fri, 26 Sep 2008 10:43:36 +0900
Sorry, my brain seems to be sleeping.. above page_mapped() check doesn't
help this situation. Maybe this page_mapped() check is not necessary
because it's of no use.

I think this kind of problem will not be fixed until we handle SwapCache.


Thanks,
-Kame












--

From: Daisuke Nishimura
Date: Thursday, September 25, 2008 - 10:54 pm

I've not fully understood yet what [12/12] does, but if we handle
swapcache properly, [12/12] would become unnecessary?

If so, how about handling swapcache instead of adding new interface?
I think it can be done independent of mem+swap.


Thanks,
Daisuke Nishimura.
--

From: KAMEZAWA Hiroyuki
Date: Thursday, September 25, 2008 - 11:54 pm

On Fri, 26 Sep 2008 14:54:22 +0900
Hmm, worth to be considered. But I'll reuse the interface itself for othres
(shmem, migrate, move_account etc)
But, in previous trial of SwapCache handling, we saw many troubles.
Then, I'd like to go carefully step by step to handle that.

Thanks,
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Friday, September 26, 2008 - 8:47 pm

On Fri, 26 Sep 2008 14:54:22 +0900

Try to illustrate what is trouble more precisely.


in do_swap_page(), page is charged when SwapCache lookup ends.

Here, 
     - charged when page is not mapped.
     - not charged when page is mapped.
set_pte() etc...are done under appropriate lock.

On the other side, when a task exits, zap_pte_range() is called.
It calls page_remove_rmap(). 

Case A) Following is race.

            Thread A                     Thread B

     do_swap_page()                      zap_pte_range()
	(1)try charge (mapcount=1)
                                         (2) page_remove_rmap()
					     (3) uncharge page. 
	(4) map it


Then,
 at (1),  mapcount=1 and this page is not charged.
 at (2),  page_remove_rmap() is called and mapcount goes down to Zero.
          uncharge(3) is called.
 at (4),  at the end of do_swap_page(), page->mapcount=1 but not charged.

Case B) In another scenario.

            Thread A                     Thread B

     do_swap_page()                      zap_pte_range()
	(1)try charge (mapcount=1)
                                         (2) page_remove_rmap()
	(3) map it
                                         (4) uncharge is called.

In (4), uncharge is capped but mapcount can go up to 1.

protocol 12/12 is for case (A).
After 12/12, double-check page_mapped() under lock_page_cgroup() will be fix to
case (B).

Huu, I don't like swap-cache ;)
Anyway, we'll have to handle swap cache later.

Thanks,
-Kame






















--

From: KAMEZAWA Hiroyuki
Date: Friday, September 26, 2008 - 8:25 pm

On Fri, 26 Sep 2008 10:43:36 +0900

I think I saw page_mapped() case (in your shmem/page test.)
I'll check what cause this if I have time.

Thanks,
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Thursday, September 25, 2008 - 7:21 pm

Because of terrible compile error in FLATMEM/DISCONTIGMEM, I post
fixed version (and avoid patch bomb again.)
I post replacements for 9(fix), and 10,11,12 (adjustment for HUNK).

==
Allocate all page_cgroup at boot and remove page_cgroup poitner
from struct page. This patch adds an interface as

 struct page_cgroup *lookup_page_cgroup(struct page*)

All FLATMEM/DISCONTIGMEM/SPARSEMEM  and MEMORY_HOTPLUG is supported.

Remove page_cgroup pointer reduces the amount of memory by
 - 4 bytes per PAGE_SIZE.
 - 8 bytes per PAGE_SIZE
if memory controller is disabled. (even if configured.)

On usual 8GB x86-32 server, this saves 8MB of NORMAL_ZONE memory.
On my x86-64 server with 48GB of memory, this saves 96MB of memory.
I think this reduction makes sense.

By pre-allocation, kmalloc/kfree in charge/uncharge are removed. 
This means
  - we're not necessary to be afraid of kmalloc faiulre.
    (this can happen because of gfp_mask type.)
  - we can avoid calling kmalloc/kfree.
  - we can avoid allocating tons of small objects which can be fragmented.
  - we can know what amount of memory will be used for this extra-lru handling.

I added printk message as

	"allocated %ld bytes of page_cgroup"
        "please try cgroup_disable=memory option if you don't want"

maybe enough informative for users.

TODO:
  - small race window still remains in do_swap_page(). will be fixed by
    later patch in series.

Changelog: v5 -> v6.
 * removed "ctype" from uncharge.
 * improved comment to show FLAT_NODE_MEM_MAP == !SPARSEMEM
 * fixed errors in !SPARSEMEM codes
 * removed unused function in !SPARSEMEM codes.
(start from v5 because of series..)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 include/linux/memcontrol.h  |   11 -
 include/linux/mm_types.h    |    4 
 include/linux/mmzone.h      |   14 ++
 include/linux/page_cgroup.h |   90 +++++++++++++++
 mm/Makefile                 |    2 
 mm/memcontrol.c             |  264 ...
From: KAMEZAWA Hiroyuki
Date: Thursday, September 25, 2008 - 7:25 pm

adjustment for changes in 9/12(fixed)
==
Free page_cgroup from its LRU in batched manner.

When uncharge() is called, page is pushed onto per-cpu vector and
removed from LRU, later.. This routine resembles to global LRU's pagevec.
This patch is half of the whole patch and a set with following lazy LRU add
patch.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 mm/memcontrol.c |  164 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 152 insertions(+), 12 deletions(-)

Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -36,6 +36,7 @@
 #include <linux/mm_inline.h>
 #include <linux/writeback.h>
 #include <linux/page_cgroup.h>
+#include <linux/cpu.h>
 
 #include <asm/uaccess.h>
 
@@ -531,6 +532,116 @@ out:
 	return ret;
 }
 
+
+#define MEMCG_PCPVEC_SIZE	(14)	/* size of pagevec */
+struct memcg_percpu_vec {
+	int nr;
+	int limit;
+	struct page_cgroup *vec[MEMCG_PCPVEC_SIZE];
+};
+static DEFINE_PER_CPU(struct memcg_percpu_vec, memcg_free_vec);
+
+static void
+__release_page_cgroup(struct memcg_percpu_vec *mpv)
+{
+	unsigned long flags;
+	struct mem_cgroup_per_zone *mz, *prev_mz;
+	struct page_cgroup *pc;
+	int i, nr;
+
+	local_irq_save(flags);
+	nr = mpv->nr;
+	mpv->nr = 0;
+	prev_mz = NULL;
+	for (i = nr - 1; i >= 0; i--) {
+		pc = mpv->vec[i];
+		VM_BUG_ON(PageCgroupUsed(pc));
+		mz = page_cgroup_zoneinfo(pc);
+		if (prev_mz != mz) {
+			if (prev_mz)
+				spin_unlock(&prev_mz->lru_lock);
+			prev_mz = mz;
+			spin_lock(&mz->lru_lock);
+		}
+		__mem_cgroup_remove_list(mz, pc);
+		css_put(&pc->mem_cgroup->css);
+		pc->mem_cgroup = NULL;
+	}
+	if (prev_mz)
+		spin_unlock(&prev_mz->lru_lock);
+	local_irq_restore(flags);
+
+}
+
+static void
+release_page_cgroup(struct page_cgroup *pc)
+{
+	struct memcg_percpu_vec *mpv;
+
+	mpv = ...
From: KAMEZAWA Hiroyuki
Date: Thursday, September 25, 2008 - 7:28 pm

Fixed HUNK with 9/12(fixed)
==
Delaying add_to_lru() and do it in batched manner like page_vec.
For doing that 2 flags PCG_USED and PCG_LRU.

If PCG_LRU is set, page is on LRU. It safe to access LRU via page_cgroup.
(under some lock.)

For avoiding race, this patch uses TestSetPageCgroupUsed().
and checking PCG_USED bit and PCG_LRU bit in add/free vector.
By this, lock_page_cgroup() in mem_cgroup_charge() is removed.

(I don't want to call lock_page_cgroup() under mz->lru_lock when 
 add/free vector core logic. So, TestSetPageCgroupUsed() logic is added.
 TestSet is an easy way to avoid unneccesary nest of locks.)


Changelog: v3 -> v5.
 - removed css_get/put per page_cgroup struct.
   Now, *new* force_empty checks there is page_cgroup on the memcg.
   We don't need to be afraid of leak.

Changelog: v2 -> v3
 - added TRANSIT flag and removed lock from core logic.
Changelog: v1 -> v2:
 - renamed function name from use_page_cgroup to set_page_cgroup_lru().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 include/linux/page_cgroup.h |   10 +++
 mm/memcontrol.c             |  121 +++++++++++++++++++++++++++++++-------------
 2 files changed, 96 insertions(+), 35 deletions(-)

Index: mmotm-2.6.27-rc7+/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.27-rc7+.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.27-rc7+/include/linux/page_cgroup.h
@@ -24,6 +24,7 @@ enum {
 	PCG_LOCK,  /* page cgroup is locked */
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
+	PCG_LRU, /* this is on LRU */
 	/* flags for LRU placement */
 	PCG_ACTIVE, /* page is active in this cgroup */
 	PCG_FILE, /* page is file system backed */
@@ -42,11 +43,20 @@ static inline void SetPageCgroup##uname(
 static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
 	{ clear_bit(PCG_##lname, &pc->flags);  }
 
+#define TESTSETPCGFLAG(uname, lname)\
+static inline int ...
From: Balbir Singh
Date: Tuesday, September 30, 2008 - 9:03 pm

I like the changelog very much, it tells exactly what we set out to do. Thanks

Should we check for slab_is_available() before calling kmalloc_node? Otherwise,

I like this code very much, even though I've not tested it. I think it combines
all we've been thinking and discussing very well (from my initial flatmem based

Yay! I think we should add some documentation saying, if you think about
extending struct page, think again, look at how we did it for the memory


I don't understand the preempt_disable() is it for atomic operations? I would
like to split the removal of struct page from atomic operations for performance.

Except for the the performance changes for atomic operations, I like this patch
very much.

For the page removal portions
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> as a co-maintainer


-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, September 30, 2008 - 10:07 pm

On Wed, 01 Oct 2008 09:33:53 +0530
That check is not necessary.
Because we use kmalloc()/vmalloc()  in mem_cgroup_create(), after this.

(We assume usual page allocateor is available here. For FLATMEM, size of 
???? it's usually not accepted. I've seen such case very much.
Memory hotremoval is not related to this.

This preempt_disable() is avoidance for deadlock. (I thought)

 lock_page_cgroup();
	-> spin_lock_irqsave(&mz->lru_lock);
	-> spin_unlock_irqrestore(&mz->lru_lock);
 unlock_page_cgroup();

Did you see performance regression by atomic ops ?

I'll rewrite the whole stack from now for your request. 

Thanks,

--

From: KAMEZAWA Hiroyuki
Date: Tuesday, September 30, 2008 - 10:32 pm

On Wed, 1 Oct 2008 14:07:48 +0900

BTW, do you have good idea to modify flag bit without affecting LOCK bit on
page_cgroup->flags ?

At least, we'll have to set ACTIVE/INACTIVE/UNEVICTABLE flags dynamically.
take lock_page_cgroup() always ?
__mem_cgroup_move_lists() will have some amount of changes. And we should
check dead lock again.

Thanks,
-Kame

--

From: Balbir Singh
Date: Tuesday, September 30, 2008 - 10:59 pm

In the past patches, I've used lock_page_cgroup(). In some cases like
initialization at boot time, I've ignored taking the lock, since I know no-one

__mem_cgroup_move_lists() is called from mem_cgroup_isolate_pages() and
mem_cgroup_move_lists(). In mem_cgroup_move_lists(), we have the page_cgroup
lock. I think the current code works on the assumption (although not documented
anywhere I've seen), that PAGE_CGROUP_FLAG_INACTIVE/ACTIVE/UNEVICTABLE bits are
protected by lru_lock. Please look at

__mem_cgroup_remove_list
__mem_cgroup_add_list
__mem_cgroup_move_lists
__mem_cgroup_charge_common (sets this flag, before the pc is associated with the
page).

Am I missing something (today is holiday here, so I am in a bit of a lazy/sleepy
mood :) )

-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, September 30, 2008 - 11:17 pm

On Wed, 01 Oct 2008 11:29:04 +0530
But my point is lru_lock doesn't means page_cgroup is not locked by someone and
we must take always lock_page_cgroup() when we modify flags.

Then, mem_cgroup_isolate_page() should have to take lock.
But this means we have to care preemption for avoiding deadlock.
Maybe need some time to test.

Thanks,
-Kame

--

From: Balbir Singh
Date: Tuesday, September 30, 2008 - 10:30 pm

I don't think you'll need to do a major rewrite? Will you? My concern is that
this patch does too much to be a single patch. Consider someone trying to do a
git-bisect to identify a problem? It is hard to review as well and I think the
patch that just removes struct page member can go in faster.

It will be easier to test/debug as well, we'll know if the problem is because of
new page_cgroup being outside struct page rather then guessing if it was the
atomic ops that caused the problem.

-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, September 30, 2008 - 10:41 pm

On Wed, 01 Oct 2008 11:00:42 +0530
yes. LOCK bit is on page_cgroup->flags bit.
Then, I'll check all places which modify page_cgroup->flags again and
take lock always.

atomic_ops patch just rewrite exisiting behavior.

Thanks,
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Tuesday, September 30, 2008 - 11:12 pm

On Wed, 1 Oct 2008 14:41:50 +0900
please forgive me to post v6 today, which passed 24h+ tests.
v5 is a week old.
Discussion about patch order is welcome. But please give me hint.

Thanks,
-Kame

--

From: Balbir Singh
Date: Tuesday, September 30, 2008 - 11:26 pm

That sounds impressive. I'll test and review v6.

-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, September 24, 2008 - 11:33 pm

Free page_cgroup from its LRU in batched manner.

When uncharge() is called, page is pushed onto per-cpu vector and
removed from LRU, later.. This routine resembles to global LRU's pagevec.
This patch is half of the whole patch and a set with following lazy LRU add
patch.


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 mm/memcontrol.c |  164 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 152 insertions(+), 12 deletions(-)

Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -36,6 +36,7 @@
 #include <linux/mm_inline.h>
 #include <linux/writeback.h>
 #include <linux/page_cgroup.h>
+#include <linux/cpu.h>
 
 #include <asm/uaccess.h>
 
@@ -533,6 +534,116 @@ out:
 	return ret;
 }
 
+
+#define MEMCG_PCPVEC_SIZE	(14)	/* size of pagevec */
+struct memcg_percpu_vec {
+	int nr;
+	int limit;
+	struct page_cgroup *vec[MEMCG_PCPVEC_SIZE];
+};
+static DEFINE_PER_CPU(struct memcg_percpu_vec, memcg_free_vec);
+
+static void
+__release_page_cgroup(struct memcg_percpu_vec *mpv)
+{
+	unsigned long flags;
+	struct mem_cgroup_per_zone *mz, *prev_mz;
+	struct page_cgroup *pc;
+	int i, nr;
+
+	local_irq_save(flags);
+	nr = mpv->nr;
+	mpv->nr = 0;
+	prev_mz = NULL;
+	for (i = nr - 1; i >= 0; i--) {
+		pc = mpv->vec[i];
+		VM_BUG_ON(PageCgroupUsed(pc));
+		mz = page_cgroup_zoneinfo(pc);
+		if (prev_mz != mz) {
+			if (prev_mz)
+				spin_unlock(&prev_mz->lru_lock);
+			prev_mz = mz;
+			spin_lock(&mz->lru_lock);
+		}
+		__mem_cgroup_remove_list(mz, pc);
+		css_put(&pc->mem_cgroup->css);
+		pc->mem_cgroup = NULL;
+	}
+	if (prev_mz)
+		spin_unlock(&prev_mz->lru_lock);
+	local_irq_restore(flags);
+
+}
+
+static void
+release_page_cgroup(struct page_cgroup *pc)
+{
+	struct memcg_percpu_vec *mpv;
+
+	mpv = &get_cpu_var(memcg_free_vec);
+	mpv->vec[mpv->nr++] = pc;
+	if (mpv->nr ...
From: KAMEZAWA Hiroyuki
Date: Wednesday, September 24, 2008 - 11:35 pm

Delaying add_to_lru() and do it in batched manner like page_vec.
For doing that 2 flags PCG_USED and PCG_LRU.

If PCG_LRU is set, page is on LRU. It safe to access LRU via page_cgroup.
(under some lock.)

For avoiding race, this patch uses TestSetPageCgroupUsed().
and checking PCG_USED bit and PCG_LRU bit in add/free vector.
By this, lock_page_cgroup() in mem_cgroup_charge() is removed.

(I don't want to call lock_page_cgroup() under mz->lru_lock when 
 add/free vector core logic. So, TestSetPageCgroupUsed() logic is added.
 This TestSet is an easy way to avoid unneccesary nest of locks.)


Changelog: v3 -> v5.
 - removed css_get/put per page_cgroup struct.
   Now, new force_empty checks there is page_cgroup on the memcg.
   We don't need to be afraid of leak.

Changelog: v2 -> v3
 - added TRANSIT flag and removed lock from core logic.
Changelog: v1 -> v2:
 - renamed function name from use_page_cgroup to set_page_cgroup_lru().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 include/linux/page_cgroup.h |   10 +++
 mm/memcontrol.c             |  121 +++++++++++++++++++++++++++++++-------------
 2 files changed, 96 insertions(+), 35 deletions(-)

Index: mmotm-2.6.27-rc7+/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.27-rc7+.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.27-rc7+/include/linux/page_cgroup.h
@@ -24,6 +24,7 @@ enum {
 	PCG_LOCK,  /* page cgroup is locked */
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
+	PCG_LRU, /* this is on LRU */
 	/* flags for LRU placement */
 	PCG_ACTIVE, /* page is active in this cgroup */
 	PCG_FILE, /* page is file system backed */
@@ -42,11 +43,20 @@ static inline void SetPageCgroup##uname(
 static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
 	{ clear_bit(PCG_##lname, &pc->flags);  }
 
+#define TESTSETPCGFLAG(uname, lname)\
+static inline int TestSetPageCgroup##uname(struct page_cgroup ...
From: KAMEZAWA Hiroyuki
Date: Wednesday, September 24, 2008 - 11:36 pm

There is a small race in do_swap_page(). When the page swapped-in is charged,
the mapcount can be greater than 0. But, at the same time some process (shares
it ) call unmap and make mapcount 1->0 and the page is uncharged.

For fixing this, I added a new interface.
  - precharge
   account to res_counter by PAGE_SIZE and try to free pages if necessary.
  - commit	
   register page_cgroup and add to LRU if necessary.
  - cancel
   uncharge PAGE_SIZE because of do_swap_page failure.

This protocol uses PCG_USED bit on page_cgroup for avoiding over accounting.
Usual mem_cgroup_charge_common() does precharge -> commit at a time.

These precharge/commit/cancel is useful and can be used for other places,
 - shmem, (and other places need precharge.)
 - migration
 - move_account(force_empty) etc...
etc..we'll revisit later.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 include/linux/memcontrol.h |   21 +++++++
 mm/memcontrol.c            |  135 +++++++++++++++++++++++++++++++--------------
 mm/memory.c                |    6 +-
 3 files changed, 120 insertions(+), 42 deletions(-)

Index: mmotm-2.6.27-rc7+/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.27-rc7+.orig/include/linux/memcontrol.h
+++ mmotm-2.6.27-rc7+/include/linux/memcontrol.h
@@ -31,6 +31,13 @@ struct mm_struct;
 
 extern int mem_cgroup_charge(struct page *page, struct mm_struct *mm,
 				gfp_t gfp_mask);
+/* for swap handling */
+extern int mem_cgroup_precharge(struct mm_struct *mm,
+		gfp_t gfp_mask, struct mem_cgroup **ptr);
+extern void mem_cgroup_commit_charge_swap(struct page *page,
+					struct mem_cgroup *ptr);
+extern void mem_cgroup_cancel_charge_swap(struct mem_cgroup *ptr);
+
 extern int mem_cgroup_cache_charge(struct page *page, struct mm_struct *mm,
 					gfp_t gfp_mask);
 extern void mem_cgroup_move_lists(struct page *page, enum lru_list lru);
@@ -85,6 +92,20 @@ static inline int mem_cgroup_cache_charg
 	return ...
From: Daisuke Nishimura
Date: Thursday, September 25, 2008 - 7:32 pm

I got general protection fault.

(log from dump)
general protection fault: 0000 [1] SMP
last sysfs file: /sys/devices/system/cpu/cpu15/cache/index1/shared_cpu_map
CPU 0
Modules linked in: ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp ipv6
autofs4 hidp rfcomm l2cap bluetooth sunrpc microcode dm_mirror dm_log dm_multipath dm_mod
rfkill input_polldev sbs sbshc battery ac lp sg e1000 ide_cd_mod cdrom button acpi_memhotp
lug serio_raw rtc_cmos parport_pc rtc_core parport rtc_lib i2c_i801 i2c_core pcspkr shpchp
 ata_piix libata megaraid_mbox megaraid_mm sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci
_hcd
Pid: 8001, comm: shmem_test_02 Tainted: G        W 2.6.27-rc7-mm1-7eacf5c9 #1
RIP: 0010:[<ffffffff802a0ebb>]  [<ffffffff802a0ebb>] __mem_cgroup_move_lists+0x8b/0xa2
RSP: 0018:ffff8800bb4ad888  EFLAGS: 00010046
RAX: ffff88010b253080 RBX: ffff88010c67d618 RCX: dead000000100100
RDX: dead000000200200 RSI: ffff88010b253088 RDI: ffff88010c67d630
RBP: 0000000000000000 R08: ffff88010fc020a3 R09: 000000000000000f
R10: ffffffff802a204a R11: 00000000fffffffa R12: ffff88010b253080
R13: 0000000000000000 R14: ffff8800bb4ad9c8 R15: 0000000000000000
FS:  00007f4600faa6f0(0000) GS:ffffffff80638900(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00000033af86c027 CR3: 00000000c1549000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process shmem_test_02 (pid: 8001, threadinfo ffff8800bb4ac000, task ffff880107d21470)
Stack:  ffffe200028ef8b0 0000000000000082 ffff88010c67d618 ffffffff802a1cb9
 ffff880000016f80 0000000000000000 ffff880000016f80 ffffe200028ef888
 ffff8800bb4adb38 ffffffff8027dd09 ffffc20001859000 0000000000000000
Call Trace:
 [<ffffffff802a1cb9>] mem_cgroup_move_lists+0x50/0x74
 [<ffffffff8027dd09>] shrink_list+0x443/0x4ff
 [<ffffffff8027e04e>] shrink_zone+0x289/0x315
 [<ffffffff802805d2>] congestion_wait+0x74/0x80
 ...
From: KAMEZAWA Hiroyuki
Date: Thursday, September 25, 2008 - 7:58 pm

On Fri, 26 Sep 2008 11:32:28 +0900
Thank you.

How about following ?
-Kame
==
Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -597,8 +597,8 @@ __set_page_cgroup_lru(struct memcg_percp
 			spin_lock(&mz->lru_lock);
 		}
 		if (PageCgroupUsed(pc) && !PageCgroupLRU(pc)) {
-			SetPageCgroupLRU(pc);
 			__mem_cgroup_add_list(mz, pc);
+			SetPageCgroupLRU(pc);
 		}
 	}
 






--

From: KAMEZAWA Hiroyuki
Date: Thursday, September 25, 2008 - 8:04 pm

On Fri, 26 Sep 2008 11:58:10 +0900
Of course, remove side should be..
-Kame
==
Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -564,8 +564,8 @@ __release_page_cgroup(struct memcg_percp
 			spin_lock(&mz->lru_lock);
 		}
 		if (!PageCgroupUsed(pc) && PageCgroupLRU(pc)) {
-			__mem_cgroup_remove_list(mz, pc);
 			ClearPageCgroupLRU(pc);
+			__mem_cgroup_remove_list(mz, pc);
 		}
 	}
 	if (prev_mz)
@@ -597,8 +597,8 @@ __set_page_cgroup_lru(struct memcg_percp
 			spin_lock(&mz->lru_lock);
 		}
 		if (PageCgroupUsed(pc) && !PageCgroupLRU(pc)) {
-			SetPageCgroupLRU(pc);
 			__mem_cgroup_add_list(mz, pc);
+			SetPageCgroupLRU(pc);
 		}
 	}
 

--

From: Daisuke Nishimura
Date: Thursday, September 25, 2008 - 8:00 pm

I'll test it with updated version of 9-11 and report you back.


Thanks,
Daisuke Nishimura.
--

From: KAMEZAWA Hiroyuki
Date: Thursday, September 25, 2008 - 9:05 pm

On Fri, 26 Sep 2008 12:00:19 +0900
Thank you. below is the new one...(Sorry!)

-Kame
==
Check LRU bit under lru_lock.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 mm/memcontrol.c |    9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -340,11 +340,12 @@ void mem_cgroup_move_lists(struct page *
 	if (!trylock_page_cgroup(pc))
 		return;
 
-	if (PageCgroupUsed(pc) && PageCgroupLRU(pc)) {
+	if (PageCgroupUsed(pc)) {
 		mem = pc->mem_cgroup;
 		mz = page_cgroup_zoneinfo(pc);
 		spin_lock_irqsave(&mz->lru_lock, flags);
-		__mem_cgroup_move_lists(pc, lru);
+		if (PageCgroupLRU(pc))
+			__mem_cgroup_move_lists(pc, lru);
 		spin_unlock_irqrestore(&mz->lru_lock, flags);
 	}
 	unlock_page_cgroup(pc);
@@ -564,8 +565,8 @@ __release_page_cgroup(struct memcg_percp
 			spin_lock(&mz->lru_lock);
 		}
 		if (!PageCgroupUsed(pc) && PageCgroupLRU(pc)) {
-			__mem_cgroup_remove_list(mz, pc);
 			ClearPageCgroupLRU(pc);
+			__mem_cgroup_remove_list(mz, pc);
 		}
 	}
 	if (prev_mz)
@@ -597,8 +598,8 @@ __set_page_cgroup_lru(struct memcg_percp
 			spin_lock(&mz->lru_lock);
 		}
 		if (PageCgroupUsed(pc) && !PageCgroupLRU(pc)) {
-			SetPageCgroupLRU(pc);
 			__mem_cgroup_add_list(mz, pc);
+			SetPageCgroupLRU(pc);
 		}
 	}
 



--

From: Daisuke Nishimura
Date: Thursday, September 25, 2008 - 10:24 pm

Unfortunately, there remains some bugs yet...

------------[ cut here ]------------
WARNING: at lib/list_debug.c:51 list_del+0x5c/0x87()
list_del corruption. next->prev should be ffff88010ca291e8, but was dead000000200200
Modules linked in: ipt_REJECT xt_tcpudp iptable_filter ip_tables x_tables bridge stp ipv6
autofs4 hidp rfcomm l2cap bluetooth sunrpc microcode dm_mirror dm_log dm_multipath dm_mod
rfkill input_polldev sbs sbshc battery ac lp sg e1000 ide_cd_mod cdrom button acpi_memhotp
lug parport_pc rtc_cmos rtc_core parport serio_raw rtc_lib i2c_i801 i2c_core shpchp pcspkr
 ata_piix libata megaraid_mbox megaraid_mm sd_mod scsi_mod ext3 jbd ehci_hcd ohci_hcd uhci
_hcd
Pid: 3940, comm: bash Tainted: G        W 2.6.27-rc7-mm1-dd8bf0fe #1
Call Trace:
 [<ffffffff8023ac52>] warn_slowpath+0xb4/0xd2
 [<ffffffff8024c0ec>] prepare_to_wait_exclusive+0x38/0x5a
 [<ffffffff8024c089>] finish_wait+0x32/0x5d
 [<ffffffff8049ec80>] __wait_on_bit_lock+0x5b/0x66
 [<ffffffff80272ff4>] __lock_page+0x5e/0x64
 [<ffffffff8022c4f9>] target_load+0x2a/0x58
 [<ffffffff8022cf59>] place_entity+0x85/0xb3
 [<ffffffff8022f6db>] enqueue_entity+0x16e/0x18f
 [<ffffffff8027ff0a>] zone_statistics+0x3a/0x5d
 [<ffffffff8027ff0a>] zone_statistics+0x3a/0x5d
 [<ffffffff802788d3>] get_page_from_freelist+0x455/0x5bf
 [<ffffffff803402e8>] list_del+0x5c/0x87
 [<ffffffff802a1530>] mem_cgroup_commit_charge+0x6f/0xdd
 [<ffffffff802a16ed>] mem_cgroup_charge_common+0x4c/0x62
 [<ffffffff802835de>] handle_mm_fault+0x222/0x791
 [<ffffffff8027ff0a>] zone_statistics+0x3a/0x5d
 [<ffffffff8028235a>] follow_page+0x2d/0x2c2
 [<ffffffff80283e42>] __get_user_pages+0x2f5/0x3f3
 [<ffffffff802a74ed>] get_arg_page+0x46/0xa5
 [<ffffffff802a7724>] copy_strings+0xfc/0x1de
 [<ffffffff802a7827>] copy_strings_kernel+0x21/0x33
 [<ffffffff802a8aff>] do_execve+0x140/0x256
 [<ffffffff8020a495>] sys_execve+0x35/0x4c
 [<ffffffff8020c1ea>] stub_execve+0x6a/0xc0
---[ end trace 4eaa2a86a8e2da22 ]---
------------[ cut here ]------------
WARNING: at ...
From: KAMEZAWA Hiroyuki
Date: Friday, September 26, 2008 - 2:28 am

On Fri, 26 Sep 2008 14:24:55 +0900
I confirmed I can reproduce this.

I found one chance to cause this. (and confirmed this happens by printk)
                                   set_lru()..
                                      
   TestSetPageCgroup(pc);
   ....                            if (PageCgroupUsed(pc) && !PageCgroupLRU(pc))
   pc->mem_cgroup = mem;
                                      SetPageCgroupLRU();
                                      __mem_cgroup_add_list();


Then, page_cgroup will be added to wrong LRU whic doesn't match pc->mem_cgroup.

But there is still more...still digging.

Thanks,

--

From: KAMEZAWA Hiroyuki
Date: Friday, September 26, 2008 - 3:43 am

On Fri, 26 Sep 2008 14:24:55 +0900

Thank you for your patient good test!

I'm now testing following (and will do over-night test.)
In this an hour, this seems to work good. 
(under your test which usually panics in 10-20min on my box.)

==
page_cgroup is not valid until pc->mem_cgroup is set to appropriate value.
There is a small race between Set-Used-Bit and Set-Proper-pc->mem_cgroup.
This patch tries to fix that by adding PCG_VALID bit

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 include/linux/page_cgroup.h |    3 +++
 mm/memcontrol.c             |   22 ++++++++++++++--------
 2 files changed, 17 insertions(+), 8 deletions(-)

Index: mmotm-2.6.27-rc7+/include/linux/page_cgroup.h
===================================================================
--- mmotm-2.6.27-rc7+.orig/include/linux/page_cgroup.h
+++ mmotm-2.6.27-rc7+/include/linux/page_cgroup.h
@@ -21,6 +21,7 @@ struct page_cgroup *lookup_page_cgroup(s
 
 enum {
 	/* flags for mem_cgroup */
+	PCG_VALID, /* you can access this page cgroup */
 	PCG_LOCK,  /* page cgroup is locked */
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
@@ -50,6 +51,10 @@ static inline int TestSetPageCgroup##una
 /* Cache flag is set only once (at allocation) */
 TESTPCGFLAG(Cache, CACHE)
 
+TESTPCGFLAG(Valid, VALID)
+SETPCGFLAG(Valid, VALID)
+CLEARPCGFLAG(Valid, VALID)
+
 TESTPCGFLAG(Used, USED)
 CLEARPCGFLAG(Used, USED)
 TESTSETPCGFLAG(Used, USED)
Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -340,7 +340,7 @@ void mem_cgroup_move_lists(struct page *
 	if (!trylock_page_cgroup(pc))
 		return;
 
-	if (PageCgroupUsed(pc) && PageCgroupLRU(pc)) {
+	if (PageCgroupValid(pc) && PageCgroupLRU(pc)) {
 		mem = pc->mem_cgroup;
 		mz = page_cgroup_zoneinfo(pc);
 		spin_lock_irqsave(&mz->lru_lock, flags);
@@ -434,7 +434,7 @@ unsigned long ...
From: KAMEZAWA Hiroyuki
Date: Friday, September 26, 2008 - 7:53 pm

On Fri, 26 Sep 2008 19:43:09 +0900
This helps some, but causes other troubles. (But I know what it is.)
I'll add lock_page_cgroup() to charge side and see what happens.

Thanks,

--

From: Balbir Singh
Date: Friday, September 26, 2008 - 1:18 am

Kame,

I'm beginning to review test the patches now. It would be really nice to split
the development patches from the maintenance ones. I think the full patchset has
too many things and is confusing to look at.


-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Friday, September 26, 2008 - 2:22 am

On Fri, 26 Sep 2008 13:48:58 +0530
I hope I can do....but maybe difficult.
If you give me ack, 1,2,4,6, can be pushed at early stage.

Thanks,
-Kame




--

From: Balbir Singh
Date: Friday, September 26, 2008 - 2:31 am

I think (1) might be OK, except for the accounting issues pointed out (change in
behaviour visible to end user again, sigh! :( ). Is (1) a serious issue? (2)
seems OK, except for the locking change for mark_page_accessed. I am looking at
(4) and (6) currently.

-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Friday, September 26, 2008 - 3:36 am

On Fri, 26 Sep 2008 15:01:46 +0530
Thanks,

--

From: KAMEZAWA Hiroyuki
Date: Friday, September 26, 2008 - 8:19 pm

On Fri, 26 Sep 2008 19:36:02 +0900

I'll do in following way in the next Monday.
Divide patches into 2 set

in early fix/optimize set.
  - push (2)
  - push (4)
  - push (6)
  - push (1)

drops (3).

I don't want to remove all? pages-never-on-LRU before fixing force_empty.

in updates
  - introduce atomic flags. (5)
  - add move_account() function (7)
  - add memory.attribute to each memcg dir. (NEW)
  - enhance force_empty (was (8))
       - remove "forget all" logic. and add attribute to select following 2 behavior
          - call try_to_free_page() until the usage goes down to 0.
            This allows faiulre (if page is mlocked, we can't do.). (NEW)
          - call move_account() to move all charges to its parent (as much as possible) (NEW)
          In future, I'd liket to add trash-box cgroup for force_empty somewhere.
  - allocate all page cgroup at boot (9)
  - lazy lru free/add (10,11) with fixes.
  - fix race at charging swap. (12)

After (9), all page and page_cgroup has one-to-one releationship and we want to
assume that "if page is alive and on LRU, it's accounted and has page_cgroup."
(other team, bio cgroup want to use page_cgroup and I want to make it easy.)

For this, fix to behavior of force_empty..."forget all" is necessary.
SwapCache handling is also necessary but I'd like to postpone until next set
because it's complicated.

After above all.
 - handle swap cache 
 - Mem+Swap controller.
 - add trashbox feature ?
 - add memory.shrink_usage_to file.

It's long way to what I really want to do....


Thanks,
-Kame








  - 




--

From: Balbir Singh
Date: Sunday, September 28, 2008 - 8:02 pm

Yes a long way to go, I want to add

1) Multi-hierarchy support
2) Support for soft-limits
3) get swappiness working (there are patches posted for it by Yamamoto-San, but


-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Sunday, September 28, 2008 - 8:27 pm

On Mon, 29 Sep 2008 08:32:07 +0530
I'll add -EBUSY behavior to force_empty.
I'm now adding(merging) code to move_account.patch for supporing force_empty.
Thanks. no major changes in my current stack from already posted one.

Thanks,

--

Previous thread: sched: by Bharata B Rao on Wednesday, September 24, 2008 - 9:21 pm. (4 messages)

Next thread: [PATCH 7/7] x86: print out irq nr for msi/ht -v2 by Yinghai Lu on Wednesday, September 24, 2008 - 11:13 pm. (17 messages)