[PATCH 1/4] memcg: account swap cache under lock

Previous thread: [PATCH] sched: minor optimizations in wake_affine and select_task_rq_fair by Amit K. Arora on Monday, September 29, 2008 - 3:02 am. (7 messages)

Next thread: 2.6.27-rc7, freezes with > 1 s2ram cycle by Soeren Sonnenburg on Monday, September 29, 2008 - 4:29 am. (2 messages)
From: KAMEZAWA Hiroyuki
Date: Monday, September 29, 2008 - 3:19 am

Cut out 4 patches from memcg update v5 series.
(Then, this is a part of v6)

I think we got some agreement on these 4.

please ack if ok.

[1/4] ...  account swap under lock
[2/4] ...  make page->mapping to be NULL before uncharge cache.
[3/4] ...  avoid accounting not-on-LRU pages.
[4/4] ...  optimize cpu stat

I still have 6 patches but it's under test and needs review and discussion.

Thanks,
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Monday, September 29, 2008 - 3:21 am

While page-cache's charge/uncharge is done under page_lock(), swap-cache
isn't. (anonymous page is charged when it's newly allocated.)

This patch moves do_swap_page()'s charge() call under lock.
I don't see any bad problem *now* but this fix will be good for future
for avoiding unneccesary racy state.


Changelog: (v5) -> (v6)
 - mark_page_accessed() is moved before lock_page().
 - fixed missing unlock_page()
(no changes in previous version)

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 mm/memory.c |   11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

Index: mmotm-2.6.27-rc7+/mm/memory.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memory.c
+++ mmotm-2.6.27-rc7+/mm/memory.c
@@ -2326,16 +2326,17 @@ static int do_swap_page(struct mm_struct
 		count_vm_event(PGMAJFAULT);
 	}
 
+	mark_page_accessed(page);
+
+	lock_page(page);
+	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
+
 	if (mem_cgroup_charge(page, mm, GFP_KERNEL)) {
-		delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
 		ret = VM_FAULT_OOM;
+		unlock_page(page);
 		goto out;
 	}
 
-	mark_page_accessed(page);
-	lock_page(page);
-	delayacct_clear_flag(DELAYACCT_PF_SWAPIN);
-
 	/*
 	 * Back out if somebody else already faulted in this pte.
 	 */

--

From: Daisuke Nishimura
Date: Monday, September 29, 2008 - 4:33 am

Looks good to me.

--

From: Balbir Singh
Date: Tuesday, September 30, 2008 - 1:05 am

Looks good to me

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>

-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Monday, September 29, 2008 - 3:22 am

This patch tries to make page->mapping to be NULL before
mem_cgroup_uncharge_cache_page() is called.

"page->mapping == NULL" is a good check for "whether the page is still
radix-tree or not".
This patch also adds BUG_ON() to mem_cgroup_uncharge_cache_page();


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 mm/filemap.c    |    2 +-
 mm/memcontrol.c |    1 +
 mm/migrate.c    |    9 +++++++--
 3 files changed, 9 insertions(+), 3 deletions(-)

Index: mmotm-2.6.27-rc7+/mm/filemap.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/filemap.c
+++ mmotm-2.6.27-rc7+/mm/filemap.c
@@ -116,12 +116,12 @@ void __remove_from_page_cache(struct pag
 {
 	struct address_space *mapping = page->mapping;
 
-	mem_cgroup_uncharge_cache_page(page);
 	radix_tree_delete(&mapping->page_tree, page->index);
 	page->mapping = NULL;
 	mapping->nrpages--;
 	__dec_zone_page_state(page, NR_FILE_PAGES);
 	BUG_ON(page_mapped(page));
+	mem_cgroup_uncharge_cache_page(page);
 
 	/*
 	 * Some filesystems seem to re-dirty the page even after
Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -734,6 +734,7 @@ void mem_cgroup_uncharge_page(struct pag
 void mem_cgroup_uncharge_cache_page(struct page *page)
 {
 	VM_BUG_ON(page_mapped(page));
+	VM_BUG_ON(page->mapping);
 	__mem_cgroup_uncharge_common(page, MEM_CGROUP_CHARGE_TYPE_CACHE);
 }
 
Index: mmotm-2.6.27-rc7+/mm/migrate.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/migrate.c
+++ mmotm-2.6.27-rc7+/mm/migrate.c
@@ -330,8 +330,6 @@ static int migrate_page_move_mapping(str
 	__inc_zone_page_state(newpage, NR_FILE_PAGES);
 
 	spin_unlock_irq(&mapping->tree_lock);
-	if (!PageSwapCache(newpage))
-		mem_cgroup_uncharge_cache_page(page);
 
 	return 0;
 }
@@ -341,6 +339,8 @@ static int ...
From: Daisuke Nishimura
Date: Monday, September 29, 2008 - 4:39 am

[Empty message]
From: Balbir Singh
Date: Tuesday, September 30, 2008 - 8:50 pm

Looks good to me

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>

-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Monday, September 29, 2008 - 3:23 am

There are not-on-LRU pages which can be mapped and they are not worth to
be accounted. (becasue we can't shrink them and need dirty codes to handle
specical case) We'd like to make use of usual objrmap/radix-tree's protcol
and don't want to account out-of-vm's control pages.

When special_mapping_fault() is called, page->mapping is tend to be NULL 
and it's charged as Anonymous page.
insert_page() also handles some special pages from drivers.

This patch is for avoiding to account special pages.

Changlog: v5 -> v6
  - modified Documentation.
  - fixed to charge only when a page is newly allocated.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 Documentation/controllers/memory.txt |   24 ++++++++++++++++--------
 mm/memory.c                          |   29 +++++++++++++----------------
 mm/rmap.c                            |    4 ++--
 3 files changed, 31 insertions(+), 26 deletions(-)

Index: mmotm-2.6.27-rc7+/mm/memory.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memory.c
+++ mmotm-2.6.27-rc7+/mm/memory.c
@@ -1323,18 +1323,14 @@ static int insert_page(struct vm_area_st
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	retval = mem_cgroup_charge(page, mm, GFP_KERNEL);
-	if (retval)
-		goto out;
-
 	retval = -EINVAL;
 	if (PageAnon(page))
-		goto out_uncharge;
+		goto out;
 	retval = -ENOMEM;
 	flush_dcache_page(page);
 	pte = get_locked_pte(mm, addr, &ptl);
 	if (!pte)
-		goto out_uncharge;
+		goto out;
 	retval = -EBUSY;
 	if (!pte_none(*pte))
 		goto out_unlock;
@@ -1350,8 +1346,6 @@ static int insert_page(struct vm_area_st
 	return retval;
 out_unlock:
 	pte_unmap_unlock(pte, ptl);
-out_uncharge:
-	mem_cgroup_uncharge_page(page);
 out:
 	return retval;
 }
@@ -2463,6 +2457,7 @@ static int __do_fault(struct mm_struct *
 	struct page *page;
 	pte_t entry;
 	int anon = 0;
+	int charged = 0;
 	struct page *dirty_page = NULL;
 	struct vm_fault vmf;
 	int ret;
@@ -2503,6 +2498,12 @@ ...
From: Daisuke Nishimura
Date: Monday, September 29, 2008 - 4:19 am

checkpatch reports a warning here.

I think it should be like

@@ -2585,7 +2581,8 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
 		/* no need to invalidate: a not-present page won't be cached */
 		update_mmu_cache(vma, address, entry);
 	} else {
-		mem_cgroup_uncharge_page(page);
+		if (charged)
+			mem_cgroup_uncharge_page(page);
 		if (anon)
 			page_cache_release(page);
 		else


Thanks,
--

From: kamezawa.hiroyu
Date: Monday, September 29, 2008 - 4:59 am

Oh, thanks. I'll post fixed one, tomorrow.

Thanks,

--

From: KAMEZAWA Hiroyuki
Date: Monday, September 29, 2008 - 6:17 pm

This is conding-style fixed version. Thank you, Nishimura-san.
-Kmae
==
There are not-on-LRU pages which can be mapped and they are not worth to
be accounted. (becasue we can't shrink them and need dirty codes to handle
specical case) We'd like to make use of usual objrmap/radix-tree's protcol
and don't want to account out-of-vm's control pages.

When special_mapping_fault() is called, page->mapping is tend to be NULL 
and it's charged as Anonymous page.
insert_page() also handles some special pages from drivers.

This patch is for avoiding to account special pages.

Changlog: v5 -> v6
  - modified Documentation.
  - fixed to charge only when a page is newly allocated.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

 Documentation/controllers/memory.txt |   24 ++++++++++++++++--------
 mm/memory.c                          |   25 +++++++++++--------------
 mm/rmap.c                            |    4 ++--
 3 files changed, 29 insertions(+), 24 deletions(-)

Index: mmotm-2.6.27-rc7+/mm/memory.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memory.c
+++ mmotm-2.6.27-rc7+/mm/memory.c
@@ -1323,18 +1323,14 @@ static int insert_page(struct vm_area_st
 	pte_t *pte;
 	spinlock_t *ptl;
 
-	retval = mem_cgroup_charge(page, mm, GFP_KERNEL);
-	if (retval)
-		goto out;
-
 	retval = -EINVAL;
 	if (PageAnon(page))
-		goto out_uncharge;
+		goto out;
 	retval = -ENOMEM;
 	flush_dcache_page(page);
 	pte = get_locked_pte(mm, addr, &ptl);
 	if (!pte)
-		goto out_uncharge;
+		goto out;
 	retval = -EBUSY;
 	if (!pte_none(*pte))
 		goto out_unlock;
@@ -1350,8 +1346,6 @@ static int insert_page(struct vm_area_st
 	return retval;
 out_unlock:
 	pte_unmap_unlock(pte, ptl);
-out_uncharge:
-	mem_cgroup_uncharge_page(page);
 out:
 	return retval;
 }
@@ -2463,6 +2457,7 @@ static int __do_fault(struct mm_struct *
 	struct page *page;
 	pte_t entry;
 	int anon = 0;
+	int charged = 0;
 	struct page *dirty_page ...
From: Balbir Singh
Date: Tuesday, September 30, 2008 - 8:49 pm

If I understand this correctly, we now account only when the VMA is not shared?
Seems reasonable, since we don't allocate a page otherwise.



Is the change because we expect the page to get directly uncharged when it is
removed from cache? i.e, page->mapping is set to NULL before uncharge?

Looks good to me, I am yet to test it though.

Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>


-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, September 30, 2008 - 9:50 pm

On Wed, 01 Oct 2008 09:19:10 +0530
Thanks,

--

From: Balbir Singh
Date: Tuesday, September 30, 2008 - 1:14 am

Coding style braces for a single-line if-else, please recheck

-- 
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Monday, September 29, 2008 - 3:24 am

Some obvious optimization to memcg.

I found mem_cgroup_charge_statistics() is a little big (in object) and
does unnecessary address calclation.
This patch is for optimization to reduce the size of this function.

And res_counter_charge() is 'likely' to success.

Changlog: v5->v6
 - patch series was reordered and needs some adjustment. no changes in logic.
Changelog v3->v4:
 - merged with an other leaf patch.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Acked-by: Balbir Singh <balbir@linux.vnet.ibm.com>


 mm/memcontrol.c |   18 ++++++++++--------
 1 file changed, 10 insertions(+), 8 deletions(-)

Index: mmotm-2.6.27-rc7+/mm/memcontrol.c
===================================================================
--- mmotm-2.6.27-rc7+.orig/mm/memcontrol.c
+++ mmotm-2.6.27-rc7+/mm/memcontrol.c
@@ -66,11 +66,10 @@ struct mem_cgroup_stat {
 /*
  * For accounting under irq disable, no need for increment preempt count.
  */
-static void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat *stat,
+static inline void __mem_cgroup_stat_add_safe(struct mem_cgroup_stat_cpu *stat,
 		enum mem_cgroup_stat_index idx, int val)
 {
-	int cpu = smp_processor_id();
-	stat->cpustat[cpu].count[idx] += val;
+	stat->count[idx] += val;
 }
 
 static s64 mem_cgroup_read_stat(struct mem_cgroup_stat *stat,
@@ -190,18 +189,21 @@ static void mem_cgroup_charge_statistics
 {
 	int val = (charge)? 1 : -1;
 	struct mem_cgroup_stat *stat = &mem->stat;
+	struct mem_cgroup_stat_cpu *cpustat;
 
 	VM_BUG_ON(!irqs_disabled());
+
+	cpustat = &stat->cpustat[smp_processor_id()];
 	if (flags & PAGE_CGROUP_FLAG_CACHE)
-		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_CACHE, val);
+		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_CACHE, val);
 	else
-		__mem_cgroup_stat_add_safe(stat, MEM_CGROUP_STAT_RSS, val);
+		__mem_cgroup_stat_add_safe(cpustat, MEM_CGROUP_STAT_RSS, val);
 
 	if (charge)
-		__mem_cgroup_stat_add_safe(stat,
+		__mem_cgroup_stat_add_safe(cpustat,
 ...
From: Daisuke Nishimura
Date: Monday, September 29, 2008 - 4:44 am

Reviewed-by: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>

I'll test with all the patches applied tonight.


Thanks,
--

From: Balbir Singh
Date: Monday, October 6, 2008 - 10:15 am

Hi, Andrew,

Could you please pick this patchset, following it is a very important set of
patches that remove struct page's page_cgroup member.


-- 
	Balbir
--

Previous thread: [PATCH] sched: minor optimizations in wake_affine and select_task_rq_fair by Amit K. Arora on Monday, September 29, 2008 - 3:02 am. (7 messages)

Next thread: 2.6.27-rc7, freezes with > 1 s2ram cycle by Soeren Sonnenburg on Monday, September 29, 2008 - 4:29 am. (2 messages)