Re: [PATCH -mmotm 1/5] memcg: disable irq at page cgroup lock

Previous thread: HTC Dream: Hook up LEDs by Pavel Machek on Sunday, March 14, 2010 - 1:16 pm. (1 message)

Next thread: [2.6.34-rc1 REGRESSION] alsa: no mixer controls (and no sound) with 2.6.24-rc1+ on Realtek ALC260 by Alessandro Guido on Sunday, March 14, 2010 - 5:06 pm. (3 messages)
From: Andrea Righi
Date: Sunday, March 14, 2010 - 4:26 pm

Control the maximum amount of dirty pages a cgroup can have at any given time.

Per cgroup dirty limit is like fixing the max amount of dirty (hard to reclaim)
page cache used by any cgroup. So, in case of multiple cgroup writers, they
will not be able to consume more than their designated share of dirty pages and
will be forced to perform write-out if they cross that limit.

The overall design is the following:

 - account dirty pages per cgroup
 - limit the number of dirty pages via memory.dirty_ratio / memory.dirty_bytes
   and memory.dirty_background_ratio / memory.dirty_background_bytes in
   cgroupfs
 - start to write-out (background or actively) when the cgroup limits are
   exceeded

This feature is supposed to be strictly connected to any underlying IO
controller implementation, so we can stop increasing dirty pages in VM layer
and enforce a write-out before any cgroup will consume the global amount of
dirty pages defined by the /proc/sys/vm/dirty_ratio|dirty_bytes and
/proc/sys/vm/dirty_background_ratio|dirty_background_bytes limits.

Changelog (v6 -> v7)
~~~~~~~~~~~~~~~~~~~~~~
 * introduce trylock_page_cgroup() to guarantee that lock_page_cgroup()
   is never called under tree_lock (no strict accounting, but better overall
   performance)
 * do not account file cache statistics for the root cgroup (zero
   overhead for the root cgroup)
 * fix: evaluate cgroup free pages as at the minimum free pages of all
   its parents

Results
~~~~~~~
The testcase is a kernel build (2.6.33 x86_64_defconfig) on a Intel Core 2 @
1.2GHz:

<before>
 - root  cgroup:	11m51.983s
 - child cgroup:	11m56.596s

<after>
 - root cgroup:		11m51.742s
 - child cgroup:	12m5.016s

In the previous version of this patchset, using the "complex" locking scheme
with the _locked and _unlocked version of mem_cgroup_update_page_stat(), the
child cgroup required 11m57.896s and 12m9.920s with lock_page_cgroup()+irq_disabled.

With this version there's no overhead for the root cgroup (the ...
From: Andrea Righi
Date: Sunday, March 14, 2010 - 4:26 pm

Document cgroup dirty memory interfaces and statistics.

Signed-off-by: Andrea Righi <arighi@develer.com>
---
 Documentation/cgroups/memory.txt |   36 ++++++++++++++++++++++++++++++++++++
 1 files changed, 36 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 49f86f3..38ca499 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -310,6 +310,11 @@ cache		- # of bytes of page cache memory.
 rss		- # of bytes of anonymous and swap cache memory.
 pgpgin		- # of pages paged in (equivalent to # of charging events).
 pgpgout		- # of pages paged out (equivalent to # of uncharging events).
+filedirty	- # of pages that are waiting to get written back to the disk.
+writeback	- # of pages that are actively being written back to the disk.
+writeback_tmp	- # of pages used by FUSE for temporary writeback buffers.
+nfs		- # of NFS pages sent to the server, but not yet committed to
+		  the actual storage.
 active_anon	- # of bytes of anonymous and  swap cache memory on active
 		  lru list.
 inactive_anon	- # of bytes of anonymous memory and swap cache memory on
@@ -345,6 +350,37 @@ Note:
   - a cgroup which uses hierarchy and it has child cgroup.
   - a cgroup which uses hierarchy and not the root of hierarchy.
 
+5.4 dirty memory
+
+  Control the maximum amount of dirty pages a cgroup can have at any given time.
+
+  Limiting dirty memory is like fixing the max amount of dirty (hard to
+  reclaim) page cache used by any cgroup. So, in case of multiple cgroup writers,
+  they will not be able to consume more than their designated share of dirty
+  pages and will be forced to perform write-out if they cross that limit.
+
+  The interface is equivalent to the procfs interface: /proc/sys/vm/dirty_*.
+  It is possible to configure a limit to trigger both a direct writeback or a
+  background writeback performed by per-bdi flusher threads.
+
+  Per-cgroup dirty limits can be set using the ...
From: Daisuke Nishimura
Date: Tuesday, March 16, 2010 - 12:41 am

It would be better to note that what those files of root cgroup mean.
We cannot write any value to them, IOW, we cannot control dirty limit about root cgroup.
And they show the same value as the global one(strictly speaking, it's not true
because global values can change. We need a hook in mem_cgroup_dirty_read()?).

Thanks,
--

From: Greg Thelen
Date: Wednesday, March 17, 2010 - 10:48 am

On Mon, Mar 15, 2010 at 11:41 PM, Daisuke Nishimura

Should these new memory.stat counters (filedirty, etc) report byte
counts rather than page counts?  I am thinking that byte counters
would make reporting more obvious depending on how heterogeneous page
sizes are used. Byte counters would also agree with /proc/meminfo.
Within the kernel we could still maintain page counts.  The only
change would be to the reporting routine, mem_cgroup_get_local_stat(),
which would scale the page counts by PAGE_SIZE as it does for for
cache,rss,etc.

--
Greg
--

From: Balbir Singh
Date: Wednesday, March 17, 2010 - 12:02 pm

I agree, byte counts would be better than page counts. pgpin and
pgpout are special cases where the pages matter, the size does not due
to the nature of the operation. 

-- 
	Three Cheers,
	Balbir
--

From: Andrea Righi
Date: Wednesday, March 17, 2010 - 3:43 pm

OK, we can just return system-wide value if mem_cgroup_is_root() in
mem_cgroup_dirty_read(). Will change this in the next version.

Thanks,
-Andrea
--

From: Andrea Righi
Date: Sunday, March 14, 2010 - 4:26 pm

Introduce page_cgroup flags to keep track of file cache pages.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Signed-off-by: Andrea Righi <arighi@develer.com>
---
 include/linux/page_cgroup.h |   22 +++++++++++++++++++++-
 1 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index bf9a913..65247e4 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -40,7 +40,11 @@ enum {
 	PCG_USED, /* this object is in use. */
 	PCG_ACCT_LRU, /* page has been accounted for */
 	/* for cache-status accounting */
-	PCG_FILE_MAPPED,
+	PCG_FILE_MAPPED, /* page is accounted as file rss*/
+	PCG_FILE_DIRTY, /* page is dirty */
+	PCG_FILE_WRITEBACK, /* page is being written back to disk */
+	PCG_FILE_WRITEBACK_TEMP, /* page is used as temporary buffer for FUSE */
+	PCG_FILE_UNSTABLE_NFS, /* NFS page not yet committed to the server */
 };
 
 #define TESTPCGFLAG(uname, lname)			\
@@ -83,6 +87,22 @@ TESTPCGFLAG(FileMapped, FILE_MAPPED)
 SETPCGFLAG(FileMapped, FILE_MAPPED)
 CLEARPCGFLAG(FileMapped, FILE_MAPPED)
 
+TESTPCGFLAG(FileDirty, FILE_DIRTY)
+SETPCGFLAG(FileDirty, FILE_DIRTY)
+CLEARPCGFLAG(FileDirty, FILE_DIRTY)
+
+TESTPCGFLAG(FileWriteback, FILE_WRITEBACK)
+SETPCGFLAG(FileWriteback, FILE_WRITEBACK)
+CLEARPCGFLAG(FileWriteback, FILE_WRITEBACK)
+
+TESTPCGFLAG(FileWritebackTemp, FILE_WRITEBACK_TEMP)
+SETPCGFLAG(FileWritebackTemp, FILE_WRITEBACK_TEMP)
+CLEARPCGFLAG(FileWritebackTemp, FILE_WRITEBACK_TEMP)
+
+TESTPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+SETPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+CLEARPCGFLAG(FileUnstableNFS, FILE_UNSTABLE_NFS)
+
 static inline int page_cgroup_nid(struct page_cgroup *pc)
 {
 	return page_to_nid(pc->page);
-- 
1.6.3.3

--

From: Andrea Righi
Date: Sunday, March 14, 2010 - 4:26 pm

Infrastructure to account dirty pages per cgroup and add dirty limit
interfaces in the cgroupfs:

 - Direct write-out: memory.dirty_ratio, memory.dirty_bytes

 - Background write-out: memory.dirty_background_ratio, memory.dirty_background_bytes

Signed-off-by: Andrea Righi <arighi@develer.com>
---
 include/linux/memcontrol.h |   92 ++++++++-
 mm/memcontrol.c            |  484 +++++++++++++++++++++++++++++++++++++++++---
 2 files changed, 540 insertions(+), 36 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 88d3f9e..0602ec9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -19,12 +19,55 @@
 
 #ifndef _LINUX_MEMCONTROL_H
 #define _LINUX_MEMCONTROL_H
+
+#include <linux/writeback.h>
 #include <linux/cgroup.h>
+
 struct mem_cgroup;
 struct page_cgroup;
 struct page;
 struct mm_struct;
 
+/* Cgroup memory statistics items exported to the kernel */
+enum mem_cgroup_read_page_stat_item {
+	MEMCG_NR_DIRTYABLE_PAGES,
+	MEMCG_NR_RECLAIM_PAGES,
+	MEMCG_NR_WRITEBACK,
+	MEMCG_NR_DIRTY_WRITEBACK_PAGES,
+};
+
+/* File cache pages accounting */
+enum mem_cgroup_write_page_stat_item {
+	MEMCG_NR_FILE_MAPPED,		/* # of pages charged as file rss */
+	MEMCG_NR_FILE_DIRTY,		/* # of dirty pages in page cache */
+	MEMCG_NR_FILE_WRITEBACK,	/* # of pages under writeback */
+	MEMCG_NR_FILE_WRITEBACK_TEMP,	/* # of pages under writeback using
+					   temporary buffers */
+	MEMCG_NR_FILE_UNSTABLE_NFS,	/* # of NFS unstable pages */
+
+	MEMCG_NR_FILE_NSTAT,
+};
+
+/* Dirty memory parameters */
+struct vm_dirty_param {
+	int dirty_ratio;
+	int dirty_background_ratio;
+	unsigned long dirty_bytes;
+	unsigned long dirty_background_bytes;
+};
+
+/*
+ * TODO: provide a validation check routine. And retry if validation
+ * fails.
+ */
+static inline void get_global_vm_dirty_param(struct vm_dirty_param *param)
+{
+	param->dirty_ratio = vm_dirty_ratio;
+	param->dirty_bytes = ...
From: KAMEZAWA Hiroyuki
Date: Sunday, March 14, 2010 - 7:26 pm

On Mon, 15 Mar 2010 00:26:41 +0100

Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


--

From: Daisuke Nishimura
Date: Monday, March 15, 2010 - 7:32 pm

hmm, mem_cgroup_page_stat() can return negative value, but you place BUG_ON()
in [5/5] to check it returns negative value. What happens if the current is moved
to root between mem_cgroup_has_dirty_limit() and mem_cgroup_page_stat() ?
How about making mem_cgroup_has_dirty_limit() return the target mem_cgroup, and
I don't have a strong objection, but I prefer showing them in bytes.
And can you add to mem_cgroup_stat_show() something like:

	for (i = 0; i < NR_MCS_STAT; i++) {
		if (i == MCS_SWAP && !do_swap_account)
			continue;
+		if (i >= MCS_FILE_STAT_STAR && i <= MCS_FILE_STAT_END &&
+		   mem_cgroup_is_root(mem_cont))
+			continue;
		cb->fill(cb, memcg_stat_strings[i].local_name, mystat.stat[i]);
	}

not to show file stat in root cgroup ? It's meaningless value anyway.
Of course, you'd better mention it in [2/5] too.


Thanks,
--

From: Vivek Goyal
Date: Tuesday, March 16, 2010 - 7:11 am

On Tue, Mar 16, 2010 at 11:32:38AM +0900, Daisuke Nishimura wrote:


Hmm, if mem_cgroup_has_dirty_limit() retrun pointer to memcg, then one
shall have to use rcu_read_lock() and that will look ugly.

Why don't we simply look at the return value and if it is negative, we
fall back to using global stats and get rid of BUG_ON()?

Or, modify mem_cgroup_page_stat() to return global stats if it can't
determine per cgroup stat for some reason. (mem=NULL or root cgroup etc).

Vivek
--

From: Daisuke Nishimura
Date: Tuesday, March 16, 2010 - 8:09 am

On Tue, 16 Mar 2010 10:11:50 -0400
I don't have any objection as long as we don't hit BUG_ON.


Thanks,



--

From: Andrea Righi
Date: Wednesday, March 17, 2010 - 3:37 pm

I vote for this one. IMHO the caller of mem_cgroup_page_stat() should
fallback to the equivalent global stats. This allows to keep the things

Thanks,
-Andrea
--

From: Andrea Righi
Date: Wednesday, March 17, 2010 - 3:52 pm

On Tue, Mar 16, 2010 at 11:32:38AM +0900, Daisuke Nishimura wrote:

Agreed. I would be definitely more clear. Balbir, KAME-san, what do you


OK.

Thanks,
-Andrea
--

From: Greg Thelen
Date: Wednesday, March 17, 2010 - 11:48 pm

I have a proposed change to "[PATCH -mmotm 4/5] memcg: dirty pages accounting
and limiting infrastructure" v6.  The change is small and I am presenting it
as a git patch (below) to be applied after 4/5 v6 has been applied.
The change is fairly simple.  An alternative would be to reject my
patch (below) and enhance get_vm_dirty_param() to loop for consistenty in all
cases.

---patch snip here, rest of email is git patch of 4/5 v6 ---

Removed unneeded looping from get_vm_dirty_param().  The only caller of
get_vm_dirty_param() gracefully handles inconsistent values, so there is no
need for get_vm_dirty_param() to loop to ensure consistency.  The previous
looping was inconsistent because it did not loop for the case where memory
cgroups were disabled.

Signed-off-by: Greg Thelen <gthelen@google.com>
---
 mm/memcontrol.c |   28 ++++++++++++----------------
 1 files changed, 12 insertions(+), 16 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4d00c0f..990a907 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1081,6 +1081,10 @@ static void __mem_cgroup_get_dirty_param(struct vm_dirty_param *param,
  * The function fills @param with the current memcg dirty memory settings. If
  * memory cgroup is disabled or in case of error the structure is filled with
  * the global dirty memory settings.
+ *
+ * Because global and memcg vm_dirty_param are not protected, inconsistent
+ * values may be returned.  If consistent values are required, then the caller
+ * should call this routine until dirty_param_is_valid() returns true.
  */
 void get_vm_dirty_param(struct vm_dirty_param *param)
 {
@@ -1090,28 +1094,20 @@ void get_vm_dirty_param(struct vm_dirty_param *param)
 		get_global_vm_dirty_param(param);
 		return;
 	}
+
 	/*
 	 * It's possible that "current" may be moved to other cgroup while we
 	 * access cgroup. But precise check is meaningless because the task can
 	 * be moved after our access and writeback tends to take long time.
 	 * At least, "memcg" ...
From: Andrea Righi
Date: Sunday, March 14, 2010 - 4:26 pm

Apply the cgroup dirty pages accounting and limiting infrastructure to
the opportune kernel functions.

[ NOTE: for now do not account WritebackTmp pages (FUSE) and NILFS2
bounce pages. This depends on charging also bounce pages per cgroup. ]

As a bonus, make determine_dirtyable_memory() static again: this
function isn't used anymore outside page writeback.

Signed-off-by: Andrea Righi <arighi@develer.com>
---
 fs/nfs/write.c            |    4 +
 include/linux/writeback.h |    2 -
 mm/filemap.c              |    1 +
 mm/page-writeback.c       |  215 ++++++++++++++++++++++++++++-----------------
 mm/rmap.c                 |    4 +-
 mm/truncate.c             |    1 +
 6 files changed, 141 insertions(+), 86 deletions(-)

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index 53ff70e..3e8b9f8 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -440,6 +440,7 @@ nfs_mark_request_commit(struct nfs_page *req)
 			NFS_PAGE_TAG_COMMIT);
 	nfsi->ncommit++;
 	spin_unlock(&inode->i_lock);
+	mem_cgroup_inc_page_stat(req->wb_page, MEMCG_NR_FILE_UNSTABLE_NFS);
 	inc_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 	inc_bdi_stat(req->wb_page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 	__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
@@ -451,6 +452,7 @@ nfs_clear_request_commit(struct nfs_page *req)
 	struct page *page = req->wb_page;
 
 	if (test_and_clear_bit(PG_CLEAN, &(req)->wb_flags)) {
+		mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(page->mapping->backing_dev_info, BDI_RECLAIMABLE);
 		return 1;
@@ -1277,6 +1279,8 @@ nfs_commit_list(struct inode *inode, struct list_head *head, int how)
 		req = nfs_list_entry(head->next);
 		nfs_list_remove_request(req);
 		nfs_mark_request_commit(req);
+		mem_cgroup_dec_page_stat(req->wb_page,
+				MEMCG_NR_FILE_UNSTABLE_NFS);
 		dec_zone_page_state(req->wb_page, NR_UNSTABLE_NFS);
 		dec_bdi_stat(req->wb_page->mapping->backing_dev_info,
 				BDI_RECLAIMABLE);
diff ...
From: KAMEZAWA Hiroyuki
Date: Sunday, March 14, 2010 - 7:31 pm

On Mon, 15 Mar 2010 00:26:42 +0100

Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>


"dirty" seems not to be necessary.
	if (get_dirty_writeback_pages() < dirty_thresh) ?

Thanks,
-Kame



--

From: Andrea Righi
Date: Sunday, March 14, 2010 - 4:26 pm

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now, file-mapped is maintaiend. But more generic update function
will be needed for dirty page accounting.

For accountig page status, we have to guarantee lock_page_cgroup()
will be never called under tree_lock held.
To guarantee that, we use trylock at updating status.
By this, we do fuzzy accounting, but in almost all case, it's correct.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h  |    7 +++-
 include/linux/page_cgroup.h |   15 +++++++
 mm/memcontrol.c             |   94 +++++++++++++++++++++++++++++++++----------
 mm/rmap.c                   |    4 +-
 4 files changed, 95 insertions(+), 25 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 44301c6..88d3f9e 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -124,7 +124,12 @@ static inline bool mem_cgroup_disabled(void)
 	return false;
 }
 
-void mem_cgroup_update_file_mapped(struct page *page, int val);
+enum mem_cgroup_page_stat_item {
+	MEMCG_NR_FILE_MAPPED,
+	MEMCG_NR_FILE_NSTAT,
+};
+
+void mem_cgroup_update_stat(struct page *page, int idx, bool charge);
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask, int nid,
 						int zid);
diff --git a/include/linux/page_cgroup.h b/include/linux/page_cgroup.h
index 30b0813..bf9a913 100644
--- a/include/linux/page_cgroup.h
+++ b/include/linux/page_cgroup.h
@@ -39,6 +39,8 @@ enum {
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
 	PCG_ACCT_LRU, /* page has been accounted for */
+	/* for cache-status accounting */
+	PCG_FILE_MAPPED,
 };
 
 #define TESTPCGFLAG(uname, lname)			\
@@ -57,6 +59,10 @@ static inline void ClearPageCgroup##uname(struct page_cgroup *pc)	\
 static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
 	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
 
+/* Page/File stat ...
From: KAMEZAWA Hiroyuki
Date: Sunday, March 14, 2010 - 5:06 pm

On Mon, 15 Mar 2010 00:26:38 +0100

Bad patch title...."use trylock for safe accounting in some contexts" ?

Thanks,

--

From: Andrea Righi
Date: Monday, March 15, 2010 - 3:00 am

OK, sounds better. I just copy & paste the email subject, but the title
was probably related to the old lock_page_cgroup()+irq_disable patch.

Thanks,
-Andrea
--

From: Balbir Singh
Date: Wednesday, March 17, 2010 - 12:04 am

What happens if the trylock fails, do we lose the stat?

-- 
	Three Cheers,
	Balbir
--

From: Balbir Singh
Date: Wednesday, March 17, 2010 - 4:58 am

I don't like this at all, but in almost all cases is not acceptable
for statistics, since decisions will be made on them and having them
incorrect is really bad. Could we do a form of deferred statistics and
fix this.

 
-- 
	Three Cheers,
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, March 17, 2010 - 4:54 pm

On Wed, 17 Mar 2010 17:28:55 +0530

plz show your implementation which has no performance regresssion.
For me, I don't neee file_mapped accounting, at all. If we can remove that,
we can add simple migration lock.
file_mapped is a feattue you added. please improve it.


Thanks,
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Wednesday, March 17, 2010 - 5:45 pm

On Thu, 18 Mar 2010 08:54:11 +0900

BTW, I should explain how acculate this accounting is in this patch itself.

Now, lock_page_cgroup/unlock_page_cgroup happens when
	- charge/uncharge/migrate/move accounting

Then, the lock contention (trylock failure) seems to occur in conflict
with
	- charge, uncharge, migarate. move accounting

About dirty accounting, charge/uncharge/migarate are operation in synchronous
manner with radix-tree (holding treelock etc). Then no account leak.
move accounting is only source for inacculacy...but I don't think this move-task
is ciritial....moreover, we don't move any file pages at task-move, now.
(But Nishimura-san has a plan to do so.)
So, contention will happen only at confliction with force_empty.

About FILE_MAPPED accounting, it's not synchronous with radix-tree operaton.
Then, accounting-miss seems to happen when charge/uncharge/migrate/account move.
But
	charge .... we don't account a page as FILE_MAPPED before it's charged.
	uncharge .. usual operation in turncation is unmap->remove-from-radix-tree.
		    Then, it's sequential in almost all case. The race exists when...
		    Assume there are 2 threads A and B. A truncate a file, B map/unmap that.
		    This is very unusal confliction.
	migrate.... we do try_to_unmap before migrating pages. Then, FILE_MAPPED
		    is properly handled.
	move account .... we don't have move-account-mapped-file, yet.

Then, this trylock contention happens at contention with force_empty and truncate.


Then, main issue for contention is force_empty. But it's called for removing memcg,
accounting for such memcg is not important.
Then, I say "this accounting is Okay."

To do more accurate, we may need another "migration lock". But to get better
performance for root cgroup, we have to call mem_cgroup_is_root() before
taking lock and there will be another complicated race.

Bye,
-Kame












--

From: Daisuke Nishimura
Date: Wednesday, March 17, 2010 - 7:16 pm

FILE_MAPPED is updated under pte lock. OTOH, move account is also done under
pte lock. page cgroup lock is held under pte lock in both cases, so move account
is not so problem as for FILE_MAPPED.


Thanks,
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, March 17, 2010 - 7:58 pm

On Thu, 18 Mar 2010 11:16:53 +0900
HmmHmm, thank you. then, only racy cases are truncate and force_empty.
Thanks,
-Kame

--

From: Balbir Singh
Date: Wednesday, March 17, 2010 - 10:12 pm

Agreed, we need to find a simpler way of doing this without affecting
the accuracy of accounting - may be two accounting routines for two
code paths. I have not thought through this yet.

-- 
	Three Cheers,
	Balbir
--

From: Balbir Singh
Date: Wednesday, March 17, 2010 - 9:19 pm

That doesn't matter, if you need it, I think the larger user base
matters. Unmapped and mapped page cache is critical and I use it

I will, but please don't break it silently

-- 
	Three Cheers,
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, March 17, 2010 - 9:21 pm

On Thu, 18 Mar 2010 09:49:44 +0530
I never do silently. All e-mails are open.

Thanks,
-Kame

--

From: Balbir Singh
Date: Wednesday, March 17, 2010 - 11:25 pm

No, I meant it from the Documentation update perspective. The end user
will be confused when looking at stats without proper documentation. 

-- 
	Three Cheers,
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, March 17, 2010 - 9:35 pm

On Thu, 18 Mar 2010 09:49:44 +0530
Andrea, could you go in following way ?

	- don't touch FILE_MAPPED stuff.
	- add new functions for other dirty accounting stuff as in this series.
	  (using trylock is ok.)
	
Then, no probelm. It's ok to add mem_cgroup_udpate_stat() indpendent from
mem_cgroup_update_file_mapped(). The look may be messy but it's not your
fault. But please write "why add new function" to patch description.

I'm sorry for wasting your time.

Thanks,
-Kame




--

From: Balbir Singh
Date: Thursday, March 18, 2010 - 9:28 am

Do we need to go down this route? We could check the stat and do the
correct thing. In case of FILE_MAPPED, always grab page_cgroup_lock
and for others potentially look at trylock. It is OK for different
stats to be protected via different locks.

/me takes a look at the code again.

-- 
	Three Cheers,
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Thursday, March 18, 2010 - 6:23 pm

On Thu, 18 Mar 2010 21:58:55 +0530

I _don't_ want to see a mixture of spinlock and trylock in a function.

Thanks,
-Kame

--

From: Balbir Singh
Date: Thursday, March 18, 2010 - 7:40 pm

A well documented well written function can help. The other thing is to
of-course solve this correctly by introducing different locking around
the statistics. Are you suggesting the later?

-- 
	Three Cheers,
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Thursday, March 18, 2010 - 8:00 pm

On Fri, 19 Mar 2010 08:10:39 +0530

No. As I wrote.
	- don't modify codes around FILE_MAPPED in this series.
	- add a new functions for new statistics
Then,
	- think about clean up later, after we confirm all things work as expected.

Thanks,
-Kame


--

From: Greg Thelen
Date: Tuesday, April 13, 2010 - 11:55 pm

I have ported Andrea Righi's memcg dirty page accounting patches to latest
mmtom-2010-04-05-16-09.  In doing so I have to address this locking issue.  Does
the following look good?  I will (of course) submit the entire patch for review,
but I wanted make sure I was aiming in the right direction.

void mem_cgroup_update_page_stat(struct page *page,
			enum mem_cgroup_write_page_stat_item idx, bool charge)
{
	static int seq;
	struct page_cgroup *pc;

	if (mem_cgroup_disabled())
		return;
	pc = lookup_page_cgroup(page);
	if (!pc || mem_cgroup_is_root(pc->mem_cgroup))
		return;

	/*
	 * This routine does not disable irq when updating stats.  So it is
	 * possible that a stat update from within interrupt routine, could
	 * deadlock.  Use trylock_page_cgroup() to avoid such deadlock.  This
	 * makes the memcg counters fuzzy.  More complicated, or lower
	 * performing locking solutions avoid this fuzziness, but are not
	 * currently needed.
	 */
	if (irqs_disabled()) {
		if (! trylock_page_cgroup(pc))
			return;
	} else
		lock_page_cgroup(pc);

	__mem_cgroup_update_page_stat(pc, idx, charge);
	unlock_page_cgroup(pc);
}

__mem_cgroup_update_page_stat() has a switch statement that updates all of the
MEMCG_NR_FILE_{MAPPED,DIRTY,WRITEBACK,WRITEBACK_TEMP,UNSTABLE_NFS} counters
using the following form:
	switch (idx) {
	case MEMCG_NR_FILE_MAPPED:
		if (charge) {
			if (!PageCgroupFileMapped(pc))
				SetPageCgroupFileMapped(pc);
			else
				val = 0;
		} else {
			if (PageCgroupFileMapped(pc))
				ClearPageCgroupFileMapped(pc);
			else
				val = 0;
		}
		idx = MEM_CGROUP_STAT_FILE_MAPPED;
		break;

		...
	}

	/*
	 * Preemption is already disabled. We can use __this_cpu_xxx
	 */
	if (val > 0) {
		__this_cpu_inc(mem->stat->count[idx]);
	} else if (val < 0) {
		__this_cpu_dec(mem->stat->count[idx]);
	}

In my current tree, irq is never saved/restored by cgroup locking code.  To
protect against interrupt reentrancy, trylock_page_cgroup() is used.  As ...
From: KAMEZAWA Hiroyuki
Date: Wednesday, April 14, 2010 - 2:29 am

On Tue, 13 Apr 2010 23:55:12 -0700

I prefer trylock_page_cgroup() always.

I have another idea fixing this up _later_. (But I want to start from simple one.)

My rough idea is following.  Similar to your idea which you gave me before.

==
DEFINE_PERCPU(account_move_ongoing);
DEFINE_MUTEX(move_account_mutex):

void memcg_start_account_move(void)
{
	mutex_lock(&move_account_mutex);
	for_each_online_cpu(cpu)
		per_cpu(cpu, account_move_ongoing) += 1;
	mutex_unlock(&move_account_mutex);
	/* Wait until there are no lockless update */
	synchronize_rcu();
	return;
}

void memcg_end_account_move(void)
{
	mutex_lock(&move_account_mutex);
	for_each_online_cpu(cpu)
		per_cpu(cpu, account_move_ongoing) -= 1;
	mutex_unlock(&move_account_mutex);
}

/* return 1 when we took lock, return 0 if lockess OPs is guarantedd to be safe */
int memcg_start_filecache_accounting(struct page_cgroup *pc)
{
	rcu_read_lock();
	smp_rmb();
	if (!this_cpu_read(move_account_ongoing))
		return 0; /* no move account is ongoing */
	lock_page_cgroup(pc);
	return 1;
}

void memcg_end_filecache_accounting(struct page_cgroup *pc, int unlock)
{
	if (unlock)
		unlock_page_cgroup(pc);

	rcu_read_unlock();
}

and call memcg_start_account_move()/end_account_move() in the start/end of
migrainting chunk of pages.

Bye.
-Kame




















--

From: Vivek Goyal
Date: Wednesday, April 14, 2010 - 7:04 am

Hi Kame-san,

May be I am missing something but how does it solve the issue of making sure
lock_page_cgroup() is not held in interrupt context? IIUC, above code will
make sure that for file cache accouting, lock_page_cgroup() is taken only
if task migration is on. But say task migration is on, and then some IO
completes and we update WRITEBACK stat (i think this is the one which can
be called from interrupt context), then we will still take the
lock_page_cgroup() and again run into the issue of deadlocks?

Thanks
Vivek
--

From: Greg Thelen
Date: Wednesday, April 14, 2010 - 12:31 pm

I agree. I think the lock/unlock_page_cgrpoup() calls suggested by
Kame-san should also include local_irq_save/restore() calls to prevent
the interrupt context deadlock Vivek describes.  These new
local_irq_save/restore() calls would only be used if
move_account_ongoing is set.  They behave just like the optional calls
to lock/unlock_page_cgroup().

--
Greg
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, April 14, 2010 - 5:14 pm

On Wed, 14 Apr 2010 10:04:30 -0400
Yes and No.

At "Set", IIRC, almost all updates against DIRTY and WRITBACK accountings
can be done under mapping->tree_lock, which disables IRQ always.
(ex. I don't mention TestSetPageWriteback but account_page_diritied().)
Then, save/restore irq flags is just a cost and no benefit, in such cases.
Of course, there are cases irqs doesn't enabled.

About FILE_MAPPED, it's not updated under mapping->tree_lock. 
So, we'll have race with charge/uncharge. We have to take lock_page_cgroup(), always.


So, I think we'll have 2 or 3 interfaces, finally.
	mem_cgroup_update_stat_fast()  // the caller must disable IRQ and lock_page()
and
	mem_cgroup_update_stat_locked() // the caller has lock_page().
and
	mem_cgroup_update_stat_safe()  // the caller don't have to do anything.

Why update_stat_fast() is for avoiding _unnecessary_ costs.
When we lock a page and disables IRQ, we don't have to do anything.
There are no races. 

But yes, it's complicated. I'd like to see what we can do.


Thanks,
-Kame


--

From: Greg Thelen
Date: Wednesday, April 14, 2010 - 9:22 am

On Wed, Apr 14, 2010 at 2:29 AM, KAMEZAWA Hiroyuki

What is your reason for preferring trylock_page_cgroup()?  I assume
it's for code simplicity, but I wanted to check.

I had though about using trylock_page_cgroup() always, but I think
that would make file_mapped accounting even more fuzzy that it already
it is.  I was trying to retain the current accuracy of file_mapped and
only make new counters, like writeback/dirty/etc (those obtained in

Hi Kame-san,

I like the general approach.  The code I previously gave you appears
to work and is faster than non-root memcgs using mmotm due to mostly

What's the reason for having a per-cpu account_move_ongoing flag?
Would a single system-wide global be sufficient?  I assume the
majority of the time this value will not be changing because
accounting moves are rare.

Perhaps all of the per-cpu variables are packed within a per-cpu
cacheline making accessing it more likely to be local, but I'm not
sure if this is true.

--
Greg
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, April 14, 2010 - 5:22 pm

On Wed, 14 Apr 2010 09:22:41 -0700

file_mapped should have different interface as mem_cgroup_update_stat_verrrry_safe().
or some.

I don't think accuracy is important (if it's doesn't go minus) but if people want,

Yes. this value is rarely updated but update is not enough rare to put
this value to read_mostly section. We see cacheline ping-pong by random
placement of global variables. This is performance critical.
Recent updates for percpu variables accessor makes access to percpu 
very efficient. I'd like to make use of it.

Thanks,
-Kame


--

From: Vivek Goyal
Date: Wednesday, April 14, 2010 - 7:05 am

^^^^^^^^^

Vivek
--

From: Greg Thelen
Date: Wednesday, April 14, 2010 - 1:14 pm

Good catch.  I will replace irqs_disabled() with in_interrupt().

Thank you.

--
Greg
--

From: Daisuke Nishimura
Date: Wednesday, April 14, 2010 - 7:40 pm

I think you should check both. __remove_from_page_cache(), which will update
DIRTY, is called with irq disabled(iow, under mapping->tree_lock) but not in
interrupt context.

Anyway, I tend to agree with KAMEZAWA-san: use trylock always(except for FILE_MAPPED),
or add some new interfaces(e.g. mem_cgroup_update_stat_locked/safe...).

Thanks,
Daisuke Nishimura.
--

From: Greg Thelen
Date: Wednesday, April 14, 2010 - 9:48 pm

On Wed, Apr 14, 2010 at 7:40 PM, Daisuke Nishimura

The only reason to use trylock in this case is to prevent deadlock
when running in a context that may have preempted or interrupted a
routine that already holds the bit locked.  In the
__remove_from_page_cache() irqs are disabled, but that does not imply
that a routine holding the spinlock has been preempted.  When the bit
is locked, preemption is disabled.  The only way to interrupt a holder
of the bit for an interrupt to occur (I/O, timer, etc).  So I think

Thank you for the input.  I'm thinking more on this.

--
Greg
--

From: Daisuke Nishimura
Date: Wednesday, April 14, 2010 - 11:21 pm

IIUC, it's would be enough to prevent deadlock where one CPU tries to acquire
the same page cgroup lock. But there is still some possibility where 2 CPUs
can cause dead lock each other(please see the commit e767e056).
IOW, my point is "don't call lock_page_cgroup() under mapping->tree_lock". 

Thanks,
Daisuke Nishimura.
--

From: Greg Thelen
Date: Wednesday, April 14, 2010 - 11:38 pm

On Wed, Apr 14, 2010 at 11:21 PM, Daisuke Nishimura

I see your point.  Thank you for explaining.

--
Greg
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, April 14, 2010 - 11:54 pm

On Thu, 15 Apr 2010 15:21:04 +0900
Hmm, maybe worth to try. We may be able to set/clear all DIRTY/WRITBACK bit
on page_cgroup without mapping->tree_lock.
In such case, of course, the page itself should be locked by lock_page().

But.Hmm..for example.

account_page_dirtied() is the best place to mark page_cgroup dirty. But
it's called under mapping->tree_lock.

Another thinking:
I wonder we may have to change our approach for dirty page acccounting.

Please see task_dirty_inc(). It's for per task dirty limiting.
And you'll notice soon that there is no task_dirty_dec().

Making use of lib/proportions.c's proportion calculation as task_dirty limit or
per-bdi dirty limit does is worth to be considered.
This is very simple and can be implemented without problems we have now.
(Need to think about algorithm itself, but it's used and works well.)
We'll never see complicated race condtions.

I know some guys wants "accurate" accounting, but I myself don't want too much.
Using propotions.c can offer us unified approach with per-task dirty accounting.
or per-bid dirty accouting.

If we do so, memcg will have interface like per-bdi dirty ratio (see below)
[kamezawa@bluextal kvm2]$ ls /sys/block/dm-0/bdi/
max_ratio  min_ratio  power  read_ahead_kb  subsystem  uevent

Maybe
  memory.min_ratio
  memory.max_ratio

And use this instead of  task_dirty_limit(current, pbdi_dirty); As

	if (mem_cgroup_dirty_ratio_support(current)) // return 0 if root cgroup
		memcg_dirty_limit(current, pbdi_dirty, xxxx?);
	else
		task_dirty_limit(current, pbdi_diryt)

To be honest, I don't want to increase caller of lock_page_cgroup() and don't
want to see complicated race conditions.

Thanks,
-Kame




--

From: Greg Thelen
Date: Friday, April 23, 2010 - 1:17 pm

Hello Kame-san,

This is an interesting idea.  If this applies to memcg dirty accounting,
then would it also apply to system-wide dirty accounting?  I don't think
so, but I wanted to float the idea.  It looks like this proportions.c
code is good is at comparing the rates of events (for example: per-task
dirty page events).  However, in the case of system-wide dirty
accounting we also want to consider the amount of dirty memory, not just
the rate at which it is being dirtied.

Cgroup dirty page accounting imposes the following additional accounting
complexities:
* hierarchical accounting
* page migration between cgroups

For per-memcg dirty accounting, are you thinking that each mem_cgroup
would have a struct prop_local_single to represent a memcg's dirty
memory usage relative to a system wide prop_descriptor?

My concern is that we will still need an efficient way to determine the
mem_cgroup associated with a page under a variety of conditions (page
fault path for new mappings, softirq for dirty page writeback).

Currently -rc4 and -mmotm use a non-irq safe lock_page_cgroup() to
protect a page's cgroup membership.  I think this will cause problems as
we add more per-cgroup stats (dirty page counts, etc) that are adjusted
in irq handlers.  Proposed approaches include:
1. use try-style locking.  this can lead to fuzzy counters, which some
   do not like.  Over time these fuzzy counter may drift.

2. mask irq when calling lock_page_cgroup().  This has some performance
   cost, though it may be small (see below).

3. because a page's cgroup membership rarely changes, use RCU locking.
   This is fast, but may be more complex than we want.

The performance of simple irqsave locking or more advanced RCU locking
is similar to current locking (non-irqsave/non-rcu) for several
workloads (kernel build, dd).  Using a micro-benchmark some differences
are seen:
* irqsave is 1% slower than mmotm non-irqsave/non-rcu locking.
* RCU locking is 4% faster than mmotm non-irqsave/non-rcu ...
From: Peter Zijlstra
Date: Friday, April 23, 2010 - 1:54 pm

Insta-bug here, there is nothing guaranteeing we're not preemptible

--

From: Greg Thelen
Date: Saturday, April 24, 2010 - 8:53 am

Good catch.  A call to smp_wmb() belongs in
mem_cgroup_begin_page_cgroup_reassignment() like so:

static void mem_cgroup_begin_page_cgroup_reassignment(void)
{
	VM_BUG_ON(mem_cgroup_account_move_ongoing);
	mem_cgroup_account_move_ongoing = true;
	smp_wmb();
	synchronize_rcu();
}


The irq-disable is not needed for current code, only for upcoming
per-memcg dirty page accounting which will be refactoring
mem_cgroup_update_file_mapped() into a generic memcg stat update
routine.  I assume these locking changes should be bundled with the
dependant memcg dirty page accounting changes which need the ability to
update counters from irq routines.  I'm sorry I didn't make that more

My addition of VM_BUG_ON() was to programmatic assert what the comment
was asserting.  All callers of mem_cgroup_update_file_mapped() hold the
pte spinlock, which disables preemption.  So I don't think this
VM_BUG_ON() will cause panic.  A function level comment for
mem_cgroup_update_file_mapped() declaring that "callers must have

--
Greg
--

From: Peter Zijlstra
Date: Friday, April 23, 2010 - 1:57 pm

btw, you know synchronize_rcu() is _really_ slow?

--

From: KAMEZAWA Hiroyuki
Date: Friday, April 23, 2010 - 7:22 pm

On Fri, 23 Apr 2010 22:57:06 +0200
IIUC, this is called once per an event when task is moved and we have
to move accouting information...and once per an event when we call
rmdir() to destroy cgroup. 

So, this is not frequenctly called.
(hooks to migration in this patch is removable.)

Thanks,
-Kame

--

From: Peter Zijlstra
Date: Friday, April 23, 2010 - 2:19 pm

Correct, the whole proportion thing is purely about comparing rates of

Depending on what architecture you care about local_t might also be an
option, it uses per-cpu irq/nmi safe instructions (and falls back to
local_irq_save/restore for architectures lacking this support).



--

From: KAMEZAWA Hiroyuki
Date: Friday, April 23, 2010 - 7:19 pm

On Fri, 23 Apr 2010 13:17:38 -0700

I think the direction itself is good. But, about FILE_MAPPED, it's out of
control of radix-tree. So, we need lock_page_cgroup/unlock_page_cgroup
always for avoiding race with charge/uncharge.

And you don't need to call "begin/end reassignment" at page migration.
(I'll post a patch for modifing codes around page-migration. It has BUG now.)

But you have to call begin/end reassignment at force_empty. It does account move.

Thanks,

--

From: Balbir Singh
Date: Wednesday, April 14, 2010 - 7:44 am

Since this is just stats can we used deferred updates?
        else

Do charging + any deferred charges pending in

-- 
	Three Cheers,
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Sunday, March 14, 2010 - 7:36 pm

On Mon, 15 Mar 2010 00:26:37 +0100
Okay, thanks. This seems good result. Optimization for children can be done under
I hope, too.

Thanks,
-Kame

--

From: Andrea Righi
Date: Monday, March 15, 2010 - 3:02 am

OK, I'll wait a bit to see if someone has other fixes or issues and post
a new version soon including these small changes.

Thanks,
-Andrea
--

From: Vivek Goyal
Date: Monday, March 15, 2010 - 10:12 am

For me even with this version I see that group with 100M limit is getting
much more BW.

root cgroup
==========
#time dd if=/dev/zero of=/root/zerofile bs=4K count=1M
4294967296 bytes (4.3 GB) copied, 55.7979 s, 77.0 MB/s

real	0m56.209s

test1 cgroup with memory limit of 100M
======================================
# time dd if=/dev/zero of=/root/zerofile1 bs=4K count=1M
4294967296 bytes (4.3 GB) copied, 20.9252 s, 205 MB/s

real	0m21.096s

Note, these two jobs are not running in parallel. These are running one
after the other.

--

From: Vivek Goyal
Date: Monday, March 15, 2010 - 10:19 am

Ok, here is the strange part. I am seeing similar behavior even without
your patches applied.

root cgroup
==========
#time dd if=/dev/zero of=/root/zerofile bs=4K count=1M
4294967296 bytes (4.3 GB) copied, 56.098 s, 76.6 MB/s

real	0m56.614s

test1 cgroup with memory limit 100M
===================================
# time dd if=/dev/zero of=/root/zerofile1 bs=4K count=1M
4294967296 bytes (4.3 GB) copied, 19.8097 s, 217 MB/s

real	0m19.992s

--

From: Balbir Singh
Date: Wednesday, March 17, 2010 - 4:54 am

This is strange, did you flish the cache between the two runs?
NOTE: Since the files are same, we reuse page cache from the
other cgroup.

-- 
	Three Cheers,
	Balbir
--

From: Vivek Goyal
Date: Wednesday, March 17, 2010 - 6:34 am

Files are different. Note suffix "1".

Vivek
--

From: Balbir Singh
Date: Wednesday, March 17, 2010 - 11:53 am

Thanks, I'll get the perf output and see what I get. 

-- 
	Three Cheers,
	Balbir
--

From: Vivek Goyal
Date: Wednesday, March 17, 2010 - 12:15 pm

One more thing I noticed and that is, it happens only if we limit the
memory of cgroup to 100M. If same cgroup test1 is unlimited memory 
thing, then it did not happen.

I also did not notice this happening on other system where I have 4G of
memory. So it also seems to be related with only bigger configurations.

Thanks
--

From: Balbir Singh
Date: Wednesday, March 17, 2010 - 12:17 pm

The data is not always repeatable at my end. Are you able to modify
the order and get repeatable results?

In fact, I saw

for cgroup != root
------------------
4294967296 bytes (4.3 GB) copied, 120.359 s, 35.7 MB/s

for cgroup = root
-----------------
4294967296 bytes (4.3 GB) copied, 84.504 s, 50.8 MB/s

This is without the patches applied.


-- 
	Three Cheers,
	Balbir
--

From: Vivek Goyal
Date: Wednesday, March 17, 2010 - 12:48 pm

I lost the access to the big configuration machine but on that machine I
could reproduce it all the time. But on smaller machine (4core, 4G), I 
could not.

I will do some more tests later.

Vivek
--

From: Balbir Singh
Date: Tuesday, March 16, 2010 - 11:44 pm

I'll test this, I have a small machine to test on at the moment, I'll
revert back with data. 

-- 
	Three Cheers,
	Balbir
--

Previous thread: HTC Dream: Hook up LEDs by Pavel Machek on Sunday, March 14, 2010 - 1:16 pm. (1 message)

Next thread: [2.6.34-rc1 REGRESSION] alsa: no mixer controls (and no sound) with 2.6.24-rc1+ on Realtek ALC260 by Alessandro Guido on Sunday, March 14, 2010 - 5:06 pm. (3 messages)