Re: [PATCH] memcg: use ID in page_cgroup

Previous thread: [PATCH/RFCv4 0/6] The Contiguous Memory Allocator framework by Michal Nazarewicz on Friday, August 20, 2010 - 2:50 am. (60 messages)

Next thread: [GIT PULL] sh updates for 2.6.36-rc2 by Paul Mundt on Friday, August 20, 2010 - 5:04 am. (1 message)
From: KAMEZAWA Hiroyuki
Date: Friday, August 20, 2010 - 2:55 am

This is v5.

Sorry for delaying...but I had time for resetting myself and..several
changes are added. I think this version is simpler than v4.

Major changes from v4 is 
 a) added kernel/cgroup.c hooks again. (for b)
 b) make RCU aware. previous version seems dangerous in an extreme case.

Then, codes are updated. Most of changes are related to RCU.

Patch brief view:
 1. add hooks to kernel/cgroup.c for ID management.
 2. use ID-array in memcg.
 3. record ID to page_cgroup rather than pointer.
 4. make update_file_mapped to be RCU aware routine instead of spinlock.
 5. make update_file_mapped as general-purpose function.


Thanks,
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Friday, August 20, 2010 - 2:58 am

CC'ed to Paul Menage and Li Zefan.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

When cgroup subsystem use ID (ss->use_id==1), each css's ID is assigned
after successful call of ->create(). css_ID is tightly coupled with
css struct itself but it is allocated by ->create() call, IOW,
per-subsystem special allocations.

To know css_id before creation, this patch adds id_attached() callback.
after css_ID allocation. This will be used by memory cgroup's quick lookup
routine.

Maybe you can think of other implementations as
	- pass ID to ->create()
	or
	- add post_create()
	etc...
But when considering dirtiness of codes, this straightforward patch seems
good to me. If someone wants post_create(), this patch can be replaced.

Changelog: 20100820
 - new approarch.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 Documentation/cgroups/cgroups.txt |   10 ++++++++++
 include/linux/cgroup.h            |    1 +
 kernel/cgroup.c                   |    5 ++++-
 3 files changed, 15 insertions(+), 1 deletion(-)

Index: mmotm-0811/Documentation/cgroups/cgroups.txt
===================================================================
--- mmotm-0811.orig/Documentation/cgroups/cgroups.txt
+++ mmotm-0811/Documentation/cgroups/cgroups.txt
@@ -621,6 +621,16 @@ and root cgroup. Currently this will onl
 the default hierarchy (which never has sub-cgroups) and a hierarchy
 that is being created/destroyed (and hence has no sub-cgroups).
 
+void id_attached(struct cgroup_subsys *ss, struct cgroup *root)
+(cgroup_mutex and ss->hierarchy_mutex held by caller)
+(called only when ss->use_id=1)
+
+Called when css_id is attached to css. Because css_id is assigned
+against "css", css_id is not available until ->create() is called.
+If subsystem wants to make use of ID at createtion time, use
+this handler. This handler will be called after css_id is assigned
+to css. Not necessary to be implemented in usual(see memcontrol.c)
+
 4. Questions
 ============
 ...
From: Greg Thelen
Date: Tuesday, August 24, 2010 - 12:19 am

Maybe this sounds better?

--

From: KAMEZAWA Hiroyuki
Date: Tuesday, August 24, 2010 - 12:18 am

On Tue, 24 Aug 2010 00:19:01 -0700


Thank you for review.

-Kame

--

From: Li Zefan
Date: Tuesday, August 24, 2010 - 2:04 am

Acked-by: Li Zefan <lizf@cn.fujitsu.com>


Maybe pass the id number to id_attached() is better.

And actually the @ss argument is not necessary, because the memcg's
id_attached() handler of course knows it's dealing with the memory
cgroup subsystem.

So I suspect we can just remove all the @ss from all the callbacks..

--

From: KAMEZAWA Hiroyuki
Date: Tuesday, August 24, 2010 - 4:58 pm

On Tue, 24 Aug 2010 17:04:52 +0800

I agree. But could you write "remove ss" patch later ?
It seems an design change. I leave "ss" as this for now.

Thanks
-Kame


--

From: Paul Menage
Date: Tuesday, August 24, 2010 - 5:11 pm

Yes, I don't think any subsystem uses these. They dated originally
from when, as part of the initial cgroups framwork, I included a
library that could wrap a mostly-unmodified CKRM resource controller
into a cgroups subsystem, at which point the callback code didn't
necessarily know which subsystem it was being called for. But that's
obsolete now.

Paul
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, August 24, 2010 - 5:17 pm

On Tue, 24 Aug 2010 17:11:22 -0700

Hmm, then, should I remove it in the next version, or leave it as it is
now and should be removed by another total clean up ?

(IOW, mixture of inconsistent interface is O.K. ?)

Thanks,
-Kame

--

From: Paul Menage
Date: Tuesday, August 24, 2010 - 5:25 pm

On Tue, Aug 24, 2010 at 5:17 PM, KAMEZAWA Hiroyuki

I think these are two separate issues. A cleanup to remove the ss
parameters from the existing callback methods would be separate from
your proposed id callback, which I think is better done (as I
mentioned above) by passing the id to create()

Paul
--

From: Paul Menage
Date: Tuesday, August 24, 2010 - 5:09 pm

On Fri, Aug 20, 2010 at 2:58 AM, KAMEZAWA Hiroyuki

I think I'd prefer the approach where any necessary css_ids are
allocated prior to calling any create methods (which gives the
additional advantage of removing the need to roll back partial
creation of a cgroup in the event of alloc_css_id() failing) and then
passed in to the create() method. The main cgroups framework would
still be responsible for actually filling the css->id field with the
allocated id.

Paul
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, August 24, 2010 - 5:20 pm

On Tue, 24 Aug 2010 17:09:25 -0700

Hmm, sure. I'll change the ->create() interface.  O.K. ?

Thanks,
-Kame

--

From: Paul Menage
Date: Tuesday, August 24, 2010 - 5:34 pm

On Tue, Aug 24, 2010 at 5:20 PM, KAMEZAWA Hiroyuki

Hmm. An alternative (possibly cleaner) would be:

1) add a css_size field in cgroup_subsys that contains the size of the
per-subsystem structure
2) change cgroups to allocate and populate the css *before* calling
create(), since it now knows the actual size
3) simplify all the subsystem create() methods since they no longer
have to worry about allocation or out-of-memory handling
4) also add a top_css field in cgroups that allows cpusets to use the
statically-allocated top_cpuset since it's initialized prior to memory
allocation being reliable

This avoids us having to pass in any new parameters to the create()
method in future since they can be populated in the CSS.

Paul
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, August 24, 2010 - 5:37 pm

On Tue, 24 Aug 2010 17:34:38 -0700

Ou...I'm sorry but I would like to use attach_id() for this time.
Forgive me, above seems a big change.
I'd like to write a series of patch to do above, later.
At least, to do a trial.

Thanks,
-Kame

--

From: Paul Menage
Date: Tuesday, August 24, 2010 - 5:46 pm

On Tue, Aug 24, 2010 at 5:37 PM, KAMEZAWA Hiroyuki

Sure, for testing outside the tree. I don't think introducing new code
into mainline that's intentionally ugly and cutting corners is a good
idea.

Paul
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, August 24, 2010 - 6:03 pm

On Tue, 24 Aug 2010 17:46:03 -0700

Hmm. How this pseudo code looks like ? This passes "new id" via
cgroup->subsys[array] at creation. (Using union will be better, maybe).

---
 include/linux/cgroup.h |    1 +
 kernel/cgroup.c        |   26 +++++++++++++++++++-------
 2 files changed, 20 insertions(+), 7 deletions(-)

Index: mmotm-0811/include/linux/cgroup.h
===================================================================
--- mmotm-0811.orig/include/linux/cgroup.h
+++ mmotm-0811/include/linux/cgroup.h
@@ -508,6 +508,7 @@ struct cgroup_subsys {
 	struct cgroupfs_root *root;
 	struct list_head sibling;
 	/* used when use_id == true */
+	int max_id;
 	struct idr idr;
 	spinlock_t id_lock;
 
Index: mmotm-0811/kernel/cgroup.c
===================================================================
--- mmotm-0811.orig/kernel/cgroup.c
+++ mmotm-0811/kernel/cgroup.c
@@ -3335,19 +3335,31 @@ static long cgroup_create(struct cgroup 
 		set_bit(CGRP_NOTIFY_ON_RELEASE, &cgrp->flags);
 
 	for_each_subsys(root, ss) {
-		struct cgroup_subsys_state *css = ss->create(ss, cgrp);
+		struct cgroup_subsys_state *css;
 
-		if (IS_ERR(css)) {
-			err = PTR_ERR(css);
-			goto err_destroy;
-		}
-		init_cgroup_css(css, ss, cgrp);
 		if (ss->use_id) {
 			err = alloc_css_id(ss, parent, cgrp);
 			if (err)
 				goto err_destroy;
+			/*
+ 			 * Here, created css_id is recorded into
+ 			 * cgrp->subsys[ss->subsys_id]
+			 * array and passed to subsystem.
+			 */
+		} else
+			cgrp->subsys[ss->subsys_id] = NULL;
+
+		css = ss->create(ss, cgrp);
+
+		if (IS_ERR(css)) {
+			/* forget preallocated id */
+			if (cgrp->subsys[ss->subsys_id])
+				free_css_id_direct((struct css_id *)
+					cgrp->subsys[ss->subsys_id]);
+			err = PTR_ERR(css);
+			goto err_destroy;
 		}
-		/* At error, ->destroy() callback has to free assigned ID. */
+		init_cgroup_css(css, ss, cgrp);
 	}
 
 	cgroup_lock_hierarchy(root);

	

--

From: Paul Menage
Date: Tuesday, August 24, 2010 - 6:35 pm

On Tue, Aug 24, 2010 at 6:03 PM, KAMEZAWA Hiroyuki

That's rather ugly. I was thinking of something more like this. (Not
even compiled yet, and the only subsystem updated is cpuset).

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index ed3e92e..063d9f2 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -458,8 +458,7 @@ void cgroup_release_and_wakeup_rmdir(struct
cgroup_subsys_state *css);
  */

 struct cgroup_subsys {
-	struct cgroup_subsys_state *(*create)(struct cgroup_subsys *ss,
-						  struct cgroup *cgrp);
+	int (*create)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*pre_destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	void (*destroy)(struct cgroup_subsys *ss, struct cgroup *cgrp);
 	int (*can_attach)(struct cgroup_subsys *ss, struct cgroup *cgrp,
@@ -513,6 +512,12 @@ struct cgroup_subsys {

 	/* should be defined only by modular subsystems */
 	struct module *module;
+
+	/* Total size of the subsystem's CSS object */
+	size_t css_size;
+
+	/* If non-NULL, the CSS to use for the root cgroup */
+	struct cgroup_subsys_state *root_css;
 };

 #define SUBSYS(_x) extern struct cgroup_subsys _x ## _subsys;
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 192f88c..c589a41 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -3307,6 +3307,7 @@ static long cgroup_create(struct cgroup *parent,
struct dentry *dentry,
 			     mode_t mode)
 {
 	struct cgroup *cgrp;
+	struct cgroup_subsys_state *new_css[CGROUP_SUBSYS_COUNT] = {};
 	struct cgroupfs_root *root = parent->root;
 	int err = 0;
 	struct cgroup_subsys *ss;
@@ -3325,6 +3326,16 @@ static long cgroup_create(struct cgroup
*parent, struct dentry *dentry,

 	mutex_lock(&cgroup_mutex);

+	for_each_subsys(root, ss) {
+		int id = ss->subsys_id;
+		new_css[id] = kzalloc(ss->css_size, GFP_KERNEL);
+		if (!new_css) {
+			/* Failed to allocate memory */
+			err = -ENOMEM;
+			goto err_destroy;
+		}
+	}
+
 	init_cgroup_housekeeping(cgrp);

 ...
From: KAMEZAWA Hiroyuki
Date: Tuesday, August 24, 2010 - 6:42 pm

On Tue, 24 Aug 2010 18:35:00 -0700

Hmm, but placing css and subsystem's its own structure in different cache line
can increase cacheline/TLB miss, I think.

I wonder I should stop this patch series and do small thing.
I prefer to call alloc_css_id() by ->create() call by subsys's its own decistion
is much better and cleaner. (as my original design)

mem_cgroup_create()
{

	cgroup_attach_css_id(ss, cgrp, &mem->css);
}

And then, there will be no difficulty.

Do we have to call alloc_css_id() in kernel/cgroup.c ?

Thanks,
-Kame

--

From: Paul Menage
Date: Tuesday, August 24, 2010 - 6:52 pm

On Tue, Aug 24, 2010 at 6:42 PM, KAMEZAWA Hiroyuki

My patch shouldn't affect the memory placement of any structures.
struct cgroup_subsys_state is still embedded in the per-subsystem

I guess not, if no-one's using it except for memcg. The general
approach of allocating the CSS in cgroup.c rather than in every
subsystem is something that I'd like to do separately, though.

Paul
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, August 24, 2010 - 7:29 pm

On Tue, 24 Aug 2010 18:52:20 -0700

I'll use this for v6.
When you move css's allcation up to kernel/cgroup.c, this can be moved, too.
My regret is that I should argue CSS_ID is very tightly coupled with css itself
and should use this design from the 1st version.

-Kame

==

---
 block/blk-cgroup.c     |    6 ++++++
 include/linux/cgroup.h |   16 +++++++++-------
 kernel/cgroup.c        |   35 +++++++++--------------------------
 mm/memcontrol.c        |    4 ++++
 4 files changed, 28 insertions(+), 33 deletions(-)

Index: mmotm-0811/kernel/cgroup.c
===================================================================
--- mmotm-0811.orig/kernel/cgroup.c
+++ mmotm-0811/kernel/cgroup.c
@@ -770,9 +770,6 @@ static struct backing_dev_info cgroup_ba
 	.capabilities	= BDI_CAP_NO_ACCT_AND_WRITEBACK,
 };
 
-static int alloc_css_id(struct cgroup_subsys *ss,
-			struct cgroup *parent, struct cgroup *child);
-
 static struct inode *cgroup_new_inode(mode_t mode, struct super_block *sb)
 {
 	struct inode *inode = new_inode(sb);
@@ -3257,7 +3254,8 @@ static void init_cgroup_css(struct cgrou
 	css->cgroup = cgrp;
 	atomic_set(&css->refcnt, 1);
 	css->flags = 0;
-	css->id = NULL;
+	if (!ss->use_id)
+		css->id = NULL;
 	if (cgrp == dummytop)
 		set_bit(CSS_ROOT, &css->flags);
 	BUG_ON(cgrp->subsys[ss->subsys_id]);
@@ -3342,11 +3340,6 @@ static long cgroup_create(struct cgroup 
 			goto err_destroy;
 		}
 		init_cgroup_css(css, ss, cgrp);
-		if (ss->use_id) {
-			err = alloc_css_id(ss, parent, cgrp);
-			if (err)
-				goto err_destroy;
-		}
 		/* At error, ->destroy() callback has to free assigned ID. */
 	}
 
@@ -3709,17 +3702,6 @@ int __init_or_module cgroup_load_subsys(
 
 	/* our new subsystem will be attached to the dummy hierarchy. */
 	init_cgroup_css(css, ss, dummytop);
-	/* init_idr must be after init_cgroup_css because it sets css->id. */
-	if (ss->use_id) {
-		int ret = cgroup_init_idr(ss, css);
-		if (ret) {
-			dummytop->subsys[ss->subsys_id] = ...
From: KAMEZAWA Hiroyuki
Date: Friday, August 20, 2010 - 2:59 am

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now, memory cgroup has an ID per cgroup and make use of it at
 - hierarchy walk,
 - swap recording.

This patch is for making more use of it. The final purpose is
to replace page_cgroup->mem_cgroup's pointer to an unsigned short.

This patch caches a pointer of memcg in an array. By this, we
don't have to call css_lookup() which requires radix-hash walk.
This saves some amount of memory footprint at lookup memcg via id.

Changelog: 20100811
 - adjusted onto mmotm-2010-08-11
 - fixed RCU related parts.
 - use attach_id() callback.

Changelog: 20100804
 - fixed description in init/Kconfig

Changelog: 20100730
 - fixed rcu_read_unlock() placement.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 init/Kconfig    |   10 ++++++
 mm/memcontrol.c |   83 ++++++++++++++++++++++++++++++++++++++++++--------------
 2 files changed, 73 insertions(+), 20 deletions(-)

Index: mmotm-0811/mm/memcontrol.c
===================================================================
--- mmotm-0811.orig/mm/memcontrol.c
+++ mmotm-0811/mm/memcontrol.c
@@ -195,6 +195,7 @@ static void mem_cgroup_oom_notify(struct
  */
 struct mem_cgroup {
 	struct cgroup_subsys_state css;
+	int	valid; /* for checking validness under RCU access.*/
 	/*
 	 * the counter to account for memory usage
 	 */
@@ -294,6 +295,29 @@ static bool move_file(void)
 					&mc.to->move_charge_at_immigrate);
 }
 
+/* 0 is unused */
+static atomic_t mem_cgroup_num;
+#define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
+static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
+
+/* Must be called under rcu_read_lock */
+static struct mem_cgroup *id_to_memcg(unsigned short id)
+{
+	struct mem_cgroup *ret;
+	/* see mem_cgroup_free() */
+	ret = rcu_dereference_check(mem_cgroups[id], rch_read_lock_held());
+	if (likely(ret && ret->valid))
+		return ret;
+	return NULL;
+}
+
+static void register_memcg_id(struct mem_cgroup ...
From: Daisuke Nishimura
Date: Sunday, August 22, 2010 - 8:35 pm

Could you explain why we need rcu_read_unlock() before mem_cgroup_put() ?
I suspect that it's because mem_cgroup_put() can free the memcg, but do we
need mem->valid then ?


Thanks,
Daisuke Nishimura.
--

From: KAMEZAWA Hiroyuki
Date: Monday, August 23, 2010 - 4:51 pm

On Mon, 23 Aug 2010 12:35:33 +0900
mem_cgroup_put() may call synchronize_rcu(). So, we have to unlock before it.

Thanks,
-Kame



--

From: Daisuke Nishimura
Date: Monday, August 23, 2010 - 5:19 pm

Ah, I see. Thank you for your explanation.

Daisuke Nishimura.
--

From: Greg Thelen
Date: Tuesday, August 24, 2010 - 12:44 am

From: KAMEZAWA Hiroyuki
Date: Tuesday, August 24, 2010 - 12:42 am

On Tue, 24 Aug 2010 00:44:59 -0700
yes, mayb overwritten by following patch.. thank you for finding.


Thanks,
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Friday, August 20, 2010 - 3:01 am

I have an idea to remove page_cgroup->page pointer, 8bytes reduction per page.
But it will be after this work.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now, addresses of memory cgroup can be calculated by their ID without complex.
This patch relplaces pc->mem_cgroup from a pointer to a unsigned short.
On 64bit architecture, this offers us more 6bytes room per page_cgroup.
Use 2bytes for blkio-cgroup's page tracking. More 4bytes will be used for
some light-weight concurrent access.

We may able to move this id onto flags field but ...go step by step.

Changelog: 20100811
 - using new rcu APIs, as rcu_dereference_check() etc.
Changelog: 20100804
 - added comments to page_cgroup.h
Changelog: 20100730
 - fixed some garbage added by debug code in early stage

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/page_cgroup.h |    6 ++++-
 mm/memcontrol.c             |   52 +++++++++++++++++++++++++++-----------------
 mm/page_cgroup.c            |    2 -
 3 files changed, 38 insertions(+), 22 deletions(-)

Index: mmotm-0811/include/linux/page_cgroup.h
===================================================================
--- mmotm-0811.orig/include/linux/page_cgroup.h
+++ mmotm-0811/include/linux/page_cgroup.h
@@ -9,10 +9,14 @@
  * page_cgroup helps us identify information about the cgroup
  * All page cgroups are allocated at boot or memory hotplug event,
  * then the page cgroup for pfn always exists.
+ *
+ * TODO: It seems ID for cgroup can be packed into "flags". But there will
+ * be race between assigning ID <-> set/clear flags. Please be careful.
  */
 struct page_cgroup {
 	unsigned long flags;
-	struct mem_cgroup *mem_cgroup;
+	unsigned short mem_cgroup;	/* ID of assigned memory cgroup */
+	unsigned short blk_cgroup;	/* Not Used..but will be. */
 	struct page *page;
 	struct list_head lru;		/* per cgroup LRU list */
 };
Index: ...
From: KAMEZAWA Hiroyuki
Date: Friday, August 20, 2010 - 3:05 am

But title....this was [PATCH 3/5].
Sorry.
-kame

--

From: Daisuke Nishimura
Date: Sunday, August 22, 2010 - 10:32 pm

On Fri, 20 Aug 2010 19:01:32 +0900
Another off topic. I think we can reduce the size of mem_cgroup by packing
Do you mean, "Set safe==true if we can ensure by some locks that the id can be

It might be better to add

	BUG_ON(newid->id != 1)

in cgroup.c::cgroup_init_idr().


Thanks,
Daisuke Nishimura.
--

From: KAMEZAWA Hiroyuki
Date: Monday, August 23, 2010 - 4:52 pm

On Mon, 23 Aug 2010 14:32:37 +0900


Why ??

Thanks,
-Kame



--

From: Daisuke Nishimura
Date: Monday, August 23, 2010 - 6:14 pm

Just to make sure that the root css has id==1. mem_cgroup_is_rootid() make
use of the fact.
I'm sorry if I miss something.

Thanks,
Daisuke Nishimura.
--

From: KAMEZAWA Hiroyuki
Date: Monday, August 23, 2010 - 6:54 pm

On Tue, 24 Aug 2010 10:14:25 +0900

Hmm. The function allocating ID does

4530 static struct css_id *get_new_cssid(struct cgroup_subsys *ss, int depth)
4531 {
==
4546         spin_lock(&ss->id_lock);
4547         /* Don't use 0. allocates an ID of 1-65535 */
4548         error = idr_get_new_above(&ss->idr, newid, 1, &myid);
4549         spin_unlock(&ss->id_lock);
==

and allocates ID above "1", always.

Adding BUG_ON(newid->id != 1) will mean that we doubt the bitmap function and
consider possibility that new->id == 0.

But, we're 100% sure that it never happens.

I don't think adding a comment is a right thing to do.

Thanks,
-Kame

--

From: Daisuke Nishimura
Date: Monday, August 23, 2010 - 9:04 pm

On Tue, 24 Aug 2010 10:54:05 +0900
Okey, I don't have strong requirement to add BUG_ON() anyway.

These patches looks good to me except for some minor points I've commented.

Thanks,
Daisuke Nishimura.
--

From: KAMEZAWA Hiroyuki
Date: Monday, August 23, 2010 - 11:05 pm

On Tue, 24 Aug 2010 13:04:02 +0900

Thank you for review. I'll repost after applying your feedback.

-Kame

--

From: Greg Thelen
Date: Tuesday, August 24, 2010 - 12:47 am

Should this be VM_BUG_ON(!rcu_read_lock_held())?  I suspect that

--

From: KAMEZAWA Hiroyuki
Date: Tuesday, August 24, 2010 - 12:51 am

On Tue, 24 Aug 2010 00:47:50 -0700


No. 
 - lock_page_cgroup() is for keeping page_cgroup's status stable.
 - rcu_read_lock() is for delaying to discard mem_cgroup object.

rcu_read_lock() is just for delaying to discard object, not for avoiding
racy updates. All _updates_ requires proper lock or speculative logic as
atomic_inc_not_zero() etc... Basically, RCU is for avoiding use-after-free.

Thanks,
-Kame

--

From: Greg Thelen
Date: Tuesday, August 24, 2010 - 1:35 am

It seems like mem_cgroup_move_account() is not balanced.  Why is
lock_page_cgroup(pc) used to lock but rcu_read_unlock() used to unlock?
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, August 24, 2010 - 1:38 am

On Tue, 24 Aug 2010 01:35:37 -0700

Nice catch. It's bug. It seems my eyes were corrupted..
Will be fixed in the next version. Sorry for bad code.

Thanks,
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Friday, August 20, 2010 - 3:02 am

No changes from v4.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

At accounting file events per memory cgroup, we need to find memory cgroup
via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup().

But, considering the context which page-cgroup for files are accessed,
we can use alternative light-weight mutual execusion in the most case.
At handling file-caches, the only race we have to take care of is "moving"
account, IOW, overwriting page_cgroup->mem_cgroup. Because file status
update is done while the page-cache is in stable state, we don't have to
take care of race with charge/uncharge.

Unlike charge/uncharge, "move" happens not so frequently. It happens only when
rmdir() and task-moving (with a special settings.)
This patch adds a race-checker for file-cache-status accounting v.s. account
moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added.
The routine for account move 
  1. Increment it before start moving
  2. Call synchronize_rcu()
  3. Decrement it after the end of moving.
By this, file-status-counting routine can check it needs to call
lock_page_cgroup(). In most case, I doesn't need to call it.


Changelog: 20100804
 - added a comment for possible optimization hint.
Changelog: 20100730
 - some cleanup.
Changelog: 20100729
 - replaced __this_cpu_xxx() with this_cpu_xxx
   (because we don't call spinlock)
 - added VM_BUG_ON().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   79 +++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 67 insertions(+), 12 deletions(-)

Index: mmotm-0811/mm/memcontrol.c
===================================================================
--- mmotm-0811.orig/mm/memcontrol.c
+++ mmotm-0811/mm/memcontrol.c
@@ -90,6 +90,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
 	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
 	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout ...
From: Daisuke Nishimura
Date: Monday, August 23, 2010 - 1:50 am

This patch looks good to me, but I have one question.

Why do we need to acquire sc.lock inside mem_cgroup_(start|end)_move() ?
These functions doesn't access mc.*.

Thanks,
Daisuke Nishimura.

On Fri, 20 Aug 2010 19:02:56 +0900
--

From: KAMEZAWA Hiroyuki
Date: Monday, August 23, 2010 - 4:49 pm

On Mon, 23 Aug 2010 17:50:15 +0900

just reusing a lock to update status. If you don't like, I'll add a new lock.

Thanks,

--

From: Daisuke Nishimura
Date: Monday, August 23, 2010 - 5:19 pm

On Tue, 24 Aug 2010 08:49:16 +0900
I see. I think it would be enough just to add some comments about it.


Thanks,
Daisuke Nishimura.
--

From: KAMEZAWA Hiroyuki
Date: Friday, August 20, 2010 - 3:03 am

No changes from v4.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Preparing for adding new status arounf file caches.(dirty, writeback,etc..)
Using a unified macro and more generic names.
All counters will have the same rule for updating.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h  |   24 ++++++++++++++++++++++
 include/linux/page_cgroup.h |   19 ++++++++++++------
 mm/memcontrol.c             |   46 ++++++++++++++++++--------------------------
 3 files changed, 56 insertions(+), 33 deletions(-)

Index: mmotm-0811/include/linux/memcontrol.h
===================================================================
--- mmotm-0811.orig/include/linux/memcontrol.h
+++ mmotm-0811/include/linux/memcontrol.h
@@ -25,6 +25,30 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 
+/*
+ * Per-cpu Statistics for memory cgroup.
+ */
+enum mem_cgroup_stat_index {
+	/*
+	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+	 */
+	MEM_CGROUP_STAT_CACHE,		/* # of pages charged as cache */
+	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
+	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
+	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
+	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
+	MEM_CGROUP_ON_MOVE,   /* A check for locking move account/status */
+	/* When you add new member for file-stat, please update page_cgroup.h */
+	MEM_CGROUP_FSTAT_BASE,
+	MEM_CGROUP_FSTAT_FILE_MAPPED = MEM_CGROUP_FSTAT_BASE,
+	MEM_CGROUP_FSTAT_END,
+	MEM_CGROUP_STAT_NSTATS = MEM_CGROUP_FSTAT_END,
+};
+
+#define MEMCG_FSTAT_IDX(idx)	((idx) - MEM_CGROUP_FSTAT_BASE)
+#define NR_FILE_FLAGS_MEMCG ((MEM_CGROUP_FSTAT_END - MEM_CGROUP_FSTAT_BASE))
+
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
Index: ...
From: Balbir Singh
Date: Tuesday, August 24, 2010 - 12:46 am

Thanks for being persistent, will review the patches with comments in
the relevant patches. 

-- 
	Three Cheers,
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, August 24, 2010 - 12:59 am

On Tue, 24 Aug 2010 13:16:17 +0530

Thank you. I may be able to post v6, tomorrow.

-Kame

--

Previous thread: [PATCH/RFCv4 0/6] The Contiguous Memory Allocator framework by Michal Nazarewicz on Friday, August 20, 2010 - 2:50 am. (60 messages)

Next thread: [GIT PULL] sh updates for 2.6.36-rc2 by Paul Mundt on Friday, August 20, 2010 - 5:04 am. (1 message)