Re: [PATCH -mm 3/5] memcg scalable file stat accounting method

Previous thread: GFS2: Pre-pull patch posting by Steven Whitehouse on Monday, August 2, 2010 - 2:27 am. (9 messages)

Next thread: [PATCH] nconfig: Fix segfault when menu is empty by Andrej Gelenberg on Monday, August 2, 2010 - 2:59 am. (2 messages)
From: KAMEZAWA Hiroyuki
Date: Monday, August 2, 2010 - 3:11 am

This is v3. removed terrble garbages from v2 and tested.(no big changes)

Now, it's merge-window and I'll have to maintain this in my box for a while.
I'll continue to update this. Maybe we can make new progress after LinuxCon.
(And I'll be busy for a while.)

This set has 2+1 purposes.
 1. re-desgin struct page_cgroup and makes room for blocckio-cgroup ID.
 2. implement quick updating method for memcg's file stat.
 3. optionally? use spin_lock instead of bit_spinlock.

Plans after this.

 1. check influence of Mel's new writeback method.
    I think we'll see OOM easier. IIUC, memory cgroup needs a thread like kswapd
    to do background writeback or low-high watermark.
    (By this, we can control priority of background writeout thread priority
     by CFS. This is very good.)

 2. implementing dirty_ratio.
    Now, Greg Thelen is working on. One of biggest problems of previous trial was
    update cost of status. I think this patch set can reduce it.

 3. record blockio cgroup's ID.
    Ikeda posted one. IIUC, it requires some consideration on (swapin)readahead
    for assigning IDs. But it seemed to be good in general.

Importance is in this order in my mind. But all aboves can be done in parallel.

Beyond that, some guys has problem with file-cache-control. If it need to use
account migration, we have to take care of races.


Thanks,
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Monday, August 2, 2010 - 3:13 am

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now, memory cgroup has an ID per cgroup and make use of it at
 - hierarchy walk,
 - swap recording.

This patch is for making more use of it. The final purpose is
to replace page_cgroup->mem_cgroup's pointer to an unsigned short.

This patch caches a pointer of memcg in an array. By this, we
don't have to call css_lookup() which requires radix-hash walk.
This saves some amount of memory footprint at lookup memcg via id.

Changelog: 20100730
 - fixed rcu_read_unlock() placement.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 init/Kconfig    |   11 +++++++++++
 mm/memcontrol.c |   48 ++++++++++++++++++++++++++++++++++--------------
 2 files changed, 45 insertions(+), 14 deletions(-)

Index: mmotm-0727/mm/memcontrol.c
===================================================================
--- mmotm-0727.orig/mm/memcontrol.c
+++ mmotm-0727/mm/memcontrol.c
@@ -292,6 +292,30 @@ static bool move_file(void)
 					&mc.to->move_charge_at_immigrate);
 }
 
+/* 0 is unused */
+static atomic_t mem_cgroup_num;
+#define NR_MEMCG_GROUPS (CONFIG_MEM_CGROUP_MAX_GROUPS + 1)
+static struct mem_cgroup *mem_cgroups[NR_MEMCG_GROUPS] __read_mostly;
+
+static struct mem_cgroup *id_to_memcg(unsigned short id)
+{
+	/*
+	 * This array is set to NULL when mem_cgroup is freed.
+	 * IOW, there are no more references && rcu_synchronized().
+	 * This lookup-caching is safe.
+	 */
+	if (unlikely(!mem_cgroups[id])) {
+		struct cgroup_subsys_state *css;
+
+		rcu_read_lock();
+		css = css_lookup(&mem_cgroup_subsys, id);
+		rcu_read_unlock();
+		if (!css)
+			return NULL;
+		mem_cgroups[id] = container_of(css, struct mem_cgroup, css);
+	}
+	return mem_cgroups[id];
+}
 /*
  * Maximum loops in mem_cgroup_hierarchical_reclaim(), used for soft
  * limit reclaim to prevent infinite loops, if they ever occur.
@@ -1824,18 +1848,7 @@ static void mem_cgroup_cancel_charge(str
  * it's concern. (dropping refcnt from swap ...
From: Balbir Singh
Date: Monday, August 2, 2010 - 8:22 pm

It is a memory versus speed tradeoff, but if the number of created
cgroups is low, it might not be all that slow, besides we do that for
swap_cgroup anyway - no?
 

-- 
	Three Cheers,
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Monday, August 2, 2010 - 8:21 pm

On Tue, 3 Aug 2010 08:52:16 +0530

In following patch, pc->page_cgroup is changed from pointer to ID.
Then, this lookup happens in lru_add/del, for example.
And, by this, we can place all lookup related things to __read_mostly.
With css_lookup(), we can't do it and have to be afraid of cache
behavior.

I hear there are a users who create 2000+ cgroups and considering
about "low number" user here is not important.
This patch is a help for getting _stable_ performance even when there are
many cgroups.

Thanks,
-Kame

--

From: Balbir Singh
Date: Monday, August 2, 2010 - 8:38 pm

I've heard of one such user on the libcgroup mailing list. 

-- 
	Three Cheers,
	Balbir
--

From: Daisuke Nishimura
Date: Monday, August 2, 2010 - 9:31 pm

Hi.

Thank you for all of your works.

Several comments are inlined.

On Mon, 2 Aug 2010 19:13:04 +0900
id_to_memcg() seems to be called under rcu_read_lock() already, so I think
We don't use vmalloc() area in this version :)


Thanks,
Daisuke Nishimura.
--

From: KAMEZAWA Hiroyuki
Date: Monday, August 2, 2010 - 9:37 pm

On Tue, 3 Aug 2010 13:31:09 +0900

Maybe. I thought about which is better to add

	VM_BUG_ON(!rcu_read_lock_held);
or
	rcu_read_lock()
	..
	rcu_read_unlock()

Oh. yes. thank you. I'll fix

Thanks,
-Kame


--

From: Daisuke Nishimura
Date: Monday, August 2, 2010 - 9:51 pm

On Tue, 3 Aug 2010 13:37:23 +0900
Yes, I personally like the former.
--

From: KAMEZAWA Hiroyuki
Date: Monday, August 2, 2010 - 9:54 pm

On Tue, 3 Aug 2010 13:51:29 +0900

ok, will rewrite in that style.

-Kame

--

From: Daisuke Nishimura
Date: Monday, August 2, 2010 - 10:04 pm

On Tue, 3 Aug 2010 13:54:13 +0900
Ah, looking into [2/5], this would not be true necessarily..
So, it might be better to leave as it is.

--

From: KAMEZAWA Hiroyuki
Date: Monday, August 2, 2010 - 3:14 am

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Now, addresses of memory cgroup can be calculated by their ID without complex.
This patch relplaces pc->mem_cgroup from a pointer to a unsigned short.
On 64bit architecture, this offers us more 6bytes room per page_cgroup.
Use 2bytes for blkio-cgroup's page tracking. More 4bytes will be used for
some light-weight concurrent access.

We may able to move this id onto flags field but ...go step by step.

Changelog: 20100730
 - fixed some garbage added by debug code in early stage

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/page_cgroup.h |    3 ++-
 mm/memcontrol.c             |   32 +++++++++++++++++++-------------
 mm/page_cgroup.c            |    2 +-
 3 files changed, 22 insertions(+), 15 deletions(-)

Index: mmotm-0727/include/linux/page_cgroup.h
===================================================================
--- mmotm-0727.orig/include/linux/page_cgroup.h
+++ mmotm-0727/include/linux/page_cgroup.h
@@ -12,7 +12,8 @@
  */
 struct page_cgroup {
 	unsigned long flags;
-	struct mem_cgroup *mem_cgroup;
+	unsigned short mem_cgroup;	/* ID of assigned memory cgroup */
+	unsigned short blk_cgroup;	/* Not Used..but will be. */
 	struct page *page;
 	struct list_head lru;		/* per cgroup LRU list */
 };
Index: mmotm-0727/mm/page_cgroup.c
===================================================================
--- mmotm-0727.orig/mm/page_cgroup.c
+++ mmotm-0727/mm/page_cgroup.c
@@ -15,7 +15,7 @@ static void __meminit
 __init_page_cgroup(struct page_cgroup *pc, unsigned long pfn)
 {
 	pc->flags = 0;
-	pc->mem_cgroup = NULL;
+	pc->mem_cgroup = 0;
 	pc->page = pfn_to_page(pfn);
 	INIT_LIST_HEAD(&pc->lru);
 }
Index: mmotm-0727/mm/memcontrol.c
===================================================================
--- mmotm-0727.orig/mm/memcontrol.c
+++ mmotm-0727/mm/memcontrol.c
@@ -379,7 +379,7 @@ struct cgroup_subsys_state *mem_cgroup_c
 static struct mem_cgroup_per_zone *
 ...
From: Balbir Singh
Date: Monday, August 2, 2010 - 8:45 pm

Can I recommend that on 64 bit systems, we merge the flag, mem_cgroup
and blk_cgroup into one 8 byte value. We could use
__attribute("packed") and do something like this

struct page_cgroup {
        unsigned int flags;
        unsigned short mem_cgroup;
        unsigned short blk_cgroup;
        ...
} __attribute(("packed"));

Then we need to make sure we don't use more that 32 bits for flags,
which is very much under control at the moment.

This will save us 8 bytes in total on 64 bit systems and nothing on 32
bit systems, but will enable blkio cgroup to co-exist.


-- 
	Three Cheers,
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Monday, August 2, 2010 - 8:48 pm

On Tue, 3 Aug 2010 09:15:13 +0530

set_bit() requires "long" as its argument. more some trick is required.

 And, IIUC, packing implies
	pc->mem_cgroup = mem_cgroup_id; or
	pc->blk_cgroup = blk_cgroup_id; will have race with
	set/clear_bit(BIT_XXXX, &pc->flags)
 
 This "packing" is not very easy. we have to consider all possible combinations

yes. But I have cocnerns of race condition. to do that, we need
patch 3-5. (But patch 5 adds spinlock, then no 8bytes reduce.)

Let me go step by step. I'm _really_ afraid of race conditions.

Thanks,
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Monday, August 2, 2010 - 3:15 am

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

At accounting file events per memory cgroup, we need to find memory cgroup
via page_cgroup->mem_cgroup. Now, we use lock_page_cgroup().

But, considering the context which page-cgroup for files are accessed,
we can use alternative light-weight mutual execusion in the most case.
At handling file-caches, the only race we have to take care of is "moving"
account, IOW, overwriting page_cgroup->mem_cgroup. Because file status
update is done while the page-cache is in stable state, we don't have to
take care of race with charge/uncharge.

Unlike charge/uncharge, "move" happens not so frequently. It happens only when
rmdir() and task-moving (with a special settings.)
This patch adds a race-checker for file-cache-status accounting v.s. account
moving. The new per-cpu-per-memcg counter MEM_CGROUP_ON_MOVE is added.
The routine for account move 
  1. Increment it before start moving
  2. Call synchronize_rcu()
  3. Decrement it after the end of moving.
By this, file-status-counting routine can check it needs to call
lock_page_cgroup(). In most case, I doesn't need to call it.


Changelog: 20100730
 - some cleanup.
Changelog: 20100729
 - replaced __this_cpu_xxx() with this_cpu_xxx
   (because we don't call spinlock)
 - added VM_BUG_ON().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   78 +++++++++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 66 insertions(+), 12 deletions(-)

Index: mmotm-0727/mm/memcontrol.c
===================================================================
--- mmotm-0727.orig/mm/memcontrol.c
+++ mmotm-0727/mm/memcontrol.c
@@ -88,6 +88,7 @@ enum mem_cgroup_stat_index {
 	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
 	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
 	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
+	MEM_CGROUP_ON_MOVE,   /* A check for locking move account/status */
 
 ...
From: Balbir Singh
Date: Monday, August 2, 2010 - 8:33 pm

Is for_each_possible really required? Won't online cpus suffice? There
can be a race if a hotplug event happens between start and end move,
shouldn't we handle that. My concern is that with something like 1024
cpus possible today, we might need to optimize this further.

May be we can do this first and optimize later.

-- 
	Three Cheers,
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Monday, August 2, 2010 - 8:39 pm

On Tue, 3 Aug 2010 09:03:27 +0530
yes. I have the same concern. But I don't have any justification to disable
Maybe. For now, cpu-hotplug event hanlder tend to be a noise for this patch.
I would like to do it later.

Thanks,
-Kame

--

From: Daisuke Nishimura
Date: Tuesday, August 3, 2010 - 5:55 pm

Is this check necessary?

Thanks,
Daisuke Nishimura.
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, August 3, 2010 - 6:11 pm

On Wed, 4 Aug 2010 09:55:13 +0900

Yes, I hit NULL here.
That happens migration=off case, IIRC.

Thanks,

--

From: Daisuke Nishimura
Date: Tuesday, August 3, 2010 - 6:25 pm

On Wed, 4 Aug 2010 10:11:50 +0900
Ah, you're right.
Thank you for your clarification.

Daisuke Nishimura.
--

From: KAMEZAWA Hiroyuki
Date: Monday, August 2, 2010 - 3:17 am

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

Preparing for adding new status arounf file caches.(dirty, writeback,etc..)
Using a unified macro and more generic names.
All counters will have the same rule for updating.

Changelog:
 - clean up and moved mem_cgroup_stat_index to header file.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h  |   23 ++++++++++++++++++++++
 include/linux/page_cgroup.h |   12 +++++------
 mm/memcontrol.c             |   46 ++++++++++++++++++--------------------------
 3 files changed, 48 insertions(+), 33 deletions(-)

Index: mmotm-0727/include/linux/memcontrol.h
===================================================================
--- mmotm-0727.orig/include/linux/memcontrol.h
+++ mmotm-0727/include/linux/memcontrol.h
@@ -25,6 +25,29 @@ struct page_cgroup;
 struct page;
 struct mm_struct;
 
+/*
+ * Per-cpu Statistics for memory cgroup.
+ */
+enum mem_cgroup_stat_index {
+	/*
+	 * For MEM_CONTAINER_TYPE_ALL, usage = pagecache + rss.
+	 */
+	MEM_CGROUP_STAT_CACHE,		/* # of pages charged as cache */
+	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
+	MEM_CGROUP_STAT_PGPGIN_COUNT,	/* # of pages paged in */
+	MEM_CGROUP_STAT_PGPGOUT_COUNT,	/* # of pages paged out */
+	MEM_CGROUP_STAT_SWAPOUT, /* # of pages, swapped out */
+	MEM_CGROUP_EVENTS,	/* incremented at every  pagein/pageout */
+	MEM_CGROUP_ON_MOVE,   /* A check for locking move account/status */
+	/* About file-stat please see memcontrol.h */
+	MEM_CGROUP_FSTAT_BASE,
+	MEM_CGROUP_FSTAT_FILE_MAPPED = MEM_CGROUP_FSTAT_BASE,
+	MEM_CGROUP_FSTAT_END,
+	MEM_CGROUP_STAT_NSTATS = MEM_CGROUP_FSTAT_END,
+};
+
+#define MEMCG_FSTAT_IDX(idx)	((idx) - MEM_CGROUP_FSTAT_BASE)
+
 extern unsigned long mem_cgroup_isolate_pages(unsigned long nr_to_scan,
 					struct list_head *dst,
 					unsigned long *scanned, int order,
Index: ...
From: Balbir Singh
Date: Monday, August 2, 2010 - 9:03 pm

Do we use the bit in pc->flags, otherwise is there an advantage of
creating a separate index for the other stats the block I/O needs?

-- 
	Three Cheers,
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Monday, August 2, 2010 - 9:24 pm

On Tue, 3 Aug 2010 09:33:04 +0530
??? using pc->flags.

use SetPageFileMapped() etc.. and drop this patch ?
I don't want to add swtich(idx) to call SetPageFileMapped() etc.


Thanks,
-Kmae


--

From: KAMEZAWA Hiroyuki
Date: Monday, August 2, 2010 - 3:20 am

From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This patch replaces bit_spinlock with spinlock. In general,
spinlock has good functinality than bit_spin_lock and we should use
it if we have a room for it. In 64bit arch, we have extra 4bytes.
Let's use it.
expected effects:
 - use better codes.
 - ticket lock on x86-64
 - para-vitualization aware lock
etc..

Chagelog: 20090729
 - fixed page_cgroup_is_locked().

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
--
---
 include/linux/page_cgroup.h |   33 ++++++++++++++++++++++++++++++++-
 mm/memcontrol.c             |    2 +-
 mm/page_cgroup.c            |    3 +++
 3 files changed, 36 insertions(+), 2 deletions(-)

Index: mmotm-0727/include/linux/page_cgroup.h
===================================================================
--- mmotm-0727.orig/include/linux/page_cgroup.h
+++ mmotm-0727/include/linux/page_cgroup.h
@@ -10,8 +10,14 @@
  * All page cgroups are allocated at boot or memory hotplug event,
  * then the page cgroup for pfn always exists.
  */
+#ifdef CONFIG_64BIT
+#define PCG_HAS_SPINLOCK
+#endif
 struct page_cgroup {
 	unsigned long flags;
+#ifdef PCG_HAS_SPINLOCK
+	spinlock_t	lock;
+#endif
 	unsigned short mem_cgroup;	/* ID of assigned memory cgroup */
 	unsigned short blk_cgroup;	/* Not Used..but will be. */
 	struct page *page;
@@ -36,7 +42,9 @@ struct page_cgroup *lookup_page_cgroup(s
 
 enum {
 	/* flags for mem_cgroup */
-	PCG_LOCK,  /* page cgroup is locked */
+#ifndef PCG_HAS_SPINLOCK
+	PCG_LOCK,  /* page cgroup is locked (see below also.)*/
+#endif
 	PCG_CACHE, /* charged as cache */
 	PCG_USED, /* this object is in use. */
 	PCG_ACCT_LRU, /* page has been accounted for */
@@ -65,8 +73,6 @@ static inline void ClearPageCgroup##unam
 static inline int TestClearPageCgroup##uname(struct page_cgroup *pc)	\
 	{ return test_and_clear_bit(PCG_##lname, &pc->flags);  }
 
-TESTPCGFLAG(Locked, LOCK)
-
 /* Cache flag is set only once (at allocation) */
 ...
From: Balbir Singh
Date: Monday, August 2, 2010 - 9:06 pm

The additional space usage is a big concern, I think saving space
would be of highest priority. I understand the expected benefits, but
a spinlock_t per page_cgroup is quite expensive at the moment. If
anything I think it should be a config option under CONFIG_DEBUG or
something else to play with and see the side effects.

-- 
	Three Cheers,
	Balbir
--

From: KAMEZAWA Hiroyuki
Date: Monday, August 2, 2010 - 9:25 pm

On Tue, 3 Aug 2010 09:36:45 +0530

Hmm. As I already wrote, packing id to flags is not easy. 
leave 4 bytes space _pad for a while and drop this patch ?

I don't like to add CONFIG_DEBUG in this core.

Thanks,
-Kame


--

From: Balbir Singh
Date: Monday, August 2, 2010 - 7:36 pm

I was catching up with my inbox, did not realize you had moved onto v3

Agreed, background watermark based reclaim is something we should look


-- 
	Three Cheers,
	Balbir
--

Previous thread: GFS2: Pre-pull patch posting by Steven Whitehouse on Monday, August 2, 2010 - 2:27 am. (9 messages)

Next thread: [PATCH] nconfig: Fix segfault when menu is empty by Andrej Gelenberg on Monday, August 2, 2010 - 2:59 am. (2 messages)