Re: [RFC][PATCH 2/2] memcg: hardwall hierarhcy for memcg

Previous thread: [PATCH] Make x86 latest boot with non-discontig boxes by Glauber Costa on Tuesday, June 3, 2008 - 9:08 pm. (6 messages)

Next thread: [PATCH] uio_pdrv: Unique IRQ Mode by Magnus Damm on Tuesday, June 3, 2008 - 11:08 pm. (33 messages)
From: KAMEZAWA Hiroyuki
Date: Tuesday, June 3, 2008 - 9:58 pm

Hi, this is third version.

While small changes in codes, the whole _tone_ of code is changed.
I'm not in hurry, any comments are welcome.

based on 2.6.26-rc2-mm1 + memcg patches in -mm queue.

Changes from v2:
 - Named as HardWall policy.
 - rewrote the code to be read easily. changed the name of functions.
 - Added text.
 - supported hierarchy_model parameter.
   Now, no_hierarchy and hardwall_hierarchy is implemented.

HardWall Policy:
  - designed for strict resource isolation under hierarchy.
    Usually, automatic load balancing between cgroup can break the
    users assumption even if it's implemented very well.
  - parent overcommits all children
     parent->usage = resource used by itself + resource moved to children.
     Of course, parent->limit > parent->usage. 
  - when child's limit is set, the resouce moves.
  - no automatic resource moving between parent <-> child

Example)
  1) Assume a cgroup with 1GB limits. (and no tasks belongs to this, now)
     - group_A limit=1G,usage=0M.

  2) create group B, C under A.
     - group A limit=1G, usage=0M
          - group B limit=0M, usage=0M.
          - group C limit=0M, usage=0M.

  3) increase group B's limit to 300M.
     - group A limit=1G, usage=300M.
          - group B limit=300M, usage=0M.
          - group C limit=0M, usage=0M.

  4) increase group C's limit to 500M
     - group A limit=1G, usage=800M.
          - group B limit=300M, usage=0M.
          - group C limit=500M, usage=0M.

  5) reduce group B's limit to 100M
     - group A limit=1G, usage=600M.
          - group B limit=100M, usage=0M.
          - group C limit=500M, usage=0M.


Thanks,
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Tuesday, June 3, 2008 - 10:01 pm

A simple hard-wall hierarhcy support for res_counter.

Changelog v2->v3
 - changed the name and arguments of functions.
 - rewrote to be read easily.
 - named as HardWall hierarchy.

This implements following model
 - A cgroup's tree means hierarchy of resource.
 - All child's resource is moved from its parents.
 - The resource moved to children is charged as parent's usage.
 - The resource moves when child->limit is changed.
 - The sum of resource for children and its own usage is limited by "limit".
 
This implies
 - No dynamic automatic hierarhcy balancing in the kernel.
 - Each resource is isolated completely.
 - The kernel just supports resource-move-at-change-in-limit.
 - The user (middle-ware) is responsible to make hierarhcy balanced well.
   Good balance can be achieved by changing limit from user land.


Background:
 Recently, there are popular resource isolation technique widely used,
 i.e. Hardware-Virtualization. We can do hierarchical resource isolation
 by using cgroup on it. But supporting hierarchy management in croups
 has some advantages of performance, unity and costs of management.

 There are good resource management in other OSs, they support some kind of
 hierarchical resource management. We wonder what kind of hierarchy policy
 is good for Linux. And there is an another point. Hierarchical system can be
 implemented by the kernel and user-land co-operation.  So, there are various
 choices to do in the kernel. Doing all in the kernel or export some proper
 interfaces to the user-land. Middle-wares are tend to be used for management.
 I hope there will be Open Source one.

 At supporting hierarchy in cgroup, several aspects of characteristics of
 policy of hierarchy can be considered. Some needs automatic balancing
 between several groups.

  - fairness    ... how fairness is kept under policy

  - performance ... should be _fast_. multi-level resource balancing tend
                 to use much amount of CPU and can cause soft lockup.

  - ...
From: Li Zefan
Date: Tuesday, June 3, 2008 - 11:54 pm

Since parent and for_children are also protected by res_count->lock,
the above text should appear before 'e. spinlock_t lock'.
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, June 4, 2008 - 12:03 am

On Wed, 04 Jun 2008 14:54:12 +0800
ok.

Thanks,
-Kame

--

From: YAMAMOTO Takashi
Date: Wednesday, June 4, 2008 - 12:20 am

s/parent/child/

YAMAMOTO Takashi
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, June 4, 2008 - 12:32 am

On Wed,  4 Jun 2008 16:20:48 +0900 (JST)
Hmm..yes. 

Thanks,
-Kame

--

From: Paul Menage
Date: Wednesday, June 4, 2008 - 1:59 am

On Tue, Jun 3, 2008 at 10:01 PM, KAMEZAWA Hiroyuki

I think that the hierarchy/reclaim handling that you currently have in
the memory controller should be here; the memory controller should
just be able to pass a reference to try_to_free_mem_cgroup_pages() and
have everything else handled by res_counter.

Paul
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, June 4, 2008 - 2:18 am

On Wed, 4 Jun 2008 01:59:31 -0700
Sounds reasonable. I'll re-design the whole AMAP. I think I can do more.

Thanks,
-Kame

--

From: Balbir Singh
Date: Monday, June 9, 2008 - 2:48 am

We'd definitely like to see a user level tool/application as a demo of how this

Soft limits has been on my plate for a while now. I'll take a crack at it. At
the moment the statistics is a bit of a worry, since users/administrators need

The other reason for preferring a shares based approach is that, the it will be

I would prefer to use a better name, lent_out? reserved_for_children?


OK, after reading this I am totally sure I want a shares based interface. Limits
are not shared like this.

A child and a parent should both be capable of having a limit of 1G, but they
could use different shares factors to govern, how much each children will get.



I don't like the idea of spinning in an infinite loop, I would prefer to fail



-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: KAMEZAWA Hiroyuki
Date: Monday, June 9, 2008 - 3:20 am

On Mon, 09 Jun 2008 15:18:47 +0530
I don't have one, now. I'll write one when I have time. Need now ?
Hmm...maybe I(we) need some more patches to implement useful statistics,

You have to think of the major difference of tha nature of CPU and Memory.
We have to reclaim the resource with some feedbacks among sevral cgroups.
But ok, if it's can be implemented in simple way.
I have no objections if cost is very low. My concern is only performance.
Not easy to use in my point of view. Can we use 'share' in proper way 

yield() after callback() means that res_counter's state will be
far different from the state after callback.
So, we have to yield before call back and check res_coutner sooner.

Thanks,
-Kame

--

From: Balbir Singh
Date: Monday, June 9, 2008 - 3:37 am

Not sure I understand your question. Share represents the share of available


But does yield() get us any guarantees of seeing the state change?

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: kamezawa.hiroyu
Date: Monday, June 9, 2008 - 5:02 am

----- Original Message -----

If no swap, you cannot reclaim anonymous pages and shared memory.
Then, the kernel has to abandon any kinds of auto-balancing somewhere.
(just an example. Things will be more complicated when we consinder
Hmm, myabe my explanation is bad.

in following sequence
   1.callback()
   2.yield()
   3.check usage again
Elapsed time between 1->3 is big.

in following
   1.yield()
   2.callback()
   3.check usage again
Elapsed time between 2->3 is small.

There is an option to implement "changing limit grarually"

Thanks,
-Kame









--

From: Randy Dunlap
Date: Wednesday, June 11, 2008 - 4:24 pm

For these non-static (non-private) functions, please use kernel-doc notation
(see Documentation/kernel-doc-nano-HOWTO.txt and/or examples in other source files).
Also, we prefer for the function documentation to be above its definition (implementation)
rather than above its declaration, so the kernel-doc should be moved to .c files



---
~Randy
'"Daemon' is an old piece of jargon from the UNIX operating system,
where it referred to a piece of low-level utility software, a
fundamental part of the operating system."
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, June 11, 2008 - 9:59 pm

On Wed, 11 Jun 2008 16:24:27 -0700
Ah, sorry. I'll do so in the next version. Maybe I should move other comments
will fix.

Thank you for review!

Thanks,
-Kame

--

From: KAMEZAWA Hiroyuki
Date: Tuesday, June 3, 2008 - 10:03 pm

Hard-Wall hierarchy support for memcg.
 - new member hierarchy_model is added to memcg.

Only root cgroup can modify this only when there is no children.

Adds following functions for supporting HARDWALL hierarchy.
 - try to reclaim memory at the change of "limit".
 - try to reclaim all memory at force_empty
 - returns resources to the parent at destroy.

Changelog v2->v3
 - added documentation.
 - hierarhcy_model parameter is added.


Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

---
 Documentation/controllers/memory.txt |   27 +++++-
 mm/memcontrol.c                      |  156 ++++++++++++++++++++++++++++++++++-
 2 files changed, 178 insertions(+), 5 deletions(-)

Index: temp-2.6.26-rc2-mm1/mm/memcontrol.c
===================================================================
--- temp-2.6.26-rc2-mm1.orig/mm/memcontrol.c
+++ temp-2.6.26-rc2-mm1/mm/memcontrol.c
@@ -137,6 +137,8 @@ struct mem_cgroup {
 	struct mem_cgroup_lru_info info;
 
 	int	prev_priority;	/* for recording reclaim priority */
+
+	int	hierarchy_model; /* used hierarchical policy */
 	/*
 	 * statistics.
 	 */
@@ -144,6 +146,10 @@ struct mem_cgroup {
 };
 static struct mem_cgroup init_mem_cgroup;
 
+
+#define MEMCG_NO_HIERARCHY	(0)
+#define MEMCG_HARDWALL_HIERARCHY	(1)
+
 /*
  * We use the lower bit of the page->page_cgroup pointer as a bit spin
  * lock.  We need to ensure that page->page_cgroup is at least two
@@ -792,6 +798,89 @@ int mem_cgroup_shrink_usage(struct mm_st
 }
 
 /*
+ * Memory Controller hierarchy support.
+ */
+
+/*
+ * shrink usage to be res->usage + val < res->limit.
+ */
+
+int memcg_shrink_val(struct res_counter *cnt, unsigned long long val)
+{
+	struct mem_cgroup *memcg = container_of(cnt, struct mem_cgroup, res);
+	unsigned long flags;
+	int ret = 1;
+	int progress = 1;
+
+retry:
+	spin_lock_irqsave(&cnt->lock, flags);
+	/* Need to shrink ? */
+	if (cnt->usage + val <= cnt->limit)
+		ret = 0;
+	spin_unlock_irqrestore(&cnt->lock, ...
From: Li Zefan
Date: Tuesday, June 3, 2008 - 11:42 pm

hierarchy_model can be a global value instead of per cgroup value.
--

From: KAMEZAWA Hiroyuki
Date: Tuesday, June 3, 2008 - 11:54 pm

On Wed, 04 Jun 2008 14:42:21 +0800
Ah, Hmm...yes. thank you for pointing out.

Regards,
-Kame

--

From: Paul Menage
Date: Wednesday, June 4, 2008 - 1:59 am

On Tue, Jun 3, 2008 at 10:03 PM, KAMEZAWA Hiroyuki

Can't this logic be in res_counter itself? I.e. the callback can
assume that some shrinking needs to be done, and should just do it and

Again, a lot of this function seems like generic logic that should be
in res_counter. The only bit that's memory specific is the
memcg_shrink_val, and maybe that could just be passed when creating
the res_counter. Perhaps we should have a res_counter_ops structure
with operations like "parse" for parsing strings into numbers
(currently called "write_strategy") and "reclaim" for trying to shrink

The res_counter already knows whether it has a parent, so these checks

Should we also re-account any remaining child usage to the parent?

Paul
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, June 4, 2008 - 2:26 am

On Wed, 4 Jun 2008 01:59:12 -0700
Hmm ok. Maybe All I have to do is to define "What the callback has to do"
When this is called, there are no process in this group. Then, remaining
resources in this level is
  - file cache
  - swap cache (if shared)
  - shmem

And the biggest usage will be "file cache".
So, I don't think it's necessary to move child's usage to the parent,
in hurry. But maybe shmem is worth to be moved.

I'd like to revisit this when I implements "usage move at task move"
logic. (currenty, memory usage doesn't move to new cgroup at task_attach.)

It will help me to implement the logic "move remaining usage to the parent"
in clean way.

Thanks,
-Kame









--

From: Daisuke Nishimura
Date: Wednesday, June 4, 2008 - 5:53 am

I agree that "usage move at task move" is needed before
"move remaining usage to the parent".


Thanks,
Daisuke Nishimura.
--

From: Daisuke Nishimura
Date: Wednesday, June 4, 2008 - 5:32 am

Shouldn't it be called after verifying there remains no task
in this group?

If called via mem_cgroup_pre_destroy, it has been verified
that there remains no task already, but if called via
mem_force_empty_wrte, there may remain some tasks and
this means many and many pages are swaped out, doesn't it?


Thanks,
Daisuke Nishimura.
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, June 4, 2008 - 5:04 pm

On Wed, 4 Jun 2008 21:32:35 +0900
you're right. I misunderstood where the number of children is checked.

Thanks,

--

From: Balbir Singh
Date: Monday, June 9, 2008 - 3:56 am

We have res_counter_check_under_limit(), may be we could re-use that here by

I know callback is called from the two functions specified in patch 1/2 (move
and return resource). I don't understand why it is OK to force the limit to be




-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: kamezawa.hiroyu
Date: Monday, June 9, 2008 - 5:09 am

Paul suggested another way. (please see his mail, sorry)

So, I changed this behavior as following.
"a cgroup's hierarchy mode can be changed when
  - parent's hierarchy mode is "no hirerachy"
I'm now rearranging the set to do that :)

Thank you for comments.

Regards,
-Kame


--

From: Randy Dunlap
Date: Wednesday, June 11, 2008 - 4:24 pm

The parent's usage is incremented by val.


---
~Randy
'"Daemon' is an old piece of jargon from the UNIX operating system,
where it referred to a piece of low-level utility software, a
fundamental part of the operating system."
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, June 11, 2008 - 10:00 pm

Thank you. will fix.

-Kame

On Wed, 11 Jun 2008 16:24:23 -0700

--

From: Paul Menage
Date: Wednesday, June 4, 2008 - 1:59 am

Hi Kame,

I like the idea of keeping the kernel simple, and moving more of the
intelligence to userspace.

It may need the kernel to expose a bit more in the way of VM details,
such as memory pressure, OOM notifications, etc, but as long as
userspace can respond quickly to memory imbalance, it should work
fine. We're doing something a bit similar using cpusets and fake NUMA
at Google - the principle of juggling memory between cpusets is the
same, but the granularity is much worse :-)

On Tue, Jun 3, 2008 at 9:58 PM, KAMEZAWA Hiroyuki

Should we try to support hierarchy and non-hierarchy cgroups in the
same tree? Maybe we should just enforce the restrictions that:

- the hierarchy mode can't be changed on a cgroup if you have children
or any non-zero usage/limit

I'm not sure that "overcommits" is the right word here - specifically,
the model ensures that a parent can't overcommit its children beyond
its limit.

Paul
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, June 4, 2008 - 2:15 am

On Wed, 4 Jun 2008 01:59:32 -0700
yes, next problem is adding interfaces. but we have to investigate
Ah, my patch does it (I think).  explanation is bad.

- mem cgroup's mode can be changed against ROOT node which has no children.
- a child inherits parent's mode.

Thanks,
-Kame


--

From: Paul Menage
Date: Wednesday, June 4, 2008 - 2:15 am

On Wed, Jun 4, 2008 at 2:15 AM, KAMEZAWA Hiroyuki

But if it can only be changed for the root cgroup when it has no
children, than implies that all cgroups must have the same mode. I'm
suggesting that we allow non-root cgroups to change their mode, as
long as:

- they have no children

- they don't have any limit charged to their parent (which means that
either they have a zero limit, or they have no parent, or they're not
in hierarchy mode)

Paul
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, June 4, 2008 - 2:31 am

On Wed, 4 Jun 2008 02:15:32 -0700
Hmm, I got your point. Your suggestion seems reasonable.
I'll try that logic in the next version.

Thanks,
-Kame

--

From: Balbir Singh
Date: Monday, June 9, 2008 - 2:30 am

Hi, Kamezawa-San,

Sorry for the delay in responding. Like we discussed last time, I'd prefer a
shares based approach for hierarchial memcg management. I'll review/try these
patches and provide more feedback.


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: KAMEZAWA Hiroyuki
Date: Monday, June 9, 2008 - 2:55 am

On Mon, 09 Jun 2008 15:00:22 +0530
Hi,

I'm now totally re-arranging patches, so just see concepts.

In previous e-mail, I thought that there was a difference between 'your share'
and 'my share'. So, please explain again ? 

My 'share' has following characteristics.

  - work as soft-limit. not hard-limit.
  - no limit when there are not high memory pressure.
  - resource usage will be proportionally fair to each group's share (priority)
    under memory pressure.

If you want to work on this, I can stop this for a while and do other important
patches, like background reclaim, mlock limitter, guarantee, etc.. because my 
priority to hierarchy is not very high (but it seems better to do this before
other misc works, so I did.). 

Anyway, we have to test the new LRU (RvR LRU) at first in the next -mm ;)

Thanks,
-Kame

--

From: Balbir Singh
Date: Monday, June 9, 2008 - 3:33 am

My share is very similar to yours.

A group might have a share of 100% and a hard limit of 1G. In this case the hard
limit applies if the system has more than 1G of memory. I think of hard limit as
the final controlling factor and shares are suggestive.


I do, but I don't want to stop you from doing it. mlock limitter is definitely
important, along with some control for large pages. Hierarchy is definitely
important, since we cannot add other major functionality without first solving
this proble, After that, High on my list is

1. Soft limits

Yes :) I just saw that going in


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

Previous thread: [PATCH] Make x86 latest boot with non-discontig boxes by Glauber Costa on Tuesday, June 3, 2008 - 9:08 pm. (6 messages)

Next thread: [PATCH] uio_pdrv: Unique IRQ Mode by Magnus Damm on Tuesday, June 3, 2008 - 11:08 pm. (33 messages)