Hi, this is third version.
While small changes in codes, the whole _tone_ of code is changed.
I'm not in hurry, any comments are welcome.based on 2.6.26-rc2-mm1 + memcg patches in -mm queue.
Changes from v2:
- Named as HardWall policy.
- rewrote the code to be read easily. changed the name of functions.
- Added text.
- supported hierarchy_model parameter.
Now, no_hierarchy and hardwall_hierarchy is implemented.HardWall Policy:
- designed for strict resource isolation under hierarchy.
Usually, automatic load balancing between cgroup can break the
users assumption even if it's implemented very well.
- parent overcommits all children
parent->usage = resource used by itself + resource moved to children.
Of course, parent->limit > parent->usage.
- when child's limit is set, the resouce moves.
- no automatic resource moving between parent <-> childExample)
1) Assume a cgroup with 1GB limits. (and no tasks belongs to this, now)
- group_A limit=1G,usage=0M.2) create group B, C under A.
- group A limit=1G, usage=0M
- group B limit=0M, usage=0M.
- group C limit=0M, usage=0M.3) increase group B's limit to 300M.
- group A limit=1G, usage=300M.
- group B limit=300M, usage=0M.
- group C limit=0M, usage=0M.4) increase group C's limit to 500M
- group A limit=1G, usage=800M.
- group B limit=300M, usage=0M.
- group C limit=500M, usage=0M.5) reduce group B's limit to 100M
- group A limit=1G, usage=600M.
- group B limit=100M, usage=0M.
- group C limit=500M, usage=0M.Thanks,
-Kame--
Hi, Kamezawa-San,
Sorry for the delay in responding. Like we discussed last time, I'd prefer a
shares based approach for hierarchial memcg management. I'll review/try these
patches and provide more feedback.--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
On Mon, 09 Jun 2008 15:00:22 +0530
Hi,I'm now totally re-arranging patches, so just see concepts.
In previous e-mail, I thought that there was a difference between 'your share'
and 'my share'. So, please explain again ?My 'share' has following characteristics.
- work as soft-limit. not hard-limit.
- no limit when there are not high memory pressure.
- resource usage will be proportionally fair to each group's share (priority)
under memory pressure.If you want to work on this, I can stop this for a while and do other important
patches, like background reclaim, mlock limitter, guarantee, etc.. because my
priority to hierarchy is not very high (but it seems better to do this before
other misc works, so I did.).Anyway, we have to test the new LRU (RvR LRU) at first in the next -mm ;)
Thanks,
-Kame--
My share is very similar to yours.
A group might have a share of 100% and a hard limit of 1G. In this case the hard
limit applies if the system has more than 1G of memory. I think of hard limit as
the final controlling factor and shares are suggestive.I do, but I don't want to stop you from doing it. mlock limitter is definitely
important, along with some control for large pages. Hierarchy is definitely
important, since we cannot add other major functionality without first solving
this proble, After that, High on my list is1. Soft limits
Yes :) I just saw that going in
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
Hi Kame,
I like the idea of keeping the kernel simple, and moving more of the
intelligence to userspace.It may need the kernel to expose a bit more in the way of VM details,
such as memory pressure, OOM notifications, etc, but as long as
userspace can respond quickly to memory imbalance, it should work
fine. We're doing something a bit similar using cpusets and fake NUMA
at Google - the principle of juggling memory between cpusets is the
same, but the granularity is much worse :-)On Tue, Jun 3, 2008 at 9:58 PM, KAMEZAWA Hiroyuki
Should we try to support hierarchy and non-hierarchy cgroups in the
same tree? Maybe we should just enforce the restrictions that:- the hierarchy mode can't be changed on a cgroup if you have children
or any non-zero usage/limitI'm not sure that "overcommits" is the right word here - specifically,
the model ensures that a parent can't overcommit its children beyond
its limit.Paul
--
On Wed, 4 Jun 2008 01:59:32 -0700
yes, next problem is adding interfaces. but we have to investigate
Ah, my patch does it (I think). explanation is bad.- mem cgroup's mode can be changed against ROOT node which has no children.
- a child inherits parent's mode.Thanks,
-Kame--
On Wed, Jun 4, 2008 at 2:15 AM, KAMEZAWA Hiroyuki
But if it can only be changed for the root cgroup when it has no
children, than implies that all cgroups must have the same mode. I'm
suggesting that we allow non-root cgroups to change their mode, as
long as:- they have no children
- they don't have any limit charged to their parent (which means that
either they have a zero limit, or they have no parent, or they're not
in hierarchy mode)Paul
--
On Wed, 4 Jun 2008 02:15:32 -0700
Hmm, I got your point. Your suggestion seems reasonable.
I'll try that logic in the next version.Thanks,
-Kame--
Hard-Wall hierarchy support for memcg.
- new member hierarchy_model is added to memcg.Only root cgroup can modify this only when there is no children.
Adds following functions for supporting HARDWALL hierarchy.
- try to reclaim memory at the change of "limit".
- try to reclaim all memory at force_empty
- returns resources to the parent at destroy.Changelog v2->v3
- added documentation.
- hierarhcy_model parameter is added.Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
Documentation/controllers/memory.txt | 27 +++++-
mm/memcontrol.c | 156 ++++++++++++++++++++++++++++++++++-
2 files changed, 178 insertions(+), 5 deletions(-)Index: temp-2.6.26-rc2-mm1/mm/memcontrol.c
===================================================================
--- temp-2.6.26-rc2-mm1.orig/mm/memcontrol.c
+++ temp-2.6.26-rc2-mm1/mm/memcontrol.c
@@ -137,6 +137,8 @@ struct mem_cgroup {
struct mem_cgroup_lru_info info;int prev_priority; /* for recording reclaim priority */
+
+ int hierarchy_model; /* used hierarchical policy */
/*
* statistics.
*/
@@ -144,6 +146,10 @@ struct mem_cgroup {
};
static struct mem_cgroup init_mem_cgroup;+
+#define MEMCG_NO_HIERARCHY (0)
+#define MEMCG_HARDWALL_HIERARCHY (1)
+
/*
* We use the lower bit of the page->page_cgroup pointer as a bit spin
* lock. We need to ensure that page->page_cgroup is at least two
@@ -792,6 +798,89 @@ int mem_cgroup_shrink_usage(struct mm_st
}/*
+ * Memory Controller hierarchy support.
+ */
+
+/*
+ * shrink usage to be res->usage + val < res->limit.
+ */
+
+int memcg_shrink_val(struct res_counter *cnt, unsigned long long val)
+{
+ struct mem_cgroup *memcg = container_of(cnt, struct mem_cgroup, res);
+ unsigned long flags;
+ int ret = 1;
+ int progress = 1;
+
+retry:
+ spin_lock_irqsave(&cnt->lock, flags);
+ /* Need to shrink ? */
+ if (cnt->usage + val <= cnt->limit)
+ ret = 0;
+ spin...
The parent's usage is incremented by val.
---
~Randy
'"Daemon' is an old piece of jargon from the UNIX operating system,
where it referred to a piece of low-level utility software, a
fundamental part of the operating system."
--
Thank you. will fix.
-Kame
On Wed, 11 Jun 2008 16:24:23 -0700
--
We have res_counter_check_under_limit(), may be we could re-use that here by
I know callback is called from the two functions specified in patch 1/2 (move
and return resource). I don't understand why it is OK to force the limit to be--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
Paul suggested another way. (please see his mail, sorry)
So, I changed this behavior as following.
"a cgroup's hierarchy mode can be changed when
- parent's hierarchy mode is "no hirerachy"
I'm now rearranging the set to do that :)Thank you for comments.
Regards,
-Kame--
Shouldn't it be called after verifying there remains no task
in this group?If called via mem_cgroup_pre_destroy, it has been verified
that there remains no task already, but if called via
mem_force_empty_wrte, there may remain some tasks and
this means many and many pages are swaped out, doesn't it?Thanks,
Daisuke Nishimura.
--
On Wed, 4 Jun 2008 21:32:35 +0900
you're right. I misunderstood where the number of children is checked.Thanks,
--
On Tue, Jun 3, 2008 at 10:03 PM, KAMEZAWA Hiroyuki
Can't this logic be in res_counter itself? I.e. the callback can
assume that some shrinking needs to be done, and should just do it andAgain, a lot of this function seems like generic logic that should be
in res_counter. The only bit that's memory specific is the
memcg_shrink_val, and maybe that could just be passed when creating
the res_counter. Perhaps we should have a res_counter_ops structure
with operations like "parse" for parsing strings into numbers
(currently called "write_strategy") and "reclaim" for trying to shrinkThe res_counter already knows whether it has a parent, so these checks
Should we also re-account any remaining child usage to the parent?
Paul
--
On Wed, 4 Jun 2008 01:59:12 -0700
Hmm ok. Maybe All I have to do is to define "What the callback has to do"
When this is called, there are no process in this group. Then, remaining
resources in this level is
- file cache
- swap cache (if shared)
- shmemAnd the biggest usage will be "file cache".
So, I don't think it's necessary to move child's usage to the parent,
in hurry. But maybe shmem is worth to be moved.I'd like to revisit this when I implements "usage move at task move"
logic. (currenty, memory usage doesn't move to new cgroup at task_attach.)It will help me to implement the logic "move remaining usage to the parent"
in clean way.Thanks,
-Kame--
I agree that "usage move at task move" is needed before
"move remaining usage to the parent".Thanks,
Daisuke Nishimura.
--
hierarchy_model can be a global value instead of per cgroup value.
--
On Wed, 04 Jun 2008 14:42:21 +0800
Ah, Hmm...yes. thank you for pointing out.Regards,
-Kame--
A simple hard-wall hierarhcy support for res_counter.
Changelog v2->v3
- changed the name and arguments of functions.
- rewrote to be read easily.
- named as HardWall hierarchy.This implements following model
- A cgroup's tree means hierarchy of resource.
- All child's resource is moved from its parents.
- The resource moved to children is charged as parent's usage.
- The resource moves when child->limit is changed.
- The sum of resource for children and its own usage is limited by "limit".This implies
- No dynamic automatic hierarhcy balancing in the kernel.
- Each resource is isolated completely.
- The kernel just supports resource-move-at-change-in-limit.
- The user (middle-ware) is responsible to make hierarhcy balanced well.
Good balance can be achieved by changing limit from user land.Background:
Recently, there are popular resource isolation technique widely used,
i.e. Hardware-Virtualization. We can do hierarchical resource isolation
by using cgroup on it. But supporting hierarchy management in croups
has some advantages of performance, unity and costs of management.There are good resource management in other OSs, they support some kind of
hierarchical resource management. We wonder what kind of hierarchy policy
is good for Linux. And there is an another point. Hierarchical system can be
implemented by the kernel and user-land co-operation. So, there are various
choices to do in the kernel. Doing all in the kernel or export some proper
interfaces to the user-land. Middle-wares are tend to be used for management.
I hope there will be Open Source one.At supporting hierarchy in cgroup, several aspects of characteristics of
policy of hierarchy can be considered. Some needs automatic balancing
between several groups.- fairness ... how fairness is kept under policy
- performance ... should be _fast_. multi-level resource balancing tend
to use much amount of CPU and can cause soft lockup.-...
For these non-static (non-private) functions, please use kernel-doc notation
(see Documentation/kernel-doc-nano-HOWTO.txt and/or examples in other source files).
Also, we prefer for the function documentation to be above its definition (implementation)
rather than above its declaration, so the kernel-doc should be moved to .c files---
~Randy
'"Daemon' is an old piece of jargon from the UNIX operating system,
where it referred to a piece of low-level utility software, a
fundamental part of the operating system."
--
On Wed, 11 Jun 2008 16:24:27 -0700
Ah, sorry. I'll do so in the next version. Maybe I should move other comments
will fix.Thank you for review!
Thanks,
-Kame--
We'd definitely like to see a user level tool/application as a demo of how this
Soft limits has been on my plate for a while now. I'll take a crack at it. At
the moment the statistics is a bit of a worry, since users/administrators needThe other reason for preferring a shares based approach is that, the it will be
I would prefer to use a better name, lent_out? reserved_for_children?
OK, after reading this I am totally sure I want a shares based interface. Limits
are not shared like this.A child and a parent should both be capable of having a limit of 1G, but they
could use different shares factors to govern, how much each children will get.I don't like the idea of spinning in an infinite loop, I would prefer to fail
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
On Mon, 09 Jun 2008 15:18:47 +0530
I don't have one, now. I'll write one when I have time. Need now ?
Hmm...maybe I(we) need some more patches to implement useful statistics,You have to think of the major difference of tha nature of CPU and Memory.
We have to reclaim the resource with some feedbacks among sevral cgroups.
But ok, if it's can be implemented in simple way.
I have no objections if cost is very low. My concern is only performance.
Not easy to use in my point of view. Can we use 'share' in proper wayyield() after callback() means that res_counter's state will be
far different from the state after callback.
So, we have to yield before call back and check res_coutner sooner.Thanks,
-Kame--
Not sure I understand your question. Share represents the share of available
But does yield() get us any guarantees of seeing the state change?
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--
----- Original Message -----
If no swap, you cannot reclaim anonymous pages and shared memory.
Then, the kernel has to abandon any kinds of auto-balancing somewhere.
(just an example. Things will be more complicated when we consinder
Hmm, myabe my explanation is bad.in following sequence
1.callback()
2.yield()
3.check usage again
Elapsed time between 1->3 is big.in following
1.yield()
2.callback()
3.check usage again
Elapsed time between 2->3 is small.There is an option to implement "changing limit grarually"
Thanks,
-Kame--
On Tue, Jun 3, 2008 at 10:01 PM, KAMEZAWA Hiroyuki
I think that the hierarchy/reclaim handling that you currently have in
the memory controller should be here; the memory controller should
just be able to pass a reference to try_to_free_mem_cgroup_pages() and
have everything else handled by res_counter.Paul
--
On Wed, 4 Jun 2008 01:59:31 -0700
Sounds reasonable. I'll re-design the whole AMAP. I think I can do more.Thanks,
-Kame--
s/parent/child/
YAMAMOTO Takashi
--
On Wed, 4 Jun 2008 16:20:48 +0900 (JST)
Hmm..yes.Thanks,
-Kame--
Since parent and for_children are also protected by res_count->lock,
the above text should appear before 'e. spinlock_t lock'.
--
On Wed, 04 Jun 2008 14:54:12 +0800
ok.Thanks,
-Kame--
| Artem Bityutskiy | [PATCH 10/44 take 2] [UBI] debug unit implementation |
| Greg Kroah-Hartman | [PATCH 004/196] Chinese: add translation of SubmittingPatches |
| Trent Piepho | [PATCH] [POWERPC] Improve (in|out)_beXX() asm code |
| Dave Young | Re: Linux v2.6.24-rc1 |
git: | |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| Linus Torvalds | Re: [GIT]: Networking |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Natalie Protasevich | [BUG] New Kernel Bugs |
