This is rewritten version of memcg hierarchy handling.
...and I'm sorry tons of typos in v1.
Changelog:
- fixed typo.
- removed meaningless params (borrow)
- renamed structure members.
not-for-test. just for discussion. (I'll rewrite when our direction is fixed.)
Implemented Policy:
- parent overcommits all children
parent->usage = resource used by itself + resource moved to children.
Of course, parent->limit > parent->usage.
- when child's limit is set, the resouce moves.
- no automatic resource moving between parent <-> child
Example)
1) Assume a cgroup with 1GB limits. (and no tasks belongs to this, now)
- group_A limit=1G,usage=0M.
2) create group B, C under A.
- group A limit=1G, usage=0M
- group B limit=0M, usage=0M.
- group C limit=0M, usage=0M.
3) increase group B's limit to 300M.
- group A limit=1G, usage=300M.
- group B limit=300M, usage=0M.
- group C limit=0M, usage=0M.
4) increase group C's limit to 500M
- group A limit=1G, usage=800M.
- group B limit=300M, usage=0M.
- group C limit=500M, usage=0M.
5) reduce group B's limit to 100M
- group A limit=1G, usage=600M.
- group B limit=100M, usage=0M.
- group C limit=500M, usage=0M.
Why this is enough ?
- A middleware can do various kind of resource balancing only by reseting "limit"
in userland.
TODO(maybe)
- rewrite force_empty to move the resource to the parent.
Thanks,
-Kame
--
This patch tries to implements _simple_ 'hierarchy policy' in res_counter.
While several policy of hierarchy can be considered, this patch implements
simple one
- the parent includes, over-commits the child
- there are no shared resource
- dynamic hierarchy resource usage management in the kernel is not necessary
works as following.
1. create a child. set default child limits to be 0.
2. set limit to child.
2-a. before setting limit to child, prepare enough room in parent.
2-b. increase 'usage' of parent by child's limit.
3. the child sets its limit to the val moved from the parent.
the parent remembers what amount of resource is to the children.
Above means that
- a directory's usage implies the sum of all sub directories +
own usage.
- there are no shared resource between parent <-> child.
Pros.
- simple and easy policy.
- no hierarchy overhead.
- no resource share among child <-> parent. very suitable for multilevel
resource isolation.
Cons.
- not good to implement some kind of _intelligent_ hierarchy balancing
in the _kernel_
Changelog:
-removed borrow.
-fixed tons of typos.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
Documentation/controllers/resource_counter.txt | 28 +++++
include/linux/res_counter.h | 72 +++++++++++++
kernel/res_counter.c | 130 +++++++++++++++++++++++--
3 files changed, 222 insertions(+), 8 deletions(-)
Index: hie-2.6.26-rc2-mm1/include/linux/res_counter.h
===================================================================
--- hie-2.6.26-rc2-mm1.orig/include/linux/res_counter.h
+++ hie-2.6.26-rc2-mm1/include/linux/res_counter.h
@@ -39,6 +39,11 @@ struct res_counter {
*/
unsigned long long failcnt;
/*
+ * the sum of all resource which is assigned to children.
+ */
+ unsigned long long for_children;
+
+ /*
* the lock to protect all of the above.
* the routines below ...I am not sure if this is desirable. The concept of a hierarchy applies really The problem with this is that you are forcing the parent will run into a reclaim Sharing is an important aspect of hierachies. I am not convinced of this approach. Did you look at the patches I sent out? Was there something fundamentally broken in them? [snip] -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
ok, let's consider a _miiddleware_ wchich has following paramater. An expoterd param to the user. - user_memory_limit parameters for co-operation with the kernel - kernel_memory_limit And here, user_memory_limit >= kernel_memory_limit == cgroup's memory.limits_in_bytes When a user ask the miidleware to set limit to 1Gbytes user_memory_limit = 1G kernel_memory_limit = 0-1G. It moves kernel_memory_limit dynamically 0 to 1Gbytes and reset limits_in_byte s in dynamic way with checking memory cgroup's statistics. Of course, we can add some kind of interdace , as following - failure_notifier - triggered at failcnt increment. That's not problem because it's avoildable by users. Yes, I read. And tried to make it faster and found it will be complicated. One problem is overhead of counter itself. Another problem is overhead of shrinking multi-level LRU with feedback. One more problem is that it's hard to implement various kinds of hierarchy policy. I believe there are other hierarhcy policies rather than OpenVZ want to use. Kicking out functions to middleware AMAP is what I'm thinking now. Thanks, -Kame --
One way to manage hierarchies other than via limits is to use shares (please see the shares used by the cpu controller). Basically, what you've done with limits is done with shares If a parent has 100 shares, then it can decide how many to pass on to it's children based on the shares of the child and your logic would work well. I propose assigning top level (high resolution) shares to the root of the cgroup and in a hierarchy passing them down to children and sharing it with them. Based on the shares, deduce the limit of each node in the hierarchy. What do you think? -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
As you wrote, a middleware can do controls based on share by limits.
And it seems much easier to implement it in userland rather than in the kernel
.
Here is an example. (just an example...)
Please point out if I'm misunderstanding "share".
root_level/ = limit 1G.
/child_A = share=30
/child_B = share=15
/child_C = share=5
(and assume there is no process under root_level for make explanation easy..)
0. At first, before starting to use memory, set all kernel_memory_limit.
root_level.limit = 1G
child_A.limit=64M,usage=0
child_B.limit=64M,usage=0
child_C.limit=64M,usage=0
free_resource=808M
1. next, a process in child_C start to run and use memory of 600M.
root_level.limit = 1G
child_A.limit=64M
child_B.limit=64M
child_C.limit=600M,usage=600M
free_resource=272M
2. now, a process in child_A start tu run and use memory of 800M.
child_A.limit=800M,usage=800M
child_B.limit=64M,usage=0M
child_C.limit=136M,usage=136M
free_resouce=0,A:C=6:1
3.Finally, a process in child_B start. and use memory of 500M.
child_A.limit=600M,usage=600M
child_B.limit=300M,usage=300M
child_C.limit=100M,usage=100M
free_resouce=0, A:B:C=6:3:1
4. one more, a process in A exits.
child_A.limit=64M, usage=0M
child_B.limit=500M, usage=500M
child_C.limit=436M, usage=436M
free_resouce=0, B:C=3:1 (but B just want to use 500M)
This is only an example and the middleware can more pricise "limit"
contols by checking statistics of memory controller hierarchy based on
their own policy.
What I think now is what kind of statistics/notifier/controls are
necessary to implement shares in middleware. How pricise/quick work the
middleware can do is based on interfaces.
Maybe the middleware should know "how fast the application runs now" by
some kind of check or co-operative interface with the application.
But I'm not sure how the kernel can help it.
Thanks,
-Kame
--
The good thing about user space is that moves unnecessary code outside the kernel, but the hard thing is standardization. If every middleware is going to implement what you say, imagine the code duplication, unless we standardize this into a library component. More comments below. I am not sure about the difference between user_memory_limit and kernel_memory_limit. Could you please This sounds incorrect, since the limits should be proportional to shares. If the maximum shares in the root were 100 (*ideally we want higher resolution than that) Then child_A.limit = .3 * 1G child_B.limit = .15 * 1G How is that feasible, it's limit was 64M, how did it bump up to 600M? If you I am not sure if I understand your proposal at this point. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
It's not problem. We're not developing world-wide eco system. It's good that there are several development groups. It's a way to evolution. Something popular will be defacto standard. Above just showing param to the kernel. From user's view, memory limitation is A:B:C=6:3:1 if memory is fully used. (In above case, usage=0) In general, "share" works only when the total usage reaches limitation. (See how cpu scheduler works.) middleware just do when child_C.failcnt hits. echo 64M > childC.memory.limits_in_bytes. and periodically checks A,B,C and allow C to use what it wants becasue Middleware notices that usage in A is growing and moves resources to A. echo current child_C's limit - 64M > child_C echo current child_A's limit + 64M > child_A do above in step by step with loops for making A:C = 6:1 echo current child_C's limit - 64M > child_C echo current child_A's limit - 64M > child_A echo current child_B's limit + 64M > child_B middleware can notice memory pressure from Child_A is reduced. echo current child_A's limit - 64M > child_A echo current child_C's limit + 64M > child_C echo current child_B's limit + 64M > child_B do above in step by step with loops for making B:C = 3:1 with avoiding The most important point is cgoups.memory.memory.limit_in_bytes is _just_ a notification to ask the kernel to limit the memory usage of process groups temporally. It changes often. Based on user's notification to the middleware (share or limit), the middleware changes limit_in_bytes to be suitable value and change it dynamically and periodically. Thanks, -Kame --
My code adds shirinking_at_limit_change. I'm now try to write migrate_resouces _at_task_move. (But seems not so easy to be implemented in clean/fast way.) I have no objection to soft-limit if it's easy to be implemented. (I wrote my explanation was just an example and we could add more knobs.) _But_ I think that something to control multiple cgroups with regard to hierar chy under some policy never be a simple one. Adding some knobs for each cgroup s to do soft-limit will be simple one if no hirerachy. Memory controller's difference from scheduler's hirerachy is that we have to do multilevel page reclaim with feedback under some policy (not only one..). Even without hierarhcy, we _did_ make the kernel's LRU logic more complicated. But we can get a help from the middleware here, I think. My goal is never to make cgroup slow or complicated. If it's slow, I'd like to say "ok, please use VMware.It's simpler and enough fast for you." "How fast it works rather than Hardware-Virtualization" is the most Thanks, I'm sorry for my poor explanation skill. Regards, -Kame --
is there any reason to check member != RES_LIMIT here, can you reduce gratuitous differences between res_counter_borrow_resource and res_counter_repay_resource? eg. 'success' vs 'done', how to decrement 'retry'. YAMAMOTO Takashi --
Ah, sorry. I'll rewrite. I'll make next version's quality better. Thanks, -Kame --
hierarchy support for memcg.
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
mm/memcontrol.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 89 insertions(+), 1 deletion(-)
Index: hie-2.6.26-rc2-mm1/mm/memcontrol.c
===================================================================
--- hie-2.6.26-rc2-mm1.orig/mm/memcontrol.c
+++ hie-2.6.26-rc2-mm1/mm/memcontrol.c
@@ -792,6 +792,78 @@ int mem_cgroup_shrink_usage(struct mm_st
}
/*
+ * Memory Controller hierarchy support.
+ */
+
+int memcg_shrink_callback(struct res_counter *cnt, unsigned long long val)
+{
+ struct mem_cgroup *memcg = container_of(cnt, struct mem_cgroup, res);
+ unsigned long flags;
+ int ret = 1;
+ int progress = 1;
+
+retry:
+ spin_lock_irqsave(&cnt->lock, flags);
+ /* Need to shrink ? */
+ if (cnt->usage + val <= cnt->limit)
+ ret = 0;
+ spin_unlock_irqrestore(&cnt->lock, flags);
+
+ if (!ret)
+ return 0;
+
+ if (!progress)
+ return 1;
+ progress = try_to_free_mem_cgroup_pages(memcg, GFP_KERNEL);
+
+ goto retry;
+}
+
+
+int mem_cgroup_resize_callback(struct res_counter *cnt, unsigned long long val)
+{
+ struct mem_cgroup *child = container_of(cnt, struct mem_cgroup, res);
+ struct mem_cgroup *parent;
+ struct cgroup *my_cg;
+ unsigned long flags, borrow;
+ unsigned long long diffs;
+ int ret = 0;
+
+ my_cg = child->css.cgroup;
+ /* Is this root group ? */
+ if (!my_cg->parent) {
+ spin_lock_irqsave(&cnt->lock, flags);
+ cnt->limit = val;
+ spin_unlock_irqrestore(&cnt->lock, flags);
+ return 0;
+ }
+ spin_lock_irqsave(&cnt->lock, flags);
+ if (val > cnt->limit) {
+ diffs = val - cnt->limit;
+ borrow = 1;
+ } else {
+ diffs = cnt->limit - val;
+ borrow = 0;
+ }
+ spin_unlock_irqrestore(&cnt->lock, flags);
+
+ parent = mem_cgroup_from_cont(my_cg->parent);
+ /* When we increase resource, call borrow. When decrease, call repay*/
+ if (borrow)
+ ret = res_counter_borrow_resource(cnt, &parent->res, ...On Fri, 30 May 2008 10:43:12 +0900 I like this idea. The alternative could mean having a page live on multiple cgroup LRU lists, not just the zone LRU and the one cgroup LRU, and drastically increasing run time overhead. Swapping memory in and out is horrendously slow anyway, so the idea of having a daemon adjust the limits on the fly should work just fine. -- All rights reversed. --
