This is the third version of the address space control patches. These patches are against 2.6.25-mm1 and have been tested using KVM in SMP mode, both with and without the config enabled. The first patch adds the user interface. The second patch fixes the cgroup mm_owner_changed callback to pass the task struct, so that accounting can be adjusted on owner changes. The thrid patch adds accounting and control. The fourth patch updates documentation. An earlier post of the patchset can be found at http://lwn.net/Articles/275143/ This patch is built on top of the mm owner patches and utilizes that feature to virtually group tasks by mm_struct. Reviews, Comments? Series rlimit-controller-setup.patch cgroup-add-task-to-mm--owner-callbacks.patch rlimit-controller-address-space-accounting.patch rlimit-controller-add-documentation.patch -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
I can't read the whole patch deeply now but this new concept "rlimit-controlle r" seems make sense to me. At quick glance, I have some thoughts. 1. kerner/rlimit_cgroup.c is better for future expansion. 2. why "+This controller framework is designed to be extensible to control any "+resource limit (memory related) with little effort." memory only ? Ok, all you want to do is related to memory, but someone may want to limit RLIMIT_CPU by group or RLIMIT_CORE by group or.... (I have no plan but they seems useful.;) So, could you add design hint of rlimit contoller to the documentation ? 3. Rleated to 2. Showing what kind of "rlimit" params are supported by cgroup will be good. I don't think you have to implement all things at once. Staring from "only RLIMIT_AS is supported now" is good. Someone will expand it if he needs. But showing basic view of "gerenal purpose rlimit contoller" in _doc ument_ or _comments_ or _codes_ is a good thing to do. If you don't want to provide RLIMIT feature other than address space, it's better to avoid using the name of RLIMIT. It's confusing. Thanks, -Kame --
I currently mentioned memory, since we have the infrastructure to group using mm->owner infrastructure. For other purposes, we'll need to enhance the Do you mean in init/Kconfig or documentation?. I should probably rename limit_in_bytes and usage_in_bytes to add an as_ prefix, so that the UI clearly I used RLIMIT since I want to extend it later to control memory locked pages :) -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
On Mon, 05 May 2008 09:51:19 +0530 I see. Thanks, -Kame --
...RLIMIT_MEMLOCK is in my want-to-do-list ;) Thanks, -Kame --
Mine too, but I want to get to it after the hierarchy and soft limit patches. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
This is the documentation patch. It describes the rlimit controller and how to build and use it. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> --- Documentation/controllers/rlimit.txt | 29 +++++++++++++++++++++++++++++ 1 file changed, 29 insertions(+) diff -puN /dev/null Documentation/controllers/rlimit.txt --- /dev/null 2008-05-03 22:12:13.033285313 +0530 +++ linux-2.6.25-balbir/Documentation/controllers/rlimit.txt 2008-05-04 03:06:06.000000000 +0530 @@ -0,0 +1,29 @@ +This controller is enabled by the CONFIG_CGROUP_RLIMIT_CTLR option. Prior +to reading this documentation please read Documentation/cgroups.txt and +Documentation/controllers/memory.txt. Several of the principles of this +controller are similar to the memory resource controller. + +This controller framework is designed to be extensible to control any +resource limit (memory related) with little effort. + +This new controller, controls the address space expansion of the tasks +belonging to a cgroup. Address space control is provided along the same lines as +RLIMIT_AS control, which is available via getrlimit(2)/setrlimit(2). +The interface for controlling address space is provided through +"rlimit.limit_in_bytes". The file is similar to "limit_in_bytes" w.r.t. the user +interface. Please see section 3 of the memory resource controller documentation +for more details on how to use the user interface to get and set values. + +The "rlimit.usage_in_bytes" file provides information about the total address +space usage of the tasks in the cgroup, in bytes. + +Advantages of providing this feature + +1. Control over virtual address space allows for a cgroup to fail gracefully + i.e., via a malloc or mmap failure as compared to OOM kill when no + pages can be reclaimed. +2. It provides better control over how many pages can be swapped out when + the cgroup goes over its limit. A badly setup cgroup can cause excessive + swapping. Providing control over the address space allocations ensur...
On Sun, 04 May 2008 03:08:25 +0530 Finally, with a bit of between-the-line reading, I begin to understand what this stuff is actually supposed to do. It puts an upper limit upon the _total_ address-space size of all the mms which are contained within the resource group, yes? (can am mm be shared by two threads whcih are in different resource groups, Here's another missing piece: what is the kernel's behaviour when such a limit is increased? Seems that the sole option is a failure return from mmap/brk/sbrk/etc, yes? This should be spelled out in careful detail, please. This is a newly-proposed kernel<->userspace interface and we care about those very much. Finally, I worry about overflows. afacit the sum-of-address-space-sizes-for-a-cgroup is accounted for in an unsigned long? If so, a 32-bit machine could easily overflow it. And a 64-bit machine could possibly do so with a bit of effort, perhaps? That's assuming that the code doesn't attempt to avoid duplicate accounting due to multiple-mms-mapping-the-same-pages, which afaict appears to be the case. (Then again, perhaps no machine will ever have the pagetable space to get that far). Ho hum, I had to do rather a lot of guesswork here to try to understand your proposed overall design for this feature. I'd prefer to hear about your design via more direct means. --
Do you have any suggestions on how to do that better. Would you like documentation to be the first patch in the series? I had sent out two RFC's earlier and got comments and feedback from several people. Having said that, I do agree that design is the most vital thing of any patchset and communicating that up front and better is critical. I am open to any suggestions to help make that process better. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
I do like to see the overall what-i-am-setting-out-to-do description in there somewhere - sometimes a Docuemtation/ file is appropriate, other times do it via changelog. But the first part of the review is reviewing whatever it is which you set out to achieve. Once that's understood and sounds like a good idea then we can start looking at how you did it. --
I think having documentation upfront does make sense in that case. I'll also try and make the changelogs more verbose. I usually try to point to the previous Yes, I agree. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
This patch adds support for accounting and control of virtual address space
limits. The accounting is done via the rlimit_cgroup_(un)charge_as functions.
The core of the accounting takes place during fork time in copy_process(),
may_expand_vm(), remove_vma_list() and exit_mmap(). There are some special
cases that are handled here as well (arch/ia64/kernel/perform.c,
arch/x86/kernel/ptrace.c, insert_special_mapping())
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---
arch/ia64/kernel/perfmon.c | 6 ++
arch/x86/kernel/ds.c | 10 ++++
fs/exec.c | 4 +
include/linux/rlimitcgroup.h | 20 +++++++++
kernel/fork.c | 12 +++++
mm/mmap.c | 11 +++++
mm/rlimitcgroup.c | 87 +++++++++++++++++++++++++++++++++++++++++++
7 files changed, 149 insertions(+), 1 deletion(-)
diff -puN mm/rlimitcgroup.c~rlimit-controller-address-space-accounting mm/rlimitcgroup.c
--- linux-2.6.25/mm/rlimitcgroup.c~rlimit-controller-address-space-accounting 2008-05-04 02:53:20.000000000 +0530
+++ linux-2.6.25-balbir/mm/rlimitcgroup.c 2008-05-04 02:53:20.000000000 +0530
@@ -44,6 +44,40 @@ struct rlimit_cgroup *rlimit_cgroup_from
struct rlimit_cgroup, css);
}
+struct rlimit_cgroup *rlimit_cgroup_from_task(struct task_struct *p)
+{
+ return container_of(task_subsys_state(p, rlimit_cgroup_subsys_id),
+ struct rlimit_cgroup, css);
+}
+
+int rlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages)
+{
+ int ret;
+ struct rlimit_cgroup *rcg;
+
+ rcu_read_lock();
+ rcg = rlimit_cgroup_from_task(rcu_dereference(mm->owner));
+ css_get(&rcg->css);
+ rcu_read_unlock();
+
+ ret = res_counter_charge(&rcg->as_res, (nr_pages << PAGE_SHIFT));
+ css_put(&rcg->css);
+ return ret;
+}
+
+void rlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages)
+{
+ struct rlimit_cgroup *rcg;
+
+ rcu_read_lock();
+ rcg = rlimit_cgroup_from_task(rcu_derefe...The basic idea of the patches looks fine (apart from some synchronization issues) but Is calling this the "rlimit" controller a great idea? That implies that it handles all (or at least many) of the things that setrlimit()/getrlimit() handle. While some of the other rlimit things definitely do make sense as cgroup controllers, putting them all in the same controller doesn't really - paying for the address-space tracking overhead just to get, say, the equivalent of RLIMIT_NPROC (max tasks) isn't a great idea. Can you instead give this a name that somehow refers to virtual address space limits, e.g. "va" or "as". That would still fit if you expanded it to deal with locked virtual address space limits too. I think that an "rlimit" controller would probably be best for representing just those limits that don't really make sense when aggregated across different tasks, but apply separately to each task (e.g. RLIMIT_FSIZE, RLIMIT_CORE, RLIMIT_NICE, RLIMIT_NOFILE, RLIMIT_RTPRIO, RLIMIT_STACK, RLIMIT_SIGPENDING, and maybe RLIMIT_CPU), in order to provide an easy way to change these limits on a group of running tasks. On a separate note for the address-space tracking, ideally the subsystem would track whether or not it was bound to a hierarchy, and skip charging/uncharging if not. That way there's no (noticeable) overhead for compiling in the subsystem but not using it. At the point when the subsystem was bound to a hierarchy, it could at that point run through all mms and charge each one's existing address space to the appropriate cgroup. (Currently that would only be the root cgroup in the hierarchy). Paul --
I currently intend to use this controller for controlling memory related rlimits, like address space and mlock'ed memory. How about we use something like Good suggestion, but it will be hard if not impossible to account the data correctly as it changes, if we do the accounting/summation at bind time. We'll need a really big lock to do it, something I want to avoid. Did you have something else in mind? -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
Yes, it'll be tricky but I think worthwhile. I believe it can be done
without the charge/uncharge code needing to take a global lock, except
for when we're actually binding/unbinding, with careful use of RCU.
My first thought for how to do this was that we have a field
"bind_transition" that indicates whether we're transitioning between
bound and unbound, and a bind_mutex. By default the charge/unpath uses
RCU, but by marking that we're in a transition state, the charge path
will use the mutex instead. By waiting for all existing chargers that
are using RCU to exit, we can then take the lock and synchronize with
the chargers.
So the charge/uncharge path would do:
rcu_read_lock();
if (ss->tranistioning) {
rcu_read_unlock();
locked = 1;
mutex_lock(&ss->bind_mutex);
}
if (ss->active) {
/* do charge/uncharge stuff, which must not block */
}
if (locked) {
mutex_unlock(&ss->bind_mutex);
} else {
rcu_read_unlock();
}
and the bind path would do something like:
ss->transitioning = 1;
synchronize_rcu();
mutex_lock(&ss->bind_mutex);
for_each_mm(mm) {
down_read(&mm->mmap_sem);
add_charge_for_mm();
up_read(&mm->mmap_sem);
}
mutex_unlock(&ss->bind_mutex);
ss->transitioning = 0;
But this would break because we're nesting mmap_sem inside bind_mutex
in the bind path, but in the charge path we're nesting bind_mutex
inside mmap_sem. So we'd probably need to define a new bit
MMF_RLIMIT_ACCOUNTED in mm->flags to indicate whether that mm's
address space usage is accounted for. Once we've done that, we can use
mmap_sem to synchronize changes to the per-mm charged status for free,
since we already hold mmap_sem whenever we're doing the charging,
right? So it becomes simple:
charge path:
if (!test_bit(MMF_RLIMIT_ACCOUNTED, &mm->flags))
return 0;
/* do charge/uncharge stuff */
bind path:
while((mm = find_unaccounted_mm()) {
down_write(&mm->mmap_se...[snip] This is an optimization that I am willing to consider later in the project. At first I want to focus on functionality. I would like to optimize once I know that the functionality has been well tested by a good base of users and make sure that the optimization is real. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
You need to synchronize against mm->owner changing, or
mm->owner->cgroups changing. How about:
int rlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages)
{
int ret;
struct rlimit_cgroup *rcg;
struct task_struct *owner;
rcu_read_lock();
again:
/* Find and lock the owner of the mm */
owner = rcu_dereference(mm->owner);
task_lock(owner);
if (mm->owner != owner) {
task_unlock(owner);
goto again;
}
/* Charge the owner's cgroup with the new memory */
rcg = rlimit_cgroup_from_task(owner);
ret = res_counter_charge(&rcg->as_res, (nr_pages << PAGE_SHIFT));
task_unlock(owner);
rcu_read_unlock();
return ret;
Can't this be implemented as just a call to charge() with a negative
value? (Possibly fixing res_counter_charge() to handle negative values
You mean disallow all movement within a hierarchy that has this cgroup
Since you need to protect against concurrent charges, and against
concurrent mm ownership changes, I think you should just do this under
Also needs to task_lock(p) to prevent concurrent charges or cgroup
reassignments?
Paul
--My mind goes blank at times, so forgive me asking, what happens if we don't use task_lock(). mm->owner cannot be freed, even if it changes, we get the callback in mm_owner_changed(). The locations from where we call _charge and _uncharge, I did that earlier, but then Pavel suggested splitting it up as charge/uncharge. Consider the following scenario We try to move task "t1" from cgroup "A" to cgroup "B". Doing so, causes "B" to go over it's limit, what do we do? Ideally, we would like to be able to go back to cgroups and say, please fail Please see my comment above on task_lock(). I'll draw some diagrams on the The callbacks are called with task_lock() held. Please see mm_update_next_owner->cgroup_mm_owner_callbacks() in kernel/exit.c -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
I guess I'm concerned about a race like: A and B are threads in cgroup G, and C is a different process A->mm->owner == B A: enter rlimit_cgroup_charge_as() A: charge new page to G C: enter attach_task(newG, B) C: update B->cgroup to point to newG C: call memrlimit->attach(G, newG, B) C: charge mm->total_vm to newG C: uncharge mm->total_vm from G A: add new page to mm->total_vm Maybe this can be solved very simply by just taking mm->mmap_sem in rlimit_cgroup_move_task() and rlimit_cgroup_mm_owner_changed() ? Since mmap_sem is (I hope) held across all operations that change OK, that sounds reasonable - that's what the can_attach() callback is for. Paul --
No, I'd keep two calls - charge and uncharge. This makes you sure that the code xxx_charge(value) is a charge, regardless of what the "value" is. Besides, xxx_charge returns an error code, you need to check (BTW, I think we should add a __must_check attribute there), while uncharge does --
On Sun, 04 May 2008 03:08:14 +0530 I worry a bit about all the conversion between page-counts and byte-counts in this code. For example, what happens if a process sits there increasing its rss with sbrk(4095) or sbrk(4097) or all sorts of other scenarios? Do we get in a situation in which the accounting is systematically wrong? Worse, do we risk getting into that situation in the future, as unrelated changes are made to the surrounding code? IOW, have we chosen the best, most maintainable representation for these things? --
We already do all our accounting in pages for total_vm (field of mm_struct). task_vsize() for example multiplies PAGE_SIZE with total_vm before returning the I can't see that happening, but I'll look again and request reviewers to help me That's a good question. From the sustenance point of view, resource counters have worked really well so far. Abstracting accounting and tracking from the controllers has been a good thing. One of the goals of the rlimit controller is to keep it open for extension, so that others can add their own control for other resources like mlock'ed pages. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
There was discussion early on in the development of the memory controller that bytes were going to be used as the base unit for accounting. I had disagreed in favor of kB since page sizes are always in these increments and historically the kernel has exported statistics this way via /proc/meminfo. --
This patch adds an additional field to the mm_owner callbacks. This field
is required to get to the mm that changed.
Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---
include/linux/cgroup.h | 3 ++-
kernel/cgroup.c | 2 +-
2 files changed, 3 insertions(+), 2 deletions(-)
diff -puN kernel/cgroup.c~cgroup-add-task-to-mm--owner-callbacks kernel/cgroup.c
--- linux-2.6.25/kernel/cgroup.c~cgroup-add-task-to-mm--owner-callbacks 2008-05-04 02:53:05.000000000 +0530
+++ linux-2.6.25-balbir/kernel/cgroup.c 2008-05-04 02:53:05.000000000 +0530
@@ -2772,7 +2772,7 @@ void cgroup_mm_owner_callbacks(struct ta
if (oldcgrp == newcgrp)
continue;
if (ss->mm_owner_changed)
- ss->mm_owner_changed(ss, oldcgrp, newcgrp);
+ ss->mm_owner_changed(ss, oldcgrp, newcgrp, new);
}
}
}
diff -puN include/linux/cgroup.h~cgroup-add-task-to-mm--owner-callbacks include/linux/cgroup.h
--- linux-2.6.25/include/linux/cgroup.h~cgroup-add-task-to-mm--owner-callbacks 2008-05-04 02:53:05.000000000 +0530
+++ linux-2.6.25-balbir/include/linux/cgroup.h 2008-05-04 02:53:05.000000000 +0530
@@ -310,7 +310,8 @@ struct cgroup_subsys {
*/
void (*mm_owner_changed)(struct cgroup_subsys *ss,
struct cgroup *old,
- struct cgroup *new);
+ struct cgroup *new,
+ struct task_struct *p);
int subsys_id;
int active;
int disabled;
_
--
Warm Regards,
Balbir Singh
Linux Technology Center
IBM, ISTL
--As Andrew suggested, can you improve the documentation? Ideally, there should be a paragraph in Documentation/cgroups.txt that describes the circumstances (including locking state) in which the callback is called. Paul --
On Sun, 04 May 2008 03:08:04 +0530 If mm_owner_changed() had any documentation I'd suggest that it be updated. Sneaky. The existing comment: /* * This routine is called with the task_lock of mm->owner held */ void (*mm_owner_changed)(struct cgroup_subsys *ss, struct cgroup *old, struct cgroup *new); Is rather mysterious. To what mm does it refer? --
No, there's no documentation besides the comments. I'll go ahead and update This callback is called when the mm->owner field that points to/owns a cgroup changes as a result of the owner exiting. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
This patch sets up the rlimit cgroup controller. It adds the basic create, destroy and populate functionality. The user interface provided is very similar to the memory resource controller. The rlimit controller can be enhanced easily in the future to control mlocked pages. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> --- include/linux/cgroup_subsys.h | 4 + include/linux/rlimitcgroup.h | 19 +++++ init/Kconfig | 10 ++ mm/Makefile | 1 mm/rlimitcgroup.c | 141 ++++++++++++++++++++++++++++++++++++++++++ 5 files changed, 175 insertions(+) diff -puN /dev/null mm/rlimitcgroup.c --- /dev/null 2008-05-03 22:12:13.033285313 +0530 +++ linux-2.6.25-balbir/mm/rlimitcgroup.c 2008-05-04 02:52:51.000000000 +0530 @@ -0,0 +1,141 @@ +/* + * Copyright
--
Yes, it can be. Thanks! -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
On Sun, 04 May 2008 03:07:36 +0530 Whatever this is doing, it should not be doing it this way ;) perhaps *tmp = ALIGN(*tmp, PAGE_SIZE); or even *tmp = PAGE_ALIGN(*tmp); ? <looks at PAGE_ALIGN> Each architecture implements its own version and they of course do it differently. It's crying out for a consolidated implementation but we have no include/linux/page.h into which to consolidate it. --
May be we can move this to asm-generic/page.h? -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
