Re: [-mm][PATCH 4/4] Add memrlimit controller accounting and control (v4)

Previous thread: Badness seen on 2.6.26-rc2 with lockdep enabled by Balbir Singh on Wednesday, May 14, 2008 - 5:45 am. (2 messages)

Next thread: 2.6.26-rc1 sound regression by Wolfgang Erig on Wednesday, May 14, 2008 - 6:07 am. (4 messages)
From: Balbir Singh
Date: Wednesday, May 14, 2008 - 6:09 am

This is the fourth version of the address space control patches. These
patches are against 2.6.26-rc2-mm1  and have been tested using KVM in SMP mode,
both with and without the config enabled, on a powerpc box and using UML.
The patches have also been compile tested with the config disabled on a
powerpc box.

The goal of this patch is to implement a virtual address space controller
using cgroups. The documentation describes the controller, it's goal and
usage in further details.

The first patch adds the user interface. The second patch fixes the
cgroup mm_owner_changed callback to pass the task struct, so that
accounting can be adjusted on owner changes. The thrid patch adds accounting
and control. The fourth patch updates documentation.

An earlier post of the patchset can be found at
http://lwn.net/Articles/275143/

This patch is built on top of the mm owner patches and utilizes that feature
to virtually group tasks by mm_struct.

Reviews, Comments?

Changelog

1. Add documentation upfront
2. Refactor the code (error handling and changes to improvde code)
3. Protect reading of total_vm with mmap_sem
4. Port to 2.6.26-rc2
5. Changed the name from rlimit to memrlimit

Series

memrlimit-controller-add-documentation.patch
memrlimit-controller-setup.patch
cgroup-add-task-to-mm-owner-callbacks.patch
memrlimit-controller-address-space-accounting-and-control.patch

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: Balbir Singh
Date: Wednesday, May 14, 2008 - 6:09 am

Documentation patch - describes the goals and usage of the memrlimit
controller.


Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 Documentation/controllers/memrlimit.txt |   29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff -puN /dev/null Documentation/controllers/memrlimit.txt
--- /dev/null	2008-05-14 04:27:30.032276540 +0530
+++ linux-2.6.26-rc2-balbir/Documentation/controllers/memrlimit.txt	2008-05-14 18:35:55.000000000 +0530
@@ -0,0 +1,29 @@
+This controller is enabled by the CONFIG_CGROUP_MEMRLIMIT_CTLR option. Prior
+to reading this documentation please read Documentation/cgroups.txt and
+Documentation/controllers/memory.txt. Several of the principles of this
+controller are similar to the memory resource controller.
+
+This controller framework is designed to be extensible to control any
+memory resource limit with little effort.
+
+This new controller, controls the address space expansion of the tasks
+belonging to a cgroup. Address space control is provided along the same lines as
+RLIMIT_AS control, which is available via getrlimit(2)/setrlimit(2).
+The interface for controlling address space is provided through
+"rlimit.limit_in_bytes". The file is similar to "limit_in_bytes" w.r.t. the user
+interface. Please see section 3 of the memory resource controller documentation
+for more details on how to use the user interface to get and set values.
+
+The "memrlimit.usage_in_bytes" file provides information about the total address
+space usage of the tasks in the cgroup, in bytes.
+
+Advantages of providing this feature
+
+1. Control over virtual address space allows for a cgroup to fail gracefully
+   i.e., via a malloc or mmap failure as compared to OOM kill when no
+   pages can be reclaimed.
+2. It provides better control over how many pages can be swapped out when
+   the cgroup goes over its limit. A badly setup cgroup can cause excessive
+   swapping. Providing control over the address space allocations ensures
+   ...
From: Li Zefan
Date: Wednesday, May 14, 2008 - 6:20 pm

From: Avi Kivity
Date: Thursday, May 15, 2008 - 11:22 am

Do you mean by this, limiting the number of pagetable pages (that are 
pinned in memory), this preventing oom by a cgroup that instantiates 
many pagetables?


-- 
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

--

From: Balbir Singh
Date: Thursday, May 15, 2008 - 11:39 am

This is not for page tables (that is in the long term TODO list). This is more
for user space calls to mmap(), malloc() or anything that causes the total
virtual memory of the process to go up (in our case cgroups). The motivation is
similar to the motivations of RLIMIT_AS.


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: Balbir Singh
Date: Wednesday, May 14, 2008 - 6:09 am

This patch sets up the rlimit cgroup controller. It adds the basic create,
destroy and populate functionality. The user interface provided is very
similar to the memory resource controller. The rlimit controller can be
enhanced easily in the future to control mlocked pages.

Changelog v3->v4

1. Use PAGE_ALIGN()
2. Rename rlimit to memrlimit


Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/cgroup_subsys.h   |    4 +
 include/linux/memrlimitcgroup.h |   19 +++++
 init/Kconfig                    |   10 ++
 mm/Makefile                     |    1 
 mm/memrlimitcgroup.c            |  144 ++++++++++++++++++++++++++++++++++++++++
 5 files changed, 178 insertions(+)

diff -puN include/linux/cgroup_subsys.h~memrlimit-controller-setup include/linux/cgroup_subsys.h
--- linux-2.6.26-rc2/include/linux/cgroup_subsys.h~memrlimit-controller-setup	2008-05-14 18:36:36.000000000 +0530
+++ linux-2.6.26-rc2-balbir/include/linux/cgroup_subsys.h	2008-05-14 18:36:36.000000000 +0530
@@ -47,4 +47,8 @@ SUBSYS(mem_cgroup)
 SUBSYS(devices)
 #endif
 
+#ifdef CONFIG_CGROUP_MEMRLIMIT_CTLR
+SUBSYS(memrlimit_cgroup)
+#endif
+
 /* */
diff -puN /dev/null include/linux/memrlimitcgroup.h
--- /dev/null	2008-05-14 04:27:30.032276540 +0530
+++ linux-2.6.26-rc2-balbir/include/linux/memrlimitcgroup.h	2008-05-14 18:36:36.000000000 +0530
@@ -0,0 +1,19 @@
+/*
+ * Copyright 
From: Pavel Emelyanov
Date: Wednesday, May 14, 2008 - 6:29 am

Acked-by: Pavel Emelyanov <xemul@openvz.org>
--

From: Balbir Singh
Date: Wednesday, May 14, 2008 - 6:09 am

This patch adds an additional field to the mm_owner callbacks. This field
is required to get to the mm that changed.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/cgroup.h |    3 ++-
 kernel/cgroup.c        |    2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff -puN include/linux/cgroup.h~cgroup-add-task-to-mm-owner-callbacks include/linux/cgroup.h
--- linux-2.6.26-rc2/include/linux/cgroup.h~cgroup-add-task-to-mm-owner-callbacks	2008-05-14 18:36:59.000000000 +0530
+++ linux-2.6.26-rc2-balbir/include/linux/cgroup.h	2008-05-14 18:36:59.000000000 +0530
@@ -310,7 +310,8 @@ struct cgroup_subsys {
 	 */
 	void (*mm_owner_changed)(struct cgroup_subsys *ss,
 					struct cgroup *old,
-					struct cgroup *new);
+					struct cgroup *new,
+					struct task_struct *p);
 	int subsys_id;
 	int active;
 	int disabled;
diff -puN kernel/cgroup.c~cgroup-add-task-to-mm-owner-callbacks kernel/cgroup.c
--- linux-2.6.26-rc2/kernel/cgroup.c~cgroup-add-task-to-mm-owner-callbacks	2008-05-14 18:36:59.000000000 +0530
+++ linux-2.6.26-rc2-balbir/kernel/cgroup.c	2008-05-14 18:36:59.000000000 +0530
@@ -2772,7 +2772,7 @@ void cgroup_mm_owner_callbacks(struct ta
 			if (oldcgrp == newcgrp)
 				continue;
 			if (ss->mm_owner_changed)
-				ss->mm_owner_changed(ss, oldcgrp, newcgrp);
+				ss->mm_owner_changed(ss, oldcgrp, newcgrp, new);
 		}
 	}
 }
_

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: Balbir Singh
Date: Wednesday, May 14, 2008 - 6:09 am

This patch adds support for accounting and control of virtual address space
limits. The accounting is done via the rlimit_cgroup_(un)charge_as functions.
The core of the accounting takes place during fork time in copy_process(),
may_expand_vm(), remove_vma_list() and exit_mmap(). There are some special
cases that are handled here as well (arch/ia64/kernel/perform.c,
arch/x86/kernel/ptrace.c, insert_special_mapping())

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 arch/ia64/kernel/perfmon.c      |    6 ++
 arch/x86/kernel/ptrace.c        |   17 +++++--
 fs/exec.c                       |    5 ++
 include/linux/memrlimitcgroup.h |   21 ++++++++
 kernel/fork.c                   |    8 +++
 mm/memrlimitcgroup.c            |   94 ++++++++++++++++++++++++++++++++++++++++
 mm/mmap.c                       |   11 ++++
 7 files changed, 157 insertions(+), 5 deletions(-)

diff -puN arch/ia64/kernel/perfmon.c~memrlimit-controller-address-space-accounting-and-control arch/ia64/kernel/perfmon.c
--- linux-2.6.26-rc2/arch/ia64/kernel/perfmon.c~memrlimit-controller-address-space-accounting-and-control	2008-05-14 18:09:32.000000000 +0530
+++ linux-2.6.26-rc2-balbir/arch/ia64/kernel/perfmon.c	2008-05-14 18:09:32.000000000 +0530
@@ -40,6 +40,7 @@
 #include <linux/capability.h>
 #include <linux/rcupdate.h>
 #include <linux/completion.h>
+#include <linux/memrlimitcgroup.h>
 
 #include <asm/errno.h>
 #include <asm/intrinsics.h>
@@ -2294,6 +2295,9 @@ pfm_smpl_buffer_alloc(struct task_struct
 
 	DPRINT(("sampling buffer rsize=%lu size=%lu bytes\n", rsize, size));
 
+	if (memrlimit_cgroup_charge_as(mm, size >> PAGE_SHIFT))
+		return -ENOMEM;
+
 	/*
 	 * check requested size to avoid Denial-of-service attacks
 	 * XXX: may have to refine this test
@@ -2313,6 +2317,7 @@ pfm_smpl_buffer_alloc(struct task_struct
 	smpl_buf = pfm_rvmalloc(size);
 	if (smpl_buf == NULL) {
 		DPRINT(("Can't allocate sampling ...
From: Balbir Singh
Date: Wednesday, May 14, 2008 - 6:25 am

* Balbir Singh <balbir@linux.vnet.ibm.com> [2008-05-14 18:39:51]:

Here's a better version of the patch. ptrace.c changes have a bug,
where uncharge is called unconditionally. We now check for ret < 0.
The two signed-off-by's by the same person did not look good either :)

Description

This patch adds support for accounting and control of virtual address space
limits. The accounting is done via the rlimit_cgroup_(un)charge_as functions.
The core of the accounting takes place during fork time in copy_process(),
may_expand_vm(), remove_vma_list() and exit_mmap(). There are some special
cases that are handled here as well (arch/ia64/kernel/perform.c,
arch/x86/kernel/ptrace.c, insert_special_mapping())



Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 arch/ia64/kernel/perfmon.c      |    6 ++
 arch/x86/kernel/ptrace.c        |   18 +++++--
 fs/exec.c                       |    5 ++
 include/linux/memrlimitcgroup.h |   21 ++++++++
 kernel/fork.c                   |    8 +++
 mm/memrlimitcgroup.c            |   94 ++++++++++++++++++++++++++++++++++++++++
 mm/mmap.c                       |   11 ++++
 7 files changed, 158 insertions(+), 5 deletions(-)

diff -puN arch/ia64/kernel/perfmon.c~memrlimit-controller-address-space-accounting-and-control arch/ia64/kernel/perfmon.c
--- linux-2.6.26-rc2/arch/ia64/kernel/perfmon.c~memrlimit-controller-address-space-accounting-and-control	2008-05-14 18:37:11.000000000 +0530
+++ linux-2.6.26-rc2-balbir/arch/ia64/kernel/perfmon.c	2008-05-14 18:37:11.000000000 +0530
@@ -40,6 +40,7 @@
 #include <linux/capability.h>
 #include <linux/rcupdate.h>
 #include <linux/completion.h>
+#include <linux/memrlimitcgroup.h>
 
 #include <asm/errno.h>
 #include <asm/intrinsics.h>
@@ -2294,6 +2295,9 @@ pfm_smpl_buffer_alloc(struct task_struct
 
 	DPRINT(("sampling buffer rsize=%lu size=%lu bytes\n", rsize, size));
 
+	if (memrlimit_cgroup_charge_as(mm, size >> PAGE_SHIFT))
+		return -ENOMEM;
+
 	/*
 	 * check requested size to avoid ...
From: Paul Menage
Date: Wednesday, May 14, 2008 - 7:25 pm

Assuming that we're holding a write lock on mm->mmap_sem here, and we
additionally hold mmap_sem for the whole of mm_update_next_owner(),
then maybe we don't need any extra synchronization here? Something
like simply:

int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages)
{
       struct memrlimit_cgroup *memrcg = memrlimit_cgroup_from_task(mm->owner);
       return res_counter_charge(&memrcg->as_res, (nr_pages << PAGE_SHIFT));
}

Seems good to minimize additional synchronization on the fast path.

The only thing that's still broken is that the task_struct.cgroups
pointer gets updated only under the synchronization of task_lock(), so
we've still got the race of:

A: attach_task() updates B->cgroups

B: memrlimit_cgroup_charge_as() charges the new res counter and
updates mm->total_vm

A: memrlimit_cgroup_move_task() moves mm->total_vm from the old
counter to the new counter

Here's one way I see to fix this:

We change attach_task() so that rather than updating the
task_struct.cgroups pointer once from the original css_set to the
final css_set, it goes through a series of intermediate css_set
structures, one for each subsystem in the hierarchy, transitioning
from the old set to the final set. Then for each subsystem ss, it
would do:

next_css = <old css with pointer for ss updated>
if (ss->attach) {
  ss->attach(ss, p, next_css);
} else {
  task_lock(p);
  rcu_assign_ptr(p->cgroups, next_css);
  task_unlock(p);
}

i.e. the subsystem would be free to implement any synchronization it
desired in the attach() code. The attach() method's responsibility
would be to ensure that p->cgroups was updated to point to next_css
before returning. This should make it much simpler for a subsystem to
handle potential races between attach() and accounting. The current
semantics of can_attach()/update/attach() are sufficient for cpusets,
but probably not for systems with more complex accounting. I'd still
need to figure out a nice way to get the kind of ...
From: Balbir Singh
Date: Wednesday, May 14, 2008 - 11:17 pm

The charge_as routine is not always called with mmap_sem held, since
the undo path gets more complicated under the lock. We already have
our own locking mechanism for the counters. We're not really accessing
any member of the mm here except the owner. Do we need to be called

A transaction manager would be great. We do the
mm_update_owner_changes under the task_lock(), may be the attach


Sure, I will. One of things we want to ensure is that mm->owner does




Taking the mmap_sem here would mean, we would need to document
(something I should have done earlier), that mmap_sem nests under
cgroup_mutex


Looking at the mm->owner patches, I am thinking of writing down a race
scenario card


        Race conditions

        R1: mm->owner can change dynamically under task_lock
        R2: mm->owner's cgroup can change under cgroup_mutex

                        Read            Write

        mm->owner       Prevent         hold task_lock 
                        cgroup from
                        changing

                        Prevent owner
                        from changing

        Scenarios requiring protection/consistency

        R1: causes no problem, since we expect to make appropriate
            adjustment in mm_owner_changed
        R2: Is handled by the attach() callback

        Which leaves us with the following conclusion

        We don't have move_task(), mm_owner_changed() and charge/uncharge()
        running in parallel at the same time.

If we agree with the assertion/conclusion above, then a simple lock
might be able to protect us, assuming that it does not create a






-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: Paul Menage
Date: Wednesday, May 14, 2008 - 11:55 pm

On Wed, May 14, 2008 at 11:17 PM, Balbir Singh

I'm not worried about the counters themselves being inconsistent - I'm
worried about the case where charge_as() is called in the middle of
the attach operation, and we account the charge X to the new cgroup's
res_counter and update mm->total_vm, and then when we do the move, we
charge the whole of mm->total_mm to the new cgroup even though the
last charge was already accounted to the new res_counter, not the old
one.

That's what I'm hoping to address with the idea of splitting the
attach into one update per subsystem, and letting the subsystems

Not necessarily mmap_sem, but there needs to be something to ensure
that the update to mm->total_vm and the charge/uncharge against the
res_counter are an atomic pair with respect to the code that shifts an
mm between two cgroups, either due to mm->owner change or due to an
attach_task(). Since mmap_sem is held for write on almost all the fast
path calls to the rlimit_as charge/uncharge functions, using that for
the synchronization avoids the need for any additional synchronization
in the fast path.

Can you say more about the complications of holding a write lock on

Yes - but new_css_set will be slightly different for each callback.
Specifically, it will differ from the existing set pointed to by
p->cgroups in the pointer for this particular subsystem. So the task
will move over in a staggered fashion, and each subsystem will get to


Right - and if we can make that lock be the mmap_sem of the mm in
question, we avoid introducing a new lock into the fast path.

Paul
--

From: Balbir Singh
Date: Thursday, May 15, 2008 - 12:03 am

I want to focus on this conclusion/assertion, since it takes care of
most of the locking related discussion above, unless I missed
something.

My concern with using mmap_sem, is that

1. It's highly contended (every page fault, vma change, etc)
2. It's going to make the locking hierarchy deeper and complex
3. It's not appropriate to call all the accounting callbacks with
   the mmap_sem() held, since the undo operations _can get_ complicated
   at the caller.

I would prefer introducing a new lock, so that other subsystems are
not affected. 

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: Paul Menage
Date: Thursday, May 15, 2008 - 12:39 am

On Thu, May 15, 2008 at 12:03 AM, Balbir Singh

But the only *new* cases of taking the mmap_sem that this would
introduce would be:

- on a failed vm limit charge
- when a task exit/exec causes an mm ownership change
- when a task moves between two cgroups in the memrlimit hierarchy.

All of these should be rare events, so I don't think the additional

Yes, potentially. But if the upside of that is that we eliminate a
lock/unlock on a shared lock on every mmap/munmap call, it might well


For getting the first cut of the memrlimit controller working this may
well make sense. But it would be nice to avoid it longer-term.

Paul
--

From: Balbir Singh
Date: Thursday, May 15, 2008 - 1:25 am

Why a failed charge? Aren't we talking of moving all charge/uncharge


Yes, this would nest cgroup_mutex and mmap_sem. Not sure if that would

We do make several of all charge calls under the mmap_sem, but not

Some paths of the uncharge are not under mmap_sem. Undoing the

OK, so here's what I am going to try and do

Refactor the code to try and use mmap_sem and see what I come up
with. Basically use mmap_sem for all charge/uncharge operations as
well use mmap_sem in read_mode in the move_task() and
mm_owner_changed() callbacks. That should take care of the race
conditions discussed, unless I missed something.
Try and instrument insert_vm_struct() for charge/uncharge

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: Paul Menage
Date: Thursday, May 15, 2008 - 8:28 am

Sorry, I worded that wrongly - I meant "cleaning up a successful
charge after an expansion fails for other reasons"

I thought that all the charges and most of the uncharges were already
under mmap_sem, and it would just be a few of the cleanup paths that

I think it's already nested that way - e.g. the cpusets code can call
various migration functions (which take mmap_sem) while holding

Sounds good.

Thanks,

Paul
--

From: Balbir Singh
Date: Thursday, May 15, 2008 - 10:01 am

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: Balbir Singh
Date: Saturday, May 17, 2008 - 1:15 pm

I've revamped the last two patches. Please review


This patch adds an additional field to the mm_owner callbacks. This field
is required to get to the mm that changed. Hold mmap_sem in write mode
before calling the mm_owner_changed callback

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 include/linux/cgroup.h |    3 ++-
 kernel/cgroup.c        |    4 +++-
 kernel/exit.c          |    3 +++
 3 files changed, 8 insertions(+), 2 deletions(-)

diff -puN include/linux/cgroup.h~cgroup-add-task-to-mm-owner-callbacks include/linux/cgroup.h
--- linux-2.6.26-rc2/include/linux/cgroup.h~cgroup-add-task-to-mm-owner-callbacks	2008-05-14 18:36:59.000000000 +0530
+++ linux-2.6.26-rc2-balbir/include/linux/cgroup.h	2008-05-14 18:36:59.000000000 +0530
@@ -310,7 +310,8 @@ struct cgroup_subsys {
 	 */
 	void (*mm_owner_changed)(struct cgroup_subsys *ss,
 					struct cgroup *old,
-					struct cgroup *new);
+					struct cgroup *new,
+					struct task_struct *p);
 	int subsys_id;
 	int active;
 	int disabled;
diff -puN kernel/cgroup.c~cgroup-add-task-to-mm-owner-callbacks kernel/cgroup.c
--- linux-2.6.26-rc2/kernel/cgroup.c~cgroup-add-task-to-mm-owner-callbacks	2008-05-14 18:36:59.000000000 +0530
+++ linux-2.6.26-rc2-balbir/kernel/cgroup.c	2008-05-17 22:09:57.000000000 +0530
@@ -2758,6 +2758,8 @@ void cgroup_fork_callbacks(struct task_s
  * Called on every change to mm->owner. mm_init_owner() does not
  * invoke this routine, since it assigns the mm->owner the first time
  * and does not change it.
+ *
+ * The callbacks are invoked with mmap_sem held in read mode.
  */
 void cgroup_mm_owner_callbacks(struct task_struct *old, struct task_struct *new)
 {
@@ -2772,7 +2774,7 @@ void cgroup_mm_owner_callbacks(struct ta
 			if (oldcgrp == newcgrp)
 				continue;
 			if (ss->mm_owner_changed)
-				ss->mm_owner_changed(ss, oldcgrp, newcgrp);
+				ss->mm_owner_changed(ss, oldcgrp, newcgrp, new);
 		}
 	}
 }
diff -puN kernel/exit.c~cgroup-add-task-to-mm-owner-callbacks ...
From: Balbir Singh
Date: Saturday, May 17, 2008 - 1:17 pm

Here's the last patch for review


This patch adds support for accounting and control of virtual address space
limits. The accounting is done via the rlimit_cgroup_(un)charge_as functions.
The core of the accounting takes place during fork time in copy_process(),
may_expand_vm(), remove_vma_list() and exit_mmap(). 

Changelog v5->v4

Move specific hooks in code to insert_vm_struct
Use mmap_sem to protect mm->owner from changing and mm->owner from
changing cgroups.

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
---

 arch/x86/kernel/ptrace.c        |   18 +++++--
 include/linux/memrlimitcgroup.h |   21 +++++++++
 kernel/fork.c                   |    8 +++
 mm/memrlimitcgroup.c            |   92 ++++++++++++++++++++++++++++++++++++++++
 mm/mmap.c                       |   17 ++++++-
 5 files changed, 149 insertions(+), 7 deletions(-)

diff -puN arch/ia64/kernel/perfmon.c~memrlimit-controller-address-space-accounting-and-control arch/ia64/kernel/perfmon.c
diff -puN arch/x86/kernel/ds.c~memrlimit-controller-address-space-accounting-and-control arch/x86/kernel/ds.c
diff -puN fs/exec.c~memrlimit-controller-address-space-accounting-and-control fs/exec.c
diff -puN include/linux/memrlimitcgroup.h~memrlimit-controller-address-space-accounting-and-control include/linux/memrlimitcgroup.h
--- linux-2.6.26-rc2/include/linux/memrlimitcgroup.h~memrlimit-controller-address-space-accounting-and-control	2008-05-17 23:14:53.000000000 +0530
+++ linux-2.6.26-rc2-balbir/include/linux/memrlimitcgroup.h	2008-05-17 23:14:53.000000000 +0530
@@ -16,4 +16,25 @@
 #ifndef LINUX_MEMRLIMITCGROUP_H
 #define LINUX_MEMRLIMITCGROUP_H
 
+#ifdef CONFIG_CGROUP_MEMRLIMIT_CTLR
+
+int memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages);
+void memrlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages);
+
+#else /* !CONFIG_CGROUP_RLIMIT_CTLR */
+
+static inline int
+memrlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages)
+{
+	return ...
From: Pavel Emelyanov
Date: Wednesday, May 14, 2008 - 6:32 am

AFAIS you didn't cover all the cases when VM expands. At least all
the arch/ia64/ia32/binfmt_elf32.c is missed.

I'd insert this charge into insert_vm_struct. This would a) cover
all of the missed cases and b) reduce the amount of places to patch.

[snip the rest of the patch]
--

From: Balbir Singh
Date: Wednesday, May 14, 2008 - 12:39 pm

I thought I have those covered. insert_vm_struct() is called from
places that we have covered in this patch. As far as
arch/ia64/ia32/binfmt_elf32.c is concerned, it inserts a GDT, LDT
and does not change total_vm. Having said that, I am not against
putting the hooks in insert_vm_struct().
 

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--

From: Andrew Morton
Date: Thursday, September 18, 2008 - 1:54 pm

On Wed, 14 May 2008 18:39:51 +0530


Large changes in linux-next's arch/x86/kernel/ptrace.c caused damage to
the memrlimit patches.

I decided to retain the patches because it looks repairable.  The
problem is this reject from
memrlimit-add-memrlimit-controller-accounting-and-control.patch:

***************
*** 808,828 ****
  
  	current->mm->total_vm  -= old_size;
  	current->mm->locked_vm -= old_size;
  
  	if (size == 0)
  		goto out;
  
  	rlim = current->signal->rlim[RLIMIT_AS].rlim_cur >> PAGE_SHIFT;
  	vm = current->mm->total_vm  + size;
  	if (rlim < vm) {
  		ret = -ENOMEM;
  
  		if (!reduce_size)
- 			goto out;
  
  		size = rlim - current->mm->total_vm;
  		if (size <= 0)
- 			goto out;
  	}
  
  	rlim = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT;
--- 809,833 ----
  
  	current->mm->total_vm  -= old_size;
  	current->mm->locked_vm -= old_size;
+ 	memrlimit_cgroup_uncharge_as(mm, old_size);
  
  	if (size == 0)
  		goto out;
  
+ 	if (memrlimit_cgroup_charge_as(current->mm, size))
+ 		goto out;
+ 
  	rlim = current->signal->rlim[RLIMIT_AS].rlim_cur >> PAGE_SHIFT;
  	vm = current->mm->total_vm  + size;
  	if (rlim < vm) {
  		ret = -ENOMEM;
  
  		if (!reduce_size)
+ 			goto out_uncharge;
  
  		size = rlim - current->mm->total_vm;
  		if (size <= 0)
+ 			goto out_uncharge;
  	}
  
  	rlim = current->signal->rlim[RLIMIT_MEMLOCK].rlim_cur >> PAGE_SHIFT;
***************
*** 831,851 ****
  		ret = -ENOMEM;
  
  		if (!reduce_size)
- 			goto out;
  
  		size = rlim - current->mm->locked_vm;
  		if (size <= 0)
- 			goto out;
  	}
  
  	ret = ds_allocate((void **)&child->thread.ds_area_msr,
  			  size << PAGE_SHIFT);
  	if (ret < 0)
- 		goto out;
  
  	current->mm->total_vm  += size;
  	current->mm->locked_vm += size;
  
  out:
  	if (child->thread.ds_area_msr)
  		set_tsk_thread_flag(child, TIF_DS_AREA_MSR);
--- 836,859 ----
  		ret = -ENOMEM;
  
  		if (!reduce_size)
+ 			goto out_uncharge;
  
  ...
From: Balbir Singh
Date: Thursday, September 18, 2008 - 4:55 pm

I'll take a look tonight and see what needs to be done

-- 
	Balbir
--

From: Balbir Singh
Date: Thursday, September 18, 2008 - 11:38 pm

Andrew,

I could not apply mmotm to linux-next (both downloaded right now). I
applied the patches one-by-one resolving differences starting from #mm
in the series file.

Here is my fixed version of the patch, I compiled the patch, but could
not run it, since I could not create the full series of applied
patches. I compiled arch/x86/kernel/ds.o and ptrace.o. I've included
the patch below, please let me know if the code looks OK (via review)
and the patch applies. I'll test it once I can resonably resolve all
conflicts between linux-next and mmotm.


From: Balbir Singh <balbir@linux.vnet.ibm.com>

This patch adds support for accounting and control of virtual address space
limits. The accounting is done via the rlimit_cgroup_(un)charge_as functions.
The core of the accounting takes place during fork time in copy_process(),
may_expand_vm(), remove_vma_list() and exit_mmap().

Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: Sudhir Kumar <skumar@linux.vnet.ibm.com>
Cc: YAMAMOTO Takashi <yamamoto@valinux.co.jp>
Cc: Paul Menage <menage@google.com>
Cc: Li Zefan <lizf@cn.fujitsu.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vivek Goyal <vgoyal@redhat.com>
Cc: Hugh Dickins <hugh@veritas.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 arch/x86/kernel/ptrace.c        |    1 
 include/linux/memrlimitcgroup.h |   21 ++++++
 kernel/fork.c                   |   14 ++++
 mm/memrlimitcgroup.c            |   92 ++++++++++++++++++++++++++++++
 mm/mmap.c                       |   17 ++++-
 5 files changed, 142 insertions(+), 3 deletions(-)

Index: linux-next/include/linux/memrlimitcgroup.h
===================================================================
--- linux-next.orig/include/linux/memrlimitcgroup.h	2008-09-18 23:19:26.000000000 -0700
+++ linux-next/include/linux/memrlimitcgroup.h	2008-09-18 ...
From: Andrew Morton
Date: Friday, September 19, 2008 - 1:14 pm

On Thu, 18 Sep 2008 23:38:23 -0700

mmotm includes linux-next.patch.  mmotm is based upon the most recent
2.6.x-rcy.

This is the only way to do it - I often have to change linux-next.patch
due to rejects and it's unreasonable to expect people to base off the

OK, we'll give it a shot, thanks.

--

From: Balbir Singh
Date: Friday, September 19, 2008 - 2:28 pm

Thanks for the info


Thanks!

-- 
	Balbir
--

Previous thread: Badness seen on 2.6.26-rc2 with lockdep enabled by Balbir Singh on Wednesday, May 14, 2008 - 5:45 am. (2 messages)

Next thread: 2.6.26-rc1 sound regression by Wolfgang Erig on Wednesday, May 14, 2008 - 6:07 am. (4 messages)