login
Header Space

 
 

Re: [-mm][PATCH 4/4] Add rlimit controller documentation

Previous thread: [patch] video: build fix for drivers/media/video/au0828 by Ingo Molnar on Saturday, May 3, 2008 - 5:23 pm. (1 message)

Next thread: [patch] irda: fix !PNP support for drivers/net/irda/smsc-ircc2.c by Ingo Molnar on Saturday, May 3, 2008 - 6:26 pm. (1 message)
To: <linux-mm@...>
Cc: Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, Paul Menage <menage@...>, <lizf@...>, <linux-kernel@...>, David Rientjes <rientjes@...>, Pavel Emelianov <xemul@...>, Balbir Singh <balbir@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Saturday, May 3, 2008 - 5:37 pm

This is the third version of the address space control patches. These
patches are against 2.6.25-mm1  and have been tested using KVM in SMP mode,
both with and without the config enabled.

The first patch adds the user interface. The second patch fixes the
cgroup mm_owner_changed callback to pass the task struct, so that
accounting can be adjusted on owner changes. The thrid patch adds accounting
and control. The fourth patch updates documentation.

An earlier post of the patchset can be found at
http://lwn.net/Articles/275143/

This patch is built on top of the mm owner patches and utilizes that feature
to virtually group tasks by mm_struct.

Reviews, Comments?

Series

rlimit-controller-setup.patch
cgroup-add-task-to-mm--owner-callbacks.patch
rlimit-controller-address-space-accounting.patch
rlimit-controller-add-documentation.patch

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: Balbir Singh <balbir@...>
Cc: <linux-mm@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, Paul Menage <menage@...>, <lizf@...>, <linux-kernel@...>, David Rientjes <rientjes@...>, Pavel Emelianov <xemul@...>, Balbir Singh <balbir@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Sunday, May 4, 2008 - 11:24 am

I can't read the whole patch deeply now but this new concept "rlimit-controlle
r" seems make sense to me.

At quick glance, I have some thoughts.

1. kerner/rlimit_cgroup.c is better for future expansion.
2. why 
   "+This controller framework is designed to be extensible to control any
   "+resource limit (memory related) with little effort."
   memory only ? Ok, all you want to do is related to memory, but someone
   may want to limit RLIMIT_CPU by group or RLIMIT_CORE by group or....
   (I have no plan but they seems useful.;)
   So, could you add design hint of rlimit contoller to the documentation ?
   
3. Rleated to 2. Showing what kind of "rlimit" params are supported by
   cgroup will be good.

I don't think you have to implement all things at once. Staring from
"only RLIMIT_AS is supported now" is good. Someone will expand it if
he needs. But showing basic view of "gerenal purpose rlimit contoller" in _doc
ument_ or _comments_ or _codes_ is a good thing to do.

If you don't want to provide RLIMIT feature other than address space,
it's better to avoid using the name of RLIMIT. It's confusing.

Thanks,
-Kame





--
To: <kamezawa.hiroyu@...>
Cc: <linux-mm@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, Paul Menage <menage@...>, <lizf@...>, <linux-kernel@...>, David Rientjes <rientjes@...>, Pavel Emelianov <xemul@...>, Andrew Morton <akpm@...>
Date: Monday, May 5, 2008 - 12:21 am

I currently mentioned memory, since we have the infrastructure to group using
mm-&gt;owner infrastructure. For other purposes, we'll need to enhance the


Do you mean in init/Kconfig or documentation?. I should probably rename
limit_in_bytes and usage_in_bytes to add an as_ prefix, so that the UI clearly


I used RLIMIT since I want to extend it later to control memory locked pages :)


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: <balbir@...>
Cc: <linux-mm@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, Paul Menage <menage@...>, <lizf@...>, <linux-kernel@...>, David Rientjes <rientjes@...>, Pavel Emelianov <xemul@...>, Andrew Morton <akpm@...>
Date: Tuesday, May 6, 2008 - 9:09 pm

On Mon, 05 May 2008 09:51:19 +0530
I see.

Thanks,
-Kame

--
To: <kamezawa.hiroyu@...>
Cc: Balbir Singh <balbir@...>, <linux-mm@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, Paul Menage <menage@...>, <lizf@...>, <linux-kernel@...>, David Rientjes <rientjes@...>, Pavel Emelianov <xemul@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Sunday, May 4, 2008 - 11:27 am

...RLIMIT_MEMLOCK is in my want-to-do-list ;)

Thanks,
-Kame
--
To: <kamezawa.hiroyu@...>
Cc: <linux-mm@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, Paul Menage <menage@...>, <lizf@...>, <linux-kernel@...>, David Rientjes <rientjes@...>, Pavel Emelianov <xemul@...>, Andrew Morton <akpm@...>
Date: Monday, May 5, 2008 - 12:24 am

Mine too, but I want to get to it after the hierarchy and soft limit patches.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: <linux-mm@...>
Cc: Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, Paul Menage <menage@...>, <lizf@...>, <linux-kernel@...>, David Rientjes <rientjes@...>, Pavel Emelianov <xemul@...>, Balbir Singh <balbir@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Saturday, May 3, 2008 - 5:38 pm

This is the documentation patch. It describes the rlimit controller and how
to build and use it.

Signed-off-by: Balbir Singh &lt;balbir@linux.vnet.ibm.com&gt;
---

 Documentation/controllers/rlimit.txt |   29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff -puN /dev/null Documentation/controllers/rlimit.txt
--- /dev/null	2008-05-03 22:12:13.033285313 +0530
+++ linux-2.6.25-balbir/Documentation/controllers/rlimit.txt	2008-05-04 03:06:06.000000000 +0530
@@ -0,0 +1,29 @@
+This controller is enabled by the CONFIG_CGROUP_RLIMIT_CTLR option. Prior
+to reading this documentation please read Documentation/cgroups.txt and
+Documentation/controllers/memory.txt. Several of the principles of this
+controller are similar to the memory resource controller.
+
+This controller framework is designed to be extensible to control any
+resource limit (memory related) with little effort.
+
+This new controller, controls the address space expansion of the tasks
+belonging to a cgroup. Address space control is provided along the same lines as
+RLIMIT_AS control, which is available via getrlimit(2)/setrlimit(2).
+The interface for controlling address space is provided through
+"rlimit.limit_in_bytes". The file is similar to "limit_in_bytes" w.r.t. the user
+interface. Please see section 3 of the memory resource controller documentation
+for more details on how to use the user interface to get and set values.
+
+The "rlimit.usage_in_bytes" file provides information about the total address
+space usage of the tasks in the cgroup, in bytes.
+
+Advantages of providing this feature
+
+1. Control over virtual address space allows for a cgroup to fail gracefully
+   i.e., via a malloc or mmap failure as compared to OOM kill when no
+   pages can be reclaimed.
+2. It provides better control over how many pages can be swapped out when
+   the cgroup goes over its limit. A badly setup cgroup can cause excessive
+   swapping. Providing control over the address space allocations ensur...
To: Balbir Singh <balbir@...>
Cc: <linux-mm@...>, <skumar@...>, <yamamoto@...>, <menage@...>, <lizf@...>, <linux-kernel@...>, <rientjes@...>, <xemul@...>, <balbir@...>, <kamezawa.hiroyu@...>
Date: Monday, May 5, 2008 - 6:35 pm

On Sun, 04 May 2008 03:08:25 +0530

Finally, with a bit of between-the-line reading, I begin to understand what
this stuff is actually supposed to do.

It puts an upper limit upon the _total_ address-space size of all the mms
which are contained within the resource group, yes?

(can am mm be shared by two threads whcih are in different resource groups,

Here's another missing piece: what is the kernel's behaviour when such a
limit is increased?  Seems that the sole option is a failure return from
mmap/brk/sbrk/etc, yes?

This should be spelled out in careful detail, please.  This is a
newly-proposed kernel&lt;-&gt;userspace interface and we care about those very
much.

Finally, I worry about overflows.  afacit the
sum-of-address-space-sizes-for-a-cgroup is accounted for in an unsigned
long?

If so, a 32-bit machine could easily overflow it.

And a 64-bit machine could possibly do so with a bit of effort, perhaps? 
That's assuming that the code doesn't attempt to avoid duplicate accounting
due to multiple-mms-mapping-the-same-pages, which afaict appears to be the
case.  (Then again, perhaps no machine will ever have the pagetable space
to get that far).



Ho hum, I had to do rather a lot of guesswork here to try to understand
your proposed overall design for this feature.  I'd prefer to hear about
your design via more direct means.
--
To: Andrew Morton <akpm@...>
Cc: <linux-mm@...>, <skumar@...>, <yamamoto@...>, <menage@...>, <lizf@...>, <linux-kernel@...>, <rientjes@...>, <xemul@...>, <kamezawa.hiroyu@...>
Date: Tuesday, May 6, 2008 - 1:39 am

Do you have any suggestions on how to do that better. Would you like
documentation to be the first patch in the series? I had sent out two RFC's
earlier and got comments and feedback from several people.

Having said that, I do agree that design is the most vital thing of any patchset
and communicating that up front and better is critical. I am open to any
suggestions to help make that process better.


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: <balbir@...>
Cc: <linux-mm@...>, <skumar@...>, <yamamoto@...>, <menage@...>, <lizf@...>, <linux-kernel@...>, <rientjes@...>, <xemul@...>, <kamezawa.hiroyu@...>
Date: Tuesday, May 6, 2008 - 1:54 am

I do like to see the overall what-i-am-setting-out-to-do description in
there somewhere - sometimes a Docuemtation/ file is appropriate, other
times do it via changelog.

But the first part of the review is reviewing whatever it is which you set
out to achieve.  Once that's understood and sounds like a good idea then we
can start looking at how you did it.


--
To: Andrew Morton <akpm@...>
Cc: <linux-mm@...>, <skumar@...>, <yamamoto@...>, <menage@...>, <lizf@...>, <linux-kernel@...>, <rientjes@...>, <xemul@...>, <kamezawa.hiroyu@...>
Date: Tuesday, May 6, 2008 - 3:59 am

I think having documentation upfront does make sense in that case. I'll also try
and make the changelogs more verbose. I usually try to point to the previous

Yes, I agree.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: <linux-mm@...>
Cc: Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, Paul Menage <menage@...>, <lizf@...>, <linux-kernel@...>, David Rientjes <rientjes@...>, Pavel Emelianov <xemul@...>, Balbir Singh <balbir@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Saturday, May 3, 2008 - 5:38 pm

This patch adds support for accounting and control of virtual address space
limits. The accounting is done via the rlimit_cgroup_(un)charge_as functions.
The core of the accounting takes place during fork time in copy_process(),
may_expand_vm(), remove_vma_list() and exit_mmap(). There are some special
cases that are handled here as well (arch/ia64/kernel/perform.c,
arch/x86/kernel/ptrace.c, insert_special_mapping())

Signed-off-by: Balbir Singh &lt;balbir@linux.vnet.ibm.com&gt;
---

 arch/ia64/kernel/perfmon.c   |    6 ++
 arch/x86/kernel/ds.c         |   10 ++++
 fs/exec.c                    |    4 +
 include/linux/rlimitcgroup.h |   20 +++++++++
 kernel/fork.c                |   12 +++++
 mm/mmap.c                    |   11 +++++
 mm/rlimitcgroup.c            |   87 +++++++++++++++++++++++++++++++++++++++++++
 7 files changed, 149 insertions(+), 1 deletion(-)

diff -puN mm/rlimitcgroup.c~rlimit-controller-address-space-accounting mm/rlimitcgroup.c
--- linux-2.6.25/mm/rlimitcgroup.c~rlimit-controller-address-space-accounting	2008-05-04 02:53:20.000000000 +0530
+++ linux-2.6.25-balbir/mm/rlimitcgroup.c	2008-05-04 02:53:20.000000000 +0530
@@ -44,6 +44,40 @@ struct rlimit_cgroup *rlimit_cgroup_from
 				struct rlimit_cgroup, css);
 }
 
+struct rlimit_cgroup *rlimit_cgroup_from_task(struct task_struct *p)
+{
+	return container_of(task_subsys_state(p, rlimit_cgroup_subsys_id),
+				struct rlimit_cgroup, css);
+}
+
+int rlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages)
+{
+	int ret;
+	struct rlimit_cgroup *rcg;
+
+	rcu_read_lock();
+	rcg = rlimit_cgroup_from_task(rcu_dereference(mm-&gt;owner));
+	css_get(&amp;rcg-&gt;css);
+	rcu_read_unlock();
+
+	ret = res_counter_charge(&amp;rcg-&gt;as_res, (nr_pages &lt;&lt; PAGE_SHIFT));
+	css_put(&amp;rcg-&gt;css);
+	return ret;
+}
+
+void rlimit_cgroup_uncharge_as(struct mm_struct *mm, unsigned long nr_pages)
+{
+	struct rlimit_cgroup *rcg;
+
+	rcu_read_lock();
+	rcg = rlimit_cgroup_from_task(rcu_derefe...
To: Balbir Singh <balbir@...>
Cc: <linux-mm@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <lizf@...>, <linux-kernel@...>, David Rientjes <rientjes@...>, Pavel Emelianov <xemul@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Tuesday, May 6, 2008 - 11:29 pm

The basic idea of the patches looks fine (apart from some
synchronization issues) but Is calling this the "rlimit" controller a
great idea? That implies that it handles all (or at least many) of the
things that setrlimit()/getrlimit() handle.

While some of the other rlimit things definitely do make sense as
cgroup controllers, putting them all in the same controller doesn't
really - paying for the address-space tracking overhead just to get,
say, the equivalent of RLIMIT_NPROC (max tasks) isn't a great idea.

Can you instead give this a name that somehow refers to virtual
address space limits, e.g. "va" or "as". That would still fit if you
expanded it to deal with locked virtual address space limits too.

I think that an "rlimit" controller would probably be best for
representing just those limits that don't really make sense when
aggregated across different tasks, but apply separately to each task
(e.g. RLIMIT_FSIZE, RLIMIT_CORE, RLIMIT_NICE, RLIMIT_NOFILE,
RLIMIT_RTPRIO, RLIMIT_STACK, RLIMIT_SIGPENDING, and maybe RLIMIT_CPU),
in order to provide an easy way to change these limits on a group of
running tasks.

On a separate note for the address-space tracking, ideally the
subsystem would track whether or not it was bound to a hierarchy, and
skip charging/uncharging if not. That way there's no (noticeable)
overhead for compiling in the subsystem but not using it. At the point
when the subsystem was bound to a hierarchy, it could at that point
run through all mms and charge each one's existing address space to
the appropriate cgroup. (Currently that would only be the root cgroup
in the hierarchy).

Paul
--
To: Paul Menage <menage@...>
Cc: <linux-mm@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <lizf@...>, <linux-kernel@...>, David Rientjes <rientjes@...>, Pavel Emelianov <xemul@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Thursday, May 8, 2008 - 10:35 am

I currently intend to use this controller for controlling memory related
rlimits, like address space and mlock'ed memory. How about we use something like

Good suggestion, but it will be hard if not impossible to account the data
correctly as it changes, if we do the accounting/summation at bind time. We'll
need a really big lock to do it, something I want to avoid. Did you have
something else in mind?


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: <balbir@...>
Cc: <linux-mm@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <lizf@...>, <linux-kernel@...>, David Rientjes <rientjes@...>, Pavel Emelianov <xemul@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Thursday, May 8, 2008 - 5:45 pm

Yes, it'll be tricky but I think worthwhile. I believe it can be done
without the charge/uncharge code needing to take a global lock, except
for when we're actually binding/unbinding, with careful use of RCU.

My first thought for how to do this was that we have a field
"bind_transition" that indicates whether we're transitioning between
bound and unbound, and a bind_mutex. By default the charge/unpath uses
RCU, but by marking that we're in a transition state, the charge path
will use the mutex instead. By waiting for all existing chargers that
are using RCU to exit, we can then take the lock and synchronize with
the chargers.

So the charge/uncharge path would do:

  rcu_read_lock();
  if (ss-&gt;tranistioning) {
    rcu_read_unlock();
    locked = 1;
    mutex_lock(&amp;ss-&gt;bind_mutex);
  }
  if (ss-&gt;active) {
    /* do charge/uncharge stuff, which must not block */
  }
  if (locked) {
    mutex_unlock(&amp;ss-&gt;bind_mutex);
  } else {
    rcu_read_unlock();
  }

and the bind path would do something like:

ss-&gt;transitioning = 1;
synchronize_rcu();
mutex_lock(&amp;ss-&gt;bind_mutex);
for_each_mm(mm) {
  down_read(&amp;mm-&gt;mmap_sem);
  add_charge_for_mm();
  up_read(&amp;mm-&gt;mmap_sem);
}
mutex_unlock(&amp;ss-&gt;bind_mutex);
ss-&gt;transitioning = 0;

But this would break because we're nesting mmap_sem inside bind_mutex
in the bind path, but in the charge path we're nesting bind_mutex
inside mmap_sem. So we'd probably need to define a new bit
MMF_RLIMIT_ACCOUNTED in mm-&gt;flags to indicate whether that mm's
address space usage is accounted for. Once we've done that, we can use
mmap_sem to synchronize changes to the per-mm charged status for free,
since we already hold mmap_sem whenever we're doing the charging,
right? So it becomes simple:

charge path:

if (!test_bit(MMF_RLIMIT_ACCOUNTED, &amp;mm-&gt;flags))
  return 0;
/* do charge/uncharge stuff */

bind path:

while((mm = find_unaccounted_mm()) {
  down_write(&amp;mm-&gt;mmap_se...
To: Paul Menage <menage@...>
Cc: <linux-mm@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <lizf@...>, <linux-kernel@...>, David Rientjes <rientjes@...>, Pavel Emelianov <xemul@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Friday, May 9, 2008 - 9:35 am

[snip]

This is an optimization that I am willing to consider later in the project. At
first I want to focus on functionality. I would like to optimize once I know
that the functionality has been well tested by a good base of users and make
sure that the optimization is real.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: Balbir Singh <balbir@...>
Cc: <linux-mm@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <lizf@...>, <linux-kernel@...>, David Rientjes <rientjes@...>, Pavel Emelianov <xemul@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Tuesday, May 6, 2008 - 11:17 pm

You need to synchronize against mm-&gt;owner changing, or
mm-&gt;owner-&gt;cgroups changing. How about:

int rlimit_cgroup_charge_as(struct mm_struct *mm, unsigned long nr_pages)
{
  int ret;
  struct rlimit_cgroup *rcg;
  struct task_struct *owner;

  rcu_read_lock();
 again:

  /* Find and lock the owner of the mm */
  owner = rcu_dereference(mm-&gt;owner);
  task_lock(owner);
  if (mm-&gt;owner != owner) {
    task_unlock(owner);
    goto again;
  }

  /* Charge the owner's cgroup with the new memory */
  rcg = rlimit_cgroup_from_task(owner);
  ret = res_counter_charge(&amp;rcg-&gt;as_res, (nr_pages &lt;&lt; PAGE_SHIFT));
  task_unlock(owner);
  rcu_read_unlock();
  return ret;

Can't this be implemented as just a call to charge() with a negative
value? (Possibly fixing res_counter_charge() to handle negative values

You mean disallow all movement within a hierarchy that has this cgroup

Since you need to protect against concurrent charges, and against
concurrent mm ownership changes, I think you should just do this under

Also needs to task_lock(p) to prevent concurrent charges or cgroup
reassignments?

Paul
--
To: Paul Menage <menage@...>
Cc: <linux-mm@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <lizf@...>, <linux-kernel@...>, David Rientjes <rientjes@...>, Pavel Emelianov <xemul@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Thursday, May 8, 2008 - 10:54 am

My mind goes blank at times, so forgive me asking, what happens if we don't use
task_lock(). mm-&gt;owner cannot be freed, even if it changes, we get the callback
in mm_owner_changed(). The locations from where we call _charge and _uncharge,

I did that earlier, but then Pavel suggested splitting it up as charge/uncharge.

Consider the following scenario

We try to move task "t1" from cgroup "A" to cgroup "B".
Doing so, causes "B" to go over it's limit, what do we do?
Ideally, we would like to be able to go back to cgroups and say, please fail

Please see my comment above on task_lock(). I'll draw some diagrams on the

The callbacks are called with task_lock() held. Please see
mm_update_next_owner-&gt;cgroup_mm_owner_callbacks() in kernel/exit.c

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: <balbir@...>
Cc: <linux-mm@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <lizf@...>, <linux-kernel@...>, David Rientjes <rientjes@...>, Pavel Emelianov <xemul@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Thursday, May 8, 2008 - 7:22 pm

I guess I'm concerned about a race like:

A and B are threads in cgroup G, and C is a different process
A-&gt;mm-&gt;owner == B

A: enter rlimit_cgroup_charge_as()
A: charge new page to G
C: enter attach_task(newG, B)
C: update B-&gt;cgroup to point to newG
C: call memrlimit-&gt;attach(G, newG, B)
C: charge mm-&gt;total_vm to newG
C: uncharge mm-&gt;total_vm from G
A: add new page to mm-&gt;total_vm

Maybe this can be solved very simply by just taking mm-&gt;mmap_sem in
rlimit_cgroup_move_task() and rlimit_cgroup_mm_owner_changed() ? Since
mmap_sem is (I hope) held across all operations that change

OK, that sounds reasonable - that's what the can_attach() callback is for.

Paul
--
To: Paul Menage <menage@...>
Cc: Balbir Singh <balbir@...>, <linux-mm@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <lizf@...>, <linux-kernel@...>, David Rientjes <rientjes@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Wednesday, May 7, 2008 - 1:59 am

No, I'd keep two calls - charge and uncharge. This makes you sure that
the code xxx_charge(value) is a charge, regardless of what the "value"
is. Besides, xxx_charge returns an error code, you need to check (BTW, I
think we should add a __must_check attribute there), while uncharge does

--
To: Balbir Singh <balbir@...>
Cc: <linux-mm@...>, <skumar@...>, <yamamoto@...>, <menage@...>, <lizf@...>, <linux-kernel@...>, <rientjes@...>, <xemul@...>, <balbir@...>, <kamezawa.hiroyu@...>
Date: Monday, May 5, 2008 - 6:24 pm

On Sun, 04 May 2008 03:08:14 +0530

I worry a bit about all the conversion between page-counts and byte-counts
in this code.

For example, what happens if a process sits there increasing its rss with
sbrk(4095) or sbrk(4097) or all sorts of other scenarios?  Do we get in a
situation in which the accounting is systematically wrong?

Worse, do we risk getting into that situation in the future, as unrelated
changes are made to the surrounding code?

IOW, have we chosen the best, most maintainable representation for these
things?

--
To: Andrew Morton <akpm@...>
Cc: <linux-mm@...>, <skumar@...>, <yamamoto@...>, <menage@...>, <lizf@...>, <linux-kernel@...>, <rientjes@...>, <xemul@...>, <kamezawa.hiroyu@...>
Date: Tuesday, May 6, 2008 - 1:34 am

We already do all our accounting in pages for total_vm (field of mm_struct).
task_vsize() for example multiplies PAGE_SIZE with total_vm before returning the

I can't see that happening, but I'll look again and request reviewers to help me

That's a good question. From the sustenance point of view, resource counters
have worked really well so far. Abstracting accounting and tracking from the
controllers has been a good thing. One of the goals of the rlimit controller is
to keep it open for extension, so that others can add their own control for
other resources like mlock'ed pages.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: Andrew Morton <akpm@...>
Cc: Balbir Singh <balbir@...>, <linux-mm@...>, <skumar@...>, <yamamoto@...>, <menage@...>, <lizf@...>, <linux-kernel@...>, <xemul@...>, <kamezawa.hiroyu@...>
Date: Monday, May 5, 2008 - 6:32 pm

There was discussion early on in the development of the memory controller 
that bytes were going to be used as the base unit for accounting.  I had 
disagreed in favor of kB since page sizes are always in these increments 
and historically the kernel has exported statistics this way via 
/proc/meminfo.
--
To: <linux-mm@...>
Cc: Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, Paul Menage <menage@...>, <lizf@...>, <linux-kernel@...>, David Rientjes <rientjes@...>, Pavel Emelianov <xemul@...>, Balbir Singh <balbir@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Saturday, May 3, 2008 - 5:38 pm

This patch adds an additional field to the mm_owner callbacks. This field
is required to get to the mm that changed.

Signed-off-by: Balbir Singh &lt;balbir@linux.vnet.ibm.com&gt;
---

 include/linux/cgroup.h |    3 ++-
 kernel/cgroup.c        |    2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff -puN kernel/cgroup.c~cgroup-add-task-to-mm--owner-callbacks kernel/cgroup.c
--- linux-2.6.25/kernel/cgroup.c~cgroup-add-task-to-mm--owner-callbacks	2008-05-04 02:53:05.000000000 +0530
+++ linux-2.6.25-balbir/kernel/cgroup.c	2008-05-04 02:53:05.000000000 +0530
@@ -2772,7 +2772,7 @@ void cgroup_mm_owner_callbacks(struct ta
 			if (oldcgrp == newcgrp)
 				continue;
 			if (ss-&gt;mm_owner_changed)
-				ss-&gt;mm_owner_changed(ss, oldcgrp, newcgrp);
+				ss-&gt;mm_owner_changed(ss, oldcgrp, newcgrp, new);
 		}
 	}
 }
diff -puN include/linux/cgroup.h~cgroup-add-task-to-mm--owner-callbacks include/linux/cgroup.h
--- linux-2.6.25/include/linux/cgroup.h~cgroup-add-task-to-mm--owner-callbacks	2008-05-04 02:53:05.000000000 +0530
+++ linux-2.6.25-balbir/include/linux/cgroup.h	2008-05-04 02:53:05.000000000 +0530
@@ -310,7 +310,8 @@ struct cgroup_subsys {
 	 */
 	void (*mm_owner_changed)(struct cgroup_subsys *ss,
 					struct cgroup *old,
-					struct cgroup *new);
+					struct cgroup *new,
+					struct task_struct *p);
 	int subsys_id;
 	int active;
 	int disabled;
_

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: Balbir Singh <balbir@...>
Cc: <linux-mm@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <lizf@...>, <linux-kernel@...>, David Rientjes <rientjes@...>, Pavel Emelianov <xemul@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Monday, May 5, 2008 - 7:00 pm

As Andrew suggested, can you improve the documentation? Ideally, there
should be a paragraph in Documentation/cgroups.txt that describes the
circumstances (including locking state) in which the callback is
called.

Paul

--
To: Balbir Singh <balbir@...>
Cc: <linux-mm@...>, <skumar@...>, <yamamoto@...>, <menage@...>, <lizf@...>, <linux-kernel@...>, <rientjes@...>, <xemul@...>, <balbir@...>, <kamezawa.hiroyu@...>
Date: Monday, May 5, 2008 - 6:15 pm

On Sun, 04 May 2008 03:08:04 +0530

If mm_owner_changed() had any documentation I'd suggest that it be updated.
Sneaky.

The existing comment:

	/*
	 * This routine is called with the task_lock of mm-&gt;owner held
	 */
	void (*mm_owner_changed)(struct cgroup_subsys *ss,
					struct cgroup *old,
					struct cgroup *new);

Is rather mysterious.  To what mm does it refer?
--
To: Andrew Morton <akpm@...>
Cc: <linux-mm@...>, <skumar@...>, <yamamoto@...>, <menage@...>, <lizf@...>, <linux-kernel@...>, <rientjes@...>, <xemul@...>, <kamezawa.hiroyu@...>
Date: Monday, May 5, 2008 - 11:43 pm

No, there's no documentation besides the comments. I'll go ahead and update

This callback is called when the mm-&gt;owner field that points to/owns a cgroup
changes as a result of the owner exiting.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: <linux-mm@...>
Cc: Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, Paul Menage <menage@...>, <lizf@...>, <linux-kernel@...>, David Rientjes <rientjes@...>, Pavel Emelianov <xemul@...>, Balbir Singh <balbir@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Saturday, May 3, 2008 - 5:37 pm

This patch sets up the rlimit cgroup controller. It adds the basic create,
destroy and populate functionality. The user interface provided is very
similar to the memory resource controller. The rlimit controller can be
enhanced easily in the future to control mlocked pages.

Signed-off-by: Balbir Singh &lt;balbir@linux.vnet.ibm.com&gt;
---

 include/linux/cgroup_subsys.h |    4 +
 include/linux/rlimitcgroup.h  |   19 +++++
 init/Kconfig                  |   10 ++
 mm/Makefile                   |    1 
 mm/rlimitcgroup.c             |  141 ++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 175 insertions(+)

diff -puN /dev/null mm/rlimitcgroup.c
--- /dev/null	2008-05-03 22:12:13.033285313 +0530
+++ linux-2.6.25-balbir/mm/rlimitcgroup.c	2008-05-04 02:52:51.000000000 +0530
@@ -0,0 +1,141 @@
+/*
+ * Copyright
To: Balbir Singh <balbir@...>
Cc: <linux-mm@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, Paul Menage <menage@...>, <linux-kernel@...>, David Rientjes <rientjes@...>, Pavel Emelianov <xemul@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Monday, May 5, 2008 - 9:31 pm

--
To: Li Zefan <lizf@...>
Cc: <linux-mm@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, Paul Menage <menage@...>, <linux-kernel@...>, David Rientjes <rientjes@...>, Pavel Emelianov <xemul@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Tuesday, May 6, 2008 - 4:15 am

Yes, it can be. Thanks!

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: Balbir Singh <balbir@...>
Cc: <linux-mm@...>, <skumar@...>, <yamamoto@...>, <menage@...>, <lizf@...>, <linux-kernel@...>, <rientjes@...>, <xemul@...>, <balbir@...>, <kamezawa.hiroyu@...>
Date: Monday, May 5, 2008 - 6:11 pm

On Sun, 04 May 2008 03:07:36 +0530

Whatever this is doing, it should not be doing it this way ;)

perhaps

	*tmp = ALIGN(*tmp, PAGE_SIZE);

or even

	*tmp = PAGE_ALIGN(*tmp);

?


&lt;looks at PAGE_ALIGN&gt;

Each architecture implements its own version and they of course do it
differently.  It's crying out for a consolidated implementation but we have
no include/linux/page.h into which to consolidate it.

--
To: Andrew Morton <akpm@...>
Cc: <linux-mm@...>, <skumar@...>, <yamamoto@...>, <menage@...>, <lizf@...>, <linux-kernel@...>, <rientjes@...>, <xemul@...>, <kamezawa.hiroyu@...>
Date: Monday, May 5, 2008 - 11:40 pm

May be we can move this to asm-generic/page.h?


-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
Previous thread: [patch] video: build fix for drivers/media/video/au0828 by Ingo Molnar on Saturday, May 3, 2008 - 5:23 pm. (1 message)

Next thread: [patch] irda: fix !PNP support for drivers/net/irda/smsc-ircc2.c by Ingo Molnar on Saturday, May 3, 2008 - 6:26 pm. (1 message)
speck-geostationary