login
Header Space

 
 

Re: [-mm] Add an owner to the mm_struct (v8)

Previous thread: Re: [patch 3/3] ata: SWNCQ should be enabled by default by Jeff Garzik on Friday, April 4, 2008 - 3:38 am. (1 message)

Next thread: [git patches] libata fixes by Jeff Garzik on Friday, April 4, 2008 - 4:23 am. (1 message)
To: Paul Menage <menage@...>, Pavel Emelianov <xemul@...>
Cc: Hugh Dickins <hugh@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <lizf@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, David Rientjes <rientjes@...>, Balbir Singh <balbir@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Friday, April 4, 2008 - 4:05 am

Changelog v7
------------
1. Make mm_need_new_owner() more readable
2. Remove extra white space from init_task.h

Changelog v6
------------

1. Fix typos
2. Document the use of delay_group_leader()

Changelog v5
------------
Remove the hooks for .owner from init_task.h and move it to init/main.c

Changelog v4
------------
1. Release rcu_read_lock() after acquiring task_lock(). Also get a reference
   to the task_struct
2. Change cgroup mm_owner_changed callback to callback only if the
   cgroup of old and new task is different and to pass the old and new
   cgroups instead of task pointers
3. Port the patch to 2.6.25-rc8-mm1

Changelog v3
------------

1. Add mm-&gt;owner change callbacks using cgroups

This patch removes the mem_cgroup member from mm_struct and instead adds
an owner. This approach was suggested by Paul Menage. The advantage of
this approach is that, once the mm-&gt;owner is known, using the subsystem
id, the cgroup can be determined. It also allows several control groups
that are virtually grouped by mm_struct, to exist independent of the memory
controller i.e., without adding mem_cgroup's for each controller,
to mm_struct.

A new config option CONFIG_MM_OWNER is added and the memory resource
controller selects this config option.

This patch also adds cgroup callbacks to notify subsystems when mm-&gt;owner
changes. The mm_cgroup_changed callback is called with the task_lock()
of the new task held and is called just prior to changing the mm-&gt;owner.

I am indebted to Paul Menage for the several reviews of this patchset
and helping me make it lighter and simpler.

This patch was tested on a powerpc box, it was compiled with both the
MM_OWNER config turned on and off.

After the thread group leader exits, it's moved to init_css_state by
cgroup_exit(), thus all future charges from runnings threads would
be redirected to the init_css_set's subsystem.

Signed-off-by: Balbir Singh &lt;balbir@linux.vnet.ibm.com&gt;
---

 fs/exec.c          ...
To: Balbir Singh <balbir@...>
Cc: Paul Menage <menage@...>, Pavel Emelianov <xemul@...>, Hugh Dickins <hugh@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <lizf@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, David Rientjes <rientjes@...>, Andrew Morton <akpm@...>
Date: Tuesday, April 8, 2008 - 8:42 pm

On Fri, 04 Apr 2008 13:35:44 +0530
I'm sorry for my laziness.

Why do_each_thread() ? for_each_process() is not enough ?
(because of delay_group_leader().)

And what we have to test for the worst case is following, right ?
==
 1. create a tons of threads.
 2. create a process which calls vfork().
 3. keep child alive and vfork() caller exits
==

Thanks,
-Kame

--
To: Balbir Singh <balbir@...>
Cc: <menage@...>, <xemul@...>, <hugh@...>, <skumar@...>, <yamamoto@...>, <lizf@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, <rientjes@...>, <balbir@...>, <kamezawa.hiroyu@...>
Date: Monday, April 7, 2008 - 6:09 pm

On Fri, 04 Apr 2008 13:35:44 +0530

Do we really want to offer this option to people?  It's rather a low-level
thing and it's likely to cause more confusion than it's worth.  Remember
that most kernels get to our users via kernel vendors - to what will they

Presumably they'll always be setting it to "y" if they are enabling cgroups

I suppose these should be __read_mostly.

--
To: Andrew Morton <akpm@...>
Cc: <menage@...>, <xemul@...>, <hugh@...>, <skumar@...>, <yamamoto@...>, <lizf@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, <rientjes@...>, <kamezawa.hiroyu@...>
Date: Monday, April 7, 2008 - 10:39 pm

I suspect that this kernel option will not be explicitly set it. This option
will be selected by other config options (memory controller, swap namespace,

Yes, good point. I'll send out v9 with this fix.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: <balbir@...>
Cc: <menage@...>, <xemul@...>, <hugh@...>, <skumar@...>, <yamamoto@...>, <lizf@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, <rientjes@...>, <kamezawa.hiroyu@...>
Date: Monday, April 7, 2008 - 10:55 pm

I believe that the way to do this is to not give the option a `help'
section.  Tht makes it a Kconfig-internal-only thing.

--
To: Balbir Singh <balbir@...>
Cc: Pavel Emelianov <xemul@...>, Hugh Dickins <hugh@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <lizf@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, David Rientjes <rientjes@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Friday, April 4, 2008 - 4:12 am

And its uncharges, which is more of the problem I was getting at
earlier - surely when the mm is finally destroyed, all its virtual
address space charges will be uncharged from the root cgroup rather
than the correct cgroup, if we left the delayed group leader as the
owner? Which is why I think the group leader optimization is unsafe.

--
To: Paul Menage <menage@...>
Cc: Pavel Emelianov <xemul@...>, Hugh Dickins <hugh@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <lizf@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, David Rientjes <rientjes@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Friday, April 4, 2008 - 4:28 am

It won't uncharge for the memory controller from the root cgroup since each page
 has the mem_cgroup information associated with it. For other controllers,
they'll need to monitor exit() callbacks to know when the leader is dead :( (sigh).

Not having the group leader optimization can introduce big overheads (consider
thousands of tasks, with the group leader being the first one to exit).

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: <balbir@...>
Cc: Pavel Emelianov <xemul@...>, Hugh Dickins <hugh@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <lizf@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, David Rientjes <rientjes@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Friday, April 4, 2008 - 4:50 am

Can you test the overhead?

As long as we find someone to pass the mm to quickly, it shouldn't be
too bad - I think we're already optimized for that case. Generally the
group leader's first child will be the new owner, and any subsequent
times the owner exits, they're unlikely to have any children so
they'll go straight to the sibling check and pass the mm to the
parent's first child.

Unless they all exit in strict sibling order and hence pass the mm
along the chain one by one, we should be fine. And if that exit
ordering does turn out to be common, then simply walking the child and
sibling lists in reverse order to find a victim will minimize the
amount of passing.

One other thing occurred to me - what lock protects the child and
sibling links? I don't see any documentation anywhere, but from the
code it looks as though it's tasklist_lock rather than RCU - so maybe
we should be holding that with a read_lock(), at least for the first
two parts of the search? (The full thread search is RCU-safe).

Paul
--
To: Paul Menage <menage@...>
Cc: Pavel Emelianov <xemul@...>, Hugh Dickins <hugh@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <lizf@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, David Rientjes <rientjes@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Friday, April 4, 2008 - 5:25 am

Yes, it would be, but worth the trouble. Is it really critical to move a dead



Finding the next mm might not be all that bad, but doing it each time a task
exits, can be an overhead, specially for large multi threaded programs. This can
get severe if the new mm-&gt;owner belongs to a different cgroup, in which case we
need to use callbacks as well.

If half the threads belonged to a different cgroup and the new mm-&gt;owner kept
switching between cgroups, the overhead would be really high, with the callbacks

You are right about the read_lock()

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: <balbir@...>
Cc: Pavel Emelianov <xemul@...>, Hugh Dickins <hugh@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <lizf@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, David Rientjes <rientjes@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Friday, April 4, 2008 - 3:11 pm

It struck me that this whole group leader optimization is broken as it
stands since there could (in strange configurations) be multiple
thread groups sharing the same mm.

I wonder if we can't just delay the exit_mm() call of a group leader

Right, but we only have that overhead if we actually end up passing
the mm from one to another each time they exit. It would be
interesting to know what order the threads in a large multi-threaded
process exit typically (when the main process exits and all the
threads die).

I guess it's likely to be one of:

- in thread creation order (i.e. in order of parent-&gt;children list),
in which case we should try to throw the mm to the parent's last child
- in reverse creation order, in which case we should try to throw the
mm to the parent's first child
- in random order depending on which threads the scheduler runs first
(in which case we can expect that a small fraction of the threads will

To me, it seems that setting up a *virtual address space* cgroup
hierarchy and then putting half your threads in one group and half in
the another is asking for trouble. We need to not break in that
situation, but I'm not sure it's a case to optimize for.

Paul
--
To: Paul Menage <menage@...>
Cc: Pavel Emelianov <xemul@...>, Hugh Dickins <hugh@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <lizf@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, David Rientjes <rientjes@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Saturday, April 5, 2008 - 10:47 am

Not sure about this one, I suspect keeping the group_leader around is an
optimization, changing exit_mm() for the group_leader, not sure how that will
impact functionality or standards. It might even break some applications.

Repeating my question earlier

Can we delay setting task-&gt;cgroups = &amp;init_css_set for the group_leader, until
all threads have exited? If the user is unable to remove a cgroup node, it will
be due a valid reason, the group_leader is still around, since the threads are

That could potentially happen, if the virtual address space cgroup and cpu
control cgroup were bound together in the same hierarchy by the sysadmin.

I measured the overhead of removing the delay_group_leader optimization and
found a 4% impact on throughput (with volanomark, that is one of the
multi-threaded benchmarks I know of).

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: <balbir@...>
Cc: Pavel Emelianov <xemul@...>, Hugh Dickins <hugh@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <lizf@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, David Rientjes <rientjes@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Saturday, April 5, 2008 - 1:23 pm

Potentially, yes. It also might make more sense to move the
exit_cgroup() for all threads to a later point rather than special

Yes, I agree it could potentially happen. But it seems like a strange
thing to do if you're planning to be not have the same groupings for

Interesting, I thought (although I've never actually looked at the
code) that volanomark was more of a scheduling benchmark than a
process start/exit benchmark. How frequently does it have processes
(not threads) exiting?

How many runs was that over? Ingo's recently posted volanomark tests
against -rc7 showed ~3% random variation between runs.

Paul
--
To: Paul Menage <menage@...>
Cc: Pavel Emelianov <xemul@...>, Hugh Dickins <hugh@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <lizf@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, David Rientjes <rientjes@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Saturday, April 5, 2008 - 1:48 pm

Yes, that makes sense. I think that patch should be independent of this one

It's easier to set it up that way. Usually the end user gets the same SLA for

I could not find any other interesting benchmark for benchmarking fork/exits. I
know that volanomark is heavily threaded, so I used it. The threads quickly exit
after processing the messages, I thought that would be a good test to see the

I ran the test four times. I took the average of runs, I did see some variation
between runs, I did not calculate the standard deviation.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: <balbir@...>
Cc: Pavel Emelianov <xemul@...>, Hugh Dickins <hugh@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <lizf@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, David Rientjes <rientjes@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Saturday, April 5, 2008 - 1:57 pm

Yes, it would probably need to be a separate patch. The current
positioning of cgroup_exit() is more or less inherited from cpusets.

True - but in that case why wouldn't they have the same SLA for

But surely the performance of thread exits wouldn't be affected by the
delay_group_leader(p) change, since none of the exiting threads would
be a group leader. That optimization only matters when the entire
process exits.

Does oprofile show any interesting differences?

Paul
--
To: Paul Menage <menage@...>
Cc: Pavel Emelianov <xemul@...>, Hugh Dickins <hugh@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <lizf@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, David Rientjes <rientjes@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Saturday, April 5, 2008 - 2:59 pm

Yes, mostly. That's why I had made the virtual address space patches as a config

On the client side, each JVM instance exits after the test. I see the thread

Need to try oprofile.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: <balbir@...>
Cc: Pavel Emelianov <xemul@...>, Hugh Dickins <hugh@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, David Rientjes <rientjes@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Saturday, April 5, 2008 - 7:31 pm

*If* they want to use the virtual address space controller, that is.

By that argument, you should make the memory and cpu controllers the
same controller, since in your scenario they'll usually be used
together..

Paul
--
To: Paul Menage <menage@...>
Cc: Pavel Emelianov <xemul@...>, Hugh Dickins <hugh@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, David Rientjes <rientjes@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Sunday, April 6, 2008 - 2:31 am

Heh, Virtual address and memory are more closely interlinked than CPU and Memory.
-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: <balbir@...>
Cc: Pavel Emelianov <xemul@...>, Hugh Dickins <hugh@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, David Rientjes <rientjes@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Tuesday, April 8, 2008 - 2:32 am

If you consider virtual address space limits a useful way to limit
swap usage, that's true.

But if you don't, then memory and CPU are more closely linked since
they represent real resource usage, whereas virtual address space is a
more abstract quantity.

Paul
--
To: <balbir@...>
Cc: Pavel Emelianov <xemul@...>, Hugh Dickins <hugh@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, David Rientjes <rientjes@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Saturday, April 5, 2008 - 7:29 pm

How long does the test run for? How many threads does each client have?

Paul
--
To: Paul Menage <menage@...>
Cc: Pavel Emelianov <xemul@...>, Hugh Dickins <hugh@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, David Rientjes <rientjes@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Sunday, April 6, 2008 - 1:38 am

The test on each client side runs for about 10 seconds. I saw the client create
up to 411 threads.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: <balbir@...>
Cc: Pavel Emelianov <xemul@...>, Hugh Dickins <hugh@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, David Rientjes <rientjes@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Tuesday, April 8, 2008 - 2:37 am

I'm not convinced that an application that creates 400 threads and
exits in 10 seconds is particular representative of a high-performance
application.

But I agree that it's an example of something it may be worth trying
to optimize for.

You mention that you saw tgid exits - what order did the individual
threads exit in? If we threw the mm to the last thread in the thread
group rather than the first, would that help?

Paul
--
To: Paul Menage <menage@...>
Cc: Pavel Emelianov <xemul@...>, Hugh Dickins <hugh@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, David Rientjes <rientjes@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Tuesday, April 8, 2008 - 2:52 am

I agree, but like I said earlier, this was the easily available ready made

The order was different each time. I suspect that when we have too many threads
all exiting at once and they are all running in parallel, I don't know if we can
have ordering or predict the order in which threads exit.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: <balbir@...>
Cc: Pavel Emelianov <xemul@...>, Hugh Dickins <hugh@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, David Rientjes <rientjes@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Tuesday, April 8, 2008 - 2:57 am

How about a simple program that creates N threads that just sleep,
then has the main thread exit?

Paul
--
To: Paul Menage <menage@...>
Cc: Pavel Emelianov <xemul@...>, Hugh Dickins <hugh@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, David Rientjes <rientjes@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Tuesday, April 8, 2008 - 3:05 am

That is not really representative of anything. I have that program handy. How do
we measure the impact on throughput?

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
To: <balbir@...>
Cc: Pavel Emelianov <xemul@...>, Hugh Dickins <hugh@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, David Rientjes <rientjes@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Tuesday, April 8, 2008 - 3:29 am

It's very representative of how much additional overhead in terms of
mm-&gt;owner churn there is in a large multi-threaded application
exiting, which is the thing that you're trying to optimize with the
delayed thread group leader checks.

Paul
--
To: Paul Menage <menage@...>
Cc: Pavel Emelianov <xemul@...>, Hugh Dickins <hugh@...>, Sudhir Kumar <skumar@...>, YAMAMOTO Takashi <yamamoto@...>, <linux-kernel@...>, <taka@...>, <linux-mm@...>, David Rientjes <rientjes@...>, Andrew Morton <akpm@...>, KAMEZAWA Hiroyuki <kamezawa.hiroyu@...>
Date: Thursday, April 10, 2008 - 5:09 am

I see almost no overhead after the notification change optimization (notify only
if owner belongs to a different cgroup).

My program creates n processes with k threads each and forces the thread group
leader to exit. For my experiment I created 10 processes with 800 threads each
(NOTE: you need to change ulimit -s for this to work).

I am going to remove the delay_group_leader() optimization and submit v9.

-- 
	Warm Regards,
	Balbir Singh
	Linux Technology Center
	IBM, ISTL
--
Previous thread: Re: [patch 3/3] ata: SWNCQ should be enabled by default by Jeff Garzik on Friday, April 4, 2008 - 3:38 am. (1 message)

Next thread: [git patches] libata fixes by Jeff Garzik on Friday, April 4, 2008 - 4:23 am. (1 message)
speck-geostationary