Changelog v7 ------------ 1. Make mm_need_new_owner() more readable 2. Remove extra white space from init_task.h Changelog v6 ------------ 1. Fix typos 2. Document the use of delay_group_leader() Changelog v5 ------------ Remove the hooks for .owner from init_task.h and move it to init/main.c Changelog v4 ------------ 1. Release rcu_read_lock() after acquiring task_lock(). Also get a reference to the task_struct 2. Change cgroup mm_owner_changed callback to callback only if the cgroup of old and new task is different and to pass the old and new cgroups instead of task pointers 3. Port the patch to 2.6.25-rc8-mm1 Changelog v3 ------------ 1. Add mm->owner change callbacks using cgroups This patch removes the mem_cgroup member from mm_struct and instead adds an owner. This approach was suggested by Paul Menage. The advantage of this approach is that, once the mm->owner is known, using the subsystem id, the cgroup can be determined. It also allows several control groups that are virtually grouped by mm_struct, to exist independent of the memory controller i.e., without adding mem_cgroup's for each controller, to mm_struct. A new config option CONFIG_MM_OWNER is added and the memory resource controller selects this config option. This patch also adds cgroup callbacks to notify subsystems when mm->owner changes. The mm_cgroup_changed callback is called with the task_lock() of the new task held and is called just prior to changing the mm->owner. I am indebted to Paul Menage for the several reviews of this patchset and helping me make it lighter and simpler. This patch was tested on a powerpc box, it was compiled with both the MM_OWNER config turned on and off. After the thread group leader exits, it's moved to init_css_state by cgroup_exit(), thus all future charges from runnings threads would be redirected to the init_css_set's subsystem. Signed-off-by: Balbir Singh <balbir@linux.vnet.ibm.com> --- fs/exec.c ...
On Fri, 04 Apr 2008 13:35:44 +0530 I'm sorry for my laziness. Why do_each_thread() ? for_each_process() is not enough ? (because of delay_group_leader().) And what we have to test for the worst case is following, right ? == 1. create a tons of threads. 2. create a process which calls vfork(). 3. keep child alive and vfork() caller exits == Thanks, -Kame --
On Fri, 04 Apr 2008 13:35:44 +0530 Do we really want to offer this option to people? It's rather a low-level thing and it's likely to cause more confusion than it's worth. Remember that most kernels get to our users via kernel vendors - to what will they Presumably they'll always be setting it to "y" if they are enabling cgroups I suppose these should be __read_mostly. --
I suspect that this kernel option will not be explicitly set it. This option will be selected by other config options (memory controller, swap namespace, Yes, good point. I'll send out v9 with this fix. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
I believe that the way to do this is to not give the option a `help' section. Tht makes it a Kconfig-internal-only thing. --
And its uncharges, which is more of the problem I was getting at earlier - surely when the mm is finally destroyed, all its virtual address space charges will be uncharged from the root cgroup rather than the correct cgroup, if we left the delayed group leader as the owner? Which is why I think the group leader optimization is unsafe. --
It won't uncharge for the memory controller from the root cgroup since each page has the mem_cgroup information associated with it. For other controllers, they'll need to monitor exit() callbacks to know when the leader is dead :( (sigh). Not having the group leader optimization can introduce big overheads (consider thousands of tasks, with the group leader being the first one to exit). -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
Can you test the overhead? As long as we find someone to pass the mm to quickly, it shouldn't be too bad - I think we're already optimized for that case. Generally the group leader's first child will be the new owner, and any subsequent times the owner exits, they're unlikely to have any children so they'll go straight to the sibling check and pass the mm to the parent's first child. Unless they all exit in strict sibling order and hence pass the mm along the chain one by one, we should be fine. And if that exit ordering does turn out to be common, then simply walking the child and sibling lists in reverse order to find a victim will minimize the amount of passing. One other thing occurred to me - what lock protects the child and sibling links? I don't see any documentation anywhere, but from the code it looks as though it's tasklist_lock rather than RCU - so maybe we should be holding that with a read_lock(), at least for the first two parts of the search? (The full thread search is RCU-safe). Paul --
Yes, it would be, but worth the trouble. Is it really critical to move a dead Finding the next mm might not be all that bad, but doing it each time a task exits, can be an overhead, specially for large multi threaded programs. This can get severe if the new mm->owner belongs to a different cgroup, in which case we need to use callbacks as well. If half the threads belonged to a different cgroup and the new mm->owner kept switching between cgroups, the overhead would be really high, with the callbacks You are right about the read_lock() -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
It struck me that this whole group leader optimization is broken as it stands since there could (in strange configurations) be multiple thread groups sharing the same mm. I wonder if we can't just delay the exit_mm() call of a group leader Right, but we only have that overhead if we actually end up passing the mm from one to another each time they exit. It would be interesting to know what order the threads in a large multi-threaded process exit typically (when the main process exits and all the threads die). I guess it's likely to be one of: - in thread creation order (i.e. in order of parent->children list), in which case we should try to throw the mm to the parent's last child - in reverse creation order, in which case we should try to throw the mm to the parent's first child - in random order depending on which threads the scheduler runs first (in which case we can expect that a small fraction of the threads will To me, it seems that setting up a *virtual address space* cgroup hierarchy and then putting half your threads in one group and half in the another is asking for trouble. We need to not break in that situation, but I'm not sure it's a case to optimize for. Paul --
Not sure about this one, I suspect keeping the group_leader around is an optimization, changing exit_mm() for the group_leader, not sure how that will impact functionality or standards. It might even break some applications. Repeating my question earlier Can we delay setting task->cgroups = &init_css_set for the group_leader, until all threads have exited? If the user is unable to remove a cgroup node, it will be due a valid reason, the group_leader is still around, since the threads are That could potentially happen, if the virtual address space cgroup and cpu control cgroup were bound together in the same hierarchy by the sysadmin. I measured the overhead of removing the delay_group_leader optimization and found a 4% impact on throughput (with volanomark, that is one of the multi-threaded benchmarks I know of). -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
Potentially, yes. It also might make more sense to move the exit_cgroup() for all threads to a later point rather than special Yes, I agree it could potentially happen. But it seems like a strange thing to do if you're planning to be not have the same groupings for Interesting, I thought (although I've never actually looked at the code) that volanomark was more of a scheduling benchmark than a process start/exit benchmark. How frequently does it have processes (not threads) exiting? How many runs was that over? Ingo's recently posted volanomark tests against -rc7 showed ~3% random variation between runs. Paul --
Yes, that makes sense. I think that patch should be independent of this one It's easier to set it up that way. Usually the end user gets the same SLA for I could not find any other interesting benchmark for benchmarking fork/exits. I know that volanomark is heavily threaded, so I used it. The threads quickly exit after processing the messages, I thought that would be a good test to see the I ran the test four times. I took the average of runs, I did see some variation between runs, I did not calculate the standard deviation. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
Yes, it would probably need to be a separate patch. The current positioning of cgroup_exit() is more or less inherited from cpusets. True - but in that case why wouldn't they have the same SLA for But surely the performance of thread exits wouldn't be affected by the delay_group_leader(p) change, since none of the exiting threads would be a group leader. That optimization only matters when the entire process exits. Does oprofile show any interesting differences? Paul --
Yes, mostly. That's why I had made the virtual address space patches as a config On the client side, each JVM instance exits after the test. I see the thread Need to try oprofile. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
*If* they want to use the virtual address space controller, that is. By that argument, you should make the memory and cpu controllers the same controller, since in your scenario they'll usually be used together.. Paul --
Heh, Virtual address and memory are more closely interlinked than CPU and Memory. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
If you consider virtual address space limits a useful way to limit swap usage, that's true. But if you don't, then memory and CPU are more closely linked since they represent real resource usage, whereas virtual address space is a more abstract quantity. Paul --
How long does the test run for? How many threads does each client have? Paul --
The test on each client side runs for about 10 seconds. I saw the client create up to 411 threads. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
I'm not convinced that an application that creates 400 threads and exits in 10 seconds is particular representative of a high-performance application. But I agree that it's an example of something it may be worth trying to optimize for. You mention that you saw tgid exits - what order did the individual threads exit in? If we threw the mm to the last thread in the thread group rather than the first, would that help? Paul --
I agree, but like I said earlier, this was the easily available ready made The order was different each time. I suspect that when we have too many threads all exiting at once and they are all running in parallel, I don't know if we can have ordering or predict the order in which threads exit. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
How about a simple program that creates N threads that just sleep, then has the main thread exit? Paul --
That is not really representative of anything. I have that program handy. How do we measure the impact on throughput? -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
It's very representative of how much additional overhead in terms of mm->owner churn there is in a large multi-threaded application exiting, which is the thing that you're trying to optimize with the delayed thread group leader checks. Paul --
I see almost no overhead after the notification change optimization (notify only if owner belongs to a different cgroup). My program creates n processes with k threads each and forces the thread group leader to exit. For my experiment I created 10 processes with 800 threads each (NOTE: you need to change ulimit -s for this to work). I am going to remove the delay_group_leader() optimization and submit v9. -- Warm Regards, Balbir Singh Linux Technology Center IBM, ISTL --
| Francois Romieu | Re: PROBLEM: 2.6.23-rc "NETDEV WATCHDOG: eth0: transmit timed out" |
| Greg Kroah-Hartman | [PATCH 040/196] kobject: add kobject_add_ng function |
| Dave Airlie | [git pull] drm patches for 2.6.27 final |
| john stultz | [PATCH] correct inconsistent ntp interval/tick_length usage |
| Krzysztof Halasa | Re: [PATCH v2] Re: WAN: new PPP code for generic HDLC |
| Dave Jones | odd RTL8139 quirk. |
| Allan Stephens | [PATCH 0/9 net-next-2.6] [TIPC]: System & debug output modifications |
| Francois Romieu | [RFT 0/6] sis190 branch info |
git: | |
| Miklos Vajna | [rfc] git submodules howto |
| Ben Collins | Re: [kernel.org users] [RFD] On deprecating "git-foo" for builtins |
| Jon Smirl | ! [rejected] master -> master (non-fast forward) |
| Evan Carroll | Git-submodule questions |
| Pieter Verberne | File collision while using pkg_add |
| Greg Thomas | Re: Is it possible to fix a stale NFS hadle without rebooting? |
| Didier Wiroth | win32-codecs, avi and amd64 question |
| rancor | How to copy/pipe console buffert to file? |
| Netfilter kernel module | 9 hours ago | Linux kernel |
| serial driver xmit problem | 11 hours ago | Linux kernel |
| Why Windows is better than Linux | 11 hours ago | Linux general |
| How can I see my kernel messages in vt12? | 18 hours ago | Linux kernel |
| Grub | 1 day ago | Linux general |
| vmalloc_fault handling in x86_64 | 1 day ago | Linux kernel |
| epoll_wait()ing on epoll FD | 1 day ago | Linux kernel |
| Framebuffer in x86_64 causes problems to multiseat | 1 day ago | Linux kernel |
| Difference between 2.4 and 2.6 regarding thread creation | 1 day ago | Linux general |
| Compiling gfs2 on kernel 2.6.27 | 2 days ago | Linux kernel |
