Greetings, Comments, suggestions etc highly welcome. This patch implements an idea from Linus, to automatically create task groups per tty, to improve desktop interactivity under hefty load such as kbuild. The feature is enabled from boot by default, The default setting can be changed via the boot option ttysched=0, and can be can be turned on or off on the fly via echo [01] > /proc/sys/kernel/sched_tty_sched_enabled. A 100% hog overhead measurement proggy pinned to the same CPU as a make -j10 pert/s: 229 >5484.43us: 41 min: 0.15 max:12069.42 avg:2193.81 sum/s:502382us overhead:50.24% pert/s: 222 >5652.28us: 43 min: 0.46 max:12077.31 avg:2248.56 sum/s:499181us overhead:49.92% pert/s: 211 >5809.38us: 43 min: 0.16 max:12064.78 avg:2381.70 sum/s:502538us overhead:50.25% pert/s: 223 >6147.92us: 43 min: 0.15 max:16107.46 avg:2282.17 sum/s:508925us overhead:50.49% pert/s: 218 >6252.64us: 43 min: 0.16 max:12066.13 avg:2324.11 sum/s:506656us overhead:50.27% Signed-off-by: Mike Galbraith <efault@gmx.de> --- drivers/char/tty_io.c | 2 include/linux/sched.h | 14 +++++ include/linux/tty.h | 3 + init/Kconfig | 13 +++++ kernel/sched.c | 9 +++ kernel/sched_tty.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++++++ kernel/sched_tty.h | 7 ++ kernel/sysctl.c | 11 ++++ 8 files changed, 186 insertions(+), 1 deletion(-) Index: linux-2.6.36.git/include/linux/sched.h =================================================================== --- linux-2.6.36.git.orig/include/linux/sched.h +++ linux-2.6.36.git/include/linux/sched.h @@ -1900,6 +1900,20 @@ int sched_rt_handler(struct ctl_table *t extern unsigned int sysctl_sched_compat_yield; +#ifdef CONFIG_SCHED_DESKTOP +int sched_tty_sched_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, + loff_t *ppos); + +extern unsigned int sysctl_sched_tty_sched_enabled; + +void ...
You might be wanting to exclude RT tasks from the tty groups since there is no interface to grant them any runtime and such :-) Also, I think tty_sched_move_task and sched_move_task() should be sharing lots more code -- I recently proposed a fix for sched_move_task() because the Android people complained, but they haven't replied back yet.. --
Yeah. I should be able to just do sched_move_task(), but it doesn't currently work even with your patch, turning tty_sched on/off can lead to incredible delays before you get box back. With virgin source, it's size infinity for all intents and purposes. -Mike --
Ah, first feedback I've got on that patch,. but surely since you created one that does work we can fix my patch and use the normal path? :-) --
Yeah. I wanted to get the RFC out the door, so took a shortcut. -Mike --
Actually, your patch works just peachy, my fiddling with sleepers vruntime at attach time was a dainbramaged thing to do. -Mike --
OK, I'll queue it with a Tested-by from you. Thanks! --
I don't think you need to disable IRQs for tasklist lock, nor do I think you actually need it. If you enable tty groups and then scan all the existing tasks you've covered them all, new tasks will already be placed right, dying tasks we don't care about anyway. --
OK, thanks. (No such thing as too paranoid;) -Mike --
It was suggested that I show a bit more info.
The same load without per tty task groups.
pert/s: 31 >40475.37us: 3 min: 0.37 max:48103.60 avg:29573.74 sum/s:916786us overhead:90.24%
pert/s: 23 >41237.70us: 12 min: 0.36 max:56010.39 avg:40187.01 sum/s:924301us overhead:91.99%
pert/s: 24 >42150.22us: 12 min: 8.86 max:61265.91 avg:39459.91 sum/s:947038us overhead:92.20%
pert/s: 26 >42344.91us: 11 min: 3.83 max:52029.60 avg:36164.70 sum/s:940282us overhead:91.12%
pert/s: 24 >44262.90us: 14 min: 5.05 max:82735.15 avg:40314.33 sum/s:967544us overhead:92.22%
^^^^^usecs ^^^^^usecs ^^the competition got
Average service latency is an order of magnitude better with tty_sched.
(Imagine that pert is Xorg or whatnot instead)
Using Mathieu Desnoyers' wakeup-latency testcase (attached):
With taskset -c 3 make -j 10 running..
taskset -c 3 ./wakeup-latency& sleep 30;killall wakeup-latency
without:
maximum latency: 42963.2 µs
average latency: 9077.0 µs
missed timer events: 0
with:
maximum latency: 4160.7 µs
average latency: 149.4 µs
missed timer events: 0
Patch makes a big difference in desktop feel under hefty load here.
-Mike
That's really nice! Could this feature realistically do block IO isolation as well? It's always annoying when some big IO job is making the desktop jerky. Especially as your patch is selecting the block cgroup feature already: + select BLK_CGROUP Thanks, Ingo --
I know my cgroup pgid config helps a bunch with rummaging in my email while other IO is going on. I've been attributing that to BLK_CGROUP, but have no proof. -Mike --
Yes, I was going to complain that the numbers in the commit message
Very impressive. This definitely looks like something people will notice.
That said, I do think we should think carefully about calling this a
"tty" feature. I think we might want to leave the door open to other
heuristics than _just_ the tty group. I think the tty group approach
is wonderful for traditional Unix loads in a desktop environment, but
I suspect we might hit issues with IDE's etc too. I don't know if we
can notice things like that automatically, but I think it's worth
thinking about.
So I think the patch looks pretty good, and the numbers seem to look
just stunningly so, but I'd like to name the feature more along the
lines of "automatic process group scheduling" rather than about tty's
per se.
And you actually did that for the Kconfig option, which makes me quite happy.
The one other thing I do wonder about is how noticeable the group
scheduling overhead is. If people compare with a non-CGROUP_SCHED
kernel, will a desktop-optimized kernel suddenly have horrible pipe
latency due to much higher scheduling cost? Right now that whole
feature is hidden by EXPERIMENTAL, I don't know how much it hurts, and
I never timed it when I tried it out long ago..
Linus
--
Oh, absolutely, that's what it's all about really. What I'd _like_ is to get per process group scheduling working on the cheap..ish. Your idea of tty cgoups looked much simpler though, so I figured that would be a great place to start. It turned out to be much simpler than I thought it would be, which is encouraging, and it works well in testing Very noticeable, cgroups is far from free. It would make no sense for a performance freak to even think about it. I don't run cgroup enabled kernels usually, and generally strip to the bone because I favor throughput very very heavily, but when I look at the desktop under load, The scheduling cost is quite high. But realistically, the cost of a distro kernel with full featured network stack is (much) higher. I seriously doubt the cost of cgroups would be noticed by the typical _desktop_ user. Overall latencies for any switchy microbenchmark will certainly be considerably higher with the feature enabled. -Mike --
Q/D test of kernels w/wo, with same .config using pipe-test (pure sched) gives on my box ~590khz with tty_sched active, 620khz without cgroups acitve in same kernel/config without patch. last time I measured stripped down config (not long ago, but not yesterday either) gave max ctx rate ~690khz on this box. (note: very Q, very D numbers, no variance testing, ballpark) -Mike --
That's 5% overhead in context switches. Definitely not in the 'horrible' category. This would be a rather tempting item for 2.6.37 ... especially as it really mainly reuses existing group scheduling functionality, in a clever way. Mind doing more of the tty->desktop renames/generalizations as Linus suggested, and resend the patch? I'd also suggest to move it out of EXPERIMENTAL - we dont really do that for core kernel features as most distros enable CONFIG_EXPERIMENTAL so it's a rather meaningless distinction. Since the feature is default-n, people will get the old scheduler by default but can also choose this desktop-centric scheduling mode. I'd even argue to make it default-y, because this patch clearly cures a form of kbuild cancer. Thanks, Ingo --
Here she comes. Better/Worse? Changes: - tty->autogroup. - only autogroup fair class tasks. - removed dainbramaged sleeper vruntime twiddling. - removed paranoid locking. You top dogs can make the default call.. it it's accepted that is ;-) Marcus: care to try the below? Works fine for me (but so did first cut). It depends on the attached patch, and applied to virgin shiny new 2.6.36. A recurring complaint from CFS users is that parallel kbuild has a negative impact on desktop interactivity. This patch implements an idea from Linus, to automatically create task groups. This patch only implements Linus' per tty task group suggestion, and only for fair class tasks, but leaves the way open for enhancement. How it works: at tty alloc/dealloc time, a task group is created/destroyed, so there is always a task group active per tty. When we select a runqueue, if the task has a has a tty association, and no task group, attach it to a per tty autogroup on the fly. The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP is selected, but can be disabled via the boot option noautogroup, and can be also be turned on/off on the fly via.. echo [01] > /proc/sys/kernel/sched_autogroup_enabled. ..which will automatically move tasks to/from the root task group. Some numbers. A 100% hog overhead measurement proggy pinned to the same CPU as a make -j10 About measurement proggy: pert/sec = perturbations/sec min/max/avg = scheduler service latencies in usecs sum/s = time accrued by the competition per sample period (1 sec here) overhead = %CPU received by the competition per sample period pert/s: 31 >40475.37us: 3 min: 0.37 max:48103.60 avg:29573.74 sum/s:916786us overhead:90.24% pert/s: 23 >41237.70us: 12 min: 0.36 max:56010.39 avg:40187.01 sum/s:924301us overhead:91.99% pert/s: 24 >42150.22us: 12 min: 8.86 max:61265.91 avg:39459.91 sum/s:947038us overhead:92.20% pert/s: 26 >42344.91us: 11 min: ...
I really like the new 'autogroup scheduler' name - as we really dont want to turn this into anything but an intelligent grouping thing. Via the naming we can resist heuristics for example. Btw., how does Xorg fare with this? Can we remove sleeper fairness for example and simplify other bits of the CFS logic as a side-effect? Ingo --
Works a treat for me. As I write this in evolution, I have amarok playing with a visualization and a make -j100 running. Song switch is instant, visualization is nice and smooth despite unaccelerated Xorg. We can't whack fair sleepers though, not without inventing a new preemption model to take it's place. -Mike --
Works fine interactively, but I still get the same kernel BUG on reboot time as before. Would a photo of the trace help you? -- Markus --
Odd. Yeah, please send me the photo and your .config. -Mike --
In the backtrace, the scheduler is called from:
do_group_exit()
__dequeue_signal()
do_exit()
given that task_group() is called from many spots in the scheduler, I wonder if
some checks making sure that tg != NULL in task_group() would be appropriate ?
Also checking that p->signal is non-NULL in autogroup_attach_tty() might help.
Thanks,
Mathieu
--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com
--
* Mike Galbraith (efault@gmx.de) wrote: Hi Mike, This per-tty task grouping approach looks very promising. I'll give it a spin when I find the time. Meanwhile, a little question about locking here: how is the read lock supposed to protect from p->signal (and p->signal->tty) modifications ? What's the locking scheme here ? So maybe just simple rcu_dereference are missing, or maybe the tsk->sighand->siglock might be required. In all cases, I feel something is missing there. Thanks! Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
minor nit, I think in theory this needs barrier(), or struct tty_struct *tty = ACCESS_ONCE(p->signal->tty); if (tty) No, I don't understand this ;) But I know nothig about task groups, most probably this is OK. It is not clear to me why do we need rcu_read_lock() and how it can help. The tty can go away right after dereferencing signal->tty. Even if the task doesn't exit, it (or its sub-thread) can do sys_setsid() at any moment and free this tty. If any thread was moved to tty->sg, doesn't this mean that, say, ->cfs_rq will point to the already freed tg->cfs_rq? From http://marc.info/?l=linux-kernel&m=128764874422614 +int sched_autogroup_handler(struct ctl_table *table, int write, + void __user *buffer, size_t *lenp, loff_t *ppos) +{ + struct task_struct *p, *t; + struct task_group *tg; + int ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); + + if (ret || !write) + return ret; + + for_each_process(p) { Hmm. This needs rcu lock at least? + tg = task_group(p); Why? + sched_move_task(p); + list_for_each_entry_rcu(t, &p->thread_group, thread_group) { + sched_move_task(t); + } + } Looks like, you can just do do_each_thread(p, t) { sched_move_task(t); } while_each_thread(p, t); With the same effect. Oleg. --
Yeah. So in theory, the tty can go away on me. I knew this was too easy. -Mike --
Which was Marcus' crash. Didn't happen here only because I didn't have CONFIG_PREEMPT set. Changes since v2: - drop --
Bumped mouse, message escaped. Doesn't matter though, damn thing just blew up during enable/disable plus hackbench stress test, despite holding a reference to the tty at every place tty changes (under sighand lock), and moving the task with that reference held. CONFIG_PREEMPT is being a little S.O.B. -Mike --
So I have a suggestion that may not be popular with you, because it does end up changing the approach of your patch a lot. And I have to say, I like how your last patch looked. It was surprisingly small, simple, and clean. So I hate saying "I think it should perhaps do things a bit differently". That said, I would suggest: - don't depend on "tsk->signal->tty" at all. - INSTEAD, introduce a "tsk->signal->sched_group" pointer that points to whatever the current auto-task_group is. Remember, long-term, we'd want to maybe have other heuristics than just the tty groups, so we'd want this separate from the tty logic _anyway_ - at fork time, just copy the task_group pointer in copy_signal() if it is non-NULL, and increment the refcount (I don't think struct task_group is refcounted now, but this would require it). - at free_signal_struct(), just do a "put_task_group(sig->task_group);" before freeing it. - make the scheduler use the "tsk->signal->sched_group" as the default group if nothing else exists. Now, all the basic logic is _entirely_ unaware of any tty logic, and it's generic. And none of it has any races with some odd tty release logic or anything like that. Now, after this, the only thing you'd need to do is hook into __proc_set_tty(), which already holds the sighand lock, and _there_ you would attach the task_group to the process. Notice how it would never be attached to a tty at all, so tty_release etc would never be involved in any taskgroup thing - it's not really the tty that owns the taskgroup, it's simply the act of becoming a tty task group leader that attaches the task to a new scheduling group. It also means, for example, that if a process loses its tty (and doesn't get a new one - think hangup), it still remains in whatever scheduling group it started out with. The tty really is immaterial. And the nice thing about this is that it should be trivial to make other things than tty's trigger this same thing, if we find a pattern (or create ...
Suggestions highly welcome. The raciness is driving me nuts. I can't Much more tasteful than what I was about to do as a last resort funky race killer, namely make my on/off switch a machine wide atomic bomb :) Thanks! -Mike --
Greetings from sunny Arizona! I _finally_ got back to this yesterday, and implemented your suggestion, though with a couple minor variations. Putting the autogroup pointer in the signal struct didn't look right to me, so I plugged it into the task struct instead. I also didn't refcount taskgroups, wanted the patchlet to be as self-contained as possible, so refcounted the autogroup struct instead. I also left group movement on tty disassociation in place, but may nuke it. The below has withstood an all night thrashing in my laptop with a PREEMPT_RT kernel, and looks kinda presentable to me, so... A recurring complaint from CFS users is that parallel kbuild has a negative impact on desktop interactivity. This patch implements an idea from Linus, to automatically create task groups. This patch only implements Linus' per tty task group suggestion, and only for fair class tasks, but leaves the way open for enhancement. Implementation: each task struct contains an inherited pointer to a refcounted autogroup struct containing a task group pointer, the default for all tasks pointing to the init_task_group. When a task calls __proc_set_tty(), the task's reference to the default group is dropped, a new task group is created, and the task is moved out of the old group and into the new. Children thereafter inherit this task group, and increase it's refcount. Calls to __tty_hangup() and proc_clear_tty() move the caller back to the init_task_group, and possibly destroy the task group. On exit, reference to the current task group is dropped, and the task group is potentially destroyed. At runqueue selection time, iff a task has no cgroup assignment, it's current autogroup is used. The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP is selected, but can be disabled via the boot option noautogroup, and can be also be turned on/off on the fly via.. echo [01] > /proc/sys/kernel/sched_autogroup_enabled. ..which will automatically move tasks to/from the root ...
Very well contained, minimally invasive to anything else! ( Noticed only one very small detail: sched_autogroup.h has an illness of lack of newlines which makes it a bit hard to read - but this is cured easily. ) I'll test and apply this patch to the scheduler tree, so if anyone has objections please holler now :-) Linus, does this look OK to you too, can i add your Acked-by? Thanks, Ingo --
Ok, the patch looks fine, but I do have a few comments:
- the reason I suggested the signal struct was really that I thought
it would avoid extra (unnecessary) cost in thread creation/teardown.
Maybe I should have made that clear, but this seems to
unnecessarily do the whole atomic_inc/dec for each thread. That seems
a bit sad.
That said, if not having to dereference ->signal simplifies the
scheduler interaction, I guess the extra atomic ref at thread
creation/deletion is fine. So I don't think this is wrong, it's just
something I wanted to bring up.
- You misspelled "detach". That just drives me wild. Please fix.
- What I _do_ think is wrong is how I think you're trying to be "too
precise". I think that's fundamentally wrong, because I think we
should make it very clear that it's a heuristic. So I dislike seeing
these functions: sched_autogroup_handler() - we shouldn't care about
old state, sched_autogroup_detach() - even with the fixed spelling I
don't really see why a tty hangup should cause the process to go back
to the default group, for example.
IOW, I think you tried a bit _too_ hard to make it a 1:1 relationship
with the tty. I don't think it needs to be. Just because a process
loses its tty because of a hangup, I don't think that that should have
any real implications for the auto-group scheduling. Or maybe it
should, but that decision should be based on "does it help scheduling
behavior" rather than on "it always matches the tty". See what I'm
saying?
That said, I do love how the patch looks. I think this is absolutely
the right thing to do. My issues are small details. I'd Ack it even in
this form (well, as long as spelling is fixed, that really does rub me
the wrong way), and the other things are more details that are about
how I'm thinking about it rather than "you need to do it this way".
Linus
--
Yeah, and it doesn't in the common case at least. The handler's classifier was because a 100% pinned hog would never obey the user's wishes, but I can whack it along with the hangup. Less is more in the scheduler. Thanks for the comments. -Mike --
Well, it cuts both ways. Maybe your approach is simpler and avoids
overhead at scheduling time. And "tsk->signal" may not be reliable due
to races with exit etc, so it may well be that going through the
signal struct could end up being a source of nasty races. I didn't
look whether the scheduler already derefenced ->signal for some other
reason, for example.
So your patch may well have done the exact right thing.
Linus
--
Just in case, starting from 2.6.35 tsk->signal is reliable. Oleg. --
Just to add some data; here are the results from my machine (AMD 4 cores) running a -j4 kernel build, while I browsed the web: 1) perf sched record sleep 30 without: total_wakeups: 44306 avg_wakeup_latency (ns): 36784 min_wakeup_latency (ns): 0 max_wakeup_latency (ns): 9378852 with: total_wakeups: 43836 avg_wakeup_latency (ns): 67607 min_wakeup_latency (ns): 0 max_wakeup_latency (ns): 8983036 2) perf record -a -e sched:sched_switch -e sched:sched_wakeup sleep 10 without: total_wakeups: 13195 avg_wakeup_latency (ns): 48484 min_wakeup_latency (ns): 0 max_wakeup_latency (ns): 8722497 with: total_wakeups: 14106 avg_wakeup_latency (ns): 92532 min_wakeup_latency (ns): 20 max_wakeup_latency (ns): 5642393 So the avg_wakeup_latency nearly doubled with your patch, while the max_wakeup_latency is lowered by a good amount. -- Markus --
When you say with/without, does that mean enabled/disabled, or patched/virgin and/or cgroups/nocgroups? -Mike --
Patched/virgin and nocgroups -- Markus --
Figures. As you can see, group scheduling is not wonderful for extreme switchers. Fortunately, most apps do a bit of work in between. -Mike --
I didn't read this patch carefully (yet) but, Surely this doesn't need rcu. But the real problem is that copy_process() can fail after that, This doesn't look right. Note that "p" can run/sleep after that (or in parallel), set_task_rq() can use the freed ->autogroup. Btw, I can't apply this patch... Oleg. --
It depends on the patch below from Peter, or manual fixup.
Subject: sched, cgroup: Fixup broken cgroup movement
From: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Fri Oct 15 15:24:15 CEST 2010
Reported-by: Dima Zavin <dima@android.com>
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
include/linux/sched.h | 2 +-
kernel/sched.c | 8 ++++----
kernel/sched_fair.c | 25 +++++++++++++++++++------
3 files changed, 24 insertions(+), 11 deletions(-)
Index: linux-2.6.36.git/kernel/sched.c
===================================================================
--- linux-2.6.36.git.orig/kernel/sched.c
+++ linux-2.6.36.git/kernel/sched.c
@@ -8297,12 +8297,12 @@ void sched_move_task(struct task_struct
if (unlikely(running))
tsk->sched_class->put_prev_task(rq, tsk);
- set_task_rq(tsk, task_cpu(tsk));
-
#ifdef CONFIG_FAIR_GROUP_SCHED
- if (tsk->sched_class->moved_group)
- tsk->sched_class->moved_group(tsk, on_rq);
+ if (tsk->sched_class->task_move_group)
+ tsk->sched_class->task_move_group(tsk, on_rq);
+ else
#endif
+ set_task_rq(tsk, task_cpu(tsk));
if (unlikely(running))
tsk->sched_class->set_curr_task(rq);
Index: linux-2.6.36.git/include/linux/sched.h
===================================================================
--- linux-2.6.36.git.orig/include/linux/sched.h
+++ linux-2.6.36.git/include/linux/sched.h
@@ -1072,7 +1072,7 @@ struct sched_class {
struct task_struct *task);
#ifdef CONFIG_FAIR_GROUP_SCHED
- void (*moved_group) (struct task_struct *p, int on_rq);
+ void (*task_move_group) (struct task_struct *p, int on_rq);
#endif
};
Index: linux-2.6.36.git/kernel/sched_fair.c
===================================================================
--- linux-2.6.36.git.orig/kernel/sched_fair.c
+++ linux-2.6.36.git/kernel/sched_fair.c
@@ -3824,13 +3824,26 @@ static void set_curr_task_fair(struct rq
}
#ifdef CONFIG_FAIR_GROUP_SCHED
-static void moved_group_fair(struct task_struct *p, int ...Just in case, the lock order may be wrong. sched_autogroup_exit() takes task_group_lock under write_lock(tasklist), while sched_autogroup_handler() takes them in reverse order. I am not sure, but perhaps this can be simpler? wake_up_new_task() does autogroup_fork(), and do_exit() does Thanks. It also applies cleanly to 2.6.36. Very basic question. Currently sched_autogroup_create_attach() has the only caller, __proc_set_tty(). It is a bit strange that signal->tty change is process-wide, but sched_autogroup_create_attach() move the single thread, the caller. What about other threads in OK, but I don't understand "p->autogroup != &autogroup_default" check. This is true if autogroup_create() succeeds. Otherwise autogroup_create() does autogroup_kref_get(autogroup_default), doesn't this mean we need unconditional _put ? Fell free to ignore, but imho return autogroup_task_group(p, tg); looks a bit better. Why autogroup_task_group() returns its result via pointer? Oleg. --
That's what I was going to do. That said, I couldn't have had the problem if I'd tied final put directly to life of container, and am Yeah, I really should (will) move all on the spot, though it doesn't seem to matter in general practice, forks afterward land in the right bucket. With per tty or p->signal, migration will pick up stragglers No particularly good reason, I'll do the cosmetic change. Thanks, -Mike --
I didn't nuke the handler, but did hide it under a debug option since it is useful for testing. If the user enables it, and turns autogroup off, imho off should means off NOW, so I stuck with it as is. I coded up a lazy (tick time check) move to handle pinned tasks not otherwise being moved, but that was too much for even my (lack of) taste to handle. The locking should be fine as it was now, since autogroup_exit() isn't under the tasklist lock any more. (surprising i didn't hit any problems with this or use after free in rt kernel given how hard i beat on it) Pondering adding some debug bits to identify autogroup tasks, maybe I ended up tying it directly to p->signal's life, and beat on it with CONFIG_PREEMPT. I wanted to give it a thrashing in PREEMPT_RT, but when I snagged your signal patches, I apparently didn't snag quite Did that, and the rest. This patch will apply to tip or .git. patchlet: A recurring complaint from CFS users is that parallel kbuild has a negative impact on desktop interactivity. This patch implements an idea from Linus, to automatically create task groups. This patch only implements Linus' per tty task group suggestion, and only for fair class tasks, but leaves the way open for enhancement. Implementation: each task's signal struct contains an inherited pointer to a refcounted autogroup struct containing a task group pointer, the default for all tasks pointing to the init_task_group. When a task calls __proc_set_tty(), the process wide reference to the default group is dropped, a new task group is created, and the process is moved into the new task group. Children thereafter inherit this task group, and increase it's refcount. On exit, a reference to the current task group is dropped when the last reference to each signal struct is dropped. The task group is destroyed when the last signal struct referencing it is freed. At runqueue selection time, IFF a task has no cgroup assignment, it's current autogroup is used. The ...
Unfortunately it won't. There's a missing "+" at line 507: markus@arch linux % patch -p1 --dry-run < /home/markus/tty_autogroup2.patch patching file include/linux/sched.h Hunk #3 succeeded at 1935 (offset -1 lines). patching file kernel/sched.c Hunk #2 succeeded at 607 (offset 1 line). Hunk #3 succeeded at 2011 with fuzz 2 (offset 1 line). Hunk #4 succeeded at 7959 (offset -25 lines). Hunk #5 succeeded at 8489 (offset -25 lines). Hunk #6 succeeded at 8514 (offset -25 lines). patching file kernel/fork.c patching file drivers/tty/tty_io.c patching file kernel/sched_autogroup.h patching file kernel/sched_autogroup.c patch: **** malformed patch at line 507: Index: linux-2.6/kernel/sysctl.c -- Markus --
Oh well, yesterday ping (one of sister's laptop loving cats) opened _187_ screenshots while I was away from the keyboard. I'll blame any spelling errors on him too while I'm at it :) Here's a fresh copy, no cats is sight. A recurring complaint from CFS users is that parallel kbuild has a negative impact on desktop interactivity. This patch implements an idea from Linus, to automatically create task groups. This patch only implements Linus' per tty task group suggestion, and only for fair class tasks, but leaves the way open for enhancement. Implementation: each task's signal struct contains an inherited pointer to a refcounted autogroup struct containing a task group pointer, the default for all tasks pointing to the init_task_group. When a task calls __proc_set_tty(), the process wide reference to the default group is dropped, a new task group is created, and the process is moved into the new task group. Children thereafter inherit this task group, and increase it's refcount. On exit, a reference to the current task group is dropped when the last reference to each signal struct is dropped. The task group is destroyed when the last signal struct referencing it is freed. At runqueue selection time, IFF a task has no cgroup assignment, it's current autogroup is used. The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP is selected, but can be disabled via the boot option noautogroup, and can be also be turned on/off on the fly via.. echo [01] > /proc/sys/kernel/sched_autogroup_enabled. ..which will automatically move tasks to/from the root task group. Some numbers. A 100% hog overhead measurement proggy pinned to the same CPU as a make -j10 About measurement proggy: pert/sec = perturbations/sec min/max/avg = scheduler service latencies in usecs sum/s = time accrued by the competition per sample period (1 sec here) overhead = %CPU received by the competition per sample period pert/s: 31 >40475.37us: 3 min: 0.37 ...
This catless version looks very good to me. Assuming testing and
approval by the scheduler people, this has an enthusiastic "ack" from
me.
Linus
--
On Sun, Nov 14, 2010 at 11:28 AM, Linus Torvalds
Oh, btw, maybe one comment: is CONFIG_SCHED_AUTOGROUP_DEBUG even worth
having as a config option? The only thing it enables is the small
/proc interface, and I don't see any downside to just always having
that if you have AUTOGROUP enabled in the first place.
It isn't like it's some traditional kernel debug feature that makes
things slower or adds tons of debug output.
Linus
--
It also enables the sched_autogroup_handler, that you disliked seeing in the previous version. -- Markus --
On Sun, Nov 14, 2010 at 12:27 PM, Markus Trippelsdorf
Yes, but that one exists only to make things "exact". I don't really
see why it exists. What's the point of doing the task movement, if the
group information is just going to be ignored anyway?
IOW, my dislike of the sched_autogroup_handler is not about the debug
option per se, it's about the same thing I said about tty hangups etc:
why do we care so deeply?
I think it would be perfectly acceptably to make the "enable" bit just
decide whether or not to take that autogroup information into account
for scheduling decision. Why do we care so deeply about having to move
the groups around right _then_?
Maybe there is some really deep reason for actually having to do the
reclassification and the sched_move_task() thing for each thread
synchronously when disabling/enabling that thing. If so, I'm just not
seeing it. Why can't we just let things be, and next time things get
scheduled they'll move on their own?
So my objection was really about the same "why do we have to try so
hard to keep the autogroups so 1:1 with the tty's"? It's just a
heuristic, trying to be exact about it seems to miss the point.
Linus
--
Not only. pinned tasks would stay in their autogroup until somebody moved them to a cgroup. Them wandering back over time would be fine, and all but pinned tasks will. -Mike --
But why is a problem?
That's kind of my point. Why do we care about some special case that
(a) likely doesn't happen and (b) even if it happens, what's the
problem with it happening?
Let me put it this way: the autogroup thing modifies the scheduler in
some subtle ways, but it doesn't make any _semantic_ difference. And
it's not like enabling/disabling autogroup scheduling is something
critical to begin with. It's a heuristic.
THAT is why I think it's so silly to try to be so strict and walk over
all processes while holding a couple of spinlocks.
Ok, so you disable autogroups after having used them, _and_ having
done some really specific things, and some processes that used to be
in a group may end up still scheduling in that group. Why care? It's
not like random people can enable/disable autogroup scheduling and try
to use this as a way to get unfair scheduling.
In fact, I'd go as far as saying that for the tty-based autogroup
scheduling, if you want the autogroup policy to take effect, it would
be perfectly ok if it really only affected the actual group
_allocation_. So when you turn it on, old tty associations that
existed before autogroup being turned on would not actually be in
their own groups at all. And when you turn it off, existing tty groups
would continue to exist, and you'd actually have to open a new window
to get processes to no longer be part of any autogroup behavior.
See what I'm saying? I still think you're bending over backwards to be
"careful" in ways that don't seem to make much sense to me.
Now, if there is some really fundamental reason why processes _have_
to be disassociated with the group when the autogroup policy changes,
that would be a different issue. If the scheduler oopses when it hits
a process that was in a tty autogroup but autogrouping has been turned
off, _that_ would be a reason to say "recalculate everything when you
disable/enable policy groups". But I don't think that's the case.
Linus
--
On Sun, Nov 14, 2010 at 4:15 PM, Linus Torvalds
Btw, let me say that I think the patch is great even with that thing
in. It looks clean, the thing I'm complaining about is not a big deal,
and it seems to perform very much as advertized. The difference with
autogroup scheduling is very noticeable with a simple "make -j64"
kernel compile.
So I really don't think it's a big deal. The sysctl handler isn't even
complicated. But boy does it hurt my eyes to see a spinlock held
around a "do_each_thread()". And I do get the feeling that the
simplest way to fix it would be to just remove the code entirely, and
just say that "enabling/disabling may be delayed for old processes
with existing autogroups".
Linus
--
Which is what I just did. If the oddball case isn't a big deal, the patch shrinks, which is a good thing. I just wanted to cover all bases. Patchlet with handler whacked: A recurring complaint from CFS users is that parallel kbuild has a negative impact on desktop interactivity. This patch implements an idea from Linus, to automatically create task groups. This patch only implements Linus' per tty task group suggestion, and only for fair class tasks, but leaves the way open for enhancement. Implementation: each task's signal struct contains an inherited pointer to a refcounted autogroup struct containing a task group pointer, the default for all tasks pointing to the init_task_group. When a task calls __proc_set_tty(), the process wide reference to the default group is dropped, a new task group is created, and the process is moved into the new task group. Children thereafter inherit this task group, and increase it's refcount. On exit, a reference to the current task group is dropped when the last reference to each signal struct is dropped. The task group is destroyed when the last signal struct referencing it is freed. At runqueue selection time, IFF a task has no cgroup assignment, it's current autogroup is used. The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP is selected, but can be disabled via the boot option noautogroup, and can be also be turned on/off on the fly via.. echo [01] > /proc/sys/kernel/sched_autogroup_enabled. ..which will automatically move tasks to/from the root task group. Some numbers. A 100% hog overhead measurement proggy pinned to the same CPU as a make -j10 About measurement proggy: pert/sec = perturbations/sec min/max/avg = scheduler service latencies in usecs sum/s = time accrued by the competition per sample period (1 sec here) overhead = %CPU received by the competition per sample period pert/s: 31 >40475.37us: 3 min: 0.37 max:48103.60 avg:29573.74 sum/s:916786us ...
Yeah. And I have to say that I'm (very happily) surprised by just how
small that patch really ends up being, and how it's not intrusive or
ugly either.
I'm also very happy with just what it does to interactive performance.
Admittedly, my "testcase" is really trivial (reading email in a
web-browser, scrolling around a bit, while doing a "make -j64" on the
kernel at the same time), but it's a test-case that is very relevant
for me. And it is a _huge_ improvement.
It's an improvement for things like smooth scrolling around, but what
I found more interesting was how it seems to really make web pages
load a lot faster. Maybe it shouldn't have been surprising, but I
always associated that with network performance. But there's clearly
enough of a CPU load when loading a new web page that if you have a
load average of 50+ at the same time, you _will_ be starved for CPU in
the loading process, and probably won't get all the http requests out
quickly enough.
So I think this is firmly one of those "real improvement" patches.
Good job. Group scheduling goes from "useful for some specific server
loads" to "that's a killer feature".
Linus
--
Next logical auto-step would be to try to subvert cfq. At a glance, io_context looks highly hackable for !CONFIG_CFQ_GROUP_IOSCHED case. The other case may get interesting for someone not very familiar with cfq innards, but I see a tempting looking subversion point. -Mike --
That made me go wtf.. curious way of writing things.
I guess the usual way of writing that (which is admittedly a little more
verbose) would be something like:
static bool
task_wants_autogroup(struct task_struct *tsk, struct task_group *tg)
{
if (tg != &root_task_group)
return false;
if (tsk->sched_class != &fair_sched_class)
return false;
if (tsk->flags & PF_EXITING)
return false;
return true;
}
static inline struct task_group *
autogroup_task_group(struct task_struct *tsk, struct task_group *tg)
{
if (task_wants_autogroup(tsk, tg))
return tsk->signal->autogroup->tg;
return tg;
}
--
That is a bit easier on the eye, modulo my hysterical attachment to "p". Ok, so it's hopefully now ready to either take wing or go splat. Baked: A recurring complaint from CFS users is that parallel kbuild has a negative impact on desktop interactivity. This patch implements an idea from Linus, to automatically create task groups. This patch only implements Linus' per tty task group suggestion, and only for fair class tasks, but leaves the way open for enhancement. Implementation: each task's signal struct contains an inherited pointer to a refcounted autogroup struct containing a task group pointer, the default for all tasks pointing to the init_task_group. When a task calls __proc_set_tty(), the process wide reference to the default group is dropped, a new task group is created, and the process is moved into the new task group. Children thereafter inherit this task group, and increase it's refcount. On exit, a reference to the current task group is dropped when the last reference to each signal struct is dropped. The task group is destroyed when the last signal struct referencing it is freed. At runqueue selection time, IFF a task has no cgroup assignment, it's current autogroup is used. The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP is selected, but can be disabled via the boot option noautogroup, and can be also be turned on/off on the fly via.. echo [01] > /proc/sys/kernel/sched_autogroup_enabled. ..which will automatically move tasks to/from the root task group. Some numbers. A 100% hog overhead measurement proggy pinned to the same CPU as a make -j10 About measurement proggy: pert/sec = perturbations/sec min/max/avg = scheduler service latencies in usecs sum/s = time accrued by the competition per sample period (1 sec here) overhead = %CPU received by the competition per sample period pert/s: 31 >40475.37us: 3 min: 0.37 max:48103.60 avg:29573.74 sum/s:916786us overhead:90.24% pert/s: 23 ...
The cat forgot to quilt refresh :) A recurring complaint from CFS users is that parallel kbuild has a negative impact on desktop interactivity. This patch implements an idea from Linus, to automatically create task groups. This patch only implements Linus' per tty task group suggestion, and only for fair class tasks, but leaves the way open for enhancement. Implementation: each task's signal struct contains an inherited pointer to a refcounted autogroup struct containing a task group pointer, the default for all tasks pointing to the init_task_group. When a task calls __proc_set_tty(), the process wide reference to the default group is dropped, a new task group is created, and the process is moved into the new task group. Children thereafter inherit this task group, and increase it's refcount. On exit, a reference to the current task group is dropped when the last reference to each signal struct is dropped. The task group is destroyed when the last signal struct referencing it is freed. At runqueue selection time, IFF a task has no cgroup assignment, it's current autogroup is used. The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP is selected, but can be disabled via the boot option noautogroup, and can be also be turned on/off on the fly via.. echo [01] > /proc/sys/kernel/sched_autogroup_enabled. ..which will automatically move tasks to/from the root task group. Some numbers. A 100% hog overhead measurement proggy pinned to the same CPU as a make -j10 About measurement proggy: pert/sec = perturbations/sec min/max/avg = scheduler service latencies in usecs sum/s = time accrued by the competition per sample period (1 sec here) overhead = %CPU received by the competition per sample period pert/s: 31 >40475.37us: 3 min: 0.37 max:48103.60 avg:29573.74 sum/s:916786us overhead:90.24% pert/s: 23 >41237.70us: 12 min: 0.36 max:56010.39 avg:40187.01 sum/s:924301us overhead:91.99% pert/s: 24 >42150.22us: ...
I continue to play the advocatus diaboli ;) Hmm, why? Perhaps PF_EXITING was needed in the previous version to avoid the race with release_task(). But now it is always safe to use signal->autogroup. And the exiting task can do a lot before it disappears, probably Not sure I understand why do we need rq->lock... It can't protect the change of signal->autogroup, multiple callers can use different rq's. However. Currently the only callers holds ->siglock, so we are safe. Perhaps we should just document that autogroup_move_group() needs ->siglock. This also mean the patch can be simplified even more, __sched_move_task() Well, in theory this can race with another thread doing autogroup_move_group(). We can read the old ->autogroup, and then use it after it was already freed. Probably this needs ->siglock too. Oleg. --
That came into existence when I stress tested previous version in PREEMPT_RT (boom). I see no good reason to bother an exiting task Guaranteed live ->autogroup should be good enough for heuristic use, and had better be so. Having to take ->siglock in the fast path would kill Another landmine. Done. A recurring complaint from CFS users is that parallel kbuild has a negative impact on desktop interactivity. This patch implements an idea from Linus, to automatically create task groups. This patch only implements Linus' per tty task group suggestion, and only for fair class tasks, but leaves the way open for enhancement. Implementation: each task's signal struct contains an inherited pointer to a refcounted autogroup struct containing a task group pointer, the default for all tasks pointing to the init_task_group. When a task calls __proc_set_tty(), the process wide reference to the default group is dropped, a new task group is created, and the process is moved into the new task group. Children thereafter inherit this task group, and increase it's refcount. On exit, a reference to the current task group is dropped when the last reference to each signal struct is dropped. The task group is destroyed when the last signal struct referencing it is freed. At runqueue selection time, IFF a task has no cgroup assignment, it's current autogroup is used. The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP is selected, but can be disabled via the boot option noautogroup, and can be also be turned on/off on the fly via.. echo [01] > /proc/sys/kernel/sched_autogroup_enabled. ..which will automatically move tasks to/from the root task group. Some numbers. A 100% hog overhead measurement proggy pinned to the same CPU as a make -j10 About measurement proggy: pert/sec = perturbations/sec min/max/avg = scheduler service latencies in usecs sum/s = time accrued by the competition per sample period (1 sec here) overhead = %CPU received ...
lockdep_assert_held() is a good way to document these things. --
On Mon, Nov 15, 2010 at 02:25:50PM -0700, Mike Galbraith wrote: Mike, IIUC, this automatically created task group is invisible to user space? I mean generally there is a task group associated with a cgroup and user space tools can walk through cgroup hierarchy to figure out how system is configured. Will that be possible with this patch. I am wondering what will happen to things like some kind of per cgroup stats. For example block controller keeps track of number of sectors transferred per cgroup. Hence this information will not be available for these internal task groups? Looks like everybody likes the idea but let me still ask the following question. Should this kind of thing be done in user space? I mean what we are essentially doing providing isolation between two groups. That's why this cgroup infrastructure is in place. Just that currently how cgroups are created is fully depends on user space and kernel does not create cgroups of its own by default (ecept root cgroup). I think systemd does something similar in the sense every system service it puts in a cgroup of its own on system startup. libcgroup daemon has the facility to listen for kernel events (through netlink socket), and then put newly created tasks in cgroups as per the user spcefied rules in a config file. For example, if one wants isolation between tasks of two user ids, one can just write a rule and once the user logs in, its login session will be automatically placed in right cgroup. Hence one will be able to achieve isolation between two users. I think now it also has rules for classifying executables based on names/paths. So one can put "firefox" in one cgroup and say "make -j64" in a separate cgroup and provide isolation between two applications. It is just a matter of putting right rule in the config file. This patch sounds like an extension to user id problem where we want isolation between the processes of same user (process groups using different terminals). Would it make sense ...
Almost certainly not.
First off, user-space is a fragmented mess. Just from a "let's get it
done" angle, it just doesn't work. There are lots of different thing
that create new tty's, and you can't have them all fixed. Plus it
would be _way_ more code in user space than it is in kernel space.
Secondly, user-space daemons are a total mess. We've tried it many
many times, and every time the _intention_ is to make things simpler
to debug and deploy. And it almost never is. The interfaces end up
being too painful, and the "part of the code is in kernel space, part
of it is in user space" means that things just break all the time.
Finally, the whole "user space is more flexible" is just a lie. It
simply doesn't end up being true. It will be _harder_ to configure
some user-space daemon than it is to just set a flag in /sys or
whatever. The "flexibility" tends to be more a flexibility to get
things wrong than any actual advantage.
Just look at the patch in question. It's simple, it's clean, and it
"just works". Doing the same thing in user space? It would be a total
nightmare, and exactly _because_ it would be a total nightmare, the
code would never be that simple or clean.
Linus
--
Please elaborate, is this a generic statement or a comment on -- Three Cheers, Balbir --
No, it won't. The target audience is those folks who don't _do_ the configuration they _could_ do, folks who don't use SCHED_IDLE or nice, or the power available through userspace cgroup tools.. folks who expect I was of the same mind when Linus first broached the subject, but Ingo convinced me it was worth exploring because of the simple fact that people are not using the available tools. Sadly, this includes distros. AFAIK, no distro has cgroups configured and ready for aunt tilly, no distro has taught the GUI to use cgroups _at all_, even though it's I fiddled with configuring my system, but found options lacking. For instance, I found no way to automate per tty (or pgid in my case). I had to cobble scripts together to get the job done. Nothing that my distro delivered could just make it just happen for me. When the tool mature, and distros use them, in kernel automation may well become more or less moot, but in the here and now, there is a Per user isn't very useful. The typical workstation has one user whacking away on the kbd/mouse. While you can identify firefox etc, it's not being done, and requires identifying every application. Heck, cgroups is built in, but userspace doesn't even mount. Nothing but I see your arguments, and agree to a large extent. As Linus noted, there are other advantages to in kernel automation, but for me, it all boils down to the fact that userspace is doing nothing with the tools. ATM, cgroups is an enterprise or power user tool. The out of the box distro kernel user sees zero benefit for the overhead investment. -Mike --
The yet another init rewrite called systemd is supposedly cgroup happy.. No idea if its going to be useful though, I doubt its going to have an effect on me launching a konsole or the like, or screen creating a bunch of ttys. --
systemd uses cgroups only for process tracking. No resource management. Though afaik, Lennart has some plans of doing resource management using systemd. I wonder how autogroups will interact with systemd in that case. Dhaval --
On Tue, Nov 16, 2010 at 9:03 AM, Lennart Poettering
Numbers talk, bullshit walks.
The numbers have been quoted. The clear interactive behavior has been seen.
And you're just full of bullshit.
Come back when you have something working and with numbers and better
interactive performance. Until then, nobody cares.
Linus
--
Here's my super-complex patch btw, to achieve exactly the same thing
from userspace without involving any kernel or systemd patching and
kernel-side logic. Simply edit your own ~/.bashrc and add this to the end:
if [ "$PS1" ] ; then
mkdir -m 0700 /sys/fs/cgroup/cpu/user/$$
echo $$ > /sys/fs/cgroup/cpu/user/$$/tasks
fi
Then, as the superuser do this:
mount -t cgroup cgroup /sys/fs/cgroup/cpu -o cpu
mkdir -m 0777 /sys/fs/cgroup/cpu/user
Done. Same effect. However: not crazy.
I am not sure I myself will find the time to prep some 'numbers' for
you. They'd be the same as with the kernel patch anyway. But I am sure
somebody else will do it for you...
Lennart
--
Lennart Poettering - Red Hat, Inc.
--
Not quite the same, you're nesting one level deeper. But the reality is, not a lot of people will change their userspace. --
That's a weak argument - not a lot of people will (explicitly) change their kernel either - they'll get a fresh kernel via their distro updates, as they would get userspace updates. So it's only a few people (distros) that actually need to make such a change. Paul --
what is the downside of this patch going to be? people who currently expect all the processes to compete equally will now find that they no longer do so. If I am understanding this correctly, this could mean that a box that was dedicated to running one application will now have that application no longer dominate the system, instead it will 'share equally' with the housekeeping apps on the system. what would need to be done to revert to the prior situation? David Lang --
build with: CONFIG_SCHED_AUTOGROUP=n, boot with: noautogroup or runtime: echo 0 > /proc/sysctl/kernel/sched_autogroup_enabled (although the latter is a lazy one, it won't force existing tasks back to the root group) --
I think this might create some issues with controllers which support some kind of upper limit on resource usage. These hidden group can practically consume any amount of resource and because use space tools can't see these, they will not be able to place a limit or control it. If it is done from user space and cgroups are visible, then user can atleast monitor the resource usage and do something about it. Thanks Vivek --
Its cpu-controller only, and then only for SCHED_OTHER tasks which are proportionally fair. --
In this thread there are already mentions of extending it to block controller also which now supports the upper limit. Secondly I am assuming at some point of time, cpu controller will also support throttling (including SCHED_OTHER class). So as it is not a problem but the moment we start extending this logic to other controllers supporting upper limits it does pose the question that how do we handle it. Even if we automatically create groups inside the kernel (because for all the reasons that why user space can't do it), it probably should create cgroups also so that these are visible to user space and not hidden. Thanks Vivek --
Possibly, but in both cases its voluntary -- that is, unless explicitly configured a cgroup can use as much as possible. These auto groups will use that. If you want more control, create your own cgroup so you can poke at the parameters at will. --
I can create my own cgroups and control these. But one would like to control these hidden groups also. For example, in a clustered environment one might not want allow more than certain IOPS from machine X. In such scenario one would like to see all the groups and possibly put the limit. Now one can argue that thse groups are under root and put overall limit on root and these groups get automatically controlled. But controllers like block and memory also support flat mode where all the groups are at same level. pivot / | \ root g2 g3 Here g2 and g3 are hidden auto groups and because it is flat configuration putting limit on root will not help and one can not limit g2 and g3 and one can not control total amount of IO from the system. Now one can say use hierarchy and not flat setup. I am writting patches to enable hierarchy but that is disabled by default as flat setup came first and also there are concerns of accounting overhead etc. So point being that these autogroups being hidden is a concern. Can we do something so that these groups show up in cgroup hierarchy and are then user controllable. Thanks Vivek --
Well, either you disable the autogroup feature, or explicitly put each task in a taskgroup other than root and you'll be fine. Nothing avoids that from working. --
I think that's valid, and I don't think it would be wrong to have some
interface to show them.
Of course, then it's an interface question, and you'd want to be a bit
careful in designing it.
Linus
--
Well, it's _currently_ CPU controller only. People have already wondered if we should try to do something similar for IO scheduling too. So the thing I think is worth keeping in mind is that the "per-tty scheduling group" is really just an implementation issue. There is absolutely no question that it can't be about more than just scheduling, and that it can't be about more than just tty's also. And an important thing to keep in mind is that "user interfaces are bad". The thinner the interface, the better. One of the reasons I really like autogroup is that it has _no_ interface at all. It's very much a heuristic, and it has zero user interface (apart from the knob that turns it on and off, of course). That is a great feature, because it means that you cannot break the interface. You will never need to have applications that have special linux-specific hooks in them, or system daemons who have to touch magical /proc files etc. One of the problems I found annoying when just testing it using the plain cgroup interface (before the patch) was the resource management. You needed root, and they actually made sense requiring root, because I don't think we _want_ to allow people creating infinite numbers of cgroups. Vivek's "trivial patch" (shell script) is a major DoS thing, for example. Letting normal users create cgroups willy-nilly is not a good idea (and as Vivek already found out, his trivial script leaks cgroups in a pretty fundamental way). The tty approach is somewhat self-limiting in that it requires you to get the tty to get an autogroup. But also, because it's very much a heuristic and doesn't have any user-visible interfaces, from a kernel perspective it's wonderful. There are no "semantics" to break. If it turns out that there is some way to create excessive cgroups, we can introduce per-user limits etc to say "the heuristic works up to X cgroups and then you'll just get your own user group". And nobody would ever notice. So doing things automatically and without ...
Well, remove the 'user' part of the path and you have the exact same behaviour. Userspace usually gets updated way more frequently in most distributions than the kernel is. Maybe *you* never update userspace. But well, you are not the examplary Linux user, are you? Lennart -- Lennart Poettering - Red Hat, Inc. --
I run very recent userspace on my desktop and laptop, and I get really tired of fixing it every upgrade. That said, I do run more recent kernels than get shipped, in fact I never run distro kernels at all. --
On Tue, Nov 16, 2010 at 10:16 AM, Lennart Poettering
Right. And that's basically how this "patch" was actually tested
originally - by doing this by hand, without actually having a patch in
hand. I told people: this seems to work really well. Mike made it work
automatically.
Because it's something we want to do it for all users, and for all
shells, and make sure it gets done automatically. Including for users
that have old distributions etc, and make it easy to do in one place.
And then you do it for all the other heuristics we can see easily in
the kernel. And then you do it magically without users even having to
_notice_.
Suddenly it doesn't seem that wonderful any more to play with bashrc, does it?
That's the point. We can push out the kernel change, and everything
will "just work". We can make that feature we already have in the
kernel actually be _useful_.
User-level configuration for something that should just work is
annoying. We can do better.
Put another way: if we find a better way to do something, we should
_not_ say "well, if users want it, they can do this <technical thing
here>". If it really is a better way to do something, we should just
do it. Requiring user setup is _not_ a feature.
Now, I'm not saying that we shouldn't allow users to use cgroups. Of
course they can do things manually too. But we shouldn't require users
to do silly things that we can more easily do ourselves.
If the choice is between telling everybody "you should do this", and
"we should just do this for you", I'll take the second one every time.
We know it should be done. Why should we then tell somebody else to do
it for us?
Linus
--
On Tue, Nov 16, 2010 at 8:49 PM, Linus Torvalds
Completely agreed. Desktop users should not be required to fiddle with
kernel knobs from userspace to fix interactivity problems. Having sane
defaults applies to the kernel as much as it does to userspace.
Pekka
--
The only problem is that *desktop* users use *desktop* apps, which never have a controlling tty. This is a *nerd* config not helping any regular desktop user. You guys are optimizing the system for people who build kernels and watch movies at the same time, not for desktop users. Hooking into TTYs works for almost nobody sane out there. :) It's all fine, no comment on the patch, it's a nice hack. But please stop talking about the *desktop* here, it has almost nothing to do with it. Kay --
Yes. And have you noticed people complaining about stuttering while
they run just their desktop app? No.
That's the thing. If you run a web browser and you use a flash player
to view youtube or whatever in it, and only use other interactive
desktop apps anyway, you won't see any problems _regardless_. It's not
like the other desktop app you have open (and is idle, because you're
not touching it) will matter.
Ask yourself who is complaining? _I_ have been complaining for years
about desktop latency. I usually do it in private to developers, but
trust me, I do it. Much of it has been about IO (the whole fsync
fiasco), but some of it has been about the scheduler.
Look at who were trying out Con's patches. They were compiling things
and running games at the same time. That's literally the kind of loads
that people were looking at. Partly because that's the kinds of loads
we haven't been good at. And it's the kind of load that this helps
with.
Anyway, I find it depressing that now that this is solved, people come
out of the woodwork and say "hey you could do this". Where were you
guys a year ago or more?
Tough. I found out that I can solve it using cgroups, I asked people
to comment and help, and I think the kernel approach is wonderful and
_way_ simpler than the scripts I've seen. Yes, I'm biased ("kernels
are easy - user space maintenance is a big pain").
Linus
--
Jeez. Don't mentione the desktop. On the desktop this is compleltely irrelevant. There are not TTYs on the desktop. There's no "make -j" of the kernel tree on the desktop. The kernel patch discussed here *has* *no* *relevance* for normal users. The kernel patch discussed here is only relevant for people which start mplayer from one terminal, and "make -j" from another. Lennart -- Lennart Poettering - Red Hat, Inc. --
Really, who uses Linux that is not a geek? -- "They that give up essential liberty to obtain temporary safety, deserve neither liberty nor safety." (Ben Franklin) "The course of history shows that as a government grows, liberty decreases." (Thomas Jefferson) --
agreed, the question is if this really is a better way. It's definantly a this is good for desktop interactivity because it no longer treats all processes equally, it give more CPU to processes that are running 'stand-alone' then it will to processes that are forked off from one master process. In the desktop case where you really want something like 'make -j64' to be in the background and not interfere with the other tasks, but what about the situation of a dedicated web server? on that box the apache processes will get less CPU because they will all be in one group, and other things that happen on the box will get more CPU as they will be in different groups. Is this really the right change to make? having an option to do this sounds like a wonderful idea, and I would expect that desktop distros would want to have it default to 'on', but should it really be a default? David Lang --
This isn#t good for desktop interatctivey. It is *irrelevant* for desktop interactivity -- unless you define running "make -j" a typical desktop usecase. Which it isn't. Just stop bringing about the word "desktop" here. It has no point in No you don't. Because that is not a desktop use case. Lennart -- Lennart Poettering - Red Hat, Inc. --
On Tue, Nov 16, 2010 at 12:33 PM, Lennart Poettering
See my other response. You don't care AT ALL, because by your
judgement, all desktop is is a web browser and a word processor.
So why are you even discussing this? It's irrelevant for you. cgroups
will _never_ matter for what you are talking about, and that has
nothing to do with ttys, automation, scripting or anything else.
Because your definition of "desktop" seems to be "only interactive
apps". So this i all irrelevant.
Which is fine by me. It's not what the patch is supposed to affect.
Linus
--
Well, I do care. But I care more about *real* problems. For example the fact that "updatedb" makes your system sluggish while it runs. Or "man-db". Or anything else that runs from cron in the background. Doing this tty dance won't help you much with background tasks such as man-db, updatedb and cron and its jobs, will it? They don't have ttys. Sorry for you. meh! Meh! meh! meh! meh! (And along comes systemd, which actually handles this properly, since it actually has a proper notion of what a service is, and what a session is, and what an app is. And which hence can control all this sanely.) Binding this to a tty is just solves a tiny bit of the real problem: i.e. your own use of make -j. End of story. Lennart -- Lennart Poettering - Red Hat, Inc. --
Oh, you mean real problems like systemd can boot 10 seconds faster. What is 10 seconds when your system is up days or weeks? -- "They that give up essential liberty to obtain temporary safety, deserve neither liberty nor safety." (Ben Franklin) "The course of history shows that as a government grows, liberty decreases." (Thomas Jefferson) --
Lennart, would you mind pointing me the the paragraph that states, autogroup excludes any other improvements in this area? In contrary, Linus clearly states, that this solves a long standing use case, that _he_ is suffering from a lot, and I bet, most of us in one or another way.. And it contains all that is needed: a fine selection of knobs for switching on/off that beast. Hopefully it can be taught to reveal some of its internal mechanics to the world, then all is fine. If you think, that systemd can solve this and probably other aspects of responsiveness, go for it, compete with it, and _prove_ it with real facts and numbers, not just hand waving. As already mentioned countless times (and some of it was even renamed for this very fact): the grouping by tty is just a starter. There are plenty of other possibilities to group the scheduling. The hard part is to find the right grouping concepts, that are making sense in the usability department _and_ are easy enough to be picked up from our favorite system and desktop environments. That's where the generic cgroup concept seems to be lacking ATM.. In one year from now on, our preferred distros will show, who won this competition. Probably both of you ;-) Pete --
Actually, cgroups should probably be completely hierarchical: sessions contain process groups, which contain processes, which contain threads. You could also gather sessions with the same uid. Samuel --
Hierarchical is ~tempting, but adds overhead for not much gain. -Mike --
What overhead? The implementation of cgroups is actually already hierarchical. Samuel --
On Thu, Nov 18, 2010 at 3:43 PM, Samuel Thibault
Well, at least the actual group creation overhead.
If it's a "only at setsid()", that's a fairly rare thing (although I
think somebody might want to run something like the AIM7 benchmark - I
have this memory of it doing lots of tty tests).
Or if it's only at "user launches new program from window manager",
that's rare too.
But if you do it per process group, now you're doing one for each
command invocation in a shell, for example. If you're doing things per
thread, you've already lost.
Also, remember the goal: it was never about some theoretical end
result. It's all about a simple heuristic that makes things work
better. Trying to do that "perfectly" totally and utterly misses the
whole point.
(google "Perfect is the enemy of good" - Voltaire)
Linus
--
Well, if it's from an interactive shell, it's not really a problem :) But when it's from a script it can become one, yes. But are cgroups so Not per thread, per process, i.e. put threads of the same process in the same cgroup. Again, I would have thought that creating a cgroup is very lightweight in front of a fork(). If not, maybe we are just looking for another, more lightweight container information that the scheduler would use [1], and keep more heavyweight containers for the non-automatic Sure. Using sid should already be quite good, but including the uid information as well should be easily even better. Samuel [1] Actually, I happen to have written a PhD thesis on this, that's why I'm popping up:) --
Also note that having a hierarchical process structure should permit to make things globally more efficient: avoid putting e.g. your cpp, cc1, and asm processes at three corners of your 4-socket NUMA machine :) Samuel --
We have the hierarchy mandated by POSIX to track parents, childs, sessions and all that stuff, its just not the data structure used for scheduling. And no, using that to load-balance between CPUs doesn't necessarily help with the NUMA case, load-balancing is an impossible job (equivalent to page-replacement -- you simply don't know the future), applications simply do wildly weird stuff. From a process hierarchy there's absolutely no difference between a cc1/cpp/asm and some MPI jobs, both can be parent-child relations with pipes between, some just run short and have data affinity, others run long and don't have any. --
MPI jobs typically communicate with each other. Keeping them on the same socket permits to keep shared-memory MPI drivers to mostly remain in e.g. the L3 cache. That typically gives benefits. Samuel --
Colour me unconvinced, measuring shared cache footprint using PMUs might help (and people have actually implemented and played with that at various times in the past) but again, the added overhead of doing so I'm not at all convinced using the process hierarchy will really help much, but feel free to write the patch and test it. But making the Pushing them away permits them to use a larger part of that same L3 cache allowing them to work on larger data sets. Most of the MPI apps have a large compute to communication ratio because that is what allows them to run in parallel so well (traditionally the interconnects were terribly slow to boot), that suggests that working on larger data sets is a good thing and running on the same node really doesn't matter since communication is assumes slow anyway. There really is no simple solution to his. --
Err, if the compute to communication ratio is big, then you should use all CPU cores, up to the point where communication becomes a matter again, and making sure that related MPI processes end up on the same I never said there was even a solution, actually (in particular any kind of generic solution), but that there are a a few simple ways exist to make things better. Samuel --
On Thu, Nov 18, 2010 at 4:02 PM, Samuel Thibault
Absolutely not.
We have a good light-weight fork(). We try to avoid any extra
allocations. We *definitely* don't want things at that level.
Seriously. I'd really like somebody running AIM7 just to see that even
just doing it at setsid() doesn't hurt too badly. And that's something
that happens once in a blue moon compared to fork and/or process
groups.
Once per session is about as much as is acceptable. That's the kind of
granularity we should look at. So things like "groups per user",
"groups per session", "groups per one graphical application" are good.
Not things that can happen tens of thousands of times a second.
Linus
--
All right. I believe that should already work quite well both for desktop and servers indeed. "Per one graphical application" will most probably require desktop panel patching, however. Samuel --
On Thu, Nov 18, 2010 at 4:59 PM, Samuel Thibault
Sure. As mentioned somewhere earlier in this thread, I don't think
this whole grouping thing is some "either or" black-or-white thing.
So I personally like the kernel patch because it's simple, and it
"just works", regardless of what user space you happen to run. But
that doesn't mean that user space couldn't easily give additional
hints (to the point that if you end up having a distro that does
hinting for everything, you could just turn off the kernel side
entirely).
Linus
--
I think that could be done with a fork flag with the same semantics as reset on fork. Once your task launcher (ala kdeinit) is tagged, it launches task groups, the children lose that ability. That requires userspace interaction though, dunno how you'd detect automatically. -Mike --
Mmm, even if always creating groups can have a slight overhead, you should probably not prevent userspace from deciding to do so when it knows it's appropriate. Samuel --
I think you may have misunderstood. The flag would be a hint that userland can set to say "I want to fork off task groups", just as SCHED_RESET_ON_FORK asks the kernel to reset a child's sched class and nice level on fork. -Mike --
Ah, ok. I would have rather done this per fork call, as there may be a difference between starting an application and starting a panel widget. Samuel --
It must be nice to be that ignorant ;-) Speaking for the scheduler cgroup controller (that being the only one I actually know), most all the load-balance operations are O(n) in the number of active cgroups, and a lot of the cpu local schedule operations are O(d) where d is the depth of the cgroup tree. [ and that's with the .38 targeted code, current mainline is O(n ln(n)) for load balancing and truly sucks on multi-socket ] You add a lot of pointer chasing to all the scheduler fast paths and there is quite significant data size bloat for even compiling with the controller enabled, let alone actually using the stuff. But sure, treat them as if they were free to use, I guess your machine is fast enough. --
In general though, I think you can say that: cgroups ass overhead. Simply because you add constraints, this means you need to 1) account more, 2) enforce constraints. Both have definite non-zero cost in both data and time. --
I really think you meant "add" here ? (Hey! The keys were next to each other!) Yep, this looks like one of these perpetual throughput vs latency trade-offs. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
Trade-off sure, throughput vs latency only in a specific use-case, its more a feature vs cost thing, just like all them trace people want lower cost tracing but want more features at the same time.. --
Yep, agreed for "feature vs cost", given that the kind of latency that is fixed in this case really boils down to a sluggish system under a relatively common workload -- so making this work might be called a "feature". ;) FWIW, about tracing, well, we should distinguish between features that add cost to the fast-path and slow-path only. But I really don't care anymore, since Ingo, Linus, Thomas and yourself made it very clear that you don't care. So let's not even start this discussion -- it's going nowhere. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
The same would apply to CPU autogroups, presumably? Paul --
Yep, they're not special at all... uses the same mechanism. --
The only difference is cost of creation and destruction, so cgroups and autogroups suck boulders of slightly different diameter when creating and/or destroying at high frequency. -Mike --
The pointer chasing hurts, but fast path + global spinlock is what slapped the hierarchical outta me :) -Mike --
I can say that for memory, with hierarchies we account all the way up, which can be a visible overhead, depending on how often you fault. -- Three Cheers, Balbir --
Of course systemd can do it. You don't even need systemd, autogroup or whatever else if you don't mind a little scripting. There's nothing to prove, any numbers he generates will be identical with any numbers I generate... it's the same scheduler whether you configure it from Actually, I switched to setsid alone for now at least, and the only thing in the root group is kernel threads. -Mike --
On Tue, Nov 16, 2010 at 10:33 PM, Lennart Poettering
How about you stop bringing *your* narrow definition of "desktop" to
the discussion? I am a kernel hacker but that doesn't mean I'm not
also a desktop user. I shouldn't need to play with cgroups from
userspace to keep music playing while I compile kernels. Things should
just work.
Pekka
--
On Tue, Nov 16, 2010 at 7:49 PM, Linus Torvalds So what do you think about something like systemd handling it. systemd already does a lot of this stuff already in the form of process tracking, so it is quite trivial to do this. And more happily avoids all this complexity in the kernel. Thanks, Dhaval --
Note that even if this was a mistake, systemd can disable it and use its own heuristics. It would be nice to know what Lennart thinks about that, because it would be pointless to have a feature that eventually could be disabled in each boot in most distros. --
What complexity? Have you looked at the patch? It has no complexity anywhere.
It's a _lot_ less complex than having system daemons you don't
control. We have not had good results with that approach in the past.
System daemons tend to cause nasty problems, and debugging them is a
nightmare.
Have you ever tried to debug the interaction between acpid and the
kernel? I doubt it. It's not simple.
Did you ever try to understand how the old pcmcia code worked, with
the user daemon to handle the add/remove requests?
Have you ever really worked with the "interesting" situations that
come from the X server handling graphics mode switching, and thus
being involved in suspend, resume and VT switching? Trust me, just
designing the _interfaces_ to do that thing is nasty, and the code
itself was a morass. There's a reason the graphics people wanted to
put modeswitching in the kernel. Because doing it in a daemon is a
f*cking pain in the ass.
Put another way: "do it in user space" usually does _not_ make things simpler.
For example: how do you do reference counting for a cgroup in user
space, when processes fork and exit without you even knowing it? In
kernel space, it's _trivial_. That's what kernel/autogroup.c does, and
it has lots of support for it, because that kind of reference counting
is exactly what the kernel does.
In a system daemon? Good luck with that. It's a nightmare. Maybe you
could just poll all the cgroups, and try to remove them once a minute,
and if they are empty it works. Or something like that. But what a
hacky thing it would be.
And more importantly: I don't run systemd. Neither do a lot of other
people. The way the patch does things, "it just works".
Did you go to the phoronix forum to look at how people reacted to the
phoronix article about the patch? There were a number of testers. It
was just so _easy_ to test and set up. If you want people to run some
specific system daemon, it immediately gets much harder to set up and
do.
...On Tue, Nov 16, 2010 at 11:45 AM, Linus Torvalds There's an existing cgroups API - release_agent and notify_on_release - whereby the kernel will spawn a userspace command once a given cgroup is completely empty. It's intended for pretty much exactly this purpose. Paul --
Yes. And it seems to be working just fine for me. I modiefied Lennart's
script a bit to achieve that.
Addition to my .bashrc.
if [ "$PS1" ] ; then
mkdir -m 0700 -p /cgroup/cpu/$$
echo 1 > /cgroup/cpu/$$/notify_on_release
echo $$ > /cgroup/cpu/$$/tasks
fi
I created one file /bin/rmcgroup to clean up the cgroup.
#!/bin/bash
rmdir /cgroup/cpu/$1
And did following to mount cgroup and setup empty group notification.
mount -t cgroup -o cpu none /cgroup/cpu
echo "/bin/rmcgroup" > /cgroup/cpu/release_agent
And it works fine. Upon ssh to my box, a cpu cgroup is automatically
created and upon exiting the shell, this group is automatically destroyed.
So API/interface for automatically reclaiming the cgroup once it is empty
seems to be pretty simple and works.
Thanks
Vivek
--
You can just do an rmdir from the cgroup release handler. Heck, "rmdir" is a pretty good GC in itself, since it deletes a cgroup only if it is Well, that nightmare already exists. It's systemd. We use the cgroup release handler. If you ask me it's an aweful interface, but works So this basically boils down to the fact that this is useful for your particular usecase. Because you don't want to update userspace. But don't claim this would be useful for anybody but you. It is definitely Jeez. Phoronix! If you truly believe that the Phoronix usecase of running "make -j64" over the kernel tree was in any way relevant in real life for anybody but kernel developers, then I can't help you. Lennart -- Lennart Poettering - Red Hat, Inc. --
To me, this starts sounding like a legendary "userspace suspend" versus "kernel suspend" support fights many years ago... Since that, I have _sometimes_but_rarely_ been luckily enought to get my laptops recovered from the suspend states successfully... And as the kernel patches allows disabling the schedyles feature either via kernel build parameters, boot parameters or at runtime _if_you_ really want, I do not understand why do you complaint if there will be much better defaults in the kernel immediately once you just build and boot the new kernel. (I really do not plan to start converting init-rd boot scripts from all of my computers systemd way of doing things in near future...) Mika --
Nah. Patch accepted, rejected, obsoleted 10 minutes from now etc etc, it'll never be a big deal. -Mike --
Well, this feature is pretty much interesting only for kernel hackers anyway. It is completely irrelevant for normal desktop people. Because a) normal users don't use many terminals anyway and that very seldom and b) if they do that they do not create gazillion of processes from one of the terminals and only very few in the other. Because only then this patch becomes relevant. Heck, the patch wouldn't even have any effect on my machine (and I am hacker), because I tend to run my builds from within emacs. And since emacs has no TTY (since it is a X11/gtk build) it wouldn't be in its own scheduling group. So, this patch only has an effect of people who build kernels from an xterm with make -j all day, and at the same time want to watch a movie, from a player they also start from a terminal, but from another one. Only in such a setup this patch matters. And for everybody else it is completely irrelevant. If you don't use terminals it has no effect. If you don't run massivily parallel CPU hoggers it has no effect. If you do not run your mplayer from a terminal it has no effect. The kernel bears your name, but that doesn't mean your own use of it was typical for more than you and a number kernel hackers like you. Also, this implicit assumption that nobody would ever see this because people upgrade their kernels often and userspace seldom is just nonsense anyway. That's how *you* do it. But in reality userspace is updated for most folks way more often then the kernel is. Or to turn this around: think how awesome that would be if we did this in userspace, then we could support older kernels too and wouldn't have to upgrade everything to your not-yet-released new kernel. Suddenly it doesn't seem that wonderful anymore to play with the kernel Well, I have no plans in pushing anybody to do this in bash really. All I am saying that tying this to a tty is crazy. And this is policy, and should be done in userspace, and we are almost there to do this in userspace. In ...
From: Lennart Poettering <mzxreary@0pointer.de> Type 'tty' in that emacs shell, what does it give you? It does get it's own TTY and it will thus get it's own scheduling group. --
I am running "make -j", not a shell from emacs. it doesn't have a tty. I just want to build something, not have terminal emulator in emacs. Lennart -- Lennart Poettering - Red Hat, Inc. --
From: Lennart Poettering <mzxreary@0pointer.de> It does the same exact thing. Please actually check and verify your facts instead of tossing back kneejerk responses. davem@sunset:~/src/GIT$ cat Makefile all: tty davem@sunset:~/src/GIT$ "M-x compile" in that directory gives: -------------------- -*- mode: compilation; default-directory: "~/src/GIT/" -*- Compilation started at Tue Nov 16 13:13:01 make -k tty /dev/pts/2 Compilation finished at Tue Nov 16 13:13:01 -------------------- --
And within the desktop where would you put this - in the window manager on the basis of top level windows or in the app startup ? Alan --
Btw, I suspect either of these are reasonable. In fact, I don't think
it would be at all wrong to have the desktop launcher have an option
to "launch in a group" (although I think it would need to be named
better than that). Right now, when you create desktop launchers under
at least gnome, it allows you to specify a "type" for the application
("Application" or "Application in Terminal"), and maybe there could be
a "CPU-bound application" choice that would set it in a CPU group of
its own. Or whatever.
So I do _not_ believe that the autogroup feature should necessarily
mean that you cannot do other grouping decisions too. I just do think
that the whole notion of "it got started from a tty" is actually a
very useful thing for legacy applications, and one where it's just
simpler to do it in the kernel than build up any extra infrastructure
for it.
So it's not necessarily at all an "either-or" thing.
Linus
--
Well, my plan was actually to by default put everything into its own group, and then let users opt-out of that for specific processes, if the want to. Lennart -- Lennart Poettering - Red Hat, Inc. --
How many users are likely to do this, though? I think you really want to make this be something which the application can specify by default that they should start in their own cgroup. One idea might be to it to the applications menu entry: http://standards.freedesktop.org/desktop-entry-spec/latest/ ... so there would be a new key value pair, "start_in_cgroup", that would allow the user to start an application in their own cgroup. It would be up to the desktop launcher to honor that if it was present. One nice thing about having the desktop launch each application in its own cgroup is that it becomes easier for an desktop task manager to have a UI listing that lists things in a format which will be somewhat easier to understand than process listing. The cgroup would be a useful way to organize what is going on for each launched application, and it would allow people to see how much memory some application like evolution really requires. (On the other hand, maybe they would be happier not knowing. :-) - Ted --
This is pretty much in line with what I want to do, except I want opt-out, not opt-in behaviour here. Lennart -- Lennart Poettering - Red Hat, Inc. --
That works for me. One suggestion is that in addition to "opt-out",
it should be also possible for an application launcher file to specify
a specific cgroup name that should be used. That would allow multiple
applications in a group to assigned to the same cgroup.
There also needs to be a way to start applications that are started
via the GNOME and KDE session manager in a specified cgroup as well.
I assume that's in your plan as well?
- Ted
--
Being able to specify cgroup name/path is a good idea. That way one can make use of cgroup hierarchy also. Thinking more about opt-in vs opt-out issue. Generally cgroups provide some kind of isolation between application groups and in the process can be somewhat expensive. More memory allocation, more accounting overhead and for CFQ block controller it can also mean additional idling and can result in overall reduced throughput. Keeping that in mind, is it really a good idea to launch each application in a separate group. Will it be better to let user decide if the application should be launched in a separate cgroup? The flip side is that how many people will really know about the functionality and will really launch application in a separate group. And may be it is a good idea to put everybody in a seprate cgroup by default even it means some cost so that if a application starts consuming too much of resources (make -j64), then its impact on rest of the groups can be contained. I really don't have strong inclination for one over other. Just thinking loud... Vivek --
I wouldn't be too concerned here. It's not that we end up with 1000s of groups here. It's way < 40 or in the end, for a single user machine. Which I think isn't that bad. Lennart -- Lennart Poettering - Red Hat, Inc. --
>>>>> "Lennart" == Lennart Poettering <mzxreary@0pointer.de> writes: Lennart> I wouldn't be too concerned here. It's not that we end up Lennart> with 1000s of groups here. It's way < 40 or in the end, for a Lennart> single user machine. Which I think isn't that bad. Ok, so how will this impact users on a single machine which is used to host VNC sessions? I've got machines with 30+ users all running one or more VNC sessions each on there. Nothing CPU bound generally, unless something freaks out and starts hogging a CPU. But will the overhead of 1000 groups be noticeable? Thanks, John --
+1 for implementing this in systemd than in the kernel. The userspace has much more info about which process needs to go into which group. --
-1 on your systemd/gnome/wm/etc. crap, that means it will become impossible to sanely switch off (gnome firmly believes knobs are evil), leaving everybody who _does_ know wth they're doing up a certain creek without a paddle. --
Please, can we stop with this false dichotomy? This is decidedly not true and as Lennart has already pointed out, the knob already exists in systemd. You may like the kernel approach, but this does not mean there is no place for grouping driven by userspace. - Ben --
Yes, and then at the next release, some idiotic GNOME engineer will decide to the "improve" the system by removing yet another knob.... -- Ted --
From: Theodore Tso <tytso@MIT.EDU> I have to agree on this one, there is no reason to believe that this trend will not continue and I've lost every ounce of optimism I've ever had in this area. Besides, I'm having trouble believing someone who can't even get it through his thick skull that even "M-x compile" in emacs gives a TTY to the make process. Yet he kept claiming the opposite over and over again until I absolutely proved it to him, at which point he became completely silent on that line of reasoning. Someone who argues and behaves in that way is not someone I put much stock in as far as implementing things in a wise or well informed way, it feels more like a mechanism which is being forced and rushed ahead without enough research or consideration. --
Wow, so much hate! Love you too, Lennart -- Lennart Poettering - Red Hat, Inc. --
From: Lennart Poettering <mzxreary@0pointer.de> Thanks for not addressing my technical arguments at all, which serves only to further my reservations. --
Guys, calm down. It's really not -that- much of a deal.
We can do both. There is basically zero cost in the kernel: you can
disabled the feature, and it's not like it's a maintenance headache
(all the complexity that matters is the group scheduling itself, not
the autogroup code that is well separated). And the kernel approach is
useful to just take an otherwise unmodified system and just make it
have nice default behavior.
And the user level approach? I think it's fine too. If you run systemd
for other reasons (or if the gnome people add it to the task launcher
or whatever), doing it there isn't wrong. I personally think it's
somewhat disgusting to have a user-level callback with processes etc
just to clean up a group, but whatever. As long as it's not common,
who cares?
And you really _can_ combine them. As mentioned, I'd be nervous about
AIM benchmarks. I keep mentioning AIM, but that's because it has shown
odd tty issues before. Back when the RT guys wanted to do that crazy
blocking BKL thing (replace the spinlock with a semaphore), AIM7
plummeted by 40%. And people were looking all over the place (file
locking etc), and the tty layer was the biggest reason iirc.
Now, I don't know if AIM7 actually uses setsid() heavily etc, and it's
possible it never hits it at all and only does read/write/kill. And
it's not like AIM7 is the main thing we should look at regardless. But
the point is that we know that there are some tty-heavy loads that are
relevant, and it's very possible that a hybrid approach with "tty's
handled automatically by the kernel, window manager does others in
user space" would be a good way to avoid the (potential) nasty side of
something that has a lot of short-lived tty connections.
So guys, calm down. We don't need to hate each other.
Linus
--
On that note, is there a good reason why the notify_on_release interface works the way it does? Wouldn't it be simpler if the cgroup simply provided a file on which a process (e.g. systemd) could block? I guess it's a little too late at this point considering the old mechanism will still need to be supported, but it seems like this would provide a slightly cheaper cleanup path. Just my (perhaps flawed) two Thanks for the nudge back to sanity. Cheers, - Ben --
Actually, the sane interface would likely just be to have a "drop on
release" interface.
Maybe some people really want to be _notified_. But my guess would
that that just dropping the cgroup when it becomes empty would be at
least a reasonable subset of users.
Who uses that thing now? The desktop launcher/systemd approach
definitely doesn't seem want the overhead of being notified and having
to remove it manually. Does anybody else really want it?
But that's really an independent question from all the other things.
But with new cgroup users, it's very possible that it turns out that
some of the original interfaces are just inconvenient and silly.
Linus
--
Hmm, I think automatic cleanup would be quite nice, but there are some niche cases to think about first (i.e. what do you do if a process generates a cgroup and wants to make itself a member of it, but then dies before it can do that, and the cgroup stays around but empty and I'd like automatic cleanup, but definitely also want to be notified when a cgroup runs empty. Here's the use case: we run apache in a cgroup (in a named hierarchy, not attached to any controller, we do this for keeping track of the service and all its children). Now apache dies. Its children, various CGI scripts stay around. However, since Apache is configured to be restarted systemd now kills all remaining children and waits until they are all gone, so that when it starts Apache anew we are in a clean and defined environment, and no remains of the previous instance remain. For this to work I need some kind of notification when all children are gone. Of course if systemd is PID 1, I can just use SIGCHLD for that, but that's more difficult when we are managing user porcesses, and want to do that with a PID != 1. And even even we are PID 1 its kinda neat to have an explicit notification for when a cgroup is empty instead of having to constantly check whether the cgroup is now empty after each SIGCHLD we receive. Also, there must be a way to opt out of automatic cleanup for some groups, since it might make sense to give users access to subtrees of the hierarchy and if you clean up groups belonging to privileged code then you get a namespace problem because unprivileged code might recreate that group and confuse everybody. So yeah, auto-cleanup is nice, but notifications I want too, please thanks. Lennart -- Lennart Poettering - Red Hat, Inc. --
Backwards-compatibility with cpusets, which is what cgroups evolved from. A delete_on_release option would be possible too, for the cases where there's really no entity that wants to do more than simply delete the group in question. Paul --
The notify_on_release interface is awful indeed. Feels like the old hotplug interface where each module request by the kernel caused a hotplug script to be spawned by the kernel. However, I am not sure I like the idea of having pollable files like that, because in the systemd case I am very much interested in getting recursive notifications, i.e. I want to register once for getting notifications for a full subtree instead of having to register for each cgroup individually. My personal favourite solution would be to get a netlink msg when a cgroup runs empty. That way multiple programs could listen to the events at the same time, and we'd have an easy way to subscribe to a whole hierarchy of groups. Lennart -- Lennart Poettering - Red Hat, Inc. --
The netlink message should not be hard to do if we agree to work on it. The largest objections I've heard is that netlink implies network programming and most users want to be able to script in their automation and network scripting is hard. -- Three Cheers, Balbir --
Well, the notify_on_release stuff cannot be dropped anyway at this point in time, so netlink support would be an addition to, not a replacement for the current stuff that might be useful for scripting folks. Lennart -- Lennart Poettering - Red Hat, Inc. --
Agreed, we still need the old notify_on_release. Are you suggesting that for scripting we use the old interface and newer tools use netlink? -- Three Cheers, Balbir --
No, the contrary. I was referring to "the current stuff that might be useful for scripting and folks". And then netlink stuff would be for everything beyond scripting, but not scripting itself. Lennart -- Lennart Poettering - Red Hat, Inc. --
If I were home, I'd have already checked out AIM7 and more, but I didn't load laptop up with all my toys unfortunately. (or fortunately, since T130 ain't exactly a speed daemon) -Mike --
It does for systemd, but he also said he's proposing patches for other userspace components (something gnome). And as we all know, gnome is not My concern is that I don't want to go trawling through a dozen unrelated userspace package configurations to disable all this. --
As a last update to this messy discussion, here's a commit I made to systemd which ensures that every user session by default gets its own cgroup in the 'cpu' hierarchy. http://cgit.freedesktop.org/systemd/commit/?id=74fe1fe36e35a26d764f1e3119d5f6d014db573c And here's the one that does the same to ensure that every service systemd manages gets by default its own cgroup in the 'cpu' hierarchy. http://cgit.freedesktop.org/systemd/commit/?id=d686d8a97bd7945af0a61504392d01a3167b576f And finally, here's bugzilla entry for a patch I prepped for gnome-terminal's livte which creates a subgroup for each terminal widget shown by the same g-t instance: https://bugzilla.gnome.org/show_bug.cgi?id=635119 With this all together we get quite a bit further than with the kernel patch, since we also cover all kinds of services, and treat user sessions equal to each other, and even users. (i.e. user A cannot get double the amount of CPU time simply by spawning double the amount of processes or double the amount of session thatn user B). This should fix things for people with systemd and GNOME. Yes, all others are left in the cold. Sorry for that. Lennart -- Lennart Poettering - Red Hat, Inc. --
Is there an easy opt out for that, other than booting a CONFIG_CGROUP=n kernel? --
systemd relies on CONFIG_CGROUP=y, since it useses it for service management. It creates its own name=systemd hierarchy for that with no controllers attached. If you turn that off, then systemd will refuse to boot. However, it does not rely on any of the controllers, and hence you are welcome to disable all cgroup controlls and systemd won't complain. If you want to disable the automatic creation of groups in the 'cpu' hierarchy for user sessions then you can tell pam_systemd that by passing "controllers=" on the PAM config line. ("controllers=cpu" is the implied default.) There's currently no global option to disable the same logic in systemd when it creates 'cpu' cgroups for the various services it runs. However, you can disable that individually with "ControlGroups=cpu:/" in the .service files. I will now add a global option as well. Lennart -- Lennart Poettering - Red Hat, Inc. --
Do expect distro bugzilla entries when this 'awesome'-ness hits the A global knob is a must -- preferably with neon signs on so I can find it. Luckily I don't use this GNOME junk, otherwise I'd have had to ask how to revert that crap as well. --
Amen!! -- "They that give up essential liberty to obtain temporary safety, deserve neither liberty nor safety." (Ben Franklin) "The course of history shows that as a government grows, liberty decreases." (Thomas Jefferson) --
This global option is now commited to systemd git. Lennart -- Lennart Poettering - Red Hat, Inc. --
The plan with systemd is to make it manage both the system and the sessions. It's along the lines of what launchd does on MacOS: one instance for the system, another one for the user, because starting and supervising a system service and a session service are actually very very similar things. In F15 we'll introduce systemd as an init system only. The next step will be to make it manage sessions too, and replace/augment gnome-session. Lennart -- Lennart Poettering - Red Hat, Inc. --
>>>>> "LP" == Lennart Poettering <mzxreary@0pointer.de> writes: LP> Heck, the patch wouldn't even have any effect on my machine (and I am LP> hacker), because I tend to run my builds from within emacs. And since LP> emacs has no TTY (since it is a X11/gtk build) it wouldn't be in its own LP> scheduling group. M-x compile uses a pty to talk to its sub-process, so if ptys get their own cgroup then a compile w/in emacs will, too. (Try using tty as the command M-x compile should run to confirm.) -JimC -- James Cloos <cloos@jhcloos.com> OpenPGP: 1024D/ED7DAEA6 --
This is absolutely fine, as long as everyone wants the feature, "just works" for us as kernel developers might not be the same as "just works" for all end users. How does one find and disable this feature if one is not happy? I don't think it is very hard, it could be a simple tool that the distro provides or documentation, but hidden works both ways, IMHO. In summary, we need tooling with good defaults. -- Three Cheers, Balbir --
So you have tested this and have a nice demo and numbers to back it up? -- "They that give up essential liberty to obtain temporary safety, deserve neither liberty nor safety." (Ben Franklin) "The course of history shows that as a government grows, liberty decreases." (Thomas Jefferson) --
I just modified my .bashrc and for my each ssh session, it seems to be working fine and creating a cgroup as soon as I ssh into the box. Just that it will need little modification to reap the group automatically when shell exits. Thanks Vivek --
Yeah, self-reaping launcher scripts work fine, I cobbled one together. There's no question that the infrastructure works. -Mike --
Yes, I tested this. And I thought the paragraph was explicit enough about the numbers. Lennart -- Lennart Poettering - Red Hat, Inc. --
OK, I've done some tests and the result is that Lennart's approach seems to work best. It also _feels_ better interactively compared to the vanilla kernel and in-kernel cgrougs on my machine. Also it's really nice to have an interface to actually see what is going on. With the kernel patch you're totally in the dark about what is going on right now. Here are some numbers all recorded while running a make -j4 job in one shell. perf sched record sleep 30 perf trace -s /usr/libexec/perf-core/scripts/perl/wakeup-latency.pl : vanilla kernel without cgroups: total_wakeups: 44306 avg_wakeup_latency (ns): 36784 min_wakeup_latency (ns): 0 max_wakeup_latency (ns): 9378852 with in-kernel patch: total_wakeups: 43836 avg_wakeup_latency (ns): 67607 min_wakeup_latency (ns): 0 max_wakeup_latency (ns): 8983036 with Lennart's approach: total_wakeups: 51070 avg_wakeup_latency (ns): 29047 min_wakeup_latency (ns): 0 max_wakeup_latency (ns): 10008237 perf record -a -e sched:sched_switch -e sched:sched_wakeup sleep 10 perf trace -s /usr/libexec/perf-core/scripts/perl/wakeup-latency.pl : without cgroups: total_wakeups: 13195 avg_wakeup_latency (ns): 48484 min_wakeup_latency (ns): 0 max_wakeup_latency (ns): 8722497 with in-kernel approach: total_wakeups: 14106 avg_wakeup_latency (ns): 92532 min_wakeup_latency (ns): 20 max_wakeup_latency (ns): 5642393 Lennart's approach: total_wakeups: 22215 avg_wakeup_latency (ns): 24118 min_wakeup_latency (ns): 0 max_wakeup_latency (ns): 8001142 -- Markus --
systemd already creates a named cgroup for each user who logs in and each session inside it. That's implemented via pam_systemd, which is enabled in all distros doing systemd. We create those groups right now only in the named "systemd" hierarchy, but iiuc then simply doing the same in the "cpu" hierarchy would have the exact same behaviour as this patch, but actually is based on a sane definition of what a session is. Binding something like this to TTYs is just backwards. No graphical session has a TTY attached anymore. And there might be multiple TTYs used in the same session. I really wonder why logic like this should live in kernel space at all, since a) the kernel has no real notion of a session, except audit and b) this is policy and as soon as people have this kind of group then they probably want other kind of autogrouping as well for the other controllers, which hence means userspace is a better, and configurable place for this. Lennart -- Lennart Poettering - Red Hat, Inc. --
Using a group per tty makes sense for us console jockeys.. Anyway, nobody uses systemd yet and afaik not all distro's even plan on using it (I know I'm not waiting to learn yet another init variant). --
-- "They that give up essential liberty to obtain temporary safety, deserve neither liberty nor safety." (Ben Franklin) "The course of history shows that as a government grows, liberty decreases." (Thomas Jefferson) --
Well, then maybe you shouldn't claim this was relevant for anybody but Well, the way it looks right now we have convinced all big distros. Lennart -- Lennart Poettering - Red Hat, Inc. --
Crap, means I get to blame you for more than non working sound. I get really fed up having to re-discover how to fix broken init scripts on every upgrade :/ --
For what it's worth, I suspect that the object that should be bound to is probably not the tty, but rather the session ID of the process (which generally is 1:1 with controlling TTY for console processes.) -hpa --
Agreed. That'll catch both the tty case (implemented by the proposed patch), and the rest. Samuel --
This really does make a lot of sense. Tying on the Session ID rather than the TTY would allow to deal with graphical applications by letting them specify session IDs with setsid() when the application starts. It seems much more generic than TTY, and maps to TTY already. Thanks, Mathieu -- Mathieu Desnoyers Operating System Efficiency R&D Consultant EfficiOS Inc. http://www.efficios.com --
That systemd gizmo sounds like it'll eventually take care of new installations, though a tad costlier (for which you likely get more tweakability). -Mike --
I don't really understand what makes the exiting task different,
but OK.
However, I must admit I dislike this check. Because, looking at this
code, it is not clear why do we check PF_EXITING. It looks as if it
is needed for correctness.
Yes, sure, rq->lock should ensure signal->autogroup can't go away.
(even if it can be changed under us). And it does, we are moving all
Well, this looks a bit strange (but correct).
We are changing ->autogroup assuming the caller holds ->siglock.
But if we hold ->siglock we do not need rcu_read_lock() to iterate
over the thread_group, we can just do
p->signal->autogroup = autogroup_kref_get(ag);
t = p;
do {
sched_move_task(t);
} while_each_thread(p, t);
Again, this is minor, I won't insist.
Oleg.
--
Is _not_ needed I presume. I'll do it that way. I was pondering adding the option to move one or all as cgroups does, but don't think that will ever be needed. Thanks, -Mike --
Argh! I was wrong, it _is_ needed for correctness. Yes, it is always safe Exactly. And this means we can _only_ assume it can't go away if autogroup_move_group() can see us on ->thread_group list. Perhaps this deserve a commen (unless I missed something again). Mike, sorry for confusion. Oleg. --
Oh no, thank you. I hadn't figured it out yet, was going to go back and poke rt kernel with sharp sticks. (exit can be one scary beast) -Mike --
Mike, Mind sending a new patch with a separate v2 announcement in a new thread, once you have something i could apply to the scheduler tree (for a v2.6.38 merge)? You sent a couple of iterations in this discussion and i'd rather not fish out the wrong one. Thanks, Ingo --
Will do. (v3->936 became wider than expected;) -Mike --
Changes since last: - switch to per session vs tty - make autogroups visible in /proc/sched_debug - make autogroups visible in /proc/<pid>/autogroup - add nice level bandwidth tweakability to /proc/<pid>/autogroup Modulo "kill it" debate outcome... A recurring complaint from CFS users is that parallel kbuild has a negative impact on desktop interactivity. This patch implements an idea from Linus, to automatically create task groups. This patch only per session autogroups, but leaves the way open for enhancement. Implementation: each task's signal struct contains an inherited pointer to a refcounted autogroup struct containing a task group pointer, the default for all tasks pointing to the init_task_group. When a task calls setsid(), the process wide reference to the default group is dropped, a new task group is created, and the process is moved into the new task group. Children thereafter inherit this task group, and increase it's refcount. On exit, a reference to the current task group is dropped when the last reference to each signal struct is dropped. The task group is destroyed when the last signal struct referencing it is freed. At runqueue selection time, IFF a task has no cgroup assignment, it's current autogroup is used. Autogroup bandwidth is controllable via setting it's nice level through the proc filesystem. cat /proc/<pid>/autogroup displays the task's group and the group's nice level. echo <nice level> > /proc/<pid>/autogroup sets the task group's shares to the weight of nice <level> task. Setting nice level is rate limited for !admin users due to the abuse risk of task group locking. The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP is selected, but can be disabled via the boot option noautogroup, and can be also be turned on/off on the fly via.. echo [01] > /proc/sys/kernel/sched_autogroup_enabled. ..which will automatically move tasks to/from the root task group. Signed-off-by: Mike Galbraith ...
Commit-ID: 5091faa449ee0b7d73bc296a93bca9540fc51d0a Gitweb: http://git.kernel.org/tip/5091faa449ee0b7d73bc296a93bca9540fc51d0a Author: Mike Galbraith <efault@gmx.de> AuthorDate: Tue, 30 Nov 2010 14:18:03 +0100 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Tue, 30 Nov 2010 16:03:35 +0100 sched: Add 'autogroup' scheduling feature: automated per session task groups A recurring complaint from CFS users is that parallel kbuild has a negative impact on desktop interactivity. This patch implements an idea from Linus, to automatically create task groups. Currently, only per session autogroups are implemented, but the patch leaves the way open for enhancement. Implementation: each task's signal struct contains an inherited pointer to a refcounted autogroup struct containing a task group pointer, the default for all tasks pointing to the init_task_group. When a task calls setsid(), a new task group is created, the process is moved into the new task group, and a reference to the preveious task group is dropped. Child processes inherit this task group thereafter, and increase it's refcount. When the last thread of a process exits, the process's reference is dropped, such that when the last process referencing an autogroup exits, the autogroup is destroyed. At runqueue selection time, IFF a task has no cgroup assignment, its current autogroup is used. Autogroup bandwidth is controllable via setting it's nice level through the proc filesystem: cat /proc/<pid>/autogroup Displays the task's group and the group's nice level. echo <nice level> > /proc/<pid>/autogroup Sets the task group's shares to the weight of nice <level> task. Setting nice level is rate limited for !admin users due to the abuse risk of task group locking. The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via the boot option noautogroup, and can also be turned on/off on the fly via: echo [01] > ...
Sorry for the late reply, I am slowly crawling through my mbox...
I assume this is the latest version. In this case I think it needs
We can race with autogroup_move_group() and use the already freed
->autogroup. We need ->siglock or task_rq_lock() to read it.
IOW, I think we need something like the patch below, but - sorry -
if was completely untested.
Do we really want this if ag == autogroup_default ? Say, autogroup_create()
fails, now the owner of this process can affect init_task_group. Or admin
can change init_task_group "by accident" (although currently this is hardly
possible, sched_autogroup_detach() has no callers). Just curious.
Oleg.
--- a/kernel/sched_autogroup.c
+++ a/kernel/sched_autogroup.c
@@ -41,6 +41,19 @@ static inline struct autogroup *autogrou
return ag;
}
+static struct autogroup *task_get_autogroup(struct task_struct *p)
+{
+ struct autogroup *ag = NULL;
+ unsigned long flags;
+
+ if (lock_task_sighand(p, &flags)) {
+ ag = autogroup_kref_get(p->signal->autogroup);
+ unlock_task_sighand(p, &flags);
+ }
+
+ return ag;
+}
+
static inline struct autogroup *autogroup_create(void)
{
struct autogroup *ag = kzalloc(sizeof(*ag), GFP_KERNEL);
@@ -149,11 +162,7 @@ EXPORT_SYMBOL(sched_autogroup_detach);
void sched_autogroup_fork(struct signal_struct *sig)
{
- struct task_struct *p = current;
-
- spin_lock_irq(&p->sighand->siglock);
- sig->autogroup = autogroup_kref_get(p->signal->autogroup);
- spin_unlock_irq(&p->sighand->siglock);
+ sig->autogroup = task_get_autogroup(current);
}
void sched_autogroup_exit(struct signal_struct *sig)
@@ -172,7 +181,6 @@ __setup("noautogroup", setup_autogroup);
#ifdef CONFIG_PROC_FS
-/* Called with siglock held. */
int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice)
{
static unsigned long next = INITIAL_JIFFIES;
@@ -193,9 +201,11 @@ int proc_sched_autogroup_set_nice(struct
if (!capable(CAP_SYS_ADMIN) && time_before(jiffies, next))
return ...I was going to lock it all up, but convinced myself it wasn't necessary.
I don't see how/why. I took a reference to the new group before
assignment of p->signal->autogroup, and put the previous group after
it's assigned.
Ponders that.. uhoh.
Mover does atomic write, but signal->autogroup write comes after that,
so can still be in flight when reader dereferences. Game over unless
the reader beats ->autogroup writer to the punch.
Thanks again for your excellent eyeballs. The below should plug that
sched_group_set_shares() does the right thing, says no to changing the
root task group's shares.
sched: fix potential access to freed memory
Oleg pointed out that the /proc interface kref_get() useage may race with
the final put during autogroup_move_group(). A signal->autogroup assignment
may be in flight when the /proc interface dereference, leaving them taking
a reference to an already dead group.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
diff --git a/kernel/sched_autogroup.c b/kernel/sched_autogroup.c
index 57a7ac2..713b6c0 100644
--- a/kernel/sched_autogroup.c
+++ b/kernel/sched_autogroup.c
@@ -41,6 +41,12 @@ static inline struct autogroup *autogroup_kref_get(struct autogroup *ag)
return ag;
}
+static inline struct autogroup *autogroup_task_get(struct task_struct *p)
+{
+ smp_rmb();
+ return autogroup_kref_get(p->signal->autogroup);
+}
+
static inline struct autogroup *autogroup_create(void)
{
struct autogroup *ag = kzalloc(sizeof(*ag), GFP_KERNEL);
@@ -119,6 +125,7 @@ autogroup_move_group(struct task_struct *p, struct autogroup *ag)
}
p->signal->autogroup = autogroup_kref_get(ag);
+ smp_mb();
t = p;
do {
@@ -172,7 +179,6 @@ __setup("noautogroup", setup_autogroup);
#ifdef CONFIG_PROC_FS
-/* Called with siglock held. */
int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice)
{
static unsigned long next = INITIAL_JIFFIES;
@@ -194,7 +200,7 @@ int ...I'd also have to disable interrupts though, so may as well just lock it.
I didn't do the -ESRCH or no display bit. As far as autogroup is
concerned, if you couldn't lock, it's history, so belongs to init.
sched: fix potential access to freed memory
Oleg pointed out that the /proc interface kref_get() useage may race with
the final put during autogroup_move_group(). A signal->autogroup assignment
may be in flight when the /proc interface dereference, leaving them taking
a reference to an already dead group.
Reported-by: Oleg Nesterov <oleg@redhat.com>
Signed-off-by: Mike Galbraith <efault@gmx.de>
diff --git a/kernel/sched_autogroup.c b/kernel/sched_autogroup.c
index 57a7ac2..c80fedc 100644
--- a/kernel/sched_autogroup.c
+++ b/kernel/sched_autogroup.c
@@ -41,6 +41,20 @@ static inline struct autogroup *autogroup_kref_get(struct autogroup *ag)
return ag;
}
+static inline struct autogroup *autogroup_task_get(struct task_struct *p)
+{
+ struct autogroup *ag;
+ unsigned long flags;
+
+ if (!lock_task_sighand(p, &flags))
+ return autogroup_kref_get(&autogroup_default);
+
+ ag = autogroup_kref_get(p->signal->autogroup);
+ unlock_task_sighand(p, &flags);
+
+ return ag;
+}
+
static inline struct autogroup *autogroup_create(void)
{
struct autogroup *ag = kzalloc(sizeof(*ag), GFP_KERNEL);
@@ -149,11 +163,7 @@ EXPORT_SYMBOL(sched_autogroup_detach);
void sched_autogroup_fork(struct signal_struct *sig)
{
- struct task_struct *p = current;
-
- spin_lock_irq(&p->sighand->siglock);
- sig->autogroup = autogroup_kref_get(p->signal->autogroup);
- spin_unlock_irq(&p->sighand->siglock);
+ sig->autogroup = autogroup_task_get(current);
}
void sched_autogroup_exit(struct signal_struct *sig)
@@ -172,7 +182,6 @@ __setup("noautogroup", setup_autogroup);
#ifdef CONFIG_PROC_FS
-/* Called with siglock held. */
int proc_sched_autogroup_set_nice(struct task_struct *p, int *nice)
{
static unsigned long next = INITIAL_JIFFIES;
@@ -194,7 +203,7 @@ int ...Agreed. I considered this option too, but I was worried about sched_group_set_shares(root). Aha, I see, thanks. I believe the patch is correct and closes the hole. Oleg. --
Commit-ID: 4f8219875a0dad2cfad9e93a3fafcd9626db98d2 Gitweb: http://git.kernel.org/tip/4f8219875a0dad2cfad9e93a3fafcd9626db98d2 Author: Mike Galbraith <efault@gmx.de> AuthorDate: Thu, 16 Dec 2010 15:09:52 +0100 Committer: Ingo Molnar <mingo@elte.hu> CommitDate: Tue, 4 Jan 2011 15:10:34 +0100 sched, autogroup: Fix potential access to freed memory Oleg pointed out that the /proc interface kref_get() useage may race with the final put during autogroup_move_group(). A signal->autogroup assignment may be in flight when the /proc interface dereference, leaving them taking a reference to an already dead group. Reported-by: Oleg Nesterov <oleg@redhat.com> Signed-off-by: Mike Galbraith <efault@gmx.de> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <1292508592.5940.28.camel@maggy.simson.net> Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/sched_autogroup.c | 25 +++++++++++++++++-------- 1 files changed, 17 insertions(+), 8 deletions(-) diff --git a/kernel/sched_autogroup.c b/kernel/sched_autogroup.c index 57a7ac2..c80fedc 100644 --- a/kernel/sched_autogroup.c +++ b/kernel/sched_autogroup.c @@ -41,6 +41,20 @@ static inline struct autogroup *autogroup_kref_get(struct autogroup *ag) return ag; } +static inline struct autogroup *autogroup_task_get(struct task_struct *p) +{ + struct autogroup *ag; + unsigned long flags; + + if (!lock_task_sighand(p, &flags)) + return autogroup_kref_get(&autogroup_default); + + ag = autogroup_kref_get(p->signal->autogroup); + unlock_task_sighand(p, &flags); + + return ag; +} + static inline struct autogroup *autogroup_create(void) { struct autogroup *ag = kzalloc(sizeof(*ag), GFP_KERNEL); @@ -149,11 +163,7 @@ EXPORT_SYMBOL(sched_autogroup_detach); void sched_autogroup_fork(struct signal_struct *sig) { - struct task_struct *p = current; - - spin_lock_irq(&p->sighand->siglock); - sig->autogroup = ...
On Tue, Nov 30, 2010 at 9:09 PM, tip-bot for Mike Galbraith The above change as well as the recent changes due to tg_shares_up improvements have two (undesirable ?) side effects on /proc/sched_debug output. The autogroup patchset removes the display of cgroup name from sched_debug output. On a 16 CPU system, with 2 groups having one task each and one task in root group, the difference in o/p appears like this: $ grep while1 sched_debug-2.6.37-rc5 R while1 2208 13610.855787 1960 120 13610.855787 19272.857661 0.000000 /2 R while1 2207 20255.605110 3160 120 20255.605110 31572.634065 0.000000 /1 R while1 2209 63913.721827 1273 120 63913.721827 12604.411880 0.000000 / $ grep while1 sched_debug-2.6.37-rc5-tip R while1 2173 17603.479529 2754 120 17603.479529 25818.279858 4.560010 R while1 2174 11435.667691 1669 120 11435.667691 16456.476663 0.000000 R while1 2175 10074.709060 1019 120 10074.709060 10075.915495 0.000000 So you can see in the latter case, it becomes difficult to see which task belongs to which group. The group names are also missing from per-CPU rq information. Hence in the above example of 2 groups, I see 2 blocks of data for 2 cfs_rqs, but it is not possible to know which group they represent. Also, with tg_shares_up improvements, the leaf cfs_rqs are maintained on rq->leaf_cfs_rq_list only if they carry any load. But the code to display cfs_rq information for sched_debug isn't updated and hence information from a few cfs_rqs are missing from sched_debug. $ grep cfs_rq ...
Hrmph.. that wasn't supposed to happen, care to send a patch to fix that Well, that's a _good_ thing, right? I mean, if we know they're empty, and don't contribute to schedule, why bother displaying them? --
There are two aspects here: - Printing cgroup name for per-CPU cfs_rqs shouldn't be affected by autogroup and the old code should work here. - Printing cgroup name for tasks depends on task_group(), which has been changed by autogroup patch. I haven't really looked deep into autogroup patch, but from whatever I can gather, Mike had a reason to remove this bit from sched_debug. The task groups created for autogroups don't have cgroups associated with them and hence no dentries and hence no pathnames. In addition to tasks, we do display other details pertaining to the cfs_rq. I thought, having a complete view of all the cfs_rqs in the system would be better and consistent than obtaining different cfs_rqs on different captures of /proc/sched_debug. Regards, Bharata. -- http://bharata.sulekha.com/blog/posts.htm, http://raobharata.wordpress.com/
I missed adding the similar bits to RT in sched_dubug.c. If this approach is reasonable, I can send the next one with RT changes included. Regards, Bharata. --
No that code needs a serious cleanup, I think you even (re-)introduced a NULL pointer deref (but I didn't look too closely). Simply re-instating the code that was removed isn't sufficient. Just write your patch as if its a new feature (and never even look at the old code), you'll very probably end up with a much saner patch. --
+#ifdef CONFIG_CGROUP_SCHED
+ {
+ char path[64];
+
...
+#if defined(CONFIG_CGROUP_SCHED) && defined(CONFIG_FAIR_GROUP_SCHED)
+ char path[128];
One reason it was removed was path[64/128].
-Mike
--
Other callers of cgroup_path use PATH_MAX=4096 here. I believe the original reason for these short path sizes was to be light on stack and as well as to avoid allocation. Can we have some reasonable length (256 or 512 ?) and live with truncation (if that ever happens) ? Also, while displaying group name with tasks, does it make sense to display autogroup-<id> (the one shown in /proc/<pid>/autogroup) ? Regards, Bharata. --
I did to me obviously :) I'm fine with it not showing up, though if it survives and evolves, maybe it'll want visibility. ATM, you know what's in what group via ps -o foo,session. -Mike --
Resurrecting this thread a bit, one question I didn't see discussed is simply: Why doesn't "nice" work for this? On my Fedora 14 system, "ps alxf" shows almost everything in my session is running at the default nice 0. The only exceptions are "/usr/libexec/tracker-miner-fs" at 19, and pulseaudio at -11. I don't know What would happen if say the scheduler effectively group-scheduled each nice value? Then, what we tell people to do is run "nice make". Which in fact, has been documented as a thing to do for decades. Actually I tend to use "ionice" too, which is also useful if any of your desktop applications happen to make the mistake of doing I/O in the mainloop (emacs fsync()ing in UI thread, I'm looking at you). Quickly testing kernel-2.6.35.6-48.fc14.x86_64 on a "Intel(R) Core(TM)2 Quad CPU Q9400 @ 2.66GHz", the difference between "make -j 128" and "nice make -j 128" is quite noticeable. As you'd expect. The CFS docs already say: "The CFS scheduler has a much stronger handling of nice levels and SCHED_BATCH than the previous vanilla scheduler: both types of workloads are isolated much more aggressively" Does it just need to be even more aggressive, and people use "nice"? --
"nice" doesn't work. It never has. Nobody ever uses it, and that has always been true. As you note, you can find occasional cases of it being used, but they are either for things that are _so_ unimportant (and know they are) and annoying cpu hogs that they wouldn't be allowed to live unless they were niced down maximally (your tracker-miner example), or they use nice not because they really want to, but because it is an approximation for what they really do want (ie pulseaudio wants low latencies, and is set up by the distro, so you'll find it niced up). But the fundamental issue is that 'nice' is broken. It's very much broken at a conceptual and technical design angle (absolute priority levels, no fairness), but it's broken also from a psychological and practical angle (ie expecting people to manually do extra work is Why would you want to do that? If you are willing to do group scheduling, do it on something sane and meaningful, and something that doesn't need user interaction or decisions. And do it on something that has more than 20 levels. Nobody but morons ever "documented" that. Sure, you can find people saying it, but you won't be finding people actually _doing_ it. Look around. Seriously. Nobody _ever_ does "nice make", unless they are seriously repressed beta-males (eg MIS people who get shouted at when they do system maintenance unless they hide in dark corners and don't get discovered). It just doesn't happen. But more fundamentally, it's still the wrong thing to do. What nice level should you use? And btw, it's not just "make". One of the things that originally caused me to want something like this is that you can enable some pretty aggressive threading with "git diff". If you use the "core.preloadindex" setting, git will fire up 20 threads just to do "lstat()" system calls as quickly as it humanly can. Or "git grep" will happily use lots of threads and really mess with your system, except it limits the threads to a smallish number just to not ...
On Sat, Dec 4, 2010 at 1:33 PM, Linus Torvalds I don't see it as ridiculous - for the simple reason that it really In this case, the "user interaction" component is pretty damn small. Look around...where? On what basis are you making that claim? I did a quick web search for "unix background process", and this tutorial (in the first page of Google search results) aimed at grad students who use Unix at college definitely describes "nice make": http://acs.ucsd.edu/info/jobctrl.shtml There are some that don't, like: http://linux.about.com/od/itl_guide/a/gdeitl35t01.htm and http://www.albany.edu/its/quickstarts/qs-common_unix.html But then again here's a Berkeley "Unix Tutorial" that does cover it: http://people.ischool.berkeley.edu/~kevin/unix-tutorial/section13.html So, does your random Linux-using college student or professional developer know about "nice"? My guess is "likely". Do they use it for "make"? No data. The issue is that you really only have a bad experience on *large* projects. But if we just said to people who come to us "Hey, when I compile webkit/linux/mozilla my system slows down" we can tell them "use nice", especially since it's already documented on the web, that seems to me like a pretty damn good Heh. Well, I do at least (or rather, my personal automagic build wrapper script does (it detects Makefile/autotools etc. and tries to Doesn't matter - if they all got group-scheduled together, then the Hmm...how many threads are we talking about here? If it's just say one per core, then I doubt it needs nicing. The reason people nice make is because the whole thing alternates between being CPU bound and I/O bound, so you need to start more jobs than cores (sometimes a lot Sure...though I imagine for "most" people that's totally I/O bound Well, the text of Documentation/scheduler/sched-design-CFS.txt certainly seems to be claiming it was a big improvement in this kind of situation from the previous scheduler. If we're finding out ...
What part of "nobody does that" didn't you understand? I know about "nice". I think it was part of the original Unix course I ever took, and it probably made more sense back then (sixteen people at a time on a microvax, compiling stuff). And I never used it, afaik. Nor does really anybody else. But hey, whatever floats your boat. You can use it. And you can feel special and better than the rest of us exactly because you know you I think git defaults to a maximum of 20 for it. Remember: it's not about "cores". It's about IO, and then 20 is a "let's not mess up everybody else _too_ much when we're actually CPU-bound". But that's not the point. The point is that "nice" is totally the wrong thing to do. It's _always_ the wrong thing to do. The only reason it's in tutorials and taught in intro Unix classes is that it's the only thing there is in traditional unix. And we can be better. We don't need to be stupid and traditional. Sure. And "most" people do something totally different. What's your point? The fact is, the session-based group scheduling really does work. It works on a lot of different loads. It's nice for things like my use, but it's _also_ nice for things like me ssh'ing into my kids or wife's computers to update their kernel. And it's nice for things like "make -j test" for git etc. And it doesn't hurt you. If you're happy with "nice", go on and use it. Why are you even discussing it? I'm telling you the FACT that others aren't happy with nice, and that smart people consider nice to be totally useless. "Mommy mommy, it hurts when I stick forks in my eyes!" What's your point again? It's a heuristic. It works great for the cases many normal people have. If you have a graphical desktop, most sane people would tend to start the browser from that nice big browser icon. But again, if you want to stick forks in your eyes, go right ahead. It's not _my_ problem. And similarly, it's not _your_ problem if other people want to do saner things, is it? ...
On Sat, Dec 4, 2010 at 5:39 PM, Linus Torvalds Because it seems to me like a bug if it isn't as good as group scheduling? Most of your message is saying it's worthless, and I don't disagree that it's not very good *right now*. I guess where we So if it's a heuristic the OS can get wrong, wouldn't it be a good idea to support a way for programs and/or interactive users to explicitly specify things? Unfortunately the cgroups utilities don't make this easy (and of course there's the issue that no major released OS exports write permission to the cpu cgroup for a desktop session uid). I guess "nice" could be patched to, if the user has permission to the cgroups, to auto-create a group. Or...nice could be fixed. On a more productive note, I see now Documentation/scheduler/sched-nice-design.txt has a lot of really useful history regarding "nice" and the complaints over time (I guess this is where some of your assertions that it's failed/worthless comes from). --
It's not worth 'fixing", because it works exactly like it's designed -
and supposed - to work.
There really isn't anything to fix. 'nice' is what it is. It's a
simple legacy interface to scheduler priority. The fact that it's also
almost totally useless is irrelevant. It's like male nipples. We
wouldn't be better off lactating, and they look like some odd wart
that doesn't do much good. But it would be worse to remove it.
'nice' is a bad idea. It's a bad idea that has perfectly
understandable historical reasons for it, but it's an _unfixably_ bad
idea.
Linus
--
Consider a multi-user machine. `nice` is an orthogonal concern in that case. Therefore, fixing nice doesn't address all issues. Also: Most linux systems are multi-user (root and the physical tty user.) Further, even a single user wears multiple hats on a single system. The idea is to infer those hats, and deal with them fairly. No one is taking nice away from you. Keep using it if you like. If you want to allow users to explicitly specify group scheduling, then good news: we already have that feature. You just seem to not be using it. Much like the other 99.993% of us. The kernel is supposed to have *sane defaults*. That's what is under discussion here. ~r. --
For the purposes of this discussion again, let's say "fixing nice" means say "group schedule each nice level above 0". There are obviously many possibilities here, but let's consider this one precisely. How, exactly, under what scenario in a "multi-user machine" does this break? How exactly is it orthogonal? Two people logged in would get their "make" jobs group scheduled together. What is the problem? Since Linus appears to be more interested in talking about nipples than explaining exactly what it would break, but you appear to agree with him, hopefully you'll be able to explain... --
THAT IS NOT HOW 'nice' WORKS! For chissake, how hard is it to understand? The semantics of "nice" are not - and have never been - to put things into process scheduling groups of their own. When somebody says "nice xyzzy", they are explicitly stating that "xyzzy" isn't as important as other processes. It's done for stuff that you don't care about, and more specifically, for stuff that you really don't want to impact anything else. So if there are other things to be run, 'nice' means that those should get more CPU time. (Obviously, negative nice levels work the other way around). This is very much documented. People rely on it. Look at the man-page. The problem is that you don't know what the hell you are talking about. Different nice levels shouldn't get group scheduled together - they should be scheduled *less*. And it's not about "make", since nobody really ever uses nice on make anyway, it's about things like pulseaudio (that wants higher priorities) and random background filesystem indexers etc (that want lower priorities). Nice levels are _not_ about group scheduling. They're about priorities. And since the cgroup code doesn't even support priority levels for the groups, it's a really *horrible* match. And the thing is, the nice semantics are traditional. They are also *horrible*, but that doesn't allow you to change their semantics. People rely on those crazy traditional and mostly useless semantics. Not very much (because they are mostly useless), but there really are people who use it. And they use it knowing that positive nice levels means that something is less important. In contrast, giving processes a scheduling group doesn't imply "less important". Not AT ALL. It doesn't really mean "more important" either, it just means "somewhat insulated from other groups". So let's say that you have a filesystem indexer, and you nice it up to make sure that it doesn't steal CPU bandwidth from your "real work". Now, let's say that you start a "make -16" ...
On Sun, Dec 5, 2010 at 3:47 PM, Linus Torvalds Again, I obviously understand that - the point is to explore the space of changes here and consider what would (and wouldn't) break. And Well, we established my Fedora 14 system doesn't. You said "no one" uses "nice" interactively. So...that leaves - who? If you were saying to me something like "I know Yahoo has some code in their data centers which uses a range of nice values; if we made this change, all of a sudden they'd get more CPU contention..." Or like, "I'm pretty sure Maemo uses very low nice values for some UI code". But you so far haven't really done that, it's just been (mostly) assertions/handwaving. Now you obviously have a lot more experience that gives those assertions and handwaving a lot of credibility - but all we need is one concrete example to shut me up =) Playing around with Google code search a bit, hits for "nice" were almost all duplicates of various C library headers/implementations. "setpriority" was a bit more interesting, it appears Chromium has some code to bump up the nice value by 5 for "background" processes: http://google.com/codesearch/p?hl=en#OAMlx_jo-ck/src/base/process_linux.cc&q=setpr... But all my Chrome related processes here are 0, so who knows what that's used for. There are also hits for chromium's copy of embedded cygwin+perl...terrifying. I assume (hope, desperately) that Cygwin+Perl is just used for building... Another hit here in some random X screensaver code: http://google.com/codesearch/p?hl=en#tJJawb1IJ20/driver/exec.c&q=setpriority%20fil... But I can't find a place where it's setting a non-zero value for that. So...ah, here's one in Android's "development" git: http://google.com/codesearch/p?hl=en#CRBM04-7BoA/simulator/wrapsim/Init.c&q=setpri... Except it appears to be unused =/ Oh! Here we go, one in the Android UI ...
[...] I'll give you two re-world examples from two (closed source, but still) apps we develop at my current employer. The first one is a server/network monitoring app where there are lots of child processes devoted to performing checks, storing data, displaying results etc. Most of these processes just run at the default nice level. One of the processes sometimes has a need for a cryptographic key pair and it can generate this when it needs it, but it's better if one is reaily available, so we have a seperate child process running that maintains a small pool of new key pairs - this process runs at a high nice level since it should not take CPU time away from the rest of the processes (it's not important, it's just a small optimization), the need for key pairs comes at large intervals, so the pool will almost never be depleted even if this process doesn't get very much CPU time for a long time and besides, if the pool ever gets depleted its no disaster since the consumer will then just generate a key pair when needed and burn the required CPU. The second is a backup aplication where one child process is in charge of doing background disk scanning, compression and encryption. This process is not interactive, it must result in minimal interference with whatever the user is currently using the machine for as his primary task and time-to-completion is not really that important. So, this process runs at a rather high nice level to avoid stealing CPU from the users primary task(s). -- Jesper Juhl <jj@chaosbits.net> http://www.chaosbits.net/ Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please. --
Ohh and a third example. On my home laptop I got sufficiently annoyed with 'updatedb' starting up from cron while I was in the middle of something so that cron job now runs updatedb with 'nice 19' and also uses ionice so it runs at the 'best effort' class and with priority 7 (lowest). -- Jesper Juhl <jj@chaosbits.net> http://www.chaosbits.net/ Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please. --
It does in fact, nice maps to a weight, we then schedule so that each
entity (be it task or group) gets a proportional amount of time relative
to the other entities (of the same parent).
The scheduler basically solves the following differential equation:
dt_i = w_i * dt / \Sum_j w_j
For tasks we map nice to weight like:
static const int prio_to_weight[40] = {
/* -20 */ 88761, 71755, 56483, 46273, 36291,
/* -15 */ 29154, 23254, 18705, 14949, 11916,
/* -10 */ 9548, 7620, 6100, 4904, 3906,
/* -5 */ 3121, 2501, 1991, 1586, 1277,
/* 0 */ 1024, 820, 655, 526, 423,
/* 5 */ 335, 272, 215, 172, 137,
/* 10 */ 110, 87, 70, 56, 45,
/* 15 */ 36, 29, 23, 18, 15,
};
For groups we expose the weight directly in cgroupfs://cpu.shares with a
default equivalent to nice-0 (1024).
So 'nice make -j9' will run make and all its children with weight=110,
if this task hierarchy has ~9 runnable tasks it will get about as much
time as a single nice-0 competing task.
[ 9*110 = 990, 1*1024 = 1024, which gives: 49% vs 51% ]
Now group scheduling is in fact closely related to nice, the only thing
group scheduling does is:
w_i = \unit * \Prod_j { w_i,j / \Sum_k w_k,j }, where:
j \elem i and its parents
k \elem entities of group j (where a task is a trivial group)
Where we compute a task's effective weight (w_i) by multiplying it with
the effective weight of their ancestors.
Suppose a grouped make -j9 against 1 competing task (all nice-0 or
equivalent), and make's 9 active children [a..i] in the group G:
R
/ \
t G
/ \
a...i
So w_t = 1024, w_G = 1024 and w_[a..i] = 1024.
Now, per the above the effective weight (weight as in the root group) of
each grouped task is:
w_[a..i] = 1024 * ...Greets. I applaud your efforts to continue addressing interactivity and responsiveness but, I know I'm going to regret this, I feel strongly enough to speak up about this change. This is precisely what I see as the flaw in this approach. The whole reason you have CFS now is that we had a scheduler which was pretty good for all the other things in the O(1) scheduler, but needed heuristics to get interactivity right. I put them there. Then I spent the next few years trying to find a way to get rid of them. The reason is precisely what Colin says above. Heuristics get it wrong sometimes. So no matter how smart you think your heuristics are, it is impossible to get it right 100% of the time. If the heuristics make it better 99% of the time, and introduce disastrous corner cases, regressions and exploits 1% of the time, that's unforgivable. That's precisely what we had with the old O(1) scheduler and that's what you got rid of when you put CFS into mainline. The whole reason CFS was better was it was mostly fair and concentrated on ensuring decent latency rather than trying to guess what would be right, so it was predictable and reliable. So if you introduce heuristics once again into the scheduler to try and improve the desktop by unfairly distributing CPU, you will go back to where you once were. Mostly better but sometimes really badly wrong. No matter how smart you think you can be with heuristics they cannot be right all the time. And there are regressions with these tty followed by per session group patches. Search forums where desktop users go and you'll see that people are afraid to speak up on lkml but some users are having mplayer and amarok skipping under light load when trying them. You want to program more intelligence in to work around these regressions, you'll just get yourself deeper and deeper into the same quagmire. The 'quick fix' you seek now is not something you should be defending so vehemently. The "I have a solution now" just ...
Actually, Linus laid the foundation with sleeper fairness, Ingo expanded it to requeue "interactive" tasks in the active array, and you tweaked And it still is Con. I didn't rewrite the thing, I just added an automated task grouping. Session to session fairness is just as holy as An automated per session task group is an evil heuristic, so we should use kinda sorta sensitive, really REALLY sensitive, don't give a damn, or no frickin' clue... to make 100% accurate non-heuristic scheduling decisions instead? Did I get that right? Goodbye. -Mike --
I think you are misunderstanding Mike's auto-group scheduling feature. The scheduling itself is not 'heuristics'. It is the _composition of a group_ that has a heuristic default. (We use the 'tty' to act as the grouping) But that can be changed: the cgroup interfaces can be (and are) used by Gnome to create different groups. They can be used by users as well, using cgroup tooling. This is not some kernel heuristic that cannot be modified - which was the main problem of the O(1) scheduler. This is a common-sense default that can be overriden by user-space if it wants to. So i definitely think you are confusing the two cases. Thanks, Ingo --
as someone who starts firefox from a terminal session all the time, I always want to start it from it's own dedicated session, if for no other reason that it spits out a TON of error messages over time, and I don't want them popping up in a window where I'm doing something else. so this is a very bad example. David Lang --
Btw, most people don't do that anymore. They don't use terminals. They click the application icons on their desktops and start menus or double click the executables in their file managers. So it's not a matter of opening a second terminal tab, because the first one isn't even open. To have a fluid desktop one shouldn't require to hack with terminal commands. --
Its a regression for those who do - and often have good reason to do. This is of course why you don't put policy in the kernel and the original Which is the classic mentality that ruins the big bloated GNOME Linux desktop "It works my way and every other way is wrong so go screw". Of course the other half of the problem is exactly that - firefox was once a small browser, its now a bloated monster too so the scheduler is quite sensible to pick on it. Alan --
On Sun, 5 Dec 2010 15:12:20 +0000 Your rant about big bloated GNOME is... well just a rant. You will never be able to change it. You can just hope, that over time the evolutionary aspects of open source development will fix it. There is nothing wrong with trying to provide ease of use. Graphical interfaces that are well designed are easier to use. Most command-line people just can't cope with the unstable nature of interfaces in the graphical world. CLI's are mostly better from an ergonomic view (old people, heavy working hackers and other power users) because they provide a stable focus point. But this comes at some cost because the human mind is (originally) not tuned for text processing and remembering abstract things like 'words'. It's unique ability to adapt itself to this is.. extraordinary. Most hackers probably don't realize this, but images/icons and other graphical interfaces are more similar to real life and are thus easier to use for 'unadapted' people. "Master minds" that can remember numbers with [really-big-number] of digits often use a trick to do this: They associate every digit-pair with an everyday item. When they learn the number, they construct a story using those items. This story is what they then later use to restore the original number. All that is because humans can remember real life things better than digits or words. Regards, Flo --
Ah of course. How sweet, your response to a point about the arrogance of certain desktop attitudes is to lecture, and make bogus pronouncements And a bit of pop psychology to go with it. I assume you are trying to No of course not we are all dim, thank you for using small words. I am actually familiar with the real models here btw and there are a couple of rather important basics you are missing - Different people have different strengths in different areas - these don't specifically line up with the senses. There isn't vast amounts of evidence to support computing people are all strong in a particular area either. You'll see in studies that some of them prefer to diagram and flowchart, some write text, some have kinesthetic models (movement and flow). I don't doubt there are others who sense it in different ways. - Visual and textual data communicate different things more effectively (as do sounds, smells, movements, ....) and are processed with different natural preferences by different people - Oh and there is no evidence I've ever seen to suggest old people are more text oriented. So any notion of CLI's or GUI's being better for [class of people] is generally naïve. It depends what is being done, who is doing it and the situation. If you really want to understand the trade-offs in a graphical world watch some good CAD operators for an hour. They'll use the same tools in very different ways - some very command line, some heavily mouse/trackball, others graphical but with hotkeys, and they'll often If you wish to see an unadapted person you probably want to go look at amazonian tribes. Bit different. But then the GUI world is the world that put "logout" under "start" 8) and thinks that "Insert OLE Object" is a good thing to put on a menu. Alan --
On Sun, 5 Dec 2010 19:48:11 +0000 Didn't want to imply that. But I think I maybe assumed wrongly that your usage of 'bloat' and 'oversize' was an overstatement and you were You seem to be angry. What's the problem? I find your reaction strange. Isn't that a valid point? I really have problems with graphical programs and tools where you have to click on the left side of the window on some 'X' Icon in Ubuntu and right top corner on fluxbox. and on the third desktop setup ratpoison is running with some super special key combos and all you can do is strg-backspace, except it's disabled. With CLI programs I don't have those problems. I just strg-c most of I find your tone inappropriate. I was not trying to insult you or Why do you think I did imply something other? I was talking about humans in general. Language is an abstract concept. It's not something we intuitivly know when we are born, we have to learn it. I was getting at the fact that the brain adapts towards it's usage. People doing lot's of sport do have less trouble learning a new kind of sport. People that do hack a lot on computers have less trouble learning to use a new tool. A physics professor has less trouble understanding a new theory about the beginning of the world. People working a lot with programming languages and on the commandline are more used to it. Other people find interfaces that resemble real world items easier to use, because they aren't used to inputting Indeed. I didn't try to classify people at all. I'm kind of sad you Regards, Flo --
What is a very clear regression is a threaded app (say firefox) vs a single threaded app, particularly on UP. The per thread scheduling model wins hands down there, because the scheduler very heavily favors the threaded application. Take that unfairness away, and you have an undeniable regression. Yes, it's not black and white, never is. -Mike --
P.S. You also have an obvious _progression_ from the perspective of the single threaded application, which may just as well be interactive. --
P.P.S :) systemd will have the same regressions/progressions. It doesn't matter one whit whether it's kernel/userland making policy. If distro-X includes either one, or neither, they are guaranteed to be wrong :) --
The fact that something is documented doesn't mean the documentation actually is correct. There exists a Linux guide written by somebody (who has enough of a rep that you can safely say "should have known better") who didn't understand the difference between traditional Unix and Linux, nor what the original concept was, and it documented the proper way to take a system down quickly as: # sync;sync;sync;halt Of course, the *original* was: # sync # sync # sync # halt And the whole point of 3 syncs was that the typing time of the second and third sync's chewed up the time till the first sync finished. Of course, sync;sync doesn't start the first sync and then make you type. And it overlooked that the Linux sync is a lot more synchronous than the ATT Unix sync, which returned as soon as the I/O was scheduled, not completed.
PREEMPT_RT has a slightly different exit path IIRC. If that was the only thing you saw it explode on we could leave the check out for now and revisit it in the -rt patches when and if it pops up. Hmm? --
Yeah, I'm going to whack it. (and add your lockdep thingy) -Mike --
Hi Mike, This is a bit of a problem, as it's called in_atomic context and kmalloc's under GFP_KERNEL (which can sleep.) This results in sleep-under-spinlock prints when CONFIG_DEBUG_SPINLOCK_SLEEP=y. I spent a bit of time thinking about how to fix that, but it's a bit difficult because of the nested spin_lock_irq in that bit of the tty_ioctl callchain. I'll think about it some more tonight and follow-up if I can think of a way to fix it. regards, Kyle --
Blame me, I threw that out as a single point where this can be done.
In fact, holding the signal spinlock was seen as a bonus, since that
was used to serialize the access to the signal->autogroup access.
Which I think is required.
But yes, it does create problems for the allocation. It could be done
as just a GFP_ATOMIC, of course, and on allocation failure you'd just
punt and not do it. Not pretty, but functional.
Linus
--
Yeah, I didn't look any deeper than kernel/sched.c::sched_create_group, but that would need to GFP_ATOMIC as well. Looking at it now, so would alloc_rt_sched_group/alloc_fair_sched_group, and we're looking at an awful lot of sleepless allocations. Not sure that's a feasible plan. --Kyle --
Yeah, I got another report about that today. -Mike --
If the on/off is available, you can silently strand pinned tasks forever in an autogroup. So, I tied the on/off switch to the global move so the user won't be surprised. -Mike --
So the set of all tasks that never call proc_set_tty() ends up in the same one big default group, correct? Do we have any provisions for making sure that if a user has 8 or 10 windows open doing heavy work, the default group (with a lot of important daemons/etc in it) doesn't get starved with only a 1/10th share of sched_autogroup_detach() instead?
Well, yes and no.
Yes, that's what the code currently does. But I did ask Mike (and he
delivered) to try to make the code look and work in a way where the
whole "tty thing" is just one of the heuristics.
It's not clear exactly what the non-tty heuristics would be, but I do
have a few suggestions:
- I think it might be a good idea to associate a task group with the
current "cred" of a process, and fall back on it in the absense of a
tty-provided one.
Now, for desktop use, that probably doesn't often matter, but even
for just the desktop it would mean that "system daemons" would at
least get a group of their own, rather than be grouped with
"everything else"
(As is, I think the autogroup thing already ends up protecting
system daemons more than _not_ having the autogroup, since it will
basically mean that a "make -j64" won't be stealing all the CPU time
from everybody else - even if the system daemons all end up in that
single "default" group together with the non-tty graphical apps.)
I think you're missing the fact that without the autogroups, it's
_worse_. If you do "make -j64" without autogroups, those important
daemons get starved by all that work. With the autogroups, they end up
being protected by being in the default group.
So the fact that they are in the default group with lots of other
users doesn't hurt, quite the reverse. User apps are likely to be in
their own groups, so they affect system apps less than they do now.
But I do agree that we might be able to make that all much more explicit.
Linus
--
Or how about (just brainstorming here) a group per 'process group'? -- Jesper Juhl <jj@chaosbits.net> http://www.chaosbits.net/ Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please. --
I switched to per session, which on my system at least looks like more than enough granularity. -Mike --
Sounds sane and it's closer to the original 'per tty' which was so successful. -- Jesper Juhl <jj@chaosbits.net> http://www.chaosbits.net/ Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html Plain text mails only, please. --
Will that have an effect on software like Chromium which creates a fork for each tab? If a user opens Thunderbird and Chromium with 100 tabs, Thunderbird should probably get 50% CPU time instead of just 1%... --
Chromium doesn't create a session or even a process group for each tab. Samuel --
At least on my machine, all of the chromium processes have the same session -- Thomas Fjellstrom thomas@fjellstrom.ca --
That'd be too heavy by Linus' measurement, as you have a process group for each shell command. Samuel --
Yes, all tasks never having had a tty association are relegated to the root task group, and no, there is no provision for the root task group getting more than it's fair share of CPU. The patch is only intended to (hopefully) better suit the general case Hm, why? -Mike -Mike --
You really aren't a good speller of that word, are you?
Linus
--
<stare> d e t a c h... d e t a t....t? aw crap. Guess not. Couldn't even see it. -Mike --
Hmm. Just found a bug. I'm not sure if it's the autogroup patches
themselves, or whether it's just the cgroup code that the autogroup
patch enables for me.
When I do
echo t > /proc/sysrq-trigger
(or "w") I get a NULL pointer dereference (offset 0x38 - decimal 56)
in "cgroup_path+0x7", with a call trace of sched_debug_show,
show_state_filter, sysrq_handle_showstate_blocked. I don't have the
whole oops, because the machine is really dead at that point
(presumably died holding the runqueue lock or some other critical
resource), but if required I could take a photo of it. However, I bet
it is repeatable, so I doubt you need it.
Anyway, that "cgroup_path+0x7" is the very first memory dereference:
movq 56(%rdi), %rsi # cgrp_5(D)->dentry, _________p1
so sched_debug_show() is apparently calling cgroup_path() with a NULL
cgroup. I think it's "print_task()" that is to blame, it does
cgroup_path(task_group(p)->css.cgroup, ..
without checking whether there _is_ any css.cgroup.
Peter, that looks like your code (commit d19ca30874f2)
Guys?
Linus
--
On Mon, Nov 15, 2010 at 5:18 PM, Linus Torvalds Right - previously the returned task_group would be always associated with a cgroup. Now, it may not be. The original task_group() should be made accessible for anything that wants a real cgroup in the scheduler hierarchy, and called from the new task_group() function. Not sure what the best naming convention would be, maybe task_group() and effective_task_group() ? Paul --
effective_task_group() works for me. Autogroup (currently at least)
only needs to interface with set_task_rq().
A tasty alternative would be to have autogroup be it's own subsystem,
with full cgroup userspace visibility/tweakability. I have no idea if
that's feasible though. I glanced at cgroup.c, but didn't see the dirt
simple I wanted, and quickly ran away.
---
kernel/sched.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -606,27 +606,33 @@ static inline int cpu_of(struct rq *rq)
*/
static inline struct task_group *task_group(struct task_struct *p)
{
- struct task_group *tg;
struct cgroup_subsys_state *css;
css = task_subsys_state_check(p, cpu_cgroup_subsys_id,
lockdep_is_held(&task_rq(p)->lock));
- tg = container_of(css, struct task_group, css);
+ return container_of(css, struct task_group, css);
+}
- return autogroup_task_group(p, tg);
+static inline struct task_group *effective_task_group(struct task_struct *p)
+{
+ return autogroup_task_group(p, task_group(p));
}
/* Change a task's cfs_rq and parent entity if it moves across CPUs/groups */
static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
{
+#if (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
+ struct task_group *tg = effective_task_group(p);
+#endif
+
#ifdef CONFIG_FAIR_GROUP_SCHED
- p->se.cfs_rq = task_group(p)->cfs_rq[cpu];
- p->se.parent = task_group(p)->se[cpu];
+ p->se.cfs_rq = tg->cfs_rq[cpu];
+ p->se.parent = tg->se[cpu];
#endif
#ifdef CONFIG_RT_GROUP_SCHED
- p->rt.rt_rq = task_group(p)->rt_rq[cpu];
- p->rt.parent = task_group(p)->rt_se[cpu];
+ p->rt.rt_rq = tg->rt_rq[cpu];
+ p->rt.parent = tg->rt_se[cpu];
#endif
}
--
What exactly do you envisage by that? Having autogroup (in its current incarnation) be a subsystem wouldn't really make sense - there's already a cgroup subsystem for partitioning CPU scheduler groups. If autogroups were integrated with cgroups I think that it would be as a way of automatically creating (and destroying?) groups based on tty connectedness. We tried something like this with the ns subsystem, which would create/enter a new cgroup whenever a new namespace was created; in the end it turned out to be more of a nuisance than anything else. People have proposed all sorts of in-kernel approaches for auto-creation of cgroups based on things like userid, process name, now tty, etc. The previous effort for kernel process grouping (CKRM) started off with a complex in-kernel rules engine that was ultimately dropped and moved to userspace. My feeling is that userspace is a better place for this - as Lennart pointed out, you can get a similar effect with a few lines tweaking in a bash login script or a pam module that's much more configurable from userspace and keeps all the existing cgroup stats available. Paul --
Right, that doesn't solve the full problem though.
/proc/sched_debug should show these automagic task_groups, its just that
there's currently no way to properly name them, we can of course add
something like a name field to the struct autogroup thing, but what do
we fill it with? "autogroup-%d" and keep a sequence number for each
autogroup?
Then the below task_group_path() thing can try the autogroup name scheme
if it finds a NULL css.
Something like the below might avoid the explosion:
---
kernel/sched_debug.c | 28 ++++++++++++++--------------
1 files changed, 14 insertions(+), 14 deletions(-)
diff --git a/kernel/sched_debug.c b/kernel/sched_debug.c
index 2e1b0d1..9b5560f 100644
--- a/kernel/sched_debug.c
+++ b/kernel/sched_debug.c
@@ -87,6 +87,19 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu,
}
#endif
+#if defined(CONFIG_CGROUP_SCHED) && \
+ (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
+static void task_group_path(struct task_group *tg, char *buf, int buflen)
+{
+ /* may be NULL if the underlying cgroup isn't fully-created yet */
+ if (!tg->css.cgroup) {
+ buf[0] = '\0';
+ return;
+ }
+ cgroup_path(tg->css.cgroup, buf, buflen);
+}
+#endif
+
static void
print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
{
@@ -115,7 +128,7 @@ print_task(struct seq_file *m, struct rq *rq, struct task_struct *p)
char path[64];
rcu_read_lock();
- cgroup_path(task_group(p)->css.cgroup, path, sizeof(path));
+ task_group_path(task_group(p), path, sizeof(path));
rcu_read_unlock();
SEQ_printf(m, " %s", path);
}
@@ -147,19 +160,6 @@ static void print_rq(struct seq_file *m, struct rq *rq, int rq_cpu)
read_unlock_irqrestore(&tasklist_lock, flags);
}
-#if defined(CONFIG_CGROUP_SCHED) && \
- (defined(CONFIG_FAIR_GROUP_SCHED) || defined(CONFIG_RT_GROUP_SCHED))
-static void task_group_path(struct task_group *tg, char *buf, int buflen)
-{
- /* may be NULL if the underlying cgroup ...I was considering exactly that for /proc/N/cgroup visibility. Might get --
My assumption is that no additional locking is needed. The tty is refcounted, dropped in release_task()->__exit_signal(), at which point the task is unhashed, is history. The tty can't go away until the last task referencing it goes away. -Mike --
I've tested your patch and it runs smoothly on my machine. However I had several NULL pointer dereference BUGs that happened when I left X or rebooted my system. I think this is caused by your patch. There is nothing in the logs unfortunately, but I scribbled down the following by hand (not the whole trace, I'm too lazy): BUG: unable to handle NULL pointer dereference at 0..038 IP: pick_next_task_fair 0xa7/0x1a0 ... Call Trace: schedule ... --
Hm. Not much I can do without the trace, but thanks for testing and
reporting anyway, guess I need to do some heavy stress testing. I'm
re-writing it as I write this anyway.
thanks,
-Mike
--
