Hello Mike, I tested it a bit, and autosched-v4 crashes on bootup with with attached config. Note: the box has serial logging enabled and there's UART code in the stacktrace - maybe it's related. Let me know if you need the full bootup log. Thanks, Ingo [FAILED] Enabling local filesystem quotas: [ OK ] PPS event at 4294886381 Enabling /etc/fstab swaps: swapon: /dev/hda2: Function not implemented [FAILED] INIT: Entering runleveBUG: unable to handle kernel paging request at f548604c IP:l: 3 [<c10307f0>] update_cfs_shares+0x60/0x160 *pdpt = 0000000002017001 *pde = 00000000029d4067 *pte = 8000000035486160 Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC last sysfs file: /sys/block/sr0/dev Pid: 1, comm: init Not tainted 2.6.37-rc2-tip+ #64308 A8N-E/System Product Name EIP: 0060:[<c10307f0>] EFLAGS: 00010086 CPU: 1 EIP is at update_cfs_shares+0x60/0x160 EAX: fffffffe EBX: f547603b ECX: 00000400 EDX: 00000002 ESI: f5486000 EDI: 0000013b EBP: f6459d48 ESP: f6459d3c DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068 Process init (pid: 1, ti=f6458000 task=f6450000 task.ti=f6458000) Stack: f5475a80 f6f066c0 00000004 f6459d84 c103256f 00000002 00000001 00000000 c10324d0 c200e6c0 00000001 f6f06b34 00000046 f5475a80 f5475ac8 f6f066c0 00000001 ffffffff f6459dfc c1b32820 f64a0010 f6459dc4 00000046 00000000 Call Trace: [<c103256f>] update_shares+0x9f/0x170 [<c10324d0>] ? update_shares+0x0/0x170 [<c1b32820>] schedule+0x580/0x9d0 [<c1039335>] ? sub_preempt_count+0xa5/0xe0 [<c1b330e5>] schedule_timeout+0x125/0x2a0 [<c104fe60>] ? process_timeout+0x0/0x10 [<c15aef4f>] uart_close+0x17f/0x350 [<c105fea0>] ? autoremove_wake_function+0x0/0x50 [<c1471f72>] tty_release+0x102/0x500 [<c1125fdf>] ? locks_remove_posix+0xf/0xa0 [<c1119a43>] ? fsnotify+0x1e3/0x2f0 [<c11198d3>] ? fsnotify+0x73/0x2f0 [<c10ea1e1>] fput+0xb1/0x230 [<c10e7e7e>] filp_close+0x4e/0x70 [<c10e7f14>] sys_close+0x74/0xc0 [<c1002b90>] sysenter_do_call+0x12/0x31 Code: 00 00 00 8b 18 8b 79 1c 8b 49 ...
Btw., there's a small cleanup in the patch that i picked up (see below), and i also edited the commit log a bit - so you might want to pick up the version below. Ingo ---------------> From 741430a6af2bdefe3d017226bbcfe96f9ed46b58 Mon Sep 17 00:00:00 2001 From: Mike Galbraith <efault@gmx.de> Date: Sat, 20 Nov 2010 12:35:00 -0700 Subject: [PATCH] sched: Improve desktop interactivity: Implement automated per session task groups A recurring complaint from CFS users is that parallel kbuild has a negative impact on desktop interactivity. This patch implements an idea from Linus, to automatically create task groups. This patch only per session autogroups, but leaves the way open for enhancement. Implementation: each task's signal struct contains an inherited pointer to a refcounted autogroup struct containing a task group pointer, the default for all tasks pointing to the init_task_group. When a task calls setsid(), the process wide reference to the default group is dropped, a new task group is created, and the process is moved into the new task group. Children thereafter inherit this task group, and increase it's refcount. On exit, a reference to the current task group is dropped when the last reference to each signal struct is dropped. The task group is destroyed when the last signal struct referencing it is freed. At runqueue selection time, IFF a task has no cgroup assignment, its current autogroup is used. Autogroup bandwidth is controllable via setting it's nice level through the proc filesystem. cat /proc/<pid>/autogroup displays the task's group and the group's nice level. echo <nice level> > /proc/<pid>/autogroup Sets the task group's shares to the weight of nice <level> task. Setting nice level is rate limited for !admin users due to the abuse risk of task group locking. The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via the boot option noautogroup, and can be also be turned on/off on ...
This needs spin_lock_irq(), ->siglock is irq-safe.
This is called by fs/proc. In this case nothing protects us from
release_task(), we can hit ->siglock == NULL (or we can race with
exec which changes ->sighand in theory).
This needs lock_task_sighand() (it can fail). Perhaps something
else have the same problem...
If the task is current and it is not exiting, or it is the new
child (sched_autogroup_fork), then it is safe to use ->siglock
directly.
This looks a bit confusing. We do not need current->sighand->siglock
to set sig->autogroup. Nobody except us can see this new signal_struct,
and in any case current->sighand->siglock can't help.
It is needed for autogroup_kref_get(), but we already have autogroup_get().
I'd suggest
void sched_autogroup_fork(struct signal_struct *sig)
{
sig->autogroup = autogroup_get(current);
}
Oleg.
--
Oh my. Thanks. I'll do that.. once the thing is non-toxic to tip. Thanks a lot Oleg. -Mike --
Oh crud. I ran 37, but not tip, it's toxic with my own config. So much for darn thing being ready :( -Mike --
And I just 2 hours ago got it working on 2.6.36.1(rc1) but had to learn and add to my 'makeit' script before I could make x work again. Yeah, I'm a bad bad boy, I run the latest nvidia drivers. A tail on the syslog is clean (so far anyway, uptime is 2:06). So you can have (FWTW) my reviewed by: Gene Heskett These patches are a definite keeper IMNSHO. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) On the whole, I'd rather be in Philadelphia. -- W.C. Fields' epitaph --
Hah. Took a while between vacation activities and my wimpy little memory ordering knowledge, but I've finally got it fingered out. Tip's update_shares() changes exposed previously invisible (to me) memory ordering woes nicely it seems. My vacation is (sniff) over, so I won't get a fully tested patch out the door for review until I get back home. -Mike --
Either I forgot to pack my eyeballs, or laptop is just too dinky and
annoying. Now back home on beloved box, this little bugger poked me
dead in the eye.
Something else is seriously wrong though. 36.1 with attached (plus
sched, cgroup: Fixup broken cgroup movement) works a treat, whereas
37.git and tip with fixlet below both suck rocks. With a make -j40
running, wakeup-latency is showing latencies of >100ms, amarok skips,
mouse lurches badly.. generally horrid. Something went south.
sched: fix 3d4b47b4 typo.
Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
LKML-Reference: new submission
---
kernel/sched.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -8087,7 +8087,6 @@ static inline void unregister_fair_sched
{
struct rq *rq = cpu_rq(cpu);
unsigned long flags;
- int i;
/*
* Only empty task groups can be destroyed; so we can speculatively
@@ -8097,7 +8096,7 @@ static inline void unregister_fair_sched
return;
raw_spin_lock_irqsave(&rq->lock, flags);
- list_del_leaf_cfs_rq(tg->cfs_rq[i]);
+ list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
raw_spin_unlock_irqrestore(&rq->lock, flags);
}
#else /* !CONFG_FAIR_GROUP_SCHED */
Can you test -rc3? Is that still ok? And are you perhaps using Nouveau? There's a report of some graphics (?) regression since -rc3 about bad desktop performance: https://bugzilla.kernel.org/show_bug.cgi?id=23912 but it doesn't have any more information yet (so if -rc3 _is_ good for you, and you can add anything to that report, it would be good. The original reporter is hopefully bisecting it now) Linus --
Mike, the last pure -rc3 -tip commit is 92c883adf03b - you could try to check that out too: it has most of the current sched/core commits, but has none of the post-rc3 DRM changes. Thanks, Ingo --
Well we totally re-wrote the cgroup load-balancer in -tip. The thing currently in -linus is a utter crap because its very strongly serialized across all cores (some people spend like 25% of their time in there). --
Yes, 92c883adf03b includes those changes: 08f3c3065f4c: Merge branch 'sched/core' 9437178f623a: sched: Update tg->shares after cpu.shares write d6b5591829bd: sched: Allow update_cfs_load() to update global load 3b3d190ec368: sched: Implement demand based update_cfs_load() c66eaf619c0c: sched: Update shares on idle_balance a7a4f8a752ec: sched: Add sysctl_sched_shares_window 67e86250f8ea: sched: Introduce hierarchal order on shares update list e33078baa4d3: sched: Fix update_cfs_load() synchronization f0d7442a5924: sched: Fix load corruption from update_cfs_shares() 9e3081ca6114: sched: Make tg_shares_up() walk on-demand 3d4b47b4b040: sched: Implement on-demand (active) cfs_rq list 2069dd75c7d0: sched: Rewrite tg_shares_up) 48c5ccae88dc: sched: Simplify cpu-hot-unplug task migration 92fd4d4d67b9: Merge commit 'v2.6.37-rc2' into sched/core I just wanted to give Mike a known-stable sha1 that has -rc3 but not the post-rc3 DRM changes. Thanks, Ingo --
The good news is that the 37.git kernel was mislabeled in grub, was also booting the _tip_ kernel, and is actually just fine. It's only tip, and tip 92fd4d4d67b9 is still bad. I'll try a quick bisect before getting back to backlog. -Mike --
Just curious, what's the freshest still good -tip sha1? Thanks, Ingo --
I don't have a good tip yet. My bisection started with a merge, which tested bad, so it spat out... marge:..git/linux-2.6 # git bisect bad The merge base e53beacd23d9cb47590da6a7a7f6d417b941a994 is bad. This means the bug has been fixed between e53beacd23d9cb47590da6a7a7f6d417b941a994 and [19650e8580987c0ffabc2fe2cbc16b944789df8b]. marge:..git/linux-2.6 # git bisect log git bisect start # good: [19650e8580987c0ffabc2fe2cbc16b944789df8b] Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6 git bisect good 19650e8580987c0ffabc2fe2cbc16b944789df8b # bad: [92fd4d4d67b945c0766416284d4ab236b31542c4] Merge commit 'v2.6.37-rc2' into sched/core git bisect bad 92fd4d4d67b945c0766416284d4ab236b31542c4 # bad: [e53beacd23d9cb47590da6a7a7f6d417b941a994] Linux 2.6.37-rc2 git bisect bad e53beacd23d9cb47590da6a7a7f6d417b941a994 marge:..git/linux-2.6 # --
Well, it seems that the rewrite is more crap than the "utter crap" in
current -git. What does that make -tip? Super-utter-crap?
Peter - getting the wrong answer quickly is not any better than strong
serialization.
Anyway, I'm happy to hear that the problem hasn't reached mainline yet.
Linus
--
I know, from the testing so far we _thought_ it was fairly sane. Apparently there's still some work to do. --
Btw., i think it shows the conceptual power of Mike's patch that this cgroups scheduling suckage was exposed so clearly. Previously it took weeks (sometimes months) for bugs to reach those who are using cgroups. Thanks, Ingo --
Damn thing bisected to:
commit 92fd4d4d67b945c0766416284d4ab236b31542c4
Merge: fe7de49 e53beac
Author: Ingo Molnar <mingo@elte.hu>
Date: Thu Nov 18 13:22:14 2010 +0100
Merge commit 'v2.6.37-rc2' into sched/core
Merge reason: Move to a .37-rc base.
Signed-off-by: Ingo Molnar <mingo@elte.hu>
92fd4d4d67b945c0766416284d4ab236b31542c4 is the first bad commit
git bisect start
# good: [f6f94e2ab1b33f0082ac22d71f66385a60d8157f] Linux 2.6.36
git bisect good f6f94e2ab1b33f0082ac22d71f66385a60d8157f
# bad: [3a2b7f908d45fa45670e8ba9e7e24c0409ba43d8] Merge branch 'linus'
git bisect bad 3a2b7f908d45fa45670e8ba9e7e24c0409ba43d8
# good: [520045db940a381d2bee1c1b2179f7921b40fb10] Merge branches 'upstream/xenfs' and 'upstream/core' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen
git bisect good 520045db940a381d2bee1c1b2179f7921b40fb10
# good: [520045db940a381d2bee1c1b2179f7921b40fb10] Merge branches 'upstream/xenfs' and 'upstream/core' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen
git bisect good 520045db940a381d2bee1c1b2179f7921b40fb10
# good: [120a795da07c9a02221ca23464c28a7c6ad7de1d] audit mmap
git bisect good 120a795da07c9a02221ca23464c28a7c6ad7de1d
# good: [19650e8580987c0ffabc2fe2cbc16b944789df8b] Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
git bisect good 19650e8580987c0ffabc2fe2cbc16b944789df8b
# good: [11259d65a61b84ad954953a194c41fe84dff889a] Merge branch 'out-of-tree'
git bisect good 11259d65a61b84ad954953a194c41fe84dff889a
# good: [eae0932ceba16e7ee0b5690455a13ef8364845da] Merge branch 'x86/mm'
git bisect good eae0932ceba16e7ee0b5690455a13ef8364845da
# good: [0464a38aaca10e1a8afed003d16d25dca2168d86] Merge branch 'sched/urgent'
git bisect good 0464a38aaca10e1a8afed003d16d25dca2168d86
# good: [22d1b202a8d0e1dedc35086b8f3df0a7b37d1371] Merge branch 'x86/urgent'
git bisect good 22d1b202a8d0e1dedc35086b8f3df0a7b37d1371
# bad: [282810f891cf6587dfc04fc5e26ec7772330c8cb] Merge branch 'sched/core'
git bisect ...Hm, i'd suggest to double check the two originator points: e53beac - is it really 'bad' ? fe7de49 - is it really 'good'? Thanks, Ingo --
https://lkml.org/lkml/2010/11/29/566 Should fix this. We missed this in testing as the delay between last-task-exit and group destruction was always sufficiently large as to ensure that the task_group had aged out of shares updates (as opposed to requiring explicit removal). With autogroup obviously the window here is essentially instantaneous which leads to the buggy removal code being executed. --
No, I had it in place where pertinent. Problem with bisection is that there are a couple of spots where X doesn't work. With X, it's obvious, less so in text console. Looks like I must have miscalled one of those. -Mike --
I've left some machines running tip + fix above + autogroup to see if anything else emerges. Hasn't crashed yet, I'll leave it going --
Thanks. Below is the hopefully final version against tip. The last I sent contained a couple remnants. From: Mike Galbraith <efault@gmx.de> Date: Tue Nov 30 14:07:12 CET 2010 Subject: [PATCH] sched: Improve desktop interactivity: Implement automated per session task groups A recurring complaint from CFS users is that parallel kbuild has a negative impact on desktop interactivity. This patch implements an idea from Linus, to automatically create task groups. Currently, only per session autogroups are implemented, but the patch leaves the way open for enhancement. Implementation: each task's signal struct contains an inherited pointer to a refcounted autogroup struct containing a task group pointer, the default for all tasks pointing to the init_task_group. When a task calls setsid(), a new task group is created, the process is moved into the new task group, and a reference to the preveious task group is dropped. Child processes inherit this task group thereafter, and increase it's refcount. When the last thread of a process exits, the process's reference is dropped, such that when the last process referencing an autogroup exits, the autogroup is destroyed. At runqueue selection time, IFF a task has no cgroup assignment, its current autogroup is used. Autogroup bandwidth is controllable via setting it's nice level through the proc filesystem. cat /proc/<pid>/autogroup displays the task's group and the group's nice level. echo <nice level> > /proc/<pid>/autogroup Sets the task group's shares to the weight of nice <level> task. Setting nice level is rate limited for !admin users due to the abuse risk of task group locking. The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via the boot option noautogroup, and can also be turned on/off on the fly via.. echo [01] > /proc/sys/kernel/sched_autogroup_enabled. ..which will automatically move tasks to/from the root task group. Signed-off-by: Mike Galbraith ...
Looks good to me, Thanks Mike! Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --
Ok, great! I've queued it up in tip:sched/core and started testing it - will push it out if it passes basic tests. Added Linus's Acked-by - i presume that's still valid for v4 as well, right? Thanks, Ingo --
Because it didn't build (for obvious reasons - the CONFIG conditions dont match up), but more importantly it's quite ugly. Some existing 'path' variables are 64 byte, some are 128 byte - so there's pre-existing damage - i removed it all. Could we do this debugging code in a bit saner way please? (as a delta patch on top of the -tip that i'll push out in the next hour or so.) Thanks, Ingo --
Won't removing that hunk bring back oops if you cat /proc/sched_debug? cfs_rq[0]:/autogroup-88 .exec_clock : 0.228697 .MIN_vruntime : 0.000001 .min_vruntime : 0.819879 .max_vruntime : 0.000001 .spread : 0.000000 Guess I'll try. -Mike --
Hi Mike, I was wonderig if these autogroups can be visible in regular cgroup hierarchy so that once somebody mounts cpu controller, these are visible? I was wondering why is a good idea to create a separate interface for autogroups through proc and not try to integrate it with cgroup interface. Without it now any user space tool shall have to either disable the autogroup feature completely or now also worry about /proc interface and there also autogroups are searchable through pid and there is no direct way to access these. IIUC, these autogroups create flat setup and are at same level as init_task_group and are not children of it. Currently cpu cgroup is hierarchical by default and any new cgroup is child of init_task_group and that could lead to representation issues. Well, will we not get same kind of latency boost if we make these autogroups children of root? If yes, then hierarchical representation issue of autogroup will be a moot point. We already have /proc/<pid>/cgroup interface which points to tasks's cgroup. We probably can avoid creating /proc/<pid>/autgroup if there is an associated cgroup which appears in cgroup hierachy and then user can change the weight of group through cgroup interface (instead of introducing another interface). Thanks --
No, autogroup is not auto-cgroup. You get zero whistles and zero bells I only put an interface there at all because it was requested, and made it a dirt simple 'nice level' interface because there's nothing simpler than 'nice'. The whole autogroup thing is intended for folks who don't want to set up cgroups, shares yadayada, so tying it into the cgroups Well, it's flat, but autogroup does.. That's possible (for someone familiar with cgroups;), but I don't see any reason for a wedding. -Mike --
Few things.
- This /proc/<pid>/autogroup is good for doing this single thing but when
I start thinking of possible extensions of it down the line, it creates
issues.
- Once we have some kind of uppper limit support in cpu controller, these
autogroups are beyond control. If you want to impose some kind of
limits on them then you shall have to extend parallel interface
/proc/<pid>/autogroup to also speicify upper limit (like nice levels).
- Similiarly if this autgroup notion is extended to other cgroup
controllers, then you shall have to again extend /proc/<pid>/autogroup
to be able to specify these additional parameters.
- If there is a monitoring tool which is monitoring the system for
resource usage by the groups, then I think these autogroups are beyond
reach and any stats exported by cgroup interface will not be available.
(though right now I can't see any stats being exported by cgroup files
in cpu controller but other controllers like block and memory do.).
- I am doing some testing with the patch and w.r.t. cgroup interface some
things don't seem right.
I have applied your patch and enabled CONFIG_AUTO_GROUP. Now I boot
into the kernel and open a new ssh connection to the machine.
# echo $$
3555
# cat /proc/3555/autogroup
/autogroup-63 nice 0
IIUC, task 3555 has been moved into an autogroup. Now I mount the cpu
controller and this task is visible in root cgroup.
# mount -t cgroup -o cpu none /cgroup/cpu
# cat /cgroup/cpu/tasks | grep 3555
3555
First of all this gives user a wrong impression that task 3555 is in
root cgroup.
Now I create a child group test1 and move the task there and also change
the weight/shares of the cgroup to 10240.
# mkdir test1
# echo 3555 > test1/tasks
# echo 10240 > test1/cpu.shares
# cat /proc/3555/cgroup
3:cpu:/test1
# cat /proc/3555/autogroup
/autogroup-63 nice 0
So again, user will think that task is in cgroup test1 and is ...Hmm, maybe we can make AUTO_GROUP depend on !CGROUPS? It seems that autogroup only uses 'struct task_group', no other cgroup things, so I think that is reasonable and doable. --
Build time exclusion is not as flexible. As is, the user can have one kernel, and use whatever he likes. -Mike --
That's only going to create more #ifdefery in sched.c (and we already got way too much of that), for little to no gain. But yes, technically that could be done. --
A non-starter if you have a Fedora Rawhide box that has systemd installed, as that won't even make it to single-user if you don't have CGROUPS in the kernel config.
Yes, if it evolves, it's interface will need to evolve as well. It could have been a directory containing buttons, knobs and statistics, If you're manually assigning bandwidth et al from userland, there's not much point to in-kernel automation is there? If I had married the two, the first thing that would have happened is gripes about things appearing and disappearing in cgroups directories, It is in the root cgroup. It is not in the root autogroup is not It is the case here. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 7573 root 20 0 7996 340 256 R 50 0.0 3:35.86 3 pert 7572 root 20 0 7996 340 256 R 50 0.0 9:21.68 3 pert ... marge:/cgroups/test # echo 7572 > tasks marge:/cgroups/test # echo 4096 > cpu.shares PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 7572 root 20 0 7996 340 256 R 80 0.0 10:06.92 3 pert 7573 root 20 0 7996 340 256 R 20 0.0 4:05.80 3 pert When you move a task into a cgroup, it still has an autogroup The user has to specifically ask for it in his config, can turn it on or ..it has a different mission, with different users being targeted, so IMHO, cgroups should have been 'nice' from the start, but the folks who wrote it did what they thought best. I like nice a lot better than Perhaps in future, they'll get married, and perhaps they should, but in the here and now, I think they have similar but not identical missions. If you turn on one, turn off the other. Maybe that should be automated. Systemd thingy may make autogroup short lived anyway. I had a query from an embedded guy (hm, which I spaced) suggesting autogroup may be quite nice for handheld stuff though, so who knows. -Mike --
Agreed, but by the time I realized that the shares thing was already in the wild. I did (probably still do) have a patch that adds a nice file to the cgroup file. Anyway, I think the whole proc/nice interface for autogroups is already a tad too far. If people want control they can use cgroups, but I really don't care enough to argue much about it. --
Agreed. I've no intention of expanding it. -Mike --
On Wed, Dec 01, 2010 at 06:57:48AM +0100, Mike Galbraith wrote: Ok, so I got confused with the fact that after moving a task into a cgroup it is still associated with an autogroup. So IIUC, if a task is in root cgroup, then it would not necessarily be driven by cpu.shares of root cgroup (as task could be in its own autogroup). But if I move the task into a non-root cgroup, then it will for sure be subjected to rules imposed by non-root cgroup cpu.shares. That's not too bad. Thanks Vivek --
I think the normal case would be either one or the other being in use at any given time, but yes, if both are active, that's how it'd work. -Mike --
Nope. I did a bisection this morning in text mode with a pipe-test
based measurement proggy, and it bisected cleanly.
2069dd75c7d0f49355939e5586daf5a9ab216db7 is the first bad commit
commit 2069dd75c7d0f49355939e5586daf5a9ab216db7
Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date: Mon Nov 15 15:47:00 2010 -0800
sched: Rewrite tg_shares_up)
-Mike
--
Ok. And has this fixed it: 822bc180a7f7: sched: Fix unregister_fair_sched_group() ... or are there two bugs? Thanks, Ingo --
another detail is that i needed this fix: --- linux.orig/init/Kconfig +++ linux/init/Kconfig @@ -790,6 +790,7 @@ endif # NAMESPACES config SCHED_AUTOGROUP bool "Automatic process group scheduling" + select EVENTFD select CGROUPS select CGROUP_SCHED select FAIR_GROUP_SCHED Because CGROUPS depends on eventfd infrastructure. Thanks, Ingo --
Shouldn't then cgroups select that? --
It depends on it so selecting it is fine. !EVENTFD is a CONFIG_EMBEDDED-only thing anyway. Thanks, Ingo --
Two bugs. 822bc180a7f7 fixes the explosions that were happening in tip. The interactivity issue is some problem in the update_shares() stuff. -Mike --
I'll hunt as soon as I can (inbox runneth over). -Mike --
I'm looking at this. The share:share ratios looked good in static testing, but perhaps we need a little more wake-up boost to improve interactivity. Should have something tomorrow. --
I'm looking at this. The share:share ratios looked good in static testing, but perhaps we need a little more wake-up boost to improve interactivity. Should have something tomorrow. --
Yeah, feels like a wakeup issue. I too did a (brief) static test, and that looked ok. -Mike --
Hey Mike,
Does something like the below help?
We're quick to drive the load_contribution up (to avoid over-commit).
However on sleepy workloads this results in lots of weight being
stranded (since it reaches maximum contribution instantaneously but
decays slowly) as the thread migrates between cpus.
Avoid this by averaging "up" in the wake-up direction as well as the sleep.
We also get a boost from the fact that we use the instantaneous weight
in computing the actual received shares.
I actually don't have a desktop setup handy to test "interactivity" (sad
but true -- working on grabbing one). But it looks better on under
synthetic load.
- Paul
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -743,12 +743,19 @@ static void update_cfs_load(struct cfs_r
return;
now = rq_of(cfs_rq)->clock;
- delta = now - cfs_rq->load_stamp;
+
+ if (likely(cfs_rq->load_stamp))
+ delta = now - cfs_rq->load_stamp;
+ else {
+ /* avoid large initial delta and initialize load_period */
+ delta = 1;
+ cfs_rq->load_stamp = 1;
+ }
/* truncate load history at 4 idle periods */
if (cfs_rq->load_stamp > cfs_rq->load_last &&
now - cfs_rq->load_last > 4 * period) {
- cfs_rq->load_period = 0;
+ cfs_rq->load_period = period/2;
cfs_rq->load_avg = 0;
}
--
Unfortunately not. For example, Xorg+mplayer needs (30 sec refresh).. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 6487 root 20 0 366m 30m 5100 S 31 0.4 2:04.83 2 Xorg 4454 root 20 0 318m 28m 15m S 29 0.4 0:38.06 3 mplayer ..but gets this when a heavy kbuild is running along with them. PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 6487 root 20 0 366m 30m 5136 S 12 0.4 2:25.98 1 Xorg 5595 root 20 0 318m 28m 15m R 8 0.4 0:09.31 3 mplayer There are 4 task groups active at this time, Xorg, mplayer, Amarok and konsole where the kbuild is running make -j40. -Mike --
Hmm.. unfortunate. Ok -- based on the traces of synthetic loads and the traces of their share on wake-up I think this is the right track at least, will refine it tomorrow. --
What tools are actually used to test "interactivity" ? I posted a tool to the list some time ago, but I don't think anyone noticed. My tool is very simple. When you hold a key down, it should repeat. It should repeat at a constant predictable interval. So, my tool just waits for key presses and times when each one occurred. The tester simply presses a key and holds it down. If the time between each key press is constant, it indicates good "interactivity". If the time between each key press varies a lot, it indicates bad "interactivity". You can reliably test if one kernel is better than the next using actual measurable figures. Kind Regards James --
On Sat, Dec 4, 2010 at 3:55 PM, James Courtier-Dutton Could you drop me a pointer? I can certainly give it a try. It would be extra useful if it included any histogram functionality. I've been using a combination of various synthetic wakeup and load scripts and measuring the received bandwidth / wakeup latency. They have not succeeded in reproducing the starvation or poor latency observed by Mike above however. (Although I've pulled a box to try reproducing his exact conditions [ e.g. user environment ] on Monday). --
Desktop hardware came in today and I can now reproduce the issues Mike's been seeing; tuning in progress. --
This goose is now cooked. It turns out the new shares code is doing the "right" thing, but in the wrong order for tasks with very small time slices. This made it rather gnarly to track down since at all times the new evaluation / the old evaluation / and an "OPT" evaluation were all in agreement! As we hierarchically dequeue we now instantaneously adjust entity weights to account for the new global state (good). However, when we then update on the parent (e.g. the group entity owning the just-adjusted cfs_rq), the accrued unaccounted time is charged at the new weight for that entity instead of the old. For longer running processes, the periodic updates hide this. However, for an interactive process, such as Xorg (which uses many _small_ timeslices -- e.g. almost all accounting ends up being at dequeue as opposed to periodic) this results in significant vruntime over-charging and a loss of fairness. In Xorg's case the loss of fairness is compounded by the fact that there is only one runnable thread means we transition between NICE_0_LOAD and MIN_SHARES for the over-charging above. This is fixed by charging the unaccounted time versus a group entity before we manipulate its weight (as a result of child movement). Thanks for your patience while I tracked this down.. it's been a few sleepless nights while I cranked through a number of dead-end theories (rather frustrating when the numbers are all right but the results are all wrong! ;). Cleaned up patch inbound in the morning. - Paul --
Right, the previous thing cheated quite enormous with wakeups simply because it was way too expensive to compute proper shares on wakeups. Maybe we should re-instate some of that cheating. --
