Re: [PATCH v4] sched: automated per session task groups

Previous thread: [PATCH] Change return type of obvious Boolean list fns to "bool". by Robert P. J. Day on Sunday, November 21, 2010 - 6:35 am. (3 messages)

Next thread: [tip:perf/core] perf tools: Change my maintainer address by tip-bot for Arnaldo Carvalho de Melo on Sunday, November 21, 2010 - 6:43 am. (1 message)
From: Ingo Molnar
Date: Sunday, November 21, 2010 - 6:37 am

Hello Mike,


I tested it a bit, and autosched-v4 crashes on bootup with with attached config.

Note: the box has serial logging enabled and there's UART code in the stacktrace - 
maybe it's related. Let me know if you need the full bootup log.

Thanks,

	Ingo

[FAILED]
Enabling local filesystem quotas:  [  OK  ]
PPS event at 4294886381
Enabling /etc/fstab swaps:  swapon: /dev/hda2: Function not implemented
[FAILED]
INIT: Entering runleveBUG: unable to handle kernel paging request at f548604c
IP:l: 3 [<c10307f0>] update_cfs_shares+0x60/0x160
*pdpt = 0000000002017001 *pde = 00000000029d4067 *pte = 8000000035486160 
Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
last sysfs file: /sys/block/sr0/dev

Pid: 1, comm: init Not tainted 2.6.37-rc2-tip+ #64308 A8N-E/System Product Name
EIP: 0060:[<c10307f0>] EFLAGS: 00010086 CPU: 1
EIP is at update_cfs_shares+0x60/0x160
EAX: fffffffe EBX: f547603b ECX: 00000400 EDX: 00000002
ESI: f5486000 EDI: 0000013b EBP: f6459d48 ESP: f6459d3c
 DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process init (pid: 1, ti=f6458000 task=f6450000 task.ti=f6458000)
Stack:
 f5475a80 f6f066c0 00000004 f6459d84 c103256f 00000002 00000001 00000000
 c10324d0 c200e6c0 00000001 f6f06b34 00000046 f5475a80 f5475ac8 f6f066c0
 00000001 ffffffff f6459dfc c1b32820 f64a0010 f6459dc4 00000046 00000000
Call Trace:
 [<c103256f>] update_shares+0x9f/0x170
 [<c10324d0>] ? update_shares+0x0/0x170
 [<c1b32820>] schedule+0x580/0x9d0
 [<c1039335>] ? sub_preempt_count+0xa5/0xe0
 [<c1b330e5>] schedule_timeout+0x125/0x2a0
 [<c104fe60>] ? process_timeout+0x0/0x10
 [<c15aef4f>] uart_close+0x17f/0x350
 [<c105fea0>] ? autoremove_wake_function+0x0/0x50
 [<c1471f72>] tty_release+0x102/0x500
 [<c1125fdf>] ? locks_remove_posix+0xf/0xa0
 [<c1119a43>] ? fsnotify+0x1e3/0x2f0
 [<c11198d3>] ? fsnotify+0x73/0x2f0
 [<c10ea1e1>] fput+0xb1/0x230
 [<c10e7e7e>] filp_close+0x4e/0x70
 [<c10e7f14>] sys_close+0x74/0xc0
 [<c1002b90>] sysenter_do_call+0x12/0x31
Code: 00 00 00 8b 18 8b 79 1c 8b 49 ...
From: Ingo Molnar
Date: Sunday, November 21, 2010 - 6:39 am

Btw., there's a small cleanup in the patch that i picked up (see below), and i also 
edited the commit log a bit - so you might want to pick up the version below.

	Ingo

--------------->
From 741430a6af2bdefe3d017226bbcfe96f9ed46b58 Mon Sep 17 00:00:00 2001
From: Mike Galbraith <efault@gmx.de>
Date: Sat, 20 Nov 2010 12:35:00 -0700
Subject: [PATCH] sched: Improve desktop interactivity: Implement automated per session task groups

A recurring complaint from CFS users is that parallel kbuild has
a negative impact on desktop interactivity.  This patch
implements an idea from Linus, to automatically create task
groups.  This patch only per session autogroups, but leaves the
way open for enhancement.

Implementation: each task's signal struct contains an inherited
pointer to a refcounted autogroup struct containing a task group
pointer, the default for all tasks pointing to the
init_task_group.  When a task calls setsid(), the process wide
reference to the default group is dropped, a new task group is
created, and the process is moved into the new task group.
Children thereafter inherit this task group, and increase it's
refcount.  On exit, a reference to the current task group is
dropped when the last reference to each signal struct is
dropped.  The task group is destroyed when the last signal
struct referencing it is freed.

At runqueue selection time, IFF a task has no cgroup assignment, its
current autogroup is used.

Autogroup bandwidth is controllable via setting it's nice level
through the proc filesystem.  cat /proc/<pid>/autogroup displays
the task's group and the group's nice level.

 echo <nice level> > /proc/<pid>/autogroup

Sets the task group's shares to the weight of nice <level> task.
Setting nice level is rate limited for !admin users due to the abuse
risk of task group locking.

The feature is enabled from boot by default if
CONFIG_SCHED_AUTOGROUP=y is selected, but can be disabled via the
boot option noautogroup, and can be also be turned on/off on ...
From: Oleg Nesterov
Date: Sunday, November 21, 2010 - 8:44 am

This needs spin_lock_irq(), ->siglock is irq-safe.


This is called by fs/proc. In this case nothing protects us from
release_task(), we can hit ->siglock == NULL (or we can race with
exec which changes ->sighand in theory).

This needs lock_task_sighand() (it can fail). Perhaps something
else have the same problem...

If the task is current and it is not exiting, or it is the new
child (sched_autogroup_fork), then it is safe to use ->siglock
directly.


This looks a bit confusing. We do not need current->sighand->siglock
to set sig->autogroup. Nobody except us can see this new signal_struct,
and in any case current->sighand->siglock can't help.

It is needed for autogroup_kref_get(), but we already have autogroup_get().
I'd suggest

	void sched_autogroup_fork(struct signal_struct *sig)
	{
		sig->autogroup = autogroup_get(current);	
	}

Oleg.

--

From: Mike Galbraith
Date: Sunday, November 21, 2010 - 9:35 am

Oh my.  Thanks.



I'll do that.. once the thing is non-toxic to tip.

Thanks a lot Oleg.

	-Mike

--

From: Mike Galbraith
Date: Sunday, November 21, 2010 - 9:15 am

Oh crud.  I ran 37, but not tip, it's toxic with my own config.  So much
for darn thing being ready :(

	-Mike

--

From: Gene Heskett
Date: Sunday, November 21, 2010 - 11:43 am

And I just 2 hours ago got it working on 2.6.36.1(rc1) but had to learn and 
add to my 'makeit' script before I could make x work again.  Yeah, I'm a 
bad bad boy, I run the latest nvidia drivers.  A tail on the syslog is 
clean (so far anyway, uptime is 2:06).

So you can have (FWTW) my reviewed by: Gene Heskett

These patches are a definite keeper IMNSHO.

-- 
Cheers, Gene
"There are four boxes to be used in defense of liberty:
 soap, ballot, jury, and ammo. Please use in that order."
-Ed Howdershelt (Author)
On the whole, I'd rather be in Philadelphia.
		-- W.C. Fields' epitaph
--

From: Mike Galbraith
Date: Thursday, November 25, 2010 - 9:00 am

Hah.  Took a while between vacation activities and my wimpy little
memory ordering knowledge, but I've finally got it fingered out.  Tip's
update_shares() changes exposed previously invisible (to me) memory
ordering woes nicely it seems.

My vacation is (sniff) over, so I won't get a fully tested patch out the
door for review until I get back home.

	-Mike


--

From: Mike Galbraith
Date: Sunday, November 28, 2010 - 7:24 am

Either I forgot to pack my eyeballs, or laptop is just too dinky and
annoying.  Now back home on beloved box, this little bugger poked me
dead in the eye.

Something else is seriously wrong though.  36.1 with attached (plus
sched, cgroup: Fixup broken cgroup movement) works a treat, whereas
37.git and tip with fixlet below both suck rocks.  With a make -j40
running, wakeup-latency is showing latencies of >100ms, amarok skips,
mouse lurches badly.. generally horrid.  Something went south.

sched: fix 3d4b47b4 typo.

Signed-off-by: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Ingo Molnar <mingo@elte.hu>
LKML-Reference: new submission
---
 kernel/sched.c |    3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -8087,7 +8087,6 @@ static inline void unregister_fair_sched
 {
 	struct rq *rq = cpu_rq(cpu);
 	unsigned long flags;
-	int i;
 
 	/*
 	* Only empty task groups can be destroyed; so we can speculatively
@@ -8097,7 +8096,7 @@ static inline void unregister_fair_sched
 		return;
 
 	raw_spin_lock_irqsave(&rq->lock, flags);
-	list_del_leaf_cfs_rq(tg->cfs_rq[i]);
+	list_del_leaf_cfs_rq(tg->cfs_rq[cpu]);
 	raw_spin_unlock_irqrestore(&rq->lock, flags);
 }
 #else /* !CONFG_FAIR_GROUP_SCHED */

From: Linus Torvalds
Date: Sunday, November 28, 2010 - 12:31 pm

Can you test -rc3? Is that still ok? And are you perhaps using
Nouveau? There's a report of some graphics (?) regression since -rc3
about bad desktop performance:

   https://bugzilla.kernel.org/show_bug.cgi?id=23912

but it doesn't have any more information yet (so if -rc3 _is_ good for
you, and you can add anything to that report, it would be good. The
original reporter is hopefully bisecting it now)

                     Linus
--

From: Ingo Molnar
Date: Sunday, November 28, 2010 - 1:18 pm

Mike, the last pure -rc3 -tip commit is 92c883adf03b - you could try to check that 
out too: it has most of the current sched/core commits, but has none of the post-rc3 
DRM changes.

Thanks,

	Ingo
--

From: Peter Zijlstra
Date: Monday, November 29, 2010 - 4:53 am

Well we totally re-wrote the cgroup load-balancer in -tip. The thing
currently in -linus is a utter crap because its very strongly serialized
across all cores (some people spend like 25% of their time in there).
--

From: Ingo Molnar
Date: Monday, November 29, 2010 - 5:30 am

Yes, 92c883adf03b includes those changes:

 08f3c3065f4c: Merge branch 'sched/core'
 9437178f623a: sched: Update tg->shares after cpu.shares write
 d6b5591829bd: sched: Allow update_cfs_load() to update global load
 3b3d190ec368: sched: Implement demand based update_cfs_load()
 c66eaf619c0c: sched: Update shares on idle_balance
 a7a4f8a752ec: sched: Add sysctl_sched_shares_window
 67e86250f8ea: sched: Introduce hierarchal order on shares update list
 e33078baa4d3: sched: Fix update_cfs_load() synchronization
 f0d7442a5924: sched: Fix load corruption from update_cfs_shares()
 9e3081ca6114: sched: Make tg_shares_up() walk on-demand
 3d4b47b4b040: sched: Implement on-demand (active) cfs_rq list
 2069dd75c7d0: sched: Rewrite tg_shares_up)
 48c5ccae88dc: sched: Simplify cpu-hot-unplug task migration
 92fd4d4d67b9: Merge commit 'v2.6.37-rc2' into sched/core

I just wanted to give Mike a known-stable sha1 that has -rc3 but not the post-rc3 
DRM changes.

Thanks,

	Ingo
--

From: Mike Galbraith
Date: Monday, November 29, 2010 - 6:45 am

The good news is that the 37.git kernel was mislabeled in grub, was also
booting the _tip_ kernel, and is actually just fine.  It's only tip, and
tip 92fd4d4d67b9 is still bad.  I'll try a quick bisect before getting
back to backlog.

	-Mike

--

From: Ingo Molnar
Date: Monday, November 29, 2010 - 6:47 am

Just curious, what's the freshest still good -tip sha1?

Thanks,

	Ingo
--

From: Mike Galbraith
Date: Monday, November 29, 2010 - 7:04 am

I don't have a good tip yet.  My bisection started with a merge, which
tested bad, so it spat out...

marge:..git/linux-2.6 # git bisect bad
The merge base e53beacd23d9cb47590da6a7a7f6d417b941a994 is bad.
This means the bug has been fixed between e53beacd23d9cb47590da6a7a7f6d417b941a994 and [19650e8580987c0ffabc2fe2cbc16b944789df8b].

marge:..git/linux-2.6 # git bisect log
git bisect start
# good: [19650e8580987c0ffabc2fe2cbc16b944789df8b] Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
git bisect good 19650e8580987c0ffabc2fe2cbc16b944789df8b
# bad: [92fd4d4d67b945c0766416284d4ab236b31542c4] Merge commit 'v2.6.37-rc2' into sched/core
git bisect bad 92fd4d4d67b945c0766416284d4ab236b31542c4
# bad: [e53beacd23d9cb47590da6a7a7f6d417b941a994] Linux 2.6.37-rc2
git bisect bad e53beacd23d9cb47590da6a7a7f6d417b941a994
marge:..git/linux-2.6 #

--

From: Linus Torvalds
Date: Monday, November 29, 2010 - 9:27 am

Well, it seems that the rewrite is more crap than the "utter crap" in
current -git. What does that make -tip? Super-utter-crap?

Peter - getting the wrong answer quickly is not any better than strong
serialization.

Anyway, I'm happy to hear that the problem hasn't reached mainline yet.

                    Linus
--

From: Ingo Molnar
Date: Monday, November 29, 2010 - 9:44 am

It wont.

Thanks,

	Ingo
--

From: Peter Zijlstra
Date: Monday, November 29, 2010 - 10:37 am

I know, from the testing so far we _thought_ it was fairly sane.
Apparently there's still some work to do.


--

From: Ingo Molnar
Date: Monday, November 29, 2010 - 11:03 am

Btw., i think it shows the conceptual power of Mike's patch that this cgroups 
scheduling suckage was exposed so clearly. Previously it took weeks (sometimes 
months) for bugs to reach those who are using cgroups.

Thanks,

	Ingo
--

From: Mike Galbraith
Date: Monday, November 29, 2010 - 12:06 pm

Damn thing bisected to:

commit 92fd4d4d67b945c0766416284d4ab236b31542c4
Merge: fe7de49 e53beac
Author: Ingo Molnar <mingo@elte.hu>
Date:   Thu Nov 18 13:22:14 2010 +0100

    Merge commit 'v2.6.37-rc2' into sched/core

    Merge reason: Move to a .37-rc base.

    Signed-off-by: Ingo Molnar <mingo@elte.hu>

92fd4d4d67b945c0766416284d4ab236b31542c4 is the first bad commit

git bisect start
# good: [f6f94e2ab1b33f0082ac22d71f66385a60d8157f] Linux 2.6.36
git bisect good f6f94e2ab1b33f0082ac22d71f66385a60d8157f
# bad: [3a2b7f908d45fa45670e8ba9e7e24c0409ba43d8] Merge branch 'linus'
git bisect bad 3a2b7f908d45fa45670e8ba9e7e24c0409ba43d8
# good: [520045db940a381d2bee1c1b2179f7921b40fb10] Merge branches 'upstream/xenfs' and 'upstream/core' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen
git bisect good 520045db940a381d2bee1c1b2179f7921b40fb10
# good: [520045db940a381d2bee1c1b2179f7921b40fb10] Merge branches 'upstream/xenfs' and 'upstream/core' of git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen
git bisect good 520045db940a381d2bee1c1b2179f7921b40fb10
# good: [120a795da07c9a02221ca23464c28a7c6ad7de1d] audit mmap
git bisect good 120a795da07c9a02221ca23464c28a7c6ad7de1d
# good: [19650e8580987c0ffabc2fe2cbc16b944789df8b] Merge branch 'bugfixes' of git://git.linux-nfs.org/projects/trondmy/nfs-2.6
git bisect good 19650e8580987c0ffabc2fe2cbc16b944789df8b
# good: [11259d65a61b84ad954953a194c41fe84dff889a] Merge branch 'out-of-tree'
git bisect good 11259d65a61b84ad954953a194c41fe84dff889a
# good: [eae0932ceba16e7ee0b5690455a13ef8364845da] Merge branch 'x86/mm'
git bisect good eae0932ceba16e7ee0b5690455a13ef8364845da
# good: [0464a38aaca10e1a8afed003d16d25dca2168d86] Merge branch 'sched/urgent'
git bisect good 0464a38aaca10e1a8afed003d16d25dca2168d86
# good: [22d1b202a8d0e1dedc35086b8f3df0a7b37d1371] Merge branch 'x86/urgent'
git bisect good 22d1b202a8d0e1dedc35086b8f3df0a7b37d1371
# bad: [282810f891cf6587dfc04fc5e26ec7772330c8cb] Merge branch 'sched/core'
git bisect ...
From: Ingo Molnar
Date: Monday, November 29, 2010 - 12:20 pm

Hm, i'd suggest to double check the two originator points:

  e53beac - is it really 'bad' ?
  fe7de49 - is it really 'good'?

Thanks,

	Ingo
--

From: Paul Turner
Date: Monday, November 29, 2010 - 8:39 pm

https://lkml.org/lkml/2010/11/29/566

Should fix this.  We missed this in testing as the delay between
last-task-exit and group destruction was always sufficiently large as
to ensure that the task_group had aged out of shares updates (as
opposed to requiring explicit removal).

With autogroup obviously the window here is essentially instantaneous
which leads to the buggy removal code being executed.
--

From: Mike Galbraith
Date: Monday, November 29, 2010 - 9:14 pm

No, I had it in place where pertinent.  Problem with bisection is that
there are a couple of spots where X doesn't work.  With X, it's obvious,
less so in text console.  Looks like I must have miscalled one of those.

	-Mike

--

From: Paul Turner
Date: Monday, November 29, 2010 - 9:23 pm

I've left some machines running tip + fix above + autogroup to see if
anything else emerges.  Hasn't crashed yet, I'll leave it going
--

From: Mike Galbraith
Date: Tuesday, November 30, 2010 - 6:18 am

Thanks.  Below is the hopefully final version against tip.  The last I
sent contained a couple remnants.

From: Mike Galbraith <efault@gmx.de>
Date: Tue Nov 30 14:07:12 CET 2010
Subject: [PATCH] sched: Improve desktop interactivity: Implement automated per session task groups

A recurring complaint from CFS users is that parallel kbuild has a negative
impact on desktop interactivity.  This patch implements an idea from Linus,
to automatically create task groups.  Currently, only per session autogroups
are implemented, but the patch leaves the way open for enhancement.

Implementation: each task's signal struct contains an inherited pointer to
a refcounted autogroup struct containing a task group pointer, the default
for all tasks pointing to the init_task_group.  When a task calls setsid(),
a new task group is created, the process is moved into the new task group,
and a reference to the preveious task group is dropped.  Child processes
inherit this task group thereafter, and increase it's refcount.  When the
last thread of a process exits, the process's reference is dropped, such
that when the last process referencing an autogroup exits, the autogroup
is destroyed.

At runqueue selection time, IFF a task has no cgroup assignment, its current
autogroup is used.

Autogroup bandwidth is controllable via setting it's nice level through the
proc filesystem.  cat /proc/<pid>/autogroup displays the task's group and the
group's nice level.  echo <nice level> > /proc/<pid>/autogroup Sets the task
group's shares to the weight of nice <level> task.  Setting nice level is rate
limited for !admin users due to the abuse risk of task group locking.

The feature is enabled from boot by default if CONFIG_SCHED_AUTOGROUP=y is
selected, but can be disabled via the boot option noautogroup, and can also
be turned on/off on the fly via..
	echo [01] > /proc/sys/kernel/sched_autogroup_enabled.
..which will automatically move tasks to/from the root task group.

Signed-off-by: Mike Galbraith ...
From: Peter Zijlstra
Date: Tuesday, November 30, 2010 - 6:48 am

Looks good to me, Thanks Mike!

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
--

From: Ingo Molnar
Date: Tuesday, November 30, 2010 - 6:59 am

Ok, great!

I've queued it up in tip:sched/core and started testing it - will push it out if it 
passes basic tests. Added Linus's Acked-by - i presume that's still valid for v4 as 
well, right?

Thanks,

	Ingo
--

From: Ingo Molnar
Date: Tuesday, November 30, 2010 - 7:13 am

Because it didn't build (for obvious reasons - the CONFIG conditions dont match up), 
but more importantly it's quite ugly. Some existing 'path' variables are 64 byte, 
some are 128 byte - so there's pre-existing damage - i removed it all.

Could we do this debugging code in a bit saner way please? (as a delta patch on top 
of the -tip that i'll push out in the next hour or so.)

Thanks,

	Ingo
--

From: Mike Galbraith
Date: Tuesday, November 30, 2010 - 9:41 am

Won't removing that hunk bring back oops if you cat /proc/sched_debug?

cfs_rq[0]:/autogroup-88
  .exec_clock                    : 0.228697
  .MIN_vruntime                  : 0.000001
  .min_vruntime                  : 0.819879
  .max_vruntime                  : 0.000001
  .spread                        : 0.000000

Guess I'll try.

	-Mike



--

From: Vivek Goyal
Date: Tuesday, November 30, 2010 - 8:17 am

Hi Mike,

I was wonderig if these autogroups can be visible in regular cgroup
hierarchy so that once somebody mounts cpu controller, these are visible?

I was wondering why is a good idea to create a separate interface for
autogroups through proc and not try to integrate it with cgroup interface.

Without it now any user space tool shall have to either disable the
autogroup feature completely or now also worry about /proc interface
and there also autogroups are searchable through pid and there is no
direct way to access these.

IIUC, these autogroups create flat setup and are at same level as
init_task_group and are not children of it. Currently cpu cgroup 
is hierarchical by default and any new cgroup is child of init_task_group
and that could lead to representation issues.

Well, will we not get same kind of latency boost if we make these autogroups
children of root? If yes, then hierarchical representation issue of autogroup
will be a moot point.

We already have /proc/<pid>/cgroup interface which points to tasks's
cgroup. We probably can avoid creating /proc/<pid>/autgroup if there
is an associated cgroup which appears in cgroup hierachy and then user
can change the weight of group through cgroup interface (instead of
introducing another interface).

Thanks
--

From: Mike Galbraith
Date: Tuesday, November 30, 2010 - 10:13 am

No, autogroup is not auto-cgroup.  You get zero whistles and zero bells

I only put an interface there at all because it was requested, and made
it a dirt simple 'nice level' interface because there's nothing simpler
than 'nice'.  The whole autogroup thing is intended for folks who don't
want to set up cgroups, shares yadayada, so tying it into the cgroups


Well, it's flat, but autogroup does..


That's possible (for someone familiar with cgroups;), but I don't see
any reason for a wedding.

	-Mike

--

From: Vivek Goyal
Date: Tuesday, November 30, 2010 - 12:36 pm

Few things.

- This /proc/<pid>/autogroup is good for doing this single thing but when
  I start thinking of possible extensions of it down the line, it creates
  issues. 

- Once we have some kind of uppper limit support in cpu controller, these
  autogroups are beyond control. If you want to impose some kind of 
  limits on them then you shall have to extend parallel interface
  /proc/<pid>/autogroup to also speicify upper limit (like nice levels).

- Similiarly if this autgroup notion is extended to other cgroup
  controllers, then you shall have to again extend /proc/<pid>/autogroup
  to be able to specify these additional parameters.

- If there is a monitoring tool which is monitoring the system for
  resource usage by the groups, then I think these autogroups are beyond
  reach and any stats exported by cgroup interface will not be available.
  (though right now I can't see any stats being exported by cgroup files
   in cpu controller but other controllers like block and memory do.).

- I am doing some testing with the patch and w.r.t. cgroup interface some
  things don't seem right.

  I have applied your patch and enabled CONFIG_AUTO_GROUP. Now I boot
  into the kernel and open a new ssh connection to the machine. 

  # echo $$
    3555
  # cat /proc/3555/autogroup
    /autogroup-63 nice 0

  IIUC, task 3555 has been moved into an autogroup. Now I mount the cpu
  controller and this task is visible in root cgroup.

  # mount -t cgroup -o cpu none /cgroup/cpu
  # cat /cgroup/cpu/tasks | grep 3555
    3555

  First of all this gives user a wrong impression that task 3555 is in
  root cgroup.

  Now I create a child group test1 and move the task there and also change
  the weight/shares of the cgroup to 10240.

  # mkdir test1
  # echo 3555 > test1/tasks
  # echo 10240 > test1/cpu.shares
  # cat /proc/3555/cgroup
    3:cpu:/test1
  # cat /proc/3555/autogroup
    /autogroup-63 nice 0

So again, user will think that task is in cgroup test1 and is ...
From: Américo Wang
Date: Tuesday, November 30, 2010 - 10:01 pm

Hmm, maybe we can make AUTO_GROUP depend on !CGROUPS?

It seems that autogroup only uses 'struct task_group', no other cgroup things,
so I think that is reasonable and doable.
--

From: Mike Galbraith
Date: Tuesday, November 30, 2010 - 11:09 pm

Build time exclusion is not as flexible.  As is, the user can have one
kernel, and use whatever he likes.

	-Mike

--

From: Peter Zijlstra
Date: Wednesday, December 1, 2010 - 4:36 am

That's only going to create more #ifdefery in sched.c (and we already
got way too much of that), for little to no gain.

But yes, technically that could be done.

--

From: Valdis.Kletnieks
Date: Wednesday, December 1, 2010 - 3:12 pm

A non-starter if you have a Fedora Rawhide box that has systemd installed, as
that won't even make it to single-user if you don't have CGROUPS in the kernel config.

From: Mike Galbraith
Date: Tuesday, November 30, 2010 - 10:57 pm

Yes, if it evolves, it's interface will need to evolve as well.  It
could have been a directory containing buttons, knobs and statistics,

If you're manually assigning bandwidth et al from userland, there's not
much point to in-kernel automation is there?

If I had married the two, the first thing that would have happened is
gripes about things appearing and disappearing in cgroups directories,

It is in the root cgroup.  It is not in the root autogroup is not

It is the case here.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND
 7573 root      20   0  7996  340  256 R   50  0.0   3:35.86 3 pert
 7572 root      20   0  7996  340  256 R   50  0.0   9:21.68 3 pert
...
marge:/cgroups/test # echo 7572 > tasks
marge:/cgroups/test # echo 4096 > cpu.shares

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND
 7572 root      20   0  7996  340  256 R   80  0.0  10:06.92 3 pert
 7573 root      20   0  7996  340  256 R   20  0.0   4:05.80 3 pert

When you move a task into a cgroup, it still has an autogroup

The user has to specifically ask for it in his config, can turn it on or

..it has a different mission, with different users being targeted, so

IMHO, cgroups should have been 'nice' from the start, but the folks who
wrote it did what they thought best.  I like nice a lot better than

Perhaps in future, they'll get married, and perhaps they should, but in
the here and now, I think they have similar but not identical missions.
If you turn on one, turn off the other.  Maybe that should be automated.

Systemd thingy may make autogroup short lived anyway.  I had a query
from an embedded guy (hm, which I spaced) suggesting autogroup may be
quite nice for handheld stuff though, so who knows.

	-Mike

--

From: Peter Zijlstra
Date: Wednesday, December 1, 2010 - 4:33 am

Agreed, but by the time I realized that the shares thing was already in
the wild. I did (probably still do) have a patch that adds a nice file
to the cgroup file.

Anyway, I think the whole proc/nice interface for autogroups is already
a tad too far. If people want control they can use cgroups, but I really
don't care enough to argue much about it.



--

From: Mike Galbraith
Date: Wednesday, December 1, 2010 - 4:55 am

Agreed.  I've no intention of expanding it.

	-Mike

--

From: Vivek Goyal
Date: Wednesday, December 1, 2010 - 7:55 am

On Wed, Dec 01, 2010 at 06:57:48AM +0100, Mike Galbraith wrote:


Ok, so I got confused with the fact that after moving a task into a
cgroup it is still associated with an autogroup.

So IIUC, if a task is in root cgroup, then it would not necessarily be driven
by cpu.shares of root cgroup (as task could be in its own autogroup). But
if I move the task into a non-root cgroup, then it will for sure be
subjected to rules imposed by non-root cgroup cpu.shares. That's not too
bad.

Thanks
Vivek
--

From: Mike Galbraith
Date: Wednesday, December 1, 2010 - 8:04 am

I think the normal case would be either one or the other being in use at
any given time, but yes, if both are active, that's how it'd work.

	-Mike

--

From: Mike Galbraith
Date: Tuesday, November 30, 2010 - 12:54 am

Nope.   I did a bisection this morning in text mode with a pipe-test
based measurement proggy, and it bisected cleanly.

2069dd75c7d0f49355939e5586daf5a9ab216db7 is the first bad commit

commit 2069dd75c7d0f49355939e5586daf5a9ab216db7
Author: Peter Zijlstra <a.p.zijlstra@chello.nl>
Date:   Mon Nov 15 15:47:00 2010 -0800

    sched: Rewrite tg_shares_up)

	-Mike

--

From: Ingo Molnar
Date: Tuesday, November 30, 2010 - 7:18 am

Ok. And has this fixed it:

  822bc180a7f7: sched: Fix unregister_fair_sched_group()

... or are there two bugs?

Thanks,

	Ingo
--

From: Ingo Molnar
Date: Tuesday, November 30, 2010 - 7:53 am

another detail is that i needed this fix:

--- linux.orig/init/Kconfig
+++ linux/init/Kconfig
@@ -790,6 +790,7 @@ endif # NAMESPACES
 
 config SCHED_AUTOGROUP
 	bool "Automatic process group scheduling"
+	select EVENTFD
 	select CGROUPS
 	select CGROUP_SCHED
 	select FAIR_GROUP_SCHED

Because CGROUPS depends on eventfd infrastructure.

Thanks,

	Ingo
--

From: Peter Zijlstra
Date: Tuesday, November 30, 2010 - 8:01 am

Shouldn't then cgroups select that?
--

From: Ingo Molnar
Date: Tuesday, November 30, 2010 - 8:11 am

It depends on it so selecting it is fine. !EVENTFD is a CONFIG_EMBEDDED-only thing 
anyway.

Thanks,

	Ingo
--

From: Mike Galbraith
Date: Tuesday, November 30, 2010 - 9:28 am

Two bugs.  822bc180a7f7 fixes the explosions that were happening in tip.
The interactivity issue is some problem in the update_shares() stuff.

	-Mike

--

From: Mike Galbraith
Date: Sunday, November 28, 2010 - 10:45 pm

I'll hunt as soon as I can (inbox runneth over).

	-Mike

--

From: Paul Turner
Date: Tuesday, November 30, 2010 - 8:39 pm

I'm looking at this.

The share:share ratios looked good in static testing, but perhaps we 
need a little more wake-up boost to improve interactivity.

Should have something tomorrow.



--

From: Paul Turner
Date: Tuesday, November 30, 2010 - 8:39 pm

I'm looking at this.

The share:share ratios looked good in static testing, but perhaps we 
need a little more wake-up boost to improve interactivity.

Should have something tomorrow.


--

From: Mike Galbraith
Date: Tuesday, November 30, 2010 - 11:16 pm

Yeah, feels like a wakeup issue.  I too did a (brief) static test, and
that looked ok.

	-Mike

--

From: Paul Turner
Date: Thursday, December 2, 2010 - 10:11 pm

Hey Mike,

Does something like the below help?

We're quick to drive the load_contribution up (to avoid over-commit). 
However on sleepy workloads this results in lots of weight being 
stranded (since it reaches maximum contribution instantaneously but 
decays slowly) as the thread migrates between cpus.

Avoid this by averaging "up" in the wake-up direction as well as the sleep.

We also get a boost from the fact that we use the instantaneous weight 
in computing the actual received shares.

I actually don't have a desktop setup handy to test "interactivity" (sad 
but true -- working on grabbing one).  But it looks better on under 
synthetic load.

- Paul

===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -743,12 +743,19 @@ static void update_cfs_load(struct cfs_r
                 return;

         now = rq_of(cfs_rq)->clock;
-       delta = now - cfs_rq->load_stamp;
+
+       if (likely(cfs_rq->load_stamp))
+               delta = now - cfs_rq->load_stamp;
+       else {
+               /* avoid large initial delta and initialize load_period */
+               delta = 1;
+               cfs_rq->load_stamp = 1;
+       }

         /* truncate load history at 4 idle periods */
         if (cfs_rq->load_stamp > cfs_rq->load_last &&
             now - cfs_rq->load_last > 4 * period) {
-               cfs_rq->load_period = 0;
+               cfs_rq->load_period = period/2;
                 cfs_rq->load_avg = 0;
         }
--

From: Mike Galbraith
Date: Thursday, December 2, 2010 - 11:48 pm

Unfortunately not.  For example, Xorg+mplayer needs (30 sec refresh)..

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND
 6487 root      20   0  366m  30m 5100 S   31  0.4   2:04.83 2 Xorg
 4454 root      20   0  318m  28m  15m S   29  0.4   0:38.06 3 mplayer

..but gets this when a heavy kbuild is running along with them.

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  P COMMAND
 6487 root      20   0  366m  30m 5136 S   12  0.4   2:25.98 1 Xorg
 5595 root      20   0  318m  28m  15m R    8  0.4   0:09.31 3 mplayer

There are 4 task groups active at this time, Xorg, mplayer, Amarok and
konsole where the kbuild is running make -j40.

	-Mike


--

From: Paul Turner
Date: Friday, December 3, 2010 - 1:37 am

Hmm.. unfortunate.  Ok -- based on the traces of synthetic loads and
the traces of their share on wake-up I think this is the right track
at least, will refine it tomorrow.

--

From: James Courtier-Dutton
Date: Saturday, December 4, 2010 - 4:55 pm

What tools are actually used to test "interactivity" ?
I posted a tool to the list some time ago, but I don't think anyone noticed.
My tool is very simple.
When you hold a key down, it should repeat. It should repeat at a
constant predictable interval.
So, my tool just waits for key presses and times when each one occurred.
The tester simply presses a key and holds it down.
If the time between each key press is constant, it indicates good
"interactivity". If the time between each key press varies a lot, it
indicates bad "interactivity".
You can reliably test if one kernel is better than the next using
actual measurable figures.

Kind Regards

James
--

From: Paul Turner
Date: Saturday, December 4, 2010 - 10:11 pm

On Sat, Dec 4, 2010 at 3:55 PM, James Courtier-Dutton

Could you drop me a pointer?  I can certainly give it a try.  It would
be extra useful if it included any histogram functionality.

I've been using a combination of various synthetic wakeup and load
scripts and measuring the received bandwidth / wakeup latency.

They have not succeeded in reproducing the starvation or poor latency
observed by Mike above however.  (Although I've pulled a box to try
reproducing his exact conditions [ e.g. user environment ] on Monday).
--

From: Paul Turner
Date: Tuesday, December 7, 2010 - 4:32 am

Desktop hardware came in today and I can now reproduce the issues
Mike's been seeing; tuning in progress.

--

From: Paul Turner
Date: Wednesday, December 15, 2010 - 5:10 am

This goose is now cooked.

It turns out the new shares code is doing the "right" thing, but in
the wrong order for tasks with very small time slices.  This made it
rather gnarly to track down since at all times the new evaluation /
the old evaluation / and an "OPT" evaluation were all in agreement!

As we hierarchically dequeue we now instantaneously adjust entity
weights to account for the new global state (good).  However, when we
then update on the parent (e.g. the group entity owning the
just-adjusted cfs_rq), the accrued unaccounted time is charged at the
new weight for that entity instead of the old.

For longer running processes, the periodic updates hide this.
However, for an interactive process, such as Xorg (which uses many
_small_ timeslices -- e.g. almost all accounting ends up being at
dequeue as opposed to periodic) this results in significant vruntime
over-charging and a loss of fairness.  In Xorg's case the loss of
fairness is compounded by the fact that there is only one runnable
thread means we transition between NICE_0_LOAD and MIN_SHARES for the
over-charging above.

This is fixed by charging the unaccounted time versus a group entity
before we manipulate its weight (as a result of child movement).

Thanks for your patience while I tracked this down.. it's been a few
sleepless nights while I cranked through a number of dead-end theories
(rather frustrating when the numbers are all right but the results are
all wrong! ;).  Cleaned up patch inbound in the morning.

- Paul

--

From: Peter Zijlstra
Date: Wednesday, December 1, 2010 - 4:34 am

Right, the previous thing cheated quite enormous with wakeups simply
because it was way too expensive to compute proper shares on wakeups.

Maybe we should re-instate some of that cheating.

--

Previous thread: [PATCH] Change return type of obvious Boolean list fns to "bool". by Robert P. J. Day on Sunday, November 21, 2010 - 6:35 am. (3 messages)

Next thread: [tip:perf/core] perf tools: Change my maintainer address by tip-bot for Arnaldo Carvalho de Melo on Sunday, November 21, 2010 - 6:43 am. (1 message)