Re: [git pull] scheduler updates for v2.6.24

Previous thread: RF Nr 9527BCV-33-7-7-7 by euromillioawards on Monday, October 15, 2007 - 10:03 am. (1 message)

Next thread: [PATCH try #3] Input/Joystick Driver: add support AD7142 joystick driver by Bryan Wu on Monday, October 15, 2007 - 10:47 am. (10 messages)
To: Linus Torvalds <torvalds@...>
Cc: <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Monday, October 15, 2007 - 10:17 am

Linus, please pull the latest scheduler git tree from:

git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched.git

It contains lots of scheduler updates from lots of people - hopefully
the last big one for quite some time. Most of the focus was on
performance (both micro-performance and scalability/balancing), but
there's the fair-scheduling feature now Kconfig selectable too. Find the
shortlog below.

Code that is touched outside of the scheduler: the KVM bits were acked
by Avi, the net/unix change is trivial and only affects sync wakeups,
ditto the fs/pipe.c changes - but i can push those separately if it
needs an ack from David first.

ABI/API changes:

- new CONFIG_FAIR_USER_SCHED and /sys/kernel/uids/ + uevent API.
- /proc/stat and /proc/<pid>/stat changes for guest-CPU usage [KVM]
- /proc/sched_debug formats changed/enhanced

Testing status: the changes are chronological and all the
interactivity-impacting changes are near the head of the queue and most
of them were done weeks ago, and were thus part of the CFS-v22 backport
series - which was tested by many people. There are no known regressions
at the moment. It's all fully bisectable.

Thanks,

Ingo

------------------>
Alexey Dobriyan (1):
sched: uninline scheduler

Andi Kleen (4):
sched: cleanup: remove unnecessary gotos
sched: cleanup: refactor common code of sleep_on / wait_for_completion
sched: cleanup: refactor normalize_rt_tasks
sched: remove stale comment from sched_group_set_shares()

Arjan van de Ven (1):
Make scheduler debug file operations const

Dhaval Giani (1):
sched: group scheduling, sysfs tunables

Dmitry Adamushko (14):
sched: clean up struct load_stat
sched: clean up schedstat block in dequeue_entity()
sched: sched_setscheduler() fix
sched: add set_curr_task() calls
sched: do not keep current in the tree and get rid of sched_entity::fair_key
sched: optimize task_ne...

To: Ingo Molnar <mingo@...>
Cc: <linux-kernel@...>
Date: Tuesday, October 16, 2007 - 6:04 am

How does this one compare to the v22 you released earlier ?

I'm thinking of backporting any fixes/optimizations to 2.6.22
(and possibly 2.6.23)

--
Thomas
-

To: Thomas Backlund <tmb@...>
Cc: <linux-kernel@...>
Date: Tuesday, October 16, 2007 - 6:08 am

i have already backported it as v22.1 - will release it within a few
days. (once the currently open regressions have been fixed)

Ingo
-

To: Thomas Backlund <tmb@...>
Cc: <linux-kernel@...>
Date: Tuesday, October 16, 2007 - 6:12 am

i've uploaded what i have at the moment, to:

http://people.redhat.com/mingo/cfs-scheduler/devel/sched-cfs-v2.6.23.1-v...

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: <linux-kernel@...>
Date: Tuesday, October 16, 2007 - 7:00 am

Big thanks for your work...

Now I just have to see if I can get it to work with the -hrt series and
I'm really happy ;-)

--
Thomas
-

To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Monday, October 15, 2007 - 10:38 pm

Nice work...

However it's a pity all the balancing stuff got wildly changed
in 2.6.23 and then somewhat changed back again now.

Despite appearances, a lot of those things weren't actually
*completely* arbitrary values. I fear that it will make finding
performance regressions harder than it should have...

Anyway.

-

To: Linus Torvalds <torvalds@...>
Cc: <linux-kernel@...>, Andrew Morton <akpm@...>
Date: Monday, October 15, 2007 - 11:04 am

so i dropped them and re-pushed. New shortlog below.

Ingo

------------------>
Alexey Dobriyan (1):
sched: uninline scheduler

Andi Kleen (4):
sched: cleanup: remove unnecessary gotos
sched: cleanup: refactor common code of sleep_on / wait_for_completion
sched: cleanup: refactor normalize_rt_tasks
sched: remove stale comment from sched_group_set_shares()

Arjan van de Ven (1):
Make scheduler debug file operations const

Dhaval Giani (1):
sched: group scheduling, sysfs tunables

Dmitry Adamushko (14):
sched: clean up struct load_stat
sched: clean up schedstat block in dequeue_entity()
sched: sched_setscheduler() fix
sched: add set_curr_task() calls
sched: do not keep current in the tree and get rid of sched_entity::fair_key
sched: optimize task_new_fair()
sched: simplify sched_class::yield_task()
sched: rework enqueue/dequeue_entity() to get rid of set_curr_task()
sched: yield fix
sched: fix __pick_next_entity()
sched: tidy up SCHED_RR
sched: cleanup, remove calc_weighted()
sched: cleanup, make dequeue_entity() and update_stats_wait_end() similar
sched: fix group scheduling for SCHED_BATCH

Gautham R Shenoy (1):
sched: fix rt ptracer monopolizing CPU

Hiroshi Shimamoto (1):
sched: clean up sched_fork()

Ingo Molnar (71):
sched: fix sysctl_sched_child_runs_first flag
sched: resched task in task_new_fair()
sched: small sched_debug cleanup
sched: debug: track maximum 'slice'
sched: uniform tunings
sched: use constants if !CONFIG_SCHED_DEBUG
sched: remove stat_gran
sched: remove precise CPU load
sched: remove precise CPU load calculations #2
sched: track cfs_rq->curr on !group-scheduling too
sched: cleanup: simplify cfs_rq_curr() methods
sched: uninline __enqueue_entity()/__dequeue_entity()
sched: speed up update_load_add/_sub()
sc...

To: Ingo Molnar <mingo@...>
Cc: <torvalds@...>, <linux-kernel@...>
Date: Monday, October 15, 2007 - 2:35 pm

On Mon, 15 Oct 2007 16:17:23 +0200

Did Paul Jackson's crash get fixed?
-

To: Andrew Morton <akpm@...>
Cc: <torvalds@...>, <linux-kernel@...>
Date: Monday, October 15, 2007 - 2:53 pm

yes - that crash was a showstopper that was holding up the pull request
for 2 days. Paul bisected it down to the culprit and the fix was to do
this in wake_up_new_task():

- if (!p->sched_class->task_new || !current->se.on_rq) {
+ if (!p->sched_class->task_new || !current->se.on_rq || !rq->cfs.curr) {

(during early bootup the cfs_rq has no curr pointer yet.) It's not clear
why this race did not trigger earlier. (and the two checks can probably
be consolidated into a single "!rq->cfs.curr" condition.)

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Andrew Morton <akpm@...>, <torvalds@...>, <linux-kernel@...>, Paul Jackson <pj@...>
Date: Tuesday, October 16, 2007 - 6:38 pm

an update on this issue:

shortly, SD_BALANCE_FORK is required to trigger this problem and
hence, only NUMA machines could have been affected by it (and only
ia64 and x86 have SD_BALANCE_FORK in SD_NODE_INIT).

more details:

it's perfectly legitimate for 'rq->cfs.curr' to be NULL in
task_new_fair() in the case when this_cpu != task_cpu(p) (p -- is a
newly created task).

why this_cpu != task_cpu(p) :

do_fork() --> copy_process() --> sched_fork() -->
cpu = sched_balance_self(this_cpu, SD_BALANCE_FORK)

chose a different cpu for the new task and there is _no_
'class_sched_fair' task running on this cpu at the moment (that's why
rq->cfs.curr == NULL).

[ thanks a lot to Paul for providing debugging information ]

btw., it's not the 'curr->vruntime < se->vruntime' part in
task_new_fair() that gave us the oops (it's only executed in the case
of this_cpu == task_cpu(p)) _but_ it's rather:

2 checks are required as 'current' and rq->cfs.curr are not the same :-)
It also should work if we just get rid of [*] or add an adiitional
(curr != NULL) check there.

just as a additional observation:

there are lots of per-cpu threads (like events/cpu, ksoftirq/cpu,
etc.) being created on start-up (x NUMBER_OF_CPUS) and SD_SCHED_FORK
(actually, sched_balance_self() from sched_fork()) is just an overhead
in this case...
although, sched_balance_self() is likely to be responsible for a minor
% of the time taken to create a new context so optimizing it away
(esp. for some corner cases) won't improve the start-up time

--
Best regards,
Dmitry Adamushko
-

To: Ingo Molnar <mingo@...>
Cc: Andrew Morton <akpm@...>, <torvalds@...>, <linux-kernel@...>
Date: Tuesday, October 16, 2007 - 6:13 pm

Maybe not related to that but now my box is killed after this merge.

When I do not much on the box I get maybe 6h uptime , by doing some work ( compiling etc ) is random freeze.

I was able to capture the OOps finally :

...

[15692.917111] BUG: unable to handle kernel NULL pointer dereference at virtual address 00000044
[15692.917159] printing eip:
[15692.917174] c0111f90
[15692.917185] *pde = 00000000
[15692.917200] Oops: 0000 [#1]
[15692.917208] PREEMPT SMP
[15692.917240] Modules linked in: fuse netconsole configfs pc87360 hwmon_vid eeprom adm1021 uhci_hcd sr_mod shpchp pci_hotplug ohci_hcd iTCO_wdt iTCO_vendor_support intel_agp i82860_edac i2c_i801 ehci_hcd usbcore edac_core cdrom agpgart 3c59x mii ext4dev jbd2 capability commoncap loop lp parport_pc parport evdev
[15692.917623] CPU: 0
[15692.917625] EIP: 0060:[<c0111f90>] Not tainted VLI
[15692.917629] EFLAGS: 00010046 (2.6.23-g65a6ec0d #330)
[15692.917661] EIP is at pick_next_task_fair+0x1f/0x2d
[15692.917672] eax: c150a7b8 ebx: 00000000 ecx: 00000000 edx: 00000000
[15692.917689] esi: c1507a48 edi: 00000000 ebp: 00eaaf7a esp: cb1fdf14
[15692.917701] ds: 007b es: 007b fs: 00d8 gs: 0000 ss: 0068
[15692.917715] Process sed (pid: 28999, ti=cb1fc000 task=cfdc3500 task.ti=cb1fc000)
[15692.917725] Stack: c02f8268 c02ef7b5 00000002 cb1fdf58 cb1fdf50 00000000 c0400f38 c0403780
[15692.917833] cfdc3500 cfdc3634 c150a780 00000000 c011a8e7 00000000 c1077aa0 000000ff
[15692.917942] 00000000 00000000 00000000 cb1fdf8c 00000010 cfdc3500 cb1fdf8c c011ace5
[15692.918048] Call Trace:
[15692.918072] [<c02ef7b5>] schedule+0x321/0x58f
[15692.918109] [<c011a8e7>] do_exit+0x293/0x6c6
[15692.918143] [<c011ace5>] do_exit+0x691/0x6c6
[15692.918169] [<c011ad87>] sys_exit_group+0x0/0xd
[15692.918195] [<c01026e6>] sysenter_past_esp+0x5f/0x85
[15692.918232] =======================
[15692.918244] Code: 8b 53 28 89 43 34 89 53 38 5b 5e c3 53 31 d2 83 78 40 00 74 20 83 c...

To: Gabriel C <nix.or.die@...>, Srivatsa Vaddagiri <vatsa@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <torvalds@...>, <linux-kernel@...>
Date: Tuesday, October 16, 2007 - 7:31 pm

[ cc'ed Srivatsa ]

Gabriel, could you please post a disassembled code for pick_next_task_fair()?
(objdump -d kernel/sched.o and then search for pick_next_task_fair --
copy_and_past)

anyway, my guess is that it's :

se = pick_next_entity(cfs_rq);
cfs_rq = group_cfs_rq(se);

'se' _happens_ to be NULL and group_cf_rq(se) does se->my_q and
(according to my calculations) offset(my_q) == 68 (0x44) for x86 32bit
system with CONFIG_SCHEDSTATS=n and CONFIG_FAIR_GROUP_SCHED=y
(according to the config).

that might take place provided put_prev_task_fair() failed for some
reason to insert 'current' (or its corresponding group element) back
into the tree in schedule()... say, due to some inconsistency in
cfs_rq's data.

Srivatsa, that's somewhat similar to another issue that has been
posted earlier today (crash in put_prev_task_fair() -->
__enqueue_task() --> rb_insert_color()) that you are already aware of
... (/me will continue tomorrow).

--
Best regards,
Dmitry Adamushko
-

To: Dmitry Adamushko <dmitry.adamushko@...>
Cc: Srivatsa Vaddagiri <vatsa@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, <torvalds@...>, <linux-kernel@...>
Date: Tuesday, October 16, 2007 - 7:50 pm

Sure here it is :

00000e49 <pick_next_task_fair>:
e49: 53 push %ebx
e4a: 31 d2 xor %edx,%edx
e4c: 83 78 40 00 cmpl $0x0,0x40(%eax)
e50: 74 20 je e72 <pick_next_task_fair+0x29>
e52: 83 c0 38 add $0x38,%eax
e55: 8b 50 20 mov 0x20(%eax),%edx
e58: 31 db xor %ebx,%ebx
e5a: 85 d2 test %edx,%edx
e5c: 74 0a je e68 <pick_next_task_fair+0x1f>
e5e: 8d 5a f8 lea -0x8(%edx),%ebx
e61: 89 da mov %ebx,%edx
e63: e8 a9 ff ff ff call e11 <set_next_entity>
e68: 8b 43 44 mov 0x44(%ebx),%eax
e6b: 85 c0 test %eax,%eax
e6d: 75 e6 jne e55 <pick_next_task_fair+0xc>
e6f: 8d 53 d0 lea -0x30(%ebx),%edx
e72: 89 d0 mov %edx,%eax
e74: 5b pop %ebx
-

Previous thread: RF Nr 9527BCV-33-7-7-7 by euromillioawards on Monday, October 15, 2007 - 10:03 am. (1 message)

Next thread: [PATCH try #3] Input/Joystick Driver: add support AD7142 joystick driver by Bryan Wu on Monday, October 15, 2007 - 10:47 am. (10 messages)