Re: [git pull] scheduler changes for v2.6.26

Previous thread: [git pull] x86 changes for v2.6.26, part two by Ingo Molnar on Saturday, April 19, 2008 - 11:08 am. (1 message)

Next thread: Pull request for semaphore include changes by Matthew Wilcox on Saturday, April 19, 2008 - 11:40 am. (6 messages)
From: Ingo Molnar
Date: Saturday, April 19, 2008 - 11:13 am

Linus, please pull the latest scheduler git tree from:

   git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git for-linus

The cpumask changes (needed for 4096 CPU support) also touch the x86 
tree - the recent x86.git pull i sent will cleanly merge with this tree, 
in either order. The only real complication the 4K CPU related changes 
have are in the scheduler code - that's why they were merged and tested 
in this tree. They have wider API impact than the scheduler but those 
are straightforward/trivial. It was all tested in linux-next as well.

Thanks!

	Ingo

------------------>
Dhaval Giani (4):
      sched: cleanup cpuacct variable names
      sched: allow cpuacct stats to be reset
      sched: mix tasks and groups
      sched: allow the group scheduler to have multiple levels

Dmitry Adamushko (1):
      latencytop: optimize LT_BACKTRACEDEPTH loops a bit

Gregory Haskins (1):
      sched: fix cpus_allowed settings

Guillaume Chazarain (1):
      sched: fix rq->clock overflows detection with CONFIG_NO_HZ

Hidetoshi Seto (2):
      sched, cpuset: customize sched domains, docs
      sched, cpuset: customize sched domains, core

Ingo Molnar (12):
      sched: re-do "sched: fix fair sleepers"
      sched: make cpu_clock() globally synchronous
      sched: introduce SCHED_FEAT_SYNC_WAKEUPS, turn it off
      sched: feat affine wakeups
      sched: cache hot buddy
      sched: reenable sync wakeups
      sched: remove sysctl_sched_batch_wakeup_granularity
      time: add ns_to_ktime()
      sched: fix checks
      sched: add SCHED_FEAT_DEADLINE
      sched: features fix
      sched: build fix

Mike Travis (21):
      cpumask: add cpumask_scnprintf_len function
      x86: reduce memory and stack usage in intel_cacheinfo
      x86: oprofile: remove NR_CPUS arrays in arch/x86/oprofile/nmi_int.c
      asm-generic: add node_to_cpumask_ptr macro
      numa: move large array from stack to _initdata section
      cpumask: Cleanup more ...
From: Guillaume Chazarain
Date: Saturday, April 19, 2008 - 11:47 am

This one should have been dropped (per http://lkml.org/lkml/2008/1/25/413)
I don't like rq->clock relying on jiffies, and you already fixed the
original manifestation of the problem in:

commit 2997c8c4a0b179e8b834a7f30ba4323f2c60ccf4
Author: Ingo Molnar <mingo@elte.hu>
Date:   Fri Jan 11 13:35:54 2008 +0100

    block: fix blktrace timestamps

    David Dillow reported broken blktrace timestamps. The reason
    is cpu_clock() which is not a global time source.

    Fix bkltrace timestamps by using ktime_get() like the networking
    code does for packet timestamps. This also removes a whole lot
    of complexity from bkltrace.c and shrinks the code by 500 bytes:

       text    data     bss     dec     hex filename
       2888     124      44    3056     bf0 blktrace.o.before
       2390     116      44    2550     9f6 blktrace.o.after

    Signed-off-by: Ingo Molnar <mingo@elte.hu>
    Signed-off-by: Jens Axboe <jens.axboe@oracle.com>


-- 
Guillaume
--

From: Ingo Molnar
Date: Saturday, April 19, 2008 - 11:49 am

no, it was still needed on a testbox of Peter on nohz.

	Ingo
--

From: Peter Zijlstra
Date: Saturday, April 19, 2008 - 11:54 am

Yeah, I got a rq->clock rate of about 500ms every second. With this
patch I got the full 1000.

--

From: Frans Pop
Date: Saturday, April 19, 2008 - 12:47 pm

Has the regression [1] that was traced to this change been fixed now?

The regression was confirmed to be also present in sched-devel.git at the 
time of -rc8, so I'd expect it still to be valid unless of course the cause 
was found since then.

Cheers,
FJP

[1] http://bugzilla.kernel.org/show_bug.cgi?id=10428
--

From: Ingo Molnar
Date: Monday, April 21, 2008 - 5:39 am

Peter reported a similar interactivity problem to be gone, so lets 
re-try this again - the change makes sense and if you see problems it's 
hiding something we really want to fix. It would be nice if you could 
try sched-devel/latest because it has an improved ftrace "sched_switch" 
tracer where you can generate much longer traces of this incident. Try 
the new /debug/trace_entries runtime tunable.

	Ingo
--

From: Frans Pop
Date: Monday, April 21, 2008 - 9:31 am

IIRC he mentioned seeing something himself during 2.6.25, but we'd already 
established that was unrelated to what I was seeing as his workaround 
(booting with nohz) did not work for me.

I can again reproduce the issue with sched-devel/latest :-(
latencytop regularly reports 500-700 msec latencies for amarokapp again 
while compiling glibc.

I expect that means it's also back in mainline?
What would be preferred for bugzilla: reopening the existing report (#10428) 

I'll try to get the trace and will reply on the private thread we had.
I may need additional instructions though.

Cheers,
FJP
--

From: Ingo Molnar
Date: Monday, April 21, 2008 - 12:43 pm

you could also reply to this thread if you dont mind, so that others can 
chime in too.

the 700-800 msecs of delays you see are very "brutal" so there must be 
something fundamentally wrong going on here.

Could you first check (under sched-devel/latest) the quality of your 
sched-clock, via running this script:

   http://people.redhat.com/mingo/cfs-scheduler/tools/watch-rq-clock.sh

if you run it, it should output ~1000 msecs periods every second:

 europe:~> watch-rq-clock.sh
 1002.115042
 1005.509851
 1004.187275
 1004.409980
 1004.430264
 1004.445508

if it's way too 'slow', say it only 100 msecs per second, then the 
scheduler clock is mis-measuring time and what the scheduler thinks to 
be a 40 msecs delay might become a 400 msecs delay.

	Ingo
--

From: Mike Galbraith
Date: Tuesday, April 22, 2008 - 1:51 am

I'm seeing latency hits with 26.git, whereas 25 is hit free. 

   LatencyTOP version 0.3       (C) 2008 Intel Corporation

Cause                                                Maximum     Percentage
Scheduler: waiting for cpu                        436.0 msec         79.8 %
do_fork sys_vfork ptregscall_common                37.7 msec          1.5 %
blk_execute_rq scsi_execute scsi_execute_req sr_te 30.5 msec          0.2 %
blk_execute_rq scsi_execute scsi_execute_req sd_re 28.7 msec          0.6 %
msleep wakeup_rh uhci_rh_resume hcd_bus_resume gen 23.3 msec          0.3 %
blk_execute_rq scsi_execute scsi_execute_req scsi_ 23.0 msec          0.8 %
down tty_write vfs_write sys_write system_call_aft 21.9 msec          0.1 %
do_get_write_access journal_get_write_access __ext 21.8 msec          0.1 %
blk_execute_rq scsi_execute scsi_execute_req sr_te 16.5 msec          0.3 %
r_block_media_changed check_disk_change cdrom_open sr_block_open do_open




Process amarokapp (4645)
Scheduler: waiting for cpu                        436.0 msec         93.8 %
do_select core_sys_select sys_select system_call_a  4.8 msec          4.7 %

Erm, should my Q6600 emit such?

2.6.25:
51.743501
124.292008
59.719506
268.016760
64.004011
144.113851
87.900658
116.007257
72.004509

On 26.git, I get numbers like yours, but with occasional dips down to
~700, though the latency hits don't _seem_ to be synchronous with
watch-rq-clock.sh glitchies.

	-Mike

--

From: Ingo Molnar
Date: Tuesday, April 22, 2008 - 1:59 am

ok. I think what happens is that your broken sched-clock hid the real 
breakage. Lets try fix the real breakage now.

I've uploaded a new sched-devel.git that is against very latest -git, 
could you try ftrace (with a sufficiently large 
/debug/tracing/tracing_max_entries value) - perhaps the 
worst-case-wakeup-latency tracer shows large latencies? If not, then 
maybe the sched_switch tracer gives a better insight into what's 

on nohz it could happen - and fixed in -git. Patch looked too dangerous 

hm, the dips shouldnt be happening normally.

	Ingo
--

From: Peter Zijlstra
Date: Tuesday, April 22, 2008 - 2:59 am

Also, it only happens when the cpu has idle time; and that typically
happens when its well,. idle - so not much to schedule wrong.


Agreed, those ought to be gone.. Guillaume do you see any holes in the
current rq clock code?

--

From: Mike Galbraith
Date: Tuesday, April 22, 2008 - 5:25 am

Well, looks like this is going to have to wait.  sched-devel is locking
hard on me, both sched-devel and .git are losing my darn keyboard right
in the middle of things, and my serial ports are (still) so screwed up I
can't get minicom working right.  Arg.

	-Mike

--

From: Mike Galbraith
Date: Wednesday, April 23, 2008 - 1:11 am

I dug my old p3/500 out and hooked it up since I still can't get minicom
functional plugging P4->Q6600, but no joy.  Whatever the heck is going
wrong in sched-devel is so catastrophic that not one character hits the
console when it locks, and nmi_watchdog=1 does nada.

FWIW, all I am doing is listening to music in amarok while a make -j4
glibc build is running, watching latencytop and watch-rq-clock.sh (nice
-15).  Latency hits are smaller, but still present without NO_HZ btw.
The largest I ever saw in latencytop before death arrived was < 300ms.
(still size mondo)

config attached.

	-Mike
From: Mike Galbraith
Date: Wednesday, April 23, 2008 - 3:30 am

I built x86/latest to see if it locked, and while testing to see if
minicom was really really working properly on p3/500, I created an
explosion.  I see in the log that clockevents griped during boot, and
when I later poked SysRq-W, box blew to pieces.

[    0.000000] Linux version 2.6.26-devel-smp (root@marge) (gcc version 4.2.1 (SUSE Linux)) #1 SMP Wed Apr 23 11:16:47 CEST 2008
[    0.000000] Command line: root=/dev/disk/by-id/scsi-SATA_ST3400620AS_6QG2LA0M-part3 vga=0x314 resume=/dev/sdb2 splash=silent console=ttyS0,115200n8 console=tty nmi_watchdog=1
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  BIOS-e820: 0000000000000000 - 000000000009e400 (usable)
[    0.000000]  BIOS-e820: 000000000009e400 - 00000000000a0000 (reserved)
[    0.000000]  BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
[    0.000000]  BIOS-e820: 0000000000100000 - 00000000bfe90000 (usable)
[    0.000000]  BIOS-e820: 00000000bfe90000 - 00000000bfee3000 (ACPI NVS)
[    0.000000]  BIOS-e820: 00000000bfee3000 - 00000000bfef0000 (ACPI data)
[    0.000000]  BIOS-e820: 00000000bfef0000 - 00000000bff00000 (reserved)
[    0.000000]  BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
[    0.000000]  BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
[    0.000000] max_pfn_mapped = 1048576
[    0.000000] x86: PAT support disabled.
[    0.000000] init_memory_mapping
[    0.000000] DMI 2.5 present.
[    0.000000] ACPI: RSDP 000F7F20, 0014 (r0 MEDION)
[    0.000000] ACPI: RSDT BFEE3000, 003C (r1 MEDION MEDIONAG 42302E31 AWRD        0)
[    0.000000] ACPI: FACP BFEE3080, 0074 (r1 MEDION MEDIONAG 42302E31 AWRD        0)
[    0.000000] ACPI: DSDT BFEE3100, 4243 (r1 MEDION AWRDACPI     1000 MSFT  3000000)
[    0.000000] ACPI: FACS BFE90000, 0040
[    0.000000] ACPI: HPET BFEE75C0, 0038 (r1 MEDION MEDIONAG 42302E31 AWRD       98)
[    0.000000] ACPI: MCFG BFEE7600, 003C (r1 MEDION MEDIONAG 42302E31 AWRD        0)
[    0.000000] ACPI: SLIC BFEE7380, 0176 (r1 MEDION MEDIONAG 42302E31 ...
From: Peter Zijlstra
Date: Wednesday, April 23, 2008 - 4:21 am

Sorry about that :-/

this should fix it:

---
Subject: sched: fix oops

sched_debug uses SEQ_printf to use printk when the seqfile 'm' is NULL.
Instead of doing that here too; choose to not output the weight tree
to cut back on output.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 89fa32b..353a481 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1643,8 +1643,10 @@ static void print_cfs_stats(struct seq_file *m, int cpu)
 	for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
 		print_cfs_rq(m, cpu, cfs_rq);
 
-	seq_printf(m, "\nWeight tree:\n");
-	print_cfs_rq_tasks(m, &cpu_rq(cpu)->cfs, 1);
+	if (m) {
+		seq_printf(m, "\nWeight tree:\n");
+		print_cfs_rq_tasks(m, &cpu_rq(cpu)->cfs, 1);
+	}
 	rcu_read_unlock();
 }
 #endif


--

From: Mike Galbraith
Date: Wednesday, April 23, 2008 - 6:27 am

Did.  But...

Continuing testing, x86/latest doesn't seem to hard lock (though i'm
going to run a while longer to be sure).  However, it does do something
else. watch-rq-clock.sh is triggering an occasional oops.  After the
oops, instead of heading off to bit heaven, grep hangs around eating
100% cpu, _possibly_ until it's parent is killed.  I say possibly,
because that's what it looked like the first time.  Also, amarok
suddenly ceased playing and exited without a peep... weird.  I restarted
amarok, kept testing and repeated the oops, but just as luck would have
it, my keyboard went away again, so I had to reboot before I was able to
see if killing watch-rq-clock.sh would kill the runaway grep again, or
if amarok was going to go awol again.

[  399.477589] BUG: unable to handle kernel NULL pointer dereference at 0000000000000418
[  399.481049] IP: [<ffffffff802296d2>] calc_delta_weight+0x17/0x37
[  399.481049] PGD 7b0ba067 PUD 7b1a2067 PMD 0 
[  399.481049] Oops: 0000 [1] SMP 
[  399.481049] CPU 2 
[  399.481049] Modules linked in: ip6t_LOG nf_conntrack_ipv6 xt_pkttype ipt_LOG xt_limit af_packet snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device ip6t_REJECT xt_tcpudp ipt_REJECT xt_state iptable_mangle iptable_nat nf_nat iptable_filter ip6table_mangle nf_conntrack_ipv4 nf_conntrack ip_tables ip6table_filter ip6_tables x_tables cpufreq_conservative cpufreq_userspace ipv6 cpufreq_powersave acpi_cpufreq microcode nls_iso8859_1 nls_cp437 vfat fat fuse loop dm_mod tda1004x saa7134_dvb videobuf_dvb dvb_core tuner tea5767 tda8290 tda827x tuner_xc2028 tda9887 tuner_simple mt20xx tea5761 saa7134 snd_hda_intel compat_ioctl32 videodev v4l1_compat v4l2_common snd_pcm firewire_ohci videobuf_dma_sg firewire_core snd_timer videobuf_core ir_kbd_i2c usbhid snd_page_alloc ir_common crc_itu_t snd_hwdep usb_storage snd tveeprom e1000e ohci1394 hid iTCO_wdt rtc_cmos i2c_i801 ieee1394 iTCO_vendor_support intel_agp rtc_core sr_mod i2c_core rtc_lib button serio_raw ff_memless sg cdrom soundcore ehci_hcd uhci_hcd ...
From: Kevin Winchester
Date: Tuesday, April 22, 2008 - 2:41 am

Is this supposed to be true for everyone?

kevin@alekhine:~/linux$ uname -a
Linux alekhine 2.6.25-02519-g3925e6f #10 PREEMPT Sat Apr 19 12:36:49 ADT 2008 x86_64 GNU/Linux

kevin@alekhine:~/linux$ ./watch-rq-clock.sh
89.986517
81.033471
76.942776
90.986318
75.988551
85.987089
74.988696
85.987078
73.988858
88.986641
68.989600

kevin@alekhine:~/linux$ cat /proc/cpuinfo
processor       : 0
vendor_id       : AuthenticAMD
cpu family      : 15
model           : 31
model name      : AMD Athlon(tm) 64 Processor 3000+
stepping        : 0
cpu MHz         : 1000.000
cache size      : 512 KB
fpu             : yes
fpu_exception   : yes
cpuid level     : 1
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov
pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 3dnowext 3dnow
rep_good lahf_lm
bogomips        : 2006.79
TLB size        : 1024 4K pages
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management: ts fid vid ttp

Does that mean anything?  Is there any other testing I should perform?

-- 
Kevin Winchester
--

From: David Miller
Date: Tuesday, April 22, 2008 - 3:49 am

From: Kevin Winchester <kjwinchester@gmail.com>

The results on my 128-cpu Niagara2 box are even more interesting:

davem@maramba:~$ /bin/bash ./watch-rq-clock.sh 
0
0
0
0
0
0
0
0
0
0
0
0
0
...

I guess this script doesn't work correctly when the cpu
whose clock value it greps out of /proc/sched_debug is
in NOHZ mode?
--

From: Peter Zijlstra
Date: Tuesday, April 22, 2008 - 4:10 am

Yeah - looking at the script it seems to look at the last one, so if
indeed that cpu is fully idle its rq clock will be stalled.

The fix that went in right after .25 was that when it came out of nohz
mode rq->clock could catch up 1 jiffy even though it had been out much
longer.

So the interesting thing to know is whether rq->clock properly accounts
for all idle time when the cpu leaves idle mode.


--

From: Frans Pop
Date: Wednesday, April 23, 2008 - 2:36 am

(Dropping Rafael, Linus and Andrew from CC.)


OK. I must admit I was a bit surprised that things were taken private last=
=20

My clock looks OK:
1009.609822
1012.920151
1015.125826
1009.978393
1011.318990
1010.932234
1012.080720
1009.895988
1013.510765
1009.280119
1009.758037
1013.955419
1008.747707
1014.474823

Before I get to some facts from testing, first a possible tracer bug.

Trace settings:
# cat available_tracers
mmiotrace wakeup sched_switch none
# cat current_tracer
sched_switch
# cat iter_ctrl
print-parent nosym-offset nosym-addr noverbose noraw nohex nobin noblock=20
nostacktrace nosched-tree
# cat trace_entries
50005

If I enable tracing with these settings, I get data in both trace *and* in=
=20
latency_trace. Is the last correct? From http://lkml.org/lkml/2008/2/10/43=
=20
I got the impression that file should only be used if the "wakeup" tracer=20
is active.


Right, finally to the traces and some background info.

I've run three 20 minute tests. For each I've counted the clearly audible=20
skips in music:
1) build without tracing, console with build running active: 17 skips
2) build without tracing, build console not visible: 16 skips
3) build with tracing, build console not visible: 11 skips

With the first build processor usage varies between 60 and 75% (over two=20
cores); with the others its fairly constant at 50% so 1 core in full use=20
and the other basically idle. All three builds do reach the same point in=20
the about the same time though, so the extra processor usage for the first=
=20
build is apparently just display update overhead.

The number of skips seems fairly constant. They don't seem to happen at=20
exactly the same points, but there are clearly points where they are more=20
likely. Most striking is that in all three cases I had a series of 4-6=20
skips at ~17-18 minutes into the build.

=46rom the last run I've got three traces with 50000 entries (about a minut=
e=20
worth each). Traces 1 and 2 should ...
From: Peter Zijlstra
Date: Wednesday, April 23, 2008 - 2:42 am

Yes, but those traces were << 1s and didn't appear to contain the jumps.

--

From: Ingo Molnar
Date: Wednesday, April 23, 2008 - 2:46 am

yeah, that's the intended behavior. 'trace' and 'latency_trace' are two 
kinds of 'views' on the same set of tracer data (with a different output 
format). They are both non-destructive - i.e. they dont deplete the 
trace buffers.

there's even a third view: 'trace_pipe' - which depletes the trace 
buffers and can be used to pipe trace data into a larger file:

    cat trace_pipe > /tmp/large-trace.txt

i guess it would be less confusing if 'trace' and 'latency_trace' was 
just a single 'view' and its behavior depended on iter_ctl settings. (In 
fact, we could probably just get rid of latency_trace - it is a leftover 


hm, i looked at them - i thought i mailed you about that but indeed i 
didnt (sorry about that) - the traces were too short in scope - they 
just covered a few dozen milliseconds in scope, while your description 

no need to rerun - but if you do the next test it might make sense to 
disable it, just to reduce the number of variables.

another thing - Peter observed skipping on NOHZ. Does your skipping go 
away if you boot with 'nohz' and/or with CONFIG_NOHZ=y disabled in the 
.config?

	Ingo
--

From: Frans Pop
Date: Wednesday, April 23, 2008 - 6:36 am

Ah, I was wondering about that. Sorry about the mistake, but if you'd let me 
know I could have run new traces.

If there's anything wrong with the current traces, please _do_ let me know. 
Problem is that I just see a huge pile of data without really knowing what 
I'm looking at. For example, I've no idea what the time units are. If I've 
guessed those wrong, it's not surprising that you don't see anything as I'm 
then basically stopping the trace too late or including too few entries.

What would be a good value for trace_entries to get a decent time interval?

If there's more documentation available about what's in those files, please 
let me know.

Cheers,
FJP

P.S. It would probably be good to just mention the units in the headers of 
the debug files. It may be obvious to you, but that way it's obvious even 
to morons like me :-P
--

From: Ingo Molnar
Date: Tuesday, April 29, 2008 - 5:29 am

yeah, agreed. We did that in -rt's /proc/latency_trace but it isnt there 
in ftrace. Steve?

	Ingo
--

From: Steven Rostedt
Date: Tuesday, April 29, 2008 - 8:28 am

Yep, I'll make a patch.

-- Steve

--

From: Frans Pop
Date: Wednesday, April 23, 2008 - 9:23 am

Just realized that I forgot to reply to this...

Already tried that with .24 (<200804111718.56958.elendil@planet.nl>):

! On Friday 11 April 2008, Peter Zijlstra wrote:
! > I'm having some issue on my own desktop - I get occasional latencies of
! > up to 1 second! It seems to be related to NO_HZ, because when I run a
! > simple while loop to keep the cpu busy, or boot with nohz=off it
! > doesn't seem to happen.
! >
! > Could you try that, just to see if we're seeing the same thing?
!
! Booted with nohz=off, but it does not seem to make any difference for me.
! I still get the music skips and chew still gives the high latencies.
!
! $ ./chew2
! pid 31568 preempted 4022 us after 1376 us
! pid 31568 preempted 6303 us after 6177 us
! pid 31568 preempted 8836 us after 7963 us
! pid 31568 preempted 13134 us after 7903 us
! pid 31568 preempted 20533 us after 9496 us
! pid 31568 preempted 22957 us after 8556 us
! pid 31568 preempted 28693 us after 9518 us
! pid 31568 preempted 251718 us after 8485 us
--

Previous thread: [git pull] x86 changes for v2.6.26, part two by Ingo Molnar on Saturday, April 19, 2008 - 11:08 am. (1 message)

Next thread: Pull request for semaphore include changes by Matthew Wilcox on Saturday, April 19, 2008 - 11:40 am. (6 messages)