Linus, please pull the latest scheduler git tree from:
git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git for-linus
The cpumask changes (needed for 4096 CPU support) also touch the x86
tree - the recent x86.git pull i sent will cleanly merge with this tree,
in either order. The only real complication the 4K CPU related changes
have are in the scheduler code - that's why they were merged and tested
in this tree. They have wider API impact than the scheduler but those
are straightforward/trivial. It was all tested in linux-next as well.
Thanks!
Ingo
------------------>
Dhaval Giani (4):
sched: cleanup cpuacct variable names
sched: allow cpuacct stats to be reset
sched: mix tasks and groups
sched: allow the group scheduler to have multiple levels
Dmitry Adamushko (1):
latencytop: optimize LT_BACKTRACEDEPTH loops a bit
Gregory Haskins (1):
sched: fix cpus_allowed settings
Guillaume Chazarain (1):
sched: fix rq->clock overflows detection with CONFIG_NO_HZ
Hidetoshi Seto (2):
sched, cpuset: customize sched domains, docs
sched, cpuset: customize sched domains, core
Ingo Molnar (12):
sched: re-do "sched: fix fair sleepers"
sched: make cpu_clock() globally synchronous
sched: introduce SCHED_FEAT_SYNC_WAKEUPS, turn it off
sched: feat affine wakeups
sched: cache hot buddy
sched: reenable sync wakeups
sched: remove sysctl_sched_batch_wakeup_granularity
time: add ns_to_ktime()
sched: fix checks
sched: add SCHED_FEAT_DEADLINE
sched: features fix
sched: build fix
Mike Travis (21):
cpumask: add cpumask_scnprintf_len function
x86: reduce memory and stack usage in intel_cacheinfo
x86: oprofile: remove NR_CPUS arrays in arch/x86/oprofile/nmi_int.c
asm-generic: add node_to_cpumask_ptr macro
numa: move large array from stack to _initdata section
cpumask: Cleanup more ...This one should have been dropped (per http://lkml.org/lkml/2008/1/25/413) I don't like rq->clock relying on jiffies, and you already fixed the original manifestation of the problem in: commit 2997c8c4a0b179e8b834a7f30ba4323f2c60ccf4 Author: Ingo Molnar <mingo@elte.hu> Date: Fri Jan 11 13:35:54 2008 +0100 block: fix blktrace timestamps David Dillow reported broken blktrace timestamps. The reason is cpu_clock() which is not a global time source. Fix bkltrace timestamps by using ktime_get() like the networking code does for packet timestamps. This also removes a whole lot of complexity from bkltrace.c and shrinks the code by 500 bytes: text data bss dec hex filename 2888 124 44 3056 bf0 blktrace.o.before 2390 116 44 2550 9f6 blktrace.o.after Signed-off-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Jens Axboe <jens.axboe@oracle.com> -- Guillaume --
no, it was still needed on a testbox of Peter on nohz. Ingo --
Yeah, I got a rq->clock rate of about 500ms every second. With this patch I got the full 1000. --
Has the regression [1] that was traced to this change been fixed now? The regression was confirmed to be also present in sched-devel.git at the time of -rc8, so I'd expect it still to be valid unless of course the cause was found since then. Cheers, FJP [1] http://bugzilla.kernel.org/show_bug.cgi?id=10428 --
Peter reported a similar interactivity problem to be gone, so lets re-try this again - the change makes sense and if you see problems it's hiding something we really want to fix. It would be nice if you could try sched-devel/latest because it has an improved ftrace "sched_switch" tracer where you can generate much longer traces of this incident. Try the new /debug/trace_entries runtime tunable. Ingo --
IIRC he mentioned seeing something himself during 2.6.25, but we'd already established that was unrelated to what I was seeing as his workaround (booting with nohz) did not work for me. I can again reproduce the issue with sched-devel/latest :-( latencytop regularly reports 500-700 msec latencies for amarokapp again while compiling glibc. I expect that means it's also back in mainline? What would be preferred for bugzilla: reopening the existing report (#10428) I'll try to get the trace and will reply on the private thread we had. I may need additional instructions though. Cheers, FJP --
you could also reply to this thread if you dont mind, so that others can chime in too. the 700-800 msecs of delays you see are very "brutal" so there must be something fundamentally wrong going on here. Could you first check (under sched-devel/latest) the quality of your sched-clock, via running this script: http://people.redhat.com/mingo/cfs-scheduler/tools/watch-rq-clock.sh if you run it, it should output ~1000 msecs periods every second: europe:~> watch-rq-clock.sh 1002.115042 1005.509851 1004.187275 1004.409980 1004.430264 1004.445508 if it's way too 'slow', say it only 100 msecs per second, then the scheduler clock is mis-measuring time and what the scheduler thinks to be a 40 msecs delay might become a 400 msecs delay. Ingo --
I'm seeing latency hits with 26.git, whereas 25 is hit free. LatencyTOP version 0.3 (C) 2008 Intel Corporation Cause Maximum Percentage Scheduler: waiting for cpu 436.0 msec 79.8 % do_fork sys_vfork ptregscall_common 37.7 msec 1.5 % blk_execute_rq scsi_execute scsi_execute_req sr_te 30.5 msec 0.2 % blk_execute_rq scsi_execute scsi_execute_req sd_re 28.7 msec 0.6 % msleep wakeup_rh uhci_rh_resume hcd_bus_resume gen 23.3 msec 0.3 % blk_execute_rq scsi_execute scsi_execute_req scsi_ 23.0 msec 0.8 % down tty_write vfs_write sys_write system_call_aft 21.9 msec 0.1 % do_get_write_access journal_get_write_access __ext 21.8 msec 0.1 % blk_execute_rq scsi_execute scsi_execute_req sr_te 16.5 msec 0.3 % r_block_media_changed check_disk_change cdrom_open sr_block_open do_open Process amarokapp (4645) Scheduler: waiting for cpu 436.0 msec 93.8 % do_select core_sys_select sys_select system_call_a 4.8 msec 4.7 % Erm, should my Q6600 emit such? 2.6.25: 51.743501 124.292008 59.719506 268.016760 64.004011 144.113851 87.900658 116.007257 72.004509 On 26.git, I get numbers like yours, but with occasional dips down to ~700, though the latency hits don't _seem_ to be synchronous with watch-rq-clock.sh glitchies. -Mike --
ok. I think what happens is that your broken sched-clock hid the real breakage. Lets try fix the real breakage now. I've uploaded a new sched-devel.git that is against very latest -git, could you try ftrace (with a sufficiently large /debug/tracing/tracing_max_entries value) - perhaps the worst-case-wakeup-latency tracer shows large latencies? If not, then maybe the sched_switch tracer gives a better insight into what's on nohz it could happen - and fixed in -git. Patch looked too dangerous hm, the dips shouldnt be happening normally. Ingo --
Also, it only happens when the cpu has idle time; and that typically happens when its well,. idle - so not much to schedule wrong. Agreed, those ought to be gone.. Guillaume do you see any holes in the current rq clock code? --
Well, looks like this is going to have to wait. sched-devel is locking hard on me, both sched-devel and .git are losing my darn keyboard right in the middle of things, and my serial ports are (still) so screwed up I can't get minicom working right. Arg. -Mike --
I dug my old p3/500 out and hooked it up since I still can't get minicom functional plugging P4->Q6600, but no joy. Whatever the heck is going wrong in sched-devel is so catastrophic that not one character hits the console when it locks, and nmi_watchdog=1 does nada. FWIW, all I am doing is listening to music in amarok while a make -j4 glibc build is running, watching latencytop and watch-rq-clock.sh (nice -15). Latency hits are smaller, but still present without NO_HZ btw. The largest I ever saw in latencytop before death arrived was < 300ms. (still size mondo) config attached. -Mike
I built x86/latest to see if it locked, and while testing to see if minicom was really really working properly on p3/500, I created an explosion. I see in the log that clockevents griped during boot, and when I later poked SysRq-W, box blew to pieces. [ 0.000000] Linux version 2.6.26-devel-smp (root@marge) (gcc version 4.2.1 (SUSE Linux)) #1 SMP Wed Apr 23 11:16:47 CEST 2008 [ 0.000000] Command line: root=/dev/disk/by-id/scsi-SATA_ST3400620AS_6QG2LA0M-part3 vga=0x314 resume=/dev/sdb2 splash=silent console=ttyS0,115200n8 console=tty nmi_watchdog=1 [ 0.000000] BIOS-provided physical RAM map: [ 0.000000] BIOS-e820: 0000000000000000 - 000000000009e400 (usable) [ 0.000000] BIOS-e820: 000000000009e400 - 00000000000a0000 (reserved) [ 0.000000] BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved) [ 0.000000] BIOS-e820: 0000000000100000 - 00000000bfe90000 (usable) [ 0.000000] BIOS-e820: 00000000bfe90000 - 00000000bfee3000 (ACPI NVS) [ 0.000000] BIOS-e820: 00000000bfee3000 - 00000000bfef0000 (ACPI data) [ 0.000000] BIOS-e820: 00000000bfef0000 - 00000000bff00000 (reserved) [ 0.000000] BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved) [ 0.000000] BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved) [ 0.000000] max_pfn_mapped = 1048576 [ 0.000000] x86: PAT support disabled. [ 0.000000] init_memory_mapping [ 0.000000] DMI 2.5 present. [ 0.000000] ACPI: RSDP 000F7F20, 0014 (r0 MEDION) [ 0.000000] ACPI: RSDT BFEE3000, 003C (r1 MEDION MEDIONAG 42302E31 AWRD 0) [ 0.000000] ACPI: FACP BFEE3080, 0074 (r1 MEDION MEDIONAG 42302E31 AWRD 0) [ 0.000000] ACPI: DSDT BFEE3100, 4243 (r1 MEDION AWRDACPI 1000 MSFT 3000000) [ 0.000000] ACPI: FACS BFE90000, 0040 [ 0.000000] ACPI: HPET BFEE75C0, 0038 (r1 MEDION MEDIONAG 42302E31 AWRD 98) [ 0.000000] ACPI: MCFG BFEE7600, 003C (r1 MEDION MEDIONAG 42302E31 AWRD 0) [ 0.000000] ACPI: SLIC BFEE7380, 0176 (r1 MEDION MEDIONAG 42302E31 ...
Sorry about that :-/
this should fix it:
---
Subject: sched: fix oops
sched_debug uses SEQ_printf to use printk when the seqfile 'm' is NULL.
Instead of doing that here too; choose to not output the weight tree
to cut back on output.
Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index 89fa32b..353a481 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -1643,8 +1643,10 @@ static void print_cfs_stats(struct seq_file *m, int cpu)
for_each_leaf_cfs_rq(cpu_rq(cpu), cfs_rq)
print_cfs_rq(m, cpu, cfs_rq);
- seq_printf(m, "\nWeight tree:\n");
- print_cfs_rq_tasks(m, &cpu_rq(cpu)->cfs, 1);
+ if (m) {
+ seq_printf(m, "\nWeight tree:\n");
+ print_cfs_rq_tasks(m, &cpu_rq(cpu)->cfs, 1);
+ }
rcu_read_unlock();
}
#endif
--
Did. But... Continuing testing, x86/latest doesn't seem to hard lock (though i'm going to run a while longer to be sure). However, it does do something else. watch-rq-clock.sh is triggering an occasional oops. After the oops, instead of heading off to bit heaven, grep hangs around eating 100% cpu, _possibly_ until it's parent is killed. I say possibly, because that's what it looked like the first time. Also, amarok suddenly ceased playing and exited without a peep... weird. I restarted amarok, kept testing and repeated the oops, but just as luck would have it, my keyboard went away again, so I had to reboot before I was able to see if killing watch-rq-clock.sh would kill the runaway grep again, or if amarok was going to go awol again. [ 399.477589] BUG: unable to handle kernel NULL pointer dereference at 0000000000000418 [ 399.481049] IP: [<ffffffff802296d2>] calc_delta_weight+0x17/0x37 [ 399.481049] PGD 7b0ba067 PUD 7b1a2067 PMD 0 [ 399.481049] Oops: 0000 [1] SMP [ 399.481049] CPU 2 [ 399.481049] Modules linked in: ip6t_LOG nf_conntrack_ipv6 xt_pkttype ipt_LOG xt_limit af_packet snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device ip6t_REJECT xt_tcpudp ipt_REJECT xt_state iptable_mangle iptable_nat nf_nat iptable_filter ip6table_mangle nf_conntrack_ipv4 nf_conntrack ip_tables ip6table_filter ip6_tables x_tables cpufreq_conservative cpufreq_userspace ipv6 cpufreq_powersave acpi_cpufreq microcode nls_iso8859_1 nls_cp437 vfat fat fuse loop dm_mod tda1004x saa7134_dvb videobuf_dvb dvb_core tuner tea5767 tda8290 tda827x tuner_xc2028 tda9887 tuner_simple mt20xx tea5761 saa7134 snd_hda_intel compat_ioctl32 videodev v4l1_compat v4l2_common snd_pcm firewire_ohci videobuf_dma_sg firewire_core snd_timer videobuf_core ir_kbd_i2c usbhid snd_page_alloc ir_common crc_itu_t snd_hwdep usb_storage snd tveeprom e1000e ohci1394 hid iTCO_wdt rtc_cmos i2c_i801 ieee1394 iTCO_vendor_support intel_agp rtc_core sr_mod i2c_core rtc_lib button serio_raw ff_memless sg cdrom soundcore ehci_hcd uhci_hcd ...
Is this supposed to be true for everyone? kevin@alekhine:~/linux$ uname -a Linux alekhine 2.6.25-02519-g3925e6f #10 PREEMPT Sat Apr 19 12:36:49 ADT 2008 x86_64 GNU/Linux kevin@alekhine:~/linux$ ./watch-rq-clock.sh 89.986517 81.033471 76.942776 90.986318 75.988551 85.987089 74.988696 85.987078 73.988858 88.986641 68.989600 kevin@alekhine:~/linux$ cat /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 31 model name : AMD Athlon(tm) 64 Processor 3000+ stepping : 0 cpu MHz : 1000.000 cache size : 512 KB fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext fxsr_opt lm 3dnowext 3dnow rep_good lahf_lm bogomips : 2006.79 TLB size : 1024 4K pages clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: ts fid vid ttp Does that mean anything? Is there any other testing I should perform? -- Kevin Winchester --
From: Kevin Winchester <kjwinchester@gmail.com> The results on my 128-cpu Niagara2 box are even more interesting: davem@maramba:~$ /bin/bash ./watch-rq-clock.sh 0 0 0 0 0 0 0 0 0 0 0 0 0 ... I guess this script doesn't work correctly when the cpu whose clock value it greps out of /proc/sched_debug is in NOHZ mode? --
Yeah - looking at the script it seems to look at the last one, so if indeed that cpu is fully idle its rq clock will be stalled. The fix that went in right after .25 was that when it came out of nohz mode rq->clock could catch up 1 jiffy even though it had been out much longer. So the interesting thing to know is whether rq->clock properly accounts for all idle time when the cpu leaves idle mode. --
(Dropping Rafael, Linus and Andrew from CC.) OK. I must admit I was a bit surprised that things were taken private last= =20 My clock looks OK: 1009.609822 1012.920151 1015.125826 1009.978393 1011.318990 1010.932234 1012.080720 1009.895988 1013.510765 1009.280119 1009.758037 1013.955419 1008.747707 1014.474823 Before I get to some facts from testing, first a possible tracer bug. Trace settings: # cat available_tracers mmiotrace wakeup sched_switch none # cat current_tracer sched_switch # cat iter_ctrl print-parent nosym-offset nosym-addr noverbose noraw nohex nobin noblock=20 nostacktrace nosched-tree # cat trace_entries 50005 If I enable tracing with these settings, I get data in both trace *and* in= =20 latency_trace. Is the last correct? From http://lkml.org/lkml/2008/2/10/43= =20 I got the impression that file should only be used if the "wakeup" tracer=20 is active. Right, finally to the traces and some background info. I've run three 20 minute tests. For each I've counted the clearly audible=20 skips in music: 1) build without tracing, console with build running active: 17 skips 2) build without tracing, build console not visible: 16 skips 3) build with tracing, build console not visible: 11 skips With the first build processor usage varies between 60 and 75% (over two=20 cores); with the others its fairly constant at 50% so 1 core in full use=20 and the other basically idle. All three builds do reach the same point in=20 the about the same time though, so the extra processor usage for the first= =20 build is apparently just display update overhead. The number of skips seems fairly constant. They don't seem to happen at=20 exactly the same points, but there are clearly points where they are more=20 likely. Most striking is that in all three cases I had a series of 4-6=20 skips at ~17-18 minutes into the build. =46rom the last run I've got three traces with 50000 entries (about a minut= e=20 worth each). Traces 1 and 2 should ...
Yes, but those traces were << 1s and didn't appear to contain the jumps. --
yeah, that's the intended behavior. 'trace' and 'latency_trace' are two
kinds of 'views' on the same set of tracer data (with a different output
format). They are both non-destructive - i.e. they dont deplete the
trace buffers.
there's even a third view: 'trace_pipe' - which depletes the trace
buffers and can be used to pipe trace data into a larger file:
cat trace_pipe > /tmp/large-trace.txt
i guess it would be less confusing if 'trace' and 'latency_trace' was
just a single 'view' and its behavior depended on iter_ctl settings. (In
fact, we could probably just get rid of latency_trace - it is a leftover
hm, i looked at them - i thought i mailed you about that but indeed i
didnt (sorry about that) - the traces were too short in scope - they
just covered a few dozen milliseconds in scope, while your description
no need to rerun - but if you do the next test it might make sense to
disable it, just to reduce the number of variables.
another thing - Peter observed skipping on NOHZ. Does your skipping go
away if you boot with 'nohz' and/or with CONFIG_NOHZ=y disabled in the
.config?
Ingo
--
Ah, I was wondering about that. Sorry about the mistake, but if you'd let me know I could have run new traces. If there's anything wrong with the current traces, please _do_ let me know. Problem is that I just see a huge pile of data without really knowing what I'm looking at. For example, I've no idea what the time units are. If I've guessed those wrong, it's not surprising that you don't see anything as I'm then basically stopping the trace too late or including too few entries. What would be a good value for trace_entries to get a decent time interval? If there's more documentation available about what's in those files, please let me know. Cheers, FJP P.S. It would probably be good to just mention the units in the headers of the debug files. It may be obvious to you, but that way it's obvious even to morons like me :-P --
yeah, agreed. We did that in -rt's /proc/latency_trace but it isnt there in ftrace. Steve? Ingo --
Just realized that I forgot to reply to this... Already tried that with .24 (<200804111718.56958.elendil@planet.nl>): ! On Friday 11 April 2008, Peter Zijlstra wrote: ! > I'm having some issue on my own desktop - I get occasional latencies of ! > up to 1 second! It seems to be related to NO_HZ, because when I run a ! > simple while loop to keep the cpu busy, or boot with nohz=off it ! > doesn't seem to happen. ! > ! > Could you try that, just to see if we're seeing the same thing? ! ! Booted with nohz=off, but it does not seem to make any difference for me. ! I still get the music skips and chew still gives the high latencies. ! ! $ ./chew2 ! pid 31568 preempted 4022 us after 1376 us ! pid 31568 preempted 6303 us after 6177 us ! pid 31568 preempted 8836 us after 7963 us ! pid 31568 preempted 13134 us after 7903 us ! pid 31568 preempted 20533 us after 9496 us ! pid 31568 preempted 22957 us after 8556 us ! pid 31568 preempted 28693 us after 9518 us ! pid 31568 preempted 251718 us after 8485 us --
