i'm pleased to announce release -v19 of the CFS scheduler patchset.
The rolled-up CFS patch against today's -git kernel, v2.6.22-rc7,
v2.6.22-rc6-mm1, v2.6.21.5 or v2.6.20.14 can be downloaded from the
usual place:
http://people.redhat.com/mingo/cfs-scheduler/
The biggest user-visible change in -v19 is reworked sleeper fairness:
it's similar in behavior to -v18 but works more consistently across nice
levels. Fork-happy workloads (like kernel builds) should behave better
as well. There are also a handful of speedups: unsigned math, 32-bit
speedups, O(1) task pickup, debloating and other micro-optimizations.
Changes since -v18:
- merged the group-scheduling CFS-core changes from Srivatsa Vaddagiri.
This makes up for the bulk of the changes in -v19 but has no
behavioral impact. The final group-fairness enabler patch is now a
small and lean add-on patch to CFS.
- fix the bloat noticed by Andrew. On 32-bit it's now this:
text data bss dec hex filename
24362 3905 24 28291 6e83 sched.o-rc7
33015 2538 20 35573 8af5 sched.o-v18
25805 2426 20 28251 6e5b sched.o-v19
so it's a net win compared to vanilla. On 64-bit it's even better:
text data bss dec hex filename
35732 40314 2168 78214 13186 sched.o.x64-rc7
41397 37642 2168 81207 13d37 sched.o.x64-v18
36132 37410 2168 75710 127be sched.o.x64-v19
( and there's also a +1.5K data win per CPU on x32, which is not
shown here. [+3.0K data win per CPU on x64.] )
- good number of core code updates, cleanups and streamlining.
(Mike Galbraith, Srivatsa Vaddagiri, Dmitry Adamushko, me.)
- use unsigned data types almost everywhere in CFS. This produces
faster and smaller code, and simplifies the logic.
- turn as many 'u64' data types into 'unsigned long' as possible, to
reduce the 32-bit footprint and to reduce 64-bit arithmetic...Hi, I run a java application at nice 15. Its been a background application here for as long as SD and CFS have been around. If I have a compile running at nice 0, with v19 java gets so little cpu that the the wrapper that runs to monitor it is timing out waiting for it to start. This is new in v19 - something in v19 is not meshing well with my mix of applications... Kernel is gentoo 2.6.22-r1 + cfs v19 How can I help to debug this? Ed Tomlinson -
how much longer did the startup of the java app get relative to say v18? to debug this, could you check whether this problem goes away if you use nice 10 (or nice 5) instead of nice 15? Ingo -
Ingo, It may take a day to two before I get to test this. I have had to revert to 2.6.21 - it seems that 22 triggers a stall here (21 also can trigger this but its harder)... Thanks Ed -
I hear Ingo is off having a genuine long weekend, but in the meantime, you could try echo 30 > /proc/sys/kernel/sched_features to eliminate the sleeper fairness changes in v19, since that was the biggest change. Sending a few seconds of logged /proc/sched_debug will also help get a picture of what's happening, and lovely would be a method to reproduce the problem locally. -Mike -
also, by running this script: http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh and sending us the file it produces we'll have most of the environmental information as well. Ingo -
Hi. Is there anything like the sched_debug in the 2.6.22-git5? Because I have a cfs-problem as well [1]. Markus [1] http://lkml.org/lkml/2007/07/14/60 -
Hm. Tasks disappearing isn't you're typical process scheduler problem by any means, nor is an idle box exhibiting mouse "lurchiness". Is there anything unusual in your logs? -Mike -
I know that its not typical, but when my current kernel is stable and shows the same problem with the cfs-patch applied like the git-snapshot, I would say its a cfs issue. There is nothing in the logs when a program dies, thats why asked for a way to make the kernel more verbose. But I can build a plain 2.6.22 without cfs and one with it and compare dmesgs output, if that helps. Markus -
Yes, from your description, and with the now presented additional information that the git-snapshot exhibits the same symptoms, it sounds like cfs _may_ be implicated in some way. I can't imagine how at the moment. In your original report, there are other patches involved, which are an unknown variables. The git-snapshot contains very many changes other than cfs as well. I'd eliminate absolutely all unknowns Yes. It would definitely be worth while to test a virgin stable kernel, and then add only cfs with identical config. Dmesg output may not turn up anything, but eliminating all other variables should either pin the tail on the donkey (cfs?) or vindicate it, and that's what needs to be nailed down solidly first. -Mike -
Well, I took a 2.6.21 (gentoo-patchset...) and applied your cfs-v19 patch for the 2.6.21 kernels. Its a amd64 system and the "normal" When I start the system, the mouse is jerking like under heavy load. (also its idle!) Then some programs just quit. They just disappear, no message in any log. (It were about four apps I realized in a period of about two hours of testing.) I'll try a git snapshot next, but is there a way I can get more output? Greetz Markus -
> I'll try a git snapshot next, but is there a way I can get more output? Did it. The mouse is smooth, just when one app is being quit (dont know why...) the mouse will be jerking for a few seconds... So, is there any way to get a more verbose output regarding the cfs? Markus -
is the mouse jerky on any app quitting? Or is your observation the following: _sometimes_ apps quit unexpectedly (their window just vanishes?), and _at the same time_, the mouse becomes jerky as well, for a few seconds? the mouse typically only becomes jerky when there's some really high load on the system - anything else would be a kernel bug. A jerky mouse on an unloaded system is definitely a sign of some sort of kernel bug (in or outside of the scheduler). An app vanishing unexpectedly might mean an OOM-kill - but that would should up in the syslog as well. Pretty weird. Can you make this regression trigger arbitrarily, so that we could debug it better? Apps exiting unexpectedly can be debugged via: http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/b... you can turn it on via the print-fatal-signals=1 boot option or via: echo 1 > /proc/sys/kernel/print-fatal-signals this feature will produce a small dump to the syslog about every app that exits unexpectedly. Note that this might not cover all types of "window suddenly vanishes" regressions. Ingo -
Well, the system uses about 30% of the cpu (cool'n'quite put it on the lowest frequency). I made a plain 2.6.22.1 and could use it for about 2 hours without any problem. Then I applied the cfs-v19 for that kernel, rebuild from mrproper with the saved config and booted. After a few minutes the first app vanished... some more followed by time (I just surfed around a bit...) The dmesg output is not differing in any interesting point (just some numbers, like raid-benchmark, some irqs or usb-numbers...) Nothing is printed for a disapeared app for me. Is there anything more I can try? Markus -
sure - could you start one of those apps via: strace -ttt -TTT -o trace.log -f <app> and wait for it to "disappear"? Then compress the trace.log via bzip2 -9 (it's probably going to be a really large file) and send me it? Ingo -
> could you please send me the cfs-debug-info output nevertheless? private mail, aswell (187K) When attachments are allowed, I can resend them on the list as well (or just ask me...) To answer a private mail: I do not use any kernel-module thats not part of the official kernel! And of course nothing proprietary # cat /proc/sys/kernel/tainted 0 I used gcc-4.1.2 (glibc-2.5-r4) to build the kernels. (Its a amd64 system, quite stable so far.) Programs that "disappeared" are most graphical, because others I have not noticed so far... also [1] might be caused by this... amarok, kdesktop, whole X, konqueror, konsole but also gtk-apps Markus [1] http://lkml.org/lkml/2007/07/14/64 -
i think it fails here due to some IO error:
9173 1184675906.674610 write(2, "konqueror: Fatal IO error: clien"..., 41) = 41 <0.000007>
could this be due to:
9173 1184675906.194424 ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff341af5c0)
= -1 ENOTTY (Inappropriate ioctl for device) <0.000006>
9173 1184675906.194463 ioctl(2, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff341af5c0)
= -1 ENOTTY (Inappropriate ioctl for device) <0.000004>
? Are those -ENOTTY results normal?
or perhaps the problem started alot earlier, at:
9173 1184675906.155015 write(2, "In file kernel/qpixmap_x11.cpp, "..., 56) = 56 <0.000006>
9173 1184675906.155052 write(2, "QImage::convertDepth: Image is a"..., 44) = 44 <0.000004>
9173 1184675906.155169 gettimeofday({1184675906, 155179}, NULL) = 0 <0.000006>
9173 1184675906.155249 write(11, "close(6f1c2f7):about:konqueror\n", 31) = 31 <0.000032>
i think konqueror tried to say something here about an image problem?
could you perhaps upload the strace to some webpage so that others can
take a look too?
it might also be good to add "-s 1000" to the strace command, so that we
can see the full messages that konqueror tried to log to some other
task, i.e.:
strace -s 1000 -ttt -TTT -o trace.log -f <app>
and perhaps try to do a 'comparison' trace.normal.log as well, with
konqueror having no problems. Also a KDE expert's advice would be useful
here too i guess ...
Ingo
-> 9173 1184675906.194424 ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS,
Well, yes:
In file kernel/qpixmap_x11.cpp, line 633: Out of memory
QImage::convertDepth: Image is a null image
In file kernel/qpixmap_x11.cpp, line 633: Out of memory
QImage::convertDepth: Image is a null image
In file kernel/qpixmap_x11.cpp, line 633: Out of memory
QImage::convertDepth: Image is a null image
In file kernel/qpixmap_x11.cpp, line 633: Out of memory
QImage::convertDepth: Image is a null image
konqueror: Fatal IO error: client killed
And no, my 2 GB of RAM are not full:
$ free -m
total used free shared buffers
cached
Mem: 2012 1077 935 0 22
441
-/+ buffers/cache: 612 1400
I now made some new strace logs:
- konq crash 251K
- Konq without crash on cfs 302K
- konq without crash on non-cfs 248K
Markus
-i think the problem starts here:
11902 1184699865.141939 read(3, "", 32) = 0 <0.000007>
file descriptor 3 is the Unix domain socket connection to X:
11902 1184699521.012316 socket(PF_FILE, SOCK_STREAM, 0) = 3 <0.000018>
a read result of 0 confused Konqueror so it quit immediately. I've
attached the relevant section below - does it make sense to anyone who
knows more about X, KDE and Konqueror internals than me? :)
Ingo
----------------------->
11902 1184699811.587247 gettimeofday({1184699811, 587257}, NULL) = 0 <0.000007>
11902 1184699811.587280 select(10, [3 4 5 7 9], [], [], {9, 999727}) = 1 (in [3], left {7, 946666}) <2.054594>
11902 1184699813.642502 gettimeofday({1184699813, 642849}, NULL) = 0 <0.000339>
11902 1184699813.643145 ioctl(3, FIONREAD, [32]) = 0 <0.000343>
11902 1184699813.643720 read(3, "\v\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0", 32) = 32 <0.000219>
11902 1184699813.644271 ioctl(3, FIONREAD, [0]) = 0 <0.000179>
11902 1184699813.644724 ioctl(3, FIONREAD, [0]) = 0 <0.000143>
11902 1184699813.645011 gettimeofday({1184699813, 645265}, NULL) = 0 <0.000248>
11902 1184699813.645448 select(10, [3 4 5 7 9], [], [], {7, 941719}) = 1 (in [3], left {0, 703333}) <7.238595>
11902 1184699820.884750 gettimeofday({1184699820, 884911}, NULL) = 0 <0.000155>
11902 1184699820.885279 ioctl(3, FIONREAD, [224]) = 0 <0.000339>
11902 1184699820.885960 read(3, "\34a\301!\5\0\200\4\365\0\0\0\210+\10\0\0\352\323\0\0\0\0\0\347 C\0\0\0\0\0\v\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\v\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\34a\301![\0\0\0V\1\0\0\211+\10\0\0mp\0\0\0\0\0\347 C\0\0\0\0\0\34a\301![\0\0\0V\1\0\0\211+\10\0\0mp\0\0\0\0\0\347 C\0\0\0\0\0\v\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\34a\301![\0\0\0002\1\0\0\211+\10\0\0mp\0\0\0\0\0\347 C\0\0\0\0\0", 224) = 224 <0.000475>
11902 1184699820.886800 write(3, "\2...Well, it's preceded by a poll() that says that it has a POLLHUP event, so
that socket would seem to have simply been closed from the other end.
There's also a huge amount of select() calls showing the same thing
(except since it's just the input side, you cannot tell that it's
POLLHUP).
Don't ask me *why*, though. It's preceded by
..
11902 1184699848.615201 read(3, 0x7fffb5b9c8b0, 32) = -1 EAGAIN (Resource temporarily unavailable) <0.000009>
11902 1184699848.615252 poll([{fd=3, events=POLLIN, revents=POLLIN}], 1, -1) = 1 <0.009307>
11902 1184699848.624614 read(3, "\1 \303!\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0b\340T\0\0\0\0\0", 32) = 32 <0.000011>
.. got data ..
11902 1184699848.624710 ioctl(3, FIONREAD, [0]) = 0 <0.000009>
11902 1184699848.624762 ioctl(3, FIONREAD, [0]) = 0 <0.000048>
.. ok, nothing more..
11902 1184699848.624866 select(10, [3 4 5 7 9], [], [], NULL) = 1 (in [3]) <16.495008>
11902 1184699865.119950 ioctl(3, FIONREAD, [0]) = 0 <0.000006>
16+ seconds pass, now it's marked as readable, but returns zero bytes of
data: the other end closed it.
Tons of unnecessary and stupid sequences of:
11902 1184699865.119988 select(10, [3 4 5 7 9], [], [], NULL) = 1 (in [3]) <0.000007>
11902 1184699865.120031 ioctl(3, FIONREAD, [0]) = 0 <0.000005>
..
and then finally:
...
11902 1184699865.141809 poll([{fd=3, events=POLLIN, revents=POLLIN|POLLHUP}], 1, 0) = 1 <0.000005>
11902 1184699865.141838 ioctl(3, FIONREAD, [0]) = 0 <0.000005>
11902 1184699865.141939 read(3, "", 32) = 0 <0.000007>
ie now konqueror noticed that it was *really* closed, and read the EOF.
But why does that happen? And why would the scheduler have *anything* to
do with this? No idea. Maybe timing. Maybe the time.c changes. Dunno.
Linus
-hm, Markus indicated that he tried the v2.6.21.6-cfsv19 patch, and that
does not include the time.c change. Markus - does your kernel include
the code below? (if yes, please revert it via patch -p1 -R )
Ingo
Index: linux/kernel/time.c
===================================================================
--- linux.orig/kernel/time.c
+++ linux/kernel/time.c
@@ -57,14 +57,17 @@ EXPORT_SYMBOL(sys_tz);
*/
asmlinkage long sys_time(time_t __user * tloc)
{
- time_t i;
- struct timeval tv;
+ /*
+ * We read xtime.tv_sec atomically - it's updated
+ * atomically by update_wall_time(), so no need to
+ * even read-lock the xtime seqlock:
+ */
+ time_t i = xtime.tv_sec;
- do_gettimeofday(&tv);
- i = tv.tv_sec;
+ smp_rmb(); /* sys_time() results are coherent */
if (tloc) {
- if (put_user(i,tloc))
+ if (put_user(i, tloc))
i = -EFAULT;
}
return i;
@@ -373,6 +376,20 @@ void do_gettimeofday (struct timeval *tv
tv->tv_sec = sec;
tv->tv_usec = usec;
+
+ /*
+ * Make sure xtime.tv_sec [returned by sys_time()] always
+ * follows the gettimeofday() result precisely. This
+ * condition is extremely unlikely, it can hit at most
+ * once per second:
+ */
+ if (unlikely(xtime.tv_sec != tv->tv_sec)) {
+ unsigned long flags;
+
+ write_seqlock_irqsave(&xtime_lock);
+ update_wall_time();
+ write_seqlock_irqrestore(&xtime_lock);
+ }
}
EXPORT_SYMBOL(do_gettimeofday);
-Well, I am now running a 2.6.22 (without cfs) and could now see it once (within a month...) that exactly the same message from konqueror was produced. So I think its a general problem of konqueror that was hidden and somehow its triggered much more often with the cfs. I just wonder why nobody else has this problem. Markus PS: I am currently building a 2.6.23.1... -
Well, I am back now, but the problem still exists in 2.6.23-rc2. And as there is nothing more I can do thats it for now. Markus -
could you send me your debug-info: http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh just run that script on 2.6.23-rc2 system and send me the file it produces. I've got a theory about what might be going on, and this debug-info could help prove/disprove it. Ingo -
Done by private mail on friday. Also tried the very current linux-2.6.git (friday aswell) with sched-ingo-combo.patch (as told in a private answer mail). Nothing fixed it so far. If I can do anything... Markus -
> hm, Markus indicated that he tried the v2.6.21.6-cfsv19 patch, and As already said, 2.6.22.1-cfs-v19 includes the patch and 2.6.21.6-cfs-v19 does not include it. But both suffer of the problem. I now reversed the patch on 2.6.22.1-cfs-v19 but it does not help. 2.6.22-git15 is not working as well... so the problem did not magically disappear like the processes. Any further things I can do to track it down? (I go on vacation on monday for two weeks, but just send them, I'll try them when I am back or answer the mails, when I dont need to build a new kernel...) Markus -
it _seems_ to be related to KDE apps most of the time, right? Linus analyzed your strace and it suggests that X is dropping the connection. I.e. it's user-space who gets confused for some reason - perhaps timing. Would be nice if a KDE/konqueror person could look at your traces and suggest a better debugging method to track this down - because right now konqueror just vanishes with only the "konqueror: Fatal IO error: client killed" message. Hopefully i've Cc:-ed the right list: konqueror kfm-devel@mail.kde.org. I've attached below Linus' analysis about this mystery 'konqueror just vanishes' problem, and the strace too. Could you perhaps mention the kde and Xorg versions you are using? Ingo ------------------------------------------- Well, it's preceded by a poll() that says that it has a POLLHUP event, so that socket would seem to have simply been closed from the other end. There's also a huge amount of select() calls showing the same thing (except since it's just the input side, you cannot tell that it's POLLHUP). Don't ask me *why*, though. It's preceded by .. 11902 1184699848.615201 read(3, 0x7fffb5b9c8b0, 32) = -1 EAGAIN (Resource temporarily unavailable) <0.000009> 11902 1184699848.615252 poll([{fd=3, events=POLLIN, revents=POLLIN}], 1, -1) = 1 <0.009307> 11902 1184699848.624614 read(3, "\1 \303!\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0b\340T\0\0\0\0\0", 32) = 32 <0.000011> .. got data .. 11902 1184699848.624710 ioctl(3, FIONREAD, [0]) = 0 <0.000009> 11902 1184699848.624762 ioctl(3, FIONREAD, [0]) = 0 <0.000048> .. ok, nothing more.. 11902 1184699848.624866 select(10, [3 4 5 7 9], [], [], NULL) = 1 (in [3]) <16.495008> 11902 1184699865.119950 ioctl(3, FIONREAD, [0]) = 0 <0.000006> 16+ seconds pass, now it's marked as readable, but returns zero bytes of data: the other end closed it. Tons of unnecessary and stupid sequences of: 11902 1184699865.119988 select(10, [3 4 5 7 9], [], [...
> it _seems_ to be related to KDE apps most of the time, right? You are right (although some gtk-apps have the same problem), but that may be caused by the fact that I dont _see_ other apps vanish... I'll I have konqueror 3.5.5 and Xorg 7.2 on a native amd64 system (gcc 4.1.2 and glibc 2.5). Markus -
> hm, Markus indicated that he tried the v2.6.21.6-cfsv19 patch, and Well, the 2.6.22.1-cfs-v19 does include it, but the 2.6.21.6-cfs-v19 does not have that patch applied. But both show this problem. Markus -
oh, and i missed the obvious request: could you start konqueror from a terminal and see what it prints when it goes down? Ingo -
Greetings, I've taken mainline git tree (freshly integrated CFS!) out for a multimedia spin. I tested watching movies and listenign to music in the presence of various sleep/burn loads, pure burn loads, and mixed loads. All was peachy here.. I saw no frame drops or sound skips or other artifacts under any load where the processor could possibly meet demand. -Mike -
I would agree with preliminary testing, save that if you get a lot of processes updating the screen at once, there seems to be a notable case of processes getting no CPU for 100-300ms, followed by a lot of CPU. I see this clearly with the "glitch1" test with four scrolling xterms and glxgears, but also watching videos with little busy processes on the screen. The only version where I never see this in test or with real use is cfs-v13. -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot -
just as a test, does this go away if you:
renice -20 pidof `Xorg`
i.e. is this connected to the way X is scheduled?
Another thing to check would be whether it goes away if you set the
granularity to some really finegrained value:
echo 0 > /proc/sys/kernel/sched_wakeup_granularity_ns
echo 500000 > /proc/sys/kernel/sched_granularity_ns
this really pushes things - but it tests the theory whether this is
related to granularity.
Ingo
-Doing this slows down the display rates, but doesn't significantly help I didn't test this with standard Xorg priority, I should go back and try that. But it didn't really make much difference. The gears and scrolling xterms ran slower with Xorg at -20 with any sched settings. I'll do that as soon as a build finishes and I can reboot. I should really go back to 2.6.21.6, 2.6.22 has many bizarre behaviors with FC6. Automount starts taking 30% of CPU (unused at the moment), the sensors applet doesn't work, etc. I hope over the weekend I can get bug -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot -
Can you confirm whether CFS is involved, i.e. does it spin like that even without the CFS patch applied? -
hmmm .... could you take out the kernel/time.c (sys_time()) changes from the CFS patch, does that solve the automount issue? If yes, could someone take a look at automount and check whether it makes use of time(2) and whether it combines it with finer grained time sources? Ingo -
Yes it does and I have two reported bugs so far.
In several places I have code similar to:
wait.tv_sec = time(NULL) + 1;
wait.tv_nsec = 0;
signaled = 0;
while (!signaled) {
status = pthread_cond_timedwait(&cond, &mutex, &wait);
if (status) {
if (status == ETIMEDOUT)
break;
fatal(status);
}
}
lead to automount spinning with strace output a bit like:
futex(0x80034b60, FUTEX_WAKE, 1) = 0
clock_gettime(CLOCK_REALTIME, {1184593936, 130925919}) = 0
time(NULL) = 1184593935
futex(0x80034b60, FUTEX_WAKE, 1) = 0
clock_gettime(CLOCK_REALTIME, {1184593936, 131160876}) = 0
time(NULL) = 1184593935
futex(0x80034b60, FUTEX_WAKE, 1) = 0
clock_gettime(CLOCK_REALTIME, {1184593936, 131377080}) = 0
time(NULL) = 1184593935
futex(0x80034b60, FUTEX_WAKE, 1) = 0
clock_gettime(CLOCK_REALTIME, {1184593936, 131593297}) = 0
time(NULL) = 1184593935
futex(0x80034b60, FUTEX_WAKE, 1) = 0
clock_gettime(CLOCK_REALTIME, {1184593936, 131871792}) = 0
There should be something like:
futex(0x5555557868c4, FUTEX_WAIT, 5321099, {0, 998091311}) = -1 ETIMEDOUT (Connection timed out)
in there I think.
Ian
-ah! It passes in a low-res time source into a high-res time interface (pthread_cond_timedwait()). Could you change the time(NULL) + 1 to time(NULL) + 2, or change it to: gettimeofday(&wait, NULL); wait.tv_sec++; does this solve the spinning? i'm wondering how widespread this is. If automount is the only app doing this then _maybe_ we could get away with it by changing automount? Ingo -
Ok, that definitely should work. This is wrong. It's wrong for two reasons: - it really shouldn't be needed. I don't think "time()" has to be *exactly* in sync, but I don't think it can be off by a third of a second or whatever (as the "30% CPU load" would seem to imply) - gettimeofday works on a timeval, pthread_cond_timedwait() works on a timespec. So if it actually makes a difference, it makes a difference for the *wrong* reason: the time is still totally nonsensical in the tv_nsec field (because it actually got filled in with msecs!), but now the tv_sec field is in sync, so it hides the bug. Anyway, hopefully the patch below might help. But we probably should make this whole thing a much more generic routine (ie we have our internal "getnstimeofday()" that still is missing the second-overflow logic, and that is quite possibly the one that triggers the "30% off" behaviour). Ingo, I'd suggest: - ger rid of "timespec_add_ns()", or at least make it return a return value for when it overflows. - make all the people who overflow into tv_sec call a "fix_up_seconds()" thing that does the xtime overflow handling. Linus --- Subject: time: make sure sys_gettimeofday() and sys_time() are in sync From: Ingo Molnar <mingo@elte.hu> make sure sys_gettimeofday() and sys_time() results are coherent. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- kernel/time/timekeeping.c | 13 +++++++++++++ 1 file changed, 13 insertions(+) Index: linux/kernel/time/timekeeping.c =================================================================== --- linux.orig/kernel/time/timekeeping.c +++ linux/kernel/time/timekeeping.c @@ -92,6 +92,19 @@ static inline void __get_realtime_clock_ } while (read_seqretry(&xtime_lock, seq)); timespec_add_ns(ts, nsecs); + /* + * Make sure xtime.tv_sec [returned by sys_time()] always + * follows the gettimeofday() result precisely. This + * condition is extremely unlikely, it can hit at m...
ah, i didnt notice that automount mixed up timespec with timeval! That is nasty and the tv_nsec field (which really is ts_usec to pthread_cond_timewait()) must stay cleared - or rather, to avoid bugs of yeah, i'll generalize it, but our internal getnstimeofday() used on most architectures is using __get_realtime_clock_ns(), and the patch you attached already adds the second-overflow logic to it. there are two versions of getnstimeofday(), a TIME_INTERPOLATION one and a !TIME_INTERPOLATION one. TIME_INTERPOLATION is only used on ia64 at the moment - and that one indeed does not have the second overflow ok, i'll do something clean. Ingo -
Spectacularly no! With this patch the "glitch1" script with multiple scrolling windows has all xterms and glxgears stop totally dead for ~200ms once per second. I didn't properly test anything else after that. Since the automount issue doesn't seem to start until something kicks it -
Bill, could you try the patch below - does it fix the automount problem, without introducing new problems? Ingo ---------------> Subject: time: introduce xtime_seconds From: Ingo Molnar <mingo@elte.hu> introduce the xtime_seconds optimization. This is a read-mostly low-resolution time source available to sys_time() and kernel-internal use. This variable is kept uptodate atomically, and it's monotically increased, every time some time interface constructs an xtime-alike time result that overflows the seconds value. (it's updated from the timer interrupt as well) this way high-resolution time results update their seconds component at the same time sys_time() does it: 1184858832999989000 1184858832000000000 1184858832999992000 1184858832000000000 1184858832999996000 1184858832000000000 1184858832999999000 1184858832000000000 1184858833000003000 1184858833000000000 1184858833000006000 1184858833000000000 1184858833000009000 1184858833000000000 [ these are nsec time results from alternating calls to sys_time() and sys_gettimeofday(), recorded at the seconds boundary. ] instead of the previous (non-coherent) behavior: 1184848950999987000 1184848950000000000 1184848950999990000 1184848950000000000 1184848950999994000 1184848950000000000 1184848950999997000 1184848950000000000 1184848951000001000 1184848950000000000 1184848951000005000 1184848950000000000 1184848951000008000 1184848950000000000 1184848951000011000 1184848950000000000 1184848951000015000 Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/time.h | 13 +++++++++++-- kernel/time.c | 25 ++++++------------------- kernel/time/timekeeping.c | 28 ++++++++++++++++++++++++---- 3 files changed, 41 insertions(+), 25 deletions(-) Index: linux/include/linux/time.h =================================================================== --- linux.orig/include/linux/time.h +++ linux/include/linux/time.h...
Okay, as noted off-list, after I exported the xtime_seconds it now builds and works. However, there are a *lot* of "section mismatches" which are not reassuring. Boots, runs, glitch1 test runs reasonably smoothly. automount has not used significant CPU yet, but I don't know what triggers it, the bad behavior did not happen immediately without the patch. However, it looks very hopeful. Warnings attached to save you the trouble... -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot
Doesn't seem to apply against 2.6.22.1, I'm trying 2.6.22.6 as soon as I -
Applied to 2.6.22-git9, building now. -
ok, that's fine too. Ingo -
the patch below is merged against 2.6.22.1-cfs-v19 - does it solve the autofs problem (without any other bad side-effects)? Ingo -------------------> Subject: time: introduce xtime_seconds From: Ingo Molnar <mingo@elte.hu> introduce the xtime_seconds optimization. This is a read-mostly low-resolution time source available to sys_time() and kernel-internal use. This variable is kept uptodate atomically, and it's monotically increased, every time some time interface constructs an xtime-alike time result that overflows the seconds value. (it's updated from the timer interrupt as well) this way high-resolution time results update their seconds component at the same time sys_time() does it: 1184858832999989000 1184858832000000000 1184858832999992000 1184858832000000000 1184858832999996000 1184858832000000000 1184858832999999000 1184858832000000000 1184858833000003000 1184858833000000000 1184858833000006000 1184858833000000000 1184858833000009000 1184858833000000000 [ these are nsec time results from alternating calls to sys_time() and sys_gettimeofday(), recorded at the seconds boundary. ] instead of the previous (non-coherent) behavior: 1184848950999987000 1184848950000000000 1184848950999990000 1184848950000000000 1184848950999994000 1184848950000000000 1184848950999997000 1184848950000000000 1184848951000001000 1184848950000000000 1184848951000005000 1184848950000000000 1184848951000008000 1184848950000000000 1184848951000011000 1184848950000000000 1184848951000015000 Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/linux/time.h | 13 +++++++++++-- kernel/time.c | 25 ++++++------------------- kernel/time/timekeeping.c | 26 +++++++++++++++++++++++--- 3 files changed, 40 insertions(+), 24 deletions(-) Index: linux-cfs-2.6.22.q/include/linux/time.h =================================================================== --- linux-cfs-2.6.22.q.orig/include...
hm, it's against recent -git. dont waste your time on 2.6.21.6-cfsv19, it will likely not apply - give me a few minutes to create a patch for you against 2.6.22.1-cfsv19, ok? Ingo -
thanks. Andrew also just reported that it broke his laptop and i'm working on a proper version. Ingo -
Oh ya .. I thought it wouldn't hurt to add the fraction of the current second for correctness and actually put things like: gettimeofday(&now, NULL); wait.tv_sec = now.tv_sec + 1; wait.tv_nsec = now.tv_usec * 1000; in autofs. Ian -
This code is horribly broken. Don't change the kernel because this code is broken. First it adds a second, but then it subtracts up to a second. Just before the second boundary, this code can burn CPU like crazy, with each wait being just a few nanoseconds. What is the intent of this code? Is it to wait "up to a second, possibly for no time at all" or is to wait "for at least a second"? If so, why are you zeroing the nanosecond count? DS -
Yes, adding in the offset within the current second appears to resolve I don't think the change is unreasonable since I wasn't using an accurate time in the condition wait, so that's a coding mistake on my part which I will fix. Ian -
thanks Ian for taking care of this and for fixing it! Linus, Thomas, what do you think, should we keep the time.c change? Automount is one app affected so far, and it's a borderline case: the increased (30%) CPU usage is annoying, but it does not prevent the system from working per se, and an upgrade to a fixed/enhanced automount version resolves it. The temptation of using a really (and trivially) scalable low-resolution time-source (which is _easily_ vsyscall-able, on any platform) for DBMS use is really large, to me at least. Should i perhaps add a boot/config option that enables/disables this optimization, to allow distros finer grained control about this? And we've also got to wait whether there's any other app affected. Ingo -
No, not if it's off by the second field. That 30% CPU usage indicates that there's some nasty bug there somewhere, and that's just not worth it. If time() cannot get the second field right, it's bogus. I'm ok with us not *guaranteeing* monotonicity of the second field when you compare gettimeofday() with time(), but the 30% thing implies that it's much worse than that, and that "time()" will likely report the previous second (when compared to hrtimers) roughly a quarter of the time. And that isn't acceptable. So either it should be fixed, or reverted. Linus -
Allow it to be selected by the "features" so that admins can evaluate the implications without a reboot? That would be a convenient interface if you could provide it. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 -
Odds are there's at least one other app doing that somewhere. Would reverting the CFS changes to time.c fix this problem? That optimization just got merged in 2.6.22 mainline... -
OK, I'm with you, hi-res timer. But even so, how is the time in the past after adding a second. Is it because I'm not setting tv_nsec when it's close to a second I don't have a system to test this on so I'll try to get one of the I'm happy to change automount but that could cause odd version specific problems for people updating their kernel on an older installed base. Aaah .. and they'll all blame me!! ;) Ian -
yeah, it looks a bit suspicious: you create a +1 second timeout out of a 1 second resolution timesource. I dont yet understand the failure mode though that results in that looping and in the 30% CPU time use - do you understand it perhaps? (and automount is still functional while this is happening, correct?) Ingo -
Can't say, I have automount running because I get it by default, but I have nothing using at on my test machine. Why is it looping so fast when there are no mount points defined? If the config changes there's no requirement to notice right away, is there? -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 -
There are two threads where this mistake is made. One is used to trigger expire events for all automounted filesystems which happen all the time since I need to run the expire to check if anything is mounted and whether it needs to be umounted. The alarm handler sleeps on a condition until the alarm list in not empty and then sleeps on a condition until the next alarm in the list expires or an alarm is added to the list, in which case it then checks the list again. Since the autofs timeout granularity is one second this is a problem and will be fixed. This isn't the source of the problem that's been reported. The second is the state queue handler which runs tasks such as expires, map re-reads, shutdowns etc. for all automounted filesystems. While the check interval could be longer it causes autofs to be slugish in situations such as shutdowns where there are a largish number of mounts present and I need to cancel such things as expires and the like. It's possible I could improve this but, in fact, once the timespec is set correctly as Ingo suggests it works fine and uses very little resource. Ian -
I will try that, but not until Tuesday night. I've been here too long today and have an out-of-state meeting tomorrow. I'll take a look after dinner. Note that the latest 2.6.21 with cfs-v19 doesn't have any problems of any nature, other than suspend to RAM not working, and I may have the config wrong. Runs really well otherwise, but I'll test drive Will do. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 -
Partial answer: -10 didn't help with v16, I'll try more boost ASAP, but power has been spotty in upstate NY, 20k+ customers with none and the rest of us subject to "load shedding" with zero warning, and the test machines have UPS but no generator, so I hesitate to use them while Will do, you suggested dropping sched_granularity_ns to 1000000 earlier, and that didn't do it, but I didn't change the wakeup, and will test these values later today. -- bill davidsen <davidsen@tmr.com> CTO TMR Associates, Inc Doing interesting things with small computers since 1979 -
Hi Ingo ! Interestingly, I also noticed the possibility of O(1) task pickup when playing with v18, but did not detect any noticeable improvement with it. Of course, it depends on the workload and I probably didn't perform the most relevant tests. V19 works very well here on 2.6.20.14. I could start 32k busy loops at nice +19 (I exhausted the 32k pids limit), and could still perform normal operations. I noticed that 'vmstat' scans all the pid entries under /proc, which takes ages to collect data before displaying a line. Obviously, the system sometimes shows some short latencies, but not much more than what you get from and SSH through a remote DSL connection. Here's a vmstat 1 output : r b w swpd free buff cache si so bi bo in cs us sy id 32437 0 0 0 809724 488 6196 0 0 1 0 135 0 24 72 4 32436 0 0 0 811336 488 6196 0 0 0 0 717 0 78 22 0 32436 0 0 0 812956 488 6196 0 0 0 0 727 0 79 21 0 32436 0 0 0 810164 400 6196 0 0 0 0 707 0 80 20 0 32436 0 0 0 815116 400 6436 0 0 0 0 769 0 77 23 0 32436 0 0 0 811644 400 6436 0 0 0 0 715 0 80 20 0 Amusingly, I started mpg123 during this test and it skipped quite a bit. After setting all tasks to SCHED_IDLE, it did not skip anymore. All this seems to behave like one could expect. I also started 30k processes distributed in 130 groups of 234 chained by pipes in which one byte is passed. I get an average of 8000 in the run queue. The context switch rate is very low and sometimes even null in this test, maybe some of them are starving, I really do not know : r b w swpd free buff cache si so bi bo in cs us sy id 7752 0 1 0 656892 244 4196 0 0 0 0 725 0 16 84 0 7781 0 1 0 656768 244 4196 0 0 0 0 724 0 16 84 0 8000 0 1 0 656916 244 41...
yeah - it's a small tweak. CFS is O(31) in sleep/wakeup so it's now all yeah. It behaves better than i expected in fact - 32K tasks is pushing hm, could you profile this? We could have some bottleneck somewhere (likely not in the scheduler) with that many tasks being runnable. [ well, just a little bit of improvement would be nice to have too :) Ingo -
Hi Ingo, sorry for the long delay, I've spent a week doing non-kernel work. Yes, that's what I tried to explain to a
