login
Header Space

 
 

Re: [patch] CFS scheduler, -v19

Previous thread: [PATCH] some kmalloc/memset ->kzalloc (tree wide) by Yoann Padioleau on Friday, July 6, 2007 - 12:51 pm. (11 messages)

Next thread: none
To: <linux-kernel@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Friday, July 6, 2007 - 1:33 pm

i'm pleased to announce release -v19 of the CFS scheduler patchset.

The rolled-up CFS patch against today's -git kernel, v2.6.22-rc7, 
v2.6.22-rc6-mm1, v2.6.21.5 or v2.6.20.14 can be downloaded from the 
usual place:

    http://people.redhat.com/mingo/cfs-scheduler/
 
The biggest user-visible change in -v19 is reworked sleeper fairness: 
it's similar in behavior to -v18 but works more consistently across nice 
levels. Fork-happy workloads (like kernel builds) should behave better 
as well. There are also a handful of speedups: unsigned math, 32-bit 
speedups, O(1) task pickup, debloating and other micro-optimizations.

Changes since -v18:

 - merged the group-scheduling CFS-core changes from Srivatsa Vaddagiri. 
   This makes up for the bulk of the changes in -v19 but has no
   behavioral impact. The final group-fairness enabler patch is now a 
   small and lean add-on patch to CFS.

 - fix the bloat noticed by Andrew. On 32-bit it's now this:

      text    data     bss     dec     hex   filename
     24362    3905      24   28291    6e83   sched.o-rc7
     33015    2538      20   35573    8af5   sched.o-v18
     25805    2426      20   28251    6e5b   sched.o-v19

   so it's a net win compared to vanilla. On 64-bit it's even better:

      text    data     bss     dec     hex   filename
     35732   40314    2168   78214   13186   sched.o.x64-rc7
     41397   37642    2168   81207   13d37   sched.o.x64-v18
     36132   37410    2168   75710   127be   sched.o.x64-v19

   ( and there's also a +1.5K data win per CPU on x32, which is not
     shown here. [+3.0K data win per CPU on x64.] )

 - good number of core code updates, cleanups and streamlining.
   (Mike Galbraith, Srivatsa Vaddagiri, Dmitry Adamushko, me.)

 - use unsigned data types almost everywhere in CFS. This produces 
   faster and smaller code, and simplifies the logic.

 - turn as many 'u64' data types into 'unsigned long' as possible, to 
   reduce the 32-bit footprint and to reduce 64-bit arithmetic...
To: Ingo Molnar <mingo@...>
Cc: <linux-kernel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Saturday, July 14, 2007 - 1:19 pm

Hi,

I run a java application at nice 15.  Its been a background application here for as long
as SD and CFS have been around.  If I have a compile running at nice 0, with v19 java 
gets so little cpu that the the wrapper that runs to monitor it is timing out waiting for 
it to start.  This is new in v19 - something in v19 is not meshing well with my mix 
of applications...

Kernel is gentoo 2.6.22-r1 + cfs v19

How can I help to debug this?
Ed Tomlinson

-
To: Ed Tomlinson <edt@...>
Cc: <linux-kernel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Monday, July 16, 2007 - 5:17 am

how much longer did the startup of the java app get relative to say v18?

to debug this, could you check whether this problem goes away if you use 
nice 10 (or nice 5) instead of nice 15?

	Ingo
-
To: Ingo Molnar <mingo@...>
Cc: <linux-kernel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Monday, July 16, 2007 - 7:10 am

Ingo,

It may take a day to two before I get to test this. I have had to revert to 2.6.21 -
it seems that 22 triggers a stall here (21 also can trigger this but its harder)...

Thanks
Ed

-
To: Ed Tomlinson <edt@...>
Cc: Ingo Molnar <mingo@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Sunday, July 15, 2007 - 1:25 am

I hear Ingo is off having a genuine long weekend, but in the meantime,
you could try echo 30 &gt; /proc/sys/kernel/sched_features to eliminate the
sleeper fairness changes in v19, since that was the biggest change.

Sending a few seconds of logged /proc/sched_debug will also help get a
picture of what's happening, and lovely would be a method to reproduce
the problem locally.

	-Mike

-
To: Mike Galbraith <efault@...>
Cc: Ed Tomlinson <edt@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Monday, July 16, 2007 - 4:00 am

also, by running this script:

   http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh

and sending us the file it produces we'll have most of the environmental 
information as well.

	Ingo
-
To: Mike Galbraith <efault@...>
Cc: Ed Tomlinson <edt@...>, Ingo Molnar <mingo@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Sunday, July 15, 2007 - 8:53 am

Hi. Is there anything like the sched_debug in the 2.6.22-git5?
Because I have a cfs-problem as well [1].

   Markus


[1] http://lkml.org/lkml/2007/07/14/60
-
To: Markus <lists4me@...>
Cc: Ed Tomlinson <edt@...>, Ingo Molnar <mingo@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Sunday, July 15, 2007 - 3:46 pm

Hm.  Tasks disappearing isn't you're typical process scheduler problem
by any means, nor is an idle box exhibiting mouse "lurchiness".  Is
there anything unusual in your logs?

	-Mike

-
To: Mike Galbraith <efault@...>
Cc: Ed Tomlinson <edt@...>, Ingo Molnar <mingo@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Sunday, July 15, 2007 - 5:11 pm

I know that its not typical, but when my current kernel is stable and 
shows the same problem with the cfs-patch applied like the 
git-snapshot, I would say its a cfs issue.
There is nothing in the logs when a program dies, thats why asked for a 
way to make the kernel more verbose.
But I can build a plain 2.6.22 without cfs and one with it and compare 
dmesgs output, if that helps.

   Markus
-
To: Markus <lists4me@...>
Cc: Ed Tomlinson <edt@...>, Ingo Molnar <mingo@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Monday, July 16, 2007 - 2:42 am

Yes, from your description, and with the now presented additional
information that the git-snapshot exhibits the same symptoms, it sounds
like cfs _may_ be implicated in some way.  I can't imagine how at the
moment.  In your original report, there are other patches involved,
which are an unknown variables.  The git-snapshot contains very many
changes other than cfs as well.  I'd eliminate absolutely all unknowns

Yes.  It would definitely be worth while to test a virgin stable kernel,
and then add only cfs with identical config.  Dmesg output may not turn
up anything, but eliminating all other variables should either pin the
tail on the donkey (cfs?) or vindicate it, and that's what needs to be
nailed down solidly first.

	-Mike

-
To: Linux Kernel Mailing List <linux-kernel@...>
Cc: Ingo Molnar <mingo@...>
Date: Saturday, July 14, 2007 - 7:34 am

Well, I took a 2.6.21 (gentoo-patchset...) and applied your cfs-v19 
patch for the 2.6.21 kernels. Its a amd64 system and the "normal" 

When I start the system, the mouse is jerking like under heavy load. 
(also its idle!)
Then some programs just quit. They just disappear, no message in any 
log. (It were about four apps I realized in a period of about two hours 
of testing.)

I'll try a git snapshot next, but is there a way I can get more output?

Greetz
Markus
-
To: Linux Kernel Mailing List <linux-kernel@...>
Cc: Ingo Molnar <mingo@...>
Date: Saturday, July 14, 2007 - 11:11 am

&gt; I'll try a git snapshot next, but is there a way I can get more 
output?
Did it. The mouse is smooth, just when one app is being quit (dont know 
why...) the mouse will be jerking for a few seconds...

So, is there any way to get a more verbose output regarding the cfs?

   Markus
-
To: Markus <lists4me@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Monday, July 16, 2007 - 5:41 am

is the mouse jerky on any app quitting? Or is your observation the 
following: _sometimes_ apps quit unexpectedly (their window just 
vanishes?), and _at the same time_, the mouse becomes jerky as well, for 
a few seconds?

the mouse typically only becomes jerky when there's some really high 
load on the system - anything else would be a kernel bug. A jerky mouse 
on an unloaded system is definitely a sign of some sort of kernel bug 
(in or outside of the scheduler). An app vanishing unexpectedly might 
mean an OOM-kill - but that would should up in the syslog as well. 
Pretty weird.

Can you make this regression trigger arbitrarily, so that we could debug 
it better? Apps exiting unexpectedly can be debugged via:

  http://kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.22-rc6/2.6.22-rc6-mm1/b...

you can turn it on via the print-fatal-signals=1 boot option or via:

  echo 1 &gt; /proc/sys/kernel/print-fatal-signals

this feature will produce a small dump to the syslog about every app 
that exits unexpectedly. Note that this might not cover all types of 
"window suddenly vanishes" regressions.

	Ingo
-
To: Ingo Molnar <mingo@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Monday, July 16, 2007 - 1:59 pm

Well, the system uses about 30% of the cpu (cool'n'quite put it on the 
lowest frequency).

I made a plain 2.6.22.1 and could use it for about 2 hours without 
any problem. Then I applied the cfs-v19 for that kernel, rebuild from 
mrproper with the saved config and booted. After a few minutes the 
first app vanished... some more followed by time (I just surfed around 
a bit...)

The dmesg output is not differing in any interesting point (just some 
numbers, like raid-benchmark, some irqs or usb-numbers...)

Nothing is printed for a disapeared app for me.


Is there anything more I can try?


   Markus
-
To: Markus <lists4me@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Tuesday, July 17, 2007 - 3:37 am

sure - could you start one of those apps via:

	strace -ttt -TTT -o trace.log -f &lt;app&gt;

and wait for it to "disappear"? Then compress the trace.log via bzip2 -9 
(it's probably going to be a really large file) and send me it?

	Ingo
-
To: Linux Kernel Mailing List <linux-kernel@...>
Cc: Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Tuesday, July 17, 2007 - 9:06 am

&gt; could you please send me the cfs-debug-info output nevertheless?
private mail, aswell (187K)

When attachments are allowed, I can resend them on the list as well (or 
just ask me...)


To answer a private mail: I do not use any kernel-module thats not part 
of the official kernel!
And of course nothing proprietary
# cat /proc/sys/kernel/tainted
0

I used gcc-4.1.2 (glibc-2.5-r4) to build the kernels. (Its a amd64 
system, quite stable so far.)

Programs that "disappeared" are most graphical, because others I have 
not noticed so far... also [1] might be caused by this...
amarok, kdesktop, whole X, konqueror, konsole but also gtk-apps


   Markus


[1] http://lkml.org/lkml/2007/07/14/64
-
To: Markus <lists4me@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>, Alan Cox <alan@...>
Date: Tuesday, July 17, 2007 - 1:06 pm

i think it fails here due to some IO error:

 9173  1184675906.674610 write(2, "konqueror: Fatal IO error: clien"..., 41) = 41 &lt;0.000007&gt;

could this be due to:

9173  1184675906.194424 ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff341af5c0)
= -1 ENOTTY (Inappropriate ioctl for device) &lt;0.000006&gt;
9173  1184675906.194463 ioctl(2, SNDCTL_TMR_TIMEBASE or TCGETS, 0x7fff341af5c0)
= -1 ENOTTY (Inappropriate ioctl for device) &lt;0.000004&gt;

? Are those -ENOTTY results normal?

or perhaps the problem started alot earlier, at:

9173  1184675906.155015 write(2, "In file kernel/qpixmap_x11.cpp, "..., 56) = 56 &lt;0.000006&gt;
9173  1184675906.155052 write(2, "QImage::convertDepth: Image is a"..., 44) = 44 &lt;0.000004&gt;
9173  1184675906.155169 gettimeofday({1184675906, 155179}, NULL) = 0 &lt;0.000006&gt;
9173  1184675906.155249 write(11, "close(6f1c2f7):about:konqueror\n", 31) = 31 &lt;0.000032&gt;

i think konqueror tried to say something here about an image problem?

could you perhaps upload the strace to some webpage so that others can 
take a look too?

it might also be good to add "-s 1000" to the strace command, so that we 
can see the full messages that konqueror tried to log to some other 
task, i.e.:

  strace -s 1000 -ttt -TTT -o trace.log -f &lt;app&gt;

and perhaps try to do a 'comparison' trace.normal.log as well, with 
konqueror having no problems. Also a KDE expert's advice would be useful 
here too i guess ...

	Ingo
-
To: Ingo Molnar <mingo@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>, Alan Cox <alan@...>
Date: Tuesday, July 17, 2007 - 3:42 pm

&gt; 9173  1184675906.194424 ioctl(1, SNDCTL_TMR_TIMEBASE or TCGETS, 
Well, yes:
In file kernel/qpixmap_x11.cpp, line 633: Out of memory
QImage::convertDepth: Image is a null image
In file kernel/qpixmap_x11.cpp, line 633: Out of memory
QImage::convertDepth: Image is a null image
In file kernel/qpixmap_x11.cpp, line 633: Out of memory
QImage::convertDepth: Image is a null image
In file kernel/qpixmap_x11.cpp, line 633: Out of memory
QImage::convertDepth: Image is a null image
konqueror: Fatal IO error: client killed

And no, my 2 GB of RAM are not full:
$ free -m
             total       used       free     shared    buffers     
cached
Mem:          2012       1077        935          0         22        
441
-/+ buffers/cache:        612       1400
I now made some new strace logs:
- konq crash 251K
- Konq without crash on cfs 302K
- konq without crash on non-cfs 248K


   Markus
-
To: Markus <lists4me@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>, Alan Cox <alan@...>
Date: Tuesday, July 17, 2007 - 4:09 pm

i think the problem starts here:

  11902 1184699865.141939 read(3, "", 32) = 0 &lt;0.000007&gt;

file descriptor 3 is the Unix domain socket connection to X:

  11902 1184699521.012316 socket(PF_FILE, SOCK_STREAM, 0) = 3 &lt;0.000018&gt;

a read result of 0 confused Konqueror so it quit immediately. I've 
attached the relevant section below - does it make sense to anyone who 
knows more about X, KDE and Konqueror internals than me? :)

	Ingo

-----------------------&gt;
11902 1184699811.587247 gettimeofday({1184699811, 587257}, NULL) = 0 &lt;0.000007&gt;
11902 1184699811.587280 select(10, [3 4 5 7 9], [], [], {9, 999727}) = 1 (in [3], left {7, 946666}) &lt;2.054594&gt;
11902 1184699813.642502 gettimeofday({1184699813, 642849}, NULL) = 0 &lt;0.000339&gt;
11902 1184699813.643145 ioctl(3, FIONREAD, [32]) = 0 &lt;0.000343&gt;
11902 1184699813.643720 read(3, "\v\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0", 32) = 32 &lt;0.000219&gt;
11902 1184699813.644271 ioctl(3, FIONREAD, [0]) = 0 &lt;0.000179&gt;
11902 1184699813.644724 ioctl(3, FIONREAD, [0]) = 0 &lt;0.000143&gt;
11902 1184699813.645011 gettimeofday({1184699813, 645265}, NULL) = 0 &lt;0.000248&gt;
11902 1184699813.645448 select(10, [3 4 5 7 9], [], [], {7, 941719}) = 1 (in [3], left {0, 703333}) &lt;7.238595&gt;
11902 1184699820.884750 gettimeofday({1184699820, 884911}, NULL) = 0 &lt;0.000155&gt;
11902 1184699820.885279 ioctl(3, FIONREAD, [224]) = 0 &lt;0.000339&gt;
11902 1184699820.885960 read(3, "\34a\301!\5\0\200\4\365\0\0\0\210+\10\0\0\352\323\0\0\0\0\0\347 C\0\0\0\0\0\v\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\v\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\34a\301![\0\0\0V\1\0\0\211+\10\0\0mp\0\0\0\0\0\347 C\0\0\0\0\0\34a\301![\0\0\0V\1\0\0\211+\10\0\0mp\0\0\0\0\0\347 C\0\0\0\0\0\v\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\34a\301![\0\0\0002\1\0\0\211+\10\0\0mp\0\0\0\0\0\347 C\0\0\0\0\0", 224) = 224 &lt;0.000475&gt;
11902 1184699820.886800 write(3, "\2...
To: Ingo Molnar <mingo@...>
Cc: Markus <lists4me@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>, Alan Cox <alan@...>
Date: Tuesday, July 17, 2007 - 4:37 pm

Well, it's preceded by a poll() that says that it has a POLLHUP event, so 
that socket would seem to have simply been closed from the other end. 
There's also a huge amount of select() calls showing the same thing 
(except since it's just the input side, you cannot tell that it's 
POLLHUP).

Don't ask me *why*, though. It's preceded by

	..
	11902 1184699848.615201 read(3, 0x7fffb5b9c8b0, 32) = -1 EAGAIN (Resource temporarily unavailable) &lt;0.000009&gt;
	11902 1184699848.615252 poll([{fd=3, events=POLLIN, revents=POLLIN}], 1, -1) = 1 &lt;0.009307&gt;
	11902 1184699848.624614 read(3, "\1 \303!\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0b\340T\0\0\0\0\0", 32) = 32 &lt;0.000011&gt;

.. got data ..

	11902 1184699848.624710 ioctl(3, FIONREAD, [0]) = 0 &lt;0.000009&gt;
	11902 1184699848.624762 ioctl(3, FIONREAD, [0]) = 0 &lt;0.000048&gt;

.. ok, nothing more..

	11902 1184699848.624866 select(10, [3 4 5 7 9], [], [], NULL) = 1 (in [3]) &lt;16.495008&gt;
	11902 1184699865.119950 ioctl(3, FIONREAD, [0]) = 0 &lt;0.000006&gt;

16+ seconds pass, now it's marked as readable, but returns zero bytes of 
data: the other end closed it.

Tons of unnecessary and stupid sequences of:

	11902 1184699865.119988 select(10, [3 4 5 7 9], [], [], NULL) = 1 (in [3]) &lt;0.000007&gt;
	11902 1184699865.120031 ioctl(3, FIONREAD, [0]) = 0 &lt;0.000005&gt;
	..

and then finally:

	...
	11902 1184699865.141809 poll([{fd=3, events=POLLIN, revents=POLLIN|POLLHUP}], 1, 0) = 1 &lt;0.000005&gt;
	11902 1184699865.141838 ioctl(3, FIONREAD, [0]) = 0 &lt;0.000005&gt;
	11902 1184699865.141939 read(3, "", 32) = 0 &lt;0.000007&gt;

ie now konqueror noticed that it was *really* closed, and read the EOF.

But why does that happen? And why would the scheduler have *anything* to 
do with this? No idea. Maybe timing. Maybe the time.c changes. Dunno.

		Linus
-
To: Linus Torvalds <torvalds@...>
Cc: Markus <lists4me@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>, Alan Cox <alan@...>
Date: Tuesday, July 17, 2007 - 4:43 pm

hm, Markus indicated that he tried the v2.6.21.6-cfsv19 patch, and that 
does not include the time.c change. Markus - does your kernel include 
the code below? (if yes, please revert it via patch -p1 -R )

	Ingo

Index: linux/kernel/time.c
===================================================================
--- linux.orig/kernel/time.c
+++ linux/kernel/time.c
@@ -57,14 +57,17 @@ EXPORT_SYMBOL(sys_tz);
  */
 asmlinkage long sys_time(time_t __user * tloc)
 {
-	time_t i;
-	struct timeval tv;
+	/*
+	 * We read xtime.tv_sec atomically - it's updated
+	 * atomically by update_wall_time(), so no need to
+	 * even read-lock the xtime seqlock:
+	 */
+	time_t i = xtime.tv_sec;
 
-	do_gettimeofday(&amp;tv);
-	i = tv.tv_sec;
+	smp_rmb(); /* sys_time() results are coherent */
 
 	if (tloc) {
-		if (put_user(i,tloc))
+		if (put_user(i, tloc))
 			i = -EFAULT;
 	}
 	return i;
@@ -373,6 +376,20 @@ void do_gettimeofday (struct timeval *tv
 
 	tv-&gt;tv_sec = sec;
 	tv-&gt;tv_usec = usec;
+
+	/*
+	 * Make sure xtime.tv_sec [returned by sys_time()] always
+	 * follows the gettimeofday() result precisely. This
+	 * condition is extremely unlikely, it can hit at most
+	 * once per second:
+	 */
+	if (unlikely(xtime.tv_sec != tv-&gt;tv_sec)) {
+		unsigned long flags;
+
+		write_seqlock_irqsave(&amp;xtime_lock);
+		update_wall_time();
+		write_seqlock_irqrestore(&amp;xtime_lock);
+	}
 }
 
 EXPORT_SYMBOL(do_gettimeofday);
-
To: <linux-kernel@...>
Cc: Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>, Alan Cox <alan@...>
Date: Tuesday, October 16, 2007 - 8:02 pm

Well, I am now running a 2.6.22 (without cfs) and could now see it once 
(within a month...) that exactly the same message from konqueror was 
produced.
So I think its a general problem of konqueror that was hidden and 
somehow its triggered much more often with the cfs.

I just wonder why nobody else has this problem.

   Markus

PS: I am currently building a 2.6.23.1...
-
To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>, Alan Cox <alan@...>
Date: Thursday, August 9, 2007 - 1:34 pm

Well, I am back now, but the problem still exists in 2.6.23-rc2.
And as there is nothing more I can do thats it for now.

   Markus
-
To: Markus <lists4me@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>, Alan Cox <alan@...>
Date: Friday, August 10, 2007 - 3:46 am

could you send me your debug-info:

  http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh

just run that script on 2.6.23-rc2 system and send me the file it 
produces. I've got a theory about what might be going on, and this 
debug-info could help prove/disprove it.

	Ingo
-
To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>, Alan Cox <alan@...>
Date: Tuesday, August 14, 2007 - 1:15 pm

Done by private mail on friday.

Also tried the very current linux-2.6.git (friday aswell) with 
sched-ingo-combo.patch (as told in a private answer mail).

Nothing fixed it so far.

If I can do anything...

   Markus
-
To: Ingo Molnar <mingo@...>, Linus Torvalds <torvalds@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>, Alan Cox <alan@...>
Date: Friday, July 20, 2007 - 6:26 pm

&gt; hm, Markus indicated that he tried the v2.6.21.6-cfsv19 patch, and 

As already said, 2.6.22.1-cfs-v19 includes the patch and 
2.6.21.6-cfs-v19 does not include it.
But both suffer of the problem.

I now reversed the patch on 2.6.22.1-cfs-v19 but it does not help.
2.6.22-git15 is not working as well... so the problem did not magically 
disappear like the processes.

Any further things I can do to track it down? (I go on vacation on 
monday for two weeks, but just send them, I'll try them when I am back 
or answer the mails, when I dont need to build a new kernel...)


   Markus
-
To: Markus <lists4me@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>, <kfm-devel@...>
Date: Sunday, July 22, 2007 - 7:59 am

it _seems_ to be related to KDE apps most of the time, right? Linus 
analyzed your strace and it suggests that X is dropping the connection. 
I.e. it's user-space who gets confused for some reason - perhaps timing.

Would be nice if a KDE/konqueror person could look at your traces and 
suggest a better debugging method to track this down - because right now 
konqueror just vanishes with only the "konqueror: Fatal IO error: client 
killed" message.

Hopefully i've Cc:-ed the right list: konqueror kfm-devel@mail.kde.org. 
I've attached below Linus' analysis about this mystery 'konqueror just 
vanishes' problem, and the strace too. Could you perhaps mention the kde 
and Xorg versions you are using?

	Ingo

-------------------------------------------


Well, it's preceded by a poll() that says that it has a POLLHUP event, so 
that socket would seem to have simply been closed from the other end. 
There's also a huge amount of select() calls showing the same thing 
(except since it's just the input side, you cannot tell that it's 
POLLHUP).

Don't ask me *why*, though. It's preceded by

	..
	11902 1184699848.615201 read(3, 0x7fffb5b9c8b0, 32) = -1 EAGAIN (Resource temporarily unavailable) &lt;0.000009&gt;
	11902 1184699848.615252 poll([{fd=3, events=POLLIN, revents=POLLIN}], 1, -1) = 1 &lt;0.009307&gt;
	11902 1184699848.624614 read(3, "\1 \303!\0\0\0\0\4\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0b\340T\0\0\0\0\0", 32) = 32 &lt;0.000011&gt;

.. got data ..

	11902 1184699848.624710 ioctl(3, FIONREAD, [0]) = 0 &lt;0.000009&gt;
	11902 1184699848.624762 ioctl(3, FIONREAD, [0]) = 0 &lt;0.000048&gt;

.. ok, nothing more..

	11902 1184699848.624866 select(10, [3 4 5 7 9], [], [], NULL) = 1 (in [3]) &lt;16.495008&gt;
	11902 1184699865.119950 ioctl(3, FIONREAD, [0]) = 0 &lt;0.000006&gt;

16+ seconds pass, now it's marked as readable, but returns zero bytes of 
data: the other end closed it.

Tons of unnecessary and stupid sequences of:

	11902 1184699865.119988 select(10, [3 4 5 7 9], [], [...
To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>, <kfm-devel@...>
Date: Sunday, July 22, 2007 - 10:26 am

&gt; it _seems_ to be related to KDE apps most of the time, right?
You are right (although some gtk-apps have the same problem), but that 
may be caused by the fact that I dont _see_ other apps vanish... I'll 
I have konqueror 3.5.5 and Xorg 7.2 on a native amd64 system (gcc 4.1.2 
and glibc 2.5).


   Markus
-
To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>, Alan Cox <alan@...>
Date: Tuesday, July 17, 2007 - 6:03 pm

&gt; hm, Markus indicated that he tried the v2.6.21.6-cfsv19 patch, and 
Well, the 2.6.22.1-cfs-v19 does include it, but the 2.6.21.6-cfs-v19 
does not have that patch applied.
But both show this problem.

   Markus
-
To: Markus <lists4me@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>, Alan Cox <alan@...>
Date: Tuesday, July 17, 2007 - 1:13 pm

oh, and i missed the obvious request: could you start konqueror from a 
terminal and see what it prints when it goes down?

	Ingo
-
To: Ingo Molnar <mingo@...>
Cc: <linux-kernel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Tuesday, July 10, 2007 - 4:08 am

Greetings,


I've taken mainline git tree (freshly integrated CFS!) out for a
multimedia spin.  I tested watching movies and listenign to music in the
presence of various sleep/burn loads, pure burn loads, and mixed loads.
All was peachy here.. I saw no frame drops or sound skips or other
artifacts under any load where the processor could possibly meet demand.

	-Mike

-
To: <linux-kernel@...>
Cc: Ingo Molnar <mingo@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Wednesday, July 11, 2007 - 1:26 pm

I would agree with preliminary testing, save that if you get a lot of 
processes updating the screen at once, there seems to be a notable case 
of processes getting no CPU for 100-300ms, followed by a lot of CPU.

I see this clearly with the "glitch1" test with four scrolling xterms 
and glxgears, but also watching videos with little busy processes on the 
screen. The only version where I never see this in test or with real use 
is cfs-v13.

-- 
Bill Davidsen &lt;davidsen@tmr.com&gt;
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot

-
To: Bill Davidsen <davidsen@...>
Cc: <linux-kernel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Wednesday, July 11, 2007 - 4:55 pm

just as a test, does this go away if you:

	renice -20 pidof `Xorg`

i.e. is this connected to the way X is scheduled?

Another thing to check would be whether it goes away if you set the 
granularity to some really finegrained value:

    echo 0 &gt; /proc/sys/kernel/sched_wakeup_granularity_ns
    echo 500000 &gt; /proc/sys/kernel/sched_granularity_ns

this really pushes things - but it tests the theory whether this is 
related to granularity.

	Ingo
-
To: Ingo Molnar <mingo@...>
Cc: <linux-kernel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Friday, July 13, 2007 - 5:19 pm

Doing this slows down the display rates, but doesn't significantly help 
I didn't test this with standard Xorg priority, I should go back and try 
that. But it didn't really make much difference. The gears and scrolling 
xterms ran slower with Xorg at -20 with any sched settings. I'll do that 
as soon as a build finishes and I can reboot.

I should really go back to 2.6.21.6, 2.6.22 has many bizarre behaviors 
with FC6. Automount starts taking 30% of CPU (unused at the moment), the 
sensors applet doesn't work, etc. I hope over the weekend I can get bug 


-- 
Bill Davidsen &lt;davidsen@tmr.com&gt;
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot
-
To: Bill Davidsen <davidsen@...>
Cc: Ingo Molnar <mingo@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ian Kent <raven@...>
Date: Monday, July 16, 2007 - 5:34 pm

Can you confirm whether CFS is involved, i.e. does it spin like that
even without the CFS patch applied?

-
To: Chuck Ebbert <cebbert@...>
Cc: Bill Davidsen <davidsen@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ian Kent <raven@...>
Date: Monday, July 16, 2007 - 5:55 pm

hmmm .... could you take out the kernel/time.c (sys_time()) changes from 
the CFS patch, does that solve the automount issue? If yes, could 
someone take a look at automount and check whether it makes use of 
time(2) and whether it combines it with finer grained time sources?

	Ingo
-
To: Ingo Molnar <mingo@...>
Cc: Chuck Ebbert <cebbert@...>, Bill Davidsen <davidsen@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, July 17, 2007 - 1:01 am

Yes it does and I have two reported bugs so far.

In several places I have code similar to:

wait.tv_sec = time(NULL) + 1;
wait.tv_nsec = 0;

signaled = 0;
while (!signaled) {
        status = pthread_cond_timedwait(&amp;cond, &amp;mutex, &amp;wait);
       if (status) {
             if (status == ETIMEDOUT)
                  break;
             fatal(status);
      }
}

lead to automount spinning with strace output a bit like:

futex(0x80034b60, FUTEX_WAKE, 1)  = 0
clock_gettime(CLOCK_REALTIME, {1184593936, 130925919}) = 0
time(NULL)                        = 1184593935
futex(0x80034b60, FUTEX_WAKE, 1)  = 0
clock_gettime(CLOCK_REALTIME, {1184593936, 131160876}) = 0
time(NULL)                        = 1184593935
futex(0x80034b60, FUTEX_WAKE, 1)  = 0
clock_gettime(CLOCK_REALTIME, {1184593936, 131377080}) = 0
time(NULL)                        = 1184593935
futex(0x80034b60, FUTEX_WAKE, 1)  = 0
clock_gettime(CLOCK_REALTIME, {1184593936, 131593297}) = 0
time(NULL)                        = 1184593935
futex(0x80034b60, FUTEX_WAKE, 1)  = 0
clock_gettime(CLOCK_REALTIME, {1184593936, 131871792}) = 0

There should be something like:

futex(0x5555557868c4, FUTEX_WAIT, 5321099, {0, 998091311}) = -1 ETIMEDOUT (Connection timed out)

in there I think.

Ian


-
To: Ian Kent <raven@...>
Cc: Chuck Ebbert <cebbert@...>, Bill Davidsen <davidsen@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, July 17, 2007 - 3:45 am

ah! It passes in a low-res time source into a high-res time interface 
(pthread_cond_timedwait()). Could you change the time(NULL) + 1 to 
time(NULL) + 2, or change it to:

	gettimeofday(&amp;wait, NULL);
	wait.tv_sec++;

does this solve the spinning?

i'm wondering how widespread this is. If automount is the only app doing 
this then _maybe_ we could get away with it by changing automount?

	Ingo
-
To: Ingo Molnar <mingo@...>
Cc: Ian Kent <raven@...>, Chuck Ebbert <cebbert@...>, Bill Davidsen <davidsen@...>, <linux-kernel@...>
Date: Wednesday, July 18, 2007 - 12:03 pm

Ok, that definitely should work.


This is wrong. It's wrong for two reasons:

 - it really shouldn't be needed. I don't think "time()" has to be 
   *exactly* in sync, but I don't think it can be off by a third of a 
   second or whatever (as the "30% CPU load" would seem to imply)

 - gettimeofday works on a timeval, pthread_cond_timedwait() works on a 
   timespec.

So if it actually makes a difference, it makes a difference for the 
*wrong* reason: the time is still totally nonsensical in the tv_nsec field 
(because it actually got filled in with msecs!), but now the tv_sec field 
is in sync, so it hides the bug.

Anyway, hopefully the patch below might help. But we probably should make 
this whole thing a much more generic routine (ie we have our internal 
"getnstimeofday()" that still is missing the second-overflow logic, and 
that is quite possibly the one that triggers the "30% off" behaviour).

Ingo, I'd suggest:
 - ger rid of "timespec_add_ns()", or at least make it return a return 
   value for when it overflows.
 - make all the people who overflow into tv_sec call a "fix_up_seconds()" 
   thing that does the xtime overflow handling.

		Linus

---
Subject: time: make sure sys_gettimeofday() and sys_time() are in sync
From: Ingo Molnar &lt;mingo@elte.hu&gt;

make sure sys_gettimeofday() and sys_time() results are coherent.

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
---
 kernel/time/timekeeping.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

Index: linux/kernel/time/timekeeping.c
===================================================================
--- linux.orig/kernel/time/timekeeping.c
+++ linux/kernel/time/timekeeping.c
@@ -92,6 +92,19 @@ static inline void __get_realtime_clock_
 	} while (read_seqretry(&amp;xtime_lock, seq));
 
 	timespec_add_ns(ts, nsecs);
+	/*
+	 * Make sure xtime.tv_sec [returned by sys_time()] always
+	 * follows the gettimeofday() result precisely. This
+	 * condition is extremely unlikely, it can hit at m...
To: Linus Torvalds <torvalds@...>
Cc: Ian Kent <raven@...>, Chuck Ebbert <cebbert@...>, Bill Davidsen <davidsen@...>, <linux-kernel@...>
Date: Thursday, July 19, 2007 - 4:16 am

ah, i didnt notice that automount mixed up timespec with timeval! That 
is nasty and the tv_nsec field (which really is ts_usec to 
pthread_cond_timewait()) must stay cleared - or rather, to avoid bugs of 

yeah, i'll generalize it, but our internal getnstimeofday() used on most 
architectures is using __get_realtime_clock_ns(), and the patch you 
attached already adds the second-overflow logic to it.

there are two versions of getnstimeofday(), a TIME_INTERPOLATION one and 
a !TIME_INTERPOLATION one. TIME_INTERPOLATION is only used on ia64 at 
the moment - and that one indeed does not have the second overflow 

ok, i'll do something clean.

	Ingo
-
To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, Ian Kent <raven@...>, Chuck Ebbert <cebbert@...>, <linux-kernel@...>
Date: Wednesday, July 18, 2007 - 5:37 pm

Spectacularly no! With this patch the "glitch1" script with multiple 
scrolling windows has all xterms and glxgears stop totally dead for 
~200ms once per second. I didn't properly test anything else after that. 
Since the automount issue doesn't seem to start until something kicks it 

-
To: <davidsen@...>
Cc: Linus Torvalds <torvalds@...>, Ian Kent <raven@...>, Chuck Ebbert <cebbert@...>, <linux-kernel@...>
Date: Thursday, July 19, 2007 - 10:32 am

Bill, could you try the patch below - does it fix the automount problem, 
without introducing new problems?

	Ingo

---------------&gt;
Subject: time: introduce xtime_seconds
From: Ingo Molnar &lt;mingo@elte.hu&gt;

introduce the xtime_seconds optimization. This is a read-mostly 
low-resolution time source available to sys_time() and kernel-internal 
use. This variable is kept uptodate atomically, and it's monotically 
increased, every time some time interface constructs an xtime-alike time 
result that overflows the seconds value. (it's updated from the timer 
interrupt as well)

this way high-resolution time results update their seconds component at 
the same time sys_time() does it:

 1184858832999989000
 1184858832000000000
 1184858832999992000
 1184858832000000000
 1184858832999996000
 1184858832000000000
 1184858832999999000
 1184858832000000000
 1184858833000003000
 1184858833000000000
 1184858833000006000
 1184858833000000000
 1184858833000009000
 1184858833000000000

 [ these are nsec time results from alternating calls to sys_time() and 
   sys_gettimeofday(), recorded at the seconds boundary. ]

instead of the previous (non-coherent) behavior:

 1184848950999987000
 1184848950000000000
 1184848950999990000
 1184848950000000000
 1184848950999994000
 1184848950000000000
 1184848950999997000
 1184848950000000000
 1184848951000001000
 1184848950000000000
 1184848951000005000
 1184848950000000000
 1184848951000008000
 1184848950000000000
 1184848951000011000
 1184848950000000000
 1184848951000015000

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
---
 include/linux/time.h      |   13 +++++++++++--
 kernel/time.c             |   25 ++++++-------------------
 kernel/time/timekeeping.c |   28 ++++++++++++++++++++++++----
 3 files changed, 41 insertions(+), 25 deletions(-)

Index: linux/include/linux/time.h
===================================================================
--- linux.orig/include/linux/time.h
+++ linux/include/linux/time.h...
To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Ian Kent <raven@...>, Chuck Ebbert <cebbert@...>, <linux-kernel@...>
Date: Thursday, July 19, 2007 - 10:32 pm

Okay, as noted off-list, after I exported the xtime_seconds it now 
builds and works. However, there are a *lot* of "section mismatches" 
which are not reassuring.

Boots, runs, glitch1 test runs reasonably smoothly. automount has not 
used significant CPU yet, but I don't know what triggers it, the bad 
behavior did not happen immediately without the patch. However, it looks 
very hopeful.

Warnings attached to save you the trouble...

-- 
Bill Davidsen &lt;davidsen@tmr.com&gt;
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot
To: Ingo Molnar <mingo@...>
Cc: <davidsen@...>, Linus Torvalds <torvalds@...>, Ian Kent <raven@...>, Chuck Ebbert <cebbert@...>, <linux-kernel@...>
Date: Thursday, July 19, 2007 - 1:06 pm

Doesn't seem to apply against 2.6.22.1, I'm trying 2.6.22.6 as soon as I 

-
To: Ingo Molnar <mingo@...>
Cc: Linus Torvalds <torvalds@...>, Ian Kent <raven@...>, Chuck Ebbert <cebbert@...>, <linux-kernel@...>
Date: Thursday, July 19, 2007 - 1:26 pm

Applied to 2.6.22-git9, building now.
-
To: <davidsen@...>
Cc: Linus Torvalds <torvalds@...>, Ian Kent <raven@...>, Chuck Ebbert <cebbert@...>, <linux-kernel@...>
Date: Thursday, July 19, 2007 - 1:42 pm

ok, that's fine too.

	Ingo
-
To: Bill Davidsen <davidsen@...>
Cc: <davidsen@...>, Linus Torvalds <torvalds@...>, Ian Kent <raven@...>, Chuck Ebbert <cebbert@...>, <linux-kernel@...>
Date: Thursday, July 19, 2007 - 1:17 pm

the patch below is merged against 2.6.22.1-cfs-v19 - does it solve the 
autofs problem (without any other bad side-effects)?

	Ingo

-------------------&gt;
Subject: time: introduce xtime_seconds
From: Ingo Molnar &lt;mingo@elte.hu&gt;

introduce the xtime_seconds optimization. This is a read-mostly 
low-resolution time source available to sys_time() and kernel-internal 
use. This variable is kept uptodate atomically, and it's monotically 
increased, every time some time interface constructs an xtime-alike time 
result that overflows the seconds value. (it's updated from the timer 
interrupt as well)

this way high-resolution time results update their seconds component at 
the same time sys_time() does it:

 1184858832999989000
 1184858832000000000
 1184858832999992000
 1184858832000000000
 1184858832999996000
 1184858832000000000
 1184858832999999000
 1184858832000000000
 1184858833000003000
 1184858833000000000
 1184858833000006000
 1184858833000000000
 1184858833000009000
 1184858833000000000

 [ these are nsec time results from alternating calls to sys_time() and 
   sys_gettimeofday(), recorded at the seconds boundary. ]

instead of the previous (non-coherent) behavior:

 1184848950999987000
 1184848950000000000
 1184848950999990000
 1184848950000000000
 1184848950999994000
 1184848950000000000
 1184848950999997000
 1184848950000000000
 1184848951000001000
 1184848950000000000
 1184848951000005000
 1184848950000000000
 1184848951000008000
 1184848950000000000
 1184848951000011000
 1184848950000000000
 1184848951000015000

Signed-off-by: Ingo Molnar &lt;mingo@elte.hu&gt;
---
 include/linux/time.h      |   13 +++++++++++--
 kernel/time.c             |   25 ++++++-------------------
 kernel/time/timekeeping.c |   26 +++++++++++++++++++++++---
 3 files changed, 40 insertions(+), 24 deletions(-)

Index: linux-cfs-2.6.22.q/include/linux/time.h
===================================================================
--- linux-cfs-2.6.22.q.orig/include...
To: Bill Davidsen <davidsen@...>
Cc: <davidsen@...>, Linus Torvalds <torvalds@...>, Ian Kent <raven@...>, Chuck Ebbert <cebbert@...>, <linux-kernel@...>
Date: Thursday, July 19, 2007 - 1:10 pm

hm, it's against recent -git.

dont waste your time on 2.6.21.6-cfsv19, it will likely not apply - give 
me a few minutes to create a patch for you against 2.6.22.1-cfsv19, ok?

	Ingo
-
To: <davidsen@...>
Cc: Linus Torvalds <torvalds@...>, Ian Kent <raven@...>, Chuck Ebbert <cebbert@...>, <linux-kernel@...>
Date: Thursday, July 19, 2007 - 4:53 am

thanks. Andrew also just reported that it broke his laptop and i'm 
working on a proper version.

	Ingo
-
To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, Chuck Ebbert <cebbert@...>, Bill Davidsen <davidsen@...>, <linux-kernel@...>
Date: Wednesday, July 18, 2007 - 1:31 pm

Oh ya .. I thought it wouldn't hurt to add the fraction of the current
second for correctness and actually put things like:

gettimeofday(&amp;now, NULL);
wait.tv_sec = now.tv_sec + 1;
wait.tv_nsec = now.tv_usec * 1000;

in autofs.

Ian


-
To: Linux-Kernel@Vger. Kernel. Org <linux-kernel@...>, Ian Kent <raven@...>
Cc: Chuck Ebbert <cebbert@...>, Bill Davidsen <davidsen@...>
Date: Tuesday, July 17, 2007 - 5:16 pm

This code is horribly broken. Don't change the kernel because this code is
broken.

First it adds a second, but then it subtracts up to a second. Just before
the second boundary, this code can burn CPU like crazy, with each wait being
just a few nanoseconds.

What is the intent of this code? Is it to wait "up to a second, possibly for
no time at all" or is to wait "for at least a second"? If so, why are you
zeroing the nanosecond count?

DS


-
To: <davids@...>
Cc: Linux-Kernel@Vger. Kernel. Org <linux-kernel@...>, Chuck Ebbert <cebbert@...>, Bill Davidsen <davidsen@...>
Date: Wednesday, July 18, 2007 - 1:59 am

Yes, adding in the offset within the current second appears to resolve

I don't think the change is unreasonable since I wasn't using an
accurate time in the condition wait, so that's a coding mistake on my
part which I will fix.

Ian


-
To: Ian Kent <raven@...>, Linus Torvalds <torvalds@...>
Cc: <davids@...>, Linux-Kernel@Vger. Kernel. Org <linux-kernel@...>, Chuck Ebbert <cebbert@...>, Bill Davidsen <davidsen@...>, Thomas Gleixner <tglx@...>
Date: Wednesday, July 18, 2007 - 3:54 am

thanks Ian for taking care of this and for fixing it!

Linus, Thomas, what do you think, should we keep the time.c change? 
Automount is one app affected so far, and it's a borderline case: the 
increased (30%) CPU usage is annoying, but it does not prevent the 
system from working per se, and an upgrade to a fixed/enhanced automount 
version resolves it.

The temptation of using a really (and trivially) scalable low-resolution 
time-source (which is _easily_ vsyscall-able, on any platform) for DBMS 
use is really large, to me at least. Should i perhaps add a boot/config 
option that enables/disables this optimization, to allow distros finer
grained control about this? And we've also got to wait whether there's
any other app affected.

	Ingo
-
To: Ingo Molnar <mingo@...>
Cc: Ian Kent <raven@...>, <davids@...>, Linux-Kernel@Vger. Kernel. Org <linux-kernel@...>, Chuck Ebbert <cebbert@...>, Bill Davidsen <davidsen@...>, Thomas Gleixner <tglx@...>
Date: Wednesday, July 18, 2007 - 1:23 pm

No, not if it's off by the second field. That 30% CPU usage indicates that 
there's some nasty bug there somewhere, and that's just not worth it.

If time() cannot get the second field right, it's bogus. I'm ok with us 
not *guaranteeing* monotonicity of the second field when you compare 
gettimeofday() with time(), but the 30% thing implies that it's much worse 
than that, and that "time()" will likely report the previous second (when 
compared to hrtimers) roughly a quarter of the time.

And that isn't acceptable. 

So either it should be fixed, or reverted.

		Linus
-
To: Ingo Molnar <mingo@...>
Cc: Ian Kent <raven@...>, Linus Torvalds <torvalds@...>, <davids@...>, Linux-Kernel@Vger. Kernel. Org <linux-kernel@...>, Chuck Ebbert <cebbert@...>, Thomas Gleixner <tglx@...>
Date: Wednesday, July 18, 2007 - 9:50 am

Allow it to be selected by the "features" so that admins can evaluate 
the implications without a reboot?  That would be a convenient interface 
if you could provide it.

-- 
bill davidsen &lt;davidsen@tmr.com&gt;
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979

-
To: Ingo Molnar <mingo@...>
Cc: Ian Kent <raven@...>, Bill Davidsen <davidsen@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, July 17, 2007 - 12:30 pm

Odds are there's at least one other app doing that somewhere.

Would reverting the CFS changes to time.c fix this problem?
That optimization just got merged in 2.6.22 mainline...
-
To: Ingo Molnar <mingo@...>
Cc: Chuck Ebbert <cebbert@...>, Bill Davidsen <davidsen@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, July 17, 2007 - 7:17 am

OK, I'm with you, hi-res timer.
But even so, how is the time in the past after adding a second.

Is it because I'm not setting tv_nsec when it's close to a second

I don't have a system to test this on so I'll try to get one of the

I'm happy to change automount but that could cause odd version specific
problems for people updating their kernel on an older installed base.

Aaah .. and they'll all blame me!! ;)

Ian


-
To: Ian Kent <raven@...>
Cc: Chuck Ebbert <cebbert@...>, Bill Davidsen <davidsen@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, July 17, 2007 - 1:16 pm

yeah, it looks a bit suspicious: you create a +1 second timeout out of a 
1 second resolution timesource. I dont yet understand the failure mode 
though that results in that looping and in the 30% CPU time use - do you 
understand it perhaps? (and automount is still functional while this is 
happening, correct?)

	Ingo
-
To: Ingo Molnar <mingo@...>
Cc: Ian Kent <raven@...>, Chuck Ebbert <cebbert@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Tuesday, July 17, 2007 - 9:24 pm

Can't say, I have automount running because I get it by default, but I 
have nothing using at on my test machine. Why is it looping so fast when 
there are no mount points defined? If the config changes there's no 
requirement to notice right away, is there?

-- 
bill davidsen &lt;davidsen@tmr.com&gt;
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979

-
To: Bill Davidsen <davidsen@...>
Cc: Ingo Molnar <mingo@...>, Chuck Ebbert <cebbert@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>
Date: Wednesday, July 18, 2007 - 2:19 am

There are two threads where this mistake is made.

One is used to trigger expire events for all automounted filesystems
which happen all the time since I need to run the expire to check if
anything is mounted and whether it needs to be umounted. The alarm
handler sleeps on a condition until the alarm list in not empty and then
sleeps on a condition until the next alarm in the list expires or an
alarm is added to the list, in which case it then checks the list again.
Since the autofs timeout granularity is one second this is a problem and
will be fixed. This isn't the source of the problem that's been
reported.

The second is the state queue handler which runs tasks such as expires,
map re-reads, shutdowns etc. for all automounted filesystems. While the
check interval could be longer it causes autofs to be slugish in
situations such as shutdowns where there are a largish number of mounts
present and I need to cancel such things as expires and the like. It's
possible I could improve this but, in fact, once the timespec is set
correctly as Ingo suggests it works fine and uses very little resource.

Ian

-
To: Ingo Molnar <mingo@...>
Cc: Chuck Ebbert <cebbert@...>, <linux-kernel@...>, Linus Torvalds <torvalds@...>, Ian Kent <raven@...>
Date: Tuesday, July 17, 2007 - 12:22 am

I will try that, but not until Tuesday night. I've been here too long 
today and have an out-of-state meeting tomorrow. I'll take a look after 
dinner. Note that the latest 2.6.21 with cfs-v19 doesn't have any 
problems of any nature, other than suspend to RAM not working, and I may 
have the config wrong. Runs really well otherwise, but I'll test drive 
Will do.

-- 
bill davidsen &lt;davidsen@tmr.com&gt;
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979

-
To: Ingo Molnar <mingo@...>
Cc: <linux-kernel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Thursday, July 12, 2007 - 8:41 am

Partial answer: -10 didn't help with v16, I'll try more boost ASAP, but 
power has been spotty in upstate NY, 20k+ customers with none and the 
rest of us subject to "load shedding" with zero warning, and the test 
machines have UPS but no generator, so I hesitate to use them while 

Will do, you suggested dropping sched_granularity_ns to 1000000 earlier, 
and that didn't do it, but I didn't change the wakeup, and will test 
these values later today.

-- 
bill davidsen &lt;davidsen@tmr.com&gt;
  CTO TMR Associates, Inc
  Doing interesting things with small computers since 1979

-
To: Ingo Molnar <mingo@...>
Cc: <linux-kernel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Sunday, July 8, 2007 - 1:46 pm

Hi Ingo !


Interestingly, I also noticed the possibility of O(1) task pickup when
playing with v18, but did not detect any noticeable improvement with it.
Of course, it depends on the workload and I probably didn't perform the
most relevant tests.

V19 works very well here on 2.6.20.14. I could start 32k busy loops at
nice +19 (I exhausted the 32k pids limit), and could still perform normal
operations. I noticed that 'vmstat' scans all the pid entries under /proc,
which takes ages to collect data before displaying a line. Obviously, the
system sometimes shows some short latencies, but not much more than what
you get from and SSH through a remote DSL connection.

Here's a vmstat 1 output :

 r  b  w   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id
32437  0  0      0 809724    488   6196    0    0     1     0  135     0 24 72 4
32436  0  0      0 811336    488   6196    0    0     0     0  717     0 78 22 0
32436  0  0      0 812956    488   6196    0    0     0     0  727     0 79 21 0
32436  0  0      0 810164    400   6196    0    0     0     0  707     0 80 20 0
32436  0  0      0 815116    400   6436    0    0     0     0  769     0 77 23 0
32436  0  0      0 811644    400   6436    0    0     0     0  715     0 80 20 0

Amusingly, I started mpg123 during this test and it skipped quite a bit.
After setting all tasks to SCHED_IDLE, it did not skip anymore. All this
seems to behave like one could expect.

I also started 30k processes distributed in 130 groups of 234 chained by pipes
in which one byte is passed. I get an average of 8000 in the run queue. The
context switch rate is very low and sometimes even null in this test, maybe
some of them are starving, I really do not know :

 r  b  w   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id
7752  0  1      0 656892    244   4196    0    0     0     0  725     0 16 84  0
7781  0  1      0 656768    244   4196    0    0     0     0  724     0 16 84  0
8000  0  1      0 656916    244   41...
To: Willy Tarreau <w@...>
Cc: <linux-kernel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Monday, July 9, 2007 - 6:39 pm

yeah - it's a small tweak. CFS is O(31) in sleep/wakeup so it's now all 



yeah. It behaves better than i expected in fact - 32K tasks is pushing 

hm, could you profile this? We could have some bottleneck somewhere 
(likely not in the scheduler) with that many tasks being runnable. [ 

well, just a little bit of improvement would be nice to have too :)

	Ingo
-
To: Ingo Molnar <mingo@...>
Cc: <linux-kernel@...>, Linus Torvalds <torvalds@...>, Andrew Morton <akpm@...>, Mike Galbraith <efault@...>, Arjan van de Ven <arjan@...>, Thomas Gleixner <tglx@...>, Dmitry Adamushko <dmitry.adamushko@...>, Srivatsa Vaddagiri <vatsa@...>
Date: Tuesday, July 17, 2007 - 5:44 pm

Hi Ingo,

sorry for the long delay, I've spent a week doing non-kernel work.


Yes, that's what I tried to explain to a guy onc