this is the second release of the CFS (Completely Fair Scheduler) patchset, against v2.6.21-rc7: http://redhat.com/~mingo/cfs-scheduler/sched-cfs-v2.patch i'd like to thank everyone for the tremendous amount of feedback and testing the v1 patch got - i could hardly keep up with just reading the mails! Some of the stuff people addressed i couldnt implement yet, i mostly concentrated on bugs, regressions and debuggability. there's a fair amount of churn: 15 files changed, 456 insertions(+), 241 deletions(-) But it's an encouraging sign that there was no crash bug found in v1, all the bugs were related to scheduling-behavior details. The code was tested on 3 architectures so far: i686, x86_64 and ia64. Most of the code size increase in -v2 is due to debugging helpers, they'll be removed later. (The new /proc/sched_debug file can be used to see the fine details of CFS scheduling.) Changes since -v1: - make nice levels less starvable. (reported by Willy Tarreau) - fixed child-runs first. A /proc/sys/kernel/sched_child_runs_first flag can be used to turn it on/off. (This might fix the Kaffeine bug reported by S.Çağlar Onur <) - changed SCHED_FAIR back to SCHED_NORMAL (suggested by Con Kolivas) - UP build fix. (reported by Gabriel C) - timer tick micro-optimization (Dmitry Adamushko) - preemption fix: sched_class->check_preempt_curr method to decide whether to preempt after a wakeup (or at a timer tick). (Found via a fairness-test-utility written for CFS by Mike Galbraith) - start forked children with neutral statistics instead of trying to inherit them from the parent: Willy Tarreau reported that this results in better behavior on extreme workloads, and it also simplifies the code quite nicely. Removed sched_exit() and the ->task_exit() methods. - make nice levels independent of the sched_granularity value - new /proc/sched_debug file listing runqueue details and the rbtree - new SCH-* fields in ...
17 Nis 2007 Sal tarihinde, Ingo Molnar =C5=9Funlar=C4=B1 yazm=C4=B1=C5=9Ft= Sorry for delayed response but i just find some free time, do you still wan= t=20 me to test mainline + "parent-runs first" patch or will i drop that one and= =20 test v2 which can change default behaviour? =2D-=20 S.=C3=87a=C4=9Flar Onur <caglar@pardus.org.tr> http://cekirdek.pardus.org.tr/~caglar/ Linux is like living in a teepee. No Windows, no Gates and an Apache in hou= se!
i suspect for now it would be sufficient if you could check the v2
patch.
if it _works_, please try this:
echo 0 > /proc/sys/kernel/sched_child_runs_first
this should break Kaffeine again :)
(if it doesnt work then the Kaffeine problem is unrelated to
child-runs-first.)
Ingo
-
17 Nis 2007 Sal tarihinde, Ingo Molnar =C5=9Funlar=C4=B1 yazm=C4=B1=C5=9Ft= OK, i tested both plain -rc7 and -rc7 + CFSv2 with while=20 sched_child_runs_first enabled/disabled. I'm always using same video file and try to reproduce freeze with constantl= y=20 pressing forward/backward buttons. With CFS 2-3 forward/backward attempt=20 reproduces this behaviour.=20 And here are the results. Mainline still has no issues with both xine-lib/kaffeine and xine-ui=20 (kaffeine-0.8.4, xine-lib-1.1.5 [both xcb enabled], xine-ui-0.99.4). I real= ly=20 try hard to reproduce the freeze, but i can't... And CFSv2 still fails for both child_runs_first and parent_runs_first cases= =20 with same strace output (FUTEX_WAIT). If you want me to test something else just ask please :)=20 Cheers =2D-=20 S.=C3=87a=C4=9Flar Onur <caglar@pardus.org.tr> http://cekirdek.pardus.org.tr/~caglar/ Linux is like living in a teepee. No Windows, no Gates and an Apache in hou= se!
I have the same problem here ( same packages ). Even VLC if I go forward/backward and then play again its start to -
yes, it would be nice to do a:
strace -o kaffine.log -f -tttTTT kaffeine
log. Because in your old log this is visible:
clone(child_stack=0xb02394a4,
flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID,
parent_tidptr=0xb0239bd8, {entry_number:6, base_addr:0xb0239b90,
limit:1048575, seg_32bit:1, contents:0, read_exec_only:0,
limit_in_pages:1, seg_not_present:0, useable:1},
child_tidptr=0xb0239bd8) = 11340
futex(0x89ac218, FUTEX_WAKE, 1) = 1
we cloned a task and immediately afterwards we used futex 0x89ac218.
After that point many things happen, but the lockup itself:
futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
futex(0x89ac218, FUTEX_WAIT, 2, NULL) = 0
is the same futex. Probably related to the same child thread? It would
be nice to also get a gdb backtrace:
gdb kaffine
<reproduce the hang>
Ctrl-C
bt
this should give you a gdb backtrace of that kaffeine hang. Thanks,
Ingo
-
Can I make a suggestion? Would it be possible (from now on) to publish changes relevant to the previous patch (eventually leading to a series of patches that describes the evolution of the new scheduler) so that it's easier for us reviewers/critics to see the latest changes. E.g. if import such changes into something like quilt (using my gquilt GUI wrapper, of course :-)) I can then use meld (or similar) to follow what's going as suggestions get folded in and bugs get fixed etc. Thanks Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce -
the v1 patch is still downloadable so you can do a delta by first applying the v1 patch to a quilt queue, doing a 'quilt snapshot', then 'quilt pop', add the v2 patch to the series file, do a 'quilt push', then doing a "quilt diff --snapshot". (I just posted the delta patch in this thread so you can pick it from there too.) Ingo -
This one (v2-rc2) is not a keeper I'm sorry to say, Ingo. v2-rc0 was much better. Watching amanda run with htop, kmails composer is being subjected to 5 to 10 second pauses, and htop says that gzip -best isn't getting more that 15% of the cpu, and the /amandatapes drive is being written to in a regular pattern that seems to be the cause of the pauses according to gkrellm, which also seems to track the size of the writes, and can show anything from 4.3k to 54 megs as being written in one cycle of its screen update. Normally hdd will fire up and take it at about 40+M/second steady till its done when there is a file ready to write even if its a 7GB file. And I can type right on during the disk i/o. But not now. In short, I seem to be heavily I/O bound. But when the write to /dev/hdd3 is done, then gzip -best pops right up to 90% plus cpu and I get my machine back. In between file writes I checked the drives speed with hdparm: root@coyote ~]# hdparm -Tt /dev/hdd /dev/hdd: Timing cached reads: 856 MB in 2.01 seconds = 426.15 MB/sec Timing buffered disk reads: 222 MB in 3.01 seconds = 73.68 MB/sec That's not too shabby, and obviously dma is active at least for the reading. gzip -best was running while this was executing. So I think the drive is fine and the scheduling is whats funkity. Sorry. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) After they got rid of capital punishment, they had to hang twice as many people as before. -
Hi Gene, Have you tried previous version with the fair-fork patch ? It might be possible that your workload is sensible to the fork()'s child getting much CPU upon startup. Ingo, maybe I'm saying something stupid, but in my userland scheduler, when new tasks are "forked", they are queued at the end of the run queue with a fixed priority. In our case, this would translate into assigning them the same prio and timeslice as their parent, but queuing them at the end so that they don't make existing tasks starve during huge fork() loads. I don't know how that would be possible (nor if that would help in anything), but I found it was a good compromise over sharing the timeslice with the parent. Perhaps we should have some absolute timeslice and some relative timeslice (eg: X percent of total time divided by the number of tasks) ? Regards, Willy -
Somewhat interesting to this, I have amanda doing a verify phase too. During the verify phase (and while I was waiting for gmail to transmit this message, it took 30 minutes before it showed up on the list) I noted that when amrestore fired up, it, and its child tar were only taking about 20% of the cpu between them, and that /dev/hdd was showing a pretty steady 55 to 75MB/sec being read. As to what this tells us, I'm not going to hazard a guess because it wouldn't, this time of the night here in WV, USA, even be a SWAG. Its coming up on 2am and the toothpicks holding my eyes open are Willy, I think that patch went by, and was followed by the v2-rc2 so fast that I never got a chance to try it with the v2-rc0 framework. So I believe the answer there is probably no. I never saw a problem with the v2-rc0, but Ingo shot me a message about it without enough detail that I could have tested for it. FWIW, I've been using the CFQ I/O scheduler for quite a while, is it time I gave the AS or Deadline versions another check? They are all built in but I Thanks Willy. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) "I take Him shopping with me. I say, 'OK, Jesus, help me find a bargain'" --Tammy Faye Bakker -
On Tue, 17 Apr 2007 01:51:08 -0400 easy :) # cat /sys/block/DEVICE/queue/scheduler as noop [cfq] ... # echo IO_SCHED > /sys/block/DEVICE/queue/scheduler -- Paolo Ornati Linux 2.6.21-rc7 on x86_64 -
Dunno about that, but here's a possibly related datapoint. I reported to Ingo yesterday that I was sometimes losing control of my GUI (KDE) under heavy IO. I just reproduced it in mainline rc7. If I start a bonnie, and click around popping windows to the foreground, then poke KDE's menu button, I may lose all GUI capability for a _very_ long time. Here, with bonnie, that means until it gets past writing with putc, and moves on to rewrite. Ages. -Mike -
the fair-fork patch is now included in -v2, but that was already in -v2-rc0 too that i sent to Gene separately. I've attached the -rc0->final delta. Gene, could you please apply this patch to your -v2-rc0 tree and do a quick double-check that indeed these changes cause the regression? Ingo
One way of handling forked tasks is to give them a high priority but a small chunk (i.e. give them a relatively short time to do some work and surrender the CPU voluntarily before you boot them off). If you choose the size of this reduced chunk well the vast majority of tasks will never be booted off and will do a small bit of work and either exit or sleep and will suffer no penalty as a result of this mechanism. But it gives you a chance to move any newly forked process that turns out to be a CPU hog to a lower priority before it gets its next chunk of CPU at which time it can revert to getting normal size chunks as pre-emption will stop it hogging the CPU from then on. I've trialled this mechanism in some of my schedulers and it works well. I found that 10 milliseconds was a good value for the initial chunk of CPU for a newly forked process. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce -
ok - fortunately the delta between -v2-rc0 and -v2-final is pretty small. One difference is the child-runs-first fix. To restore the parent-runs-first logic, do this: echo 0 > /proc/sys/kernel/sched_child_runs_first does this make any difference? If not then pretty much the only other change was the nice level tweak i did. Could you try to grab a few snapshots of scheduling state via something like: while sleep 1; do cat /proc/sched_debug >> to-ingo.txt; done (and tell me the PID of the kmail composer, to make sure i'm checking the right task's behavior.) also, as a separate experiment, could you perhaps run this script as root: cd /proc; for N in [1-9]*; do renice -n 0 $N; done this will move all tasks in the system to nice level 0 and should make any nice level handling logic in the scheduler irrelevant. Do you have X reniced perhaps? Lots of system threads have negative or positive nice levels, so once you have executed this script, only a reboot will be a practical way to restore it to the previous settings. Ingo -
ok, i've got something better to test: i separated the delta out into a more finegrained stack of 3 patches. You can pick them up from: http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0.patch http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-preempt-fix.patch http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-child-runs-first.patch http://redhat.com/~mingo/cfs-scheduler/older/sched-cfs-v2-rc0-misc.patch i test-built and test-booted all 4 steps of this. The baseline -v2-rc0 patch should be the one that works - you might want to double-check it, just to be sure. One of the other 3 patches ontop of this baseline causes the regression on your desktop. My current bet is on preempt-fix, so i have put that one first. The other one would be the second patch, child-runs-first. The misc patch should have no effect on behavior - but i've included it for completeness. (and i was wrong about the 'nice fix', it is not in this delta) Ingo -
Isn't that easier for everyone if you keep them as quilt series (ala syslets)? - Davide -
i _do_ have a quilt tree, but i never had the clean splitup above. Why? Because i worked on all of these aspects (and a whole lot of other aspects as well) in parallel during the past 2 days, back and forth, often mixing changes, etc. and there was never any clean splitup. Now it turned out that the clean splitup of -rc0->final delta would ease Gene's testing so i created it. Note that this is just 30% of the total v1->v2 delta and i just saved the work of having to do a clean splitup of the other 70%. (and note that this splitup will be undone because it makes no sense for any potential upstream merge at all, it's only to ease testing for Gene) Ingo -
Now he tells me. :-) But I have some CHO stuff to do, so it will be about 36 -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Support the Girl Scouts! (Today's Brownie is tomorrow's Cookie!) -
-- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) This life is yours. Some of it was given to you; the rest, you made yourself. -
Ahh, so many cats, and so few recipes here Ingo. In this case cats=patches & recipes=time to test adequately. I do have another box, but it would probably take a week & about a big buck to get that old rh7.3 brought up to date & suitable, and its only a 500MHZ K-III, which might make the diffs more obvious. It would need a video card to replace its dinosaur Diamond and a fresh dvd drive. And its motherboard has very buggy usb chips. TYAN S-1590. -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) Q: What's the difference between a dead dog in the road and a dead lawyer in the road? A: There are skid marks in front of the dog. -
Sorry, I did not follow the latest developments, but how many tunables we have so far in CFS? Are those for debug only or they're supposed to stay? Weren't those listed inside the Axis of Evil (just to remain in topic :) till yesterday? - Davide -
Actually I think this is something that makes sense to add, even if just for debugging, but maybe also for production, depending on how much it impacts things. Child runs first is an heuristic optimisation that exploits a VM detail (however fundamental). But for things that don't exec right after forking (and maybe some things that do), it can be nicer to reduce context switches, improve cache patterns, and allow children to be load balanced away before touching memory, if child_runs_first is turned off. -
yeah, the primary intent was debug. Nick, am i confused to conclude that
mainline in fact runs the _parent_ first, despite all the elaborate
runqueue juggling we do there? This piece of code in wake_up_new_task()
caught my eyes:
p->prio = current->prio;
p->normal_prio = current->normal_prio;
list_add_tail(&p->run_list, &current->run_list);
p->array = current->array;
p->array->nr_active++;
inc_nr_running(p, rq);
shouldnt the list_add_tail() be list_add(), so that task pickup sees the
child first? Maybe we still do child-runs-first in practice, due to the
timeslice and sleep average fixups that happen if the parent preempts,
but the above piece of code seems a quite elaborate way of doing
activate_task(). To have the child _before_ the parent we'd need the
add-on patch below. But ... i could be wrong, this is just a quick
thought.
Ingo
---
kernel/sched.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Index: linux/kernel/sched.c
===================================================================
--- linux.orig/kernel/sched.c
+++ linux/kernel/sched.c
@@ -1685,7 +1685,7 @@ void fastcall wake_up_new_task(struct ta
else {
p->prio = current->prio;
p->normal_prio = current->normal_prio;
- list_add_tail(&p->run_list, &current->run_list);
+ list_add(&p->run_list, &current->run_list);
p->array = current->array;
p->array->nr_active++;
inc_nr_running(p, rq);
-
I think that it works because the list we're adding to is not the normal runqueue list head, but the parent's list_head on that runqueue. -
yeah, you are right, i was confused: list_add() adds _after_ the head, list_add_tail() adds _before_ the head - and in the middle of the list if we do a list_add_tail() it adds before that entry. So everything's fine and working as expected :) Ingo -
yeah, debug only. I strongly suspect the Kaffeine breakage for example was related to child-runs-first, so userspace developers might be interested in a switch to turn this on/off. while reviewing the upstream scheduler it occured to me that we are probably _not_ doing child-runs-first there due to the list_add_tail() [it should be a list_add() for it to be child-first. But i havent instrumented this heavily and this portion of the mainline scheduler is pretty fragile.]. So via this flag we could also see the performance heh ;) Ingo -
And I let the crf0 version run longer as I was looking for the composer's pid, but htop (or I) can't see it. Even a ps -e isn't seeing it! But its running, I'm actively typing in it. So you get 3 files, the third one called -- Cheers, Gene "There are four boxes to be used in defense of liberty: soap, ballot, jury, and ammo. Please use in that order." -Ed Howdershelt (Author) I have many CHARTS and DIAGRAMS.. -
Have you considered using rq->raw_weighted_load instead of rq->nr_running in calculating fair_clock? This would take the nice value (or RT priority) of the other tasks into account when determining what's fair. Peter PS You'd have to change the migration thread's load_weight from 0 to 1 in order to prevent divide by zero without having to explicitly check for it every time. -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce -
I suspect you mean (curr->load_weight*delta_exec)/rq->raw_weighted_load in update_curr(). -- wli -
i'll try another thing too: we could perhaps get rid of rq->nr_running and only use raw_weighted_load, because now the only main remaining property of ->nr_running is "is it zero or not". [ ->nr_running's only other significant use is 'group_capacity', but in reality it is only interested in whether all CPUs in the group are busy and what the combined cpu power of that group is, and this could be restructured to use rq->curr and cpu_power - and become independent of nr_running. ] [ then there are other details like load-average, but we could change that to be weighted-cpu-load driven - that makes sense anyway: a reniced task should have less effect on the 'system load' than a non-reniced task. ] that would be one less variable to maintain in the scheduler hotpath, and it would make smpnice an effective _replacement_ for nr_running, instead of an add-on thing that costs a bit of performance. Ingo -
In the longer term, I'd suggest modifying this idea to use the maximum of rq->raw_weighted_load and a running average of rq->raw_weighted_load much the same as was done within the load balancer code. This will tend to make scheduling "smoother". To try the idea out you could (on an SMP system) use one of the rq->cpu_load[] metrics as the running average. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce -
Or something like that, yes. :-) I was trying to make the point that the weighted load stuff provides useful data for implementing nice (in a number of ways e.g. see spa_ebs). Also, now that the old time slices are gone, a simpler more efficient function for mapping RT priority or nice (as appropriate) to p->load_weight can be used instead of the current one which uses the time slice the task would have been allocated as a basis. I'd suggest the function that the current one replaced. (Because it was mine :-)). Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce -
Actually, this formula can't be used for the migration thread itself as its load_weight isn't an accurate reflection of its static priority. But as the migration thread is a real time task this probably isn't an issue, right? If this assumption is correct (i.e. curr is never a real time task) then my earlier caveat re division by zero being possible is invalid because the migration task will never be the only task on the runqueue when this code is called. I'm also assuming here that (because of its name) curr is already on the runqueue when this code is called. If it isn't the divisor in the above expression should be (rq->raw_weighted_load + curr->load_weight). This would also preclude the possibility of divide by zero. Peter -- Peter Williams pwil3058@bigpond.net.au "Learning, n. The kind of ignorance distinguishing the studious." -- Ambrose Bierce -
