(Back onto lkml) This doesn't really help either (at 10ms). (For the record, I've tried turning SD_WAKE_IDLE, SD_WAKE_AFFINE on and off for each domain and that hasn't helped either). I've also tried increasing sched_latency_ns as far as it can go. BTW. this is a pretty nasty behaviour if you ask my opinion. It starts *increasing* the number of involuntary context switches as resources get oversubscribed. That's completely unintuitive as far as I can see -- when we get overloaded, the obvious thing to do is try to increase efficiency, or at least try as hard as possible not to lose it. So context switches should be steady or decreasing as I add more processes to a runqueue. It seems to max out at nearly 100 context switches per second, and this has actually shown to be too frequent for modern CPUs and big caches. Increasing the tunable didn't help for this workload, but it really needs to be fixed so it doesn't decrease timeslices as the number of processes increases. --
/proc/sys/kernel/sched_min_granularity_ns
/proc/sys/kernel/sched_latency_ns
period := max(latency, nr_running * min_granularity)
slice := period * w_{i} / W
W := \Sum_{i} w_{i}
So if you want to increase the slice length for loaded systems, up
min_granularity.
--
OK, but the very concept of reducing efficiency when load increases is nasty, and leads to nasty feedback loops. It's just a very bad behaviour to have out of the box, and as a general observation, 10ms is too short a default timeslice IMO. I don't see how it is really helpful for interactive processes either. By definition, if they are not CPU bound, then they should be run quite soon after waking up; if they are CPU bound, then reducing efficiency by increasing context switches is effectively going to increase their latency anyway. Can this be changed by default, please? --
How? Are you saying that switching the granularity to, say, 25ms, will *decrease* the latency of interactive tasks? And the efficiency we're talking about reducing here is due to the fact that tasks are hitting cold caches more times per second when the granularity is smaller, correct? Or are you concerned by another Not without benchmarks of interactivity, please. There are far, far more linux desktops than there are servers. People expect to have to tune servers (I do, for the servers I maintain). People don't expect to have to tune a desktop to make it run well. --
Oh and even on servers, when your anti-virus proxy reaches a load of 800, you're happy no to have too large a time-slice so that you regularly get a chance of being allowed to type in commands over SSH. Large time-slices are needed only in HPC environments IMHO, where only one task runs. --
Your ssh session should be allowed to run anyway. I don't see the difference. If the runqueue length is 100 and the time-slice is (say) 10ms, then if your ssh only needs average of 5ms of CPU time per second, then it should be run next when it becomes runnable. If it wants 20ms of CPU time per second, then it has to wait for 2 seconds anyway to be run next, regardless of whether That's silly. By definition if there is only one task running, you don't care what the timeslice is. We actually did conduct some benchmarks, and a 10ms timeslice can start hurting even things like kbuild quite a bit. But anyway, I don't care what the time slice is so much (although it should be higher -- if the scheduler can't get good interactive behaviour with a 20-30ms timeslice in most cases then it's no good IMO). I care mostly that the timeslice does not decrease when load increases. --
It's not about what *ssh* uses but about what *others* use. Except by renicing SSH or marking it real-time, it has no way to say "give the CPU to me right now, I have something very short to do". So it will have to wait for the 100 other tasks to eat their 10ms, waiting 1 second to consume 5ms of CPU (and I was speaking about 800 and not 100). It is one of the situations where I prefer to shorten timeslices when load increases because it will not slow down the service too much, but will still provide better interactivity, which is also benefical to the service itself since there is no reason for the cycles usage to be the same for all processes. So by having a finer granularity, small CPU eaters I mean there's only one important task. There is always a bit of pollution around it, and interrupting the tasks less often slightly reduces the On the opposite, I think it's a fundamental requirement if you need to maintain a reasonable interactivity, and a fair progress between all tasks. I think it's obvious to understand that the only way to maintain a constant latency with a growing number of tasks is to reduce the time each task may spend on the CPU. Contrary to other domains such as network, you don't know how much time a task will spend on the CPU if you grant an access to it, and there is no way to know because only the work that this task will perform will determine if it should run shorter or longer. Fair scheduling in other areas such as network is "easier" because you know the size of your packets so you know how much time they will take on the wire. Here with tasks, the best you can do is estimating based on history. But it will be very rare when you'll be able to correctly guess and guarantee that the latency is correct. Maybe the timeslices should shrink only past a certain load though (I don't know how it's done today). Regards, willy --
Um, if ssh is not using as much CPU time as the other processes running, (if it has "something very short to do") then yes it should get the CPU *right now*, regardless of what the timeslice size is. If it *is* using as much CPU time as everyone else, then it will have to wait to get time, just like everybody else; and in that case, lowering the timeslice will not help matters at all because consider if ssh has to compute for 20ms before returning control to the user, then with a 10ms timeslice it just has to wait for 2 slices. So in that case you actually do want a longer and more efficient timeslice so everybody (including ssh) can get their I think it is important for many situations, not only just HPC at all. Just because tpc-c runs are set up so the number of server threads exactly matches the number of cpus, doesn't mean that real world servers don't run into lots of different overload conditions. And yes, cache You are just asserting that shorter timeslices are more interactive. As far as I know (aside from implementation details of a given scheduler), that assertion only holds in general for a small number of things like for example video playing or 3d graphics that adaptively scale back their output as they get starved for CPU (it might be better to drop every 2nd frame than to drop 10 frames every 20). I doubt there are many server side apps like that. What you really need on your server is to give ssh more priority than your 800 spam threads. You can do that *properly* with nice or with this group fairness stuff. Lowering timeslices is basically shooting in the dark. --
Nick, We do grow the period as the load increases, and this keeps the slice constant - although it might not be big enough for your taste (but its tunable) Short running tasks will indeed be very likely to be run quickly after wakeup because wakeup's are placed left in the tree. (and when using sleeper fairness, can get up to a whole slice bonus). Interactivity is all about generating a scheduling pattern that is easy on the human brain - that means predictable and preferably with lags < 40ms - as long as the interval is predictable the human brain will patch up a lot, once it becomes erratic all is out the window. (human perception of lags is in the 10ms range, but up to 40ms seems to do acceptable patch up as long as its predictable). Due to current desktop bloat, its important cpu bound tasks are treated well too. Take for instance scrolling firefox - that utterly consumes the fastest cpus, still people expect a smooth experience. By ensuring the scheduler behaviour degrades in a predicatable fashion, and trying to keep the latency to a sane level. The thing that seems to trip up this psql thing is the strict requirement to always run the leftmost task. If all tasks have very short runnable periods, we start interleaving between all contending tasks. The way we're looking to solve this by weakening this leftmost requirement so that a server/client pair can ping-pong for a while, then switch to another pair which gets to ping-pong for a while. This alternating pattern as opposed to the interleaving pattern is much more friendly to the cache. And we should do it in such a manner that we still ensure fairness and predictablilty and such. The latest sched code contains a few patches in this direction (.25-rc6), and they seem to have the desired effect on 1 socket single and dual core and 8 socket single core and dual core. On quad core we seem to have some load balance problems that destroy the workload in other interresting ways - looking into that now. - ...
Yeah, and firefox scrolling is in the class of workloads where they adaptively reduce CPU consumption as they get less quota (ie. because they just start skipping). Still, for desktop workloads you shouldn't have to deal with lots of CPU hog processes on the runqueue, so I don't see why this is needed? I don't mind having the timeslice smallish, but it shouldn't be Yeah, thanks for looking at that. Wow, scheduler patches sure make it upstream a lot quicker than when I used to work on the damn thing ;) I did a quick run and it seems like the postgresql overload problem is far better if not solved now on my 2x quad core. Haven't had time to get some reportable results, but I hope to. --
here's a performance comparison between 2.6.21 and -rc6, on a 8-socket/16-core system: http://redhat.com/~mingo/misc/sysbench-rc6.jpg [transactions/sec, higher is better] 2.6.21 2.6.25-rc5 2.6.25-rc6 ------------------------------------------------------- 1: 383.26 270.47 269.69 2: 741.02 527.67 560.52 4: 1880.79 1049.59 1184.44 8: 3815.59 2901.07 3881.78 16: 8944.81 8993.24 9000.81 32: 8647.19 8568.66 8638.64 64: 8058.10 7624.46 8212.92 128: 6500.06 5804.75 8182.71 256: 5625.27 3656.52 7661.02 [ Postgresql 8.3, default scheduler parameters, sysbench parameters: --test=oltp --db-driver=psql --max-time=60 --max-requests=0 --oltp-read-only=on. Ask if you need more info about the test. ] as you can see near and after the saturation point .25 not only has fixed any regression but rules the picture and is 35%+ faster at 256 clients and shows no breakdown at all at high client counts. ( i also have to observe that while running with 256 clients overload, the 2.6.25 system was totally serviceable, while 2.6.21 showed bad lags. ) The "early rampup" phase [less than 25% utilized] is still not as good as we'd like it to be - our idle balancing force is still a tad too strong for this workload. (But that is relatively easy to solve in general and we are working on those bits.) in any case, we welcome any help from you with these tuning efforts. It's certainly fun :) Ingo --
Secondary issues like the actual cost of context switch, but they are Linux desktops shouldn't run with massive loads anyway. Tuning the scheduler to "work" well in an X session when you have a make -j100 in the background is retarded. But sure, if the scheduler doesn't properly prioritize non-CPU bound tasks versus CPU bound ones, then it should be fixed to do so. --
