Re: Poor PostgreSQL scaling on Linux 2.6.25-rc5 (vs 2.6.22)

Previous thread: [PATCH -mm] do not check condition twice in WARN_ON_SECS by Dave Young on Tuesday, March 11, 2008 - 6:09 pm. (4 messages)

Next thread: boot cgroup questions by Max Krasnyansky on Tuesday, March 11, 2008 - 6:23 pm. (42 messages)
From: Nick Piggin
Date: Tuesday, March 11, 2008 - 6:21 pm

(Back onto lkml)


This doesn't really help either (at 10ms).

(For the record, I've tried turning SD_WAKE_IDLE, SD_WAKE_AFFINE
on and off for each domain and that hasn't helped either).

I've also tried increasing sched_latency_ns as far as it can go.
BTW. this is a pretty nasty behaviour if you ask my opinion. It
starts *increasing* the number of involuntary context switches
as resources get oversubscribed. That's completely unintuitive as
far as I can see -- when we get overloaded, the obvious thing to
do is try to increase efficiency, or at least try as hard as
possible not to lose it. So context switches should be steady or
decreasing as I add more processes to a runqueue.

It seems to max out at nearly 100 context switches per second,
and this has actually shown to be too frequent for modern CPUs
and big caches.

Increasing the tunable didn't help for this workload, but it really
needs to be fixed so it doesn't decrease timeslices as the number
of processes increases.

--

From: Peter Zijlstra
Date: Wednesday, March 12, 2008 - 12:58 am

/proc/sys/kernel/sched_min_granularity_ns
/proc/sys/kernel/sched_latency_ns

period := max(latency, nr_running * min_granularity)
slice := period * w_{i} / W
W := \Sum_{i} w_{i}

So if you want to increase the slice length for loaded systems, up
min_granularity.

--

From: Nick Piggin
Date: Sunday, March 16, 2008 - 5:44 pm

OK, but the very concept of reducing efficiency when load increases
is nasty, and leads to nasty feedback loops. It's just a very bad
behaviour to have out of the box, and as a general observation, 10ms
is too short a default timeslice IMO.

I don't see how it is really helpful for interactive processes either.
By definition, if they are not CPU bound, then they should be run
quite soon after waking up; if they are CPU bound, then reducing
efficiency by increasing context switches is effectively going to
increase their latency anyway.

Can this be changed by default, please?

--

From: Ray Lee
Date: Sunday, March 16, 2008 - 10:16 pm

How? Are you saying that switching the granularity to, say, 25ms, will
*decrease* the latency of interactive tasks?

And the efficiency we're talking about reducing here is due to the
fact that tasks are hitting cold caches more times per second when the
granularity is smaller, correct? Or are you concerned by another

Not without benchmarks of interactivity, please. There are far, far
more linux desktops than there are servers. People expect to have to
tune servers (I do, for the servers I maintain). People don't expect
to have to tune a desktop to make it run well.
--

From: Willy Tarreau
Date: Sunday, March 16, 2008 - 10:21 pm

Oh and even on servers, when your anti-virus proxy reaches a load of 800,
you're happy no to have too large a time-slice so that you regularly get
a chance of being allowed to type in commands over SSH.

Large time-slices are needed only in HPC environments IMHO, where only
one task runs.

--

From: Nick Piggin
Date: Monday, March 17, 2008 - 12:19 am

Your ssh session should be allowed to run anyway. I don't see the difference.
If the runqueue length is 100 and the time-slice is (say) 10ms, then if your
ssh only needs average of 5ms of CPU time per second, then it should be run
next when it becomes runnable. If it wants 20ms of CPU time per second, then
it has to wait for 2 seconds anyway to be run next, regardless of whether

That's silly. By definition if there is only one task running, you don't
care what the timeslice is.

We actually did conduct some benchmarks, and a 10ms timeslice can start
hurting even things like kbuild quite a bit.

But anyway, I don't care what the time slice is so much (although it should
be higher -- if the scheduler can't get good interactive behaviour with a
20-30ms timeslice in most cases then it's no good IMO). I care mostly that
the timeslice does not decrease when load increases.

--

From: Willy Tarreau
Date: Monday, March 17, 2008 - 1:26 am

It's not about what *ssh* uses but about what *others* use. Except by
renicing SSH or marking it real-time, it has no way to say "give the
CPU to me right now, I have something very short to do". So it will
have to wait for the 100 other tasks to eat their 10ms, waiting 1 second
to consume 5ms of CPU (and I was speaking about 800 and not 100).

It is one of the situations where I prefer to shorten timeslices when
load increases because it will not slow down the service too much, but
will still provide better interactivity, which is also benefical to the
service itself since there is no reason for the cycles usage to be the
same for all processes. So by having a finer granularity, small CPU eaters

I mean there's only one important task. There is always a bit of pollution
around it, and interrupting the tasks less often slightly reduces the

On the opposite, I think it's a fundamental requirement if you need to
maintain a reasonable interactivity, and a fair progress between all
tasks. I think it's obvious to understand that the only way to maintain
a constant latency with a growing number of tasks is to reduce the time
each task may spend on the CPU. Contrary to other domains such as network,
you don't know how much time a task will spend on the CPU if you grant an
access to it, and there is no way to know because only the work that this
task will perform will determine if it should run shorter or longer. Fair
scheduling in other areas such as network is "easier" because you know the
size of your packets so you know how much time they will take on the wire.

Here with tasks, the best you can do is estimating based on history. But
it will be very rare when you'll be able to correctly guess and guarantee
that the latency is correct.

Maybe the timeslices should shrink only past a certain load though (I don't
know how it's done today).

Regards,
willy

--

From: Nick Piggin
Date: Monday, March 17, 2008 - 1:54 am

Um, if ssh is not using as much CPU time as the other processes running,
(if it has "something very short to do") then yes it should get the CPU
*right now*, regardless of what the timeslice size is. If it *is* using
as much CPU time as everyone else, then it will have to wait to get time,
just like everybody else; and in that case, lowering the timeslice will
not help matters at all because consider if ssh has to compute for 20ms
before returning control to the user, then with a 10ms timeslice it just
has to wait for 2 slices. So in that case you actually do want a longer
and more efficient timeslice so everybody (including ssh) can get their

I think it is important for many situations, not only just HPC at all.
Just because tpc-c runs are set up so the number of server threads
exactly matches the number of cpus, doesn't mean that real world servers
don't run into lots of different overload conditions. And yes, cache

You are just asserting that shorter timeslices are more interactive.
As far as I know (aside from implementation details of a given scheduler),
that assertion only holds in general for a small number of things like
for example video playing or 3d graphics that adaptively scale back their
output as they get starved for CPU (it might be better to drop every 2nd
frame than to drop 10 frames every 20). I doubt there are many server side
apps like that. What you really need on your server is to give ssh more 
priority than your 800 spam threads. You can do that *properly* with nice
or with this group fairness stuff. Lowering timeslices is basically
shooting in the dark.

--

From: Peter Zijlstra
Date: Monday, March 17, 2008 - 2:28 am

Nick,

We do grow the period as the load increases, and this keeps the slice
constant - although it might not be big enough for your taste (but its
tunable)

Short running tasks will indeed be very likely to be run quickly after
wakeup because wakeup's are placed left in the tree. (and when using
sleeper fairness, can get up to a whole slice bonus).

Interactivity is all about generating a scheduling pattern that is easy
on the human brain - that means predictable and preferably with lags <
40ms - as long as the interval is predictable the human brain will patch
up a lot, once it becomes erratic all is out the window. (human
perception of lags is in the 10ms range, but up to 40ms seems to do
acceptable patch up as long as its predictable).

Due to current desktop bloat, its important cpu bound tasks are treated
well too. Take for instance scrolling firefox - that utterly consumes
the fastest cpus, still people expect a smooth experience. By ensuring
the scheduler behaviour degrades in a predicatable fashion, and trying
to keep the latency to a sane level.


The thing that seems to trip up this psql thing is the strict
requirement to always run the leftmost task. If all tasks have very
short runnable periods, we start interleaving between all contending
tasks. The way we're looking to solve this by weakening this leftmost
requirement so that a server/client pair can ping-pong for a while, then
switch to another pair which gets to ping-pong for a while.

This alternating pattern as opposed to the interleaving pattern is much
more friendly to the cache. And we should do it in such a manner that we
still ensure fairness and predictablilty and such.

The latest sched code contains a few patches in this direction
(.25-rc6), and they seem to have the desired effect on 1 socket single
and dual core and 8 socket single core and dual core. On quad core we
seem to have some load balance problems that destroy the workload in
other interresting ways - looking into that now.

- ...
From: Nick Piggin
Date: Monday, March 17, 2008 - 2:56 am

Yeah, and firefox scrolling is in the class of workloads where they
adaptively reduce CPU consumption as they get less quota (ie. because
they just start skipping).

Still, for desktop workloads you shouldn't have to deal with lots of
CPU hog processes on the runqueue, so I don't see why this is needed?

I don't mind having the timeslice smallish, but it shouldn't be

Yeah, thanks for looking at that. Wow, scheduler patches sure make
it upstream a lot quicker than when I used to work on the damn thing ;)

I did a quick run and it seems like the postgresql overload problem is
far better if not solved now on my 2x quad core. Haven't had time to
get some reportable results, but I hope to.

--

From: Ingo Molnar
Date: Monday, March 17, 2008 - 3:16 am

here's a performance comparison between 2.6.21 and -rc6, on a 
8-socket/16-core system:

   http://redhat.com/~mingo/misc/sysbench-rc6.jpg

                      [transactions/sec, higher is better]

              2.6.21         2.6.25-rc5         2.6.25-rc6
   -------------------------------------------------------
      1:      383.26             270.47             269.69
      2:      741.02             527.67             560.52
      4:     1880.79            1049.59            1184.44
      8:     3815.59            2901.07            3881.78
     16:     8944.81            8993.24            9000.81
     32:     8647.19            8568.66            8638.64
     64:     8058.10            7624.46            8212.92
    128:     6500.06            5804.75            8182.71
    256:     5625.27            3656.52            7661.02

  [ Postgresql 8.3, default scheduler parameters, sysbench parameters: 
    --test=oltp --db-driver=psql --max-time=60 --max-requests=0 
    --oltp-read-only=on. Ask if you need more info about the test. ]

as you can see near and after the saturation point .25 not only has 
fixed any regression but rules the picture and is 35%+ faster at 256 
clients and shows no breakdown at all at high client counts.

( i also have to observe that while running with 256 clients overload, 
  the 2.6.25 system was totally serviceable, while 2.6.21 showed bad 
  lags. )

The "early rampup" phase [less than 25% utilized] is still not as good 
as we'd like it to be - our idle balancing force is still a tad too 
strong for this workload. (But that is relatively easy to solve in 
general and we are working on those bits.)

in any case, we welcome any help from you with these tuning efforts. 
It's certainly fun :)

	Ingo
--

From: Nick Piggin
Date: Sunday, March 16, 2008 - 10:34 pm

Secondary issues like the actual cost of context switch, but they are

Linux desktops shouldn't run with massive loads anyway. Tuning the
scheduler to "work" well in an X session when you have a make -j100
in the background is retarded.

But sure, if the scheduler doesn't properly prioritize non-CPU bound
tasks versus CPU bound ones, then it should be fixed to do so.

--

Previous thread: [PATCH -mm] do not check condition twice in WARN_ON_SECS by Dave Young on Tuesday, March 11, 2008 - 6:09 pm. (4 messages)

Next thread: boot cgroup questions by Max Krasnyansky on Tuesday, March 11, 2008 - 6:23 pm. (42 messages)