On Sat, Feb 09, 2008 at 11:29:41PM -0600, Olof Johansson wrote:
No, it's really nothing. I had to push the loop to 1 billion to make the load
noticeable. You don't have 500 threads, you have 2 threads and that load is
repeated 500 times. And if we look at the numbers, let's take the worst one :
113291/500 = 227 microseconds/loop. This is still very low compared to the
smallest timeslice you would have (1 ms at HZ=1000).
So your threads are still completing *before* the scheduler has to preempt
them.
In fact, I don't think it takes longer, I think it does not do it at their
creation, but will do it immediately after the first slice is consumed. This
would explain the important differences here. I don't know how we could ensure
that the new thread is created on the second CPU from the start, though.
I tried inserting a sched_yield() at the top of the busy loop (1M loops).
By default, it did not change a thing. Then I simply set sched_compat_yield
to 1, and the two threads then ran simultaneously with a stable low time
(2700 ms instead of 10-12 seconds).
Doing so with 10k loops (initial test) shows times in the range 240-300 ms
only instead of 2200-6500 ms.
Ingo, would it be possible (and wise) to ensure that a new thread being
created gets immediately rebalanced in order to emulate what is done here
with sched_compat_yield=1 and sched_yield() in both threads just after the
thread creation ? I don't expect any performance difference doing this,
but maybe some shell scripts reliying on short-lived pipes would get faster
on SMP.
In fact, right now the following command uses only 1 CPU :
$ dd if=/dev/urandom of=rnd bs=1M count=1
$ (time bzip2 -9|bzip2 -9|bzip2 -9|bzip2 -9|bzip2 -9|bzip2 -9)<rnd >/dev/null
real 0m6.977s
user 0m6.908s
sys 0m0.072s
Maybe this is due to the block size in bzip2 though, because gzip does not
exhibit exactly the same behaviour :
$ (time gzipi -1|gzip -1|gzip -1|gzip -1|gzip -1|gzip -1)<rnd >/dev/null
real 0m5.300s
user 0m7.392s
sys 0m0.392s
I'm well aware of that. What's important is to be able to explain what is
causing the difference and why the test case does not represent anything
related to performance. Maybe the code author wanted to get 500 parallel
threads and got his code wrong ?
Regards,
willy
--