High priority tasks break SMP balancer?

Previous thread: [git pull] scheduler fixes by Ingo Molnar on Friday, November 9, 2007 - 3:06 pm. (1 message)

Next thread: [PATCH 0/11 v3] enable "make ARCH=x86" by Sam Ravnborg on Friday, November 9, 2007 - 4:08 pm. (34 messages)
From: Micah Dowty
Date: Friday, November 9, 2007 - 3:34 pm

I've been investigating a problem recently, in which N runnable
CPU-bound tasks on an N-way machine run on only N-1 CPUs. The
remaining CPU is almost 100% idle. I have seen it occur with both the
CFS and O(1) schedulers.

I've traced this down to what seems to be a quirk in the SMP balancer,
whereby a high-priority thread which spends most of its time sleeping
can artificially inflate the CPU load average calculated for one
processor. Most of the time this CPU is idle (nr_running==0) yet its
CPU load average is much higher than that of any other CPU.

Please find attached a sample program which demonstrates this
behaviour on a 2-way SMP machine. It creates three threads: two are
CPU bound and run at the default priority, the third spends most of
its time sleeping and runs at an elevated priority. It wakes up
frequently (using /dev/rtc) and randomly generates some CPU load.

On my machine (2-way Opteron with a vanilla 2.6.23.1 kernel) this test
program will reliably put the scheduler into a state where one CPU has
both of the busy-looping processes in its runqueue, and the other CPU
is usually idle. The usually-idle CPU will have a very high cpu_load,
as reported by /proc/sched_debug.

Your mileage may vary. On some machines, this test program will only
enter the "bad" state for a few seconds. Sometimes we bounce back and
forth between good and bad states every few seconds. In all cases,
removing the priority elevation fixes the balancing problem.

Is this a behaviour any of the scheduler developers are aware of? I
would be very greatful if anyone could shed some light on the root
cause behind the inflated cpu_load average. If this turns out to be a
real bug, I would be happy to work on a patch.

Thanks in advance,
Micah Dowty
From: Cyrus Massoumi
Date: Friday, November 9, 2007 - 4:56 pm

I tried your program on my machine (C2D, 2.6.17, O(1) scheduler).

Both CPUs are 100% busy all the time. Each busy-looping thread is 
running on its own CPU. I've been watching top output for 10 minutes, 
the spreading is stable and the threads don't bounce at all.


greetings
Cyrus


-

From: Micah Dowty
Date: Friday, November 9, 2007 - 5:11 pm

As I said, YMMV. I haven't been able to find a single set of
parameters for the demo program which cause the problem to occur 100%
of the time on all systems.

In general, boosting the MAINTHREAD_PRIORITY even more and increasing
the WAKE_HZ should exaggerate the problem. These parameters reproduce
the problem very reliably on my system:

#define NUM_BUSY_THREADS            2
#define MAINTHREAD_PRIORITY       -20
#define MAINTHREAD_WAKE_HZ       1024
#define MAINTHREAD_LOAD_PERCENT     5
#define MAINTHREAD_LOAD_CYCLES      2

It's also possible this problem doesn't occur on 2.6.17. I have only
tested this example on 2.6.23.1 and 2.6.20 so far.

Thanks,
--Micah
-

From: Micah Dowty
Date: Wednesday, November 14, 2007 - 11:39 am

I tested a handful of kernel versions, and it looks like this is
indeed the case. As far as I can tell, this problem was introduced
between 2.6.19.7 and 2.6.20.

Looking at the diff between those versions, it looks like there were
some noteworthy changes to the SMP balancer. I'm still working on
narrowing this down further.

Thanks,
--Micah
-

From: Kyle Moffett
Date: Thursday, November 15, 2007 - 11:48 am

First of all, since Ingo Molnar seems to be one of the head scheduler  
gurus, you might CC him on this.  Also added a couple other useful  
CCs for regression reports.


Well from these statistics; if you are requesting wakeups that often  
then it is probably *not* correct to try to move another thread to  
that CPU in the mean-time.  Essentially the migration cost will  
likely far outweigh the advantage of letting it run a little bit of  
extra time, and in addition it will dump out cache from the high- 
priority thread.  As per the description I think that an increased a  
priority and increased WAKE_HZ will certainly cause the "problem" to  
occur more, simply because it reduces the time between wakeups of the  
high-priority process and makes it less helpful to migrate another  
process over to that CPU during the sleep periods.  This will also  
depend on your hardware and possibly other configuration parameters.

I'm not really that much of an expert in this particular area,  
though, so it's entirely possible that one of the above-mentioned  
scheduler head-honchos will poke holes in my argument and give a  
better explanation or a possible patch.

Cheers,
Kyle Moffett

-

From: Micah Dowty
Date: Thursday, November 15, 2007 - 12:14 pm

The real problem, though, is that the high priority thread only needs
a total of a few percent worth of CPU time. The behaviour which I'm
reporting as a potential bug is that this process which needs very
little CPU time is effectively getting an entire CPU to itself,
despite the fact that other CPU-bound threads could benefit from
having time on that CPU.

One could argue that a thread with a high enough priority *should* get
a CPU all to itself even if it won't use that CPU- but that isn't the
behaviour I want. If this is in fact the intended effect of having a
high-priority thread wake very frequently, I should start a different
discussion about how to solve my specific problem without the use of
elevated priorities :)

I don't have any reason to believe, though, that this behaviour was
intentional. I just finished my "git bisect" run, and I found the
first commit after which I can reproduce the problem:

c9819f4593e8d052b41a89f47140f5c5e7e30582 is first bad commit
commit c9819f4593e8d052b41a89f47140f5c5e7e30582
Author: Christoph Lameter <clameter@sgi.com>
Date:   Sun Dec 10 02:20:25 2006 -0800

    [PATCH] sched: use softirq for load balancing
    
    Call rebalance_tick (renamed to run_rebalance_domains) from a newly introduced
    softirq.
    
    We calculate the earliest time for each layer of sched domains to be rescanned
    (this is the rescan time for idle) and use the earliest of those to schedule
    the softirq via a new field "next_balance" added to struct rq.
    
    Signed-off-by: Christoph Lameter <clameter@sgi.com>
    Cc: Peter Williams <pwil3058@bigpond.net.au>
    Cc: Nick Piggin <nickpiggin@yahoo.com.au>
    Cc: Christoph Lameter <clameter@sgi.com>
    Cc: "Siddha, Suresh B" <suresh.b.siddha@intel.com>
    Cc: "Chen, Kenneth W" <kenneth.w.chen@intel.com>
    Acked-by: Ingo Molnar <mingo@elte.hu>
    Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@osdl.org>

Thanks! I've also CC'ed ...
From: Christoph Lameter
Date: Thursday, November 15, 2007 - 1:07 pm

The kernel crashes? Sounds like your application crashes with a divide by 
zero?


-

From: Micah Dowty
Date: Thursday, November 15, 2007 - 1:24 pm

Yes, the Python test harness crashes, not the kernel. It's just
because on a kernel which exhibits this SMP balancer bug, within a
couple of test iterations I'll hit a case where cpu1 was almost
totally idle and the test harness divides by zero when calculating the
imbalance.

--Micah
-

From: Christoph Lameter
Date: Thursday, November 15, 2007 - 2:28 pm

I still have some problem understanding what you are complaining about? 
The patch that you bisected had some issues that were fixed later by 
Siddha.

-

From: Micah Dowty
Date: Thursday, November 15, 2007 - 2:35 pm

On all kernels I've tested from after your patch was committed, I can
reproduce a problem where a single high-priority thread which wakes up
very frequently can artificially inflate the SMP balancer's load
average for one CPU, causing other tasks to be migrated off that
CPU. The result is that this high-priority thread (which may only use
a few percent CPU) gets an entire CPU to itself. Even if there are
several busy-looping threads running, this CPU will be mostly idle.

In the first email I CC'ed you on, I attached a test program that I've
been able to use to reproduce the issue. Feel free to ignore the
Python program- it's just a harness I was using to test for the
problem quickly during my git-bisect run.  to reproduce the
problem. The relevant code is all in priosched.c.

Thanks,
--Micah
-

From: Christoph Lameter
Date: Thursday, November 15, 2007 - 7:31 pm

I am a bit at a loss as to how this could relate to the patch. This looks 
like a load balance logic issue that causes the load calculation to go 
wrong?


-

From: Micah Dowty
Date: Thursday, November 15, 2007 - 7:44 pm

My best guess is that this has something to do with the timing with
which we sample the CPU's instantaneous load when calculating the load
averages.. but I still understand only the basics of the scheduler and
SMP balancer. All I really know for sure at this point regarding your
patch is that git-bisect found it for me.

It almost seems like the load average algorithm is ignoring the CPU's
idle time, and only accounting for the time that CPU spends running
processes. One of the symptoms is that the mostly-idle CPU in my test
has an instantaneous load which is usually zero, but a very high load
average. (9000, 30000, etc.)

I want to help get to the bottom of this issue, but I was hoping that
someone experienced with the Linux scheduler and SMP balancer would
have some insight or some suggestions about what to try next.

Thanks,
--Micah
-

From: Ingo Molnar
Date: Thursday, November 15, 2007 - 11:07 pm

hm, your code uses timeouts for this, right? The CPU load average that 
is used for SMP load balancing is sampled from the scheduler tick - and 
has been sampled from the scheduler tick for eons. v2.6.23 defaulted to 
a different method but v2.6.24 samples it from the tick again. So my 
guess is, your testcode behave similarly on 2.6.22 too, correct?

	Ingo
-

From: Micah Dowty
Date: Friday, November 16, 2007 - 2:19 am

My sample program (and the VMware code which I originally observed
this behaviour with) wakes up frequently using /dev/rtc.

I know the problem occurs on 2.6.20 and 2.6.23.1. I haven't tried
2.6.22 yet, but I'll do so as soon as I can.

Thanks,
--Micah
-

From: Micah Dowty
Date: Friday, November 16, 2007 - 3:48 am

Interesting.. here are the kernels I've tested so far, not including
the git-bisect run I did between 2.6.19 and 2.6.20:

2.6.17 -
2.6.19 -
2.6.19.7 -
2.6.20 +
2.6.21 +
2.6.22 -
2.6.23.1 +

Here a "-" means that the problem does not occur (my test program uses
100% of both CPUs) and a "+" means that the test program leaves one
CPU mostly idle.

Unless I've made a mistake, 2.6.22 seems like the outlier rather than
2.6.23. Is this inconsistent with the scheduler tick hypothesis?

Thanks,
--Micah
-

From: Dmitry Adamushko
Date: Friday, November 16, 2007 - 3:48 am

yeah... but one of the kernels in question is actually 2.6.23.1, which
does use the 'precise' accounting.

Micah,

could you try to change either :

cat /proc/sys/kernel/sched_stat_granularity

put it to the value equal to a tick on your system

or just remove bit #3 (which is responsible for 8 == 1000) here:

cat /proc/sys/kernel/sched_features

(this one is enabled by default in 2.6.23.1)

anyway, when it comes to calculating rq->cpu_load[], a nice(0) cpu-hog
task (on cpu_0) may generate a similar load (contribute to
rq->cpu_load[]) as e.g. some negatively reniced task (on cpu_1) which
runs only periodically (say, once in a tick for N ms., etc.) [*]

The thing is that the higher a prio of the task, the bigger 'weight'
it has (prio_to_wait[] table in sched.c) ... and roughly, the load it
generates is not only 'proportional' to 'run-time per fixed interval
of time' but also to its 'weight'. That's why the [*] above.

so you may have a situation :

cpu_0 : e.g. a nice(-20) task running periodically every tick and
generating, say ~10% cpu load ;

cpu_1 : 2-3 nice(0) cpu-hog tasks ;

both cpus may be seen with similar rq->load_cpu[]... yeah, one would
argue that one of the cpu hogs could be migrated to cpu_0 and consume
remaining 'time slots' and it would not "disturb" the nice(-20) task
as :
it's able to preempt the lower prio task whenever it want (provided,
fine-grained kernel preemption) and we don't care that much of
trashing of caches here.

btw., without the precise load balancing, there can be situations when
the nice(-20) (or say, a RT periodic task) can be even not seen (i.e.
don't contribute to cpu_load[]) on cpu_0...
we do sampling every tick (sched.c :: update_cpu_load()) and consider
this_rq->ls.load.weight at this particular moment (that is the sum of
'weights' for all runnable tasks on this rq)... and it may well be
that the aforementioned high-priority task is just never (or likely,
rarely) runnable at this particular moment (it runs for short ...
From: Micah Dowty
Date: Friday, November 16, 2007 - 3:14 pm

Aha. Turning off bit 3 appears to instantly fix my problem while it's
occurring in an existing process, and I can't reproduce it on any new

Right. I gathered from reading the scheduler source earlier that the
load average is intended to be proportional to the priority of the
task, but I was really confused by the fairly nondeterministic effect

Part of the problem may be that my high-priority task can run much
more often than every tick. In my test case and in the VMware code
that I originally observed the problem in, the thread can wake up
based on /dev/rtc or on a device IRQ. Either of these can happen much

When I try this, cpu0 has a cpu_load[] of over 10000 and cpu1 has a

Yes, that's the behaviour I expected to see (and what my application

Indeed. I think this is the major contributor to the nondeterminism
I'm seeing.

Thanks much,
--Micah
-

From: Micah Dowty
Date: Friday, November 16, 2007 - 6:03 pm

For consistency, I tested this using /dev/rtc. I set the rtc frequency
to 16 Hz, and replaced the main loop of my high (-19) priority thread
with:

   while (1) {
      unsigned long data;

      for (i = 0; i < 3; i++) {
         if (read(rtc, &data, sizeof data) != sizeof data) {
            perror("read");
            return 1;
         }
      }

      fcntl(rtc, F_SETFL, O_NONBLOCK);
      while (read(rtc, &data, sizeof data) < 0);
      fcntl(rtc, F_SETFL, 0);
   }

Now it's busy-looping for 62ms, and sleeping for three consecutive
62.5ms chunks totalling 187.5ms.

The results aren't quite what I was expecting. I have only observed
this so far in test cases where I have a very high wakeup frequency,
so I wasn't expecting this to work. I did, however, still observe the
problem where occasionally I get into a state where one CPU is mostly
idle.

Qualitatively, this feels a bit different. With the higher clock
frequency it seemed like the CPU would easily get "stuck" in this
state where it's mostly idle, and it would stay there for a long
time. With the low wakeup frequency, I'm seeing it toggle between the
busy and mostly-idle states more quickly.

I tried a similar test using usleep() and gettimeofday() rather than
/dev/rtc:

   while (1) {
      usleep(300000);

      gettimeofday(&t1, NULL);
      do {
         gettimeofday(&t2, NULL);
      } while (t2.tv_usec - t1.tv_usec +
               (t2.tv_sec - t1.tv_sec) * 1000000 < 100000);
   }

With this test program, I haven't yet seen a CPU imbalance that lasts
longer than a fraction of a second.

--Micah

-

From: Dmitry Adamushko
Date: Monday, November 19, 2007 - 3:22 pm

You seem to have a configuration with domains which don't have
SD_BALANCE_NEWIDLE  on (CONFIG_NUMA?) as there are no events (all
zeros above) for CPU_NEWLY_IDLE.

this one is being triggered whenever a cpu becomes idle (schedule()
--> idle_balance() --> load_balance_newidle()).

(this flag is a bit #1 == 2)

cat /proc/sys/kernel/sched_domain/cpu0/domain0/flags

?

it can be changed with "echo new_value > ..."

does a change of it make any difference?

moreover, /proc/sys/kernel/sched_domain/cpu1/domain0/newidle_idx seems
to be responsible for a source of the load for calculating the busiest
group. e.g. with newidle_idx == 0, the current load on the queue is

-- 
Best regards,
Dmitry Adamushko
-

From: Micah Dowty
Date: Monday, November 19, 2007 - 4:05 pm

Hmm. I don't have this file on my system:

root@micah-64:/proc/sys/kernel/sched_domain/cpu0/domain0# ls
busy_factor  busy_idx  forkexec_idx  idle_idx  imbalance_pct  max_interval  min_interval  newidle_idx  wake_idx
root@micah-64:/proc/sys/kernel/sched_domain/cpu0/domain0# uname -a
Linux micah-64 2.6.23.1 #1 SMP Fri Nov 2 12:25:47 PDT 2007 x86_64 GNU/Linux

Is there a config option I'm missing?

Thanks,
--Micah

-

From: Ingo Molnar
Date: Monday, November 19, 2007 - 10:57 pm

yes, CONFIG_SCHED_DEBUG.

	Ingo
-

From: Micah Dowty
Date: Tuesday, November 20, 2007 - 11:06 am

I have that one. I even posted the /proc/sched_debug output :)

--Micah

-

From: Dmitry Adamushko
Date: Tuesday, November 20, 2007 - 2:47 pm

ah ok, try applying this patch on top of 2.6.23.1.

btw., what's your system? If I recall right, SD_BALANCE_NEWIDLE is on
by default for all configs, except for NUMA nodes.

(attached a white-space non-damaged version)

---

--- kernel/sched.c-old  2007-11-20 22:33:22.000000000 +0100
+++ kernel/sched.c      2007-11-20 22:37:07.000000000 +0100
@@ -5306,7 +5306,7 @@ set_table_entry(struct ctl_table *entry,
 static struct ctl_table *
 sd_alloc_ctl_domain_table(struct sched_domain *sd)
 {
-       struct ctl_table *table = sd_alloc_ctl_entry(14);
+       struct ctl_table *table = sd_alloc_ctl_entry(12);

        set_table_entry(&table[0], "min_interval", &sd->min_interval,
                sizeof(long), 0644, proc_doulongvec_minmax);
@@ -5326,10 +5326,10 @@ sd_alloc_ctl_domain_table(struct sched_d
                sizeof(int), 0644, proc_dointvec_minmax);
        set_table_entry(&table[8], "imbalance_pct", &sd->imbalance_pct,
                sizeof(int), 0644, proc_dointvec_minmax);
-       set_table_entry(&table[10], "cache_nice_tries",
+       set_table_entry(&table[9], "cache_nice_tries",
                &sd->cache_nice_tries,
                sizeof(int), 0644, proc_dointvec_minmax);
-       set_table_entry(&table[12], "flags", &sd->flags,
+       set_table_entry(&table[10], "flags", &sd->flags,
                sizeof(int), 0644, proc_dointvec_minmax);

        return table;



-- 
Best regards,
Dmitry Adamushko
From: Micah Dowty
Date: Thursday, November 22, 2007 - 12:46 am

It's a dual AMD64 Opteron.

So, I recompiled my 2.6.23.1 kernel without NUMA support, and with
your patch for scheduling domain flags in /proc. It looks like with
NUMA disabled, my test case no longer shows the CPU imbalance
problem. Cool.

With NUMA disabled (and my test running smoothly), the flags show that
SD_BALANCE_NEWIDLE is set:

root@micah-64:~# cat /proc/sys/kernel/sched_domain/cpu0/domain0/flags
55

Next I turned it off:

root@micah-64:~# echo 53 > /proc/sys/kernel/sched_domain/cpu0/domain0/flags
root@micah-64:~# echo 53 > /proc/sys/kernel/sched_domain/cpu1/domain0/flags

Oddly enough, I still don't observe the CPU imbalance problem.

Now I reboot into a kernel which has NUMA re-enabled but which is
otherwise identical. I verify that now I can reproduce the CPU
imbalance again.

root@micah-64:~# cat /proc/sys/kernel/sched_domain/cpu0/domain0/flags
1101

Now I set cpu[10]/domain0/flags to 1099, and the imbalance immediately
disappears. I can reliably cause the imbalance again by setting it
back to 1101, and remove the imbalance by setting them to 1099.

Do these results make sense? I'm not sure I understand how
SD_BALANCE_NEWIDLE could be the whole story, since my /proc/schedstat
graphs do show that we continuously try to balance on idle, but we
can't successfully do so because the idle CPU has a much higher load
than the non-idle CPU. I don't understand how the problem I'm seeing
could be related to the time at which we run the balancer, rather than
being related to the load average calculation.

Assuming the CPU imbalance I'm seeing is actually related to
SD_BALANCE_NEWIDLE being unset, I have a couple of questions:

 - Is this intended/expected behaviour for a machine without
   NEWIDLE set? I'm not familiar with the rationale for disabling
   this flag on NUMA systems.

 - Is there a good way to detect, without any kernel debug flags
   set, whether the current machine has any scheduling domains
   that are missing the SD_BALANCE_NEWIDLE bit? I'm ...
From: Dmitry Adamushko
Date: Thursday, November 22, 2007 - 5:53 am

Look for the 'load_idx' variable in find_busiest_group() and how it's used.

include/linux/topology.h (+ some architectures defines additional
node-definition-structures, but that's primarily for NUMA)

There you can find : 'busy_idx', 'idle_idx' and 'newidle_idx' (amongst
others that affect load-balancing).

Those parameters specify the _source_ of load for determining the
busiest group for !CPU_IDLE, CPU_IDLE and CPU_NEWLY_IDLE cases
respectively.

when 'load_idx == 0' (say, when CPU_NEWLY_IDLE and 'newidle_idx == 0'
or when CPU_IDLE and 'idle_idx == 0') the cpu_load[] table is _not_
taken into account by {source,target}_load(), and the raw load on the
rq is considered instead. In case of CPU_NEWLY_IDLE the raw load is
always 0 (no tasks on the queue).

so for NUMA, newidle_idx = 0 and it also seems to be a generic option
for other variants on 2.6.23.1. Hence, it doesn't matter how big are

for NUMA, 'idle_idx != 0' (i.e. cpu_load[] is taken into account) for
the CPU_IDLE case
and
'newidle_idx == 0' (just because, SD_BALANCE_NEWIDLE is normally not


(1) CPU_IDLE and idle_idx != 0 --> cpu_load[] is considered

(2) CPU_NEWIDLE and newidle_idx == 0 --> raw load is considered (which

As you may see, there are varios node's configs in
include/linux/topology.h that affects load-balancing. Migrations
between NUMA nodes can be more expensive than e.g. between cpus on
'normal' SMP, multi-core, etc. So it should be about being more
conservative when doing load-balancing on some setups.

e.g. in your particular case, what happens is smth like this:

SD_BALANCE_NEWIDLE + newidle_idx == 0

cpu #0: 2 nice(0) cpu-hogs
cpu #1: nice(-15) blocks...
cpu #1: load_balance_newidle() is triggered  (*)
cpu #1: newidle_idx == 0 so the busiest group is on cpu #0
cpu #1: nice(0) (which is != current) is pulled over to cpu #1
cpu #1: nice(0) is running untill it's preempted by nice(-15)
cpu #1: nice(-15) is running
...
in the mean time, rebalance_domain event can be triggered on
cpu ...
From: Micah Dowty
Date: Monday, November 26, 2007 - 12:44 pm

Dmitry,

Thank you for the detailed explanation of the scheduler behaviour I've
been seeing.


Would it be reasonable perhaps to look for the numa_* entries in

The application doesn't really depend on the load-balancer's decisions
per se, it just happens that this behaviour I'm seeing on NUMA systems
is extremely bad for performance.

In this context, the application is a virtual machine runtime which is
executing either an SMP VM or it's executing a guest which has a
virtual GPU. In either case, there are at least three threads:

  - Two virtual CPU/GPU threads, which are nice(0) and often CPU-bound
  - A low-latency event handling thread, at nice(-10)

The event handling thread performs periodic tasks like delivering
timer interrupts and completing I/O operations. It isn't expected to
use much CPU, but it does run very frequently. To get the best latency
performance, we've been running it at nice(-10). The intention is that
this event handling thread should usually be able to preempt either of
the CPU-bound threads.

The behaviour we expect is that on a system which is otherwise lightly
loaded, the two CPU-bound threads each get nearly an entire physical
CPU to themselves. The event handling thread periodically wakes up,
preempts one of the CPU-bound threads briefly, then goes to sleep and
yields the CPU back to one of the CPU-bound threads.

The actual behaviour I see when the load balancer makes this
unfavorable decision is that the two CPU-bound threads share a single
CPU, and the other CPU is mostly idle. Best case, the virtual machine
runs half as fast as it could be running. Worst case it runs much
slower, because the CPU-bound threads often need to wait for each
other to catch up if they don't always run at about the same
rate. This involves a lot of extra synchronization overhead.

My current workaround for this is to avoid ever boosting our thread
priority on a kernel in which this problem could occur. Currently this
is any 2.6.20 or later kernel running ...
From: Dmitry Adamushko
Date: Tuesday, November 27, 2007 - 2:21 am

Are I/O operations initiated by these "virtual CPU/GPU threads"?

If so, would it make sense to have per-CPU event handling threads
(instead of one global)? They would handle I/O operations initiated
from their respective CPUs to (hopefully) achieve better data locality
(esp. if the most part of the involved data is per-CPU).

Then let the load balancer to evenly distribute the "virtual CPU/GPU
threads" or even (at least, as an experiment) fix them to different
CPUs as well?

sure, the scenario is highly dependent on a nature of those
'events'... and I can just speculate here :-) (but I'd imagine

-- 
Best regards,
Dmitry Adamushko
-

From: Micah Dowty
Date: Tuesday, November 27, 2007 - 10:13 am

We already do this when it makes sense to. The high priority "event"
thread is mostly used for asynchronous events- timers (driven by
/dev/rtc), completion of disk or USB I/O, etc. These are things that
happen asynchronously to the virtual CPU, and which may or may not
need to interrupt the virtual CPU at some point.

Another good example is audio. We have another thread (which ideally
is high-priority) which processes audio events asynchronously from the
virtual CPUs. The CPUs manipulate some shared memory to queue up audio
data, but while the audio is playing there is no direct control path
between the two threads. The virtual CPU reads/writes audio registers
whenever it needs to emulate accesses to the audio device, and the
audio thread wakes up any time the hardware needs more samples to
play. The two threads are asynchronous with very little shared
synchronization.

Hopefully that clarifies our situation a bit. Thanks again,
--Micah
-

From: David Newall
Date: Friday, November 16, 2007 - 12:13 pm

There are a couple of points I would make about your python test 
harness.  Your program compares real+system jiffies for both cpus; an 
ideal result would be 1.00.  The measurement is taken over a relatively 
short period of approximately a half-second, and you kill the CPU hogs 
before taking final measurements, even wait for them to die first.  You 
repeat this measurement, starting and killing CPU hogs each time.  Why 
do you do that?

What happens if you start the hogs and take the baseline outside of the 
loop?

    from __future__ import division
    import sys, os, time

    def getCpuTimes():
        cpu0 = 0
        cpu1 = 1
        for line in open("/proc/stat"):
            tokens = line.split()
            if tokens[0] == "cpu0":
                cpu0 = int(tokens[1]) + int(tokens[3])
            elif tokens[0] == "cpu1":
                cpu1 = int(tokens[1]) + int(tokens[3])
        return cpu0, cpu1

    pid = os.spawnl(os.P_NOWAIT, "./priosched")
    baseline = getCpuTimes()
    while True:
        time.sleep(0.5)
        current = getCpuTimes()
        print "%.04f" % (current[0] - baseline[0]) / (current[1] -
    baseline[1])

-

From: Micah Dowty
Date: Friday, November 16, 2007 - 2:38 pm

The Python test harness is fairly artificial, but this is just the
best way I found to reliably reproduce the problem in a short amount
of time. It was just for convenience while running git-bisect. When
running the C program directly, there seems to be a somewhat random
chance that it will start up in the "bad" state. Once the single CPU
is stuck in this mostly-idle mode, it seems to stay that way for a

The problem still occurs then, but killing/restarting the test app
seems to trigger the problem more reliably. As I said in the original
email about this, left to its own devices this problem will occur
seemingly-randomly. In the original VMware code I observed this
problem in, the same process would flip between the "good" and "bad"
states seemingly randomly, every few seconds.

Thanks,
--Micah

-

Previous thread: [git pull] scheduler fixes by Ingo Molnar on Friday, November 9, 2007 - 3:06 pm. (1 message)

Next thread: [PATCH 0/11 v3] enable "make ARCH=x86" by Sam Ravnborg on Friday, November 9, 2007 - 4:08 pm. (34 messages)