Erich Focht submitted a patch to the lkml, fixing a bug in migration threads (in the 2.5.8 development kernel) that lead to a deadlock, or frozen system. In the process of fixing the bug, he also worked to cleanup the initialization of the migration threads.
In an earlier email to the lkml, Ingo Molnar explained migration threads:
"The concept is the following: there arenew per-CPU system threads (so-called migration threads) that handle a per-runqueue 'migration queue'. set_cpus_allowed() registers tasks in the target CPU's migration queue, kicks the migrating thread and wakes up the migration thread. The migrating thread unschedules on its source CPU, at which point the migration thread picks the task up and puts it into the local runqueue."
The recent thread is an interesting read, mostly between Erich Focht, Robert Love and William Lee Irwin III.
From: Erich Focht
Subject: [PATCH] migration thread fix
Date: Thu, 18 Apr 2002 22:08:55 +0200 (CEST)
The patch below applies to the 2.5.8 kernel. It does two things:
1: Fixes a BUG in the migration threads: the interrupts MUST be disabled
before the double runqueue lock is aquired, otherwise this thing will
deadlock sometimes.
2: Streamlines the initialization of migration threads. Instead of
fiddling around with cache_deccay_ticks, waiting for migration_mask bits
and relying on the scheduler to distribute the tasks uniformly among
processors, it starts the migration thread on the boot cpu and uses it to
reliably distribute the other threads to their target cpus.
Please consider applying it!
Thanks,
Erich
From: William Lee Irwin III
Subject: Re: [PATCH] migration thread fix
Date: Thu, 18 Apr 2002 14:28:51 -0700
I have a patch to fix #2 as well. Did you see it? Did you try it?
Cheers,
Bill
From: Robert Love
Subject: Re: [PATCH] migration thread fix
Date: 18 Apr 2002 17:41:04 -0400
On Thu, 2002-04-18 at 17:28, William Lee Irwin III wrote:
> I have a patch to fix #2 as well. Did you see it? Did you try it?
Eric, I am interested in your opinion of wli's patch, too. I really
liked his approach.
You seem to remove a lot of code since, after starting the first thread,
you rely on set_cpus_allowed and the existing migration_thread to push
the task to the correct place. I suppose this will work .. but it may
depend implicitly on behavior of the migration_threads and load_balance
(not that the current code doesn't rely on load_balance - it does).
What happens if a migration_thread comes up on a CPU without a migration
thread and then you call set_cpus_allowed?
I am also curious what causes #1 you mention. Do you see it in the
_normal_ code or just with your patch? I cannot see what we race
against wrt interrupts ... disabling interrupts, however, would disable
load_balance and that is a potential pitfall with using
migration_threads to migrate migration_threads as noted above.
Robert Love
From: Anton Blanchard
Subject: Re: [PATCH] migration thread fix
Date: Fri, 19 Apr 2002 08:07:43 +1000
> I am also curious what causes #1 you mention. Do you see it in the
> _normal_ code or just with your patch?
We were getting lockups with 2.5.5ish on the 32 way which #1 fixed.
Anton
From: Robert Love
Subject: Re: [PATCH] migration thread fix
Date: 18 Apr 2002 18:27:48 -0400
On Thu, 2002-04-18 at 18:07, Anton Blanchard wrote:
> We were getting lockups with 2.5.5ish on the 32 way which #1 fixed.
You were getting lockups even on top of your TASK_INTERRUPTIBLE fix
(which was post-2.5.5)? The migration code sure is delicate...
Robert Love
Bill,
thanks for sending me your patch. No, I didn't see it before.
rl> Eric, I am interested in your opinion of wli's patch, too. I really
rl> liked his approach.
Hmm, as far as I can see, it probably fixes the problems but keeps the
original philosophy of depending on the load balancer to get the
migration tasks scheduled on every CPU.
rl> You seem to remove a lot of code since, after starting the first thread,
rl> you rely on set_cpus_allowed and the existing migration_thread to push
rl> the task to the correct place. I suppose this will work .. but it may
rl> depend implicitly on behavior of the migration_threads and load_balance
rl> (not that the current code doesn't rely on load_balance - it does).
The first thread starts up on the boot cpu with cpus_allowed mask set
to the boot cpu. It will write itself into cpu_rq(0)->migration_thread
and be fully functional because all other existing threads (and
migration_init) will call yield() or schedule(). The other
migration_threads can also run only on cpu#0 and will do so when
migration_CPU0 was completely initialized. Then they move themselves
to their designated CPU by calling set_cpus_allowed(). Their initial
CPU mask beeing 1<<0, they will call migration_CPU0, which is fully
functional at this time and only does the job it was designed
for. There is no dependency on the load balancer any more.
rl> What happens if a migration_thread comes up on a CPU without a migration
rl> thread and then you call set_cpus_allowed?
It doesn't because the cpus_allowed mask is 0x0001.
rl> I am also curious what causes #1 you mention. Do you see it in the
rl> _normal_ code or just with your patch? I cannot see what we race
rl> against wrt interrupts ... disabling interrupts, however, would disable
rl> load_balance and that is a potential pitfall with using
rl> migration_threads to migrate migration_threads as noted above.
I saw the problem #1 when testing an interface to
userspace which migrates tasks from one cpu to another. It happens
pretty easilly that the timer interrupt occurs while the migration
thread is doing its job and holds the current runqueue
lock. scheduler_tick() spinlocks on the own rq->lock which will never
be released by migration_thread. Maybe this occurs pretty seldomly on
IA32 with 10ms ticks, in IA64 it's easy to produce.
Best regards,
Erich
From: Robert Love
Subject: Re: [PATCH] migration thread fix
Date: 18 Apr 2002 18:38:29 -0400
On Thu, 2002-04-18 at 18:24, Erich Focht wrote:
> The first thread starts up on the boot cpu with cpus_allowed mask set
> to the boot cpu. It will write itself into cpu_rq(0)->migration_thread
> and be fully functional because all other existing threads (and
> migration_init) will call yield() or schedule(). The other
> migration_threads can also run only on cpu#0 and will do so when
> migration_CPU0 was completely initialized. Then they move themselves
> to their designated CPU by calling set_cpus_allowed(). Their initial
> CPU mask beeing 1<<0, they will call migration_CPU0, which is fully
> functional at this time and only does the job it was designed
> for. There is no dependency on the load balancer any more.
This seems right. I was skeptical because it is very easy here to rely
on some implicit behavior that is not at all defined. I cannot think of
anything wrong with your approach - good.
> I saw the problem #1 when testing an interface to
> userspace which migrates tasks from one cpu to another. It happens
> pretty easilly that the timer interrupt occurs while the migration
> thread is doing its job and holds the current runqueue
> lock. scheduler_tick() spinlocks on the own rq->lock which will never
> be released by migration_thread. Maybe this occurs pretty seldomly on
> IA32 with 10ms ticks, in IA64 it's easy to produce.
Oh, this is indeed a problem. But perhaps the bigger question is, why
does not double_rq_lock disable interrupts like task_rq_lock? You are
right about rq_lock vs. interrupts - thus interrupts must _always_ be
disabled when holding a runqueue lock.
The only other code that uses double_rq_lock is init_idle, which
correctly disables interrupts. Maybe it is wise to just make
double_rq_lock disable interrupts? If not, at least a comment above the
code: "must disable interrupts before calling!".
Also, this is almost certainly the cause of the preempt race in
set_cpus_allowed. I knew it was not kernel preemptions fault!
Robert Love
From: William Lee Irwin III
Subject: Re: [PATCH] migration thread fix
Date: Thu, 18 Apr 2002 16:27:53 -0700
On Fri, Apr 19, 2002 at 12:24:14AM +0200, Erich Focht wrote:
> Bill,
> thanks for sending me your patch. No, I didn't see it before.
No trouble at all.
On Fri, Apr 19, 2002 at 12:24:14AM +0200, Erich Focht wrote:
> Hmm, as far as I can see, it probably fixes the problems but keeps the
> original philosophy of depending on the load balancer to get the
> migration tasks scheduled on every CPU.
It does keep the original philosophy, and it's not too complicated:
(1) Keeping the bound migration_thread in busywait prevents
the idle task from not being able to migrate an unbound
task from a cpu where task is bound already because the
task it wants to migrate is already running -- basically
the state of busywait locks the cpu
(2) Putting migration_init() to sleep on a completion to free
up the cpu it's running on and waking it up when every
migration_thread has made it home to opens up the cpu
migration_init() was running on quite safely, and that
that progress is tracked by an atomic counter
(3) Getting rid of the timeouts, they lead to livelocks (or
very long waits) when the window while something's
awake is always when it can't get anything done
It's really a very conservative algorithm.
On Fri, Apr 19, 2002 at 12:24:14AM +0200, Erich Focht wrote:
> The first thread starts up on the boot cpu with cpus_allowed mask set
> to the boot cpu. It will write itself into cpu_rq(0)->migration_thread
> and be fully functional because all other existing threads (and
> migration_init) will call yield() or schedule(). The other
> migration_threads can also run only on cpu#0 and will do so when
> migration_CPU0 was completely initialized. Then they move themselves
> to their designated CPU by calling set_cpus_allowed(). Their initial
> CPU mask beeing 1<<0, they will call migration_CPU0, which is fully
> functional at this time and only does the job it was designed
> for. There is no dependency on the load balancer any more.
I'm not sure set_cpus_allowed() is safe to call at this stage. OTOH
setting task->cpus_allowed one way or the other from the argument will
ensure that the only idle task to steal it is the one from the cpu it's
destined for. Basically, it's whether you pick your cpu before or after
you land on it. The idle task will bring you home either way. An extra
push from cpu0 probably doesn't hurt either. =)
The only bit that really scares me is that I know my own patch works
and I don't really truly know that yours does (and never will until I
run it a number of times). And I'll bet you're in a similar position
yourself. The side issues are non-issues to me; I just want kernels that
do a little more than hang at boot so I can hack on other parts of it.
I'll get around to testing it soon, but it's kind of a pain, because
failed attempts didn't fail every time, and the machines where it fails
are not swift to boot... though I imagine you're in a similar position
wrt. testing being a hassle. =)
Let me get some other things out of the way and I'll make time for it.
Though we may have duplicated effort, I do appreciate your work.
Thanks,
Bill
From: Erich Focht
Subject: Re: [PATCH] migration thread fix
Date: Fri, 19 Apr 2002 03:51:29 +0200 (CEST)
Hi Bill,
> It does keep the original philosophy, and it's not too complicated:
[...]
>
> It's really a very conservative algorithm.
thanks for the explanations. Usually I'm also conservative with changes.
I'm currently working on a node affine scheduler extension for NUMA
machines and the load balancer behaves a bit different from the original.
So after a few boot failures with those slowly booting 16 CPU IA64
machines I thought there must be a simpler solution than synchronizing and
waiting for the load balancer: just let migration_CPU0 do what it is
designed for. So my proposal is:
- start all migration threads on CPU#0
- initialize migration_CPU0 (trivial, reliable, as it already is on
the right CPU)
- let all other migration threads use set_cpus_allowed() to get to the
right place
The only synchronization needed is the non-zero migration threads waiting
for migration_CPU0 to start working, which it will, as it is already on
the right CPU. This saves quite some lines of code.
I first posted this to LKML on March 6th (BTW, the fix #1, too) and since
then it was tested on several big NUMA platforms: 16 CPU NEC AzusA (IA64)
(also known as HP rx....), up to 32 CPU SGI IA64, 16 CPU IBM NUMA-Q
(IA32). No more lock-ups at boot since then. So I consider it working.
There is another good reason for this approach: the integration of the CPU
hotplug patch with the new scheduler becomes easier. One just needs to
create the new migration thread, it will move itself to the right CPU
without any additional magic (which you otherwise need because of the
synchronizations which won't be there at hotplug). Kimi Suganuma in the
neighboring cube is fiddling this out currently.
> yourself. The side issues are non-issues to me; I just want kernels that
> do a little more than hang at boot so I can hack on other parts of it.
> I'll get around to testing it soon, but it's kind of a pain, because
> failed attempts didn't fail every time, and the machines where it fails
> are not swift to boot... though I imagine you're in a similar position
> wrt. testing being a hassle. =)
:-) I know exactly what you mean...
Best regards,
Erich
PS: attached is the patch against Linus' latest kernel. This patch is only
for the migration_init change (#2), the fix for problem #1 is in the patch
sent by Robert, it has to be applied before.
From: William Lee Irwin III
Subject: Re: [PATCH] migration thread fix
Date: Thu, 18 Apr 2002 19:30:15 -0700
At some point in the past, I wrote:
>> It does keep the original philosophy, and it's not too complicated:
> [...]
>> It's really a very conservative algorithm.
On Fri, Apr 19, 2002 at 03:51:29AM +0200, Erich Focht wrote:
> thanks for the explanations. Usually I'm also conservative with changes.
Perhaps I was especially so, as I really have only one goal in mind,
that is, getting it to boot. I have, of course, no intention of
interfering with the development of new functionality.
On Fri, Apr 19, 2002 at 03:51:29AM +0200, Erich Focht wrote:
> I'm currently working on a node affine scheduler extension for NUMA
> machines and the load balancer behaves a bit different from the original.
> So after a few boot failures with those slowly booting 16 CPU IA64
> machines I thought there must be a simpler solution than synchronizing and
> waiting for the load balancer: just let migration_CPU0 do what it is
> designed for. So my proposal is:
I've seen a few posts of yours on the subject, though I'm sorry to say I
haven't followed it closely, as there is quite a bit of interesting work
going on these days in the scheduler. My main interests are elsewhere.
On Fri, Apr 19, 2002 at 03:51:29AM +0200, Erich Focht wrote:
> I first posted this to LKML on March 6th (BTW, the fix #1, too) and since
> then it was tested on several big NUMA platforms: 16 CPU NEC AzusA (IA64)
> (also known as HP rx....), up to 32 CPU SGI IA64, 16 CPU IBM NUMA-Q
> (IA32). No more lock-ups at boot since then. So I consider it working.
Sounds fairly thoroughly tested; this is actually more systems than I
myself have access to. Just to sort of doublecheck the references, is
there a mailing list archive where I can find reports of this problem
and/or successes of others using it?
On Fri, Apr 19, 2002 at 03:51:29AM +0200, Erich Focht wrote:
> There is another good reason for this approach: the integration of the CPU
> hotplug patch with the new scheduler becomes easier. One just needs to
> create the new migration thread, it will move itself to the right CPU
> without any additional magic (which you otherwise need because of the
> synchronizations which won't be there at hotplug). Kimi Suganuma in the
> neighboring cube is fiddling this out currently.
Given that it's been tested quite a bit, on a superset even of
platforms I'm explicitly trying to keep working I don't have any qualms
left. The code is correct from my review of it, it's been tested on my
machines, and he's trying to get something done that could make use of
some of the properties of the algorithm (the hotplug stuff).
Looks good to me.
Cheers,
Bill
From: Erich Focht
Subject: Re: [PATCH] migration thread fix
Date: Fri, 19 Apr 2002 10:09:35 +0200 (CEST)
Hi Bill,
> Sounds fairly thoroughly tested; this is actually more systems than I
> myself have access to. Just to sort of doublecheck the references, is
> there a mailing list archive where I can find reports of this problem
> and/or successes of others using it?
there is some email exchange with Jesse Barnes from SGI on the LSE and
linux-ia64 mailing lists.
Date: 1.-5. March, 2002
Subject: Re: [Linux-ia64] O(1) scheduler K3+ for IA64
The success report was a personal email.
On LSE and linux-kernel there were some emails related to the node affine
scheduler which contains the same migration mechanism (but a different
load balancer):
Date: 13. March and later
Subject: Node affine NUMA scheduler
Besides, Matt Dobson from IBM adapted the node affine scheduler to work
on NUMA-Q and tested it quite a bit, that email exchange was direct, too.
The testing on our side wasn't publicized, either, but our production
kernel for AzusA (which contains these patches) is about to be sent out to
customers and has undergone quite some testing.
It's a pity that there's so much duplicated effort in the Linux community,
but that's how it goes, you probably know it better than I do.
Best regards,
Erich
In conclusion...
Since this thread ended up going off-list, I thought I should share the results we reached to conclude this.
First, Linus took wli's migration_init rewrite before this thread began. I sent Linus a patch, based off his tree (wli's patch), to solve the interrupt issue Erich pointed out.
Erich reposted another copy of his patch, this time for use on top of the patch I sent. Linus liked it and wli and I agreed it certainly was simple. I think the biggest selling point is how much code it removed. Linus merged the patch.
So we now have Erich's migration_init fixes and the interrupt fix. This is in Linus's BK tree now and will appear in the next patch.
Robert Love