Re: [patch] sched: prevent bound kthreads from changing cpus_allowed

Previous thread: RE: Vpools and v3.0-UPSTREAM LIO by Nicholas A. Bellinger on Thursday, June 5, 2008 - 3:36 pm. (1 message)

Next thread: 2008/6/6 - 4:3:36 by 胡育茹 on Thursday, June 5, 2008 - 4:03 pm. (1 message)
To: Ingo Molnar <mingo@...>
Cc: Peter Zijlstra <peterz@...>, Paul Menage <menage@...>, Paul Jackson <pj@...>, <linux-kernel@...>
Date: Thursday, June 5, 2008 - 3:57 pm

Kthreads that have called kthread_bind() are bound to specific cpus, so
other tasks should not be able to change their cpus_allowed from under
them. Otherwise, it is possible to move kthreads, such as the migration
or software watchdog threads, so they are not allowed access to the cpu
they work on.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Paul Menage <menage@google.com>
Cc: Paul Jackson <pj@sgi.com>
Signed-off-by: David Rientjes <rientjes@google.com>
---
include/linux/sched.h | 1 +
kernel/cpuset.c | 14 +++++++++++++-
kernel/kthread.c | 1 +
kernel/sched.c | 6 ++++++
4 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1504,6 +1504,7 @@ static inline void put_task_struct(struct task_struct *t)
#define PF_SWAPWRITE 0x00800000 /* Allowed to write to swap */
#define PF_SPREAD_PAGE 0x01000000 /* Spread page cache over cpuset */
#define PF_SPREAD_SLAB 0x02000000 /* Spread some slab caches over cpuset */
+#define PF_THREAD_BOUND 0x04000000 /* Thread bound to specific cpu */
#define PF_MEMPOLICY 0x10000000 /* Non-default NUMA mempolicy */
#define PF_MUTEX_TESTER 0x20000000 /* Thread belongs to the rt mutex tester */
#define PF_FREEZER_SKIP 0x40000000 /* Freezer should not count it as freezeable */
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1190,6 +1190,15 @@ static int cpuset_can_attach(struct cgroup_subsys *ss,

if (cpus_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed))
return -ENOSPC;
+ if (tsk->flags & PF_THREAD_BOUND) {
+ cpumask_t mask;
+
+ mutex_lock(&callback_mutex);
+ mask = cs->cpus_allowed;
+ mutex_unlock(&callback_mutex);
+ if (!cpus_equal(tsk->cpus_allowed, mask))
+ return -EINVAL;
+ }

return security_task_setscheduler(tsk, 0, NULL);
}
@@ -1203,11 +1212,14 @@ static void cpus...

To: David Rientjes <rientjes@...>
Cc: Ingo Molnar <mingo@...>, Peter Zijlstra <peterz@...>, Paul Menage <menage@...>, Paul Jackson <pj@...>, <linux-kernel@...>
Date: Tuesday, June 10, 2008 - 1:47 pm

What if user-space moves this task right before "|= PF_THREAD_BOUND" ?

Oleg.

--

To: David Rientjes <rientjes@...>
Cc: Peter Zijlstra <peterz@...>, Paul Menage <menage@...>, Paul Jackson <pj@...>, <linux-kernel@...>
Date: Tuesday, June 10, 2008 - 6:28 am

applied to tip/sched-devel, thanks David.

Ingo
--

To: David Rientjes <rientjes@...>
Cc: <mingo@...>, <peterz@...>, <menage@...>, <linux-kernel@...>
Date: Thursday, June 5, 2008 - 5:47 pm

Ok - looks good to me. (I too have shot my foot off
moving tasks that shouldn't be moved - thanks.)

Reviewed-by: Paul Jackson <pj@sgi.com>

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--

To: David Rientjes <rientjes@...>
Cc: Ingo Molnar <mingo@...>, Peter Zijlstra <peterz@...>, Paul Menage <menage@...>, Paul Jackson <pj@...>, <linux-kernel@...>
Date: Thursday, June 5, 2008 - 4:52 pm

I'm all for it .. I've crashed test systems just by migrating tasks that
are suppose to stay bound while using cpusets ..

Daniel

--

To: David Rientjes <rientjes@...>
Cc: <mingo@...>, <peterz@...>, <menage@...>, <linux-kernel@...>
Date: Thursday, June 5, 2008 - 4:29 pm

A couple of questions on this:

1) Sometimes threads are bound to a set of CPUs, such as the CPUs
on a particular node:

drivers/pci/pci-driver.c: set_cpus_allowed_ptr(current, nodecpumask);
net/sunrpc/svc.c: set_cpus_allowed_ptr(current, nodecpumask);

Such cases can't invoke kthread_bind(), as that only binds to a single CPU.
I only see one place in your patch that sets PF_THREAD_BOUND; would it make
sense for such multi-CPU binds as above to be PF_THREAD_BOUND as well?

2) Sometimes calls to kthread_bind are binding to any online cpu, such as in:

drivers/infiniband/hw/ehca/ehca_irq.c: kthread_bind(cct->task, any_online_cpu(cpu_online_map));

In such cases, the PF_THREAD_BOUND seems inappropriate. The caller of
kthread_bind() really doesn't seem to care where that thread is bound;
they just want it on a CPU that is still online.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <pj@sgi.com> 1.940.382.4214
--

To: Paul Jackson <pj@...>
Cc: <mingo@...>, <peterz@...>, <menage@...>, <linux-kernel@...>
Date: Thursday, June 5, 2008 - 5:12 pm

Not in the drivers/pci/pci-driver.c case because the setting of
cpus_allowed to nodecpumask is only temporary (as is the disabling of the
mempolicy). It's just done for the probe callback and then reset to the
old cpumask. So any migration here would be lost.

I can't speculate about the net/sunrpc/svc.c case because I don't know if
user migration is harmful or discouraged. The behavior with my patch is
the same for any calls to set_cpus_allowed_ptr() for tasks that haven't
called kthread_bind(), so I'll leave that decision up to those who know

This particular case is simply moving the thread to any online cpu so that
it survives long enough for the subsequent kthread_stop() in
destroy_comp_task(). So I don't see a problem with this instance.

A caller to kthread_bind() can always remove PF_THREAD_BOUND itself upon
return, but I haven't found any cases in the tree where that is currently
necessary. And doing that would defeat the semantics of kthread_bind()
where these threads are supposed to be bound to a specific cpu and not
allowed to run on others.

David
--

To: David Rientjes <rientjes@...>
Cc: Paul Jackson <pj@...>, <mingo@...>, <peterz@...>, <menage@...>, <linux-kernel@...>, Oleg Nesterov <oleg@...>
Date: Monday, June 9, 2008 - 4:59 pm

Actually I have another use case here. Above example in particular may be ok
but it does demonstrate the issue nicely. Which is that in some cases kthreads
are bound to a CPU but do not have a strict "must run here" requirement and
could be moved if needed.
For example I need an ability to move workqueue threads. Workqueue threads do
kthread_bind().
So how about we add something like kthread_bind_strict() which would set
PF_THREAD_BOUND ?
We could also simply add flags argument to the kthread_bind() which would be
better imo but requires more changes. ie It'd look like
kthread_bind(..., cpu, KTHREAD_BIND_STRICT);

Things like migration threads, stop machine, etc would use the strict version
and everything else would use non-strict bind.

---
On the related note (this seems like the right crowd :). What do people think
about kthreads and cpusets in general. We currently have a bit of a disconnect
in the logic.
1. kthreads can be put into a cpuset at which point their cpus_allowed mask is
updated properly
2. kthread's cpus_allowed mask is updated properly when cpuset setup changes
(cpus added, removed, etc).
3. kthreads inherit cpuset from a parent (kthreadd for example) _but_ they
either do kthread_bind() or set_cpus_allowed() and both of those simply ignore
inherited cpusets.

Notice how scenario #3 does not fit into the overall picture. The behaviour is
inconsistent.
How about this:
- Split sched_setaffinity into

sched_setaffinity()
{
task *p = get_task_by_pid();
return task_setaffinity(p);
}

task_setaffinity(task, cpumask, flags)
{
if (flags & FORCE) {
// Used for kthreads that require strict binding.
// Detach the task from the current cpuset
// and put it into the root cpuset.
// Set PF_THREAD_BOUND.
}

// Rest of the original sched_setaffinity logic
}

- Have kthreads call task_setaffinity() instead of set_cpus_allowed() directly.
That way the behaviour will be consistent across the board.

Comments ?...

To: Max Krasnyanskiy <maxk@...>
Cc: David Rientjes <rientjes@...>, Paul Jackson <pj@...>, <mingo@...>, <menage@...>, <linux-kernel@...>, Oleg Nesterov <oleg@...>
Date: Tuesday, June 10, 2008 - 2:44 am

Per cpu workqueues should stay on their cpu.

What you're really looking for is a more fine grained alternative to
flush_workqueue().

--

To: Peter Zijlstra <peterz@...>
Cc: David Rientjes <rientjes@...>, Paul Jackson <pj@...>, <mingo@...>, <menage@...>, <linux-kernel@...>, Oleg Nesterov <oleg@...>
Date: Tuesday, June 10, 2008 - 11:38 am

Actually I had a discussion on that with Oleg Nesterov. If you remember my
original solution (ie centralized cpu_isolate_map) was to completely redirect
work onto other cpus. Then you pointed out that it's the flush_() that really
makes the box stuck. So I started thinking about redoing the flush. While
looking at the code I realized that if I only change the flush_() then queued
work can get stale so to speak. ie Machine does not get stuck but some work
submitted on the isolated cpus will sit there for a long time. Oleg pointed
out exact same thing. So the simplest solution that does not require any
surgery to the workqueue is to just move the threads to other cpus. I did not
want to get into too much detail on the workqueue stuff here. I'll start a
separate thread on this.
As I pointed out, there are a bunch of other kthreads like: kswapd, kacpid,
pdflush, khubd, etc, etc, that clearly do not need any pinning but still
violate cpuset constraints they inherit from kthreadd.

Max

--

To: Max Krasnyansky <maxk@...>
Cc: Peter Zijlstra <peterz@...>, David Rientjes <rientjes@...>, Paul Jackson <pj@...>, <mingo@...>, <menage@...>, <linux-kernel@...>
Date: Tuesday, June 10, 2008 - 1:00 pm

Cough... I'd like to mention that I _personally agree with Peter, cwq->thread's
should stay on their cpu.

I just meant that from the workqueue.c pov it is (afaics) OK to move cwq->thread
to other CPUs, in a sense that this shouldn't add races or hotplug problems, etc.
But still this doesn't look right to me.

Oleg.

--

To: Oleg Nesterov <oleg@...>
Cc: Peter Zijlstra <peterz@...>, David Rientjes <rientjes@...>, Paul Jackson <pj@...>, <mingo@...>, <menage@...>, <linux-kernel@...>
Date: Tuesday, June 10, 2008 - 2:00 pm

I never argued against the _should stay_ ;-). What I'm arguing against is the
Yeah, it's all about perceptions. We'll fix that ;-).

Max

--

To: Oleg Nesterov <oleg@...>
Cc: Max Krasnyansky <maxk@...>, David Rientjes <rientjes@...>, Paul Jackson <pj@...>, <mingo@...>, <menage@...>, <linux-kernel@...>
Date: Tuesday, June 10, 2008 - 1:19 pm

The advantage of creating a more flexible or fine-grained flush is that
large machine also profit from it.

A simple scheme would be creating a workqueue context that is passed
along on enqueue, and then passed to flush.

This context could:

- either track the individual worklets and employ a completion scheme
to wait for them;

- or track on which cpus the worklets are enqueued and flush only those
few cpus.

Doing this would solve your case since nobody (except those having
business) will enqueue something on the isolated cpus.

And it will improve the large machine case for the same reasons - it
won't have to iterate all cpus.

Of course, things that use schedule_on_each_cpu() will still end up
doing things on your isolated cpus, but getting around those would
probably get you into some correctness trouble.

--

To: Peter Zijlstra <peterz@...>, Oleg Nesterov <oleg@...>, <mingo@...>, Andrew Morton <akpm@...>
Cc: David Rientjes <rientjes@...>, Paul Jackson <pj@...>, <menage@...>, <linux-kernel@...>, Mark Hounschell <dmarkh@...>
Date: Tuesday, June 10, 2008 - 4:24 pm

Here is some backgound on this. Full cpu isolation requires some tweaks to the
workqueue handling. Either the workqueue threads need to be moved (which is my
current approach), or work needs to be redirected when it's submitted.
flush_*_work() needs to be improved too. See Peter's reply below.

First reaction that a lot of people get is "oh no no, this is bad, this will
not work". Which is understandable but _wrong_ ;-). See below for more details
and analysis.

One thing that helps in accepting this isolation idea is to think about the
use cases. There are two uses cases for it:
1. Normal threaded RT apps with threads that use system calls, block on
events, etc.
2. Specialized RT apps with thread(s) that require close to 100% of the CPU
resources. Their threads avoid using system calls and avoid blocking. This is
done to achieve very low latency and low overhead.

Scenario #1 is straightforward. You'd want to isolate the processor the RT app
is running on to avoid typical sources of latency. Workqueues running on the
same processor is not an issue (because RT threads block), but you do not get
the same latency guaranties.

Workqueues are an issue for the scenario #2. Workqueue kthreads do not get a
chance to run because user's RT threads are higher priority. However those RT
threads should not use regular kernel services because that by definition
means that they are not getting ~100% of the CPU they want. In other words
they cannot have it both ways :).

Therefore it's expected that the kernel won't be used heavily on those cpus,
and nothing really schedules workqueues and stuff. It's also expected that
certain kernel services may not be available on those CPUs. Again we cannot
have it both ways. ie Have all the kernel services and yet the kernel is not
supposed to use much CPU time :).

If at this point people still get this "Oh no, that's wrong" feeling, please
read this excellent statement by Paul J

"A key reason that Linux has succeeded is that it actively seeks to wo...

To: Max Krasnyansky <maxk@...>
Cc: Peter Zijlstra <peterz@...>, <mingo@...>, Andrew Morton <akpm@...>, David Rientjes <rientjes@...>, Paul Jackson <pj@...>, <menage@...>, <linux-kernel@...>, Mark Hounschell <dmarkh@...>
Date: Wednesday, June 11, 2008 - 12:08 pm

Yes, it is easy to implement flush_work(struct work_struct *work) which
only waits for that work, so it can't hang unless it was enqueued on the
isolated cpu.

But in most cases it is enough to just do

if (cancel_work_sync(work))
work->func(work);

Or we can add flush_workqueue_cpus(struct workqueue_struct *wq, cpumask_t *cpu_map).

Almost all of them should be changed to use cancel_work_sync().

Oleg.

--

To: Oleg Nesterov <oleg@...>, Peter Zijlstra <peterz@...>
Cc: <mingo@...>, Andrew Morton <akpm@...>, David Rientjes <rientjes@...>, Paul Jackson <pj@...>, <menage@...>, <linux-kernel@...>, Mark Hounschell <dmarkh@...>, Nick Piggin <nickpiggin@...>
Date: Wednesday, June 11, 2008 - 4:44 pm

Previous emails were very long :). So here is an executive summary of the
discussions so far:

----
Workqueue kthread starvation by non-blocking user RT threads.

Starving workqueue threads on the isolated cpus does not seems like a big
deal. All current mainline users of schedule_on_cpu() kind of api can live
with it. Starvation of the workqueue threads is an issue for the -rt kernels.
See http://marc.info/?l=linux-kernel&m=121316707117552&w=2 for more info.

If absolutely necessary moving workqueue threads from the isolated cpus is
also not a big deal, even for cpu hotplug. It's certainly _not_ encouraged in
general but at the same time is not strictly prohibited either, because
nothing fundamental brakes (that's what my current isolation solution does).

----
Optimize workqueue flush.

Current flush_work_queue() implementation is an issue for the starvation case
mentioned above and in general it's not very efficient because it has to
schedule in each online cpu.

Peter suggested rewriting flush logic to avoid scheduling on each online cpu.

Oleg suggested converting existing users of flush_queued_work() to
cancel_work_sync(work) which will provide fine grained flushing and will not
schedule on each cpu.

Both of the suggestions would improve overall performance and address the case
when machine gets stuck due work queue thread starvation.

----
Timer or IPI based Oprofile.

Currently oprofile collects samples by using schedule_work_on_cpu(). Which
means that if workqueue threads are starved on, or moved from cpuX oprofile
fails to collect samples on that cpuX.

It seems that it can be easily converted to use per-CPU timer or IPI.
This might be useful in general (ie less expensive) and will take care of the
issue described above.

----
Optimize pagevec drain.

Current pavevec drain logic on the NUMA boxes schedules a workqueue on each
online cpu. It's not an issue for the CPU isolation per se but can be improved
in general.
Peter suggested keeping a cpuma...

To: Oleg Nesterov <oleg@...>
Cc: Peter Zijlstra <peterz@...>, <mingo@...>, Andrew Morton <akpm@...>, David Rientjes <rientjes@...>, Paul Jackson <pj@...>, <menage@...>, <linux-kernel@...>, Mark Hounschell <dmarkh@...>
Date: Wednesday, June 11, 2008 - 3:21 pm

Cool. That would work.
btw Somehow I thought that you already implemented flush_work(). I do not see
it 2.6.25 but I could've sworn that I saw a patch flying by. Must have been
That'd be special casing. I mean something will have to know what cpus cannot

That'd be a lot of changes.

git grep flush_scheduled_work | wc
154 376 8674

Hmm, I guess maybe not that bad. I might actually do that :-).

Max
--

To: Max Krasnyansky <maxk@...>
Cc: Peter Zijlstra <peterz@...>, <mingo@...>, Andrew Morton <akpm@...>, David Rientjes <rientjes@...>, Paul Jackson <pj@...>, <menage@...>, <linux-kernel@...>, Mark Hounschell <dmarkh@...>
Date: Thursday, June 12, 2008 - 12:35 pm

Well... I don't think Andrew will take this patch right now... OK, I'll send
the preparation patch with comments. Do you see an immediate user for this

Cool! I _bet_ you will find a lot of bugs ;)

Oleg.

--

To: Oleg Nesterov <oleg@...>
Cc: Peter Zijlstra <peterz@...>, <mingo@...>, Andrew Morton <akpm@...>, David Rientjes <rientjes@...>, Paul Jackson <pj@...>, <menage@...>, <linux-kernel@...>, Mark Hounschell <dmarkh@...>
Date: Wednesday, June 11, 2008 - 3:21 pm

Cool. That would.
btw Somehow I thought that you already implemented flush_work(). I do not see
it 2.6.25 but I could've sworn that I saw a patch flying by. Must have been
That'd be special casing. I mean something will have to know what cpus cannot

That'd be a lot of changes.

git grep flush_scheduled_work | wc
154 376 8674

Hmm, I guess maybe not that bad. I might actually do that :-).

Max
--

To: Max Krasnyansky <maxk@...>
Cc: Oleg Nesterov <oleg@...>, <mingo@...>, Andrew Morton <akpm@...>, David Rientjes <rientjes@...>, Paul Jackson <pj@...>, <menage@...>, <linux-kernel@...>, Mark Hounschell <dmarkh@...>, Thomas Gleixner <tglx@...>
Date: Wednesday, June 11, 2008 - 2:49 am

There are quite a bit more on -rt, where a lot of on_each_cpu() callers,

Sure, ondemand cpu_freq doesn't make sense while running (hard) rt

NMI/timers sound like a good way to run oprofile - I thought it could

Dude, SLUB uses on_each_cpu(), that's even worse for your #2 case. Hmm

It isn't actually swap only - its all paging, including pagecache etc..
Still, you're probably right in that the per cpu lrus are empty, but why
not improve the current scheme by keeping a cpumask of cpus with

Looking at this code I'm not seeing the harm in letting it run - even
for your #2 case, it certainly is not worse than some of the
on_each_cpu() code, and starving it doesn't seem like a big issue.

---

I'm worried by your approach to RT - both your solutions [1,2] and
oversight of the on_each_cpu() stuff seem to indicate you don't care
about some jitter on your isolated cpu.

Timers and on_each_cpu() code run with hardirqs disabled and can do all
kinds of funny stuff like spin on shared locks. This will certainly
affect your #2 case.

Again, the problem with most of your ideas is that they are very narrow
- they fail to consider the bigger picture/other use-cases.

To quote Paul again:
"A key reason that Linux has succeeded is that it actively seeks
to work for a variety of people, purposes and products"

You often seem to forget 'variety' and target only your one use-case.

I'm not saying it doesn't work for you - I'm just saying that by putting
in a little more effort (ok, -rt is a lot more effort) we can make it
work for a lot more people by taking out a lot of the restrictions
you've put upon yourself.

Please don't take this too personal - I'm glad you're working on this.
I'm just trying to see what we can generalize.

--

To: Peter Zijlstra <peterz@...>
Cc: Oleg Nesterov <oleg@...>, <mingo@...>, Andrew Morton <akpm@...>, David Rientjes <rientjes@...>, Paul Jackson <pj@...>, <menage@...>, <linux-kernel@...>, Mark Hounschell <dmarkh@...>, Thomas Gleixner <tglx@...>
Date: Wednesday, June 11, 2008 - 3:02 pm

Yes I noticed that. I did not look deep enough though. I played with -rt
To sum up the discussion:
Overall conclusion regarding workqueues for the current mainline kernels is
that starting/moving workqueue threads is not a big deal. It's definitely not
encouraged but at the same time is not be prohibited (and it isn't at this
point I'm moving them from user-space).

Looks like the action items are:
- Optimize workqueue flush
- Timer or IPI based Oprofile (configurable)
- Optimize pagevec drain. I wonder if there is something on that front in the
Nick's latest patches.

Of course I may not be able to look at all those

Did I miss any other suggestion ?

Yeah, I'm not sure why oprofile uses workqueues. Seems like a much heavier way
Sure. I was talking about _workqueues_ specifically.
There is also add_timer_on().

Also on_each_cpu() is an IPI. It's not necessarily worse for the #2 case.
Depending what the IPI does of course. In the SLUB case it's essentially a
noop on the CPU that is no a heavy SLUB user. I've been meaning to profile
IPIs and timers but they have not generated enough noise.

btw Converting everything to threads like -rt does is not necessarily the best
solutions for all cases. Look at the "empty SLUB" case. IPI will get in and
out with minimal intrusion. The threaded approach will have to run the
scheduler and do a context switch. Granted in-kernel context switches are
super fast but still not as fast as the IPI. Just to clarify I'm not saying
-rt is doing the wrong thing, I'm just saying it's not black and white that
Oh, I missed the pagecache part. btw It's also used only for the CONFIG_NUMA
I did not know enough about it to come with such an idea :). So yes now that
Yes that's I meant. ie That nothing needs to be done here. We just let it run
Hang on, why do say an oversight of on_each_cpu() ?
I started this discussion about workqueues specifically. on_each_cpu() is an
IPI. Actually awhile ago I looked at most uses of the xxx_on_cpu() api
including...

To: Max Krasnyansky <maxk@...>
Cc: Oleg Nesterov <oleg@...>, <mingo@...>, Andrew Morton <akpm@...>, David Rientjes <rientjes@...>, Paul Jackson <pj@...>, <menage@...>, <linux-kernel@...>, Mark Hounschell <dmarkh@...>, Thomas Gleixner <tglx@...>
Date: Thursday, June 12, 2008 - 2:44 pm

Sorry for being late,.. and I'm afraid most will have to wait a bit
longer :-(

No, no, you understand me wrong (or I expressed myself wrong). Your
ideas are good, its just the implementation / execution I have issues
with.

Like with extending the isolation map, what didn't leave any room for
hard-rt smp schedulers or multiple rt domains. Whereas the cpuset stuff
does.

--

To: Peter Zijlstra <peterz@...>
Cc: Oleg Nesterov <oleg@...>, <mingo@...>, Andrew Morton <akpm@...>, David Rientjes <rientjes@...>, Paul Jackson <pj@...>, <menage@...>, <linux-kernel@...>, Mark Hounschell <dmarkh@...>, Thomas Gleixner <tglx@...>
Date: Thursday, June 12, 2008 - 3:10 pm

Yep. And I redid it completely and switched gears to fix/improve cpusets,
genirq, etc.
Anyway, I think we're on the same page. Please look over the summary that I
sent out and see if I missed anything.

Max
--

To: Max Krasnyanskiy <maxk@...>
Cc: Paul Jackson <pj@...>, <mingo@...>, <peterz@...>, <menage@...>, <linux-kernel@...>, Oleg Nesterov <oleg@...>
Date: Monday, June 9, 2008 - 6:07 pm

That isn't a completely accurate description of the patch; the kthread
itself is always allowed to change its cpu affinity. This exception, for
example, allows stopmachine to still work correctly since it uses
set_cpus_allowed_ptr() for each thread.

This patch simply prohibits other tasks from changing the cpu affinity of

Let's elaborate a little on that: you're moving workqueue threads from
their originating cpu to another cpu (hopefully on the same NUMA node)

It depends on the number of exceptions that we want to allow. If there's
one or two, it's sufficient to just use

p->flags &= ~PF_THREAD_BOUND;

upon return from kthread_bind(), or simply convert the existing code to
use set_cpus_allowed_ptr() instead:

kthread_bind(p, cpu);

becomes

cpumask_t mask = cpumask_of_cpu(cpu);
set_cpus_allowed_ptr(p, &mask);

Or, if we have more exceptions to the rule than actual strict binding for
kthreads using kthread_bind(), we can just add

p->flags |= PF_THREAD_BOUND;

upon return on watchdog and migration threads.

But, to me, the semantics of kthread_bind() are pretty clear. I think
it's dangerous to move kthreads such as watchdog or migration threads out
from under them and that is easily shown if you try to do it. There are
perhaps exceptions to the rule where existing kthread_bind() calls could
be converted to set_cpus_allowed_ptr(), but I think we should enumerate
those cases and discuss the hazards that could be associated with changing
their cpu affinity.

I'd also like to hear why you choose to move your workqueue threads off

With my patch, this is slightly different. Kthreads that have called
kthread_bind() can have a different cpu affinity than the cpus_allowed of
their cpuset. This happens when such a kthread is attached to a cpuset
and its 'cpus' file subsequently changes. Cpusets is written correctly to
use set_cpus_allowed_ptr() so that this change in affinity is now rejected
for PF_THREAD_BOUND tasks, yet th...

To: Paul Jackson <pj@...>, <mingo@...>
Cc: David Rientjes <rientjes@...>, Peter Zijlstra <a.p.zijlstra@...>, <menage@...>, <linux-kernel@...>
Date: Tuesday, June 10, 2008 - 12:30 pm

I pointed this out in the email thread about PF_THREAD_BOUND patch and wanted
to restart the thread to make sure that people pay attention :).
I was going to cook up a patch for this and wanted to get some early feedback
to avoid time waste.

Basically the issue is that current behaviour of the cpusets is inconsistent
with regards to kthreads. Kthreads inherit cpuset from a parent properly but
they simply ignore cpuset.cpus when their cpu affinity is set/updated.
I think the behaviour must be consistent across the board. cpuset.cpus must
apply to _all_ the tasks in the set, not just some of the tasks. If kthread
must run on the cpus other than current_cpuset.cpus then it should detach from
the cpuset.

To give you an example kthreads like scsi_eh, kswapd, kacpid, pdflush,
kseriod, etc are all started with cpus_allows=ALL_CPUS even though they
inherit a cpuset from kthreadd. Yes they can moved manually (with
sched_setaffinity) but the behaviour is not consistent, and for no good
reason. kthreads can be stopped/started at any time (module load for example)
which means that the user will have to keep moving them.

To sum it up here is what I'm suggesting:
kthread_bind(task, cpu)
{
// Set PF_THREAD_BOUND
// Move into root cpuset
// Bind to the cpu
}

kthread_setaffinity(task, cpumask)
{
// Enforce cpuset.cpus_allowed
// Updated affinity mask and migrate kthread (if needed)
}

Kthreads that do not require strict cpu binding will be calling
kthread_setaffinity() instead of set_cpus_allowed_ptr() and such.

Kthreads that require strict cpu binding will be calling kthread_bind() and
detach from the cpuset they inherit from their parent.

That way the behaviour is consistent across the board.

Comments ?

Max

--

--

To: Max Krasnyansky <maxk@...>
Cc: Paul Jackson <pj@...>, <mingo@...>, Peter Zijlstra <a.p.zijlstra@...>, <menage@...>, <linux-kernel@...>
Date: Tuesday, June 10, 2008 - 2:47 pm

I disagree that a cpuset's set of allowable cpus should apply to all tasks
attached to that cpuset. It's certainly beneficial to be able to further
constrict the set of allowed cpus for a task using sched_setaffinity().

It makes more sense to argue that for each task p, p->cpus_allowed is a

This doesn't seem to be purely a kthread issue. Tasks can be moved to a
disjoint set of cpus by any caller to set_cpus_allowed_ptr() in the
kernel.

David
--

To: David Rientjes <rientjes@...>
Cc: Paul Jackson <pj@...>, <mingo@...>, Peter Zijlstra <a.p.zijlstra@...>, <menage@...>, <linux-kernel@...>
Date: Tuesday, June 10, 2008 - 4:44 pm

Yes that's exactly what I meant :). Sorry for not being clear. I did not mean
that each task in a cpuset must have the same affinity mask. So we're on the
Hmm, technically you are correct of course. But we do not have any other
kernel tasks besides kthreads. All the kernel tasks I have on my machines have
kthreadd as their parent.
And I'm not aware of any kernel code that changes affinity mask of a
user-space task without paying attention to the cpuset the task belongs to. If
you know of any we should fix it because it'd clearly be a bug.

Thanx
Max
--

To: Max Krasnyansky <maxk@...>
Cc: Paul Jackson <pj@...>, <mingo@...>, Peter Zijlstra <a.p.zijlstra@...>, <menage@...>, <linux-kernel@...>
Date: Tuesday, June 10, 2008 - 4:54 pm

This is why it shouldn't belong in the sched or kthread code; the
discrepency that you point out between p->cpus_allowed and
task_cs(p)->cpus_allowed is a cpuset created one.

So to avoid having tasks with a cpus_allowed mask that is not a subset of
its cpuset's set of allowable cpus, the solution would probably be to add
a flavor of cpuset_update_task_memory_state() for a cpus generation value.
--

To: David Rientjes <rientjes@...>
Cc: Paul Jackson <pj@...>, <mingo@...>, Peter Zijlstra <a.p.zijlstra@...>, <menage@...>, <linux-kernel@...>
Date: Tuesday, June 10, 2008 - 5:15 pm

I guess I do not see how the kernel task case is different from the
sched_setaffinity(). ie sched_setaffinity() is in the scheduler but it deals
with cpusets.

Also in this case the discrepancy is created _not_ by the cpuset, it's created
due to the lack of the appropriate API. In other words if we had something
Post policing would not work well in some scenarios due to lag time. ie By the
time cpusets realized that some kthread is running on the wrong cpus it maybe
too late.
We should just enforce cpuset constraint when kthreads change their affinity
mask, just like sched_setaffinity() already does.

Max
--

To: David Rientjes <rientjes@...>
Cc: Paul Jackson <pj@...>, <mingo@...>, <peterz@...>, <menage@...>, <linux-kernel@...>, Oleg Nesterov <oleg@...>
Date: Tuesday, June 10, 2008 - 12:23 am

I think what can and cannot be moved is a separate question. As far as cpu
affinity API for kthreads goes it makes sense to support both scenarios.
That would be inconsistent and confusing, I think. If a task is in the cpuset
X all constraints of the cpuset X must apply. If some constraints cannot be
Yes cpusets are not only about cpu affinity. But again the behaviour should be
consistent across the board. cpuset.cpus must apply to all the task in the
set, not just some of the tasks.

Also I think you missed my other point/suggestion. Yes your patch fixes one
problem, which is kthreads that must not move will not move. Although like I
said the behaviour with regards to the cpuset is not consistent and confusing.
The other thing that I pointed out is that kthreads that _can_ move do not
honor cpuset constrains.
Let me give another example. Forget the workqueues for a second. Kthreads like
scsi_eh, kswapd, kacpid, etc, etc are all started with cpus_allows=ALL_CPUS
even though they inherit a cpuset from kthreadd. Yes I can move them manually
but the behaviour is not consistent, and for no good reason. kthreads can be
stopped/started at any time (module load for example) which means that I'd
have to keep moving them.

To sum it up here is what I'm suggesting:
kthread_bind(task, cpu)
{
// Set PF_THREAD_BOUND
// Move into root cpuset
// Bind to the cpu
}

kthread_setaffinity(task, cpumask)
{
// Enforce cpuset.cpus_allowed
// Updated affinity mask and migrate kthread (if needed)
}

Instead of calling set_cpus_allowed_ptr() kthreads that do not need strict cpu
binding will be calling kthread_setaffinity().
That way the behaviour is consistent across the board.

Makes sense ?

Max
--

To: Max Krasnyansky <maxk@...>
Cc: Paul Jackson <pj@...>, <mingo@...>, <peterz@...>, <menage@...>, <linux-kernel@...>, Oleg Nesterov <oleg@...>
Date: Tuesday, June 10, 2008 - 1:04 pm

This probably isn't something that you should be doing, at least with the
workqueue threads. The slab cache reaper, for example, depends on being
able to drain caches for each cpu and will be neglected if they are moved.

I'm curious why you haven't encountered problems with this while isolating
per-cpu workqueue threads in cpusets that don't have access to their own
cpu.

Regardless, we'd need a patch to the slab layer and ack'd by the

It has always been possible to assign a task to a cpu and then further
constrict its set of allowable cpus with sched_setaffinity(). So, while
the cpus_allowed in this case are always a subset of the cpuset's cpus,

kthread_bind() usually happens immediately following kthread_create(), so
it should already be in the root cpuset. If it has been forked in a
different cpuset, however, implicitly moving it may be more harmful than
any inconsistency that exists in cpus_allowed.

David
--

Previous thread: RE: Vpools and v3.0-UPSTREAM LIO by Nicholas A. Bellinger on Thursday, June 5, 2008 - 3:36 pm. (1 message)

Next thread: 2008/6/6 - 4:3:36 by 胡育茹 on Thursday, June 5, 2008 - 4:03 pm. (1 message)