Re: IRQ affinities

Previous thread: Re: Poor PostgreSQL scaling on Linux 2.6.25-rc5 (vs 2.6.22) by Nick Piggin on Tuesday, March 11, 2008 - 6:21 pm. (12 messages)

Next thread: [2.6.25-rc5-mm1] BUG() at mnt_want_write(). by penguin-kernel on Tuesday, March 11, 2008 - 6:37 pm. (10 messages)
From: Max Krasnyansky
Date: Tuesday, March 11, 2008 - 6:23 pm

Folks,

Concept of 'boot' cgroup was discussed as part of the cpuset/cpuisol lkml threads.
In short 'boot' group is very much like the 'root' or toplevel group. ie It
contains all tasks, and 'boot' cpuset contains all cpus, mem nodes, irqs, etc.
The difference is that it can be easily shrunk if needed, where as
toplevel/root group cannot.

I just wanted to make sure that we still want to create 'boot' cgroup during
kernel init instead of doing it in the user-space.

After looking into this a little bit I'm thinking of creating 'boot' cgroup
right after cpuset_init_smp() (init/main.c:841). Just before do_basic_setup()
which creates work queues and stuff.

The thing is though that the very next thing we do there is run early
userspace. Which begs the question, shouldn't we just do it from early
user-space then ?
It'd be very simple to mount cgroup, create 'boot' group and move all the
tasks in there.

So kernel or early-userspace ?

If kernel.
Paul M, do you have a suggestion as to what's the best way of creating a
cgroup without mounting cgroup fs. Seems like there is currently no easy way
for doing that. I probably missed it.

Thanx
Max



--

From: Paul Menage
Date: Tuesday, March 11, 2008 - 6:27 pm

Seems simplest to me. We have an early boot script that creates a
"system" cpuset and moves all tasks into it. It seems to work fine for
us.

Paul
--

From: Max Krasnyansky
Date: Tuesday, March 11, 2008 - 7:34 pm

Suppose we were to do it from kernel. What's the right way to create a cgroup
without mounting a cgroupfs ?
I just want to play with it. There are a couple of advantages that I see for
doing it from kernel. We can move 'kthreadd' and idle threads into the 'boot'
cgroup early on and therefor later on won't even have to iterate through the
tasks and stuff. Whereas user-space has to iterate through tasks and be smart
about threads that are pinned and stuff. Not a big deal but if kernel code is
simple enough maybe it makes sense.

So, any pointers. How do I do create_cgroup() without fs mounted ?

Max
--

From: Paul Menage
Date: Tuesday, March 11, 2008 - 7:36 pm

There isn't really a way, but you could always kern_mount() a

Would this be done based on some boot commandline option? I don't
think you'd want to do it unconditionally.

Paul
--

From: Max Krasnyansky
Date: Tuesday, March 11, 2008 - 7:53 pm

Hmm, I believe the original discussion was about doing it unconditionally.
Why not I guess ? It probably won't even affect your existing scripts since
they will be able to move tasks into another set just like they do now. The
only thing I can think of is that if your scripts use sched_load_balance then
they will now have to unset it in the 'boot' set as well. Otherwise since the
'boot' set will be non-exclusive (cpus and mems) it should not really affect
anything.
So what's your concern with unconditional 'boot' cgroup/cpuset ?

Max
--

From: Paul Menage
Date: Tuesday, March 11, 2008 - 8:09 pm

My boot scripts look in /dev/cpuset/tasks to find processes to move

That can break existing userspace, so I presume PaulJ isn't in favour


The exclusivity problem, as above.

Which subsystems are you going to include in this boot hierarchy?
Userspace is going to have to be aware of the fact that there's a
cpusets hierarchy which might have to be dismantled if it wants to set
up something different.

Paul
--

From: Max Krasnyansky
Date: Tuesday, March 11, 2008 - 8:39 pm

My impression was that he was ok with changing his stuff. But I maybe
completely wrong of course. I'm actually perfectly fine with making it
conditional.
Maybe something like
	bootcpuset=1
Hold on, if you move all the tasks ... Oh, never mind :). You mean that you
won't be able to create any cpusets that must be exclusive unless you nuke
Yes I agree. If this 'boot' set is unconditional user-space tools will have to
change. As I mentioned above I totally do not mind if is is conditional. Any
I was going to only include 'cpusets'. Does it make sense for anything else ?

Max
--

From: Paul Jackson
Date: Tuesday, March 11, 2008 - 9:59 pm

You're right - I don't favor it.

Using the 'cpus' in one or more cpusets to determine both:
 1) which CPUs can receive an irq, and
 2) resolving conflicts in such irq placement,
excessively overloads the cpuset hierarchy, breaking existing
userspace, as Paul M notes.

If you don't have any other cpuset hierarchy you need to use, and
so don't really otherwise care what your cpuset hierarchy is, then
I suppose this works just fine.

But if you also need to use the cpuset hierarchy to define nested
subsets of CPUs and Memory Nodes, for the purposes of controlling
which tasks can run where (the original and still primary motivation
for cpusets) then one can only conveniently specify those trivial
irq configurations that happen to exactly conform with that hierarchy
(that exactly want to make use of some of the same sets of CPUs, and
that don't depend on the hierarchy to resolve conflicts in overlapping
irq directives).

Almost any non-trivial use of cpusets for both irq directivity and CPU
and Memory placement would complicate both hierarchies, forcing
unending confusion and breakage on the existing cpuset users.

Some examples:

    Let's say I have three cpusets defining the CPU and Memory Node
    sets in which I want to place my tasks:

	    /dev/cpuset/A
	    /dev/cpuset/B
	    /dev/cpuset/C

    and I want a particular set of irqs to be directed to the CPUs in A
    and B, but not C.  Well -- guess I can duplicate the irqs settings.

    But don't tell me to use a 'boot' cpuset, as in:

	    /dev/cpuset/boot/A
	    /dev/cpuset/boot/B
	    /dev/cpuset/C

    to accomplish this, as that intrudes in the hierarchy, breaking
    user code.

    If my irq isolation needs don't exactly partition along the
    'cpus' settings in A, B and C, then not even duplication helps.

    If the 'irqs' in /dev/cpuset/A/Z (where Z's cpus are a proper
    subset of A's) don't match the 'irqs' in /dev/cpuset/A, then I
    have further confusions resulting from conflicting irq ...
From: Max Krasnyanskiy
Date: Wednesday, March 12, 2008 - 11:24 am

Hmm, I think we're mixing two different threads here.
	1. How to map irq affinity handling onto cpusets.
	2. Whether and how to create in kernel 'boot' cgroup/cpuset.
They are somewhat orthogonal imho. In a sense that no mater how we decide to 
handle irqs (even if we do not do them under cpusets at all) we may still want 
'boot' group. As I mentioned at the beginning of this thread 'boot group/set 
is basically just a convenience feature. The only difference between root/top 
  group is that 'boot' group can be dynamically resized and moved.

Ok. So the rest of the email is mostly about irqs. It'd be nice if it was in 
I'm not sure #2 is a concern. With the latest couset irq handling patches 
I'm not sure what breakage you're talking about. But lets talk examples I 
How is that any different from tasks ? Exact same example right back at you.
Suppose I have a task that needs to run in A and B but not C. In fact if you 
look at the example that I provided in the other thread I already have such an 
app. In my current apps different threads have to run in different cpusets.

And yes I think the way to solve that is to use more complex cpuset hierarchy 
like the one you used above. I would not necessarily mix in the 'boot' set 
here. I mean if people want to subdivide it that's fine but they do not have 
It does not force any changes. irqs handled just like tasks and if people have 
complex partitioning requirements they may have to use more complicated 
That (ie additional sets of irqs) seems like an major overkill to me.
Probably because I do not think that there are any conflicts to resolve in the 
first place. As I explained above if we treat irqs just like tasks (from 
cpuset perspective) then same exact rules and limitations apply. Irq can be 
assigned to a single cpuset at a time. Complex requirements can be solved 
either by deeper cgroup/cpuset hierarchies or worst case if there is something 
  totally wacky constraint people always have an option of assigning irq to 
the ...
From: Paul Jackson
Date: Wednesday, March 12, 2008 - 11:57 am

Can't happen.

Each task belongs to exactly one cpuset, no exceptions.

That's why you can't "treat irqs just like tasks".

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Max Krasnyanskiy
Date: Wednesday, March 12, 2008 - 12:11 pm

Sure you can.

I was talking about running on the _cpus_ that belong to the "sets A and B but 
not C" and not that a task must belong to more than one cpuset. Unless I 
misinterpreted your example you were talking about exact same thing. In other 
words that an irq needs to assigned to the _cpus_ in the sets A and B but not C.
Makes sense ?

Max


--

From: Paul Jackson
Date: Wednesday, March 12, 2008 - 12:32 pm

This doesn't make sense to me.

If a task is to run on the CPUs in both sets A and B, then it has to be
in both those cpusets, which isn't allowed, or in some super set of both
A and B (that is, in this example, in the top cpuset), which doesn't
restrict the task to just A or B or their union.

I have no idea what distinction you are seeing between what _cpus_ a task
can run on, and what cpuset it belongs to.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Max Krasnyanskiy
Date: Wednesday, March 12, 2008 - 1:08 pm

Paul, we are in 100% agreement here about the tasks. All I'm saying is that 
the same exact thing applies to the irqs. Again let me try your example.

Suppose we have
	/dev/cpuset/A
	/dev/cpuset/B
	/dev/cpuset/C

Now suppose that for whatever reason I must run task1 on the cpus that belong 
to sets A and B but not C. The only way to do that with cpusets is

	/dev/cpuset/X
         	  |-- A
	          `-- B
	/dev/cpuset/C

i.e. create parent cpuset X and assign task1 into cpuset X.
Of course if A and B are not cpu_exclusive then X does not have to be their 
parent.

Makes sense so far ?

Now the same exact thing can be said about the irqs. If I need to assign irq1 
to the cpus in sets A and B but not C I have to create set X that is the union 
of A and B, and assign irq1 to the set X.

This is what I meant by "deeper hierarchies" in the earlier emails.

Did I do a better job explaining this time :) ?

Max






--

From: Paul Jackson
Date: Wednesday, March 12, 2008 - 1:37 pm

> This is what I meant by "deeper hierarchies" in the earlier emails.

These deeper hierarchies create an incompatibility in some common uses
of cpusets.

When my example had cpusets A, B and C, that was as stated, not as
might be modified to X, X/A, X/B and C.

If the user has or would have setup cpusets A, B and C because that's
what they needed to manage the CPU and Memory Node placement of their
tasks, then that's what they might have setup, and there is a good
chance that they would find the imposition of the extra 'X' cpuset to
be a problem, to require more code and to be a cause of bugs.

Adding irqs to the cpuset hierarchy isn't free; it can further overload
the hierarchy, with "deeper hierarchies" as you state.

If instead of deeper hierarchies, we allow the same irq to be listed in
more than one cpuset (unlike tasks, which only get one cpuset) then we
need some way, independent of the cpuset hierarchy, to determine how to
resolve conflicts.  We can't just add all the cpus together, allowing an
irq to be directed to any CPU which is listed in any cpuset that
accepts that irq, because a major use for this is to remove irqs from
certain realtime CPUs.

So ... if the natural hierarchy needed to map irqs to CPUs is not a
subset of the natural hierarchy needed to map tasks to sets of CPUs and
Nodes, then we either deepen the hierarchy (cross product of the the
two maps, essentially) or we allow the same irq to be listed in
multiple cpusets and provide some alternative mechanism, outside the
hierarchy, to resolve the resulting conflicts in the irq to CPU map.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Max Krasnyanskiy
Date: Wednesday, March 12, 2008 - 3:29 pm

Isn't that just an issue of planing ? Those cpusets are not cast in stones are 
they. I mean yes users have setup A,B,C they way they did because that's what 
they needed. Now their plans/requirements have changed. They now want to also 
manage irqs via cpusets and in order to do that they need to replan/redo the 
partitioning.

In order to manager irqs the code has to change anyway because currently there 
is not way to do that via cpuset. The users would have two options:
1. keep all irqs in the top set and manage them individually via /proc
2. layout cpusets differently

btw I still do not see the "incompatibility" argument. Probably because I have 
no idea how the software you're talking about is designed. Are you saying that 
the software relies on a flat cpuset partitioning ? ie That it will brake if 
This sounds like an overkill and as you pointed out is not even clear how it'd 
work.

Looks like we have a trade-off here:
1. use simple "irq == pseudo-task" concept and potentially brake some existing 
software. We do have working solution.
2. come up with something that requires more complex irq management rules at 
the expense of complexity. We do not have working solution.

I think by natural you mean "compatible with existing sw". What is unnatural 
in extra levels of cpusets ? If I read cgroup/cpuset documentation it seems to 
imply that nested cgroups/cpuset are allowed.

Max


--

From: Paul Jackson
Date: Wednesday, March 12, 2008 - 4:30 pm

It's similar, perhaps, to what happens when we try to accomodate two
architectures in one file system, with things like:
	/x86_64/bin
	/ia64/bin
replacing the well known /bin.

Things break.  Apps such as the major batch schedulers (PBS and LSF)
and various other tools and scripts buried here and there have come
used to developing particular cpuset hierarchies over the last couple
of years.

Any time you force another dimension into such an existing hierarchy,
things break, and people get annoyed.

Sure ... the kernel doesn't care ... it can handle whatever hierarchy
you like.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Max Krasnyanskiy
Date: Wednesday, March 12, 2008 - 5:57 pm

Crazy idea. How about we add support for sym links to the cgroup fs ?
It's still much cleaner imo than dealing with complex irq grouping schemes.

In other words with symlinks we could do
`-- cpuset
     |-- A -> X/A
     |-- B -> X/B
     |-- C
     `-- X
         |-- A
         `-- B

The software that is used to the flat structure won't know the difference.

Max





--

From: Paul Jackson
Date: Thursday, March 13, 2008 - 12:03 am

What's this "complex irq grouping scheme" that you're referring to?

If it's what I posted last week, with named sets of irqs, and each
cpuset naming which set it belonged to, that seems to me to actually
fit the usage pattern rather well.

The jobs running in particular cpusets need only know the 'name' of
the set of irqs it makes sense to send to its CPUs (the realtime
irqs, a particular piece of hardwares irqs, the ordinary system
irqs, the absolute minimum set of irqs, ...) and the system admin
gets to specify, one time, which irq numbers are in which named
set, or to change, later on, which set a particular irq is in, all
without having to have detailed knowledge of the jobs that want
particular irq sets directed to their CPUs.

We tend to label whatever makes sense to us as "simple", and whatever
doesn't seem necessary in our experience, or doesn't make sense, as
"complex".

Such labels are losing their meaning these days, other than to help
others figure out what we favor, or disfavor.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Max Krasnyanskiy
Date: Thursday, April 10, 2008 - 11:03 am

The context here was that we were talking about a way to group irqs and assign 
them to the cpusets. I was proposing to just treat IRQs as tasks, and you were 
proposing to add some additional grouping. Replies inline below.

I was just saying that cpuset already provides a nice grouping. After thinking 
about this some more I still do not see a need to group IRQs before assigning 
I agree in general. In this particular case additional grouping introduces 
even more hierarchy. I seems to me that
	"irqN -> cpu1, cpu2, cpu3"
is a very simple, straightforward relationship. Whereas
	"irqN -> groupX"
	"groupX -> cpu1"
	"groupX -> cpu2"
	"groupX -> cpu3"
Is not that straightforward.

Anyway. I think it all boils down to the compatibility with existing 
user-space apps. I still like the simple approach of treating irqs like tasks 
when it comes to assigning them to the cpusets. Which as we discussed earlier 
in some cases may require an extra level in the cpuset hierarchy. The question 
is, is that really such a big problem. If we make in kernel boot set optional, 
by default all irqs will be in the root cpuset. Which means people can still 
use /proc/irq/N/smp_affinity and manage irqs just like they do now. There is 
no compatibility issues in that case.

So do you think the apps compatibility is an issue in that case ?
Also isn't it likely that the apps will gradually adapt to handling 
multi-level cpusets anyway ? I mean you guys were talking about how wonderful 
and flexible cpusets are, but we cannot seem to use the flexibility because 
the apps are designed for a flat layout.

Max
--

From: Paul Jackson
Date: Monday, April 14, 2008 - 11:39 am

No.  Not flat.  Not at all flat.

We routinely and normally have an interesting hierarchy of cpusets
below /dev/cpuset.  However that hierarchy is determined by the
nesting of subsets of the nodes (CPUs and/or Memory) on the system.

These subsets of nodes in the /dev/cpuset hierarchy may well map
nicely into the subsets of CPUs that can receive a particular set
of IRQs, however that map is not bijective.  Of particular interest
here, it's not injective, meaning that multiple cpusets might and
will commonly receive the same set of IRQs.  You can force this map
to be injective by elaborating the cpuset hierarchy to reflect both
this new assignment of IRQs and the (CPU and/or Memory) node subset
hiearchy that it currently reflects, but that will break code that
was expecting the directory tree below /dev/cpuset to directly and
only reflect the node hierarchy.

In less mathematically obtuse wording, sure you can add more directory
layers below /dev/cpuset, to handle IRQ assignments, but that will
break code that was expecting the /dev/cpuset directory tree to only
reflect the nesting of (CPU and/or Memory) nodes.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Peter Zijlstra
Date: Friday, May 9, 2008 - 3:45 am

Sorry for being rather late to the game - other stuff keeps me from
doing anything much here :-(.

Anyway, the current applications don't support IRQ assingment anyway.
That's a new feature; and its quite common that new features require
code changes.

So I'm not seeing the problem - don't change code and stuff works as
before - change code and you get new stuff.

So I'm arguing in favour of the IRQs as tasks idea that might need extra
hierarchy levels.

--

From: Paul Jackson
Date: Friday, May 9, 2008 - 4:17 am

It's common for new features to require code changes to take advantage
of the new features.

It's less desirable that taking advantage of such new features breaks
existing, basically unrelated, code.

My gut sense is that, in a misguided effort to find a "simple" answer
to irq distribution, we (well, y'all) are trying to attach this
feature to cpusets or cgroups.

Let me ask a different question:

  What solutions would you (Max, Peter, Ingo, lurkers, ...) be
  suggesting for this 'IRQ affinity' problem if cpusets and
  cgroups didn't exist in any form whatsoever?

The answer to that question might help me contribute to this discussion
in another way ... it might help me understand better what we're really
trying to do here.  You guys were proposing mechanisms that don't fit
my architecture sense of cpusets, but I was having problems figuring out
what are the essential underlying requirements, independent of choice
of mechanism.

Perhaps by describing one or two possible alternative, cpuset-free,
mechanisms that come more or less close to meeting our needs, I will
glean a better understanding of these elusive requirements, and can
better contribute to the discussion of design trade offs facing us.

So could you describe some possible cpuset-free solutions?  If they are
flawed in some critical way, that's ok, just point out said flaw(s).
Either way, this could help illuminate what's needed here.

It might be, once I better understand the requirements, possible
solutions and their tradeoffs, that I come to agree that cpusets or
cgroups present the best mechanism, given the tradeoffs and what's
needed.  Or it might be we find a better way to meet our needs.

Actually, if for no other reason than to bring any lurkers up to speed,
if you (Max or Peter, likely) wanted to describe, from the beginning,
what this discussion is about, that would be good too.  I doubt anyone
outside of three or four of us even recalls that long discussion of
February and March, 2008.

-- 
   ...
From: Peter Zijlstra
Date: Friday, May 9, 2008 - 4:48 am

I see two use-cases:

 - Isolation
 - NUMA node devices

With isolation you want to move all of you 'normal' system tasks off to
side of your machine and use the other side for 'special - rt' tasks.

For IRQs this means that you want to move all the 'normal' IRQs along
with the 'normal' tasks, and move the special IRQs into the rt side.

Of course you can do this by setting IRQ affinities one by one, but
being able to group the IRQs seems a sensible thing to me.

One thing here is that we'd like to also provide a default group for new
IRQs, so that when a new device appears its not allowed into the
'special' side of your machine.

This is what Max focussed on, and provides a binary devision of your
machine: special and not special.


Now I was thinking that if we generalize this whole thing it might be
useful for other purposes such as IRQ placement near the nodes that host
the device and/or the application using them.


So what we'd end up with is named affinity groups that contain (unique)
IRQs. 


--

From: Paul Jackson
Date: Friday, May 9, 2008 - 5:03 am

Ok ... so let me propose an entirely different solution.

No doubt it has some terrible flaw, but I'll just have to
await your replies to see what that is.

How about we have:

 1) Yet another text config file in /etc, this one containing
    lines having two fields:
	* a list of IRQs, and
	* a cpumask.
    This file would specify which CPUs should handle which IRQs.

 2) A utility that can be run, after changing the above file, 
    to poke the proper cpumask to each IRQ, as specified in
    the file.

(Obligatory "simple" marketing claim: the above requires no
kernel changes.)

What am I missing?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Peter Zijlstra
Date: Friday, May 9, 2008 - 5:14 am

Two points:

 - we can't currently set irq affinities for non-existent (aka new) IRQs
 - its a shame to duplicate the masks - most of this information would
also be used in the cpuset structure used to place the tasks.




--

From: Paul Jackson
Date: Friday, May 9, 2008 - 5:36 am

Ok.  Let me twist this a turn tighter then.

The first of your two points, a default affinitiy mask for new irqs,
would seem to require a kernel change.  But that change could be a
single cpumask, settable in /sys somewhere, specifying the default
affinity.  If that's all we needed, it would be easy.

The second of your two points, "duplicating masks", seems more delicate.

The space of named cpusets (the directory pathnames below the usual
mount point, /dev/cpuset) is not really much more compact than the
set of interesting cpumasks.  But I suppose your point is that some
of the -particular- cpumasks already named by the cpuset hierarchy
are tantilizingly close to the set of interesting cpumasks needed for
irq affinity ... close given some combination of union, intersection,
set difference and compliment operations, given my usual bias toward
looking at such things as this using set theory mechanisms.  That is,
for example, one might want all the CPUs in cpusets foo, bar and baz,
except the CPUs in cpuset blip, to handle IRQs so and so.

Let me think on that ... it's my nap time now.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Paul Jackson
Date: Friday, May 9, 2008 - 10:43 am

Ahh!  Perhaps that example has the keys to this kingdom.

How about this.  We add two files to each cpuset:

    irq_affinity_include	# IRQs to direct to CPUs in this cpuset
    irq_affinity_exclude	# IRQs -not- to direct to these CPUs

where irq_affinity_exclude overrides irq_affinity_include.

So, to determine to which CPUs a given interrupt (IRQ) can be directed:
 1) Combine (union) the 'cpus' of all the cpusets for which
    that IRQ is in that cpusets irq_affinity_include, then
 2) Remove (set substraction) the 'cpus' of any cpuset for which
    that IRQ is in that cpusets irq_affinity_exclude.

In the simplest case of just wanting to isolate some CPUs with their
own special list of interrupts, one would:
 1) include all interrupts in the top cpusets irq_affinity_include, and
 2) include the interrupts you don't want in the isolated cpusets
    irq_affinity_exclude.

Observe that there is no dependency on the cpuset hierarchy in the above.

The contents of the files irq_affinity_include and irq_affinity_exclude
would be inherited by child cpusets on creation from their parents.

The one detail that puzzles me at the moment is what ownership and
permissions these two irq_affinity_* files would have.  I am concerned
that the usual permissions, which allow a job to write its own cpuset
files would allow a job to affect the overall system to a greater
degree than is desired.  Perhaps an additional inheritance rule would
be useful and appropriate, such as a rule that a given cpusets
irq_affinity_include must be a subset of its parents or a rule that
a given cpusets irq_affinity_exclude must be a -superset- of its
parents; I'm unsure here.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Max Krasnyanskiy
Date: Tuesday, May 20, 2008 - 6:21 pm

Looks like we arrived at the same conclusion. See my prev reply.
There is actually no duplication as far as I can see because IRQ layer already 
This would be an overkill imho.

Max

--

From: Max Krasnyanskiy
Date: Tuesday, May 20, 2008 - 6:14 pm

As Peter explained I'm focusing on the "CPU isolation" aspect. ie Shielding a 
CPU (or a set of CPUs) from various kernel activities (load balancing, soft 
and hard irq handling, workqueues, etc).

For the IRQs specifically all I need is to be able to tell the kernel to not 
route IRQs to certain CPUs. That's mostly works already via 
/proc/irq/N/smp_affinity, the problem is dynamically allocated irqs because 
/proc/irq/N directory does not exist until those IRQs are allocated/enabled.

Originally I introduced global cpu_isolated_map. IRQ code was using that map 
to exclude CPU(s) from IRQ routing. What I realized now is that all I need is
/proc/irq/default_smp_affinity. In other words I just need to export default 
mask used by the IRQ layer. I think this makes sense regardless of what cpuset 
  based solution we'll come up with.

Max
--

From: Arjan van de Ven
Date: Tuesday, May 20, 2008 - 9:45 pm

On Tue, 20 May 2008 18:14:58 -0700
\\

why don't you tell irqbalance instead? it'll make sure the irq stays
out of the wind...
--

From: Max Krasnyanskiy
Date: Wednesday, May 21, 2008 - 9:18 am

That will be too late. By the time irqbalance sees that IRQ it may have 
already fired (possibly several times) on the "wrong" processor.

Max

--

From: Paul Jackson
Date: Tuesday, May 20, 2008 - 11:34 pm

I suspect that something like you're proposing to do here will answer
your needs, to "tell the kernel to not route IRQs to certain CPUs."

I suspect that other folks will have some additional needs, that perhaps
my idea of May 9, 2008:

       How about this.  We add two files to each cpuset:
       
           irq_affinity_include	# IRQs to direct to CPUs in this cpuset
           irq_affinity_exclude	# IRQs -not- to direct to these CPUs
       
       where irq_affinity_exclude overrides irq_affinity_include.

could meet.

It makes sense to me to deal with your "default_smp_affinity" patch
first, and then come back around and see what remains to be done, and
how to do it, perhaps with additional cpuset based mechanisms such as

Peter, et al: how does Max's planned "default_smp_affinity" patch sound
to you, as the next step we take on this?

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Max Krasnyanskiy
Date: Wednesday, May 21, 2008 - 10:58 am

Hi Paul,


I saw your earlier email with that proposal. Just had to digest it a bit :) 

That would work. But wouldn't it be hard for the users to debug things ? I 
mean if you have a complex cpuset hierarchy it may be hard to figure out why a 
certain irq is not getting to cpuX and vice versa.
btw How would we represent "all irqs", are you implying that those files 
contain masks ?
We'll also need to handle conflicts like "irq excluded from all cpusets", etc.
I still prefer "irq as a task" approach. It's very simple and straightforward 
  mapping of an irq -> cpuset, no conflicts, etc. Easy to figure out for the 
user where an irq will end up.

btw I did not quite get the idea behind the "exclude" part. Why is "include" 

I think it makes sense regardless of the cpuset based approach. Seems like a 
logical extension of the existing interface (ie per IRQ mask plus the default).

Max


--

From: Paul Jackson
Date: Monday, April 14, 2008 - 11:42 am

Clearly, yes, the first is simpler than the second.

The question is which is correct.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Paul Jackson
Date: Thursday, March 13, 2008 - 12:12 am

> How about we add support for sym links to the cgroup fs ?

Still pollutes the primary cpuset name space ... you have all
the directories X, X/A, and X/B as well as the symlinks A and B.

Symlinks allow for one path that needs to be 'aliased' to another,
but they are a one-way map; without an exhaustive search of the
potential namespace, one can't invert them, or determine if they
can't be inverted.

Tools have to constantly make heuristic decisions whether to
default to dereferencing the symlink, or not, and often have to
provide alternatives for the non-default choice.

They are a pain in the backside even if designed in and expected
up front.

If added as critical structure after the fact, something breaks,
pretty much for sure.

For one minor example, code I've probably buried someplace that
does "find /dev/cpuset -type d" to find all cpusets would break.

Or the one-line /sbin/cpuset_release_agent script:
	rmdir /dev/cpuset/$1
is broken -- fails to clean-up associated symlinks, and can't

Agreed ;)

But nice picture ;).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Max Krasnyanskiy
Date: Thursday, April 10, 2008 - 10:24 am

Sorry for disappearing on you guys. I'm working on releasing the user-space 
framework and engine that uses cpu isolation for hard-RT. Once that's done I'm 
going to resurrect these efforts. In the mean time let me reply to your last 
comments.


Got it. Symlinks are out :)

Max


--

From: Paul Jackson
Date: Thursday, April 10, 2008 - 10:37 am

Good ;).

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Paul Jackson
Date: Wednesday, March 12, 2008 - 4:32 pm

Breaking existing software is not what I call working.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Max Krasnyanskiy
Date: Wednesday, March 12, 2008 - 5:46 pm

Ok ok I get it :)
You know what I meant though. In the other scheme it's not even clear how it'd 
work in general.

Max


--

From: Paul Menage
Date: Wednesday, March 12, 2008 - 12:16 pm

Not cgroups, no. If you really wanted to extend cpusets specifically
to allow irqs to be assigned to a cpuset to control which cpus they
could execute on, then that might be a possibility. But I don't see
how this would be useful for any other cgroup subsystem, so it doesn't
belong in the generic framework.

My feeling is that just using a simple bitmask assignment, unrelated
to cpusets or cgroups, as Max suggested in his later email is the way
to go.

Paul
--

From: Paul Jackson
Date: Wednesday, March 12, 2008 - 12:24 pm

I'll have to have another go at reading his replies.  I seem to have
more difficulty making sense of his posts ... not sure why.

-- 
                  I won't rest till it's the best ...
                  Programmer, Linux Scalability
                  Paul Jackson <pj@sgi.com> 1.940.382.4214
--

From: Max Krasnyanskiy
Date: Wednesday, March 12, 2008 - 12:30 pm

I'm sure it's because of gazillion typos in them :).

Max
--

Previous thread: Re: Poor PostgreSQL scaling on Linux 2.6.25-rc5 (vs 2.6.22) by Nick Piggin on Tuesday, March 11, 2008 - 6:21 pm. (12 messages)

Next thread: [2.6.25-rc5-mm1] BUG() at mnt_want_write(). by penguin-kernel on Tuesday, March 11, 2008 - 6:37 pm. (10 messages)