Re: [RFC] export irq_set/get_affinity() for multiqueue network drivers

Previous thread: Re: 2.6.25 DMA: Out of SW-IOMMU space - Asus M2N32 AMD 8GB memory by Jari Aalto on Thursday, August 28, 2008 - 1:49 pm. (2 messages)

Next thread: [PATCH] x86: make sure the CPA test code's use of _PAGE_UNUSED1 is obvious by Jeremy Fitzhardinge on Thursday, August 28, 2008 - 1:58 pm. (2 messages)
From: Brice Goglin
Date: Thursday, August 28, 2008 - 1:21 pm

Hello,

Is there any way to setup IRQ masks from within a driver? myri10ge
currently relies on an external script (writing in
/proc/irq/*/smp_affinity) to bind each queue/MSI-X to a different
processor. By default, Linux will either:
* round-robin the interrupts (killing the benefit of DCA for instance)
* put all IRQs on the same CPU (killing much of the benefit of multislices)

With more and more drivers using multiqueues, I think we need a nice way
to bind MSI-X from within the drivers. I am not sure what's best, the
attached (untested) patch would just export the existing
irq_set_affinity() and add irq_get_affinity(). Comments?

thanks,
Brice

From: David Miller
Date: Thursday, August 28, 2008 - 1:56 pm

From: Brice Goglin <Brice.Goglin@inria.fr>

I think we should rather have some kind of generic thing in the
IRQ layer that allows specifying the usage model of the device's
interrupts, so that the IRQ layer can choose a default affinities.

I never notice any of this complete insanity on sparc64 because
we flat spread out all of the interrupts across the machine.

What we don't want it drivers choosing IRQ affinity settings,
they have no idea about NUMA topology, what NUMA node the
PCI controller sits behind, what cpus are there, etc.  and
without that kind of knowledge you cannot possible make
affinity decisions properly.
--

From: Brice Goglin
Date: Friday, August 29, 2008 - 12:08 am

As long as we get something better than the current behavior, I am fine
with it :)

Brice
--

From: Arjan van de Ven
Date: Friday, August 29, 2008 - 5:50 am

On Thu, 28 Aug 2008 22:21:53 +0200

* do the right thing with the userspace irq balancer

-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Andi Kleen
Date: Friday, August 29, 2008 - 9:48 am

It probably also needs to be hooked up the sched_mc_power_savings
When the switch is on the interrupts shouldn't be spread out over
that many sockets.

Does it need callbacks to change the interrupts when that variable
changes?

Also I suspect handling SMT explicitely is a good idea. e.g. I would
always set the affinity to all thread siblings in a core, not 
just a single one, because context switch is very cheap between them.

-Andi

-- 
ak@linux.intel.com
--

From: Arjan van de Ven
Date: Friday, August 29, 2008 - 9:52 am

On Fri, 29 Aug 2008 18:48:12 +0200


that is what irqbalance already does today, at least for what it
considers somewhat slower irqs.
for networking it still sucks because the packet reordering logic is
per logical cpu so you still don't want to receive packets from the
same "stream" over multiple logical cpus.



-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Rick Jones
Date: Friday, August 29, 2008 - 10:14 am

That is true, but don't they also "compete" for pipeline resources?

rick jones
--

Previous thread: Re: 2.6.25 DMA: Out of SW-IOMMU space - Asus M2N32 AMD 8GB memory by Jari Aalto on Thursday, August 28, 2008 - 1:49 pm. (2 messages)

Next thread: [PATCH] x86: make sure the CPA test code's use of _PAGE_UNUSED1 is obvious by Jeremy Fitzhardinge on Thursday, August 28, 2008 - 1:58 pm. (2 messages)