At the moment, devices with the MSI-X capability can request multiple interrupts, but devices with MSI can only request one. This isn't an inherent limitation of MSI, it's just the way that Linux currently implements it. I intend to lift that restriction, so I'm throwing out some idea that I've had while looking into it. First, architectures need to support MSI, and I'm ccing the people who seem to have done the work in the past to keep them in the loop. I do intend to make supporting multiple MSIs optional (the midlayer code will fall back to supporting only a single MSI). Next, MSI requires that you assign a block of interrupts that is a power of two in size (between 2^0 and 2^5), and aligned to at least that power of two. I've looked at the x86 code and I think this is doable there [1]. I don't know how doable it is on other architectures. If not, just ignore all this and continue to have MSI hardware use a single interrupt. In a somewhat related topic, I really don't like the API for pci_enable_msix(). The all-or-nothing allocation and returning the number of vectors that could have been allocated is a bit kludgy, as is the existence of the msix_entry vector. I'd like some advice on a couple of alternative schemes: 1. pci_enable_msi_block(pdev, nr_irqs). If successful, updates pdev->irq to be the base irq number; the allocated interrupts are from pdev->irq to pdev->irq + nr_irqs - 1. If it fails, return the number of interrupts that could have been allocated. 2. pci_enable_msi_block(pdev, nr_irqs, min_irqs). Will allocate at least min_irqs or return failure, otherwise same as above. My design is largely influenced by the AHCI spec where the device can potentially cope with any number of MSI interrupts allocated and will use them as best it can. I don't know how common that is. One thing I do want to be clear in the API is that the driver can ask for any number of irqs, the pci layer will round up to the next power of two if necessary. I don't ...
Interesting. I've been thinking about that one for some time but back then, the feedback I got left and right is that nobody cares :-) I'm adding Michael Ellerman to the CC list, he's done a good part of the Well, it requires that for HW number. But I don't think it should require that at API level (ie. for driver visible irq numbers). Some architectures can fully remap between HW sources and "linux" visible IRQ numbers and thus wouldn't have that limitation from an API point of That would constraint the linux IRQ numbers to be a linear block just like the HW numbers. Better than having them be a power-of-two aligned but still a restriction on SW number allocation, though it's probably Well, that's where I'm not happy. The API shouldn't expose the "power-of-two" thing. The numbers shown to drivers aren't in the same space as the source numbers as seen by the HW on many architectures and It's very implementation specific. IE. On most powerpc implementations, MSI just route via a decoder to sources of the existing interrupt controller so we can control per-source affinity at that level. Some x86 seem to require different base addresses which makes it mostly impossible to spread them I believe (maybe that's why people came up Cheers, Ben. --
This is true and worth considering carefully. Are IRQ numbers a scarce resource on PowerPC? They are considerably less scarce than interrupt vectors are on x86-64. How hard is it to make IRQ numbers an abundent resource? Is it simply a question of increasing NR_IRQS? This cost should be traded off against the cost of allocating something like the msix_entry array in each driver that wants to use multiple MSIs, passing that array around, using it properly, etc. It would make some sense to pass nr_irqs all the way down to arch code and let arch code take care of reserving the block of vectors (aligned appropriately). That would conserve IRQ numbers, though not vectors. I think we have to consider excess vectors reserved. If we don't, we could get into the situation where a device uses more interrupts than the driver thinks it will and problems ensue. By the way, would people be interested in changing the MSI-X API to get rid of the msix_entry array? If allocating consecutive IRQs isn't a problem, then we could switch the MSI-X code to use consecutive IRQs. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
Yes, indeed, they aren't really scarce... actually less than the underlying HW vectors in most cases, so it isn't a big issue to add some Ok, so I lift my objection there in the sense that allocating a linear array of virtual numbers shouldn't be a problem (somebody remind me to make NR_IRQS a config option one of these days on ppc, or help with just getting rid of irq_desc array alltogether :-) However, do you want to still keep the fact that they are power-of-2 aligned up to the API or can I just do a linear block allocation for virtual number sand require drivers to do the appropriate addition/subtraction to get the N'th one ? I will need to allocate appropriately aligned HW numbers but that's done via different mechanisms (and in some case not even under full linux control, ie, It would make a lot of code simpler... Ben. --
Not scarce, but increasing NR_IRQS makes some static arrays bigger, It's not a pretty API to be sure. I thought drivers needed the flexibility of being able to specify non-contiguous ranges. In practice it looks like only s2io is doing anything different. cheers -- Michael Ellerman OzLabs, IBM Australia Development Lab email: michaele@au.ibm.com stime: ellerman@au1.ibm.com notes: Michael Ellerman/Australia/IBM phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person --
Some years ago, we had discussions about getting rid of IRQ numbers altogether, or at least the requirement to have device drivers know about them. Does anyone remember what happened to that idea? I think the concept was that you pass around struct irq_desc pointers that may or may not be dynamically allocated by the interrupt controller code. Another simplification that should really help here is encapsulating request_irq() per subsystem so that you can do something like int my_irq(struct pci_dev *dev); err = pci_request_irq(pci_dev, &my_irq, IRQF_SHARED); Most PCI drivers should be trivial to convert to this model. If you want to have multiple MSI/MSI-X interrupts for one PCI device with this model, you'd need to introduce the number back as an offset, I guess. Arnd <>< --
I think it's not totally dead. Last I heard, someone (jgarzik ?) was slowly, bit by bit, removing the dependencies on the irq argument on irq Yup. There are still a few hard dependencies on numbers left and right tho. The main issue is old userspace tied to the layout of things like /proc/interrupts though I'd be happy to special case the 16 "legacy" interrupts (like we do on powerpc in our remapping layer) and Might indeed be a good idea. Cheers, Ben. --
You can't do that. /proc/interrupts is so terribly useful for a sysadmin that you can't remove information from it. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
Ouch, missed that one... sad, would have been a good idea in the long You can create a new one with informations about the new stuff.. Anyway, looks like it's not happening and we'll be stuck with the bloody array for the time being. Crap. Ben. --
Well it looks like Linus' main objection is with changing the driver In most cases that I can see the conversion from irq number to irq_desc is done in the genirq code, so I don't see why you couldn't just put a remapping in there from irq numbers to descs that doesn't use an array.=20 But maybe I'm missing something. cheers --=20 Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person
There are a pile of drivers you would also need to fix up that use NR_IRQ sized tables. There is also the performance concern. The irq->desc conversion has to be very fast. It would be better (IMHO) if dev->irq and irq in general was a pointer whose number for end users was irq->num or similar. --
Similar situation on x86-64, so I already thought about it ;-) I think the IRQ numbers should be contiguous, but not necessarily aligned. It certainly would! -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
On Thu, Jul 03, 2008 at 01:24:29PM +1000, Benjamin Herrenschmidt wrote: The drivers have to deal with the limitations of the HW spec. In this case it means they have to know they are getting power of 2 number of interrupts. I think exposing this in the API is a requirement Correct. MSI only has one address for multiple vectors and thus will only target one CPU. MSI-X has address/vector pairs (1:1). If the Local-APICs are able to redirect interrupts, then multiple CPUs can process the interrupts. I expect this "HW Interrupt redirection" is what the PCI committee expected to be used...however HP (and perhaps others) have HW which didn't implement "XTP" register (IIRC, that's the register required to redirect interrupts by the Local-APIC) since one gets better performance by "targeting" interrupts at specific CPUs. hth, grant --
That's not the only way it can work. If you have an APIC per root bus, you can target that with the write. The APIC could then map the interrupt request to the appropriate CPU. In this scenario, programming affinity would be twiddling some bits in the APIC and not need to write to the device's MSI register at all. What I've implemented for x86-64 can target any mask of CPUs that are in the same interrupt domain. My machine only has one interrupt domain, so I can target the MSI to any subset of the CPUs. They all move together, so you can't target a different subset of CPUs for different MSIs on the same device. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
On Mon, Jul 07, 2008 at 10:39:19AM -0600, Matthew Wilcox wrote: Ok...but that still means it's got only one target (either one CPU *nod* - that's what I meant. thanks, grant --
Ugh ? The number of interrupts isn't what I was talking about, but the -alignment- was. ie, the driver should make no assumption that the irq numbers it will obtain will be aligned to a power-of-two. Cheers, Ben. --
I don't think it's quite that strong. If a driver asked for 6 interrupts the MSI code could setup 8 and have 2 just hooked to nothing. I'm not sure that's a good idea, but it's possible. cheers -- Michael Ellerman OzLabs, IBM Australia Development Lab email: michaele@au.ibm.com stime: ellerman@au1.ibm.com notes: Michael Ellerman/Australia/IBM phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person --
Here's some code. It's four patches, the first two are to the PCI MSI code, the third is for the AHCI driver and the fourth is for the x86-64 interrupt code. It's had some light testing; no performance testing yet. I'd value some review. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
It's taking a while to come through ... patches are also at: http://www.parisc-linux.org/~willy/multiple-msi/ I'll put up the git tree if there's demand. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
I found and fixed a couple of bugs ... and implemented CPU affinity. As the comment says, when you move one MSI, you move them all (at least as far as I can figure out the x86 APIC architecture ... other architectures may not have this problem). Git tree now available: git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc.git multiple-msi (will be updated as and when) git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc.git multiple-msi-20080705 (semi-permanent) I don't intend to submit these patches to Linus myself; I'd like the first two to go in through the PCI tree. The third patch does depend on the first two, but should go in through the IDE tree. The fourth patch is entirely independent of the first three and can go in through the x86 tree at any time. I love these cross-responsibility patch sets ;-) -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
Implement the arch_setup_msi_block() interface. Rewrite create_irq()
into create_irq_block() and call create_irq_block() from create_irq().
Implement __assign_irq_vector_block() based closely on __assign_irq_vector().
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
arch/x86/kernel/io_apic_64.c | 199 ++++++++++++++++++++++++++++++++++++++----
1 files changed, 183 insertions(+), 16 deletions(-)
diff --git a/arch/x86/kernel/io_apic_64.c b/arch/x86/kernel/io_apic_64.c
index ef1a8df..44e942a 100644
--- a/arch/x86/kernel/io_apic_64.c
+++ b/arch/x86/kernel/io_apic_64.c
@@ -683,6 +683,8 @@ static int pin_2_irq(int idx, int apic, int pin)
return irq;
}
+static int current_vector = FIRST_DEVICE_VECTOR;
+
static int __assign_irq_vector(int irq, cpumask_t mask)
{
/*
@@ -696,7 +698,7 @@ static int __assign_irq_vector(int irq, cpumask_t mask)
* Also, we've got to be careful not to trash gate
* 0x80, because int 0x80 is hm, kind of importantish. ;)
*/
- static int current_vector = FIRST_DEVICE_VECTOR, current_offset = 0;
+ static int current_offset = 0;
unsigned int old_vector;
int cpu;
struct irq_cfg *cfg;
@@ -769,6 +771,97 @@ static int assign_irq_vector(int irq, cpumask_t mask)
return err;
}
+static int __assign_irq_vector_block(int irq, int count, cpumask_t mask)
+{
+ unsigned int old_vector;
+ int i, cpu;
+ struct irq_cfg *cfg;
+
+ /*
+ * We've got to be careful not to trash gate 0x80,
+ * because int 0x80 is hm, kind of importantish. ;)
+ */
+ BUG_ON((unsigned)irq + count > NR_IRQS);
+
+ /* Only try and allocate irqs on cpus that are present */
+ cpus_and(mask, mask, cpu_online_map);
+
+ for (i = 0; i < count; i++) {
+ cfg = &irq_cfg[irq + i];
+ if ((cfg->move_in_progress) || cfg->move_cleanup_count)
+ return -EBUSY;
+ }
+
+ cfg = &irq_cfg[irq];
+ old_vector = cfg->vector;
+ if (old_vector) {
+ cpumask_t tmp;
+ cpus_and(tmp, cfg->domain, mask);
+ if (!cpus_empty(tmp))
+ return ...This first part simply changes the msi_attrib data structure to store
how many vectors have been allocated. In order to do this, I shrink the
'type' from 5 bits to 2 and rename it to _type to catch any unsuspecting
users.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
drivers/pci/msi.c | 41 ++++++++++++++++++++++++-----------------
drivers/pci/msi.h | 4 ----
include/linux/msi.h | 6 +++++-
3 files changed, 29 insertions(+), 22 deletions(-)
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 8c61304..92992a8 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -106,11 +106,11 @@ static void msix_flush_writes(unsigned int irq)
entry = get_irq_msi(irq);
BUG_ON(!entry || !entry->dev);
- switch (entry->msi_attrib.type) {
- case PCI_CAP_ID_MSI:
+ switch (entry->msi_attrib._type) {
+ case MSI_ATTRIB:
/* nothing to do */
break;
- case PCI_CAP_ID_MSIX:
+ case MSIX_ATTRIB:
{
int offset = entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE +
PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET;
@@ -129,8 +129,8 @@ static void msi_set_mask_bits(unsigned int irq, u32 mask, u32 flag)
entry = get_irq_msi(irq);
BUG_ON(!entry || !entry->dev);
- switch (entry->msi_attrib.type) {
- case PCI_CAP_ID_MSI:
+ switch (entry->msi_attrib._type) {
+ case MSI_ATTRIB:
if (entry->msi_attrib.maskbit) {
int pos;
u32 mask_bits;
@@ -144,7 +144,7 @@ static void msi_set_mask_bits(unsigned int irq, u32 mask, u32 flag)
msi_set_enable(entry->dev, !flag);
}
break;
- case PCI_CAP_ID_MSIX:
+ case MSIX_ATTRIB:
{
int offset = entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE +
PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET;
@@ -162,8 +162,8 @@ static void msi_set_mask_bits(unsigned int irq, u32 mask, u32 flag)
void read_msi_msg(unsigned int irq, struct msi_msg *msg)
{
struct msi_desc *entry = get_irq_msi(irq);
- switch(entry->msi_attrib.type) {
- case PCI_CAP_ID_MSI:
+ switch(entry->msi_attrib._type) {
+ case MSI_ATTRIB:
{
...Please don't, it significantly uglifies the code IMHO. Just add a new field for the size, I'd rather call it qsize to match the register. If you're worried about bloating msi_desc, there's several fields in there that are per-device not per-desc, so we could do another patch to move them into pci_dev or something hanging off it, eg. pci_dev->msi_info rather than storing them in every desc. --=20 Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person
Uglifies the code? Seriously? Other than the _ addition (which really I just did to be sure I didn't miss a case), how is MSI_ATTRIB uglier than PCI_CAP_ID_MSI? I'd like to rename the register definition from QSIZE. It's _not_ a queue. I don't know where this misunderstanding came from, but I Might be worth it anyway for devices with lots of MSI-X interrupts. I think the MSI-X implementation is a bit poorly written anyway. If we had an array of msi_desc for each device, we could avoid the list_head in the msi_desc, for example. That'd save two pointers (8 or 16 bytes), plus the overhead of allocating each one individually. I also think that MSI-X could be improved by changing the interface to do away with this msix_entry list passed in -- just allocate the irqs consecutively. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
Apparently some drivers rely on scattered allocation of MSI-X Ben. --
Yeah seriously :) The _ is part of it, but MSI_ATTRIB is uglier than PCI_CAP_ID_MSI exactly because it's not PCI_CAP_ID_MSI, which exists and I didn't say it was a queue, but a Q ;) But I agree it's not a good name, the spec calls it "multiple message enable", nvec would match the Eventually yeah, last I looked we didn't have any drivers using more It would be nice, but as I said the other day we have at least one driver (s2io) which asks for non-consecutive entries. That doesn't effect the irq allocation, but you need some way for the driver to express it. cheers --=20 Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person
Here's an improvement over both the status quo and my patch -- simply
I don't see what's wrong with 'multiple'. log_nvec is clunky, and
Ouch. I just used pahole and discovered we were using 72 bytes on
64-bit. A swift rearrangement of a u16 gets us back down to 64.
Here's the replacement patch:
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 8c61304..8f7e483 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -106,20 +106,10 @@ static void msix_flush_writes(unsigned int irq)
entry = get_irq_msi(irq);
BUG_ON(!entry || !entry->dev);
- switch (entry->msi_attrib.type) {
- case PCI_CAP_ID_MSI:
- /* nothing to do */
- break;
- case PCI_CAP_ID_MSIX:
- {
+ if (entry->msi_attrib.is_msix) {
int offset = entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE +
PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET;
readl(entry->mask_base + offset);
- break;
- }
- default:
- BUG();
- break;
}
}
@@ -129,8 +119,12 @@ static void msi_set_mask_bits(unsigned int irq, u32 mask, u32 flag)
entry = get_irq_msi(irq);
BUG_ON(!entry || !entry->dev);
- switch (entry->msi_attrib.type) {
- case PCI_CAP_ID_MSI:
+ if (entry->msi_attrib.is_msix) {
+ int offset = entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET;
+ writel(flag, entry->mask_base + offset);
+ readl(entry->mask_base + offset);
+ } else {
if (entry->msi_attrib.maskbit) {
int pos;
u32 mask_bits;
@@ -143,18 +137,6 @@ static void msi_set_mask_bits(unsigned int irq, u32 mask, u32 flag)
} else {
msi_set_enable(entry->dev, !flag);
}
- break;
- case PCI_CAP_ID_MSIX:
- {
- int offset = entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE +
- PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET;
- writel(flag, entry->mask_base + offset);
- readl(entry->mask_base + offset);
- break;
- }
- default:
- BUG();
- break;
}
entry->msi_attrib.masked = !!flag;
}
@@ -162,9 +144,14 @@ static void msi_set_mask_bits(unsigned int irq, u32 mask, u32 ...I prefer the "is_msix" for readability though I'm not particularly fond of bit fields (in any form). This is why I don't like bit fields. "uninitialized" (3rd state) doesn't exist. Is there something else in place to catch that state? (It's clearly a bug if someone did that and maybe I'm being too paranoid.) hth, grant --
msi_descs are only allocated in two places, one of which sets is_msix and the other doesn't. I wouldn't object to passing parameters to alloc_msi_entry() so there's only one place that fills in is_msix, but yes, I do think the current code is too paranoid. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
Perhaps I'm pedantic, but I'd rather it was two patches, one to change *msg) A #define for "<< 4" would be nice. And should we be paranoid about potentially writing 0b110 or 0b111 which are reserved? cheers --=20 Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person
Heh, I've been calling it MSI-Y in my internal monologue ;-) Maybe I'll make that change. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
Add the new API pci_enable_msi_block() to allow drivers to
request multiple MSIs. Reimplement pci_enable_msi in terms
of pci_enable_msi_block. Add a default implementation of
arch_setup_msi_block() that only allows one MSI to be requested.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
arch/powerpc/kernel/msi.c | 2 +-
drivers/pci/msi.c | 109 +++++++++++++++++++++++++++++----------------
include/linux/msi.h | 3 +-
include/linux/pci.h | 6 ++-
4 files changed, 77 insertions(+), 43 deletions(-)
diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
index c62d101..317c7c8 100644
--- a/arch/powerpc/kernel/msi.c
+++ b/arch/powerpc/kernel/msi.c
@@ -32,7 +32,7 @@ int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
return ppc_md.setup_msi_irqs(dev, nvec, type);
}
-void arch_teardown_msi_irqs(struct pci_dev *dev)
+void arch_teardown_msi_irqs(struct pci_dev *dev, int nvec)
{
return ppc_md.teardown_msi_irqs(dev);
}
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 92992a8..6cbdf11 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -40,18 +40,31 @@ arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *entry)
}
int __attribute__ ((weak))
+arch_setup_msi_block(struct pci_dev *pdev, struct msi_desc *desc, int nvec)
+{
+ if (nvec > 1)
+ return 1;
+ return arch_setup_msi_irq(pdev, desc);
+}
+
+int __attribute__ ((weak))
arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
{
- struct msi_desc *entry;
+ struct msi_desc *desc;
int ret;
- list_for_each_entry(entry, &dev->msi_list, list) {
- ret = arch_setup_msi_irq(dev, entry);
- if (ret)
- return ret;
+ if (type == PCI_CAP_ID_MSI) {
+ desc = list_first_entry(&dev->msi_list, struct msi_desc, list);
+ ret = arch_setup_msi_block(dev, desc, nvec);
+ } else {
+ list_for_each_entry(desc, &dev->msi_list, list) {
+ ret = arch_setup_msi_irq(dev, desc);
+ if (ret)
+ break;
+ }
}
- return ...I don't think you need arch_setup_msi_block() at all. We already have an arch hook that takes a number of irqs, it's arch_setup_msi_irqs(), plural. It also has the type passed to it (MSI or MSI-X), so it can decide if it needs to allocate the irq numbers contiguously. Or am I missing something? --=20 Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person
I suppose I should audit the current implementors of arch_setup_msi_irqs (er, maybe that's just you?) to be sure that there's no assumption that MSI -> asked for one. I'll look into doing it your way tomorrow (my timezone ;-) -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
Yeah I think it's just us. But there's also the default implementation, which will happily use the singular arch hook to setup multiple MSIs without any constraint on the irq numbers - which will break. So I think you want to make the default arch_msi_check_device() return an error if you ask for MSI & nvec > 1. Then on powerpc we'll probably add the same check to our version (at least until we can test it), but Sure, although that'll be today in my timezone :D cheers --=20 Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person
That was my intent ... something like this:
diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
index c62d101..79ff21f 100644
--- a/arch/powerpc/kernel/msi.c
+++ b/arch/powerpc/kernel/msi.c
@@ -29,10 +29,12 @@ int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
int arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
{
+ if (type == PCI_CAP_ID_MSI && nvec > 1)
+ return 1;
return ppc_md.setup_msi_irqs(dev, nvec, type);
}
-void arch_teardown_msi_irqs(struct pci_dev *dev)
+void arch_teardown_msi_irqs(struct pci_dev *dev, int nvec)
{
return ppc_md.teardown_msi_irqs(dev);
}
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 92992a8..4f7b31f 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -42,11 +42,14 @@ arch_setup_msi_irq(struct pci_dev *dev, struct msi_desc *entry)
int __attribute__ ((weak))
arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
{
- struct msi_desc *entry;
+ struct msi_desc *desc;
int ret;
- list_for_each_entry(entry, &dev->msi_list, list) {
- ret = arch_setup_msi_irq(dev, entry);
+ if ((type == PCI_CAP_ID_MSI) && (nvec > 1))
+ return 1;
+
+ list_for_each_entry(desc, &dev->msi_list, list) {
+ ret = arch_setup_msi_irq(dev, desc);
if (ret)
return ret;
}
@@ -60,13 +63,16 @@ void __attribute__ ((weak)) arch_teardown_msi_irq(unsigned int irq)
}
void __attribute__ ((weak))
-arch_teardown_msi_irqs(struct pci_dev *dev)
+arch_teardown_msi_irqs(struct pci_dev *dev, int nvec)
{
struct msi_desc *entry;
list_for_each_entry(entry, &dev->msi_list, list) {
- if (entry->irq != 0)
- arch_teardown_msi_irq(entry->irq);
+ int i;
+ if (entry->irq == 0)
+ continue;
+ for (i = 0; i < nvec; i++)
+ arch_teardown_msi_irq(entry->irq + i);
}
}
@@ -350,7 +356,7 @@ EXPORT_SYMBOL_GPL(pci_restore_msi_state);
* multiple messages. A return of zero indicates the successful setup
* of an entry zero with the new MSI irq or non-zero for ...This should go in arch_msi_check_device(). We might move it into a I think the check should be in the generic arch_msi_check_device(), so This looks wrong. You're looping through all MSIs for the device, and then for each one you're looping through all MSIs for the device. And you're assuming they're contiguous, which they won't be for MSI-X. You don't describe this behaviour in the doco. I'm a bit lukewarm on it, ie. returning the number that /could/ be allocated and having drivers use that, I think it's likely drivers will be poorly tested in the case where they get fewer irqs than they ask for. But I suppose that's a Your patches would be easier to read if you didn't keep renaming to ; Here you have "count", the implementation uses "nr_irqs", and the rest Someone will probably say this should be a static inline. cheers --=20 Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person
... then x86 has to implement arch_msi_check_device in order to _not_ For MSI-X, nvec will be = 1. Maybe I should call it something else to avoid confusion. The code won't work for me as-was because it won't Ah, I changed the bahviour (to match msix) and forgot to update the comment. Thanks, I'll fix that. By the way I have an updated version There's inconsistency between the various implementations too. I got Not quite sure why. You don't get any better typechecking by making it a static inline. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
Agreed, but I think that's still better. You might have alignment It will call arch_teardown_msi_irq() for all entries, unless they were never allocated (entry->irq =3D=3D 0). Or are we talking about different things? If you mean that you're allocating more irqs than there are entries then Yeah I agree, just pointing it out. cheers --=20 Michael Ellerman OzLabs, IBM Australia Development Lab wwweb: http://michael.ellerman.id.au phone: +61 2 6212 1183 (tie line 70 21183) We do not inherit the earth from our ancestors, we borrow it from our children. - S.M.A.R.T Person
AHCI controllers can support up to 16 interrupts, one per port. This
saves us a readl() in the interrupt path to determine which port has
generated the interrupt.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
drivers/ata/ahci.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++---
1 files changed, 79 insertions(+), 5 deletions(-)
diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 061817a..4b2f90a 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -102,6 +102,7 @@ enum {
/* HOST_CTL bits */
HOST_RESET = (1 << 0), /* reset controller; self-clear */
HOST_IRQ_EN = (1 << 1), /* global IRQ enable */
+ HOST_MSI_RSM = (1 << 2), /* Revert to Single Message */
HOST_AHCI_EN = (1 << 31), /* AHCI enabled */
/* HOST_CAP bits */
@@ -1771,6 +1772,13 @@ static void ahci_port_intr(struct ata_port *ap)
}
}
+static irqreturn_t ahci_msi_interrupt(int irq, void *dev_instance)
+{
+ struct ata_port *ap = dev_instance;
+ ahci_port_intr(ap);
+ return IRQ_HANDLED;
+}
+
static irqreturn_t ahci_interrupt(int irq, void *dev_instance)
{
struct ata_host *host = dev_instance;
@@ -2210,6 +2218,66 @@ static void ahci_p5wdh_workaround(struct ata_host *host)
}
}
+static int ahci_request_irqs(struct pci_dev *pdev, struct ata_host *host)
+{
+ int i, n_irqs, rc;
+ struct ahci_host_priv *hpriv = host->private_data;
+
+ if (hpriv->flags & AHCI_HFLAG_NO_MSI) {
+ n_irqs = 1;
+ } else {
+ u16 control;
+ int pos = pci_find_capability(pdev, PCI_CAP_ID_MSI);
+ pci_read_config_word(pdev, pos + PCI_MSI_FLAGS, &control);
+ n_irqs = 1 << ((control & PCI_MSI_FLAGS_QMASK) >> 1);
+
+ for (;;) {
+ rc = pci_enable_msi_block(pdev, n_irqs);
+ if (rc == 0) {
+ void __iomem *mmio = host->iomap[AHCI_PCI_BAR];
+ u32 host_ctl = readl(mmio + HOST_CTL);
+ if ((host_ctl & HOST_MSI_RSM) == 0)
+ break;
+ pci_disable_msi(pdev);
+ n_irqs = 1;
+ } else if (rc < 0) {
+ n_irqs = 1;
+ pci_intx(pdev, ...If the system is busy, the readl is the cost of coalescing the interrupts. I suspect it's cheaper to take one readl than handle 16 individual interrupts. I'm just pointing out the only upside of the existing code and not trying to argue against this patch. Can you confirm the upside for the use case of when multiple SSDs are attached? Avoiding the readl _and_ the loop will work much better if interrupts are getting redirected to multiple CPUs. Latency for each drive will be substantially lower and handle many more IOPS. Hrm...without targeting a particular socket on a multi-socket machine, spinlocks and control data cachelines are going to bounce around alot. *shrug* BTW, one more downside of the regular IRQ is it's possibly shared. Using MSI guaratees exclusive IRQ and avoids spurious readl's when AHCI is not busy but the other device is. This would be worth noting (or as a reminder) in the change log or as a comment in the code. hth, --
16 would be a maximum imposed by the AHCI spec. My ICH9 board has 6 There may well be an upside to the existing code, but it's pretty slim. The oprofile shows clearly that ahci_interrupt is the largest consumer of time during an iozone run. The only thing that routine does is read the HOST_IRQ_STAT register, acquire the spinlock and loop calling ahci_port_intr(). I don't have a profile for this new code yet. Hopefully we'll have one AHCI already allocates itself a new MSI if the machine supports MSI. This change merely extends AHCI to use multiple MSIs. Thanks. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
Willy, where you able to get this profile? ok. thanks, --
