I'd like to thank Michael Ellerman for his feedback. This is a much better patchset than it used to be. Changes since the previous patchset: - Dumped the variable renaming as it made the patchset harder to read. - Moved the addition of the 'multiple' field from the first patch to the second. - Changed the calling convention of arch_teardown_msi_irqs back to upstream. - Fix the docbook for pci_enable_msi_block - Limit powerpc's MSI allocation to 1 in arch_msi_check_device, not arch_setup_msi_irqs - Rewrote AHCI interrupt claiming code to handle failures better and to claim fewer interrupts if possible. Outstanding things I'd like reviewed: - The x86-64 patch has received no comments so far. - The AHCI patch needs to be re-reviewed. Sorry, Jeff. It's a bit more complex than I'd like, but I don't see a clearer way to write it, given the nasty behaviour that the AHCI spec permits. - I've just added the MSI-HOWTO patch to this set. New reviewers are welcome. Patchset available as a git tree from: git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc.git multiple-msi (multiple-msi-20080710 is this patch set in particular) Patches will be sent in reply to this message. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
Implement the arch_setup_msi_irqs() interface. Extend create_irq()
into create_irq_block() and reimplement create_irq as a wrapper around
it. Create assign_irq_vector_block() based closely on
assign_irq_vector(). Teach set_msi_irq_affinity() how to handle
multiple MSIs.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
arch/x86/kernel/io_apic_64.c | 237 ++++++++++++++++++++++++++++++++++++------
1 files changed, 205 insertions(+), 32 deletions(-)
diff --git a/arch/x86/kernel/io_apic_64.c b/arch/x86/kernel/io_apic_64.c
index ef1a8df..6a00dca 100644
--- a/arch/x86/kernel/io_apic_64.c
+++ b/arch/x86/kernel/io_apic_64.c
@@ -61,7 +61,7 @@ struct irq_cfg {
};
/* irq_cfg is indexed by the sum of all RTEs in all I/O APICs. */
-struct irq_cfg irq_cfg[NR_IRQS] __read_mostly = {
+static struct irq_cfg irq_cfg[NR_IRQS] __read_mostly = {
[0] = { .domain = CPU_MASK_ALL, .vector = IRQ0_VECTOR, },
[1] = { .domain = CPU_MASK_ALL, .vector = IRQ1_VECTOR, },
[2] = { .domain = CPU_MASK_ALL, .vector = IRQ2_VECTOR, },
@@ -683,6 +683,8 @@ static int pin_2_irq(int idx, int apic, int pin)
return irq;
}
+static int current_vector = FIRST_DEVICE_VECTOR;
+
static int __assign_irq_vector(int irq, cpumask_t mask)
{
/*
@@ -696,7 +698,7 @@ static int __assign_irq_vector(int irq, cpumask_t mask)
* Also, we've got to be careful not to trash gate
* 0x80, because int 0x80 is hm, kind of importantish. ;)
*/
- static int current_vector = FIRST_DEVICE_VECTOR, current_offset = 0;
+ static int current_offset = 0;
unsigned int old_vector;
int cpu;
struct irq_cfg *cfg;
@@ -769,6 +771,97 @@ static int assign_irq_vector(int irq, cpumask_t mask)
return err;
}
+static int __assign_irq_vector_block(int irq, int count, cpumask_t mask)
+{
+ unsigned int old_vector;
+ int i, cpu;
+ struct irq_cfg *cfg;
+
+ /*
+ * We've got to be careful not to trash gate 0x80,
+ * because int 0x80 is hm, kind of importantish. ;)
+ */
+ BUG_ON((unsigned)irq + ...Hi, I'm very sorry for very delayed comment, but I have a concern about irq affinity code. Since I didn't have enough time to look at your patch, I might be misunderstanding something... In my understanding, when irq affinity is changed on one of the IRQs corresponding to MSI, it will not completed until interrupts occur on all the IRQs. Attempts to change irq affinity before previous affinity change is finished will be ignored. Here, suppose that device uses three MSI irqs. In this case, your patch manages four irqs, but only three interrupts are used. If irq affinity changed on this MSI interrupts, will it be completed? Thanks, Kenji Kaneshige --
I have tested the affinity code with an ICH9 AHCI: 495: 117233 117966 118033 117797 PCI-MSI-edge ahci 496: 29860 29106 30191 28705 PCI-MSI-edge ahci 497: 0 0 0 0 PCI-MSI-edge ahci 498: 0 0 0 0 PCI-MSI-edge ahci 499: 0 0 0 0 PCI-MSI-edge ahci 500: 0 0 0 0 PCI-MSI-edge ahci This chip requires 16 MSIs to be registered, and it has 6 ports. Only ports 0 and 1 have a device attached. If I change the mask of an active irq (eg 495 or 496), it takes effect on both of them. If I change the mask of an inactive irq (497-500), nothing happens. But I can subsequently change the mask on 495 or 496 successfully. I can't tell you why this works this way; I haven't looked in enough detail at the irq affinity code, but this is my observation. Thanks for your comment. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
What I was worrying about was around irq_cfg->move_in_progress. Since your environment has less than 8 cpus according to your /proc/interrupts, I think your environment seems to use "apic_flat" mode for interrupt handling. In this case, irq_cfg->move_in_progress is not used for irq migration. I think this is why your environment didn't encounter the problem I worried about. We should note that vector allocation logic varies depending on the environment. BTW, I looked at your take 4 patch. The problem I worried about seems not to exist in this version (as a good side effect). Thanks, Kenji Kaneshige --
By changing from a 5-bit field to a 1-bit field, we free up some bits
that can be used by a later patch. Also rearrange the fields for better
packing on 64-bit platforms (reducing the size of msi_desc from 72 bytes
to 64 bytes).
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
drivers/pci/msi.c | 100 ++++++++++++++++----------------------------------
include/linux/msi.h | 4 +-
2 files changed, 34 insertions(+), 70 deletions(-)
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 8c61304..104f297 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -106,20 +106,10 @@ static void msix_flush_writes(unsigned int irq)
entry = get_irq_msi(irq);
BUG_ON(!entry || !entry->dev);
- switch (entry->msi_attrib.type) {
- case PCI_CAP_ID_MSI:
- /* nothing to do */
- break;
- case PCI_CAP_ID_MSIX:
- {
+ if (entry->msi_attrib.is_msix) {
int offset = entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE +
PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET;
readl(entry->mask_base + offset);
- break;
- }
- default:
- BUG();
- break;
}
}
@@ -129,8 +119,12 @@ static void msi_set_mask_bits(unsigned int irq, u32 mask, u32 flag)
entry = get_irq_msi(irq);
BUG_ON(!entry || !entry->dev);
- switch (entry->msi_attrib.type) {
- case PCI_CAP_ID_MSI:
+ if (entry->msi_attrib.is_msix) {
+ int offset = entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE +
+ PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET;
+ writel(flag, entry->mask_base + offset);
+ readl(entry->mask_base + offset);
+ } else {
if (entry->msi_attrib.maskbit) {
int pos;
u32 mask_bits;
@@ -143,18 +137,6 @@ static void msi_set_mask_bits(unsigned int irq, u32 mask, u32 flag)
} else {
msi_set_enable(entry->dev, !flag);
}
- break;
- case PCI_CAP_ID_MSIX:
- {
- int offset = entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE +
- PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET;
- writel(flag, entry->mask_base + offset);
- readl(entry->mask_base + ...AHCI controllers can support up to 16 interrupts, one per port. This
saves us a readl() in the interrupt path to determine which port has
generated the interrupt.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
drivers/ata/ahci.c | 127 +++++++++++++++++++++++++++++++++++++++++++++++++--
1 files changed, 122 insertions(+), 5 deletions(-)
diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 5e6468a..12562fc 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -102,6 +102,7 @@ enum {
/* HOST_CTL bits */
HOST_RESET = (1 << 0), /* reset controller; self-clear */
HOST_IRQ_EN = (1 << 1), /* global IRQ enable */
+ HOST_MSI_RSM = (1 << 2), /* Revert to Single Message */
HOST_AHCI_EN = (1 << 31), /* AHCI enabled */
/* HOST_CAP bits */
@@ -1771,6 +1772,15 @@ static void ahci_port_intr(struct ata_port *ap)
}
}
+static irqreturn_t ahci_msi_interrupt(int irq, void *dev_instance)
+{
+ struct ata_port *ap = dev_instance;
+ spin_lock(&ap->host->lock);
+ ahci_port_intr(ap);
+ spin_unlock(&ap->host->lock);
+ return IRQ_HANDLED;
+}
+
static irqreturn_t ahci_interrupt(int irq, void *dev_instance)
{
struct ata_host *host = dev_instance;
@@ -2220,6 +2230,107 @@ static void ahci_p5wdh_workaround(struct ata_host *host)
}
}
+static int ahci_request_irqs(struct pci_dev *pdev, struct ata_host *host,
+ int n_irqs)
+{
+ int i, rc;
+
+ if (n_irqs > host->n_ports) {
+ n_irqs = host->n_ports;
+ } else if (n_irqs < host->n_ports || n_irqs == 1) {
+ n_irqs--;
+ rc = devm_request_irq(host->dev, pdev->irq + n_irqs,
+ ahci_interrupt,
+ IRQF_SHARED | (n_irqs ? IRQF_DISABLED : 0),
+ dev_driver_string(host->dev), host);
+ if (rc)
+ return rc;
+ }
+
+ for (i = 0; i < n_irqs; i++) {
+ rc = devm_request_irq(host->dev, pdev->irq + i,
+ ahci_msi_interrupt, IRQF_DISABLED,
+ dev_driver_string(host->dev), host->ports[i]);
+ if (rc)
+ goto free_irqs;
+ }
+
+ return 0;
+
+ free_irqs:
+ if (n_irqs < ...Add the new API pci_enable_msi_block() to allow drivers to
request multiple MSI and reimplement pci_enable_msi in terms of
pci_enable_msi_block. Ensure that the architecture back ends don't
have to know about multiple MSI.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
arch/powerpc/kernel/msi.c | 4 ++
drivers/pci/msi.c | 79 +++++++++++++++++++++++++++++++-------------
drivers/pci/msi.h | 4 --
include/linux/msi.h | 1 +
include/linux/pci.h | 6 ++-
5 files changed, 64 insertions(+), 30 deletions(-)
diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
index c62d101..5bc8dda 100644
--- a/arch/powerpc/kernel/msi.c
+++ b/arch/powerpc/kernel/msi.c
@@ -19,6 +19,10 @@ int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
return -ENOSYS;
}
+ /* PowerPC doesn't support multiple MSI yet */
+ if (type == PCI_CAP_ID_MSI && nvec > 1)
+ return 1;
+
if (ppc_md.msi_check_device) {
pr_debug("msi: Using platform check routine.\n");
return ppc_md.msi_check_device(dev, nvec, type);
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 104f297..bc2e888 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -45,6 +45,13 @@ arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
struct msi_desc *entry;
int ret;
+ /*
+ * If an architecture wants to support multiple MSI, it needs to
+ * override arch_setup_msi_irqs()
+ */
+ if (type == PCI_CAP_ID_MSI && nvec > 1)
+ return 1;
+
list_for_each_entry(entry, &dev->msi_list, list) {
ret = arch_setup_msi_irq(dev, entry);
if (ret)
@@ -65,8 +72,12 @@ arch_teardown_msi_irqs(struct pci_dev *dev)
struct msi_desc *entry;
list_for_each_entry(entry, &dev->msi_list, list) {
- if (entry->irq != 0)
- arch_teardown_msi_irq(entry->irq);
+ int i, nvec;
+ if (entry->irq == 0)
+ continue;
+ nvec = 1 << entry->msi_attrib.multiple;
+ for (i = 0; i < nvec; i++)
+ arch_teardown_msi_irq(entry->irq + i);
...Hi, First of all, it seems that mask/unmask of MSI has problems. - Per-vector masking is optional for MSI, so I think that allocating multiple messages for a function without masking capability would be not good idea, since all vector in the block will be masked/unmasked at once without any agreement. - Even if the function supports per-vector masking, current mask/unmask_msi_irq() functions assume that MSI uses only one vector, therefore they only set/unset the first bit of the maskbits which for the first vector of the block. The bits for other vectors are initialized as 'masked' but no one unmask them. I think we should return -EINVAL here. No one guarantee that 32 interrupts is able to be allocate at this time. And also I think -EINVAL should be returned if nvec is greater than the number of encoded in the function's "Multiple Message Capable", but I could not find any mention about handling of such over-capability request Thanks, H.Seto --
Thank you for pointing out the problems with masking. The device I am testing with does not support per-vector masking, so I have not paid attention to this. To your first point, if the function does not support per-vector masking, I think it's OK to mask/unmask all vectors at once. But we must be careful to manage this correctly in software; if we disable IRQ 496, disable IRQ 497, then enable IRQ 497, we must not enable IRQ 496 at that time. I think we can solve this problem, but I must think about it some more. The second point is a simple bug that should be easy to fix. Thank you It would be outside the scope of the PCI Bus Specification. I think you're right that we should check the MMC bits; but I think we should tell the driver to request a lower number, not return -EINVAL. Thanks for your comments. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
I tend to think we should just do soft-masking anyway for MSI... better than whacking config space. Ben --
I didn't find the previous version very useful, so I rewrote it. Also move it to the PCI subdirectory of Documentation. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> --- Documentation/MSI-HOWTO.txt | 511 --------------------------------------- Documentation/PCI/MSI-HOWTO.txt | 318 ++++++++++++++++++++++++ 2 files changed, 318 insertions(+), 511 deletions(-) delete mode 100644 Documentation/MSI-HOWTO.txt create mode 100644 Documentation/PCI/MSI-HOWTO.txt diff --git a/Documentation/MSI-HOWTO.txt b/Documentation/MSI-HOWTO.txt deleted file mode 100644 index a51f693..0000000 --- a/Documentation/MSI-HOWTO.txt +++ /dev/null @@ -1,511 +0,0 @@ - The MSI Driver Guide HOWTO - Tom L Nguyen tom.l.nguyen@intel.com - 10/03/2003 - Revised Feb 12, 2004 by Martine Silbermann - email: Martine.Silbermann@hp.com - Revised Jun 25, 2004 by Tom L Nguyen - -1. About this guide - -This guide describes the basics of Message Signaled Interrupts (MSI), -the advantages of using MSI over traditional interrupt mechanisms, -and how to enable your driver to use MSI or MSI-X. Also included is -a Frequently Asked Questions (FAQ) section. - -1.1 Terminology - -PCI devices can be single-function or multi-function. In either case, -when this text talks about enabling or disabling MSI on a "device -function," it is referring to one specific PCI device and function and -not to all functions on a PCI device (unless the PCI device has only -one function). - -2. Copyright 2003 Intel Corporation - -3. What is MSI/MSI-X? - -Message Signaled Interrupt (MSI), as described in the PCI Local Bus -Specification Revision 2.3 or later, is an optional feature, and a -required feature for PCI Express devices. MSI enables a device function -to request service by sending an Inbound Memory Write on its PCI bus to -the FSB as a Message Signal Interrupt transaction. Because MSI is -generated in the form of a Memory Write, all transaction conditions, -such as a Retry, Master-Abort, Target-Abort ...
... Willy [Catching up on my email since I was on vacation in July.] If this patch set can still be resurrected, please add: Reviewed-by: Grant Grundler <grundler@parisc-linux.org> Greater control over where interrupts can be directed (not generated) "writes" -> "DMA writes" This is one use. I'm told cciss driver does this. If so, it would be good to point at a working example, even if it's messy. The other use case is to distribute both initiating IO requests and handle IO completions on multiple CPUs as well. This is the Multi-Queue case where each Queue pair is associated with one MSI-X. I expect several NIC drivers could serve as examples here. I just haven't looked at I'm guessing the device driver is expected to call "request_irq()" for each msix_entry it is allocated. And later free_irq() for each one too. I see you've mentioned that in the "disable_msix()" description but The remainder looked fine to me. I should re-read the original sometime too just to compare. But I'm happy besides the nits I pointed out. thanks, grant --
There is a reason we don't have an API to support this. Linux can not reasonably support this, especially not on current X86. The designers of the of the AHCI were idiots and should have used MSI-X. Attempting to support multiple irqs in an MSI capability breaks every interesting use of an irq. mask/unmask is will likely break because the mask bit is optional and when it is not present we disable the msi capability. We can not set the affinity individually so we can not allow different queues to be processed on different cores. So in general it seems something that we have to jump through a million hurdles and the result is someones twisted parody of a multiple working irqs, that even Intel's IOMMU can't cure. So unless the performance of the AHCI is better by a huge amount I don't see the point, and even then I am extremely sceptical. Eric --
Thank you for your constructive feedback, Eric. Unfortunately, we have to deal with the world as it is, not how we would like it to be. Since I have it running, I'd like to know what is unreasonable about the This is true, and yet, it is still useful. Just not as useful as one I don't have performance numbers yet, but surely you can see that avoiding a register read in the interrupt path is a large win? -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
From: Matthew Wilcox <matthew@wil.cx> Such overhead is going to be amortized. AHCI is not like networking where we have lots of very small transactions to deal with, and therefore the per-IRQ overhead can begin to dominate. Therefore, like Eric, I think workng on multiple MSI is a very dubious usage of one's time. But, it's your time, so use it how you wish :) --
I didn't start hacking without performance numbers ;-) With an I/O-heavy workload, the two biggest consumers of CPU time in the profile were ahci_interrupt and ahci_qc_issue. I got 10% more bandwidth by removing the flushing readl() from ahci_qc_issue. I'm hoping for a similar improvement by removing this readl() too. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
At the very least it is setting all kinds of expectations that it doesn't meet. In addition the MSI-X spec predates the AHCI device by a long shot. In general my experience has been that the hardware designers who really care and have done their homework and can actually take Assuming AHCI implements the mask bits. In the general case this is not fixable. I know of several devices that do not implement Also a case of mismatched expectations. The linux irq API allows irqs to be bound to different cpus individually. Multi msi does No. It is not obvious to me. Eric --
There is one idea that seems to model this cleanly without breaking all kinds of expectations. That is an irq with a very small data payload. In that case we wire all of the vectors up to a single irq handler that computes the payload as: payload = vector - base-vector. And then we figure out how to pass that to the handler in irqaction. To most of the system it is a single irq so it avoids breaking expectations about what an irq is. To everything else it is a little odd, and has it's own unique set of rules (which is good as well). Eric --
OK, I'm willing to play this scenario through and see where it takes us. - The affinity now matches reality. Good. - For devices without individual masking, the masking API matches reality. Good. - For devices with individual masking, we'll want a new API. Adequate. - We'll still need to allocate an aligned block of vectors on x86-64. No change. I think rather than passing the 'vector - base_vector' integer, the request_irq_block() should pass in an array of pointers as large as nvec and irqaction passes the appropriate pointer to the handler. -- Intel are signing my paycheques ... these opinions are still mine "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
That idea requires a lot of mucking with generic_IRQ_handle, which I'm
not willing to do. Anyone have any problems with the below patch?
I've not implemented the corresponding bit in io_apic_64.c yet, so
completely untested.
diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
index 3aac154..a5d79e9 100644
--- a/arch/x86/kernel/irq_64.c
+++ b/arch/x86/kernel/irq_64.c
@@ -173,7 +173,7 @@ asmlinkage unsigned int do_IRQ(struct pt_regs *regs)
stack_overflow_check(regs);
#endif
- if (likely(irq < NR_IRQS))
+ if (likely((irq & ARCH_IRQ_DATA_MASK) < NR_IRQS))
generic_handle_irq(irq);
else {
if (!disable_apic)
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 552e0ec..19c679c 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -282,6 +282,19 @@ extern unsigned int __do_IRQ(unsigned int irq);
#endif
/*
+ * For multiple MSI, we allow a small amount of data to be passed to
+ * the interrupt handler to determine which vector was called. MSI
+ * only allows 32 interrupts, so we default to requiring 5 bits of
+ * information to be passed. If your arch has NR_IRQS set higher than
+ * 2^27, you have some extremely large arrays and probably want to consider
+ * defining ARCH_IRQ_DATA_SHIFT and ARCH_IRQ_DATA_MASK.
+ */
+#ifndef ARCH_IRQ_DATA_SHIFT
+#define ARCH_IRQ_DATA_SHIFT 27
+#define ARCH_IRQ_DATA_MASK ((1 << ARCH_IRQ_DATA_SHIFT) - 1)
+#endif
+
+/*
* Architectures call this to let the generic IRQ layer
* handle an interrupt. If the descriptor is attached to an
* irqchip-style controller then we call the ->handle_irq() handler,
@@ -289,7 +302,7 @@ extern unsigned int __do_IRQ(unsigned int irq);
*/
static inline void generic_handle_irq(unsigned int irq)
{
- struct irq_desc *desc = irq_desc + irq;
+ struct irq_desc *desc = irq_desc + (irq & ARCH_IRQ_DATA_MASK);
#ifdef CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ
desc->handle_irq(irq, desc);
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index ...With interrupt-remapping, we can program the individual interrupt remapping table entries to point to different cpu's etc. All we have to take care is, do the IRTE allocation in a consecutive block and program the starting index to the MSI registers. Just curious Eric, why do you think that won't work? thanks, --
Working mask/unmask. With MSI-X as specced if I mask an irq and then unmask it, an msi message will fire if something happened while the irq was masked and not taken care of before the irq was unmasked. That is the correct behavior for an irq and a mmu won't let me get that. The best I can do with an iommu is to run delayed disable and set the interrupt remapping slot in the iommu to reject the traffic. Which is almost but not quite what I want. Overall introducing a new concept into the linux irq model seems a lot cleaner and more portable, and even there we are likely to be a lot more fragile because of the difficulty in obtaining contiguous vectors. Speaking of. How many interrupt targets does the dmar iommu have for interrupts? 16K? Eric --
There can be multiple interrupt-remapping units in the platform and each of table in the remapping unit has max 64K entries. thanks, suresh --
Why do we ever? It is part of the linux irq API and people wind up using it in all kinds of strange ways. One of the more surprising uses I have see is for the real time kernel people mask the irq, wake up a kernel thread (to handle the irq), then ack the irq and get out of the interrupt handler. It looked like Thanks. That should last us for a little while. :) Eric --
And ? It's just a message, we can ignore it if masked, ie, do software-masking. Not a big deal... no ? Ben. --
It is edge triggered so it won't refire when unmasked (especially if we don't know). So it is easy to wind up in a state where the device is waiting for the software and the software is waiting for the device because an irq gets dropped. There are enough places that have problems that we have a fairly standard work around to the problem (listed above) by just taking the first irq (after we have disabled the irq) and setting it pending in software and then actually masking it in hardware. That works, but it is still isn't quite correct. Because we can run the interrupt handler once to often. For interrupts that are never shared and always in order with the DMA, generally don't require reading a status register on the card, and are otherwise highly optimized that might actually be a problem. Which is why I said that it doesn't look like even using an iommu can fix all of the issues with treating msi multi message mode messages as individual irqs. We can get very close but not quite there. Eric --
Well, we are smarter than that. soft-masking is a know well-solved problem. We just latch that something happened while masked and refire when unmasked. Not terribly hard. We already do that in various Masking in HW is totally optional. I don't mask in HW on cell for We only re-fire if it actually occured while "masked", that should take There must be some way of knowing what work is to do (ie, whether a DMA q entry is completed, some kind of done bit, etc...). There generally is at least, so that even in that case, spurrious MSIs are mostly a non I still mostly dislike the new approach, I prefer Matthew's original one with SW masking of the MSIs. For example, if you have the MSIs be 'one' interrupt, then you hit all of the logic in the IRQ core to make sure only one happens at once. Might not be what you want, and -will- cause some to be dropped... not nice. Ben. --
You are correct. Using the existing irq handling logic will cause us to drop irqs and to loose information if two messages are sent close to each other. Which is very nasty. With a little care we can avoid that problem by having a 32 bit bitmap of which sub irqs have fired so we can make all of them pending without loosing information. That does requires a new handle_irq method. One of the primary purposes of masking irqs in hardware is to prevent them from screaming. Unlikely with edge triggered irqs but not a capability I would like to give up. Multi-msi has the problem that cpu affinity can not be changed on a per message basis without an iommu. Which is a portability problem and a problem on common architectures. Therefore to support multi-msi it must be handled as a special case, we can not treat the individual messages like normal irqs. Eric --
And we end up re-doing the logic that the core already provides ... for Have you seen screaming MSIs yet ? We -could- if necessary add a hook It's a minor problem I believe. It's mostly an API issue in fact, and I'm happy to live with it rather than the other approach which sounds just ... gross. Sorry but you are reproducing locally within an IRQ what the whole IRQ is about to differenciate them in the first place. It also makes it harder to remove the "irq" argument to handlers which We can, that's what they are. They just are IRQs with -some- restrictions on -some- platforms (mostly affinity on x86 and need to soft-mask, which are fairly minor in my book). Ben. --
Ben. Multi-MSI is a crap hardware design. Why do you think we have MSI-X? MSI-X as specced is a properly operating irq controller that we don't need kludges to support. Multi-MSI with a full set of kludges almost work but not quite fits the linux irq model. Any hardware designer who choose to implement Multi-MSI instead of MSI-X was not really concerned about having a high performance device. If we can find a way to model the portable capabilities of Multi-MSI cleanly then we can support it, and our drivers and our users and our intermediate layers won't get surprised. So far we have too close fits but neither model really works. Further this is all about driver optimization, so none of this is necessary to have working hardware. Which makes kludges much less appropriate. Modelling Multi-MSI irqs as normal irqs requires a lot of nasty kludges. One of the kludges is allocating a continuous chunk of irq targets, and the resulting fragmentation issues that you get when you start allowing different sized allocations. Overall if Multi-MSI was to become common I think we would really regret it. Eric --
I know and I agree. Which is why I'd rather keep the SW crap totally local to the MSI support code and not add new concepts to the generic IRQ API such as sub-channels, for which it's really not ready for imho. They -are- separate IRQs, just badly implemented. Besides, a large part of the problem is purely due to the typical x86 implementation of them, since for example, on most PowerPC's (and possibly other archs), they tend to land in the PIC as normal sources, and as such benefit from all the "features" of such interrupts like HW masking, affinity control, etc... at the PIC level. In any case, SW masking will work just fine, and affinity can be dealt And you want to change the model for it ? I say no :-) Just make them fit with a hammer. In fact, it's not even a big hammer. The only thing I think Willy's initial model can work with SW masking. All of the latching & re-emission stuff is already in the core. The only problem is affinity which can be hacked around, by either requiring all irqs to change affinity or bouncing them all on one change, or only exposing No. Not a lot. And those kludge are mostly local to the MSI support core I agree. But in the meantime, if we want to support it, I prefer the option of making it fit in the existing model. Cheers, Ben. --
From: Benjamin Herrenschmidt <benh@kernel.crashing.org> This is how it works on sparc64 too. The x86 system designers decided to implement multi-MSI in an inconvenient way, it is not a "crap hardware design", merely some (unfortunately common) implementations of it happen to be. --
To be clear I was referring to the PCI spec that describes multi-MSI as a crap hardware design. At the very least you are left with the problem of allocating multiple contiguous destinations. Which has the potential to create fragmentation on all supported platforms. Optional mask bits are also nasty. My honest opinion is that the should have deprecated multi-msi after the introduction of the msi-x specification. Eric --
From: ebiederm@xmission.com (Eric W. Biederman) I don't think this is a very real concern. With 256 MSI slots available per domain, at least on sparc64, no real case can cause problems. And even so, drivers can and should fall back when allocations fail anyways. That's what drivers, at least the ones I have written, do even for MSI-X. For example, the NIU driver scales back the number of MSI-X vectors it askes for if the original request cannot be I don't think this would have materially influenced what happened with AHCI at all. --
Hi Matthew, what's current state of this work? I pulled it into a newly created git branch and tried to rebase it upon the latest Linus tree, found that there are some merge conflicts. -Jike --
The status is that I was working on it on the plane home from Plumbers and didn't quite managed to get it working on x86-32 yet. This is a requirement from Ingo before he'll accept the x86-64 implementation. I was hoping to work on it some more this week, but a combination of being ill and having other things to work on (including a lot of patch review) has meant that I have not had time to look at it. I did start setting up to test again on Friday ... hopefully I'll have time on Monday. -- Matthew Wilcox Intel Open Source Technology Centre "Bill, look, we understand that you're interested in selling us this operating system, but compare it to ours. We can't possibly take such a retrograde step." --
