Re: Multiple MSI, take 3

Previous thread: [PATCH] ioctl conversion by Stoyan Gaydarov on Thursday, July 10, 2008 - 5:28 pm. (4 messages)

Next thread: [PATCH 0/8] ftrace: updates by Steven Rostedt on Thursday, July 10, 2008 - 5:58 pm. (4 messages)
From: Matthew Wilcox
Date: Thursday, July 10, 2008 - 5:57 pm

I'd like to thank Michael Ellerman for his feedback.  This is a much
better patchset than it used to be.

Changes since the previous patchset:
 - Dumped the variable renaming as it made the patchset harder to read.
 - Moved the addition of the 'multiple' field from the first patch to the
   second.
 - Changed the calling convention of arch_teardown_msi_irqs back to
   upstream.
 - Fix the docbook for pci_enable_msi_block
 - Limit powerpc's MSI allocation to 1 in arch_msi_check_device, not
   arch_setup_msi_irqs
 - Rewrote AHCI interrupt claiming code to handle failures better and to
   claim fewer interrupts if possible.

Outstanding things I'd like reviewed:
 - The x86-64 patch has received no comments so far.
 - The AHCI patch needs to be re-reviewed.  Sorry, Jeff.  It's a bit
   more complex than I'd like, but I don't see a clearer way to write
   it, given the nasty behaviour that the AHCI spec permits.
 - I've just added the MSI-HOWTO patch to this set.  New reviewers are
   welcome.

Patchset available as a git tree from:

git://git.kernel.org/pub/scm/linux/kernel/git/willy/misc.git multiple-msi
(multiple-msi-20080710 is this patch set in particular)

Patches will be sent in reply to this message.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

From: Matthew Wilcox
Date: Thursday, July 10, 2008 - 6:00 pm

Implement the arch_setup_msi_irqs() interface.  Extend create_irq()
into create_irq_block() and reimplement create_irq as a wrapper around
it.  Create assign_irq_vector_block() based closely on
assign_irq_vector().  Teach set_msi_irq_affinity() how to handle
multiple MSIs.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 arch/x86/kernel/io_apic_64.c |  237 ++++++++++++++++++++++++++++++++++++------
 1 files changed, 205 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kernel/io_apic_64.c b/arch/x86/kernel/io_apic_64.c
index ef1a8df..6a00dca 100644
--- a/arch/x86/kernel/io_apic_64.c
+++ b/arch/x86/kernel/io_apic_64.c
@@ -61,7 +61,7 @@ struct irq_cfg {
 };
 
 /* irq_cfg is indexed by the sum of all RTEs in all I/O APICs. */
-struct irq_cfg irq_cfg[NR_IRQS] __read_mostly = {
+static struct irq_cfg irq_cfg[NR_IRQS] __read_mostly = {
 	[0]  = { .domain = CPU_MASK_ALL, .vector = IRQ0_VECTOR,  },
 	[1]  = { .domain = CPU_MASK_ALL, .vector = IRQ1_VECTOR,  },
 	[2]  = { .domain = CPU_MASK_ALL, .vector = IRQ2_VECTOR,  },
@@ -683,6 +683,8 @@ static int pin_2_irq(int idx, int apic, int pin)
 	return irq;
 }
 
+static int current_vector = FIRST_DEVICE_VECTOR;
+
 static int __assign_irq_vector(int irq, cpumask_t mask)
 {
 	/*
@@ -696,7 +698,7 @@ static int __assign_irq_vector(int irq, cpumask_t mask)
 	 * Also, we've got to be careful not to trash gate
 	 * 0x80, because int 0x80 is hm, kind of importantish. ;)
 	 */
-	static int current_vector = FIRST_DEVICE_VECTOR, current_offset = 0;
+	static int current_offset = 0;
 	unsigned int old_vector;
 	int cpu;
 	struct irq_cfg *cfg;
@@ -769,6 +771,97 @@ static int assign_irq_vector(int irq, cpumask_t mask)
 	return err;
 }
 
+static int __assign_irq_vector_block(int irq, int count, cpumask_t mask)
+{
+	unsigned int old_vector;
+	int i, cpu;
+	struct irq_cfg *cfg;
+
+	/*
+	 * We've got to be careful not to trash gate 0x80,
+	 * because int 0x80 is hm, kind of importantish. ;)
+	 */
+	BUG_ON((unsigned)irq + ...
From: Kenji Kaneshige
Date: Thursday, July 10, 2008 - 9:50 pm

Hi,

I'm very sorry for very delayed comment, but I have a concern
about irq affinity code. Since I didn't have enough time to
look at your patch, I might be misunderstanding something...

In my understanding, when irq affinity is changed on one of
the IRQs corresponding to MSI, it will not completed until
interrupts occur on all the IRQs. Attempts to change irq
affinity before previous affinity change is finished will be
ignored. Here, suppose that device uses three MSI irqs. In
this case, your patch manages four irqs, but only three
interrupts are used. If irq affinity changed on this MSI
interrupts, will it be completed?

Thanks,
Kenji Kaneshige




--

From: Matthew Wilcox
Date: Friday, July 11, 2008 - 1:50 am

I have tested the affinity code with an ICH9 AHCI:

495:     117233     117966     118033     117797   PCI-MSI-edge      ahci
496:      29860      29106      30191      28705   PCI-MSI-edge      ahci
497:          0          0          0          0   PCI-MSI-edge      ahci
498:          0          0          0          0   PCI-MSI-edge      ahci
499:          0          0          0          0   PCI-MSI-edge      ahci
500:          0          0          0          0   PCI-MSI-edge      ahci

This chip requires 16 MSIs to be registered, and it has 6 ports.
Only ports 0 and 1 have a device attached.  If I change the mask of
an active irq (eg 495 or 496), it takes effect on both of them.  If I
change the mask of an inactive irq (497-500), nothing happens.  But I
can subsequently change the mask on 495 or 496 successfully.

I can't tell you why this works this way; I haven't looked in enough
detail at the irq affinity code, but this is my observation.

Thanks for your comment.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

From: Kenji Kaneshige
Date: Sunday, July 13, 2008 - 6:08 pm

What I was worrying about was around irq_cfg->move_in_progress.
Since your environment has less than 8 cpus according to your
/proc/interrupts, I think your environment seems to use "apic_flat"
mode for interrupt handling. In this case, irq_cfg->move_in_progress
is not used for irq migration. I think this is why your environment
didn't encounter the problem I worried about. We should note that
vector allocation logic varies depending on the environment.

BTW, I looked at your take 4 patch. The problem I worried about
seems not to exist in this version (as a good side effect).

Thanks,
Kenji Kaneshige



--

From: Matthew Wilcox
Date: Thursday, July 10, 2008 - 5:59 pm

By changing from a 5-bit field to a 1-bit field, we free up some bits
that can be used by a later patch.  Also rearrange the fields for better
packing on 64-bit platforms (reducing the size of msi_desc from 72 bytes
to 64 bytes).

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 drivers/pci/msi.c   |  100 ++++++++++++++++----------------------------------
 include/linux/msi.h |    4 +-
 2 files changed, 34 insertions(+), 70 deletions(-)

diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 8c61304..104f297 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -106,20 +106,10 @@ static void msix_flush_writes(unsigned int irq)
 
 	entry = get_irq_msi(irq);
 	BUG_ON(!entry || !entry->dev);
-	switch (entry->msi_attrib.type) {
-	case PCI_CAP_ID_MSI:
-		/* nothing to do */
-		break;
-	case PCI_CAP_ID_MSIX:
-	{
+	if (entry->msi_attrib.is_msix) {
 		int offset = entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE +
 			PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET;
 		readl(entry->mask_base + offset);
-		break;
-	}
-	default:
-		BUG();
-		break;
 	}
 }
 
@@ -129,8 +119,12 @@ static void msi_set_mask_bits(unsigned int irq, u32 mask, u32 flag)
 
 	entry = get_irq_msi(irq);
 	BUG_ON(!entry || !entry->dev);
-	switch (entry->msi_attrib.type) {
-	case PCI_CAP_ID_MSI:
+	if (entry->msi_attrib.is_msix) {
+		int offset = entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE +
+			PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET;
+		writel(flag, entry->mask_base + offset);
+		readl(entry->mask_base + offset);
+	} else {
 		if (entry->msi_attrib.maskbit) {
 			int pos;
 			u32 mask_bits;
@@ -143,18 +137,6 @@ static void msi_set_mask_bits(unsigned int irq, u32 mask, u32 flag)
 		} else {
 			msi_set_enable(entry->dev, !flag);
 		}
-		break;
-	case PCI_CAP_ID_MSIX:
-	{
-		int offset = entry->msi_attrib.entry_nr * PCI_MSIX_ENTRY_SIZE +
-			PCI_MSIX_ENTRY_VECTOR_CTRL_OFFSET;
-		writel(flag, entry->mask_base + offset);
-		readl(entry->mask_base + ...
From: Matthew Wilcox
Date: Thursday, July 10, 2008 - 6:00 pm

AHCI controllers can support up to 16 interrupts, one per port.  This
saves us a readl() in the interrupt path to determine which port has
generated the interrupt.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 drivers/ata/ahci.c |  127 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 files changed, 122 insertions(+), 5 deletions(-)

diff --git a/drivers/ata/ahci.c b/drivers/ata/ahci.c
index 5e6468a..12562fc 100644
--- a/drivers/ata/ahci.c
+++ b/drivers/ata/ahci.c
@@ -102,6 +102,7 @@ enum {
 	/* HOST_CTL bits */
 	HOST_RESET		= (1 << 0),  /* reset controller; self-clear */
 	HOST_IRQ_EN		= (1 << 1),  /* global IRQ enable */
+	HOST_MSI_RSM		= (1 << 2),  /* Revert to Single Message */
 	HOST_AHCI_EN		= (1 << 31), /* AHCI enabled */
 
 	/* HOST_CAP bits */
@@ -1771,6 +1772,15 @@ static void ahci_port_intr(struct ata_port *ap)
 	}
 }
 
+static irqreturn_t ahci_msi_interrupt(int irq, void *dev_instance)
+{
+	struct ata_port *ap = dev_instance;
+	spin_lock(&ap->host->lock);
+	ahci_port_intr(ap);
+	spin_unlock(&ap->host->lock);
+	return IRQ_HANDLED;
+}
+
 static irqreturn_t ahci_interrupt(int irq, void *dev_instance)
 {
 	struct ata_host *host = dev_instance;
@@ -2220,6 +2230,107 @@ static void ahci_p5wdh_workaround(struct ata_host *host)
 	}
 }
 
+static int ahci_request_irqs(struct pci_dev *pdev, struct ata_host *host,
+								int n_irqs)
+{
+	int i, rc;
+
+	if (n_irqs > host->n_ports) {
+		n_irqs = host->n_ports;
+	} else if (n_irqs < host->n_ports || n_irqs == 1) {
+		n_irqs--;
+		rc = devm_request_irq(host->dev, pdev->irq + n_irqs,
+				ahci_interrupt,
+				IRQF_SHARED | (n_irqs ? IRQF_DISABLED : 0),
+				dev_driver_string(host->dev), host);
+		if (rc)
+			return rc;
+	}
+
+	for (i = 0; i < n_irqs; i++) {
+		rc = devm_request_irq(host->dev, pdev->irq + i,
+				ahci_msi_interrupt, IRQF_DISABLED,
+				dev_driver_string(host->dev), host->ports[i]);
+		if (rc)
+			goto free_irqs;
+	}
+
+	return 0;
+
+ free_irqs:
+ 	if (n_irqs < ...
From: Matthew Wilcox
Date: Thursday, July 10, 2008 - 5:59 pm

Add the new API pci_enable_msi_block() to allow drivers to
request multiple MSI and reimplement pci_enable_msi in terms of
pci_enable_msi_block.  Ensure that the architecture back ends don't
have to know about multiple MSI.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 arch/powerpc/kernel/msi.c |    4 ++
 drivers/pci/msi.c         |   79 +++++++++++++++++++++++++++++++-------------
 drivers/pci/msi.h         |    4 --
 include/linux/msi.h       |    1 +
 include/linux/pci.h       |    6 ++-
 5 files changed, 64 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/kernel/msi.c b/arch/powerpc/kernel/msi.c
index c62d101..5bc8dda 100644
--- a/arch/powerpc/kernel/msi.c
+++ b/arch/powerpc/kernel/msi.c
@@ -19,6 +19,10 @@ int arch_msi_check_device(struct pci_dev* dev, int nvec, int type)
 		return -ENOSYS;
 	}
 
+	/* PowerPC doesn't support multiple MSI yet */
+	if (type == PCI_CAP_ID_MSI && nvec > 1)
+		return 1;
+
 	if (ppc_md.msi_check_device) {
 		pr_debug("msi: Using platform check routine.\n");
 		return ppc_md.msi_check_device(dev, nvec, type);
diff --git a/drivers/pci/msi.c b/drivers/pci/msi.c
index 104f297..bc2e888 100644
--- a/drivers/pci/msi.c
+++ b/drivers/pci/msi.c
@@ -45,6 +45,13 @@ arch_setup_msi_irqs(struct pci_dev *dev, int nvec, int type)
 	struct msi_desc *entry;
 	int ret;
 
+	/*
+	 * If an architecture wants to support multiple MSI, it needs to
+	 * override arch_setup_msi_irqs()
+	 */
+	if (type == PCI_CAP_ID_MSI && nvec > 1)
+		return 1;
+
 	list_for_each_entry(entry, &dev->msi_list, list) {
 		ret = arch_setup_msi_irq(dev, entry);
 		if (ret)
@@ -65,8 +72,12 @@ arch_teardown_msi_irqs(struct pci_dev *dev)
 	struct msi_desc *entry;
 
 	list_for_each_entry(entry, &dev->msi_list, list) {
-		if (entry->irq != 0)
-			arch_teardown_msi_irq(entry->irq);
+		int i, nvec;
+		if (entry->irq == 0)
+			continue;
+		nvec = 1 << entry->msi_attrib.multiple;
+		for (i = 0; i < nvec; i++)
+			arch_teardown_msi_irq(entry->irq + i);
 ...
From: Hidetoshi Seto
Date: Friday, July 11, 2008 - 1:28 am

Hi,

First of all, it seems that mask/unmask of MSI has problems.
 - Per-vector masking is optional for MSI, so I think that allocating
   multiple messages for a function without masking capability would be
   not good idea, since all vector in the block will be masked/unmasked
   at once without any agreement.
 - Even if the function supports per-vector masking, current
   mask/unmask_msi_irq() functions assume that MSI uses only one vector,
   therefore they only set/unset the first bit of the maskbits which
   for the first vector of the block.  The bits for other vectors are
   initialized as 'masked' but no one unmask them.


I think we should return -EINVAL here.
No one guarantee that 32 interrupts is able to be allocate at this time.

And also I think -EINVAL should be returned if nvec is greater than
the number of encoded in the function's "Multiple Message Capable", but
I could not find any mention about handling of such over-capability request

Thanks,
H.Seto
--

From: Matthew Wilcox
Date: Friday, July 11, 2008 - 2:45 am

Thank you for pointing out the problems with masking.  The device I am
testing with does not support per-vector masking, so I have not paid
attention to this.

To your first point, if the function does not support per-vector
masking, I think it's OK to mask/unmask all vectors at once.  But we
must be careful to manage this correctly in software; if we disable IRQ
496, disable IRQ 497, then enable IRQ 497, we must not enable IRQ 496 at
that time.  I think we can solve this problem, but I must think about it
some more.

The second point is a simple bug that should be easy to fix.  Thank you

It would be outside the scope of the PCI Bus Specification.  I think
you're right that we should check the MMC bits; but I think we should
tell the driver to request a lower number, not return -EINVAL.

Thanks for your comments.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

From: Benjamin Herrenschmidt
Date: Friday, July 11, 2008 - 8:45 pm

I tend to think we should just do soft-masking anyway for MSI... better
than whacking config space.

Ben

--

From: Matthew Wilcox
Date: Thursday, July 10, 2008 - 6:00 pm

I didn't find the previous version very useful, so I rewrote it.
Also move it to the PCI subdirectory of Documentation.

Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
---
 Documentation/MSI-HOWTO.txt     |  511 ---------------------------------------
 Documentation/PCI/MSI-HOWTO.txt |  318 ++++++++++++++++++++++++
 2 files changed, 318 insertions(+), 511 deletions(-)
 delete mode 100644 Documentation/MSI-HOWTO.txt
 create mode 100644 Documentation/PCI/MSI-HOWTO.txt

diff --git a/Documentation/MSI-HOWTO.txt b/Documentation/MSI-HOWTO.txt
deleted file mode 100644
index a51f693..0000000
--- a/Documentation/MSI-HOWTO.txt
+++ /dev/null
@@ -1,511 +0,0 @@
-		The MSI Driver Guide HOWTO
-	Tom L Nguyen tom.l.nguyen@intel.com
-			10/03/2003
-	Revised Feb 12, 2004 by Martine Silbermann
-		email: Martine.Silbermann@hp.com
-	Revised Jun 25, 2004 by Tom L Nguyen
-
-1. About this guide
-
-This guide describes the basics of Message Signaled Interrupts (MSI),
-the advantages of using MSI over traditional interrupt mechanisms,
-and how to enable your driver to use MSI or MSI-X. Also included is
-a Frequently Asked Questions (FAQ) section.
-
-1.1 Terminology
-
-PCI devices can be single-function or multi-function.  In either case,
-when this text talks about enabling or disabling MSI on a "device
-function," it is referring to one specific PCI device and function and
-not to all functions on a PCI device (unless the PCI device has only
-one function).
-
-2. Copyright 2003 Intel Corporation
-
-3. What is MSI/MSI-X?
-
-Message Signaled Interrupt (MSI), as described in the PCI Local Bus
-Specification Revision 2.3 or later, is an optional feature, and a
-required feature for PCI Express devices. MSI enables a device function
-to request service by sending an Inbound Memory Write on its PCI bus to
-the FSB as a Message Signal Interrupt transaction. Because MSI is
-generated in the form of a Memory Write, all transaction conditions,
-such as a Retry, Master-Abort, Target-Abort ...
From: Grant Grundler
Date: Thursday, September 25, 2008 - 11:42 pm

...

Willy
[Catching up on my email since I was on vacation in July.]

If this patch set can still be resurrected, please add:
Reviewed-by: Grant Grundler <grundler@parisc-linux.org>


Greater control over where interrupts can be directed (not generated)


"writes" -> "DMA writes"


This is one use.  I'm told cciss driver does this. If so, it would be good
to point at a working example, even if it's messy.

The other use case is to distribute both initiating IO requests and handle
IO completions on multiple CPUs as well. This is the Multi-Queue case
where each Queue pair is associated with one MSI-X. I expect several
NIC drivers could serve as examples here. I just haven't looked at

I'm guessing the device driver is expected to call "request_irq()" for
each msix_entry it is allocated. And later free_irq() for each one too.
I see you've mentioned that in the "disable_msix()" description but

The remainder looked fine to me. I should re-read the original
sometime too just to compare. But I'm happy besides the nits
I pointed out.

thanks,

grant
--

From: Eric W. Biederman
Date: Friday, July 11, 2008 - 3:06 am

There is a reason we don't have an API to support this.  Linux can not
reasonably support this, especially not on current X86.  The designers
of the of the AHCI were idiots and should have used MSI-X.

Attempting to support multiple irqs in an MSI capability breaks
every interesting use of an irq.

mask/unmask is will likely break because the mask bit is optional
and when it is not present we disable the msi capability.

We can not set the affinity individually so we can not allow
different queues to be processed on different cores.

So in general it seems something that we have to jump through a million
hurdles and the result is someones twisted parody of a multiple working
irqs, that even Intel's IOMMU can't cure.

So unless the performance of the AHCI is better by a huge amount I don't
see the point, and even then I am extremely sceptical.

Eric
--

From: Matthew Wilcox
Date: Friday, July 11, 2008 - 3:23 am

Thank you for your constructive feedback, Eric.  Unfortunately, we have
to deal with the world as it is, not how we would like it to be.  Since
I have it running, I'd like to know what is unreasonable about the


This is true, and yet, it is still useful.  Just not as useful as one


I don't have performance numbers yet, but surely you can see that
avoiding a register read in the interrupt path is a large win?

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

From: David Miller
Date: Friday, July 11, 2008 - 3:32 am

From: Matthew Wilcox <matthew@wil.cx>

Such overhead is going to be amortized.

AHCI is not like networking where we have lots of very small
transactions to deal with, and therefore the per-IRQ overhead can
begin to dominate.

Therefore, like Eric, I think workng on multiple MSI is a very dubious
usage of one's time.  But, it's your time, so use it how you wish :)
--

From: Matthew Wilcox
Date: Friday, July 11, 2008 - 3:41 am

I didn't start hacking without performance numbers ;-)  With an
I/O-heavy workload, the two biggest consumers of CPU time in the profile
were ahci_interrupt and ahci_qc_issue.  I got 10% more bandwidth by
removing the flushing readl() from ahci_qc_issue.  I'm hoping for a
similar improvement by removing this readl() too.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

From: Eric W. Biederman
Date: Friday, July 11, 2008 - 4:05 am

At the very least it is setting all kinds of expectations that it
doesn't meet.

In addition the MSI-X spec predates the AHCI device by a long shot.
In general my experience has been that the hardware designers who
really care and have done their homework and can actually take

Assuming AHCI implements the mask bits.  In the general case this
is not fixable.  I know of several devices that do not implement 

Also a case of mismatched expectations.  The linux irq API allows
irqs to be bound to different cpus individually.  Multi msi does

No.  It is not obvious to me.

Eric
--

From: Eric W. Biederman
Date: Friday, July 11, 2008 - 4:34 am

There is one idea that seems to model this cleanly
without breaking all kinds of expectations.

That is an irq with a very small data payload.

In that case we wire all of the vectors up to a single
irq handler that computes the payload as:
payload = vector - base-vector.

And then we figure out how to pass that to the handler in irqaction.

To most of the system it is a single irq so it avoids breaking
expectations about what an irq is.

To everything else it is a little odd, and has it's own unique
set of rules (which is good as well).

Eric
--

From: Matthew Wilcox
Date: Friday, July 11, 2008 - 5:17 am

OK, I'm willing to play this scenario through and see where it takes us.

 - The affinity now matches reality.  Good.
 - For devices without individual masking, the masking API matches
   reality.  Good.
 - For devices with individual masking, we'll want a new API.  Adequate.
 - We'll still need to allocate an aligned block of vectors on x86-64.
   No change.

I think rather than passing the 'vector - base_vector' integer, the
request_irq_block() should pass in an array of pointers as large as nvec
and irqaction passes the appropriate pointer to the handler.

-- 
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

From: Matthew Wilcox
Date: Friday, July 11, 2008 - 8:10 am

That idea requires a lot of mucking with generic_IRQ_handle, which I'm
not willing to do.  Anyone have any problems with the below patch?

I've not implemented the corresponding bit in io_apic_64.c yet, so
completely untested.

diff --git a/arch/x86/kernel/irq_64.c b/arch/x86/kernel/irq_64.c
index 3aac154..a5d79e9 100644
--- a/arch/x86/kernel/irq_64.c
+++ b/arch/x86/kernel/irq_64.c
@@ -173,7 +173,7 @@ asmlinkage unsigned int do_IRQ(struct pt_regs *regs)
 	stack_overflow_check(regs);
 #endif
 
-	if (likely(irq < NR_IRQS))
+	if (likely((irq & ARCH_IRQ_DATA_MASK) < NR_IRQS))
 		generic_handle_irq(irq);
 	else {
 		if (!disable_apic)
diff --git a/include/linux/irq.h b/include/linux/irq.h
index 552e0ec..19c679c 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -282,6 +282,19 @@ extern unsigned int __do_IRQ(unsigned int irq);
 #endif
 
 /*
+ * For multiple MSI, we allow a small amount of data to be passed to
+ * the interrupt handler to determine which vector was called.  MSI
+ * only allows 32 interrupts, so we default to requiring 5 bits of
+ * information to be passed.  If your arch has NR_IRQS set higher than
+ * 2^27, you have some extremely large arrays and probably want to consider
+ * defining ARCH_IRQ_DATA_SHIFT and ARCH_IRQ_DATA_MASK.
+ */
+#ifndef ARCH_IRQ_DATA_SHIFT
+#define ARCH_IRQ_DATA_SHIFT 27
+#define ARCH_IRQ_DATA_MASK ((1 << ARCH_IRQ_DATA_SHIFT) - 1)
+#endif
+
+/*
  * Architectures call this to let the generic IRQ layer
  * handle an interrupt. If the descriptor is attached to an
  * irqchip-style controller then we call the ->handle_irq() handler,
@@ -289,7 +302,7 @@ extern unsigned int __do_IRQ(unsigned int irq);
  */
 static inline void generic_handle_irq(unsigned int irq)
 {
-	struct irq_desc *desc = irq_desc + irq;
+	struct irq_desc *desc = irq_desc + (irq & ARCH_IRQ_DATA_MASK);
 
 #ifdef CONFIG_GENERIC_HARDIRQS_NO__DO_IRQ
 	desc->handle_irq(irq, desc);
diff --git a/kernel/irq/handle.c b/kernel/irq/handle.c
index ...
From: Suresh Siddha
Date: Friday, July 11, 2008 - 2:59 pm

With interrupt-remapping, we can program the individual interrupt
remapping table entries to point to different cpu's etc. All we have
to take care is, do the IRTE allocation in a consecutive block and
program the starting index to the MSI registers.

Just curious Eric, why do you think that won't work?

thanks,
--

From: Eric W. Biederman
Date: Friday, July 11, 2008 - 3:59 pm

Working mask/unmask.  With MSI-X as specced if I mask an irq and then unmask
it, an msi message will fire if something happened while the irq was masked
and not taken care of before the irq was unmasked.  That is the correct
behavior for an irq and a mmu won't let me get that.

The best I can do with an iommu is to run delayed disable and set the
interrupt remapping slot in the iommu to reject the traffic.  Which
is almost but not quite what I want.

Overall introducing a new concept into the linux irq model seems a lot
cleaner and more portable, and even there we are likely to be a lot
more fragile because of the difficulty in obtaining contiguous
vectors.

Speaking of.  How many interrupt targets does the dmar iommu have
for interrupts?  16K?

Eric
--

From: Suresh Siddha
Date: Friday, July 11, 2008 - 4:15 pm

There can be multiple interrupt-remapping units in the platform and
each of table in the remapping unit has max 64K entries.

thanks,
suresh
--

From: Eric W. Biederman
Date: Friday, July 11, 2008 - 4:59 pm

Why do we ever?  It is part of the linux irq API and people wind up using
it in all kinds of strange ways.

One of the more surprising uses I have see is for the real time kernel
people mask the irq, wake up a kernel thread (to handle the irq), then
ack the irq and get out of the interrupt handler.  It looked like

Thanks.  That should last us for a little while. :)

Eric
--

From: Benjamin Herrenschmidt
Date: Friday, July 11, 2008 - 8:52 pm

And ? It's just a message, we can ignore it if masked, ie, do
software-masking. Not a big deal... no ?

Ben.


--

From: Eric W. Biederman
Date: Friday, July 11, 2008 - 9:41 pm

It is edge triggered so it won't refire when unmasked (especially if we don't know).
So it is easy to wind up in a state where the device is waiting for the software
and the software is waiting for the device because an irq gets dropped.

There are enough places that have problems that we have a fairly standard work around
to the problem (listed above) by just taking the first irq (after we have disabled the
irq) and setting it pending in software and then actually masking it in hardware.

That works, but it is still isn't quite correct.  Because we can run the
interrupt handler once to often.  For interrupts that are never shared and
always in order with the DMA, generally don't require reading a status
register on the card, and are otherwise highly optimized that might actually
be a problem.

Which is why I said that it doesn't look like even using an iommu can
fix all of the issues with treating msi multi message mode messages
as individual irqs.  We can get very close but not quite there.

Eric
--

From: Benjamin Herrenschmidt
Date: Saturday, July 12, 2008 - 12:36 am

Well, we are smarter than that. soft-masking is a know well-solved
problem. We just latch that something happened while masked and refire
when unmasked. Not terribly hard. We already do that in various

Masking in HW is totally optional. I don't mask in HW on cell for

We only re-fire if it actually occured while "masked", that should take

There must be some way of knowing what work is to do (ie, whether a DMA
q entry is completed, some kind of done bit, etc...). There generally is
at least, so that even in that case, spurrious MSIs are mostly a non

I still mostly dislike the new approach, I prefer Matthew's original one
with SW masking of the MSIs. For example, if you have the MSIs be 'one'
interrupt, then you hit all of the logic in the IRQ core to make sure
only one happens at once. Might not be what you want, and -will- cause
some to be dropped... not nice.

Ben.


--

From: Eric W. Biederman
Date: Sunday, July 13, 2008 - 3:30 pm

You are correct.  Using the existing irq handling logic will cause us
to drop irqs and to loose information if two messages are sent close
to each other.  Which is very nasty.

With a little care we can avoid that problem by having a 32 bit bitmap
of which sub irqs have fired so we can make all of them pending
without loosing information.  That does requires a new handle_irq method.

One of the primary purposes of masking irqs in hardware is to prevent
them from screaming.  Unlikely with edge triggered irqs but not a capability
I would like to give up.

Multi-msi has the problem that cpu affinity can not be changed on a
per message basis without an iommu.  Which is a portability problem
and a problem on common architectures.

Therefore to support multi-msi it must be handled as a special case, we can not
treat the individual messages like normal irqs.

Eric
--

From: Benjamin Herrenschmidt
Date: Sunday, July 13, 2008 - 3:44 pm

And we end up re-doing the logic that the core already provides ... for

Have you seen screaming MSIs yet ? We -could- if necessary add a hook

It's a minor problem I believe. It's mostly an API issue in fact, and
I'm happy to live with it rather than the other approach which sounds
just ... gross. Sorry but you are reproducing locally within an IRQ what
the whole IRQ is about to differenciate them in the first place.

It also makes it harder to remove the "irq" argument to handlers which

We can, that's what they are. They just are IRQs with -some-
restrictions on -some- platforms (mostly affinity on x86 and need to
soft-mask, which are fairly minor in my book).

Ben.


--

From: Eric W. Biederman
Date: Sunday, July 13, 2008 - 4:29 pm

Ben.  Multi-MSI is a crap hardware design.  Why do you think we have
MSI-X?  MSI-X as specced is a properly operating irq controller that
we don't need kludges to support.  Multi-MSI with a full set of
kludges almost work but not quite fits the linux irq model.

Any hardware designer who choose to implement Multi-MSI instead of
MSI-X was not really concerned about having a high performance device.

If we can find a way to model the portable capabilities of Multi-MSI
cleanly then we can support it, and our drivers and our users and our
intermediate layers won't get surprised.

So far we have too close fits but neither model really works.

Further this is all about driver optimization, so none of this is
necessary to have working hardware.  Which makes kludges much less
appropriate.  Modelling Multi-MSI irqs as normal irqs requires a lot
of nasty kludges.

One of the kludges is allocating a continuous chunk of irq targets,
and the resulting fragmentation issues that you get when you start
allowing different sized allocations.

Overall if Multi-MSI was to become common I think we would really
regret it.

Eric
--

From: Benjamin Herrenschmidt
Date: Sunday, July 13, 2008 - 5:17 pm

I know and I agree. Which is why I'd rather keep the SW crap totally
local to the MSI support code and not add new concepts to the generic
IRQ API such as sub-channels, for which it's really not ready for imho.

They -are- separate IRQs, just badly implemented. Besides, a large part
of the problem is purely due to the typical x86 implementation of them,
since for example, on most PowerPC's (and possibly other archs), they
tend to land in the PIC as normal sources, and as such benefit from all
the "features" of such interrupts like HW masking, affinity control,
etc... at the PIC level.

In any case, SW masking will work just fine, and affinity can be dealt

And you want to change the model for it ? I say no :-) Just make them
fit with a hammer. In fact, it's not even a big hammer. The only thing



I think Willy's initial model can work with SW masking. All of the
latching & re-emission stuff is already in the core. The only problem is
affinity which can be hacked around, by either requiring all irqs to
change affinity or bouncing them all on one change, or only exposing

No. Not a lot. And those kludge are mostly local to the MSI support core


I agree. But in the meantime, if we want to support it, I prefer the
option of making it fit in the existing model.

Cheers,
Ben.

--

From: David Miller
Date: Sunday, July 13, 2008 - 5:44 pm

From: Benjamin Herrenschmidt <benh@kernel.crashing.org>

This is how it works on sparc64 too.

The x86 system designers decided to implement multi-MSI in an
inconvenient way, it is not a "crap hardware design", merely
some (unfortunately common) implementations of it happen to be.

--

From: Eric W. Biederman
Date: Sunday, July 13, 2008 - 7:03 pm

To be clear I was referring to the PCI spec that describes multi-MSI
as a crap hardware design.

At the very least you are left with the problem of allocating multiple
contiguous destinations.  Which has the potential to create
fragmentation on all supported platforms.

Optional mask bits are also nasty.

My honest opinion is that the should have deprecated multi-msi after the
introduction of the msi-x specification.

Eric
--

From: David Miller
Date: Sunday, July 13, 2008 - 8:19 pm

From: ebiederm@xmission.com (Eric W. Biederman)

I don't think this is a very real concern.  With 256 MSI slots
available per domain, at least on sparc64, no real case can cause
problems.

And even so, drivers can and should fall back when allocations fail
anyways.

That's what drivers, at least the ones I have written, do even for
MSI-X.  For example, the NIU driver scales back the number of
MSI-X vectors it askes for if the original request cannot be

I don't think this would have materially influenced what happened with
AHCI at all.
--

From: Jike Song
Date: Thursday, September 25, 2008 - 10:30 pm

Hi Matthew, what's current state of this work? I pulled it into a
newly created git branch and tried to rebase it upon the latest Linus
tree, found that there are some merge conflicts.

-Jike
--

From: Matthew Wilcox
Date: Saturday, September 27, 2008 - 12:04 pm

The status is that I was working on it on the plane home from Plumbers
and didn't quite managed to get it working on x86-32 yet.  This is a
requirement from Ingo before he'll accept the x86-64 implementation.
I was hoping to work on it some more this week, but a combination of
being ill and having other things to work on (including a lot of patch
review) has meant that I have not had time to look at it.

I did start setting up to test again on Friday ... hopefully I'll have
time on Monday.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."
--

Previous thread: [PATCH] ioctl conversion by Stoyan Gaydarov on Thursday, July 10, 2008 - 5:28 pm. (4 messages)

Next thread: [PATCH 0/8] ftrace: updates by Steven Rostedt on Thursday, July 10, 2008 - 5:58 pm. (4 messages)