Re: MSI problem since 2.6.21 for devices not providing a mask in their MSI capability

Previous thread: [PATCH 8/8] scsi: megaraid_sas - Update version and changelog by bo yang on Monday, October 1, 2007 - 9:18 am. (1 message)

Next thread: [PATCH] hugetlb: Fix pool resizing corner case V2 by Adam Litke on Wednesday, October 3, 2007 - 2:32 pm. (1 message)
From: Loic Prylli
Date: Wednesday, October 3, 2007 - 2:12 pm

Hi,

We observe a problem with MSI since kernel 2.6.21 where interrupts would
randomly stop working. We have tracked it down to the new
msi_set_mask_bit definition in 2.6.21. In the MSI case with a device not
providing a "native" MSI mask, it was a no-op before, and now it
disables MSI in the MSI-ctl register which according to the PCI spec is
interpreted as reverting the device to legacy interrupts. If such a
device try to generate a new interrupt during the "masked" window, the
device will try a legacy interrupt which is generally
ignored/never-acked and cause interrupts to no longer work for the
device/driver combination (even after the enable bit is restored).


Is there anything apart from irq migration that strongly requires
masking? Is is possible to do the irq migration without masking?



Loic

-

From: Eric W. Biederman
Date: Wednesday, October 3, 2007 - 2:49 pm

We should also be leaving the INTx irqs disabled.  So no irq
should be generated.

If you have a mask bit implemented you are required to be
able to refire it after the msi is enabled.  I don't recall
the requirements for when both intx and msi irqs are both
disabled.  Intuitively I would expect no irq message to
be generated, and at most the card would need to be polled
manually to recognize a device event happened.

Certainly firing an irq and having it get completely lost is
unfortunate, and a major pain if you are trying to use the
card.


enable_irq/disable_irq.  Although we can get away with a software
emulation there and those are only needed if the driver calls them.

The PCI spec requires disabling/masking the msi when reprogramming it.
So as a general rule we can not do better.  Further because we are
writing to multiple pci config registers the only way we can safely
reprogram the message is with the msi disabled/masked on the card in
some fashion.

I suspect what needs to happen is a spec search to verify that the
current linux behavior is at least reasonable within the spec.

Once we have verified that the generic code can not do better.
We can look at work-arounds.   One possibility is for the generic
code to provide some overrides for the methods for masking and
reading/writing to a msi message.

I don't want to break anyones hardware, but at the same time I want us
to be careful and in spec for the default case.

Eric
-

From: Benjamin Herrenschmidt
Date: Wednesday, October 3, 2007 - 3:03 pm

Well, yes and no ... A valid option here would be to use soft-masking,
which is possible because MSIs are edge interrupts. That is, basically,
when masked, just ignore them and set IRQF_PENDING, and when unmasked,
replay (which can be done with softirq if there is no HW mechanism for
that). The genirq code contains all the necessary infrastructure for
doing that stuff, it's fairly trivial, and would probably avoid stepping
in HW lalaland (how much do you bet HW generally get that masking thing

Hrm... all right, that will be an issue, so migration need a real

Ben.


-

From: Eric W. Biederman
Date: Wednesday, October 3, 2007 - 3:25 pm

Well.  If people actually use it I suspect it will work ok.  The
circuitry is quite simple so as long as people get their requirements
straight we should be fine.  Which is why I tried to get everything
working as well as we could sooner rather then later.  Of course
drivers are free not to call anything that would cause the irq
to be masked.

That said the current disable_irq and enable_irq path is using the
IRQF_PENDING infrastructure on x86.  So the only time this comes up
is for irq migration.

Eric
-

From: Loic Prylli
Date: Wednesday, October 3, 2007 - 6:44 pm

Even if the INTx line is not raised, you cannot rely on the device to
retain memory of a interrupt triggered while MSI are disabled, and
expect it to fire it under MSI form later when MSI are reenabled.  The
PCI spec does not provide any implicit or explicit guarantee about the
MSI enable flag that would allow it to be used for temporary masking
without running the risk of loosing such interrupts. Moreover, even if
you eventually call the interrupt handler to recover a lost-interrupt,
having switched the device to INTx mode (whether or not the INTx line
was forced down or not with the corresponding pci-command bit) without
informing the driver can (and will in our case) break interrupt
handshaking because MSI and INTx interrupts are not acked in the same
way (INTx requires an extra step that we don't do for MSI and that the



Indeed the masking case is well-defined by the spec (including the
operation of the pending bits). And my subject was definitely restricted




OK no-op was a bug, but using the enable-bit for temporary masking
purposes still feels like a bug. I am afraid the only safe solution
might be to prohibit any operation that absolutely requires masking if
real masking is not available. Maybe the set_affinity method should
simply be disabled for device not supported masking (unless there is an
option of doing it without masking for instance by guaranteeing only one



I don't think there is a problem here, no sane driver would depend on



Do you have a reference for that requirement. The spec only vaguely
associates MSI programming with "configuration", but I haven't found any





I don't see how you can disable MSI through the control bit (which is
equivalent to switching the device to INTx whether or not the INTx
disable bit is set in PCI_COMMAND) in the middle of operations, not tell
the driver, and not risk loosing interrupts (unless you rely on much


The interrupt while doing set_affinity masking would certainly cause a
problem for ...
From: Eric W. Biederman
Date: Wednesday, October 3, 2007 - 8:58 pm

Sure.  My expectation is if we happened to hit such a narrow window

Right.  And INTx has such a pending bit as well.  I guess I figured
if MSI was enabled transferring it over would be the obvious thing to

It's worth looking at, I think that happens in the common case.

Of course it might even make sense simply to refuse to enable MSI

I would have to look it up again but it said that the result is only
defined in the case when it is disabled/masked, when I looked a couple

I will relook.  My impression is that bit is defined as MSI enable.
Not mode switch.   Although myrinet has clearly implemented it as

Interesting.  So an irq fires before the driver has finished
processing the last instance of the irq.  This is very close to a
screaming irq and something we may actually want to deal with.

Eric
-

From: Loic Prylli
Date: Thursday, October 4, 2007 - 2:14 am

The INTx pending and disable bit were only added starting with PCI 2.3,
so in  PCI-2.2 and PCI-X 1.0{,a,b} those bits don't exist at all and
there is still a significant of such devices still in use or on the market.

I agree it would look natural (and that probably happens for a lot of or
most devices) to transfer the interrupt state from INTx to MSI, but I
don't think you can rely on it without doing some assumptions about



The possibility of masking for MSI was only specified (and then only as
something optional) starting with PCI-3.0, PCI Express 1.0 and 1.0a are
based on the older PCI-2.3 and corresponding devices are very unlikely
to have it. So there might still be majority of devices in the field
with no MSI masking capability in the different PCI categories:


I found this quote in PCI-3.0/6.8.3.5:
"For MSI-X, a function is permitted to cache Address and Data values
from unmasked
MSI-X Table entries. However, anytime software unmasks a currently
masked MSI-X
Table entry either by clearing its Mask bit or by clearing the Function
Mask bit, the
function must update any Address or Data values that it cached from that
entry. If
software changes the Address or Data value of an entry while the entry
is unmasked, the
result is undefined."


I haven't seen a caching possibility mentioned for the MSI case, so
apart from the problem with multi-word changes, maybe changing the



It is indeed defined as MSI-enable, but that's not a contradiction with 
being equivalent to a "mode switch between INTx and MSI" (ignoring MSI-X
in that context). The spec seems to define the following "modes":

MSI-enable = 1, INTx-disable= x : MSI-mode
MSI-enable = 0, INTx-disable= 0 : INTx-mode with INTx-signal == INTx-pending
MSI-enable = 0, INTx-disable= 1 : INTx-mode "polling/diag" mode using
INTx pending bit

The only specificity of Myrinet is having relatively independant logic
for the two modes, while at the same time requiring any pending INTx to



In our case it is ...
From: Eric W. Biederman
Date: Thursday, October 4, 2007 - 10:03 am

Loic Prylli <loic@myri.com> writes:

I still looking through my copy of the pci specs and so will reply to

Hmm. Something like this?

Only mask it if the irq is disabled, and only disable it if
the user requests it or if note_interrupt decides we are screaming?

void handle_edge_irq(unsigned int irq, struct irq_desc *desc)
{
	const unsigned int cpu = smp_processor_id();

	spin_lock(&desc->lock);

	desc->status &= ~(IRQ_REPLAY | IRQ_WAITING);

	/*
	 * If we're currently running this IRQ, or its disabled,
	 * we shouldn't process the IRQ. Mark it pending, handle
	 * the necessary masking and go out
	 */
	if (unlikely((desc->status & IRQ_DISABLED) || !desc->action)){
        	desc->status |= (IRQ_PENDING | IRQ_MASKED);
                mask_ack_irq(desc, irq);
                goto out_unlock;
	}    
	if (unlikely(desc->status & IRQ_INPROGRESS)) {
		desc->status |= IRQ_PENDING;
		desc->chip->ack(desc, irq);
                note_interrupt(irq, desc, IRQ_NONE /* IRQ_DUP? */);
		goto out_unlock;
	}

	kstat_cpu(cpu).irqs[irq]++;

	/* Start handling the irq */
	desc->chip->ack(irq);

	/* Mark the IRQ currently in progress.*/
	desc->status |= IRQ_INPROGRESS;

	do {
		struct irqaction *action = desc->action;
		irqreturn_t action_ret;

		if (unlikely(!action)) {
			desc->chip->mask(irq);
			goto out_unlock;
		}

		desc->status &= ~IRQ_PENDING;
		spin_unlock(&desc->lock);
		action_ret = handle_IRQ_event(irq, action);
		if (!noirqdebug)
			note_interrupt(irq, desc, action_ret);
		spin_lock(&desc->lock);

	} while ((desc->status & (IRQ_PENDING | IRQ_DISABLED)) == IRQ_PENDING);

	desc->status &= ~IRQ_INPROGRESS;
out_unlock:
	spin_unlock(&desc->lock);
}

Eric
-

Previous thread: [PATCH 8/8] scsi: megaraid_sas - Update version and changelog by bo yang on Monday, October 1, 2007 - 9:18 am. (1 message)

Next thread: [PATCH] hugetlb: Fix pool resizing corner case V2 by Adam Litke on Wednesday, October 3, 2007 - 2:32 pm. (1 message)