This addresses the review comments of the previous round: - renamed irq_data::status to drv_status - moved drv_status around to unbreak GENERIC_HARDIRQS_NO_DEPRECATED - fixed signature of get_irq_status (irq is now unsigned int) - converted register_lock into a global one - fixed critical white space breakage (that I just left in to check if anyone is actually reading the code, of course...) Note: The KVM patch still depends on http://thread.gmane.org/gmane.comp.emulators.kvm.devel/64515 Thanks for all comments! Final but critical question: Who will pick up which bits? Jan Kiszka (4): genirq: Introduce driver-readable IRQ status word genirq: Inform handler about line sharing state genirq: Add support for IRQF_COND_ONESHOT KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices Documentation/kvm/api.txt | 27 ++++ arch/x86/kvm/x86.c | 1 + include/linux/interrupt.h | 15 ++ include/linux/irq.h | 2 + include/linux/kvm.h | 6 + include/linux/kvm_host.h | 10 ++- kernel/irq/manage.c | 77 ++++++++++- virt/kvm/assigned-dev.c | 336 ++++++++++++++++++++++++++++++++++++++++----- 8 files changed, 436 insertions(+), 38 deletions(-) --
From: Jan Kiszka <jan.kiszka@siemens.com>
This associates a status word with every IRQ descriptor. Drivers can obtain
its content via get_irq_status(irq). First use case will be propagating the
interrupt sharing state.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
include/linux/interrupt.h | 2 ++
include/linux/irq.h | 2 ++
kernel/irq/manage.c | 15 +++++++++++++++
3 files changed, 19 insertions(+), 0 deletions(-)
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 79d0c4f..4c1aa72 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -126,6 +126,8 @@ struct irqaction {
extern irqreturn_t no_action(int cpl, void *dev_id);
+extern unsigned long get_irq_status(unsigned int irq);
+
#ifdef CONFIG_GENERIC_HARDIRQS
extern int __must_check
request_threaded_irq(unsigned int irq, irq_handler_t handler,
diff --git a/include/linux/irq.h b/include/linux/irq.h
index abde252..8bdb421 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -96,6 +96,7 @@ struct msi_desc;
* methods, to allow shared chip implementations
* @msi_desc: MSI descriptor
* @affinity: IRQ affinity on SMP
+ * @drv_status: driver-readable status flags (IRQS_*)
*
* The fields here need to overlay the ones in irq_desc until we
* cleaned up the direct references and switched everything over to
@@ -111,6 +112,7 @@ struct irq_data {
#ifdef CONFIG_SMP
cpumask_var_t affinity;
#endif
+ unsigned long drv_status;
};
/**
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 5f92acc..2ea0d30 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -1157,3 +1157,18 @@ int request_any_context_irq(unsigned int irq, irq_handler_t handler,
return !ret ? IRQC_IS_HARDIRQ : ret;
}
EXPORT_SYMBOL_GPL(request_any_context_irq);
+
+/**
+ * get_irq_status - read interrupt line status word
+ * @irq: Interrupt line of the status word
+ *
+ * This returns the current content of the status word ...We should document that this is a snapshot and in no way serialized against modifications of drv_status. I'll fix up the kernel doc. Thanks, tglx --
Yeah, I think I had some hint on this in the previous version but apparently dropped it for this round. Thanks, Jan
From: Jan Kiszka <jan.kiszka@siemens.com>
Provide an adaptive version of IRQF_ONESHOT: If the line is exclusively used,
IRQF_COND_ONESHOT provides the same semantics as IRQF_ONESHOT. If it is
shared, the line will be unmasked directly after the hardirq handler, just as
if IRQF_COND_ONESHOT was not provided.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
include/linux/interrupt.h | 3 +++
kernel/irq/manage.c | 19 ++++++++++++++++---
2 files changed, 19 insertions(+), 3 deletions(-)
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 12e5fc0..bbb16f4 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -56,6 +56,8 @@
* irq line disabled until the threaded handler has been run.
* IRQF_NO_SUSPEND - Do not disable this IRQ during suspend
* IRQF_ADAPTIVE - Request notification about upcoming interrupt line sharing
+ * IRQF_COND_ONESHOT - If line is not shared, keep interrupt disabled after
+ * hardirq handler finshed.
*
*/
#define IRQF_DISABLED 0x00000020
@@ -69,6 +71,7 @@
#define IRQF_ONESHOT 0x00002000
#define IRQF_NO_SUSPEND 0x00004000
#define IRQF_ADAPTIVE 0x00008000
+#define IRQF_COND_ONESHOT 0x00010000
#define IRQF_TIMER (__IRQF_TIMER | IRQF_NO_SUSPEND)
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 2dd4eef..9a73633 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -583,7 +583,7 @@ static int irq_thread(void *data)
struct sched_param param = { .sched_priority = MAX_USER_RT_PRIO/2, };
struct irqaction *action = data;
struct irq_desc *desc = irq_to_desc(action->irq);
- int wake, oneshot = desc->status & IRQ_ONESHOT;
+ int wake, oneshot;
sched_setscheduler(current, SCHED_FIFO, &param);
current->irqaction = action;
@@ -606,6 +606,7 @@ static int irq_thread(void *data)
desc->status |= IRQ_PENDING;
raw_spin_unlock_irq(&desc->lock);
} else {
+ oneshot = desc->status & IRQ_ONESHOT;
...From: Jan Kiszka <jan.kiszka@siemens.com>
PCI 2.3 allows to generically disable IRQ sources at device level. This
enables us to share IRQs of such devices on the host side when passing
them to a guest.
However, IRQ disabling via the PCI config space is more costly than
masking the line via disable_irq. Therefore we register the IRQ in adaptive
mode and switch between line and device level disabling on demand.
This feature is optional, user space has to request it explicitly as it
also has to inform us about its view of PCI_COMMAND_INTX_DISABLE. That
way, we can avoid unmasking the interrupt and signaling it if the guest
masked it via the PCI config space.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
Documentation/kvm/api.txt | 27 ++++
arch/x86/kvm/x86.c | 1 +
include/linux/kvm.h | 6 +
include/linux/kvm_host.h | 10 ++-
virt/kvm/assigned-dev.c | 336 ++++++++++++++++++++++++++++++++++++++++-----
5 files changed, 346 insertions(+), 34 deletions(-)
diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
index e1a9297..1c34e25 100644
--- a/Documentation/kvm/api.txt
+++ b/Documentation/kvm/api.txt
@@ -1112,6 +1112,14 @@ following flags are specified:
/* Depends on KVM_CAP_IOMMU */
#define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
+/* The following two depend on KVM_CAP_PCI_2_3 */
+#define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)
+#define KVM_DEV_ASSIGN_MASK_INTX (1 << 2)
+
+If KVM_DEV_ASSIGN_PCI_2_3 is set, the kernel will manage legacy INTx interrupts
+via the PCI-2.3-compliant device-level mask, but only if IRQ sharing with other
+assigned or host devices requires it. KVM_DEV_ASSIGN_MASK_INTX specifies the
+guest's view on the INTx mask, see KVM_ASSIGN_SET_INTX_MASK for details.
4.48 KVM_DEASSIGN_PCI_DEVICE
@@ -1263,6 +1271,25 @@ struct kvm_assigned_msix_entry {
__u16 padding[3];
};
+4.54 KVM_ASSIGN_SET_INTX_MASK
+
+Capability: KVM_CAP_PCI_2_3
+Architectures: x86
+Type: vm ioctl
+Parameters: struct ...From: Jan Kiszka <jan.kiszka@siemens.com>
This enabled interrupt handlers to retrieve the current line sharing state via
the new interrupt status word so that they can adapt to it.
The switch from shared to exclusive is generally uncritical and can thus be
performed on demand. However, preparing a line for shared mode may require
preparational steps of the currently registered handler. It can therefore
request an ahead-of-time notification via IRQF_ADAPTIVE. The notification
consists of an exceptional handler invocation with IRQS_MAKE_SHAREABLE set in
the status word.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
include/linux/interrupt.h | 10 +++++++++
kernel/irq/manage.c | 47 ++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 54 insertions(+), 3 deletions(-)
diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 4c1aa72..12e5fc0 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -55,6 +55,7 @@
* Used by threaded interrupts which need to keep the
* irq line disabled until the threaded handler has been run.
* IRQF_NO_SUSPEND - Do not disable this IRQ during suspend
+ * IRQF_ADAPTIVE - Request notification about upcoming interrupt line sharing
*
*/
#define IRQF_DISABLED 0x00000020
@@ -67,6 +68,7 @@
#define IRQF_IRQPOLL 0x00001000
#define IRQF_ONESHOT 0x00002000
#define IRQF_NO_SUSPEND 0x00004000
+#define IRQF_ADAPTIVE 0x00008000
#define IRQF_TIMER (__IRQF_TIMER | IRQF_NO_SUSPEND)
@@ -126,6 +128,14 @@ struct irqaction {
extern irqreturn_t no_action(int cpl, void *dev_id);
+/*
+ * Driver-readable IRQ line status flags:
+ * IRQS_SHARED - line is shared between multiple handlers
+ * IRQS_MAKE_SHAREABLE - in the process of making an exclusive line shareable
+ */
+#define IRQS_SHARED 0x00000001
+#define IRQS_MAKE_SHAREABLE 0x00000002
+
extern unsigned long get_irq_status(unsigned int irq);
#ifdef ...What's the reason to clear this flag outside of the desc->lock held region. I need this status for other purposes as well, where I Thanks, tglx --
We need to synchronize the irq first before clearing the flag. The problematic scenario behind this: An IRQ started in shared mode, this the line was unmasked after the hardirq. Now we clear IRQS_SHARED before calling into the threaded handler. And that handler may now think Well, two options: wrap all bit manipulations with desc->lock acquisition/release or turn drv_status into an atomic. I don't know what Jan
Some bits for irq migration and other stuff, which allows us to avoid fiddling with irqdesc in the drivers. Thanks, tglx --
Can't remember who wrote this, but that guy might have been too tired for clear sentences: Yes, of course, we could run into troubles, if IRQS_SHARED was _not_ set while the IRQ line is unmasked between hard Be warned, might be painful. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux --
Right. As a side note, the current implementation requires that you lookup irq_data.drv_status for every invocation of the handler or have a reference to irq_data.drv_status somewhere locally, which I don't like either. I have an evil and nasy idea how to avoid that, need to look how ugly it gets. Worst case we need to go back to that notification thing which I wanted really avoid in the first place. Though I like the register_mutex idea which came out of this a lot as it allows us to reduce desc->lock held and interrupt disabled regions quite nicely. Bah, my brain became pain resistant when I started hacking that code. Thanks, tglx --
Talking about headache. Your solution above does not prevent that scenario. CPU 0 CPU 1 synchronize_irq(); hard irq comes in sees shared and unmasks clear IRQS_SHARED thread handler runs and sees !SHARED Same scenario, just moved by a few lines :) Thanks, tglx --
The same, just the other way around - and mostly harmless, I hope. :) Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux --
Errm ? It's set. Could you please stop to increase my confusion ? :) --
This is redundant. IRQS_SHARED gets set in a code path where all checks are done already. To make that more obvious we can set it right before raw_spin_unlock_irqrestore(&desc->lock, flags); conditionally on (shared). That way we can also move the kfree out of the mutex locked section. Thanks, tglx --
Nope, it's also set before entry of __setup_irq in case we call an IRQF_ADAPTIVE handler. We need to set it that early as we may race with IRQ events for the already registered handler happening between the sharing notification and the actual registration of the second handler. Jan
Hmm, ok. Though the MAKE_SHAREABLE flag should be sufficient to do the notification. Thanks, tglx --
For notification, yes. But we need SHARED once we reenable the line after the notification. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux --
Darn. Will think more about it. --
This is weird, really. I thought you wanted to avoid waiting for the threaded handler to finish if it's on the fly. So this should be disable_irq_nosync() or did you change your mind ? Thanks, tglx --
No, I did not. I wanted to avoid that we set MAKE_SHAREABLE while there might be another IRQ in flight. The handler that is called due to a real IRQ might misinterpret MAKE_SHAREABLE as "there is no real event" and perform the wrong steps (at least the current implementation for KVM would). However, I will rebase my patch over your series now and try to re-think this. The question is what could go wrong if we do not guarantee that MAKE_SHAREABLE and ordinary IRQ will always be distinguishable. If there is really nothing, specifically for the KVM scenario, we could even drop the disable/enable_irq. That would be also be nicer when thinking about potential delays of the already registered handler during this transitional phase. Jan
Actually, the requirement we have to fulfill here is to avoid that the hardirq handler sees !SHARED while the threaded one reads "SHARED". To achieve this without disabling the line, I'm still searching for a way to couple the sharing state of associated hard and threaded handler runs - but I think there is no reliable association, is there? Jan
Unfortunately not. So the only way to solve that is disabling the interrupt which makes sure that all handlers have completed. OTOH, if we have to disable anyway, then we could simply keep it disabled across the installation of a new handler. That would make the notification business go away, wouldn't it ? Thanks, tglx --
No, the notification is still necessary in case the registered handler keeps the line off after returning from both hard and threaded handler. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux --
And how should that happen? If it is in oneshot mode then the line is reenabled when the thread handler returns. Thanks, tglx --
disable_irq_nosync is called by the handler before returning. And it's the handler's job to revert this, properly synchronizing it internally. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux --
disable_irq_nosync() is really the worst thing to do. That's simply
not going to work without a lot of fuglyness.
What about the following:
primary_handler(....)
{
if (!shared)
return IRQ_WAKE_THREAD;
spin_lock(dev->irq_lock);
if (from_my_device() || dev->irq_thread_waiting) {
mask_dev();
dev->masked = true;
ret = IRQ_WAKE_THREAD;
} else
ret = IRQ_NONE;
spin_unlock(dev->irq_lock);
return ret;
}
check_timeout()
{
if (dev->irq_active && wait_longer())
return IRQ_WAKE_THREAD;
return IRQ_HANDLED;
}
unmask_dev_if_necessary()
{
if (dev->masked && dev->irq_active)
umask_dev();
}
threaded_handler(....)
{
if (!dev->irq_thread_waiting) {
spin_lock_irq(dev->irq_lock);
wake_user = do_magic_stuff_with_the_dev();
dev->irq_thread_waiting = wake_user;
spin_unlock(dev->irq_lock);
if (wake_user)
wake_up(user);
}
if (!dev->irq_thread_waiting) {
spin_lock_irq(dev->irq_lock);
unmask_dev_if_necessary();
spin_unlock(dev->irq_lock);
return IRQ_HANDLED;
}
/*
* Wait for user space to complete. Timeout is to
* avoid starvation of the irq line when
* something goes wrong
*/
wait_for_completion_timeout(dev->compl, SENSIBLE_TIMEOUT);
spin_lock_irq(dev->irq_lock);
if (timedout) {
mask_dev();
dev->masked = true;
/*
* Leave dev->irq_thread_waiting untouched and let
* the core code reschedule us when check_timeout
* decides it's worth to wait. In any case we leave
* the device masked at the device level, so we don't
* cause an interrupt storm.
*/
ret = check_timeout();
} else {
unmask_dev_if_necessary();
dev->irq_thread_waiting = false;
ret = IRQ_HANDLED;
}
spin_unlock(dev->irq_lock);
return ret;
}
userspace_complete()
{
complete(dev->irq_compl);
}
Your aproach with disable_irq_nosync() is completely flawed, simply
because you try to pretend that your interrupt handler is done, while
it is not done at all. The threaded interrupt ...disable_irq_nosync is the pattern currently used in KVM, it's nothing new in fact. The approach looks interesting but requires separate code for non-PCI-2.3 devices, i.e. when we have no means to mask at device level. Further drawbacks - unless I missed something on first glance: - prevents any future optimizations that would work without IRQ thread ping-pong (ie. once we allow guest IRQ injection from hardirq context for selected but typical setups) - two additional, though light-weight, context switches on each interrupt completion - continuous polling if user space decides to leave the interrupt unhandled (e.g. because the virtual IRQ line is masked) Maybe the latter can be solved in a nicer way, but I don't think we can avoid the first two. I'm not saying yet that they are killing this approach, we just need to asses their relevance. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux --
Why? You can have the same code, you just can't request COND_ONESHOT handlers for it, so it needs unshared ONESHOT or it won't work at all, no matter what approach you chose. No device level mask, no sharing, The drawback of these two points is way less than the horror which you need to introduce to do the whole async handler disable, userspace enable dance. Robust and simple solutions really a preferred over That should be a solvable problem. Thanks, tglx --
I'd like to note that the overhead of involving the scheduler in interrupt injection for an assigned device should be easily measurable: just make the MSI handlers threaded and see what the result is. In the case of emulated devices, when we had an extra thread involved in MSI handling, the vcpu thread and the interrupt injection thread were competing for cpu with strange fluctuations in performance as the result (i.e. sometimes we would get good speed as threading would introduce --
The procedure which has served us well in the past is that tip picks up the irq stuff and sticks them in a fast-forward-only branch; kvm merges the branch and applies the kvm bits on top. -- error compiling committee.c: too many arguments to function --
Just for the record, you either missed or introduced some new white space noise :) --
