[PATCH 0/5] KVM&genirq: Enable adaptive IRQ sharing for passed-through devices

Previous thread: [PATCH 1/5] genirq: Pass descriptor to __free_irq by Jan Kiszka on Friday, December 3, 2010 - 4:37 pm. (1 message)

Next thread: [PATCH v3 0/4] Introduce hardware spinlock framework by Ohad Ben-Cohen on Friday, December 3, 2010 - 4:50 pm. (17 messages)
From: Jan Kiszka
Date: Friday, December 3, 2010 - 4:37 pm

Besides 3 cleanup patches, this series consists of two major changes.
The first introduces an interrupt sharing notifier to the genirq
subsystem. It fires when an interrupt line is about to be use by more
than one driver or the last but one user called free_irq.

The second major change makes use of this interface in KVM's PCI pass-
through subsystem. KVM has to keep the interrupt source disabled while
calling into the guest to handle the event. This can be done at device
or line level. The former is required to share the interrupt line, the
latter is an order of magnitude faster (see patch 3 for details).

Beside pass-through support of KVM, further users of the IRQ notifier
could become VFIO (not yet mainline) and uio_pci_generic which have to
resolve the same conflict.

Jan Kiszka (5):
  genirq: Pass descriptor to __free_irq
  genirq: Introduce interrupt sharing notifier
  KVM: Split up MSI-X assigned device IRQ handler
  KVM: Clean up unneeded void pointer casts
  KVM: Allow host IRQ sharing for passed-through PCI 2.3 devices

 Documentation/kvm/api.txt |   25 +++
 arch/x86/kvm/x86.c        |    1 +
 include/linux/interrupt.h |   21 +++
 include/linux/irqdesc.h   |    9 +
 include/linux/kvm.h       |    6 +
 include/linux/kvm_host.h  |    4 +
 kernel/irq/irqdesc.c      |    6 +
 kernel/irq/manage.c       |  174 ++++++++++++++++++--
 virt/kvm/assigned-dev.c   |  420 +++++++++++++++++++++++++++++++++++++++------
 9 files changed, 606 insertions(+), 60 deletions(-)

--

From: Jan Kiszka
Date: Friday, December 3, 2010 - 4:37 pm

From: Jan Kiszka <jan.kiszka@siemens.com>

PCI 2.3 allows to generically disable IRQ sources at device level. This
enables us to share IRQs of such devices on the host side when passing
them to a guest.

However, IRQ disabling via the PCI config space is more costly than
masking the line via disable_irq. Therefore we register an IRQ sharing
notifier and switch between line and device level disabling on demand.

This feature is optional, user space has to request it explicitly as it
also has to inform us about its view of PCI_COMMAND_INTX_DISABLE. That
way, we can avoid unmasking the interrupt and signaling it if the guest
masked it via the PCI config space.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 Documentation/kvm/api.txt |   25 +++
 arch/x86/kvm/x86.c        |    1 +
 include/linux/kvm.h       |    6 +
 include/linux/kvm_host.h  |    4 +
 virt/kvm/assigned-dev.c   |  382 +++++++++++++++++++++++++++++++++++++++++----
 5 files changed, 385 insertions(+), 33 deletions(-)

diff --git a/Documentation/kvm/api.txt b/Documentation/kvm/api.txt
index e1a9297..5e164db 100644
--- a/Documentation/kvm/api.txt
+++ b/Documentation/kvm/api.txt
@@ -1112,6 +1112,14 @@ following flags are specified:
 
 /* Depends on KVM_CAP_IOMMU */
 #define KVM_DEV_ASSIGN_ENABLE_IOMMU	(1 << 0)
+/* The following two depend on KVM_CAP_PCI_2_3 */
+#define KVM_DEV_ASSIGN_PCI_2_3		(1 << 1)
+#define KVM_DEV_ASSIGN_MASK_INTX	(1 << 2)
+
+If KVM_DEV_ASSIGN_PCI_2_3 is set, the kernel will manage legacy INTx interrupts
+via the PCI-2.3-compliant device-level mask, but only if IRQ sharing with other
+assigned or host devices requires it. KVM_DEV_ASSIGN_MASK_INTX specifies the
+guest's view on the INTx mask, see KVM_ASSIGN_SET_INTX_MASK for details.
 
 4.48 KVM_DEASSIGN_PCI_DEVICE
 
@@ -1263,6 +1271,23 @@ struct kvm_assigned_msix_entry {
 	__u16 padding[3];
 };
 
+4.54 KVM_ASSIGN_SET_INTX_MASK
+
+Capability: KVM_CAP_PCI_2_3
+Architectures: x86
+Type: vm ioctl
+Parameters: struct ...
From: Avi Kivity
Date: Monday, December 6, 2010 - 9:21 am

What's the protocol for doing this?  I suppose userspace has to disable 
interrupts, ioctl(SET_INTX_MASK, masked), ..., ioctl(SET_INTX_MASK, 
unmasked), enable interrupts?

Isn't there a race window between the two operations?


-- 
error compiling committee.c: too many arguments to function

--

From: Jan Kiszka
Date: Monday, December 6, 2010 - 9:34 am

Userspace just has to synchronize against itself - what it already does:
qemu_mutex, and masking/unmasking is synchronous /wrt the the executing
VCPU. Otherwise, masking/unmasking is naturally racy, also in Real Life.

I think this is what VFIO does and is surely cleaner than this approach.
But it's not possible with the existing interface (sysfs + KVM ioctls) -
or can you restrict the sysfs access to the config space in such details?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--

From: Avi Kivity
Date: Monday, December 6, 2010 - 9:40 am

I meant when qemu sets INTX_MASK and the kernel clears it immediately 
afterwards because the two are not synchronized.  I guess that won't 

I'm sure you can, not sure it's worth it.  Can the situation be 
exploited?  what if userspace lies?

-- 
error compiling committee.c: too many arguments to function

--

From: Jan Kiszka
Date: Monday, December 6, 2010 - 9:46 am

Ah, there is indeed a race, and the qemu-kvm patches I did not post yet
(to wait for the kernel interface to settle) actually suffer from it:
userspace needs to set the kernel mask before writing the config space
(it's the other way around ATM). This avoids that the kernel overwrites
what userspace just wrote out. We always suffer from the race the other

That's also the above scenario inverted: Userspace can mask or unmask at
any time. If it unmasks a yet unhandled, thus raise interrupt, it will
trigger another one. The kernel will catch it and mask it again. That
can repeat forever with the frequency userspace is able to run its
unmasking code. Not nice, but nothing to leverage for a DoS.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--

From: Avi Kivity
Date: Monday, December 6, 2010 - 10:01 am

Please document the protocol.  Is this always the right order?  

Ok (I think).

-- 
error compiling committee.c: too many arguments to function

--

From: Jan Kiszka
Date: Monday, December 6, 2010 - 10:11 am

The kernel only notes down userspace's view on the mask, it doesn't
modify the hardware (unless it is in line-disabling mode, but then there
is no conflict anyway). Therefore, if userspace first unmasks and an
interrupts arrives before this was reported to the kernel, the device
interrupt will remain masked and no event is delivered to userspace.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux
--

From: Jan Kiszka
Date: Friday, December 3, 2010 - 4:37 pm

From: Jan Kiszka <jan.kiszka@siemens.com>

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 virt/kvm/assigned-dev.c |   12 ++++++------
 1 files changed, 6 insertions(+), 6 deletions(-)

diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index 3f8a745..c6114d3 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -149,7 +149,7 @@ static void deassign_host_irq(struct kvm *kvm,
 
 		for (i = 0; i < assigned_dev->entries_nr; i++)
 			free_irq(assigned_dev->host_msix_entries[i].vector,
-				 (void *)assigned_dev);
+				 assigned_dev);
 
 		assigned_dev->entries_nr = 0;
 		kfree(assigned_dev->host_msix_entries);
@@ -159,7 +159,7 @@ static void deassign_host_irq(struct kvm *kvm,
 		/* Deal with MSI and INTx */
 		disable_irq(assigned_dev->host_irq);
 
-		free_irq(assigned_dev->host_irq, (void *)assigned_dev);
+		free_irq(assigned_dev->host_irq, assigned_dev);
 
 		if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSI)
 			pci_disable_msi(assigned_dev->dev);
@@ -238,7 +238,7 @@ static int assigned_device_enable_host_intx(struct kvm *kvm,
 	 * are going to be long delays in accepting, acking, etc.
 	 */
 	if (request_threaded_irq(dev->host_irq, NULL, kvm_assigned_dev_thread,
-				 IRQF_ONESHOT, dev->irq_name, (void *)dev))
+				 IRQF_ONESHOT, dev->irq_name, dev))
 		return -EIO;
 	return 0;
 }
@@ -257,7 +257,7 @@ static int assigned_device_enable_host_msi(struct kvm *kvm,
 
 	dev->host_irq = dev->dev->irq;
 	if (request_threaded_irq(dev->host_irq, NULL, kvm_assigned_dev_thread,
-				 0, dev->irq_name, (void *)dev)) {
+				 0, dev->irq_name, dev)) {
 		pci_disable_msi(dev->dev);
 		return -EIO;
 	}
@@ -284,7 +284,7 @@ static int assigned_device_enable_host_msix(struct kvm *kvm,
 	for (i = 0; i < dev->entries_nr; i++) {
 		r = request_threaded_irq(dev->host_msix_entries[i].vector,
 					 NULL, kvm_assigned_dev_thread_msix,
-					 0, dev->irq_name, (void *)dev);
+					 0, dev->irq_name, dev);
 		if (r)
 			goto err;
 	}
@@ ...
From: Jan Kiszka
Date: Friday, December 3, 2010 - 4:37 pm

From: Jan Kiszka <jan.kiszka@siemens.com>

The threaded IRQ handler for MSI-X has almost nothing in common with the
INTx/MSI handler. Move its code into a dedicated handler.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 virt/kvm/assigned-dev.c |   32 +++++++++++++++++++-------------
 1 files changed, 19 insertions(+), 13 deletions(-)

diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index ae72ae6..3f8a745 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -58,8 +58,6 @@ static int find_index_from_host_irq(struct kvm_assigned_dev_kernel
 static irqreturn_t kvm_assigned_dev_thread(int irq, void *dev_id)
 {
 	struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
-	u32 vector;
-	int index;
 
 	if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_INTX) {
 		spin_lock(&assigned_dev->intx_lock);
@@ -68,20 +66,28 @@ static irqreturn_t kvm_assigned_dev_thread(int irq, void *dev_id)
 		spin_unlock(&assigned_dev->intx_lock);
 	}
 
-	if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX) {
-		index = find_index_from_host_irq(assigned_dev, irq);
-		if (index >= 0) {
-			vector = assigned_dev->
-					guest_msix_entries[index].vector;
-			kvm_set_irq(assigned_dev->kvm,
-				    assigned_dev->irq_source_id, vector, 1);
-		}
-	} else
+	kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
+		    assigned_dev->guest_irq, 1);
+
+	return IRQ_HANDLED;
+}
+
+#ifdef __KVM_HAVE_MSIX
+static irqreturn_t kvm_assigned_dev_thread_msix(int irq, void *dev_id)
+{
+	struct kvm_assigned_dev_kernel *assigned_dev = dev_id;
+	int index = find_index_from_host_irq(assigned_dev, irq);
+	u32 vector;
+
+	if (index >= 0) {
+		vector = assigned_dev->guest_msix_entries[index].vector;
 		kvm_set_irq(assigned_dev->kvm, assigned_dev->irq_source_id,
-			    assigned_dev->guest_irq, 1);
+			    vector, 1);
+	}
 
 	return IRQ_HANDLED;
 }
+#endif
 
 /* Ack the irq line for an assigned device */
 static void kvm_assigned_dev_ack_irq(struct ...
From: Jan Kiszka
Date: Friday, December 3, 2010 - 4:37 pm

From: Jan Kiszka <jan.kiszka@siemens.com>

This enables drivers to switch their interrupt handling between
exclusive and shared mode.

While there is generally no advantage in exclusive interrupt handling
patterns, at least one use case can benefit from adaptive handling:
Generic stub drivers that hand out PCI devices for unprivileged use in
virtualization or user space device driver scenarios. They need to mask
the interrupt until the unprivileged driver had a chance to process the
event. As generic IRQ masking at device level (via PCI config space) is
at least 10 times slower than disabling at interrupt controller level on
x86, we only want to apply the costly method when there is a real need.

The sharing notifier provides both the required information about the
number of interrupt handlers as well as a properly synchronized
execution context to run interrupt de-/registration and mode switch
procedures. The notifier is called with IRQN_SETUP_USED or
IRQN_SETUP_USED, respectively, during registration to allow setup
according to the current interrupt use. When the number of users is one
on entry of request_threaded_irq or on exit of free_irq, IRQN_SHARED or
IRQN_EXCLUSIVE, respectively, are signaled. Deregistration is signaled
as IRQN_SHUTDOWN to the removed notifier.

Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
 include/linux/interrupt.h |   21 ++++++
 include/linux/irqdesc.h   |    9 +++
 kernel/irq/irqdesc.c      |    6 ++
 kernel/irq/manage.c       |  158 +++++++++++++++++++++++++++++++++++++++++++-
 4 files changed, 190 insertions(+), 4 deletions(-)

diff --git a/include/linux/interrupt.h b/include/linux/interrupt.h
index 79d0c4f..9ebb98f 100644
--- a/include/linux/interrupt.h
+++ b/include/linux/interrupt.h
@@ -96,6 +96,23 @@ enum {
 	IRQC_IS_NESTED,
 };
 
+/*
+ * These values are passed to the sharing notifier as second argument.
+ *
+ * IRQN_SHARED - interrupt line is shared by two or more devices
+ * IRQN_EXCLUSIVE - interrupt line is used by ...
From: Thomas Gleixner
Date: Saturday, December 4, 2010 - 3:37 am

Hmm. You basically want to have the following functionality:

If interrupt is shared, then you want to keep the current behaviour:

   disable at line level (IRQF_ONESHOT)
   run handler thread (PCI level masking)
   reenable at line level in irq_finalize_oneshot()
   reenable at PCI level when guest is done

If interrupt is not shared:

   disable at line level (IRQF_ONESHOT)
   run handler thread (no PCI level masking)
   no reenable at line level
   reenable at line level when guest is done

I think the whole notifier approach including the extra irq handlers
plus the requirement to call disable_irq_nosync() from the non shared
handler is overkill. We can be more clever.

The genirq code knows whether you have one or more handler
registered. So we can add IRQF_ONESHOT_UNMASK_SHARED and add a status
field to irq_data (which I was going to do anyway for other
reasons). In that status field you get a bit which says IRQ_MASK_DEVICE.

So with IRQF_ONESHOT_UNMASK_SHARED == 0 we keep the current behaviour.

If IRQF_ONESHOT_UNMASK_SHARED== 1 and IRQ_MASK_DEVICE == 1 we keep the
current behaviour.

If IRQF_ONESHOT_UNMASK_SHARED== 1 and IRQ_MASK_DEVICE == 0 then then
irq_finalize_oneshot() simply marks the interrupt disabled (equivalent
to disable_irq_nosync()) and returns.

Now in your primary irq handler you simply check the IRQ_MASK_DEVICE
status flag and decide whether you need to mask at PCI level or not.

Your threaded handler gets the same information via IRQ_MASK_DEVICE so
it can issue the appropriate user space notification depending on that
flag.

This works fully transparent across adding and removing handlers. On
request_irq/free_irq we update the IRQ_MASK_DEVICE flag with the
following logic:

    nr_actions	IRQF_ONESHOT_UNMASK_SHARED	IRQ_MASK_DEVICE
    1		0				1
    1		1				0
    >1		don't care			1

If interrupts are in flight accross request/free then this change
takes effect when the next interrupt comes in.

No notifiers, no ...
From: Jan Kiszka
Date: Saturday, December 4, 2010 - 4:34 am

If the interrupt is shared, we must mask at PCI level inside the primary
handler as we also have to support non-threaded users of the same line.
So we always have a transition line-level -> device-level

We only use threaded handlers so far for hairy lock-dependency reasons
and as certain injection complexity is too much guest-controllable. We
hope to move most injection cases (ie. any IRQ directed to a single VCPU
/ user space task) directly into a primary handler in the future to
reduce the latency. So both threaded and non-threaded cases should be



For registration, that might be too late. We need to synchronize on
in-flight interrupts for that line and then ensure that it gets enabled
independently of the registered user. That user may have applied
outdated information, thus would block the line for too long if user
space decides to do something else.

I think this will be the trickier part of the otherwise nice & simple

Pulling the effect of disable_irq_nosync into genirq also for threaded
handling would remove the need for re-registering handlers. That's good.
But we need to think about the hand-over scenarios again. Maybe
solvable, but non-trivial.

Jan

From: Thomas Gleixner
Date: Saturday, December 4, 2010 - 7:41 am

Jan,


Sorry that left out the hard irq part. Of course it needs to do the

The oneshot magic should work on non threaded cases as well. Needs

No, that's really just a corner case when going from one to two
handlers and I don't think it matters much. If you setup a new driver
it's not really important whether that first thing comes in a few ms
later.

Also there is a pretty simple solution for this: The core code knows,
that there is an ONESHOT interrupt in flight, so it simply can call
the primary handler of that device with the appropriate flag set
(maybe an additional one to indicate the transition) and let that deal
with it. Needs some thought vs. locking and races, but that shouldn't

See above.

Thanks,

	tglx
--

From: Jan Kiszka
Date: Saturday, December 4, 2010 - 7:54 am

It doesn't synchronize the tail part against the masking in the

Yes, I thought about this kind of transition (re-invoking the existing
handler) already. We do need notification of the switch (at least for
exclusive->shared) as only the driver can migrate the masking for
in-flight interrupts.

Jan

From: Thomas Gleixner
Date: Saturday, December 4, 2010 - 9:10 am

Right, but the core knows from the irq state, that the line is masked
and due to the ONESHOT or whatever feature it knows that it needs to
call the handler.


Right. It needs to set the device level mask in that case. As the
interrupt handler already has the code to do that it's the most
obvious function to call.

Thanks,

	tglx
--

Previous thread: [PATCH 1/5] genirq: Pass descriptor to __free_irq by Jan Kiszka on Friday, December 3, 2010 - 4:37 pm. (1 message)

Next thread: [PATCH v3 0/4] Introduce hardware spinlock framework by Ohad Ben-Cohen on Friday, December 3, 2010 - 4:50 pm. (17 messages)