Re: [PATCH] KVM x86: Handle hypercalls for assigned PCI devices

Previous thread: sound: azx_get_response timeout by Jan Engelhardt on Tuesday, April 29, 2008 - 6:31 am. (6 messages)

Next thread: Kernel crash report - maybe triggered by the bluetooth system by Giordano Battilana on Tuesday, April 29, 2008 - 6:39 am. (1 message)
To: <kvm-devel@...>
Cc: <virtualization@...>, <muli@...>, <BENAMI@...>, <gcosta@...>, <chrisw@...>, <dor.laor@...>, <allen.m.kay@...>, <avi@...>, <linux-kernel@...>
Date: Tuesday, April 29, 2008 - 6:37 am

This patchset implements PVDMA for handling DMA requests from
devices assigned to the guest from the host machine.

They're also available from

git-pull git://git.kernel.org/pub/scm/linux/kernel/git/amit/kvm.git pvdma

These patches are based on my pci-passthrough tree, which is available
from

git-pull git://git.kernel.org/pub/scm/linux/kernel/git/amit/kvm.git

and the userspace from

git-pull git://git.kernel.org/pub/scm/linux/kernel/git/amit/kvm-userspace.git

The first and the third patch in this series is needed on the guest (with some
bits from the 2nd as well). The 2nd patch is meant for the host kernel.

Amit.
--

To: Amit Shah <amit.shah@...>
Cc: <kvm-devel@...>, <chrisw@...>, <allen.m.kay@...>, <linux-kernel@...>, <gcosta@...>, <avi@...>, <virtualization@...>, <BENAMI@...>
Date: Tuesday, April 29, 2008 - 9:15 am

You forgot to post a high level design overview of how this works,
what it is good for, what are the design trade offs etc.?

Include that in the first patch.

-Andi

--

To: <kvm-devel@...>
Cc: <virtualization@...>, <muli@...>, <BENAMI@...>, <gcosta@...>, <chrisw@...>, <dor.laor@...>, <allen.m.kay@...>, <avi@...>, <linux-kernel@...>, Amit Shah <amit.shah@...>
Date: Tuesday, April 29, 2008 - 6:37 am

dma_alloc_coherent() doesn't call dma_ops->alloc_coherent in case no IOMMU
translations are necessary. However, if the device doing the DMA is a
physical device assigned to the guest OS by the host, we need to map
all the DMA addresses to the host machine addresses. This is done via
hypercalls to the host.

In KVM, with pci passthrough support, we can assign actual devices to the
guest OS which need this functionality.

Signed-off-by: Amit Shah <amit.shah@qumranet.com>
---
arch/x86/kernel/pci-dma.c | 11 +++++++++++
include/asm-x86/dma-mapping.h | 2 ++
2 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 388b113..678cafb 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -443,6 +443,17 @@ dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle,
memset(memory, 0, size);
if (!mmu) {
*dma_handle = bus;
+ if (unlikely(dma_ops->is_pv_device) &&
+ unlikely(dma_ops->is_pv_device(dev, dev->bus_id))) {
+ void *r;
+ r = dma_ops->alloc_coherent(dev, size,
+ dma_handle, gfp);
+ if (r == NULL) {
+ free_pages((unsigned long)memory,
+ get_order(size));
+ memory = NULL;
+ }
+ }
return memory;
}
}
diff --git a/include/asm-x86/dma-mapping.h b/include/asm-x86/dma-mapping.h
index a1a4dc7..b9c6a39 100644
--- a/include/asm-x86/dma-mapping.h
+++ b/include/asm-x86/dma-mapping.h
@@ -55,6 +55,8 @@ struct dma_mapping_ops {
int direction);
int (*dma_supported)(struct device *hwdev, u64 mask);
int is_phys;
+ /* Is this a physical device in a paravirtualized guest? */
+ int (*is_pv_device)(struct device *hwdev, const char *name);
};

extern const struct dma_mapping_ops *dma_ops;
--
1.5.4.3

--

To: Amit Shah <amit.shah@...>
Cc: <kvm-devel@...>, <virtualization@...>, Ben-Ami Yassour1 <BENAMI@...>, <gcosta@...>, <chrisw@...>, <dor.laor@...>, <allen.m.kay@...>, <avi@...>, <linux-kernel@...>
Date: Wednesday, April 30, 2008 - 2:29 am

I always thought this was a huge wart in the x86-64 DMA ops. Would
there be strong resistance to fixing it so that alloc_coherent
matches the way the other ops are used? This will eliminate the need
for this patch and will make other DMA ops implementations saner.

Cheers,
Muli
--

To: Amit Shah <amit.shah@...>
Cc: <kvm-devel@...>, <chrisw@...>, <allen.m.kay@...>, <linux-kernel@...>, <gcosta@...>, <avi@...>, <virtualization@...>, <BENAMI@...>
Date: Tuesday, April 29, 2008 - 9:14 am

First double unlikely in a condition is useless. Just drop them.

And then ->is_xyz() in a generic vops interface is about as ugly
and non generic as you can get. dma_alloc_coherent is not performance
critical, so you should rather change the interface that ->alloc_coherent
is always called and the other handlers handle the !mmu case correctly.
In fact they need that already I guess (e.g. on DMAR there is not really
a nommu case)

-Andi

--

To: Andi Kleen <andi@...>
Cc: <kvm-devel@...>, <chrisw@...>, <allen.m.kay@...>, <linux-kernel@...>, <gcosta@...>, <avi@...>, <virtualization@...>, <BENAMI@...>, Muli Ben-Yehuda <muli@...>
Date: Tuesday, April 29, 2008 - 9:49 am

This point came up the last time I sent out the patch; we should do this as
well as implement stackable dma_ops (the need for that is evident in the next
patch).

Thanks for the observation; this should be the next step.

Amit.
--

To: <kvm-devel@...>
Cc: <virtualization@...>, <muli@...>, <BENAMI@...>, <gcosta@...>, <chrisw@...>, <dor.laor@...>, <allen.m.kay@...>, <avi@...>, <linux-kernel@...>, Amit Shah <amit.shah@...>
Date: Tuesday, April 29, 2008 - 6:37 am

We introduce three hypercalls:
1. When the guest wants to check if a particular device is an assigned device
(this is done once per device by the guest to enable / disable hypercall-
based translation of addresses)

2. map: to convert guest phyical addresses to host physical address to pass on
to the device for DMA. We also pin the pages thus requested so that they're
not swapped out.

3. unmap: to unpin the pages and free any information we might have stored.

Signed-off-by: Amit Shah <amit.shah@qumranet.com>
---
arch/x86/kvm/x86.c | 211 +++++++++++++++++++++++++++++++++++++++++++-
include/asm-x86/kvm_host.h | 15 +++
include/asm-x86/kvm_para.h | 8 ++
3 files changed, 233 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fb9b329..94ee4db 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -24,8 +24,11 @@
#include <linux/interrupt.h>
#include <linux/kvm.h>
#include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/pci.h>
#include <linux/vmalloc.h>
#include <linux/module.h>
+#include <linux/highmem.h>
#include <linux/mman.h>
#include <linux/highmem.h>

@@ -76,6 +79,9 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ "halt_exits", VCPU_STAT(halt_exits) },
{ "halt_wakeup", VCPU_STAT(halt_wakeup) },
{ "hypercalls", VCPU_STAT(hypercalls) },
+ { "hypercall_map", VCPU_STAT(hypercall_map) },
+ { "hypercall_unmap", VCPU_STAT(hypercall_unmap) },
+ { "hypercall_pv_dev", VCPU_STAT(hypercall_pv_dev) },
{ "request_irq", VCPU_STAT(request_irq_exits) },
{ "irq_exits", VCPU_STAT(irq_exits) },
{ "host_state_reload", VCPU_STAT(host_state_reload) },
@@ -95,9 +101,164 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ NULL }
};

+static struct kvm_pv_dma_map*
+find_pci_pv_dmap(struct list_head *head, dma_addr_t dma)
+{
+ struct list_head *ptr;
+ struct kvm_pv_dma_map *match;
+
+ list_for_each(ptr, h...

To: Amit Shah <amit.shah@...>
Cc: <kvm-devel@...>, <virtualization@...>, <muli@...>, <BENAMI@...>, <chrisw@...>, <dor.laor@...>, <allen.m.kay@...>, <avi@...>, <linux-kernel@...>
Date: Tuesday, April 29, 2008 - 10:44 am

might be better to prefix those functions with kvm? Even though they are
those backwards goto are very clumsy. Might be better to give it a
that's ugly.

it's better to keep the free function with free-like semantics: just a
r = -1 is not really informative. Better use some meaningful error.

We can return here, and avoid this goto if we always increment the
better use atomic mappings here.

--

To: Glauber Costa <gcosta@...>
Cc: <kvm-devel@...>, <virtualization@...>, <muli@...>, <BENAMI@...>, <chrisw@...>, <dor.laor@...>, <allen.m.kay@...>, <avi@...>, <linux-kernel@...>
Date: Tuesday, April 29, 2008 - 11:58 am

The function names are long enough already. Prefixing everything with kvm_

It does keep everything nicely in one place though. You're right though. Some

I was lazy and used the return value from here to propagate it down further.

The error's going to the guest. The guest, as we know, has already done a
successful DMA allocation. Something went wrong in the hypercall, and we
don't know why (bad page). Any kind of error here isn't going to be
intelligible to the guest anyway. It's mostly a host thing if we ever hit

We can't use atomic mappings for guest pages. They can be swapped out.
--

To: Glauber Costa <gcosta@...>
Cc: <kvm-devel@...>, <virtualization@...>, <muli@...>, <BENAMI@...>, <chrisw@...>, <dor.laor@...>, <allen.m.kay@...>, <avi@...>, <linux-kernel@...>
Date: Thursday, May 1, 2008 - 9:18 am

Actually you were right: there's no sleeping call here after doing the
mapping. I've updated this call with kmap_atomic.

The other function that uses kmap can't be converted since we continue to map
several pages in a loop (depending on the length of the DMA region) and hence
can't use kmap_atomic there.
--

To: Amit Shah <amit.shah@...>
Cc: Glauber Costa <gcosta@...>, <kvm-devel@...>, <virtualization@...>, <muli@...>, <BENAMI@...>, <chrisw@...>, <dor.laor@...>, <allen.m.kay@...>, <linux-kernel@...>
Date: Tuesday, April 29, 2008 - 6:48 pm

If the guest is not able to handle it, why bother returning an error?
Better to kill it.

kmap()ed pages can't be swapped out either. The atomic in kmap_atomic()
only refers to the context in which the pages are used. Atmoic kmaps
are greatly preferable to the nonatomic ones.

--
Any sufficiently difficult bug is indistinguishable from a feature.

--

To: Avi Kivity <avi@...>
Cc: Amit Shah <amit.shah@...>, Glauber Costa <gcosta@...>, <kvm-devel@...>, <virtualization@...>, Ben-Ami Yassour1 <BENAMI@...>, <chrisw@...>, <dor.laor@...>, <allen.m.kay@...>, <linux-kernel@...>
Date: Wednesday, April 30, 2008 - 2:05 am

The guest should be able to deal with transient DMA mapping errors,
either by retrying, or quiescing the device. This is in line with how
HW IOMMUs work - they may run out of mappings for example and the
driver should be able to cope with it. Killing the guest is a last
resort.

Cheers,
Muli
--

To: <kvm-devel@...>
Cc: <virtualization@...>, <muli@...>, <BENAMI@...>, <gcosta@...>, <chrisw@...>, <dor.laor@...>, <allen.m.kay@...>, <avi@...>, <linux-kernel@...>, Amit Shah <amit.shah@...>
Date: Tuesday, April 29, 2008 - 6:37 am

We make the dma_mapping_ops structure to point to our structure
so that every DMA access goes through us.

We make a hypercall for every device that does a DMA operation
to find out if it is an assigned device -- so that we can make
hypercalls on each DMA access. The result of this hypercall is
cached, so that this hypercall is made only once for each device

This can be compiled as a module, but that's only used for debugging.
It can be compiled into the guest kernel directly without any side
effects (if you ignore one error message about the hypercall
failing for hard disks).

Signed-off-by: Amit Shah <amit.shah@qumranet.com>
---
arch/x86/Kconfig | 8 +
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/kvm_pv_dma.c | 391 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 400 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/kernel/kvm_pv_dma.c

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e5790fe..aad16d9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -392,6 +392,14 @@ config KVM_GUEST
This option enables various optimizations for running under the KVM
hypervisor.

+config KVM_PV_DMA
+ tristate "KVM paravirtualized DMA access"
+ ---help---
+ Provides support for DMA operations in the guest. A hypercall
+ is raised to the host to enable devices owned by guest to use
+ DMA. Select this if compiling a guest kernel and you need
+ paravirtualized DMA operations.
+
source "arch/x86/lguest/Kconfig"

config PARAVIRT
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index fa19c38..0adb37b 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -82,6 +82,7 @@ obj-$(CONFIG_DEBUG_NX_TEST) += test_nx.o
obj-$(CONFIG_VMI) += vmi_32.o vmiclock_32.o
obj-$(CONFIG_KVM_GUEST) += kvm.o
obj-$(CONFIG_KVM_CLOCK) += kvmclock.o
+obj-$(CONFIG_KVM_PV_DMA) += kvm_pv_dma.o
obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).o

ifdef CONFIG_INPUT_PCSPKR
diff --git a/a...

To: Amit Shah <amit.shah@...>
Cc: <kvm-devel@...>, <chrisw@...>, <allen.m.kay@...>, <linux-kernel@...>, <gcosta@...>, <avi@...>, <virtualization@...>, <BENAMI@...>
Date: Tuesday, April 29, 2008 - 9:31 am

I suspect real dma ops stacking will need some further thought than

Are you sure that's correct? It looks quite bogus. order is a 2 logarithm,
normally npages = 1 << order

if you want npages from order the correct need 1 << order

Haven't read further, but to be honest the code doesn't seem to be anywhere
near merging quality.

-Andi

--

To: Andi Kleen <andi@...>
Cc: <kvm-devel@...>, <chrisw@...>, <allen.m.kay@...>, <linux-kernel@...>, <gcosta@...>, <avi@...>, <virtualization@...>, <BENAMI@...>, Muli Ben-Yehuda <muli@...>
Date: Tuesday, April 29, 2008 - 9:59 am

I'm basically using these patches to test the PCI passthrough functionality
(by which we can assign host PCI devices to a guest OS via KVM). While other
methods of handling DMA operations are being worked on (1-1 mapping of the
guest in the host address space and virtualization-aware IOMMU translations),
this patchset provides a quick way to check if things indeed work.

However, if some version of this patch can be useful upstream, I'll be glad to
work on that. That said, as you point out, we need stackable dma ops as well
as getting rid of the is_pv_device() function in dma_ops and that's something
that can be done right away.

Thanks for the review!

Amit
--

Previous thread: sound: azx_get_response timeout by Jan Engelhardt on Tuesday, April 29, 2008 - 6:31 am. (6 messages)

Next thread: Kernel crash report - maybe triggered by the bluetooth system by Giordano Battilana on Tuesday, April 29, 2008 - 6:39 am. (1 message)