This patchset implements PVDMA for handling DMA requests from
devices assigned to the guest from the host machine.They're also available from
git-pull git://git.kernel.org/pub/scm/linux/kernel/git/amit/kvm.git pvdma
These patches are based on my pci-passthrough tree, which is available
fromgit-pull git://git.kernel.org/pub/scm/linux/kernel/git/amit/kvm.git
and the userspace from
git-pull git://git.kernel.org/pub/scm/linux/kernel/git/amit/kvm-userspace.git
The first and the third patch in this series is needed on the guest (with some
bits from the 2nd as well). The 2nd patch is meant for the host kernel.Amit.
--
You forgot to post a high level design overview of how this works,
what it is good for, what are the design trade offs etc.?Include that in the first patch.
-Andi
--
dma_alloc_coherent() doesn't call dma_ops->alloc_coherent in case no IOMMU
translations are necessary. However, if the device doing the DMA is a
physical device assigned to the guest OS by the host, we need to map
all the DMA addresses to the host machine addresses. This is done via
hypercalls to the host.In KVM, with pci passthrough support, we can assign actual devices to the
guest OS which need this functionality.Signed-off-by: Amit Shah <amit.shah@qumranet.com>
---
arch/x86/kernel/pci-dma.c | 11 +++++++++++
include/asm-x86/dma-mapping.h | 2 ++
2 files changed, 13 insertions(+), 0 deletions(-)diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 388b113..678cafb 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -443,6 +443,17 @@ dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle,
memset(memory, 0, size);
if (!mmu) {
*dma_handle = bus;
+ if (unlikely(dma_ops->is_pv_device) &&
+ unlikely(dma_ops->is_pv_device(dev, dev->bus_id))) {
+ void *r;
+ r = dma_ops->alloc_coherent(dev, size,
+ dma_handle, gfp);
+ if (r == NULL) {
+ free_pages((unsigned long)memory,
+ get_order(size));
+ memory = NULL;
+ }
+ }
return memory;
}
}
diff --git a/include/asm-x86/dma-mapping.h b/include/asm-x86/dma-mapping.h
index a1a4dc7..b9c6a39 100644
--- a/include/asm-x86/dma-mapping.h
+++ b/include/asm-x86/dma-mapping.h
@@ -55,6 +55,8 @@ struct dma_mapping_ops {
int direction);
int (*dma_supported)(struct device *hwdev, u64 mask);
int is_phys;
+ /* Is this a physical device in a paravirtualized guest? */
+ int (*is_pv_device)(struct device *hwdev, const char *name);
};extern const struct dma_mapping_ops *dma_ops;
--
1.5.4.3--
I always thought this was a huge wart in the x86-64 DMA ops. Would
there be strong resistance to fixing it so that alloc_coherent
matches the way the other ops are used? This will eliminate the need
for this patch and will make other DMA ops implementations saner.Cheers,
Muli
--
First double unlikely in a condition is useless. Just drop them.
And then ->is_xyz() in a generic vops interface is about as ugly
and non generic as you can get. dma_alloc_coherent is not performance
critical, so you should rather change the interface that ->alloc_coherent
is always called and the other handlers handle the !mmu case correctly.
In fact they need that already I guess (e.g. on DMAR there is not really
a nommu case)-Andi
--
This point came up the last time I sent out the patch; we should do this as
well as implement stackable dma_ops (the need for that is evident in the next
patch).Thanks for the observation; this should be the next step.
Amit.
--
We introduce three hypercalls:
1. When the guest wants to check if a particular device is an assigned device
(this is done once per device by the guest to enable / disable hypercall-
based translation of addresses)2. map: to convert guest phyical addresses to host physical address to pass on
to the device for DMA. We also pin the pages thus requested so that they're
not swapped out.3. unmap: to unpin the pages and free any information we might have stored.
Signed-off-by: Amit Shah <amit.shah@qumranet.com>
---
arch/x86/kvm/x86.c | 211 +++++++++++++++++++++++++++++++++++++++++++-
include/asm-x86/kvm_host.h | 15 +++
include/asm-x86/kvm_para.h | 8 ++
3 files changed, 233 insertions(+), 1 deletions(-)diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index fb9b329..94ee4db 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -24,8 +24,11 @@
#include <linux/interrupt.h>
#include <linux/kvm.h>
#include <linux/fs.h>
+#include <linux/list.h>
+#include <linux/pci.h>
#include <linux/vmalloc.h>
#include <linux/module.h>
+#include <linux/highmem.h>
#include <linux/mman.h>
#include <linux/highmem.h>@@ -76,6 +79,9 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ "halt_exits", VCPU_STAT(halt_exits) },
{ "halt_wakeup", VCPU_STAT(halt_wakeup) },
{ "hypercalls", VCPU_STAT(hypercalls) },
+ { "hypercall_map", VCPU_STAT(hypercall_map) },
+ { "hypercall_unmap", VCPU_STAT(hypercall_unmap) },
+ { "hypercall_pv_dev", VCPU_STAT(hypercall_pv_dev) },
{ "request_irq", VCPU_STAT(request_irq_exits) },
{ "irq_exits", VCPU_STAT(irq_exits) },
{ "host_state_reload", VCPU_STAT(host_state_reload) },
@@ -95,9 +101,164 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ NULL }
};+static struct kvm_pv_dma_map*
+find_pci_pv_dmap(struct list_head *head, dma_addr_t dma)
+{
+ struct list_head *ptr;
+ struct kvm_pv_dma_map *match;
+
+ list_for_each(ptr, h...
might be better to prefix those functions with kvm? Even though they are
those backwards goto are very clumsy. Might be better to give it a
that's ugly.it's better to keep the free function with free-like semantics: just a
r = -1 is not really informative. Better use some meaningful error.We can return here, and avoid this goto if we always increment the
better use atomic mappings here.--
The function names are long enough already. Prefixing everything with kvm_
It does keep everything nicely in one place though. You're right though. Some
I was lazy and used the return value from here to propagate it down further.
The error's going to the guest. The guest, as we know, has already done a
successful DMA allocation. Something went wrong in the hypercall, and we
don't know why (bad page). Any kind of error here isn't going to be
intelligible to the guest anyway. It's mostly a host thing if we ever hitWe can't use atomic mappings for guest pages. They can be swapped out.
--
Actually you were right: there's no sleeping call here after doing the
mapping. I've updated this call with kmap_atomic.The other function that uses kmap can't be converted since we continue to map
several pages in a loop (depending on the length of the DMA region) and hence
can't use kmap_atomic there.
--
If the guest is not able to handle it, why bother returning an error?
Better to kill it.kmap()ed pages can't be swapped out either. The atomic in kmap_atomic()
only refers to the context in which the pages are used. Atmoic kmaps
are greatly preferable to the nonatomic ones.--
Any sufficiently difficult bug is indistinguishable from a feature.--
The guest should be able to deal with transient DMA mapping errors,
either by retrying, or quiescing the device. This is in line with how
HW IOMMUs work - they may run out of mappings for example and the
driver should be able to cope with it. Killing the guest is a last
resort.Cheers,
Muli
--
We make the dma_mapping_ops structure to point to our structure
so that every DMA access goes through us.We make a hypercall for every device that does a DMA operation
to find out if it is an assigned device -- so that we can make
hypercalls on each DMA access. The result of this hypercall is
cached, so that this hypercall is made only once for each deviceThis can be compiled as a module, but that's only used for debugging.
It can be compiled into the guest kernel directly without any side
effects (if you ignore one error message about the hypercall
failing for hard disks).Signed-off-by: Amit Shah <amit.shah@qumranet.com>
---
arch/x86/Kconfig | 8 +
arch/x86/kernel/Makefile | 1 +
arch/x86/kernel/kvm_pv_dma.c | 391 ++++++++++++++++++++++++++++++++++++++++++
3 files changed, 400 insertions(+), 0 deletions(-)
create mode 100644 arch/x86/kernel/kvm_pv_dma.cdiff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index e5790fe..aad16d9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -392,6 +392,14 @@ config KVM_GUEST
This option enables various optimizations for running under the KVM
hypervisor.+config KVM_PV_DMA
+ tristate "KVM paravirtualized DMA access"
+ ---help---
+ Provides support for DMA operations in the guest. A hypercall
+ is raised to the host to enable devices owned by guest to use
+ DMA. Select this if compiling a guest kernel and you need
+ paravirtualized DMA operations.
+
source "arch/x86/lguest/Kconfig"config PARAVIRT
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index fa19c38..0adb37b 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -82,6 +82,7 @@ obj-$(CONFIG_DEBUG_NX_TEST) += test_nx.o
obj-$(CONFIG_VMI) += vmi_32.o vmiclock_32.o
obj-$(CONFIG_KVM_GUEST) += kvm.o
obj-$(CONFIG_KVM_CLOCK) += kvmclock.o
+obj-$(CONFIG_KVM_PV_DMA) += kvm_pv_dma.o
obj-$(CONFIG_PARAVIRT) += paravirt.o paravirt_patch_$(BITS).oifdef CONFIG_INPUT_PCSPKR
diff --git a/a...
I suspect real dma ops stacking will need some further thought than
Are you sure that's correct? It looks quite bogus. order is a 2 logarithm,
normally npages = 1 << orderif you want npages from order the correct need 1 << order
Haven't read further, but to be honest the code doesn't seem to be anywhere
near merging quality.-Andi
--
I'm basically using these patches to test the PCI passthrough functionality
(by which we can assign host PCI devices to a guest OS via KVM). While other
methods of handling DMA operations are being worked on (1-1 mapping of the
guest in the host address space and virtualization-aware IOMMU translations),
this patchset provides a quick way to check if things indeed work.However, if some version of this patch can be useful upstream, I'll be glad to
work on that. That said, as you point out, we need stackable dma ops as well
as getting rid of the is_pv_device() function in dma_ops and that's something
that can be done right away.Thanks for the review!
Amit
--
| FUJITA Tomonori | Re: Linux 2.6.25-rc4 |
| Greg KH | [GIT PATCH] driver core patches against 2.6.24 |
| Jan Engelhardt | intel iommu (Re: -mm merge plans for 2.6.23) |
| Artem Bityutskiy | [PATCH 11/44 take 2] [UBI] allocation unit header |
git: | |
| David Miller | [GIT]: Networking |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Natalie Protasevich | [BUG] New Kernel Bugs |
