Hi, this patch series implements stackable dma_ops on x86. This is useful to be able to fall back to a different dma_ops implementation if one can not handle a particular device (as necessary for example with paravirtualized device passthrough or if a hardware IOMMU only handles a subset of available devices). The patch series is split to change the different iommu implementations in different patches. This should make it easier for the specific maintainers to review the part that changes their code. Thanks for review and comments, Joerg diffstat: arch/x86/kernel/amd_iommu.c | 5 ++- arch/x86/kernel/pci-calgary_64.c | 21 ++++++------ arch/x86/kernel/pci-dma.c | 63 +++++++++++++++++++++++++++++++++++++- arch/x86/kernel/pci-gart_64.c | 9 +++++ arch/x86/kernel/pci-nommu.c | 12 +++++++ arch/x86/kernel/pci-swiotlb_64.c | 16 +++++++++- drivers/pci/intel-iommu.c | 8 +++++ include/asm-x86/device.h | 6 ++-- include/asm-x86/dma-mapping.h | 43 +++++++++++++++++++++---- include/asm-x86/swiotlb.h | 1 + 10 files changed, 160 insertions(+), 24 deletions(-) --
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
---
arch/x86/kernel/pci-calgary_64.c | 21 +++++++++++----------
1 files changed, 11 insertions(+), 10 deletions(-)
diff --git a/arch/x86/kernel/pci-calgary_64.c b/arch/x86/kernel/pci-calgary_64.c
index fe7695e..e2f2f60 100644
--- a/arch/x86/kernel/pci-calgary_64.c
+++ b/arch/x86/kernel/pci-calgary_64.c
@@ -525,6 +525,15 @@ static void calgary_free_coherent(struct device *dev, size_t size,
free_pages((unsigned long)vaddr, get_order(size));
}
+static int calgary_device_supported(struct device *dev)
+{
+ struct iommu_table *tbl;
+
+ tbl = find_iommu_table(dev);
+
+ return translation_enabled(tbl);
+}
+
static struct dma_mapping_ops calgary_dma_ops = {
.alloc_coherent = calgary_alloc_coherent,
.free_coherent = calgary_free_coherent,
@@ -532,6 +541,7 @@ static struct dma_mapping_ops calgary_dma_ops = {
.unmap_single = calgary_unmap_single,
.map_sg = calgary_map_sg,
.unmap_sg = calgary_unmap_sg,
+ .device_supported = calgary_device_supported,
};
static inline void __iomem * busno_to_bbar(unsigned char num)
@@ -1223,16 +1233,6 @@ static int __init calgary_init(void)
goto error;
} while (1);
- dev = NULL;
- for_each_pci_dev(dev) {
- struct iommu_table *tbl;
-
- tbl = find_iommu_table(&dev->dev);
-
- if (translation_enabled(tbl))
- dev->dev.archdata.dma_ops = &calgary_dma_ops;
- }
-
return ret;
error:
@@ -1535,6 +1535,7 @@ int __init calgary_iommu_init(void)
force_iommu = 1;
bad_dma_address = 0x0;
+ x86_register_dma_ops(&calgary_dma_ops, DMA_OPS_TYPE_HW);
/* dma_ops is set to swiotlb or nommu */
if (!dma_ops)
dma_ops = &nommu_dma_ops;
--
1.5.6.4
--
Whole patchset and this patch in particular looks very good. Will ack after I get a chance to test it. Cheers, Muli -- The First Workshop on I/O Virtualization (WIOV '08) Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/ and SYSTOR 2009---The Israeli Experimental Systems Conference http://www.haifa.il.ibm.com/conferences/systor2009/ --
Yes. I will change it.
Joerg
--
| AMD Saxony Limited Liability Company & Co. KG
Operating | Wilschdorfer Landstr. 101, 01109 Dresden, Germany
System | Register Court Dresden: HRA 4896
Research | General Partner authorized to represent:
Center | AMD Saxony LLC (Wilmington, Delaware, US)
| General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy
--
Sure, but I prefer the explicit form since it lends itself to easier debugging (oops line numbers, adding printks, etc.). Cheers, Muli -- The First Workshop on I/O Virtualization (WIOV '08) Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/ xxx SYSTOR 2009---The Israeli Experimental Systems Conference http://www.haifa.il.ibm.com/conferences/systor2009/ --
we never do that for simple stuff like this. The kernel would be twice as large if we did. An oops is easy enough to decode and an oops does not come with a line number. Ingo --
I assume you are talking about the source size, because the generated code size should be exactly the same. In any case, arguing where to place the semi-colon is a waste of time. Do as you see fit. Cheers, Muli -- The First Workshop on I/O Virtualization (WIOV '08) Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/ xxx SYSTOR 2009---The Israeli Experimental Systems Conference http://www.haifa.il.ibm.com/conferences/systor2009/ --
This patch adds the function x86_register_dma_ops which drivers can use
to register their specific dma_ops implementations. The patch also adds
the data structures necessary for registration and the initializtion of
them.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
---
arch/x86/kernel/pci-dma.c | 31 +++++++++++++++++++++++++++++++
include/asm-x86/dma-mapping.h | 3 +++
2 files changed, 34 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index d2f2c01..c783b73 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -47,6 +47,11 @@ EXPORT_SYMBOL(iommu_bio_merge);
dma_addr_t bad_dma_address __read_mostly = 0;
EXPORT_SYMBOL(bad_dma_address);
+/* The lists for stacking dma_ops and its lock */
+static DEFINE_RWLOCK(dma_ops_list_lock);
+
+static struct list_head dma_ops_list[DMA_OPS_TYPE_MAX];
+
/* Dummy device used for NULL arguments (normally ISA). Better would
be probably a smaller DMA mask, but this is bug-to-bug compatible
to older i386. */
@@ -57,6 +62,26 @@ struct device x86_dma_fallback_dev = {
};
EXPORT_SYMBOL(x86_dma_fallback_dev);
+static void __init init_dma_ops_list(void)
+{
+ int i;
+
+ for (i = 0; i < DMA_OPS_TYPE_MAX; ++i)
+ INIT_LIST_HEAD(&dma_ops_list[i]);
+}
+
+void x86_register_dma_ops(struct dma_mapping_ops *ops,
+ enum dma_ops_driver_type type)
+{
+ unsigned long flags;
+
+ BUG_ON(type < DMA_OPS_TYPE_PV || type >= DMA_OPS_TYPE_MAX);
+
+ write_lock_irqsave(&dma_ops_list_lock, flags);
+ list_add_tail(&ops->list, &dma_ops_list[type]);
+ write_unlock_irqrestore(&dma_ops_list_lock, flags);
+}
+
int dma_set_mask(struct device *dev, u64 mask)
{
if (!dev->dma_mask || !dma_supported(dev, mask))
@@ -257,6 +282,12 @@ EXPORT_SYMBOL(dma_supported);
static int __init pci_iommu_init(void)
{
+ /*
+ * initialize dma_ops_list so that drivers can register their
+ * implementations
+ */
+ init_dma_ops_list();
+
...Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
---
arch/x86/kernel/pci-nommu.c | 12 ++++++++++++
1 files changed, 12 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kernel/pci-nommu.c b/arch/x86/kernel/pci-nommu.c
index 1c1c98a..0cbecb9 100644
--- a/arch/x86/kernel/pci-nommu.c
+++ b/arch/x86/kernel/pci-nommu.c
@@ -115,16 +115,28 @@ static void nommu_free_coherent(struct device *dev, size_t size, void *vaddr,
free_pages((unsigned long)vaddr, get_order(size));
}
+static int nommu_device_supported(struct device *dev)
+{
+ return 1;
+}
+
struct dma_mapping_ops nommu_dma_ops = {
.alloc_coherent = nommu_alloc_coherent,
.free_coherent = nommu_free_coherent,
.map_single = nommu_map_single,
.map_sg = nommu_map_sg,
+ .device_supported = nommu_device_supported,
.is_phys = 1,
};
void __init no_iommu_init(void)
{
+ /*
+ * we always want nommu to be the last fallback if no
+ * other dma_ops implementation feels responsible
+ */
+ x86_register_dma_ops(&nommu_dma_ops, DMA_OPS_TYPE_SOFT);
+
if (dma_ops)
return;
--
1.5.6.4
--
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
---
arch/x86/kernel/pci-gart_64.c | 9 +++++++++
1 files changed, 9 insertions(+), 0 deletions(-)
diff --git a/arch/x86/kernel/pci-gart_64.c b/arch/x86/kernel/pci-gart_64.c
index 508ef47..c5891c9 100644
--- a/arch/x86/kernel/pci-gart_64.c
+++ b/arch/x86/kernel/pci-gart_64.c
@@ -533,6 +533,12 @@ gart_free_coherent(struct device *dev, size_t size, void *vaddr,
free_pages((unsigned long)vaddr, get_order(size));
}
+static int gart_device_supported(struct device *dev)
+{
+ /* GART remaps _everything_ (including CPU memory accesses) */
+ return 1;
+}
+
static int no_agp;
static __init unsigned long check_iommu_size(unsigned long aper, u64 aper_size)
@@ -736,6 +742,7 @@ static struct dma_mapping_ops gart_dma_ops = {
.unmap_sg = gart_unmap_sg,
.alloc_coherent = gart_alloc_coherent,
.free_coherent = gart_free_coherent,
+ .device_supported = gart_device_supported,
};
void gart_iommu_shutdown(void)
@@ -874,6 +881,8 @@ void __init gart_iommu_init(void)
iommu_gatt_base[i] = gart_unmapped_entry;
flush_gart();
+
+ x86_register_dma_ops(&gart_dma_ops, DMA_OPS_TYPE_HW);
dma_ops = &gart_dma_ops;
}
--
1.5.6.4
--
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
---
arch/x86/kernel/amd_iommu.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/amd_iommu.c b/arch/x86/kernel/amd_iommu.c
index 6f7b974..7aa9824 100644
--- a/arch/x86/kernel/amd_iommu.c
+++ b/arch/x86/kernel/amd_iommu.c
@@ -1258,7 +1258,7 @@ free_mem:
* This function is called by the DMA layer to find out if we can handle a
* particular device. It is part of the dma_ops.
*/
-static int amd_iommu_dma_supported(struct device *dev, u64 mask)
+static int amd_iommu_device_supported(struct device *dev)
{
u16 bdf;
struct pci_dev *pcidev;
@@ -1320,7 +1320,7 @@ static struct dma_mapping_ops amd_iommu_dma_ops = {
.unmap_single = unmap_single,
.map_sg = map_sg,
.unmap_sg = unmap_sg,
- .dma_supported = amd_iommu_dma_supported,
+ .device_supported = amd_iommu_device_supported,
};
/*
@@ -1362,6 +1362,7 @@ int __init amd_iommu_init_dma_ops(void)
#endif
/* Make the driver finally visible to the drivers */
+ x86_register_dma_ops(&amd_iommu_dma_ops, DMA_OPS_TYPE_HW);
dma_ops = &amd_iommu_dma_ops;
return 0;
--
1.5.6.4
--
This patch enables stackable dma_ops on x86. To do this, it also enables
the per-device dma_ops on i386.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
---
arch/x86/kernel/pci-dma.c | 26 ++++++++++++++++++++++++++
include/asm-x86/device.h | 6 +++---
include/asm-x86/dma-mapping.h | 14 +++++++-------
3 files changed, 36 insertions(+), 10 deletions(-)
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index b990fb6..2e517c2 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -82,6 +82,32 @@ void x86_register_dma_ops(struct dma_mapping_ops *ops,
write_unlock_irqrestore(&dma_ops_list_lock, flags);
}
+struct dma_mapping_ops *find_dma_ops_for_device(struct device *dev)
+{
+ int i;
+ unsigned long flags;
+ struct dma_mapping_ops *entry, *ops = NULL;
+
+ read_lock_irqsave(&dma_ops_list_lock, flags);
+
+ for (i = 0; i < DMA_OPS_TYPE_MAX; ++i)
+ list_for_each_entry(entry, &dma_ops_list[i], list) {
+ if (!entry->device_supported)
+ continue;
+ if (entry->device_supported(dev)) {
+ ops = entry;
+ goto out;
+ }
+ }
+out:
+ read_unlock_irqrestore(&dma_ops_list_lock, flags);
+
+ BUG_ON(ops == NULL);
+
+ return ops;
+}
+EXPORT_SYMBOL(find_dma_ops_for_device);
+
int dma_set_mask(struct device *dev, u64 mask)
{
if (!dev->dma_mask || !dma_supported(dev, mask))
diff --git a/include/asm-x86/device.h b/include/asm-x86/device.h
index 3c034f4..9ed12e3 100644
--- a/include/asm-x86/device.h
+++ b/include/asm-x86/device.h
@@ -5,9 +5,9 @@ struct dev_archdata {
#ifdef CONFIG_ACPI
void *acpi_handle;
#endif
-#ifdef CONFIG_X86_64
-struct dma_mapping_ops *dma_ops;
-#endif
+
+ struct dma_mapping_ops *dma_ops;
+
#ifdef CONFIG_DMAR
void *iommu; /* hook for IOMMU specific extension */
#endif
diff --git a/include/asm-x86/dma-mapping.h b/include/asm-x86/dma-mapping.h
index 128786b..83d6d7f 100644
--- a/include/asm-x86/dma-mapping.h
+++ b/include/asm-x86/dma-mapping.h
@@ -82,19 +82,19 ...For PVDMA, we want the "native" dma_ops to succeed first, eg, nommu, and then do our "PV DMA", which is just translating gpa to hpa and then program the hardware. This isn't being done here. This can be done by extending the return type: DMA_DEV_NOT_SUPPORTED DMA_DEV_HANDLED DMA_DEV_PASS Where NOT_SUPPORTED means we should look for the next one in the chain (current return value 0), DEV_HANDLED means the dma operation has been handled successfully (current return value 1) and DEV_PASS means fall back to the next layer and then return back. --
I am not sure I fully understand what you mean? Why do we need to call
nommu handlers first for PVDMA devices?
I think that PVDMA devices must always be handled by a pv-dma_ops
implementation. So it makes more sense for me to assign the the dma_ops
of this implementation to the per-device dma_ops structure when we do
the first call to the dma api. So we pay this overhead of finding out
who is responsible only once and not at every call to the dma api.
Joerg
--
| AMD Saxony Limited Liability Company & Co. KG
Operating | Wilschdorfer Landstr. 101, 01109 Dresden, Germany
System | Register Court Dresden: HRA 4896
Research | General Partner authorized to represent:
Center | AMD Saxony LLC (Wilmington, Delaware, US)
| General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy
--
For the usual dma_alloc_coherent, dma_map_single, etc. routines. They return the gpa to the driver. We want to intercept this gpa and convert it to the hpa before passing on the value to the driver. So our dma_alloc_coherent will assume the real underlying alloc_coherent has succeeded and then make a hypercall. The PV dma_ops routines won't do the usual allocation, etc. that's already --
Ok, the allocation only matters for dma_alloc_coherent. Fujita
introduced a generic software-based dma_alloc_coherent recently which
you can use for that. I think implementing PVDMA into an own dma_ops
backend and multiplex it using my patches introduces less overhead than
an additional layer over the current dma_ops implementation.
Another two questions to your approach: What happens if a
dma_alloc_coherent allocation crosses page boundarys and the gpa's are
not contiguous in host memory? How will dma masks be handled?
Joerg
--
| AMD Saxony Limited Liability Company & Co. KG
Operating | Wilschdorfer Landstr. 101, 01109 Dresden, Germany
System | Register Court Dresden: HRA 4896
Research | General Partner authorized to represent:
Center | AMD Saxony LLC (Wilmington, Delaware, US)
| General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy
--
I'm not sure what you have in mind, but I agree with Amit that conceptually pvdma should be called after the guest's "native" dma_ops have done their thing. This is not just for nommu, consider a guest that is using an (emulated) hardware IOMMU, or that wants to use swiotlb. We can't replicate their functionality in the pv_dma_ops layer, we have to let them run first and then pass deal with whatever That's a very good question. The host will need to be aware of a device's DMA capabilities in order to return I/O addresses (which could be hpa's if you don't have an IOMMU) that satisfy them. That's quite a pain. Cheers, Muli -- The First Workshop on I/O Virtualization (WIOV '08) Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/ xxx SYSTOR 2009---The Israeli Experimental Systems Conference http://www.haifa.il.ibm.com/conferences/systor2009/ --
I have something in mind what I discussed with Amit at the last KVM forum. The idea was not ready at the event but meanwhile it has matured a bit. I think we should try to build a paravirtualized IOMMU for KVM guests. It should work this way: We reserve a configurable amount of contiguous guest physical memory and map it dma contiguous using some kind of hardware IOMMU. This is possible with all hardare IOMMUs we have in the field by now, also Calgary and GART. The guest does dma_coherent allocations from this memory directly and is done. For map_single and map_sg the guest can do bounce buffering. We avoid nearly all pvdma hypercalls with this approach, keep guest swapping working and solve also the problems with device dma_masks and guest memory that is not contigous on the host side. For systems without any kind of hardware IOMMU we can extend the interface to support bounce buffering between host and guest (in this case we can not avoid the hypercalls). This means that the host reserves the memory for the DMA transaction (also recognizing the dma_mask) and copies it from/to the guest directly upon the dma_*_sync calls. This is what I have in mind and want to propose. Maybe we can discuss these ideas here. I think since there are many systems out there with some kind of hardware IOMMUs (every 64bit AMD processor has a GART) we True. And I fear we don't get a simple and clean interface with this approach. Joerg --
The bounce buffering is needed for map_single/map_sg allocations. For dma_alloc_coherent we can directly allocate from that range. The performance loss of the bounce buffering may be lower than the hypercalls we need as the alternative (we need hypercalls for map, unmap and sync). Joerg --
On Mon, 29 Sep 2008 11:36:52 +0200 Nobody cares about the performance of dma_alloc_coherent. Only the performance of map_single/map_sg matters. I'm not sure how expensive the hypercalls are, but they are more expensive than bounce buffering coping lots of data for every I/Os? --
I don't think that we can avoid bounce buffering into the guests at all
(with and without my idea of a paravirtualized IOMMU) when we want to
handle dma_masks and requests that cross guest physical pages properly.
With mapping/unmapping through hypercalls we add the world-switch
overhead to the copy-overhead. We can't avoid this when we have no
hardware support at all. But already with older IOMMUs like Calgary and
GART we can at least avoid the world-switch. And since, for example,
every 64 bit capable AMD processor has a GART we can make use of it.
Joerg
--
| AMD Saxony Limited Liability Company & Co. KG
Operating | Wilschdorfer Landstr. 101, 01109 Dresden, Germany
System | Register Court Dresden: HRA 4896
Research | General Partner authorized to represent:
Center | AMD Saxony LLC (Wilmington, Delaware, US)
| General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy
--
It might be possible to have a per-device slow or fast path, where the fast path is for devices which have no DMA limitations (high-end It should be possible to reduce the number and overhead of hypercalls to the point where their cost is immaterial. I think that's fundamentally a better approach. Cheers, Muli -- The First Workshop on I/O Virtualization (WIOV '08) Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/ xxx SYSTOR 2009---The Israeli Experimental Systems Conference http://www.haifa.il.ibm.com/conferences/systor2009/ --
This solves the problem with the DMA masks. But what happens to requests
Ok, we can queue map_sg allocations together an queue them into one
hypercall. But I remember a paper from you where you wrote that most
allocations are mapping only one area. Are there other ways to optimize
this? I must say that reducing the number of hypercalls was important
while thinking about my idea. If there are better ways I am all ears to
hear from them.
Joerg
--
| AMD Saxony Limited Liability Company & Co. KG
Operating | Wilschdorfer Landstr. 101, 01109 Dresden, Germany
System | Register Court Dresden: HRA 4896
Research | General Partner authorized to represent:
Center | AMD Saxony LLC (Wilmington, Delaware, US)
| General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy
--
I'm not sure I follow. If a buffer is contiguous in the guest space, it will remain contiguous (i.e., be mapped contiguously) in the IOMMU I/O address space, even if each I/O PTE ends up mapping a different I'm afraid that bit of the paper was poorly done (mea culpa). As far as I can recall, the majority of dma_alloc_coherent + scatter-gather list *element* mappings only map a single frame, but we didn't look at the time at the average length of a scatter gather list and the frequency of sg list mappings vs. single page mappings. If the length and frequency are high enough, and you map entire sg lists in a single There were a number of ideas mentioned in our paper (for example, switching drivers from the streaming DMA API to the persistent DMA API, which will be a big help to the scheme you propose), and Willman, Rixner and Cox also had some input to the problem[1]. Unfortunately no implementations exist yet AFAIK. [1] "Protection Strategies for Direct Access to Virtualized I/O Devices", by Paul Willmann, Scott Rixner and Alan L. Cox, USENIX '08. Cheers, Muli -- The First Workshop on I/O Virtualization (WIOV '08) Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/ <-> SYSTOR 2009---The Israeli Experimental Systems Conference http://www.haifa.il.ibm.com/conferences/systor2009/ --
Btw, how will dma_masks be handled when PV DMA just translates gpa to
hpa?
Joerg
--
| AMD Saxony Limited Liability Company & Co. KG
Operating | Wilschdorfer Landstr. 101, 01109 Dresden, Germany
System | Register Court Dresden: HRA 4896
Research | General Partner authorized to represent:
Center | AMD Saxony LLC (Wilmington, Delaware, US)
| General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy
--
On Mon, 22 Sep 2008 20:21:21 +0200 Hmm, every time we call dma_sg/map_single, we call read_lock_irqsave(&dma_ops_list_lock, flags). It's likely that we see notable performance drop? --
Hmm, we should only call find_dma_ops_for_device() the first time a dma api call is done (look into get_dma_ops). But I also thought about how this lock can be avoided. In the real world it should not be necessary because the dma_ops list is initialized before dma api calls are done. But since there is now a register function which can be called its safer this way. What do you think, are we still safe enough without this lock? Joerg --
We could be, if we add a check to the register function that verifies
it isn't being called after DMAs have started. Something like:
in register:
if (dma_started)
yell loudly
before PCI device initialization and after IOMMU initialization:
dma_started = true
Cheers,
Muli
--
The First Workshop on I/O Virtualization (WIOV '08)
Dec 2008, San Diego, CA, http://www.usenix.org/wiov08/
xxx
SYSTOR 2009---The Israeli Experimental Systems Conference
http://www.haifa.il.ibm.com/conferences/systor2009/
--
Good idea. I will change this in my patchset, thanks. Joerg --
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
---
arch/x86/kernel/pci-dma.c | 6 +++++-
arch/x86/kernel/pci-swiotlb_64.c | 16 +++++++++++++++-
include/asm-x86/swiotlb.h | 1 +
3 files changed, 21 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index c783b73..b990fb6 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -156,7 +156,7 @@ void __init pci_iommu_alloc(void)
amd_iommu_detect();
- pci_swiotlb_init();
+ pci_swiotlb_detect();
}
unsigned long iommu_num_pages(unsigned long addr, unsigned long len)
@@ -296,6 +296,10 @@ static int __init pci_iommu_init(void)
gart_iommu_init();
+#ifdef CONFIG_X86_64
+ pci_swiotlb_init();
+#endif
+
no_iommu_init();
return 0;
}
diff --git a/arch/x86/kernel/pci-swiotlb_64.c b/arch/x86/kernel/pci-swiotlb_64.c
index c4ce033..26747e0 100644
--- a/arch/x86/kernel/pci-swiotlb_64.c
+++ b/arch/x86/kernel/pci-swiotlb_64.c
@@ -18,6 +18,12 @@ swiotlb_map_single_phys(struct device *hwdev, phys_addr_t paddr, size_t size,
return swiotlb_map_single(hwdev, phys_to_virt(paddr), size, direction);
}
+static int
+swiotlb_device_supported(struct device *hwdev)
+{
+ return 1;
+}
+
struct dma_mapping_ops swiotlb_dma_ops = {
.mapping_error = swiotlb_dma_mapping_error,
.alloc_coherent = swiotlb_alloc_coherent,
@@ -33,9 +39,10 @@ struct dma_mapping_ops swiotlb_dma_ops = {
.map_sg = swiotlb_map_sg,
.unmap_sg = swiotlb_unmap_sg,
.dma_supported = NULL,
+ .device_supported = swiotlb_device_supported,
};
-void __init pci_swiotlb_init(void)
+void __init pci_swiotlb_detect(void)
{
/* don't initialize swiotlb if iommu=off (no_iommu=1) */
if (!iommu_detected && !no_iommu && max_pfn > MAX_DMA32_PFN)
@@ -45,6 +52,13 @@ void __init pci_swiotlb_init(void)
if (swiotlb) {
printk(KERN_INFO "PCI-DMA: Using software bounce buffering for IO (SWIOTLB)\n");
swiotlb_init();
+ }
+}
+
+void __init ...This patch extends the x86 dma_ops structure so that we can queue it
into a list. It also adds a device_supported callback. It can be used to
find out if a registered dma_ops implementation can handle a given
device. Further it adds an enum with dma_ops implementation types
possible in the future.
Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
---
include/asm-x86/dma-mapping.h | 26 ++++++++++++++++++++++++++
1 files changed, 26 insertions(+), 0 deletions(-)
diff --git a/include/asm-x86/dma-mapping.h b/include/asm-x86/dma-mapping.h
index f408e6d..59d1101 100644
--- a/include/asm-x86/dma-mapping.h
+++ b/include/asm-x86/dma-mapping.h
@@ -17,6 +17,8 @@ extern struct device x86_dma_fallback_dev;
extern int panic_on_overflow;
struct dma_mapping_ops {
+ struct list_head list; /* for stacking dma_ops */
+
int (*mapping_error)(struct device *dev,
dma_addr_t dma_addr);
void* (*alloc_coherent)(struct device *dev, size_t size,
@@ -51,9 +53,33 @@ struct dma_mapping_ops {
struct scatterlist *sg, int nents,
int direction);
int (*dma_supported)(struct device *hwdev, u64 mask);
+ int (*device_supported)(struct device *hwdev);
int is_phys;
};
+/*
+ * This are the supported types of dma_ops implementations on x86. The
+ * different types mean different priority:
+ *
+ * DMA_OPS_TYPE_PV - paravirtualized dma_ops implementation
+ * DMA_OPS_TYPE_HW - dma_ops implementation using some kind of hardware
+ * support
+ * DMA_OPS_TYPE_SOFT - a software only dma_ops implementation
+ *
+ * When a device issues its first request to the DMA layer the available
+ * implementations are asked if they support the device by calling the specific
+ * dma_supported callback. The implementations are checked in order, first the
+ * DMA_OPS_TYPE_PV, then the DMA_OPS_TYPE_HW and at the end the
+ * DMA_OPS_TYPE_SOFT implementations. Within the implementation types they are
+ * called in registration ...Signed-off-by: Joerg Roedel <joerg.roedel@amd.com>
---
drivers/pci/intel-iommu.c | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)
diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c
index 6c4c1c3..283e65f 100644
--- a/drivers/pci/intel-iommu.c
+++ b/drivers/pci/intel-iommu.c
@@ -2255,6 +2255,12 @@ static int intel_map_sg(struct device *hwdev, struct scatterlist *sglist,
return nelems;
}
+static int intel_device_supported(struct device *dev)
+{
+ /* FIXME: is this correct? */
+ return dev && dev->bus == &pci_bus_type;
+}
+
static struct dma_mapping_ops intel_dma_ops = {
.alloc_coherent = intel_alloc_coherent,
.free_coherent = intel_free_coherent,
@@ -2262,6 +2268,7 @@ static struct dma_mapping_ops intel_dma_ops = {
.unmap_single = intel_unmap_single,
.map_sg = intel_map_sg,
.unmap_sg = intel_unmap_sg,
+ .device_supported = intel_device_supported,
};
static inline int iommu_domain_cache_init(void)
@@ -2449,6 +2456,7 @@ int __init intel_iommu_init(void)
init_timer(&unmap_timer);
force_iommu = 1;
+ x86_register_dma_ops(&intel_dma_ops, DMA_OPS_TYPE_HW);
dma_ops = &intel_dma_ops;
return 0;
}
--
1.5.6.4
--
On Mon, 22 Sep 2008 20:21:12 +0200 isn't the right answer here to have a per device DMA ops instead ? -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Its implemented using the per-device dma-ops already there. With this
patches there is a list of available dma_ops implementations which are
asked in a particular order if they can handle the device. The first
implementation which returns true is assigned to the device as the
per-device dma_ops structure.
(Hmm, maybe the name stackable is misleading, is "dma_ops multiplexing"
better?)
Joerg
--
| AMD Saxony Limited Liability Company & Co. KG
Operating | Wilschdorfer Landstr. 101, 01109 Dresden, Germany
System | Register Court Dresden: HRA 4896
Research | General Partner authorized to represent:
Center | AMD Saxony LLC (Wilmington, Delaware, US)
| General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy
--
Is per-device the right level? Wouldn't per-bus make more sense? How
does a dma_ops implementation "know" whether it can handle a particular
device?
(I haven't had a chance to read the patches yet.)
J
--
On Mon, 22 Sep 2008 19:41:28 -0700 not really; all DMA functions get a device as argument already anyway; just going to bus makes no sense there. Even if you set it the same for the whole bus almost all of the time... the APIs just work per device. (and device assignment clearly is per device as well) -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
On Mon, 22 Sep 2008 20:21:12 +0200 We already handle the latter. This patchset is more flexible but seems to incur more overheads. This feature will be used for only paravirtualized device passthrough? If so, I feel that there is more simpler (and specific) solutions for it. --
Its not only for device passthrough. It handles also the cases where a hardware IOMMU does not handle all devices in the system (like in some Calgary systems but also possible with AMD IOMMU). With this patchset we can handle these cases in a generic way without hacking it into the hardware drivers (these hacks are also in the AMD IOMMU code and I plan to remove them in the case this patchset will be accepted). Joerg --
On Sun, 28 Sep 2008 20:49:26 +0200 I know that. As I wrote in the previous mail, we already solved that problem with per-device-dma-ops. My question is what unsolved problems this patchset can fix? This patchset is named "stackable dma_ops" but it's different from what we discussed as "stackable dma_ops". This patchset provides IOMMUs a generic mechanism to set up "stackable dma_ops". But this patchset doesn't solve the problem that a hardware IOMMU does not handle all devices (it was already solved with per-device-dma-ops). If paravirtualized device passthrough still needs to call multiple --
Ok, the name "stackable" is misleading and was a bad choice. I will
rename it to "multiplexing". This should make it more clear what it is.
Like you pointed out, the problems are solved with per-device dma_ops,
but in the current implementation it needs special hacks in the IOMMU
drivers to use these per-device dma_ops.
I see this patchset as a continuation of the per-device dma_ops idea. It
moves the per-device handling out of the specific drivers to a common
place. So we can avoid or remove special hacks in the IOMMU drivers.
Joerg
--
| AMD Saxony Limited Liability Company & Co. KG
Operating | Wilschdorfer Landstr. 101, 01109 Dresden, Germany
System | Register Court Dresden: HRA 4896
Research | General Partner authorized to represent:
Center | AMD Saxony LLC (Wilmington, Delaware, US)
| General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy
--
On Mon, 29 Sep 2008 15:26:47 +0200 Basically, I'm not against this patchset. It simplify Calgary and AMD IOMMUs code to set up per-device-dma-ops (though it makes dma_ops a bit complicated). But it doesn't solve any problems including the paravirtualized device passthrough. When I wrote per-device-dma-ops, I expected that KVM people want more changes (such as stackable dma_ops) to dma_ops for the paravirtualized device passthrough. I'd like to hear what they want first. --
Yes. But mind that this patchset adds complexity to one point (at
dma_ops initialization) while we can avoid and remove it
Sure. Therefore this patchset is RFC and I cc'ed them.
Joerg
--
| AMD Saxony Limited Liability Company & Co. KG
Operating | Wilschdorfer Landstr. 101, 01109 Dresden, Germany
System | Register Court Dresden: HRA 4896
Research | General Partner authorized to represent:
Center | AMD Saxony LLC (Wilmington, Delaware, US)
| General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy
--
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
| Linux Kernel Mailing List | ixbge: remove TX lock and redo TX accounting. |
| Linux Kernel Mailing List | ixgbe: fix several counter register errata |
| Linux Kernel Mailing List | b43: fix build with CONFIG_SSB_PCIHOST=n |
| Linux Kernel Mailing List | 9p: block-based virtio client |
