This is an updated version of the patchset to add per-device dma_mapping_ops support for CONFIG_X86_64 like POWER architecture does: http://lkml.org/lkml/2008/5/13/36 This is against 2.6.26-rc2-mm1 (the changes since the v1 are pretty trivial like dropping the change for v850 arch). This enables us to cleanly fix the Calgary IOMMU issue that some devices are not behind the IOMMU [1]. I think that per-device dma_mapping_ops support would be also helpful for KVM people to support PCI passthrough but Andi thinks that this makes it difficult to support the PCI passthrough (see the above thread). So I CC'ed this to KVM camp. Comments are appreciated. A pointer to dma_mapping_ops to struct dev_archdata is added. If the pointer is non NULL, DMA operations in asm/dma-mapping.h use it. If it's NULL, the system-wide dma_ops pointer is used as before. If it's useful for KVM people, I plan to implement a mechanism to register a hook called when a new pci (or dma capable) device is created (it works with hot plugging). It enables IOMMUs to set up an appropriate dma_mapping_ops per device. The major obstacle is that dma_mapping_error doesn't take a pointer to the device unlike other DMA operations. So x86 can't have dma_mapping_ops per device. Note all the POWER IOMMUs use the same dma_mapping_error function so this is not a problem for POWER but x86 IOMMUs use different dma_mapping_error functions. The first patch adds the device argument to dma_mapping_error. The patch is trivial but large since it touches lots of drivers and dma-mapping.h in all the architecture. [1] http://lkml.org/lkml/2008/5/8/423 --
dma_mapping_error doesn't take a pointer to the device unlike other DMA operations. So we can't have dma_mapping_ops per device. Note that POWER already has dma_mapping_ops per device but all the POWER IOMMUs use the same dma_mapping_error function. x86 IOMMUs use different dma_mapping_error functions. So dma_mapping_error needs the device argument. Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp> --- Documentation/DMA-API.txt | 4 +- arch/arm/common/dmabounce.c | 2 +- arch/ia64/hp/common/hwsw_iommu.c | 5 ++- arch/ia64/hp/common/sba_iommu.c | 2 +- arch/ia64/sn/pci/pci_dma.c | 2 +- arch/mips/mm/dma-default.c | 2 +- arch/powerpc/platforms/cell/celleb_scc_pciex.c | 2 +- arch/powerpc/platforms/cell/spider-pci.c | 2 +- arch/powerpc/platforms/iseries/mf.c | 2 +- arch/x86/kernel/pci-gart_64.c | 1 - arch/x86/kernel/pci-nommu.c | 2 +- drivers/firewire/fw-iso.c | 2 +- drivers/firewire/fw-ohci.c | 2 +- drivers/firewire/fw-sbp2.c | 8 +++--- drivers/infiniband/hw/ipath/ipath_sdma.c | 2 +- drivers/infiniband/hw/ipath/ipath_user_sdma.c | 6 ++-- drivers/infiniband/hw/mthca/mthca_eq.c | 2 +- drivers/media/dvb/pluto2/pluto2.c | 2 +- drivers/net/arm/ep93xx_eth.c | 4 +- drivers/net/b44.c | 22 +++++++++------- drivers/net/bnx2x.c | 2 +- drivers/net/e100.c | 2 +- drivers/net/e1000e/ethtool.c | 4 +- drivers/net/e1000e/netdev.c | 11 ++++--- drivers/net/ibmveth.c | 32 ++++++++++++----------- drivers/net/iseries_veth.c | 4 +- drivers/net/mlx4/eq.c ...
This adds per-device dma_mapping_ops support for CONFIG_X86_64.
A pointer to dma_mapping_ops to struct dev_archdata is added. If the
pointer is non NULL, DMA operations in asm/dma-mapping.h use it. If
it's NULL, the system-wide dma_ops pointer is used as before.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Muli Ben-Yehuda <muli@il.ibm.com>
---
arch/x86/kernel/pci-calgary_64.c | 2 +-
arch/x86/kernel/pci-dma.c | 2 +-
arch/x86/kernel/pci-gart_64.c | 2 +-
arch/x86/kernel/pci-nommu.c | 14 +-----
arch/x86/kernel/pci-swiotlb_64.c | 2 +-
include/asm-x86/device.h | 3 +
include/asm-x86/dma-mapping.h | 95 ++++++++++++++++++++++++++------------
7 files changed, 74 insertions(+), 46 deletions(-)
diff --git a/arch/x86/kernel/pci-calgary_64.c b/arch/x86/kernel/pci-calgary_64.c
index e28ec49..dca9a82 100644
--- a/arch/x86/kernel/pci-calgary_64.c
+++ b/arch/x86/kernel/pci-calgary_64.c
@@ -541,7 +541,7 @@ error:
return ret;
}
-static const struct dma_mapping_ops calgary_dma_ops = {
+static struct dma_mapping_ops calgary_dma_ops = {
.alloc_coherent = calgary_alloc_coherent,
.map_single = calgary_map_single,
.unmap_single = calgary_unmap_single,
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 9c4614a..d2f2304 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -11,7 +11,7 @@
int forbid_dac __read_mostly;
EXPORT_SYMBOL(forbid_dac);
-const struct dma_mapping_ops *dma_ops;
+struct dma_mapping_ops *dma_ops;
EXPORT_SYMBOL(dma_ops);
static int iommu_sac_force __read_mostly;
diff --git a/arch/x86/kernel/pci-gart_64.c b/arch/x86/kernel/pci-gart_64.c
index 020cadd..638c214 100644
--- a/arch/x86/kernel/pci-gart_64.c
+++ b/arch/x86/kernel/pci-gart_64.c
@@ -619,7 +619,7 @@ static __init int init_k8_gatt(struct agp_kern_info *info)
extern int agp_amd64_init(void);
-static const struct dma_mapping_ops gart_dma_ops = {
+static struct ...This patch continues to turn my hair grey.
I'm currently staring at this, in include/linux/ssb/ssb.h:
static inline int ssb_dma_mapping_error(struct ssb_device *dev, dma_addr_t addr)
{
switch (dev->bus->bustype) {
case SSB_BUSTYPE_PCI:
return pci_dma_mapping_error(dev->dev, addr);
case SSB_BUSTYPE_SSB:
return dma_mapping_error(dev->dev, addr);
default:
__ssb_dma_not_implemented(dev);
}
return -ENOSYS;
}
How do I go from an ssb_device* to a pci_dev*?
Dunno. I think I'll cheat and do:
static inline int ssb_dma_mapping_error(struct ssb_device *dev, dma_addr_t addr)
{
switch (dev->bus->bustype) {
case SSB_BUSTYPE_PCI:
return dma_mapping_error(dev->dev, addr);
case SSB_BUSTYPE_SSB:
return dma_mapping_error(dev->dev, addr);
default:
__ssb_dma_not_implemented(dev);
}
return -ENOSYS;
}
please take a look, see if we can do better?
--
The statement above is wrong. The PCI-specific dma-mapping-error function must be used here. (I hope such a thing exists. Otherwise the API is broken). So -- Greetings Michael. --
Oh wait I see you quoted the function above. So the statement should be: return pci_dma_mapping_error(dev->bus->host_pci, addr); -- Greetings Michael. --
I think what's more useful is a chain with a properly defined order or hierarchy (based on what Muli suggested last time we discussed this http://lkml.org/lkml/2007/11/12/44 ) The suggested order was (in calling order): pvdma->hardare->nommu/swiotlb OK; this sounds helpful. the hook can make a hypercall and confirm with the host kernel if the device in question is an assigned physical device. If yes, we replace the dma_ops. Though, the original intent of having stackable ops is that we might want to go through the swiotlb in the guest even for an assigned device if the guest dma addresses are not in the addressable range From what we've discussed so far, it looks like stackable dma ops will definitely be needed. Does this patchset provide something that stacking Amit. --
Yes---this patchset let's you have a per-device dma-ops, whereas with stackable you only get global dma-ops. I think it's clear we need both, and I think per-device dma-ops are the first thing that's needed. Stacking can then be introduced on a per-device basis. Cheers, Muli --
When we would want stacking, we'll want it globally and not per-device, isn't it? Or at least for devices on a particular bus. When an IOMMU driver registers itself, it should tell which devices it's interested in (each device behind a bus or by enumerating each device it cares for). This should take care of all the scenarios and we won't have the need for per-device dma_ops. For something like pvdma, we can walk through the list of pci devices and make a hypercall for each of them to get this information and have the pvdma version of dma_ops registered for that device. This sounds like it's per-device dma_ops, but it's not -- internally, the dma operations walk through each of the IOMMUs registered and call them in sequence. Does this work? --
On Mon, 26 May 2008 09:39:20 +0530 Well, without per-device dma_ops, IOMMUs could live. But it's pretty hacky. Every time a dma operation is called, IOMMUs need to figure out how a device should be handled. If IOMMUs can set dma_ops for the device when a new device is created, IOMMUs don't care anything any more. That's much clean. That's What As Muli poinsted out, For pvdma, you need stacking per-device dma_ops. With per-device dma_ops, you don't need hack like adding is_pv_device hook in dma_ops. You can set your dma_ops to only pci devices that you are interested. --
What if this information could be hidden behind (a slightly complicated) get_dma_ops()? Also, each of the operations in dma_ops will see if there's something else down the stack that might be interested in the current device. My contention is that we are going to need stackable ops, and a full-fledged stackable implementation is going to solve this problem as well. However, this current implementation of per-device dma_ops looks like a really simple and non-intrusive solution to one problem, that of getting rid of some The hack was added only because there's no stackable dma api we have now. Sure, per-device dma_ops is going to solve this problem and I like it. I'm only saying we're also going to need stacking ops and in effect, per-device dma_ops would just be replaced by them once we get the complete solution. --
On Mon, 26 May 2008 22:14:34 +0530 dma_ops can't do anything since only IOMMUs know what to do against a device. Whatever you implement in dma_ops, without per-device dma_ops, IOMMUs need to figure out what to do a device every time a dma operation is Again, stackable ops can't cleanly solve the problem that per-device dma_ops tries to solve. For example, you stack dma_ops like pvdma->hardare->nommu/swiotlb. How can pvdma_ops know if pvdma_ops needs to handle a device or not? pvdma_ops needs to skip some devices and handle some. per-device dma_ops enables us not to stack pvdma_ops for devices that pvdma_ops are not instrested in. That's much clean. --
Instead of each device calling a function to check which IOMMU is right, I am
OK; how about this:
An example with per-device dma_ops and stacking will look like this:
pvdma->hardware->nommu/swiotlb
^ ^
| |
e1000 rtl8139
And this scheme is going to suit everyone, agreed?
This is simple and doesn't need too many changes all around.
I was suggesting something more than this that can handle cases like an iommu
wanting to have each device behind a bus to pass through it (it's still
possible, but needs a per-device walk). Also, in the scenario depicted above,
each device will start by pointing to the first iommu in the chain (pvdma in
this case) and the iommu will then determine if that device needs to be
passed via its translations.
--
On Tue, 27 May 2008 10:23:21 +0530 It means that you need to register IOMMU information per device. That's same to per-device dma_ops. Or It means you need put devices (an IOMMU is interested in) to a list. Every time dma operation is called, you check the list to see Sorry, I'm not sure what this picture represents. BTW, without pvdma, there is no need to hardware->nommu/swiotlb No, IOMMUs doesn't need to do that. We need to put a stacking mechanism in dma-mapping.h. A stacking mechanism should not be visible to IOMMUs. --
It meant to show just e1000 needs to go through the pvdma translations. rtl8139 goes via the other iommus. e1000 also goes through the other iommus OK; then just per-device dma_ops will work and for the pvdma case, we'll have to have the stacking. Since this is a special case, any kind of generic APIs shouldn't be needed as well. What is the plan with this patch then? When do you plan to ask for mainline merging? --
On Tue, 27 May 2008 11:24:08 +0530 I'm not sure what you mean exactly, but there might be other people who want to use the dma_ops stacking though I'm not sure yet how they use it: Andrew already put this patchset in -mm. Unless someone comes with a new reason against this patchset, it will be merged, I think. BTW, Andrew, really sorry about several compile bugs due to the first patch (changing dma_mapping_error) in this patchset. And thanks a lot for fixing them. --
