[PATCH v2 -mm 0/2] x86: per-device dma_mapping_ops

Previous thread: Re: [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition by Neil Brown on Sunday, May 18, 2008 - 10:27 pm. (3 messages)

Next thread: [PATCH] ahci: change the Device IDs of nvidia MCP7B AHCI controller in ahci.c by peerchen on Sunday, May 18, 2008 - 11:44 pm. (3 messages)
From: FUJITA Tomonori
Date: Sunday, May 18, 2008 - 11:31 pm

This is an updated version of the patchset to add per-device
dma_mapping_ops support for CONFIG_X86_64 like POWER architecture
does:

http://lkml.org/lkml/2008/5/13/36

This is against 2.6.26-rc2-mm1 (the changes since the v1 are pretty
trivial like dropping the change for v850 arch).

This enables us to cleanly fix the Calgary IOMMU issue that some
devices are not behind the IOMMU [1].

I think that per-device dma_mapping_ops support would be also helpful
for KVM people to support PCI passthrough but Andi thinks that this
makes it difficult to support the PCI passthrough (see the above
thread). So I CC'ed this to KVM camp. Comments are appreciated.

A pointer to dma_mapping_ops to struct dev_archdata is added. If the
pointer is non NULL, DMA operations in asm/dma-mapping.h use it. If
it's NULL, the system-wide dma_ops pointer is used as before.

If it's useful for KVM people, I plan to implement a mechanism to
register a hook called when a new pci (or dma capable) device is
created (it works with hot plugging). It enables IOMMUs to set up an
appropriate dma_mapping_ops per device.

The major obstacle is that dma_mapping_error doesn't take a pointer to
the device unlike other DMA operations. So x86 can't have
dma_mapping_ops per device. Note all the POWER IOMMUs use the same
dma_mapping_error function so this is not a problem for POWER but x86
IOMMUs use different dma_mapping_error functions.

The first patch adds the device argument to dma_mapping_error. The
patch is trivial but large since it touches lots of drivers and
dma-mapping.h in all the architecture.

[1] http://lkml.org/lkml/2008/5/8/423



--

From: FUJITA Tomonori
Date: Sunday, May 18, 2008 - 11:31 pm

dma_mapping_error doesn't take a pointer to the device unlike other
DMA operations. So we can't have dma_mapping_ops per device.

Note that POWER already has dma_mapping_ops per device but all the
POWER IOMMUs use the same dma_mapping_error function. x86 IOMMUs use
different dma_mapping_error functions. So dma_mapping_error needs the
device argument.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
---
 Documentation/DMA-API.txt                      |    4 +-
 arch/arm/common/dmabounce.c                    |    2 +-
 arch/ia64/hp/common/hwsw_iommu.c               |    5 ++-
 arch/ia64/hp/common/sba_iommu.c                |    2 +-
 arch/ia64/sn/pci/pci_dma.c                     |    2 +-
 arch/mips/mm/dma-default.c                     |    2 +-
 arch/powerpc/platforms/cell/celleb_scc_pciex.c |    2 +-
 arch/powerpc/platforms/cell/spider-pci.c       |    2 +-
 arch/powerpc/platforms/iseries/mf.c            |    2 +-
 arch/x86/kernel/pci-gart_64.c                  |    1 -
 arch/x86/kernel/pci-nommu.c                    |    2 +-
 drivers/firewire/fw-iso.c                      |    2 +-
 drivers/firewire/fw-ohci.c                     |    2 +-
 drivers/firewire/fw-sbp2.c                     |    8 +++---
 drivers/infiniband/hw/ipath/ipath_sdma.c       |    2 +-
 drivers/infiniband/hw/ipath/ipath_user_sdma.c  |    6 ++--
 drivers/infiniband/hw/mthca/mthca_eq.c         |    2 +-
 drivers/media/dvb/pluto2/pluto2.c              |    2 +-
 drivers/net/arm/ep93xx_eth.c                   |    4 +-
 drivers/net/b44.c                              |   22 +++++++++-------
 drivers/net/bnx2x.c                            |    2 +-
 drivers/net/e100.c                             |    2 +-
 drivers/net/e1000e/ethtool.c                   |    4 +-
 drivers/net/e1000e/netdev.c                    |   11 ++++---
 drivers/net/ibmveth.c                          |   32 ++++++++++++-----------
 drivers/net/iseries_veth.c                     |    4 +-
 drivers/net/mlx4/eq.c            ...
From: FUJITA Tomonori
Date: Sunday, May 18, 2008 - 11:31 pm

This adds per-device dma_mapping_ops support for CONFIG_X86_64.

A pointer to dma_mapping_ops to struct dev_archdata is added. If the
pointer is non NULL, DMA operations in asm/dma-mapping.h use it. If
it's NULL, the system-wide dma_ops pointer is used as before.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
Acked-by: Muli Ben-Yehuda <muli@il.ibm.com>
---
 arch/x86/kernel/pci-calgary_64.c |    2 +-
 arch/x86/kernel/pci-dma.c        |    2 +-
 arch/x86/kernel/pci-gart_64.c    |    2 +-
 arch/x86/kernel/pci-nommu.c      |   14 +-----
 arch/x86/kernel/pci-swiotlb_64.c |    2 +-
 include/asm-x86/device.h         |    3 +
 include/asm-x86/dma-mapping.h    |   95 ++++++++++++++++++++++++++------------
 7 files changed, 74 insertions(+), 46 deletions(-)

diff --git a/arch/x86/kernel/pci-calgary_64.c b/arch/x86/kernel/pci-calgary_64.c
index e28ec49..dca9a82 100644
--- a/arch/x86/kernel/pci-calgary_64.c
+++ b/arch/x86/kernel/pci-calgary_64.c
@@ -541,7 +541,7 @@ error:
 	return ret;
 }
 
-static const struct dma_mapping_ops calgary_dma_ops = {
+static struct dma_mapping_ops calgary_dma_ops = {
 	.alloc_coherent = calgary_alloc_coherent,
 	.map_single = calgary_map_single,
 	.unmap_single = calgary_unmap_single,
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 9c4614a..d2f2304 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -11,7 +11,7 @@
 int forbid_dac __read_mostly;
 EXPORT_SYMBOL(forbid_dac);
 
-const struct dma_mapping_ops *dma_ops;
+struct dma_mapping_ops *dma_ops;
 EXPORT_SYMBOL(dma_ops);
 
 static int iommu_sac_force __read_mostly;
diff --git a/arch/x86/kernel/pci-gart_64.c b/arch/x86/kernel/pci-gart_64.c
index 020cadd..638c214 100644
--- a/arch/x86/kernel/pci-gart_64.c
+++ b/arch/x86/kernel/pci-gart_64.c
@@ -619,7 +619,7 @@ static __init int init_k8_gatt(struct agp_kern_info *info)
 
 extern int agp_amd64_init(void);
 
-static const struct dma_mapping_ops gart_dma_ops = {
+static struct ...
From: Andrew Morton
Date: Wednesday, July 2, 2008 - 3:07 am

This patch continues to turn my hair grey.

I'm currently staring at this, in include/linux/ssb/ssb.h:

static inline int ssb_dma_mapping_error(struct ssb_device *dev, dma_addr_t addr)
{
	switch (dev->bus->bustype) {
	case SSB_BUSTYPE_PCI:
		return pci_dma_mapping_error(dev->dev, addr);
	case SSB_BUSTYPE_SSB:
		return dma_mapping_error(dev->dev, addr);
	default:
		__ssb_dma_not_implemented(dev);
	}
	return -ENOSYS;
}

How do I go from an ssb_device* to a pci_dev*?

Dunno.  I think I'll cheat and do:

static inline int ssb_dma_mapping_error(struct ssb_device *dev, dma_addr_t addr)
{
	switch (dev->bus->bustype) {
	case SSB_BUSTYPE_PCI:
		return dma_mapping_error(dev->dev, addr);
	case SSB_BUSTYPE_SSB:
		return dma_mapping_error(dev->dev, addr);
	default:
		__ssb_dma_not_implemented(dev);
	}
	return -ENOSYS;
}

please take a look, see if we can do better?
--

From: Michael Buesch
Date: Wednesday, July 2, 2008 - 3:18 am

The statement above is wrong. The PCI-specific dma-mapping-error function
must be used here. (I hope such a thing exists. Otherwise the API is broken).
So



-- 
Greetings Michael.
--

From: Michael Buesch
Date: Wednesday, July 2, 2008 - 3:20 am

Oh wait I see you quoted the function above. So the statement should be:

		return pci_dma_mapping_error(dev->bus->host_pci, addr);


-- 
Greetings Michael.
--

From: Andrew Morton
Date: Wednesday, July 2, 2008 - 3:41 am

that works, thanks.
--

From: Amit Shah
Date: Thursday, May 22, 2008 - 3:43 am

I think what's more useful is a chain with a properly defined order or 
hierarchy (based on what Muli suggested last time we discussed this

http://lkml.org/lkml/2007/11/12/44 )

The suggested order was (in calling order):
pvdma->hardare->nommu/swiotlb


OK; this sounds helpful. the hook can make a hypercall and confirm with the 
host kernel if the device in question is an assigned physical device. If yes, 
we replace the dma_ops. Though, the original intent of having stackable ops 
is that we might want to go through the swiotlb in the guest even for an 
assigned device if the guest dma addresses are not in the addressable range 

From what we've discussed so far, it looks like stackable dma ops will 
definitely be needed. Does this patchset provide something that stacking 

Amit.
--

From: Muli Ben-Yehuda
Date: Sunday, May 25, 2008 - 12:20 am

Yes---this patchset let's you have a per-device dma-ops, whereas with
stackable you only get global dma-ops. I think it's clear we need
both, and I think per-device dma-ops are the first thing that's
needed. Stacking can then be introduced on a per-device basis.

Cheers,
Muli

--

From: Amit Shah
Date: Sunday, May 25, 2008 - 9:09 pm

When we would want stacking, we'll want it globally and not per-device, isn't 
it? Or at least for devices on a particular bus.

When an IOMMU driver registers itself, it should tell which devices it's 
interested in (each device behind a bus or by enumerating each device it 
cares for). This should take care of all the scenarios and we won't have the 
need for per-device dma_ops.

For something like pvdma, we can walk through the list of pci devices and make 
a hypercall for each of them to get this information and have the pvdma 
version of dma_ops registered for that device. This sounds like it's 
per-device dma_ops, but it's not -- internally, the dma operations walk 
through each of the IOMMUs registered and call them in sequence.

Does this work?
--

From: FUJITA Tomonori
Date: Sunday, May 25, 2008 - 11:11 pm

On Mon, 26 May 2008 09:39:20 +0530

Well, without per-device dma_ops, IOMMUs could live. But it's pretty
hacky. Every time a dma operation is called, IOMMUs need to figure out
how a device should be handled.

If IOMMUs can set dma_ops for the device when a new device is created,
IOMMUs don't care anything any more. That's much clean. That's What

As Muli poinsted out, For pvdma, you need stacking per-device
dma_ops. With per-device dma_ops, you don't need hack like adding
is_pv_device hook in dma_ops. You can set your dma_ops to only pci
devices that you are interested.
--

From: Amit Shah
Date: Monday, May 26, 2008 - 9:44 am

What if this information could be hidden behind (a slightly complicated) 
get_dma_ops()? Also, each of the operations in dma_ops will see if there's 
something else down the stack that might be interested in the current device.

My contention is that we are going to need stackable ops, and a full-fledged 
stackable implementation is going to solve this problem as well. However, 
this current implementation of per-device dma_ops looks like a really simple 
and non-intrusive solution to one problem, that of getting rid of some 

The hack was added only because there's no stackable dma api we have now. 
Sure, per-device dma_ops is going to solve this problem and I like it. I'm 
only saying we're also going to need stacking ops and in effect, per-device 
dma_ops would just be replaced by them once we get the complete solution.
--

From: FUJITA Tomonori
Date: Monday, May 26, 2008 - 4:50 pm

On Mon, 26 May 2008 22:14:34 +0530

dma_ops can't do anything since only IOMMUs know what to do against a
device.

Whatever you implement in dma_ops, without per-device dma_ops, IOMMUs
need to figure out what to do a device every time a dma operation is


Again, stackable ops can't cleanly solve the problem that per-device
dma_ops tries to solve. For example, you stack dma_ops like
pvdma->hardare->nommu/swiotlb. How can pvdma_ops know if pvdma_ops
needs to handle a device or not? pvdma_ops needs to skip some devices
and handle some. per-device dma_ops enables us not to stack pvdma_ops
for devices that pvdma_ops are not instrested in. That's much clean.
--

From: Amit Shah
Date: Monday, May 26, 2008 - 9:53 pm

Instead of each device calling a function to check which IOMMU is right, I am 

OK; how about this:

An example with per-device dma_ops and stacking will look like this:

pvdma->hardware->nommu/swiotlb
    ^             ^
     |              |
 e1000     rtl8139

And this scheme is going to suit everyone, agreed?

This is simple and doesn't need too many changes all around.

I was suggesting something more than this that can handle cases like an iommu 
wanting to have each device behind a bus to pass through it (it's still 
possible, but needs a per-device walk). Also, in the scenario depicted above, 
each device will start by pointing to the first iommu in the chain (pvdma in 
this case) and the iommu will then determine if that device needs to be 
passed via its translations.
--

From: FUJITA Tomonori
Date: Monday, May 26, 2008 - 10:24 pm

On Tue, 27 May 2008 10:23:21 +0530

It means that you need to register IOMMU information per
device. That's same to per-device dma_ops.

Or It means you need put devices (an IOMMU is interested in) to a
list. Every time dma operation is called, you check the list to see

Sorry, I'm not sure what this picture represents.

BTW, without pvdma, there is no need to hardware->nommu/swiotlb

No, IOMMUs doesn't need to do that. We need to put a stacking
mechanism in dma-mapping.h. A stacking mechanism should not be visible
to IOMMUs.
--

From: Amit Shah
Date: Monday, May 26, 2008 - 10:54 pm

It meant to show just e1000 needs to go through the pvdma translations. 
rtl8139 goes via the other iommus. e1000 also goes through the other iommus 


OK; then just per-device dma_ops will work and for the pvdma case, we'll have 
to have the stacking. Since this is a special case, any kind of generic APIs 
shouldn't be needed as well.

What is the plan with this patch then? When do you plan to ask for mainline 
merging?
--

From: FUJITA Tomonori
Date: Wednesday, May 28, 2008 - 3:19 am

On Tue, 27 May 2008 11:24:08 +0530

I'm not sure what you mean exactly, but there might be other people
who want to use the dma_ops stacking though I'm not sure yet how they
use it:


Andrew already put this patchset in -mm. Unless someone comes with a
new reason against this patchset, it will be merged, I think.

BTW, Andrew, really sorry about several compile bugs due to the first
patch (changing dma_mapping_error) in this patchset. And thanks a lot
for fixing them.
--

Previous thread: Re: [RFC][PATCH] md: avoid fullsync if a faulty member missed a dirty transition by Neil Brown on Sunday, May 18, 2008 - 10:27 pm. (3 messages)

Next thread: [PATCH] ahci: change the Device IDs of nvidia MCP7B AHCI controller in ahci.c by peerchen on Sunday, May 18, 2008 - 11:44 pm. (3 messages)