Re: [PATCH 0/2] x86: per-device dma_mapping_ops

Previous thread: none

Next thread: [PATCH] memcg: better migration handling by KAMEZAWA Hiroyuki on Monday, May 12, 2008 - 11:31 pm. (1 message)
From: FUJITA Tomonori
Date: Monday, May 12, 2008 - 11:04 pm

This patchset adds per-device dma_mapping_ops support for
CONFIG_X86_64 like POWER architecture does. This change enables us to
cleanly fix the Calgary IOMMU issue that some devices are not behind
the IOMMU [1]. It also would be helpful to handle KVM PCI passthrough.

A pointer to dma_mapping_ops to struct dev_archdata is added. If the
pointer is non NULL, DMA operations in asm/dma-mapping.h use it. If
it's NULL, the system-wide dma_ops pointer is used as before.

The major obstacle is that dma_mapping_error doesn't take a pointer to
the device unlike other DMA operations. So x86 can't have
dma_mapping_ops per device. Note all the POWER IOMMUs use the same
dma_mapping_error function so this is not a problem for POWER but x86
IOMMUs use different dma_mapping_error functions.

The first patch adds the device argument to dma_mapping_error. The
patch is trivial but large since it touches lots of drivers and
dma-mapping.h in all the architecture.

[1] http://lkml.org/lkml/2008/5/8/423


--

From: FUJITA Tomonori
Date: Monday, May 12, 2008 - 11:04 pm

dma_mapping_error doesn't take a pointer to the device unlike other
DMA operations. So we can't have dma_mapping_ops per device.

Note that POWER already has dma_mapping_ops per device but all the
POWER IOMMUs use the same dma_mapping_error function. x86 IOMMUs use
different dma_mapping_error functions. So dma_mapping_error needs the
device argument.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
---
 arch/arm/common/dmabounce.c                    |    2 +-
 arch/ia64/hp/common/hwsw_iommu.c               |    5 ++-
 arch/ia64/hp/common/sba_iommu.c                |    2 +-
 arch/ia64/sn/pci/pci_dma.c                     |    2 +-
 arch/mips/mm/dma-default.c                     |    2 +-
 arch/powerpc/platforms/cell/celleb_scc_pciex.c |    2 +-
 arch/powerpc/platforms/cell/spider-pci.c       |    2 +-
 arch/powerpc/platforms/iseries/mf.c            |    2 +-
 arch/x86/kernel/pci-gart_64.c                  |    1 -
 arch/x86/kernel/pci-nommu.c                    |    2 +-
 drivers/firewire/fw-iso.c                      |    2 +-
 drivers/firewire/fw-ohci.c                     |    2 +-
 drivers/firewire/fw-sbp2.c                     |    8 +++---
 drivers/infiniband/hw/ipath/ipath_sdma.c       |    2 +-
 drivers/infiniband/hw/ipath/ipath_user_sdma.c  |    6 ++--
 drivers/infiniband/hw/mthca/mthca_eq.c         |    2 +-
 drivers/media/dvb/pluto2/pluto2.c              |    2 +-
 drivers/net/arm/ep93xx_eth.c                   |    4 +-
 drivers/net/b44.c                              |   22 +++++++++-------
 drivers/net/bnx2x.c                            |    2 +-
 drivers/net/e100.c                             |    2 +-
 drivers/net/e1000e/ethtool.c                   |    4 +-
 drivers/net/e1000e/netdev.c                    |   11 ++++---
 drivers/net/ibmveth.c                          |   32 ++++++++++++-----------
 drivers/net/iseries_veth.c                     |    4 +-
 drivers/net/mlx4/eq.c                          |    2 +-
 drivers/net/pasemi_mac.c         ...
From: FUJITA Tomonori
Date: Monday, May 12, 2008 - 11:04 pm

This adds per-device dma_mapping_ops support for CONFIG_X86_64.

A pointer to dma_mapping_ops to struct dev_archdata is added. If the
pointer is non NULL, DMA operations in asm/dma-mapping.h use it. If
it's NULL, the system-wide dma_ops pointer is used as before.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
---
 arch/x86/kernel/pci-calgary_64.c |    2 +-
 arch/x86/kernel/pci-dma.c        |    2 +-
 arch/x86/kernel/pci-gart_64.c    |    2 +-
 arch/x86/kernel/pci-nommu.c      |   14 +-----
 arch/x86/kernel/pci-swiotlb_64.c |    2 +-
 include/asm-x86/device.h         |    3 +
 include/asm-x86/dma-mapping.h    |   95 ++++++++++++++++++++++++++------------
 7 files changed, 74 insertions(+), 46 deletions(-)

diff --git a/arch/x86/kernel/pci-calgary_64.c b/arch/x86/kernel/pci-calgary_64.c
index e28ec49..dca9a82 100644
--- a/arch/x86/kernel/pci-calgary_64.c
+++ b/arch/x86/kernel/pci-calgary_64.c
@@ -541,7 +541,7 @@ error:
 	return ret;
 }
 
-static const struct dma_mapping_ops calgary_dma_ops = {
+static struct dma_mapping_ops calgary_dma_ops = {
 	.alloc_coherent = calgary_alloc_coherent,
 	.map_single = calgary_map_single,
 	.unmap_single = calgary_unmap_single,
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
index 0c37f16..33ded93 100644
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -11,7 +11,7 @@
 int forbid_dac __read_mostly;
 EXPORT_SYMBOL(forbid_dac);
 
-const struct dma_mapping_ops *dma_ops;
+struct dma_mapping_ops *dma_ops;
 EXPORT_SYMBOL(dma_ops);
 
 static int iommu_sac_force __read_mostly;
diff --git a/arch/x86/kernel/pci-gart_64.c b/arch/x86/kernel/pci-gart_64.c
index e1a6954..8469f76 100644
--- a/arch/x86/kernel/pci-gart_64.c
+++ b/arch/x86/kernel/pci-gart_64.c
@@ -621,7 +621,7 @@ static __init int init_k8_gatt(struct agp_kern_info *info)
 
 extern int agp_amd64_init(void);
 
-static const struct dma_mapping_ops gart_dma_ops = {
+static struct dma_mapping_ops gart_dma_ops = {
 ...
From: Muli Ben-Yehuda
Date: Tuesday, May 13, 2008 - 10:55 pm

Acked-by: Muli Ben-Yehuda <muli@il.ibm.com>

Cheers,
Muli
--

From: Yinghai Lu
Date: Thursday, May 15, 2008 - 1:45 pm

On Mon, May 12, 2008 at 11:04 PM, FUJITA Tomonori

do you have other patch to assign value to dma_ops for every device?
it is always NULL at this time.

YH
--

From: FUJITA Tomonori
Date: Thursday, May 15, 2008 - 8:44 pm

On Thu, 15 May 2008 13:45:58 -0700


Not yet.

As I said, I plan to implement a mechanism to register a hook called
when a new pci (or dma capable) device is created. It enables IOMMUs
to set up an appropriate dma_mapping_ops per device:

http://lkml.org/lkml/2008/5/14/519

But I wanted to make sure whether per-device dma_ops is useful for
everyone.
--

From: Randy Dunlap
Date: Tuesday, May 13, 2008 - 8:39 am

Seems to be missing updates to Documentation/DMA-API.txt ??

---
~Randy
--

From: FUJITA Tomonori
Date: Wednesday, May 14, 2008 - 6:12 pm

On Tue, 13 May 2008 08:39:01 -0700


Oops, I'll post an updated version against -mm soon.

Thanks,
--

From: Muli Ben-Yehuda
Date: Tuesday, May 13, 2008 - 10:49 pm

Awesome! Much needed, thank you for doing this.

Cheers,
Muli


--

From: FUJITA Tomonori
Date: Wednesday, May 14, 2008 - 6:12 pm

On Wed, 14 May 2008 08:49:24 +0300

No problem. Well, as you know, it's just a base. We need more work to
solve the problems on the top of this.

I'd like to have a mechanism to register a hook called when a new pci
(or dma capable) device is created. It enables IOMMUs to set up an
appropriate dma_mapping_ops per device. It could also enables us to
simplify the IOMMUs code to initilize devices at startup (for exmple,
intel-iommu checks all the pci devices and creates a domain per device
if necessary).

I'll post an updated version against -mm. If people seems to be fine
with per-device dma_mapping_ops, then I'll work on further issues.
--

From: Alexis Bruemmer
Date: Wednesday, May 14, 2008 - 7:00 pm

I wanted to try and create a test version of this, at least for the
calgary side, but ran into an error with this patch set:

Fusion MPT SAS Host driver 3.04.06
ACPI: PCI Interrupt 0000:34:00.0[A] -> GSI 142 (level, low) -> IRQ 142
mptbase: ioc0: Initiating bringup
ioc0: LSISAS1078 C1: Capabilities={Initiator}
mptbase: ioc0: PCI-MSI enabled
Calgary: DMA error on CalIOC2 PHB 0x33
Calgary: 0x02000000@CSR 0x00000000@PLSSR 0xb0008000@CSMR 0x00000000@MCK
Calgary: 0x00000000@0x810 0xfee0c000@0x820 0x00000000@0x830
0x00000000@0x840 0x03804a00@0x850 0x00000000@0x860 0x00000000@0x870
Calgary: 0x00000000@0xcb0
mptbase: ioc0: Initiating recovery
mptbase: ioc0: WARNING - Unexpected doorbell active!
mptbase: ioc0: WARNING - NOT READY!
mptbase: ioc0: WARNING - Cannot recover rc = -1!
mptbase: ioc0: WARNING - Firmware Reload FAILED!
Clocksource tsc unstable (delta = 48333779773 ns)

Not sure what is causing this error, but will keep digging.  Any ideas?

--Alexis

--

From: FUJITA Tomonori
Date: Wednesday, May 14, 2008 - 7:30 pm

On Wed, 14 May 2008 19:00:13 -0700

Hmm, you don't hit this without my patches, right?
--

From: Alexis Bruemmer
Date: Thursday, May 15, 2008 - 11:21 am

oops, you are right, this failure was due to a .config change not your
patch set (MSI not behaving on MegaRAID).  Sorry for the false alarm.
Your patches work as expected.  

--

From: Muli Ben-Yehuda
Date: Thursday, May 15, 2008 - 8:17 am

Calgary caught an errant DMA. If it was working before, we are
probably mis-programming the TCE tables. The first thing to check is
whether all dma_map_xxx calls by mptbase end up calling the Calgary
dma_ops.

Cheers,
Muli
--

From: Muli Ben-Yehuda
Date: Thursday, May 15, 2008 - 8:15 am

That's great---it will be needed to support hot-plugging of devices on

I'm not sure if it will be easier than the current "loop over all
devices" method, but since we will need it for hotplug anyway, we

Excellent.

Cheers,
Muli

--

From: Andi Kleen
Date: Thursday, May 15, 2008 - 2:01 am

This makes it basically impossible to do stack ops, which some 
people have been doing.

-Andi
--

From: FUJITA Tomonori
Date: Thursday, May 15, 2008 - 2:16 am

On Thu, 15 May 2008 11:01:09 +0200

Seems that I misunderstand what those people want.

What those people want to do and how the stack ops achieve it? Or can
you tell me where their patches are?
--

From: Andi Kleen
Date: Thursday, May 15, 2008 - 2:30 am

I've seen it in two cases: first was for KVM IO bypass and the other was a 
(unfinished) patch to support the NoDMA bitmaps on some systems.

In this case you really want to do a wrapper around the existing ops
and extend the mapping. 

That worked fine by just replacing the global pointer, but will
be quite hard in your set up.

-Andi


--

From: FUJITA Tomonori
Date: Thursday, May 15, 2008 - 2:41 am

On Thu, 15 May 2008 11:30:12 +0200

Thanks,

I thought that KVM people want to do it per device (in the first
case). So with my patchse, they can replace the dma_ops pointer in
dev_archdata with what they want.
--

From: Andi Kleen
Date: Thursday, May 15, 2008 - 3:48 am

But where would they save the original pointer?

-Andi

--

From: Muli Ben-Yehuda
Date: Thursday, May 15, 2008 - 8:32 am

See my other mail on this subject, we don't need the original pointer,
we just use stackable ops as a poor man's replacement for per-device
ops.

Cheers,
Muli
--

From: FUJITA Tomonori
Date: Thursday, May 15, 2008 - 8:44 pm

On Thu, 15 May 2008 12:48:04 +0200

Yeah, we need an extra mechanism for that but it's same for the
system-wide dma_ops pointer (i.e. without my patches), isn't it?

I'm still not sure how this patchset make it impossible to have stack
dma_ops. These people need per-device dma_ops and we can do stack
per-device dma_ops?
--

From: Andi Kleen
Date: Thursday, May 15, 2008 - 10:24 pm

Anybody who does stack ops in your scheme would need to hook into new
device creation and an own per device saving pointer. Also there
are livetime issues when to wrap.

It's certainly possible, but likely complicated

-Andi


--

From: FUJITA Tomonori
Date: Sunday, May 18, 2008 - 11:32 pm

On Fri, 16 May 2008 07:24:33 +0200

But is that what those people want, setting up dma_ops per device?

I think that creating a hook is not difficult and we could put a

Hmm, let's see what those people think about on this. I'll repost a
patchset for -mm with CC'ing those people.
--

From: Muli Ben-Yehuda
Date: Thursday, May 15, 2008 - 8:26 am

That's my understanding too. We use stackable ops as a poor man's
replacement for per-device ops (depending on what kind of device it
is, call the original ops or our pvdma ops).

Cheers,
Muli



--

From: Andi Kleen
Date: Thursday, May 15, 2008 - 2:03 pm

But in the KVM case you still need to support the underlying ops too, e.g. in 
case of bouncing through swiotlb needed

-Andi
--

From: Muli Ben-Yehuda
Date: Thursday, May 15, 2008 - 2:39 pm

Good point, although "secondary" DMA-ops should only be needed in rare
cases (the only one I can think of is bouncing through swiotlb for a
pass-through device which has a limited DMA mask), whereas per-device
ops are needed for every pass-through device.

Cheers,
Muli
--

From: Andi Kleen
Date: Thursday, May 15, 2008 - 2:45 pm

Even rare cases need to be handled. In the nodma bitmap case it was the
same always btw

-Andi

--

From: Muli Ben-Yehuda
Date: Thursday, May 15, 2008 - 2:53 pm

Not sure what you mean?

Cheers,
Muli
--

From: Muli Ben-Yehuda
Date: Thursday, May 15, 2008 - 3:05 pm

... which is not to say of course we can't do both, dev_archdata could
have a list of struct dma_mapping_ops instead of (or in addition to) a
struct dma_mapping_ops pointer. If space was at a premium we could
even multiplex both on the same pointer by using one of the low bits
to select between them.

Cheers,
Muli
--

From: Alexis Bruemmer
Date: Thursday, May 15, 2008 - 11:25 am

Yes an explanation of stack ops, or a pointer to a patch set that
implements them would be very useful.


--

Previous thread: none

Next thread: [PATCH] memcg: better migration handling by KAMEZAWA Hiroyuki on Monday, May 12, 2008 - 11:31 pm. (1 message)