Re: [PATCH] x86: avoid unnecessary low zone allocation in AMD IOMMU's alloc_coherent

Previous thread: Re: forcedeth: option to disable 100Hz timer (try 2) by Bodo Eggert on Wednesday, September 10, 2008 - 4:16 am. (1 message)

Next thread: [PATCH] include <linux/fs.h> into linux/ext2_fs.h by Kirill A. Shutemov on Wednesday, September 10, 2008 - 4:22 am. (18 messages)
From: FUJITA Tomonori
Date: Wednesday, September 10, 2008 - 4:19 am

This is for tip/iommu.

btw, Joery, what happens if map_sg (or map_single) gets
not-DMA-capable buffer in the case of !iommu || !domain?


==
From: FUJITA Tomonori &lt;fujita.tomonori@lab.ntt.co.jp&gt;
Subject: [PATCH] x86: avoid unnecessary low zone allocation in AMD IOMMU's alloc_coherent

x86's common alloc_coherent (dma_alloc_coherent in dma-mapping.h) sets
up the gfp flag according to the device dma_mask but AMD IOMMU doesn't
need it for devices that the IOMMU can do virtual mappings for. This
patch avoids unnecessary low zone allocation.

Signed-off-by: FUJITA Tomonori &lt;fujita.tomonori@lab.ntt.co.jp&gt;
---
 arch/x86/kernel/amd_iommu.c |    7 +++++--
 1 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/amd_iommu.c b/arch/x86/kernel/amd_iommu.c
index 01c68c3..8efd249 100644
--- a/arch/x86/kernel/amd_iommu.c
+++ b/arch/x86/kernel/amd_iommu.c
@@ -999,6 +999,11 @@ static void *alloc_coherent(struct device *dev, size_t size,
 	u16 devid;
 	phys_addr_t paddr;
 
+	get_device_resources(dev, &amp;iommu, &amp;domain, &amp;devid);
+
+	if (iommu &amp;&amp; domain)
+		flag &amp;= ~(__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32);
+
 	virt_addr = (void *)__get_free_pages(flag, get_order(size));
 	if (!virt_addr)
 		return 0;
@@ -1006,8 +1011,6 @@ static void *alloc_coherent(struct device *dev, size_t size,
 	memset(virt_addr, 0, size);
 	paddr = virt_to_phys(virt_addr);
 
-	get_device_resources(dev, &amp;iommu, &amp;domain, &amp;devid);
-
 	if (!iommu || !domain) {
 		*dma_addr = (dma_addr_t)paddr;
 		return virt_addr;
-- 
1.5.4.2

--

From: Ingo Molnar
Date: Wednesday, September 10, 2008 - 4:57 am

applied these commits to tip/x86/iommu:

edee956: x86: avoid unnecessary low zone allocation in AMD IOMMU's alloc_coherent
9821626: x86: convert dma_alloc_coherent to use is_device_dma_capable
8a493d3: x86: remove duplicated extern force_iommu

thanks!

	Ingo
--

From: Joerg Roedel
Date: Wednesday, September 10, 2008 - 5:03 am

It blows up :-)
I know it but currently it is no problem because this will never happen
in any real AMD IOMMU system (because the hardware just remaps every
device in the system). It needs a fix anyway and the
right solution here is to fall back to one of the software iommu
implementations. The stackable dma_ops patches I have currently in work

These flags are already removed in the dma_alloc_coherent function which
calls this one. Further I think in the case of a remapping IOMMU like
this one we should avoid implementing any gfp hacks and just trust the

-- 
           |           AMD Saxony Limited Liability Company &amp; Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy

--

From: Ingo Molnar
Date: Wednesday, September 10, 2008 - 5:18 am

or add a WARN_ON_ONCE() instead, to document and enforce that 
assumption.

	Ingo
--

From: FUJITA Tomonori
Date: Wednesday, September 10, 2008 - 5:38 am

On Wed, 10 Sep 2008 14:03:10 +0200


I'm not sure you need the stackable dma_ops support. Calgary IOMMU had

Not true about x86/tip/iommu. dma_alloc_coherent in dma-mapping.h does
that so that swiotlb and pci-nommu don't need the gfp hack. Clearing
the gfp flags is much simpler than setting up the flags correctly
mainly because of the fallback device, setting up the flags is really
--

From: Joerg Roedel
Date: Wednesday, September 10, 2008 - 5:48 am

We need stackable dma_ops anyway for paravirt IOMMU support in KVM. And

Yes, dma_alloc_coherent in dma-mapping.h clears the flags. And this
function also calls ops-&gt;alloc_coherent which points to the AMD IOMMUs
alloc_coherent function if the driver is in place.

Joerg

-- 
           |           AMD Saxony Limited Liability Company &amp; Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy

--

From: FUJITA Tomonori
Date: Wednesday, September 10, 2008 - 6:03 am

On Wed, 10 Sep 2008 14:48:22 +0200



Hmm, I'm not sure what code you look at. Here's dma_alloc_coherent()
in tip/x86/iommu:

dma_alloc_coherent(struct device *dev, size_t size, dma_addr_t *dma_handle,
		gfp_t gfp)
{
	struct dma_mapping_ops *ops = get_dma_ops(dev);
	void *memory;

	gfp &amp;= ~(__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32);

Surely we here clear the flag but...

	if (dma_alloc_from_coherent(dev, size, dma_handle, &amp;memory))
		return memory;

	if (!dev) {
		dev = &amp;x86_dma_fallback_dev;
		gfp |= GFP_DMA;
	}

we play with it here though (not happens with pci devices),

	if (!dev-&gt;dma_mask)
		return NULL;

	if (!ops-&gt;alloc_coherent)
		return NULL;

Then dma_alloc_coherent_gfp_flags() sets it again according to
device-&gt;coherent_dma_mask and gfp before ops-&gt;alloc_coherent hook:

	return ops-&gt;alloc_coherent(dev, size, dma_handle,
				   dma_alloc_coherent_gfp_flags(dev, gfp));


This code can set up the exact same gfp flag for swiotbl and nommu as
before.
--

From: Joerg Roedel
Date: Wednesday, September 10, 2008 - 6:10 am

On Wed, Sep 10, 2008 at 10:03:32PM +0900, FUJITA Tomonori wrote:


So its possible that alloc_coherent is called with region specifiers in
the gfp flags. Can't we simply make the gfp hacks depend on
dma_ops-&gt;is_phys and avoid further gfp hacks in the hardware iommu
implementations?

Joerg

-- 
           |           AMD Saxony Limited Liability Company &amp; Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy

--

From: FUJITA Tomonori
Date: Wednesday, September 10, 2008 - 6:37 am

On Wed, 10 Sep 2008 15:10:32 +0200

Yes, that's how we solved the ZONE_DMA exhaustion problem in
swiotlb. And we don't duplicate the gfp hack in both swiotlb and

I thought about it but adding a new dma_ops-&gt;we_don't_want_gfp_flag
hook doesn't make the code simpler much. Currently, we have the gfp
setting hack in just one place. It's not bad. Adding such new hook
means adding more lines than we can remove.

Yeah, I was against your patch to adding the gfp setting hack to
swiotlb but it's because gfp is kinda architecture specific stuff and
swiotlb should not. It's the bad design IMO. It's ok for me that
architecture specific IOMMUs can do the architecture specific stuff
(and it's about just clearing the gfp flag).

Intel IOMMU already clears up the flag so I'll send a patch to do that
for Calgary shortly then I think that we can finish x86 alloc_coherent
rewrite.
--

From: Joerg Roedel
Date: Wednesday, September 10, 2008 - 6:53 am

The is_phys flas is already in place and its meaning is &quot;the dma_ops
return bus addresses equal to physical addresses&quot;. This is exactly the
case when we need the gfp hacks. So I don't see a problem in just
skipping the gfp rewrite if is_phys is zero. I don't see a point in
adding gfp flags in dma_alloc_coherent and remove them again
dma_ops-&gt;alloc_coherent code. Specially in this case where we already

The generic swiotlb code already contained a gfp hack for IA64 and I
added another one for x86 which is ok in my opinion. But the #ifdef was
ugly, I agree with that now :-)
Your solution of removing the flag hacks from the generic code completly
was the other possible way.

Joerg

-- 
           |           AMD Saxony Limited Liability Company &amp; Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy

--

From: FUJITA Tomonori
Date: Wednesday, September 10, 2008 - 7:24 am

On Wed, 10 Sep 2008 15:53:47 +0200

dma_ops-&gt;is_phys doesn't work well for GART and Intel IOMMU, that do
virtual mappings for some devices and doesn't for some.

We need to a hook that can pass a point to a device to IOMMUs like:

dma_ops-&gt;is_phys(struct device *dev)


Because they need to look at a device to know if they will do virtual
mappings or not for it.
--

From: Joerg Roedel
Date: Wednesday, September 10, 2008 - 7:38 am

Ok, thats a valid point. I queue your patch with the AMD IOMMU updates
for 2.6.28. Thanks.

Joerg

--

From: FUJITA Tomonori
Date: Wednesday, September 10, 2008 - 7:45 am

On Wed, 10 Sep 2008 16:38:18 +0200

Ingo already has queued it his tree, I think.
--

From: Joerg Roedel
Date: Thursday, September 11, 2008 - 2:10 am

I asked Ingo to remove it from his tree to avoid merge conflicts with
updates I am currently preparing. Here is the patch I queued into my
2.6.28 update chain.

Joerg

===
commit 42a7fa8d8a7f1680cba76e4824dba30bfd1c9d70
Author: FUJITA Tomonori &lt;fujita.tomonori@lab.ntt.co.jp&gt;
Date:   Wed Sep 10 20:19:40 2008 +0900

    x86: avoid unnecessary low zone allocation in AMD IOMMU's alloc_coherent
    
    x86's common alloc_coherent (dma_alloc_coherent in dma-mapping.h) sets
    up the gfp flag according to the device dma_mask but AMD IOMMU doesn't
    need it for devices that the IOMMU can do virtual mappings for. This
    patch avoids unnecessary low zone allocation.
    
    Signed-off-by: FUJITA Tomonori &lt;fujita.tomonori@lab.ntt.co.jp&gt;
    Signed-off-by: Joerg Roedel &lt;joerg.roedel@amd.com&gt;

diff --git a/arch/x86/kernel/amd_iommu.c b/arch/x86/kernel/amd_iommu.c
index ae38b24..09261dc 100644
--- a/arch/x86/kernel/amd_iommu.c
+++ b/arch/x86/kernel/amd_iommu.c
@@ -1360,6 +1360,9 @@ static void *alloc_coherent(struct device *dev, size_t size,
 	if (!check_device(dev))
 		return NULL;
 
+	if (!get_device_resources(dev, &amp;iommu, &amp;domain, &amp;devid))
+		flag &amp;= ~(__GFP_DMA | __GFP_HIGHMEM | __GFP_DMA32);
+
 	virt_addr = (void *)__get_free_pages(flag, get_order(size));
 	if (!virt_addr)
 		return 0;
@@ -1367,8 +1370,6 @@ static void *alloc_coherent(struct device *dev, size_t size,
 	memset(virt_addr, 0, size);
 	paddr = virt_to_phys(virt_addr);
 
-	get_device_resources(dev, &amp;iommu, &amp;domain, &amp;devid);
-
 	if (!iommu || !domain) {
 		*dma_addr = (dma_addr_t)paddr;
 		return virt_addr;

-- 
           |           AMD Saxony Limited Liability Company &amp; Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. ...
From: FUJITA Tomonori
Date: Thursday, September 11, 2008 - 6:36 am

On Thu, 11 Sep 2008 11:10:00 +0200

I see, thanks.
--

From: FUJITA Tomonori
Date: Wednesday, September 10, 2008 - 7:39 am

On Wed, 10 Sep 2008 23:24:57 +0900

btw, in tip/x86/iommu, GART's alloc_coherent always does virtual
mappings to allocate a size-aligned memory (as DMA-mapping.txt
defines).

Because someone strongly insisted, I modified GART's alloc_coherent to
do so but as I said again and again, it's completely meaningless (only
POWER IOMMU does it and drivers don't depend on such requirement).

I guess that it would be better to do virtual mappings only when
necessary as the current mainline does since GART I/O space is
precious in some systems. But I don't care much. What's your opinion
(as a AMD developer)?
--

From: Joerg Roedel
Date: Wednesday, September 10, 2008 - 7:52 am

Very true. My original rewrite did the mapping only when necessary too.
What were the reasons to do the mapping always?

Joerg

-- 
           |           AMD Saxony Limited Liability Company &amp; Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy

--

From: FUJITA Tomonori
Date: Wednesday, September 10, 2008 - 8:09 am

On Wed, 10 Sep 2008 16:52:49 +0200

As I said above, it's for allocating a size-aligned memory. Look at
the description of pci_alloc_consistent in DMA-mapping.txt:

The cpu return address and the DMA bus master address are both
guaranteed to be aligned to the smallest PAGE_SIZE order which
is greater than or equal to the requested size.  This invariant
exists (for example) to guarantee that if you allocate a chunk
which is smaller than or equal to 64 kilobytes, the extent of the
buffer you receive will not cross a 64K boundary.

You can't do this with __get_free_pages easily (you need some hacks to
do this). You can do this via iommu_area_alloc() for free.

Well, actually you agreed with adding such requirement (though I said
again and again that it's totally meaningless...):

http://lkml.org/lkml/2008/7/24/162


--

From: Joerg Roedel
Date: Wednesday, September 10, 2008 - 8:29 am

What hacks do you need with __get_free_pages? The memory it returns is

The other possible way is removing this requirement from the
documentation. But if the spec has this requirement the code _has_ to

This message stated nothing about _always_ map GART alloc_coherent
allocations using the aperture.

Joerg

-- 
           |           AMD Saxony Limited Liability Company &amp; Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy

--

From: FUJITA Tomonori
Date: Wednesday, September 10, 2008 - 9:29 am

On Wed, 10 Sep 2008 17:29:18 +0200

Is it guaranteed (documented somewhere) ?
--

From: Joerg Roedel
Date: Wednesday, September 10, 2008 - 10:05 am

I don't know if there is a formal definition for it. It is documented in
some books about the Linux kernel (I read this in some book the first
time). This alignment results from the buddy algorithm the page alloctor
uses. You can definitly rely on that.

Joerg

-- 
           |           AMD Saxony Limited Liability Company &amp; Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy

--

From: FUJITA Tomonori
Date: Wednesday, September 10, 2008 - 10:15 am

On Wed, 10 Sep 2008 19:05:04 +0200

I meant, if it's not documented as a guaranteed feature (not just the
characteristic of the current code), it could change any time.
--

From: Joerg Roedel
Date: Wednesday, September 10, 2008 - 10:25 am

It will never change. I am sure about this. A lot of code relys on
this like the HugeTLBfs implementations for example. In fact, it may be
possible that the requirement to the dma_alloc_coherent function derive
from this behavior of the Linux page allocator (to not break anything
when the dma api was introduced).

Joerg

-- 
           |           AMD Saxony Limited Liability Company &amp; Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy

--

Previous thread: Re: forcedeth: option to disable 100Hz timer (try 2) by Bodo Eggert on Wednesday, September 10, 2008 - 4:16 am. (1 message)

Next thread: [PATCH] include <linux/fs.h> into linux/ext2_fs.h by Kirill A. Shutemov on Wednesday, September 10, 2008 - 4:22 am. (18 messages)