[PATCH 3/3] x86: make GART to respect device's dma_mask about virtual mappings

Previous thread: none

Next thread: [PATCH] Fixup deb-pkg target to generate separate firmware deb. by Jonathan McDowell on Friday, September 12, 2008 - 4:20 am. (5 messages)
From: FUJITA Tomonori
Date: Friday, September 12, 2008 - 3:42 am

Currently, gart IOMMU ignores device's dma_mask when it does virtual
mappings. So it could give a device a virtual address that the device
can't access to.

Some IOMMUs, in x86 IOMMUs GART and Calgary, have this issue. This
patchset fixes GART.

The first and second patches add helper functions, useful for some
IOMMUs (both are taken from POWER IOMMU code). I'll convert some
IOMMUs to use the functions after they are merged into mainline (to
avoid complicated dependence on multiple trees).

This is against tip/x86/iommu plus a patch that I sent yesterday:

http://lkml.org/lkml/2008/9/11/147.



--

From: FUJITA Tomonori
Date: Friday, September 12, 2008 - 3:42 am

This function helps IOMMUs to know the highest address that a device
can access to.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
---
 include/linux/iommu-helper.h |   10 ++++++++++
 1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/include/linux/iommu-helper.h b/include/linux/iommu-helper.h
index c975caf..58f4110 100644
--- a/include/linux/iommu-helper.h
+++ b/include/linux/iommu-helper.h
@@ -1,3 +1,13 @@
+static inline unsigned long iommu_device_max_index(unsigned long size,
+						   unsigned long offset,
+						   u64 dma_mask)
+{
+	if (size + offset > dma_mask)
+		return dma_mask - offset + 1;
+	else
+		return size;
+}
+
 extern int iommu_is_span_boundary(unsigned int index, unsigned int nr,
 				  unsigned long shift,
 				  unsigned long boundary_size);
-- 
1.5.5.GIT

--

From: FUJITA Tomonori
Date: Friday, September 12, 2008 - 3:42 am

Several IOMMUs do the same thing to get the dma_mask of a device. This
adds a helper function to do the same thing to sweep them.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
---
 include/linux/dma-mapping.h |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 6ed50c1..0dba743 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -63,6 +63,13 @@ static inline int is_buffer_dma_capable(u64 mask, dma_addr_t addr, size_t size)
 #define dma_sync_single		dma_sync_single_for_cpu
 #define dma_sync_sg		dma_sync_sg_for_cpu
 
+static inline u64 dma_get_mask(struct device *dev)
+{
+	if (dev->dma_mask && *dev->dma_mask)
+		return *dev->dma_mask;
+	return DMA_32BIT_MASK;
+}
+
 extern u64 dma_get_required_mask(struct device *dev);
 
 static inline unsigned int dma_get_max_seg_size(struct device *dev)
-- 
1.5.5.GIT

--

From: FUJITA Tomonori
Date: Friday, September 12, 2008 - 3:42 am

Currently, GART IOMMU ingores device's dma_mask when it does virtual
mappings. So it could give a device a virtual address that the device
can't access to.

This patch fixes the above problem.

Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
---
 arch/x86/kernel/pci-gart_64.c |   39 ++++++++++++++++++++++++++++-----------
 1 files changed, 28 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kernel/pci-gart_64.c b/arch/x86/kernel/pci-gart_64.c
index 47abe43..9739d56 100644
--- a/arch/x86/kernel/pci-gart_64.c
+++ b/arch/x86/kernel/pci-gart_64.c
@@ -83,23 +83,34 @@ static unsigned long next_bit;  /* protected by iommu_bitmap_lock */
 static int need_flush;		/* global flush state. set for each gart wrap */
 
 static unsigned long alloc_iommu(struct device *dev, int size,
-				 unsigned long align_mask)
+				 unsigned long align_mask, u64 dma_mask)
 {
 	unsigned long offset, flags;
 	unsigned long boundary_size;
 	unsigned long base_index;
+	unsigned long limit;
 
 	base_index = ALIGN(iommu_bus_base & dma_get_seg_boundary(dev),
 			   PAGE_SIZE) >> PAGE_SHIFT;
 	boundary_size = ALIGN((unsigned long long)dma_get_seg_boundary(dev) + 1,
 			      PAGE_SIZE) >> PAGE_SHIFT;
 
+	limit = iommu_device_max_index(iommu_pages,
+				       DIV_ROUND_UP(iommu_bus_base, PAGE_SIZE),
+				       dma_mask >> PAGE_SHIFT);
+
 	spin_lock_irqsave(&iommu_bitmap_lock, flags);
-	offset = iommu_area_alloc(iommu_gart_bitmap, iommu_pages, next_bit,
+
+	if (limit <= next_bit) {
+		need_flush = 1;
+		next_bit = 0;
+	}
+
+	offset = iommu_area_alloc(iommu_gart_bitmap, limit, next_bit,
 				  size, base_index, boundary_size, align_mask);
-	if (offset == -1) {
+	if (offset == -1 && next_bit) {
 		need_flush = 1;
-		offset = iommu_area_alloc(iommu_gart_bitmap, iommu_pages, 0,
+		offset = iommu_area_alloc(iommu_gart_bitmap, limit, 0,
 					  size, base_index, boundary_size,
 					  align_mask);
 	}
@@ -228,12 +239,14 @@ nonforced_iommu(struct device *dev, unsigned long addr, ...
From: Joerg Roedel
Date: Friday, September 12, 2008 - 7:52 am

You can calculate the dma_mask in this function from the dev parameter.
There is no need to pass it two levels down to this function extending

-- 
           |           AMD Saxony Limited Liability Company & Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy

--

From: FUJITA Tomonori
Date: Friday, September 12, 2008 - 8:11 am

On Fri, 12 Sep 2008 16:52:27 +0200

No, we can't because we need to use dev->coherent_dma_mask for
alloc_coherent and dev->dma_mask for the rest.
--

From: Ingo Molnar
Date: Sunday, September 14, 2008 - 7:45 am

applied them to tip/x86/iommu:

bee44f2: x86: make GART to respect device's dma_mask about virtual mappings
589fc9a: iommu: add dma_get_mask helper function
eecfffc: iommu: add iommu_device_max_index IOMMU helper function

thanks!

	Ingo
--

From: Andi Kleen
Date: Monday, September 15, 2008 - 5:54 pm

Huh? That is what the need_iommu() logic in gart_map_sg()
does. An I'm not aware of any bugs in this area.

Did you actually see that failure in practice? I don't see
how it could happen.

-Andi
-- 
ak@linux.intel.com
--

From: FUJITA Tomonori
Date: Tuesday, September 16, 2008 - 6:20 am

On Tue, 16 Sep 2008 02:54:32 +0200

What the need_iommu() does is seeing if GART needs to do virtual
mappings or not.

(After need_iommu() checking) What this patchset does is to guarantee

No, I did not. This patchset does the right thing theoretically, I
think, but if such problem never happens for GART, I'll drop the patch
for GART. Joerg?
--

From: Andi Kleen
Date: Tuesday, September 16, 2008 - 6:43 am

Ah you care about masks < 32bit?

Those always are handled elsewhere in the block layer (using the bounce_pfn
mechanism) or in various other ways in other subsystems. e.g. on networking
the rule is that you just don't announce direct SG when you have 
less than 32bit mask. And the others like sound generally don't use 
map_sg()/map_single(), but instead pre allocate something with 
dma_alloc_coherent() or similar.

Also what would you do if it this check fails? There is no suitable
fallback path.

Can you describe a concrete use case your patch fixes? 

Anyways I'm aware the semantics are a little non untuitive (I didn't
invent them, but inherited them from other 64bit IOMMU implementations),
but fully general handling of < 32bit masks would be extremly complicated
because you would always need to have a full arbitary fallback
swiotlb implemention that is able to allocate arbitary low memory.

-Andi
--

From: FUJITA Tomonori
Date: Tuesday, September 16, 2008 - 10:13 am

On Tue, 16 Sep 2008 15:43:35 +0200


I don't think that the bounce guarantees that dma_alloc_coherent()

I'm not familiar with what the networking does, for example, seems
that b44 sets dev->dma_mask to DMA_30BIT_MASK and b44_start_xmit()
does:

mapping = ssb_dma_map_single(bp->sdev, skb->data, len, DMA_TO_DEVICE);
if (ssb_dma_mapping_error(bp->sdev, mapping) || mapping + len > DMA_30BIT_MASK) {
	struct sk_buff *bounce_skb;

	/* Chip can't handle DMA to/from >1GB, use bounce buffer */
	if (!ssb_dma_mapping_error(bp->sdev, mapping))
		ssb_dma_unmap_single(bp->sdev, mapping, len,
				     DMA_TO_DEVICE);

IOMMUs can try to return an address that the NIC can access to.
--

From: Andi Kleen
Date: Tuesday, September 16, 2008 - 10:58 am

dma_alloc_coherent() is not used for block IO data. And dma_alloc_coherent()

b44 (and related designs like the bcm wireless chipset) 

It's not worth to handle this strange case. The drivers do anyways.
These are very cheap devices which are only rarely used in systems
with >2GB and for those cases the existing bouncing setup works fine.

-Andi

-- 
ak@linux.intel.com
--

From: FUJITA Tomonori
Date: Tuesday, September 16, 2008 - 4:53 pm

On Tue, 16 Sep 2008 19:58:24 +0200

What do you mean? For example, some aacraid cards have 31bit dma
mask. What guarantees that IOMMUs's dma_alloc_coherent don't return a

I think that the patch is a pretty straightforward, it just the same
thing that IOMMUs with > 32bits virtual address space do. We can do
better with the simple patch. But I'm ok with dropping the patch for
GART since we can live without the patch, as you said.
--

From: Andi Kleen
Date: Tuesday, September 16, 2008 - 5:24 pm

At least the old IOMMU implementations (GART, non GART) handled this
by falling back to GFP_DMA. I haven't checked if that didn't get broken
in the recent reorganization, but if it got it should be fixed of course.
But hopefully it still works.

But that has nothing to do with _map_sg() anyways. The semantics here
are quite different, matching traditional i386 semantics. On i386
_alloc_coherent always did this logic, and map_sg/map_single didn't

The semantics are a little weird and non intuitive I agree. Perhaps it would 
make sense to document them properly? 

With the DMA allocator rework I did some time ago it would be actually
relatively straight forward to do. But to do it properly would
require doing it for 32bit x86 too, and frankly I don't think it's
worth doing there. Devices with such weird masks are on their way out
and after we survived this plague for years and finally things are 
getting better I don't think it's worth complicating basic infrastructure
for this.

-Andi
--

From: FUJITA Tomonori
Date: Wednesday, September 17, 2008 - 12:20 pm

On Wed, 17 Sep 2008 02:24:04 +0200

The falling back mechanism was moved to pci-nommu from the common code
since it doesn't work for other IOMMUs that always need virtual
mappings. Calgary needs this dma_mask trick too but I guess that it's
unlikely that the IBM servers with Calgary have weird hardware.
--

From: Andi Kleen
Date: Thursday, September 18, 2008 - 11:20 am

There's no fallback for _map_sg/_map_single. All the fallback to GFP
only works for coherent allocations, but not for streaming mappings.

To make this "fully robust" for masks < 32bit you would need to implement 
a new swiotlb that uses GFP_DMA allocations as fallback (or use the DMA 
allocator's swiotlb which can actually handle this)

So you're right now basically checking for something that you cannot
fix. And also you try to check for (but not handle) something that even 
32bit x86 doesn't handle. So if some driver relied on you checking
for it on 64bit it wouldn't work on 32bit x86 which would be a bad 

On a full IOMMU like calgary it's easier to do because you don't
 need to deal with GFP_DMA at least. But as you say it's unlikely
to be worth the effort on these systems.

Also the earlier problem that it wouldn't work on 32bit x86 and other
systems would make it also not too useful. Having drivers that only
work on Calgary wouldn't be a good thing.

-Andi

-- 
ak@linux.intel.com
--

From: FUJITA Tomonori
Date: Thursday, September 18, 2008 - 3:15 pm

On Thu, 18 Sep 2008 20:20:29 +0200

Yeah, so the falling back mechanism was moved to pci-nommu's

Do you mean if GART's alloc_coherent can't find a virtual address that
a device can access to, it should try GFP_DMA allocations as fallback?

GART could but why GART should do? If full IOMMUs' alloc_coherent
can't find a virtual address that a device can access to, it's
failure. No fallback is for them. Why can't GART use the same logic?
Yeah, GART is not a full IOMMU, so it can have a fallback for this

What does '32bit x86 doesn't handle' mean? pci-nommu's alloc_coherent
can handle < 32bit bit mask in the fallback path.

Or you are talking about '_map_sg/_map_single'? If so, as we
discussed, < 32bit bit mask can be handled in else where. The patch
just tries to return an address that such tricks are not necessary
with.
--

From: Andi Kleen
Date: Thursday, September 18, 2008 - 5:44 pm

Sure, but that doesn't help for map_single/map_sg. The two cases are

It used to at least, that is how I wrote it. That is it did all GFP_DMA,

The GART is somewhere in the 4GB range so you cannot use it to 
map anything < 4GB.

Also GART is pretty small (and it's not a isolating) IOMMU so 
if you can get direct memory allocation that fits you should 

GART uses the same logic, but only for alloc_cohernet, not for

Because GART cannot remap to addresses < 4GB reliably.

The big difference to the other IOMMUs is that it's only a tiny memory
hole somewhere near the 4GB boundary, not a full remapper of the full 

Yes it does, just map_sg/map_single doesn't.  And your patch changed

I don't hink it can, unless you want to write another swiotlb using
GFP_DMA (or use the dma allocator). That is because the swiotlb
has the same limitation as GART. It cannot reliably remap to < 4GB.

-Andi
--

From: FUJITA Tomonori
Date: Monday, September 22, 2008 - 12:12 pm

On Fri, 19 Sep 2008 02:44:31 +0200


Ok, after you told me that GART cannot remap to addresses < 4GB
reliably, I understand that GART's alloc_coherent needs to work in the
old way.

I'll take care of it. Well, I need to take care of SWIOTLB about this

Sure, pci-nommu's map_sg/map_single doesn't handle it. But we handle
this issue somewhere else (like b44 keeps own DMA buffer)?

If I understand your logic correctly,

1. not all map_sg/map_single (e.g. pci-nommu) can't handle it.
2. we already have workarounds for it somewhere else so
map_sg/map_single don't need to handle it.
3. I changed GART map_sg/map_signle to handle it. I thought if it
can handle it, for example, b44 doesn't go the workaround path. It
would be a good thing.
4. But GART cannot remap to addresses < 4GB reliably, so my above
argument doesn't always work.
5. Then my patch doesn't break anything but it's almost meaningless.


Again GART cannot remap to addresses < 4GB reliably, then I have to

As I wrote above, 'else where' meant to something like b44's own DMA
hack.
--

From: Andi Kleen
Date: Monday, September 22, 2008 - 1:35 pm

Yes, b44 handles it on its own. It has to for 32bit which always
has a nop map_sg/single. Also some other subsystems like the block layer


Correct.

-Andi
-- 
ak@linux.intel.com
--

From: FUJITA Tomonori
Date: Monday, September 22, 2008 - 9:02 pm

On Mon, 22 Sep 2008 22:35:18 +0200


Thanks for the clarification.
--

From: Ingo Molnar
Date: Wednesday, September 17, 2008 - 3:43 am

no - any extra layer of robustness is good in such a critical piece of 
code. Perhaps also emit a one-time printk if you really consider the use 
as a potential sign of bugs, but dont remove robustness.

	Ingo
--

From: Andi Kleen
Date: Thursday, September 18, 2008 - 11:25 am

> no - any extra layer of robustness is good in such a critical piece of 

robustness would require to handle it everywhere (as in always fall
back to 24bit DMA bouncing if someone tries a map_* on a < 32bit mapping), 
but this patch doesn't do that at all.

-Andi
-- 
ak@linux.intel.com
--

From: Joerg Roedel
Date: Tuesday, September 16, 2008 - 8:52 am

I am not aware of any failures which are fixed by these patches. But in
theory there could be failures.

Joerg

-- 
           |           AMD Saxony Limited Liability Company & Co. KG
 Operating |         Wilschdorfer Landstr. 101, 01109 Dresden, Germany
 System    |                  Register Court Dresden: HRA 4896
 Research  |              General Partner authorized to represent:
 Center    |             AMD Saxony LLC (Wilmington, Delaware, US)
           | General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy

--

From: Andi Kleen
Date: Tuesday, September 16, 2008 - 9:20 am

AFAIK all subsystems deal with it on their own. That is because i386
is the same (no remapping pci_map_* at all) and subsystems are usually 

They will stay failures because GFP_DMA bouncing can not be really
done today in the pci_map_* layer. With a lot of effort you could
probably fix all that, but I doubt it would be worth the effort for
the few devices left with DMA masks < 32bit. 


-- 
ak@linux.intel.com
--

Previous thread: none

Next thread: [PATCH] Fixup deb-pkg target to generate separate firmware deb. by Jonathan McDowell on Friday, September 12, 2008 - 4:20 am. (5 messages)