Currently, gart IOMMU ignores device's dma_mask when it does virtual mappings. So it could give a device a virtual address that the device can't access to. Some IOMMUs, in x86 IOMMUs GART and Calgary, have this issue. This patchset fixes GART. The first and second patches add helper functions, useful for some IOMMUs (both are taken from POWER IOMMU code). I'll convert some IOMMUs to use the functions after they are merged into mainline (to avoid complicated dependence on multiple trees). This is against tip/x86/iommu plus a patch that I sent yesterday: http://lkml.org/lkml/2008/9/11/147. --
This function helps IOMMUs to know the highest address that a device
can access to.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
---
include/linux/iommu-helper.h | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)
diff --git a/include/linux/iommu-helper.h b/include/linux/iommu-helper.h
index c975caf..58f4110 100644
--- a/include/linux/iommu-helper.h
+++ b/include/linux/iommu-helper.h
@@ -1,3 +1,13 @@
+static inline unsigned long iommu_device_max_index(unsigned long size,
+ unsigned long offset,
+ u64 dma_mask)
+{
+ if (size + offset > dma_mask)
+ return dma_mask - offset + 1;
+ else
+ return size;
+}
+
extern int iommu_is_span_boundary(unsigned int index, unsigned int nr,
unsigned long shift,
unsigned long boundary_size);
--
1.5.5.GIT
--
Several IOMMUs do the same thing to get the dma_mask of a device. This
adds a helper function to do the same thing to sweep them.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
---
include/linux/dma-mapping.h | 7 +++++++
1 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/include/linux/dma-mapping.h b/include/linux/dma-mapping.h
index 6ed50c1..0dba743 100644
--- a/include/linux/dma-mapping.h
+++ b/include/linux/dma-mapping.h
@@ -63,6 +63,13 @@ static inline int is_buffer_dma_capable(u64 mask, dma_addr_t addr, size_t size)
#define dma_sync_single dma_sync_single_for_cpu
#define dma_sync_sg dma_sync_sg_for_cpu
+static inline u64 dma_get_mask(struct device *dev)
+{
+ if (dev->dma_mask && *dev->dma_mask)
+ return *dev->dma_mask;
+ return DMA_32BIT_MASK;
+}
+
extern u64 dma_get_required_mask(struct device *dev);
static inline unsigned int dma_get_max_seg_size(struct device *dev)
--
1.5.5.GIT
--
Currently, GART IOMMU ingores device's dma_mask when it does virtual
mappings. So it could give a device a virtual address that the device
can't access to.
This patch fixes the above problem.
Signed-off-by: FUJITA Tomonori <fujita.tomonori@lab.ntt.co.jp>
---
arch/x86/kernel/pci-gart_64.c | 39 ++++++++++++++++++++++++++++-----------
1 files changed, 28 insertions(+), 11 deletions(-)
diff --git a/arch/x86/kernel/pci-gart_64.c b/arch/x86/kernel/pci-gart_64.c
index 47abe43..9739d56 100644
--- a/arch/x86/kernel/pci-gart_64.c
+++ b/arch/x86/kernel/pci-gart_64.c
@@ -83,23 +83,34 @@ static unsigned long next_bit; /* protected by iommu_bitmap_lock */
static int need_flush; /* global flush state. set for each gart wrap */
static unsigned long alloc_iommu(struct device *dev, int size,
- unsigned long align_mask)
+ unsigned long align_mask, u64 dma_mask)
{
unsigned long offset, flags;
unsigned long boundary_size;
unsigned long base_index;
+ unsigned long limit;
base_index = ALIGN(iommu_bus_base & dma_get_seg_boundary(dev),
PAGE_SIZE) >> PAGE_SHIFT;
boundary_size = ALIGN((unsigned long long)dma_get_seg_boundary(dev) + 1,
PAGE_SIZE) >> PAGE_SHIFT;
+ limit = iommu_device_max_index(iommu_pages,
+ DIV_ROUND_UP(iommu_bus_base, PAGE_SIZE),
+ dma_mask >> PAGE_SHIFT);
+
spin_lock_irqsave(&iommu_bitmap_lock, flags);
- offset = iommu_area_alloc(iommu_gart_bitmap, iommu_pages, next_bit,
+
+ if (limit <= next_bit) {
+ need_flush = 1;
+ next_bit = 0;
+ }
+
+ offset = iommu_area_alloc(iommu_gart_bitmap, limit, next_bit,
size, base_index, boundary_size, align_mask);
- if (offset == -1) {
+ if (offset == -1 && next_bit) {
need_flush = 1;
- offset = iommu_area_alloc(iommu_gart_bitmap, iommu_pages, 0,
+ offset = iommu_area_alloc(iommu_gart_bitmap, limit, 0,
size, base_index, boundary_size,
align_mask);
}
@@ -228,12 +239,14 @@ nonforced_iommu(struct device *dev, unsigned long addr, ...You can calculate the dma_mask in this function from the dev parameter.
There is no need to pass it two levels down to this function extending
--
| AMD Saxony Limited Liability Company & Co. KG
Operating | Wilschdorfer Landstr. 101, 01109 Dresden, Germany
System | Register Court Dresden: HRA 4896
Research | General Partner authorized to represent:
Center | AMD Saxony LLC (Wilmington, Delaware, US)
| General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy
--
On Fri, 12 Sep 2008 16:52:27 +0200 No, we can't because we need to use dev->coherent_dma_mask for alloc_coherent and dev->dma_mask for the rest. --
applied them to tip/x86/iommu: bee44f2: x86: make GART to respect device's dma_mask about virtual mappings 589fc9a: iommu: add dma_get_mask helper function eecfffc: iommu: add iommu_device_max_index IOMMU helper function thanks! Ingo --
Huh? That is what the need_iommu() logic in gart_map_sg() does. An I'm not aware of any bugs in this area. Did you actually see that failure in practice? I don't see how it could happen. -Andi -- ak@linux.intel.com --
On Tue, 16 Sep 2008 02:54:32 +0200 What the need_iommu() does is seeing if GART needs to do virtual mappings or not. (After need_iommu() checking) What this patchset does is to guarantee No, I did not. This patchset does the right thing theoretically, I think, but if such problem never happens for GART, I'll drop the patch for GART. Joerg? --
Ah you care about masks < 32bit? Those always are handled elsewhere in the block layer (using the bounce_pfn mechanism) or in various other ways in other subsystems. e.g. on networking the rule is that you just don't announce direct SG when you have less than 32bit mask. And the others like sound generally don't use map_sg()/map_single(), but instead pre allocate something with dma_alloc_coherent() or similar. Also what would you do if it this check fails? There is no suitable fallback path. Can you describe a concrete use case your patch fixes? Anyways I'm aware the semantics are a little non untuitive (I didn't invent them, but inherited them from other 64bit IOMMU implementations), but fully general handling of < 32bit masks would be extremly complicated because you would always need to have a full arbitary fallback swiotlb implemention that is able to allocate arbitary low memory. -Andi --
On Tue, 16 Sep 2008 15:43:35 +0200
I don't think that the bounce guarantees that dma_alloc_coherent()
I'm not familiar with what the networking does, for example, seems
that b44 sets dev->dma_mask to DMA_30BIT_MASK and b44_start_xmit()
does:
mapping = ssb_dma_map_single(bp->sdev, skb->data, len, DMA_TO_DEVICE);
if (ssb_dma_mapping_error(bp->sdev, mapping) || mapping + len > DMA_30BIT_MASK) {
struct sk_buff *bounce_skb;
/* Chip can't handle DMA to/from >1GB, use bounce buffer */
if (!ssb_dma_mapping_error(bp->sdev, mapping))
ssb_dma_unmap_single(bp->sdev, mapping, len,
DMA_TO_DEVICE);
IOMMUs can try to return an address that the NIC can access to.
--
dma_alloc_coherent() is not used for block IO data. And dma_alloc_coherent() b44 (and related designs like the bcm wireless chipset) It's not worth to handle this strange case. The drivers do anyways. These are very cheap devices which are only rarely used in systems with >2GB and for those cases the existing bouncing setup works fine. -Andi -- ak@linux.intel.com --
On Tue, 16 Sep 2008 19:58:24 +0200 What do you mean? For example, some aacraid cards have 31bit dma mask. What guarantees that IOMMUs's dma_alloc_coherent don't return a I think that the patch is a pretty straightforward, it just the same thing that IOMMUs with > 32bits virtual address space do. We can do better with the simple patch. But I'm ok with dropping the patch for GART since we can live without the patch, as you said. --
At least the old IOMMU implementations (GART, non GART) handled this by falling back to GFP_DMA. I haven't checked if that didn't get broken in the recent reorganization, but if it got it should be fixed of course. But hopefully it still works. But that has nothing to do with _map_sg() anyways. The semantics here are quite different, matching traditional i386 semantics. On i386 _alloc_coherent always did this logic, and map_sg/map_single didn't The semantics are a little weird and non intuitive I agree. Perhaps it would make sense to document them properly? With the DMA allocator rework I did some time ago it would be actually relatively straight forward to do. But to do it properly would require doing it for 32bit x86 too, and frankly I don't think it's worth doing there. Devices with such weird masks are on their way out and after we survived this plague for years and finally things are getting better I don't think it's worth complicating basic infrastructure for this. -Andi --
On Wed, 17 Sep 2008 02:24:04 +0200 The falling back mechanism was moved to pci-nommu from the common code since it doesn't work for other IOMMUs that always need virtual mappings. Calgary needs this dma_mask trick too but I guess that it's unlikely that the IBM servers with Calgary have weird hardware. --
There's no fallback for _map_sg/_map_single. All the fallback to GFP only works for coherent allocations, but not for streaming mappings. To make this "fully robust" for masks < 32bit you would need to implement a new swiotlb that uses GFP_DMA allocations as fallback (or use the DMA allocator's swiotlb which can actually handle this) So you're right now basically checking for something that you cannot fix. And also you try to check for (but not handle) something that even 32bit x86 doesn't handle. So if some driver relied on you checking for it on 64bit it wouldn't work on 32bit x86 which would be a bad On a full IOMMU like calgary it's easier to do because you don't need to deal with GFP_DMA at least. But as you say it's unlikely to be worth the effort on these systems. Also the earlier problem that it wouldn't work on 32bit x86 and other systems would make it also not too useful. Having drivers that only work on Calgary wouldn't be a good thing. -Andi -- ak@linux.intel.com --
On Thu, 18 Sep 2008 20:20:29 +0200 Yeah, so the falling back mechanism was moved to pci-nommu's Do you mean if GART's alloc_coherent can't find a virtual address that a device can access to, it should try GFP_DMA allocations as fallback? GART could but why GART should do? If full IOMMUs' alloc_coherent can't find a virtual address that a device can access to, it's failure. No fallback is for them. Why can't GART use the same logic? Yeah, GART is not a full IOMMU, so it can have a fallback for this What does '32bit x86 doesn't handle' mean? pci-nommu's alloc_coherent can handle < 32bit bit mask in the fallback path. Or you are talking about '_map_sg/_map_single'? If so, as we discussed, < 32bit bit mask can be handled in else where. The patch just tries to return an address that such tricks are not necessary with. --
Sure, but that doesn't help for map_single/map_sg. The two cases are It used to at least, that is how I wrote it. That is it did all GFP_DMA, The GART is somewhere in the 4GB range so you cannot use it to map anything < 4GB. Also GART is pretty small (and it's not a isolating) IOMMU so if you can get direct memory allocation that fits you should GART uses the same logic, but only for alloc_cohernet, not for Because GART cannot remap to addresses < 4GB reliably. The big difference to the other IOMMUs is that it's only a tiny memory hole somewhere near the 4GB boundary, not a full remapper of the full Yes it does, just map_sg/map_single doesn't. And your patch changed I don't hink it can, unless you want to write another swiotlb using GFP_DMA (or use the dma allocator). That is because the swiotlb has the same limitation as GART. It cannot reliably remap to < 4GB. -Andi --
On Fri, 19 Sep 2008 02:44:31 +0200 Ok, after you told me that GART cannot remap to addresses < 4GB reliably, I understand that GART's alloc_coherent needs to work in the old way. I'll take care of it. Well, I need to take care of SWIOTLB about this Sure, pci-nommu's map_sg/map_single doesn't handle it. But we handle this issue somewhere else (like b44 keeps own DMA buffer)? If I understand your logic correctly, 1. not all map_sg/map_single (e.g. pci-nommu) can't handle it. 2. we already have workarounds for it somewhere else so map_sg/map_single don't need to handle it. 3. I changed GART map_sg/map_signle to handle it. I thought if it can handle it, for example, b44 doesn't go the workaround path. It would be a good thing. 4. But GART cannot remap to addresses < 4GB reliably, so my above argument doesn't always work. 5. Then my patch doesn't break anything but it's almost meaningless. Again GART cannot remap to addresses < 4GB reliably, then I have to As I wrote above, 'else where' meant to something like b44's own DMA hack. --
Yes, b44 handles it on its own. It has to for 32bit which always has a nop map_sg/single. Also some other subsystems like the block layer Correct. -Andi -- ak@linux.intel.com --
On Mon, 22 Sep 2008 22:35:18 +0200 Thanks for the clarification. --
no - any extra layer of robustness is good in such a critical piece of code. Perhaps also emit a one-time printk if you really consider the use as a potential sign of bugs, but dont remove robustness. Ingo --
> no - any extra layer of robustness is good in such a critical piece of robustness would require to handle it everywhere (as in always fall back to 24bit DMA bouncing if someone tries a map_* on a < 32bit mapping), but this patch doesn't do that at all. -Andi -- ak@linux.intel.com --
I am not aware of any failures which are fixed by these patches. But in
theory there could be failures.
Joerg
--
| AMD Saxony Limited Liability Company & Co. KG
Operating | Wilschdorfer Landstr. 101, 01109 Dresden, Germany
System | Register Court Dresden: HRA 4896
Research | General Partner authorized to represent:
Center | AMD Saxony LLC (Wilmington, Delaware, US)
| General Manager of AMD Saxony LLC: Dr. Hans-R. Deppe, Thomas McCoy
--
AFAIK all subsystems deal with it on their own. That is because i386 is the same (no remapping pci_map_* at all) and subsystems are usually They will stay failures because GFP_DMA bouncing can not be really done today in the pci_map_* layer. With a lot of effort you could probably fix all that, but I doubt it would be worth the effort for the few devices left with DMA masks < 32bit. -- ak@linux.intel.com --
