On Fri, Feb 29, 2008 at 03:18:41PM -0800, mark gross wrote:
I didn't see the original patch - thanks for cc'ing linux-pci.
PA-RISC and IA64 have been doing this for several years now.
See DELAYED_RESOURCE_CNT usage in drivers/parisc/sba_iommu.c
and in arch/ia64/hp/common/sba_iommu.c.
I prototyped the same concept for x86_64 GART support but it didn't
seem to matter on benchmarks since most of the devices I use are
64-bit and don't need to use the IOMMU. IDE/SATA controllers
are 32-bit but there is LOTS of overhead in so many other places,
this change made less than 0.3 difference for them.
(And the decision to use uncached memory for the IO Pdir made this moot).
I'd be happy to post the patch if someone wants to see it though.
One can do a few things to limit how much the protections
are weakened:
1) Invalidate the IO TLB entry and IO Pdir entries (just don't force
syncronization). At least this was possible on the IOMMU's
I'm familiar with.
2) release the IO Pdir entry for re-use once syncronization has been forced.
Syncronization/flushing of the IO TLB shoot-down is the most expensive
ops AFAICT on the IOMMU's I've worked with.
I don't ever recall finding a bug where the device was DMA'ing to a
buffer again shortly after unmap call. Most DMA errors are device driver
not programming the DMA engines correctly.
I suggest adding an "unlikely()" around this test so the compiler
can generate better code for this and it's clear what your intent is.
I just looked at the implementation of flush_unmaps().
I strongly reccomend comparing this implementation with the
DELAYED_RESOURCE_CNT since this looks like it will need 2X or more
of the CPU cycles to execute for each entry. Use of "list_del()"
is substantially more expensive than a simple linear arrary.
(Fewer entries per cacheline by 4X maybe?)
Another problem is every now and then some IO is going to burn a bunch
of CPU cycles and introduce inconsistent performance for particular
unmap operations. One has to tradeoff amortizing the cost of the
IOMMU flushing with the number of "unusable" IO Pdir entries and more
consistent CPU utilization for each unmap() call. My gut feeling
is 250 is rather high for high_watermark unless flushing the IO TLB
is extremely expensive.
...
I prefer a compile time constant since we are talking about fixed costs
for this implementation. The compiler can do a better job with a constant.
I know this sounds nit picky but if I can't sufficiently emphasized how
perf critical DMA map/unmap code is.
If it has to be a variable, I prefer sysfs but don't have a good reason
for that.
Ah! Maybe you can get together with Matthew Wilcox (also) and consider
how the IOMMU code might be included in the scsi_ram driver he wrote.
Or maybe just directly use the ram disk (rd.c) instead.
At LSF, I suggested using the IOAT DMA engine (if available) do real DMA.
Running the IO requests through the IOMMU would expose the added CPU cost
of the IOMMU in it's worst case. This doesn't need to go to kernel.org
but might help you isolate perf bottlenecks in this iommu code.
...
Why bother with the timer?
This just adds more overhead and won't help much to improve
protection. If someone needs tighter protection, the "strict"
unmapping/flushing parameter should be sufficient to track
down the issues they might be seeing. And it's perfectly OK
to be sitting on a few "unusable" IOMMU entries.
...
When we hit a highwater mark, would be make sense to only
flush the iommu associated with the device in question?
I'm trying to limit the amount of time spent in any single
call to flush_unmaps(). If iommu_flush_iotlb_global() is really
fast (ie not 1000s of cycles), then this might be still ok.
But it would certainly make more sense to only flush the
iommu associated with the IO device calling unmap.
Using a linear array would be alot more efficient than list_del().
One could increment a local index (we are holding the spinlock) and
not touch the data again. See code snippet from sba_iommu.c below.
...
I'd like to compare the above with code from parisc sba_iommu.c
which does the same thing:
...
d = &(ioc->saved[ioc->saved_cnt]);
d->iova = iova;
d->size = size;
if (++(ioc->saved_cnt) >= DELAYED_RESOURCE_CNT) {
...
(plus spinlock acquire/release)
map/unmap_single is called _alot_. It's probably called more often
than the low level CPU interrupt handler on a busy NIC.
Removing stats gathering from the SBA code yielded an easily
measurable difference in perf. I don't recall exactly what it was.
Apologies for not quantifying the diference when I made the change:
http://lists.parisc-linux.org/pipermail/parisc-linux-cvs/2004-January/033739.html
Can we have a more descriptive name than "list"?
Why isn't this declared "struct dmar_domain *"?
I'm looking at this usage in the patch:
+ __free_iova(&((struct dmar_domain *)node->dmar)->iovad, node);
Given this is all for intel IOMMUs, I expect a struct could be visible.
But I might be overlooking some higher design here.
thanks and I hope the above helps,
grant
--