Re: [PATCH]iommu-iotlb-flushing

Previous thread: Questions on rtc drivers... by J.A. on Wednesday, February 20, 2008 - 5:07 pm. (1 message)

Next thread: [PATCH]iova-lockdep-false-alarm-fix. by mark gross on Wednesday, February 20, 2008 - 5:35 pm. (3 messages)
From: mark gross
Date: Wednesday, February 20, 2008 - 5:06 pm

The following patch is for batching up the flushing of the IOTLB for
the DMAR implementation found in the Intel VT-d hardware.  It works by
building a list of to be flushed IOTLB entries and a bitmap list of
which DMAR engine they are from.

After either a high water mark (250 accessible via debugfs) or 10ms the
list of iova's will be reclaimed and the DMAR engines associated are
IOTLB-flushed.

This approach recovers 15 to 20% of the performance lost when using the
IOMMU for my netperf udp stream benchmark with small packets.  It can be
disabled with a kernel boot parameter
"intel_iommu=strict".

Its use does weaken the IOMMU protections a bit.

I would like to see this go into MM for a while and then onto mainline.

Thanks,


Signed-off-by: <mgross@linux.intel.com>


Index: linux-2.6.24-mm1/drivers/pci/intel-iommu.c
===================================================================
--- linux-2.6.24-mm1.orig/drivers/pci/intel-iommu.c	2008-02-12 07:12:06.000000000 -0800
+++ linux-2.6.24-mm1/drivers/pci/intel-iommu.c	2008-02-12 11:35:42.000000000 -0800
@@ -21,6 +21,7 @@
 
 #include <linux/init.h>
 #include <linux/bitmap.h>
+#include <linux/debugfs.h>
 #include <linux/slab.h>
 #include <linux/irq.h>
 #include <linux/interrupt.h>
@@ -30,6 +31,7 @@
 #include <linux/dmar.h>
 #include <linux/dma-mapping.h>
 #include <linux/mempool.h>
+#include <linux/timer.h>
 #include "iova.h"
 #include "intel-iommu.h"
 #include <asm/proto.h> /* force_iommu in this header in x86-64*/
@@ -50,11 +52,39 @@
 
 #define DOMAIN_MAX_ADDR(gaw) ((((u64)1) << gaw) - 1)
 
+
+static void flush_unmaps_timeout(unsigned long data);
+
+static struct timer_list unmap_timer =
+	TIMER_INITIALIZER(flush_unmaps_timeout, 0, 0);
+
+struct unmap_list {
+	struct list_head list;
+	struct dmar_domain *domain;
+	struct iova *iova;
+};
+
+static struct intel_iommu *g_iommus;
+/* bitmap for indexing intel_iommus */
+static unsigned long 	*g_iommus_to_flush;
+static int g_num_of_iommus;
+
+static ...
From: Andrew Morton
Date: Saturday, February 23, 2008 - 1:05 am

No, this modifies unmap_timer.expires twice.  Might be racy too.  You want


I see timers being added, but I see no del_timer_sync()s added on cleanup

boot-time options suck.  Is it not possible to tweak this at runtime?
--

From: mark gross
Date: Monday, February 25, 2008 - 9:28 am

I'll double check if its needed, but it wouldn't hurt.  This code is on


Once you set up the IOMMU's you never take them down or re-set them up.


Not very.  But, its better than blocking on the hardware poll IOTLB
flush operation on every unmap.  Is there a lock less way to insert
into this list?  I suppose I could have one lock per DMAR-engine but,
that would still have the scalability issue.  (perhaps a list per IOVA?
/me needs to think on this a bit)

The best way is to get the network stack to reuse dma buffers when using
an iommu and avoid the unmap operation altogether.  But thats a longer


This code doesn't really tear down well.  However; at this point in

Yes they do.  There may be a way to enable / disable this behavior at
runtime.  Let me think on it a bit.

Thank you for looking at this!

--mgross

--

From: Andrew Morton
Date: Monday, February 25, 2008 - 11:40 am

We normally put a space after "/*" and before "*/"
--

From: mark gross
Date: Friday, February 29, 2008 - 4:18 pm

Yup, not needed.  Only the single threaded __init path increments this

I just looked, I don't think bitmap.h helps me with my bitmap use in



Its not very, but I'm doing an insert walk of a list and need the locks.
On the bright side, its way better than spin locking and poling the


The only error path to delete the single timer created is at the bottom
of intel_iommu_init.  I don't think there is a place for the timer to be

Yes, I could easily create a sysfs or debugfs mechanism for turning it
on / off at run time.  I would like input on the preferred way to do this.
sysfs or debugfs?


Thanks again for your review on this.  The following is the updated
patch.

--mgross

Signed-off-by: <mgross@linux.intel.com>



Index: linux-2.6.25-rc2-mm1/drivers/pci/intel-iommu.c
===================================================================
--- linux-2.6.25-rc2-mm1.orig/drivers/pci/intel-iommu.c	2008-02-26 11:15:46.000000000 -0800
+++ linux-2.6.25-rc2-mm1/drivers/pci/intel-iommu.c	2008-02-29 14:36:55.000000000 -0800
@@ -21,6 +21,7 @@
 
 #include <linux/init.h>
 #include <linux/bitmap.h>
+#include <linux/debugfs.h>
 #include <linux/slab.h>
 #include <linux/irq.h>
 #include <linux/interrupt.h>
@@ -30,6 +31,7 @@
 #include <linux/dmar.h>
 #include <linux/dma-mapping.h>
 #include <linux/mempool.h>
+#include <linux/timer.h>
 #include "iova.h"
 #include "intel-iommu.h"
 #include <asm/proto.h> /* force_iommu in this header in x86-64*/
@@ -50,11 +52,32 @@
 
 #define DOMAIN_MAX_ADDR(gaw) ((((u64)1) << gaw) - 1)
 
+
+static void flush_unmaps_timeout(unsigned long data);
+
+DEFINE_TIMER(unmap_timer,  flush_unmaps_timeout, 0, 0);
+
+static struct intel_iommu *g_iommus;
+/* bitmap for indexing intel_iommus */
+static unsigned long 	*g_iommus_to_flush;
+static int g_num_of_iommus;
+
+static DEFINE_SPINLOCK(async_umap_flush_lock);
+static LIST_HEAD(unmaps_to_do);
+
+static int timer_on;
+static long list_size;
+static int high_watermark;
+
+static ...
From: Greg KH
Date: Friday, February 29, 2008 - 10:54 pm

Which is it, a debugging option, or something that anyone would want to
change?  If debugging, make it a debugfs file.  Otherwise sysfs if fine,
but be sure to add a Documentation/ABI/ file for it.

thanks,

greg k-h
--

From: Grant Grundler
Date: Saturday, March 1, 2008 - 12:10 am

I didn't see the original patch - thanks for cc'ing linux-pci.

PA-RISC and IA64 have been doing this for several years now.
See DELAYED_RESOURCE_CNT usage in drivers/parisc/sba_iommu.c
and in arch/ia64/hp/common/sba_iommu.c.

I prototyped the same concept for x86_64 GART support but it didn't
seem to matter on benchmarks since most of the devices I use are
64-bit and don't need to use the IOMMU. IDE/SATA controllers 
are 32-bit but there is LOTS of overhead in so many other places,
this change made less than 0.3 difference for them.
(And the decision to use uncached memory for the IO Pdir made this moot).

One can do a few things to limit how much the protections
are weakened:
1) Invalidate the IO TLB entry and IO Pdir entries (just don't force
  syncronization). At least this was possible on the IOMMU's
  I'm familiar with.

2) release the IO Pdir entry for re-use once syncronization has been forced.
   Syncronization/flushing of the IO TLB shoot-down is the most expensive
   ops AFAICT on the IOMMU's I've worked with.

I don't ever recall finding a bug where the device was DMA'ing to a
buffer again shortly after unmap call. Most DMA errors are device driver

I suggest adding an "unlikely()" around this test so the compiler
can generate better code for this and it's clear what your intent is.


I just looked at the implementation of flush_unmaps().
I strongly reccomend comparing this implementation with the
DELAYED_RESOURCE_CNT since this looks like it will need 2X or more
of the CPU cycles to execute for each entry. Use of "list_del()"
is substantially more expensive than a simple linear arrary.
(Fewer entries per cacheline by 4X maybe?)

Another problem is every now and then some IO is going to burn a bunch
of CPU cycles and introduce inconsistent performance for particular
unmap operations. One has to tradeoff amortizing the cost of the
IOMMU flushing with the number of "unusable" IO Pdir entries and more
consistent CPU utilization for each unmap() call.  My gut ...
From: mark gross
Date: Monday, March 3, 2008 - 11:34 am

The intel DMAR engines with the TLB's do have a measurable overhead
flushing TLB entries on every unmapsingle.  This hardware forces a bit
poll on TLB flush completion that adds up on high rates of unmap
operations.  (about 15% on my netperf small packet 32 byte UDP stream
over a 1GigE adapter)

FWIW: I think IOMMU use for avoiding bounce buffers is a thing of the
past.  (at least I have a "show me where it helps" point of view for
modern hardware.  what new hardware needs bounce buffers anymore?)

Also, to be fair to the Intel hardware the performance overhead is
really only noticeable on small packed and 10Gig benchmarking.  Most

The default is the TLB invalidation. It has a nasty polling loop that




/me looks, an array of pointers to iova's to clean up makes good sense.
I don't know why I didn't think of this approach first.  Working with

the 250 high water mark came from 1gig small packet netperf udp
streaming testing.  The 1000 I initially used didn't perform quite as

Compile time options lock out OSV's from using the feature.  These
IOMMU's are in desktop systems now.  My job is to make it not suck so
OSV's will leave DMAR enabled in production kernels.  (well at least

Yup, Willi has just started to look over my shoulder on some of this.
He pointed me at some OLS papers and whatnot to look at.  I botched my
chance to work face to face with him last week on his last visit to

Oh the worst cases are easy.  Just run a 10gig NIC and see what you get.
With a 1Gig NIC you need to drop the packet size down to 32 bytes or
something tiny like that to really notice it.  16K packets
have only a smallish overhead that are iffy as benchmarks because the
CPU utilization is too low (<25%) to be a good measure.

FWIW : throughput isn't effected so much as the CPU overhead.
iommu=strict: 16K UDP UNIDIRECTIONAL SEND TEST 826Mbps at 25% cpu 
with this patch: 16K UDP UNIDIRECTIONAL SEND TEST 826Mbps at 18% cpu 

/me thinks about this.  You have a point.  I put in the timer ...
From: Grant Grundler
Date: Wednesday, March 5, 2008 - 11:23 am

So if we flushed the IOTLB every 16 unmap calls, it should be < 1% penalty?

Using that logic is how I decided the current calue of DELAYED_RESOURCE_CNT


Otherwise, Anything running in a legacy IDE mode is using 32-bit DMA.
So linux  still has plenty of SATA drivers using 32-bit DMA.

All the NICs, FC, Infiniband, et al are PCI-e and thus by definition

*nod* - I know. That why I use pktgen to measure the dma map/unmap overhead.
Note that with solid state storage (should be more visible in the next year
or so), the transaction rate is going to look alot more like a NIC than
the traditional HBA storage controller. So map/unmap performance will

Ok - I wasn't sure which step was the "syncronize step".

BTW, I hope you (and others from Intel - go willy! ;) can give feedback
to the Intel/AMD chipset designers to ditch this design ASAP.

*shrug* I didn't think too much about it on the original implementation
and it happened to work out nicely. Think cacheline "life cycle" when
picking array sizes (or when comparing to linked lists) and you'll

Well, it can be more scientific. Determine the overhead (with vs without).
Then divide by how many times one can afford to ignore the sync op.

Huh? I think you misunderstood. Decide how _many_ you want to defer
at compile time and leave the general feature enabled as a runtime option.

Ok - my suggestion still applies. "compile time constant" only refers
to the number of unmaps the code will defer. e.g. "borrow" DELAY_RESOURCE_CNT.
I understand the DMAR needs to be enabled at runtime for production.
(I've spent ~5 years dealing with RH/SuSE/Debian IA64-linux kernels...)

If you can reduce the overhead to < 1% for pktgen, TPC-C won't

Excellent. I'm also happy to answer questions on linux-pci about this.
And willy will be back in oregon I'm sure. :)

Well, you still need a consistent benchmark/workload (e.g. pktgen) and
a precise tool to measure the overhead (e.g. oprofile or perfmon2).



Understood. That's why netperf (see ...

The HW implementation IS evolving.  Especially as the MCH and more of
the chipsets are moved into the CPU package.  It will get better over

Sadly the overhead only a fraction of the overhead is due to the IOTLB
flushes.  I can't wave away the IOVA management overhead with batched

I'm not going to OLS this year :(  travel budget is severely restricted

I've been using oprofile, and some TSC counters I have in an out-of tree
patch for instrumenting the code and dumping the cycles per
code-path-of-interest.  Its been pretty effective, but it affects the


The following patch is an update to use an array instead of a list of
IOVA's in the implementation of defered iotlb flushes.  It takes
inspiration from sba_iommu.c

I like this implementation better as it encapsulates the batch process
within intel-iommu.c, and no longer touches iova.h (which is shared)

Performance data:  Netperf 32byte UDP streaming
2.6.25-rc3-mm1:
IOMMU-strict : 58Mps @ 62% cpu
NO-IOMMU : 71Mbs @ 41% cpu
List-based IOMMU-default-batched-IOTLB flush: 66Mbps @ 57% cpu

with this patch:
IOMMU-strict : 73Mps @ 75% cpu
NO-IOMMU : 74Mbs @ 42% cpu
Array-based IOMMU-default-batched-IOTLB flush: 72Mbps @ 62% cpu

--mgross


Signed-off-by: <mgross@linux.intel.com>


Index: linux-2.6.25-rc3-mm1/drivers/pci/intel-iommu.c
===================================================================
--- linux-2.6.25-rc3-mm1.orig/drivers/pci/intel-iommu.c	2008-03-05 11:34:41.000000000 -0800
+++ linux-2.6.25-rc3-mm1/drivers/pci/intel-iommu.c	2008-03-05 12:50:21.000000000 -0800
@@ -59,8 +59,17 @@
 DEFINE_TIMER(unmap_timer,  flush_unmaps_timeout, 0, 0);
 
 static struct intel_iommu *g_iommus;
+
+#define HIGH_WATER_MARK 250
+struct deferred_flush_tables {
+	int next;
+	struct iova *iova[HIGH_WATER_MARK];
+	struct dmar_domain *domain[HIGH_WATER_MARK];
+};
+
+static struct deferred_flush_tables *deferred_flush;
+
 /* bitmap for indexing intel_iommus */
-static unsigned long 	*g_iommus_to_flush;
 static int ...

On Wed, Mar 05, 2008 at 03:01:57PM -0800, mark gross wrote:

*chuckle* Glad I could be of assistance :P


Agreed. But IO TLB shoot-down/invalidate could be alot cheaper. Intel knows
how to do it for CPU MMU (I hope they do at least given IA64 experience).
IO MMU should be no different (well, not too much different). IOMMU is
like the CPU MMU except it's shared by many IO devices instead of

Right. IOVA management is CPU intensive.  But stalling on IO TLB flush
syncronize is a major part, an easy target and should be reduced.
Taking advantage of "warm cache" and other "normal" coding methods
will help minimize the CPU overhead.


I removed stats from the parisc IOMMU code exactly for that reason.
It was useful for evaluating details of specific algorithms (comparative)

Nice! :)
66/57 == 1.15
72/62 == 1.16
~10% higher throughput with essentially no change in service demand.

But I'm wondering why IOMMU-strict gets better throughput. Something
else is going on here. I suspect better CPU cache utilization and
perhaps lowering the high water mark to 32 would be a test to prove that.

BTW, can you clarify what the units are?
I see "Mps", "Mbs", and "Mbps". Ideally we'd be using
a single unit of measure to compare. "Mpps" would be my preferred one
(Million packets per second) for small, fixed sized packets.

Ditch the "debug" code (stats pr0n) and I'll bet this will go up
a few more percentage points and reduce the service demand.

cheers,
grant
--

Previous thread: Questions on rtc drivers... by J.A. on Wednesday, February 20, 2008 - 5:07 pm. (1 message)

Next thread: [PATCH]iova-lockdep-false-alarm-fix. by mark gross on Wednesday, February 20, 2008 - 5:35 pm. (3 messages)