The following patch is for batching up the flushing of the IOTLB for
the DMAR implementation found in the Intel VT-d hardware. It works by
building a list of to be flushed IOTLB entries and a bitmap list of
which DMAR engine they are from.
After either a high water mark (250 accessible via debugfs) or 10ms the
list of iova's will be reclaimed and the DMAR engines associated are
IOTLB-flushed.
This approach recovers 15 to 20% of the performance lost when using the
IOMMU for my netperf udp stream benchmark with small packets. It can be
disabled with a kernel boot parameter
"intel_iommu=strict".
Its use does weaken the IOMMU protections a bit.
I would like to see this go into MM for a while and then onto mainline.
Thanks,
Signed-off-by: <mgross@linux.intel.com>
Index: linux-2.6.24-mm1/drivers/pci/intel-iommu.c
===================================================================
--- linux-2.6.24-mm1.orig/drivers/pci/intel-iommu.c 2008-02-12 07:12:06.000000000 -0800
+++ linux-2.6.24-mm1/drivers/pci/intel-iommu.c 2008-02-12 11:35:42.000000000 -0800
@@ -21,6 +21,7 @@
#include <linux/init.h>
#include <linux/bitmap.h>
+#include <linux/debugfs.h>
#include <linux/slab.h>
#include <linux/irq.h>
#include <linux/interrupt.h>
@@ -30,6 +31,7 @@
#include <linux/dmar.h>
#include <linux/dma-mapping.h>
#include <linux/mempool.h>
+#include <linux/timer.h>
#include "iova.h"
#include "intel-iommu.h"
#include <asm/proto.h> /* force_iommu in this header in x86-64*/
@@ -50,11 +52,39 @@
#define DOMAIN_MAX_ADDR(gaw) ((((u64)1) << gaw) - 1)
+
+static void flush_unmaps_timeout(unsigned long data);
+
+static struct timer_list unmap_timer =
+ TIMER_INITIALIZER(flush_unmaps_timeout, 0, 0);
+
+struct unmap_list {
+ struct list_head list;
+ struct dmar_domain *domain;
+ struct iova *iova;
+};
+
+static struct intel_iommu *g_iommus;
+/* bitmap for indexing intel_iommus */
+static unsigned long *g_iommus_to_flush;
+static int g_num_of_iommus;
+
+static ...No, this modifies unmap_timer.expires twice. Might be racy too. You want I see timers being added, but I see no del_timer_sync()s added on cleanup boot-time options suck. Is it not possible to tweak this at runtime? --
I'll double check if its needed, but it wouldn't hurt. This code is on Once you set up the IOMMU's you never take them down or re-set them up. Not very. But, its better than blocking on the hardware poll IOTLB flush operation on every unmap. Is there a lock less way to insert into this list? I suppose I could have one lock per DMAR-engine but, that would still have the scalability issue. (perhaps a list per IOVA? /me needs to think on this a bit) The best way is to get the network stack to reuse dma buffers when using an iommu and avoid the unmap operation altogether. But thats a longer This code doesn't really tear down well. However; at this point in Yes they do. There may be a way to enable / disable this behavior at runtime. Let me think on it a bit. Thank you for looking at this! --mgross --
We normally put a space after "/*" and before "*/" --
Yup, not needed. Only the single threaded __init path increments this I just looked, I don't think bitmap.h helps me with my bitmap use in Its not very, but I'm doing an insert walk of a list and need the locks. On the bright side, its way better than spin locking and poling the The only error path to delete the single timer created is at the bottom of intel_iommu_init. I don't think there is a place for the timer to be Yes, I could easily create a sysfs or debugfs mechanism for turning it on / off at run time. I would like input on the preferred way to do this. sysfs or debugfs? Thanks again for your review on this. The following is the updated patch. --mgross Signed-off-by: <mgross@linux.intel.com> Index: linux-2.6.25-rc2-mm1/drivers/pci/intel-iommu.c =================================================================== --- linux-2.6.25-rc2-mm1.orig/drivers/pci/intel-iommu.c 2008-02-26 11:15:46.000000000 -0800 +++ linux-2.6.25-rc2-mm1/drivers/pci/intel-iommu.c 2008-02-29 14:36:55.000000000 -0800 @@ -21,6 +21,7 @@ #include <linux/init.h> #include <linux/bitmap.h> +#include <linux/debugfs.h> #include <linux/slab.h> #include <linux/irq.h> #include <linux/interrupt.h> @@ -30,6 +31,7 @@ #include <linux/dmar.h> #include <linux/dma-mapping.h> #include <linux/mempool.h> +#include <linux/timer.h> #include "iova.h" #include "intel-iommu.h" #include <asm/proto.h> /* force_iommu in this header in x86-64*/ @@ -50,11 +52,32 @@ #define DOMAIN_MAX_ADDR(gaw) ((((u64)1) << gaw) - 1) + +static void flush_unmaps_timeout(unsigned long data); + +DEFINE_TIMER(unmap_timer, flush_unmaps_timeout, 0, 0); + +static struct intel_iommu *g_iommus; +/* bitmap for indexing intel_iommus */ +static unsigned long *g_iommus_to_flush; +static int g_num_of_iommus; + +static DEFINE_SPINLOCK(async_umap_flush_lock); +static LIST_HEAD(unmaps_to_do); + +static int timer_on; +static long list_size; +static int high_watermark; + +static ...
Which is it, a debugging option, or something that anyone would want to change? If debugging, make it a debugfs file. Otherwise sysfs if fine, but be sure to add a Documentation/ABI/ file for it. thanks, greg k-h --
I didn't see the original patch - thanks for cc'ing linux-pci. PA-RISC and IA64 have been doing this for several years now. See DELAYED_RESOURCE_CNT usage in drivers/parisc/sba_iommu.c and in arch/ia64/hp/common/sba_iommu.c. I prototyped the same concept for x86_64 GART support but it didn't seem to matter on benchmarks since most of the devices I use are 64-bit and don't need to use the IOMMU. IDE/SATA controllers are 32-bit but there is LOTS of overhead in so many other places, this change made less than 0.3 difference for them. (And the decision to use uncached memory for the IO Pdir made this moot). One can do a few things to limit how much the protections are weakened: 1) Invalidate the IO TLB entry and IO Pdir entries (just don't force syncronization). At least this was possible on the IOMMU's I'm familiar with. 2) release the IO Pdir entry for re-use once syncronization has been forced. Syncronization/flushing of the IO TLB shoot-down is the most expensive ops AFAICT on the IOMMU's I've worked with. I don't ever recall finding a bug where the device was DMA'ing to a buffer again shortly after unmap call. Most DMA errors are device driver I suggest adding an "unlikely()" around this test so the compiler can generate better code for this and it's clear what your intent is. I just looked at the implementation of flush_unmaps(). I strongly reccomend comparing this implementation with the DELAYED_RESOURCE_CNT since this looks like it will need 2X or more of the CPU cycles to execute for each entry. Use of "list_del()" is substantially more expensive than a simple linear arrary. (Fewer entries per cacheline by 4X maybe?) Another problem is every now and then some IO is going to burn a bunch of CPU cycles and introduce inconsistent performance for particular unmap operations. One has to tradeoff amortizing the cost of the IOMMU flushing with the number of "unusable" IO Pdir entries and more consistent CPU utilization for each unmap() call. My gut ...
The intel DMAR engines with the TLB's do have a measurable overhead flushing TLB entries on every unmapsingle. This hardware forces a bit poll on TLB flush completion that adds up on high rates of unmap operations. (about 15% on my netperf small packet 32 byte UDP stream over a 1GigE adapter) FWIW: I think IOMMU use for avoiding bounce buffers is a thing of the past. (at least I have a "show me where it helps" point of view for modern hardware. what new hardware needs bounce buffers anymore?) Also, to be fair to the Intel hardware the performance overhead is really only noticeable on small packed and 10Gig benchmarking. Most The default is the TLB invalidation. It has a nasty polling loop that /me looks, an array of pointers to iova's to clean up makes good sense. I don't know why I didn't think of this approach first. Working with the 250 high water mark came from 1gig small packet netperf udp streaming testing. The 1000 I initially used didn't perform quite as Compile time options lock out OSV's from using the feature. These IOMMU's are in desktop systems now. My job is to make it not suck so OSV's will leave DMAR enabled in production kernels. (well at least Yup, Willi has just started to look over my shoulder on some of this. He pointed me at some OLS papers and whatnot to look at. I botched my chance to work face to face with him last week on his last visit to Oh the worst cases are easy. Just run a 10gig NIC and see what you get. With a 1Gig NIC you need to drop the packet size down to 32 bytes or something tiny like that to really notice it. 16K packets have only a smallish overhead that are iffy as benchmarks because the CPU utilization is too low (<25%) to be a good measure. FWIW : throughput isn't effected so much as the CPU overhead. iommu=strict: 16K UDP UNIDIRECTIONAL SEND TEST 826Mbps at 25% cpu with this patch: 16K UDP UNIDIRECTIONAL SEND TEST 826Mbps at 18% cpu /me thinks about this. You have a point. I put in the timer ...
So if we flushed the IOTLB every 16 unmap calls, it should be < 1% penalty? Using that logic is how I decided the current calue of DELAYED_RESOURCE_CNT Otherwise, Anything running in a legacy IDE mode is using 32-bit DMA. So linux still has plenty of SATA drivers using 32-bit DMA. All the NICs, FC, Infiniband, et al are PCI-e and thus by definition *nod* - I know. That why I use pktgen to measure the dma map/unmap overhead. Note that with solid state storage (should be more visible in the next year or so), the transaction rate is going to look alot more like a NIC than the traditional HBA storage controller. So map/unmap performance will Ok - I wasn't sure which step was the "syncronize step". BTW, I hope you (and others from Intel - go willy! ;) can give feedback to the Intel/AMD chipset designers to ditch this design ASAP. *shrug* I didn't think too much about it on the original implementation and it happened to work out nicely. Think cacheline "life cycle" when picking array sizes (or when comparing to linked lists) and you'll Well, it can be more scientific. Determine the overhead (with vs without). Then divide by how many times one can afford to ignore the sync op. Huh? I think you misunderstood. Decide how _many_ you want to defer at compile time and leave the general feature enabled as a runtime option. Ok - my suggestion still applies. "compile time constant" only refers to the number of unmaps the code will defer. e.g. "borrow" DELAY_RESOURCE_CNT. I understand the DMAR needs to be enabled at runtime for production. (I've spent ~5 years dealing with RH/SuSE/Debian IA64-linux kernels...) If you can reduce the overhead to < 1% for pktgen, TPC-C won't Excellent. I'm also happy to answer questions on linux-pci about this. And willy will be back in oregon I'm sure. :) Well, you still need a consistent benchmark/workload (e.g. pktgen) and a precise tool to measure the overhead (e.g. oprofile or perfmon2). Understood. That's why netperf (see ...
The HW implementation IS evolving. Especially as the MCH and more of
the chipsets are moved into the CPU package. It will get better over
Sadly the overhead only a fraction of the overhead is due to the IOTLB
flushes. I can't wave away the IOVA management overhead with batched
I'm not going to OLS this year :( travel budget is severely restricted
I've been using oprofile, and some TSC counters I have in an out-of tree
patch for instrumenting the code and dumping the cycles per
code-path-of-interest. Its been pretty effective, but it affects the
The following patch is an update to use an array instead of a list of
IOVA's in the implementation of defered iotlb flushes. It takes
inspiration from sba_iommu.c
I like this implementation better as it encapsulates the batch process
within intel-iommu.c, and no longer touches iova.h (which is shared)
Performance data: Netperf 32byte UDP streaming
2.6.25-rc3-mm1:
IOMMU-strict : 58Mps @ 62% cpu
NO-IOMMU : 71Mbs @ 41% cpu
List-based IOMMU-default-batched-IOTLB flush: 66Mbps @ 57% cpu
with this patch:
IOMMU-strict : 73Mps @ 75% cpu
NO-IOMMU : 74Mbs @ 42% cpu
Array-based IOMMU-default-batched-IOTLB flush: 72Mbps @ 62% cpu
--mgross
Signed-off-by: <mgross@linux.intel.com>
Index: linux-2.6.25-rc3-mm1/drivers/pci/intel-iommu.c
===================================================================
--- linux-2.6.25-rc3-mm1.orig/drivers/pci/intel-iommu.c 2008-03-05 11:34:41.000000000 -0800
+++ linux-2.6.25-rc3-mm1/drivers/pci/intel-iommu.c 2008-03-05 12:50:21.000000000 -0800
@@ -59,8 +59,17 @@
DEFINE_TIMER(unmap_timer, flush_unmaps_timeout, 0, 0);
static struct intel_iommu *g_iommus;
+
+#define HIGH_WATER_MARK 250
+struct deferred_flush_tables {
+ int next;
+ struct iova *iova[HIGH_WATER_MARK];
+ struct dmar_domain *domain[HIGH_WATER_MARK];
+};
+
+static struct deferred_flush_tables *deferred_flush;
+
/* bitmap for indexing intel_iommus */
-static unsigned long *g_iommus_to_flush;
static int ...On Wed, Mar 05, 2008 at 03:01:57PM -0800, mark gross wrote: *chuckle* Glad I could be of assistance :P Agreed. But IO TLB shoot-down/invalidate could be alot cheaper. Intel knows how to do it for CPU MMU (I hope they do at least given IA64 experience). IO MMU should be no different (well, not too much different). IOMMU is like the CPU MMU except it's shared by many IO devices instead of Right. IOVA management is CPU intensive. But stalling on IO TLB flush syncronize is a major part, an easy target and should be reduced. Taking advantage of "warm cache" and other "normal" coding methods will help minimize the CPU overhead. I removed stats from the parisc IOMMU code exactly for that reason. It was useful for evaluating details of specific algorithms (comparative) Nice! :) 66/57 == 1.15 72/62 == 1.16 ~10% higher throughput with essentially no change in service demand. But I'm wondering why IOMMU-strict gets better throughput. Something else is going on here. I suspect better CPU cache utilization and perhaps lowering the high water mark to 32 would be a test to prove that. BTW, can you clarify what the units are? I see "Mps", "Mbs", and "Mbps". Ideally we'd be using a single unit of measure to compare. "Mpps" would be my preferred one (Million packets per second) for small, fixed sized packets. Ditch the "debug" code (stats pr0n) and I'll bet this will go up a few more percentage points and reduce the service demand. cheers, grant --
