Linux: Reducing Kernel Latency

Submitted by Jeremy
on September 10, 2004 - 9:43pm

With much feedback on the lkml, Ingo Molnar [interview] has continued to improve his voluntary kernel preemption patch [story]. Testing the patch has revealed a number of areas in the 2.6 Linux kernel [forum] that were causing high latency. Fixes have been created and merged as these areas have been located. For example, following Ingo's release of the -R6 version of the patch, Lee Revell reported that he was still able to cause measurable latencies by driving the server to swap. Ingo acknowledged that this was do to the get_swap_page() function which he described as, "pretty stupid, it does a near linear search for a free slot in the swap bitmap - this not only is a latency issue but also an overhead thing as we do it for every other page that touches swap." He went on to add, "this is pretty much the only latency that we still having during heavy VM load".

Andrew Morton [interview] agreed, going on to say, "someone needs to get down and redesign the swap block allocator. I bet latency improvements would fall out of that automatically. The main problem is that swap blocks are now physically clustered according to the page lru ordering, which doesn't have much relationship to process-virtual-address-ordering." He attached a rough patch he'd written earlier to accomplish the latter ordering. Ingo took the patch and merged it with his own work. Lee retested showing positive results, "OK, Andrew's patch seems to be an improvement. I can still cause unbounded latencies, but these only seem to happen when we fill all available RAM and swap space, at which point we start spending milliseconds at a time in scan_swap_map". This particular fix is still a work in progress, but the end result is sure to be reduced latency on a server actively using swap.


From: Ingo Molnar [email blocked]
To:  linux-kernel
Subject: [patch] voluntary-preempt-2.6.9-rc1-bk12-R6
Date: 	Mon, 6 Sep 2004 13:06:26 +0200


i've released the -R6 patch:

  http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk12-R6

Changes in -R6:

 - fixed a CONFIG_SMP + CONFIG_PREEMPT bug that had the potential to
   cause spinlock related lockups. (UP kernels are unaffected.) This bug 
   got introduced in -R5.

2.6.9-rc1-bk12 patching order is:
 
    http://kernel.org/pub/linux/kernel/v2.6/linux-2.6.8.tar.bz2
  + http://kernel.org/pub/linux/kernel/v2.6/testing/patch-2.6.9-rc1.bz2
  + http://redhat.com/~mingo/voluntary-preempt/patch-2.6.9-rc1-bk12.bz2
 
	Ingo


From: Lee Revell [email blocked] Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk12-R6 Date: Wed, 08 Sep 2004 02:56:03 -0400 On Mon, 2004-09-06 at 07:06, Ingo Molnar wrote: > i've released the -R6 patch: > > http://redhat.com/~mingo/voluntary-preempt/voluntary-preempt-2.6.9-rc1-bk12-R6 I get these latencies when I cause the machine to swap by compiling a kernel with make -j32. They get bigger as the machine gets further into swap. Every 2.0s: head -60 /proc/latency_trace Wed Sep 8 02:51:40 2004 preemption latency trace v1.0.6 on 2.6.9-rc1-bk12-VP-R6 -------------------------------------------------- latency: 605 us, entries: 5 (5) [VP:1 KP:1 SP:1 HP:1 #CPUS:1] ----------------- | task: kswapd0/35, uid:0 nice:0 policy:0 rt_prio:0 ----------------- => started at: get_swap_page+0x23/0x490 => ended at: get_swap_page+0x13f/0x490 =======> 00000001 0.000ms (+0.606ms): get_swap_page (add_to_swap) 00000001 0.606ms (+0.000ms): sub_preempt_count (get_swap_page) 00000001 0.606ms (+0.000ms): update_max_trace (check_preempt_timing) 00000001 0.606ms (+0.000ms): _mmx_memcpy (update_max_trace) 00000001 0.607ms (+0.000ms): kernel_fpu_begin (_mmx_memcpy) Lee
From: Ingo Molnar [email blocked] Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk12-R6 Date: Thu, 9 Sep 2004 21:29:24 +0200 * Lee Revell <rlrevell@joe-job.com> wrote: > I get these latencies when I cause the machine to swap by compiling a > kernel with make -j32. They get bigger as the machine gets further > into swap. > > Every 2.0s: head -60 /proc/latency_trace > Wed Sep 8 02:51:40 2004 > > preemption latency trace v1.0.6 on 2.6.9-rc1-bk12-VP-R6 > -------------------------------------------------- > latency: 605 us, entries: 5 (5) [VP:1 KP:1 SP:1 HP:1 #CPUS:1] > ----------------- > | task: kswapd0/35, uid:0 nice:0 policy:0 rt_prio:0 > ----------------- > => started at: get_swap_page+0x23/0x490 > => ended at: get_swap_page+0x13f/0x490 > =======> > 00000001 0.000ms (+0.606ms): get_swap_page (add_to_swap) > 00000001 0.606ms (+0.000ms): sub_preempt_count (get_swap_page) > 00000001 0.606ms (+0.000ms): update_max_trace (check_preempt_timing) > 00000001 0.606ms (+0.000ms): _mmx_memcpy (update_max_trace) > 00000001 0.607ms (+0.000ms): kernel_fpu_begin (_mmx_memcpy) yep, the get_swap_page() latency. I can easily trigger 10+ msec latencies on a box with alot of swap by just letting stuff swap out. I had a quick look but there was no obvious way to break the lock. Maybe Andrew has better ideas? get_swap_page() is pretty stupid, it does a near linear search for a free slot in the swap bitmap - this not only is a latency issue but also an overhead thing as we do it for every other page that touches swap. rationale: this is pretty much the only latency that we still having during heavy VM load and it would Just Be Cool if we fixed this final one. audio daemons and apps like jackd use mlockall() so they are not affected by swapping. Ingo
From: Andrew Morton [email blocked] Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk12-R6 Date: Thu, 9 Sep 2004 13:05:26 -0700 Ingo Molnar [email blocked] wrote: > > yep, the get_swap_page() latency. I can easily trigger 10+ msec > latencies on a box with alot of swap by just letting stuff swap out. I > had a quick look but there was no obvious way to break the lock. Maybe > Andrew has better ideas? get_swap_page() is pretty stupid, it does a > near linear search for a free slot in the swap bitmap - this not only is > a latency issue but also an overhead thing as we do it for every other > page that touches swap. Someone needs to get down and redesign the swap block allocator. I bet latency improvements would fall out of that automatically. The main problem is that swap blocks are now physically clustered according to the page lru ordering, which doesn't have much relationship to process-virtual-address-ordering. The swap allocator made sense when we were doing a virtual scan. It doesn't make much sense now. I did a patch a while back which switches the swapspace allocator over to perform program-virtual-address clustering, but it didn't help much in brief testing and I haven't got back onto it. And contrary to my above assertion, I don't think it'll help latency ;) A short-term bodge would be to scan the map without locks held, take the lock just to actually claim the block, retry if we raced. Use swapon_sem to avoid races. After checking that we never perform GFP_WAIT allocations while holding swapon_sem. The whole thing needs work. [patch]
From: Alan Cox [email blocked] Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk12-R6 Date: Thu, 09 Sep 2004 21:09:57 +0100 On Iau, 2004-09-09 at 21:05, Andrew Morton wrote: > I did a patch a while back which switches the swapspace allocator over to > perform program-virtual-address clustering, but it didn't help much in > brief testing and I haven't got back onto it. > > And contrary to my above assertion, I don't think it'll help latency ;) I would still expect the only thing to materially improve swap latency to be a log structured swap, possibly with a cleaner which tidies together pages that are referenced together. You also want contiguous runs of at least 64K and probaly a lot more on bigger memory systems.
From: Andrew Morton [email blocked] Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk12-R6 Date: Thu, 9 Sep 2004 14:28:01 -0700 Alan Cox [email blocked] wrote: > > I would still expect the only thing to materially improve swap latency > to be a log structured swap, possibly with a cleaner which tidies > together pages that are referenced together. > Maybe. It'd be nice to show some benefit from the "organise pages by virtual address" patch first. But then, maybe that doesn't help because there is little correlation between address congruency and time-of-reference. That's hard to believe though. hm. The patch _does_ do what I wanted it to do. Maybe I tested it with silly workloads. > > You also want contiguous runs of at least 64K and probaly a lot more on > bigger memory systems. I used 1MB. +/* + * We divide the swapdev into 1024 kilobyte chunks. We use the cookie and the + * upper bits of the index to select a chunk and the rest of the index as the + * offset into the selected chunk. + */ +#define CHUNK_SHIFT (20 - PAGE_SHIFT)
From: Ingo Molnar [email blocked] Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk12-R6 Date: Fri, 10 Sep 2004 15:28:41 +0200 * Andrew Morton [email blocked] wrote: > diff -puN mm/vmscan.c~swapspace-layout-improvements mm/vmscan.c > --- 25/mm/vmscan.c~swapspace-layout-improvements 2004-06-03 21:32:51.087602712 -0700 > +++ 25-akpm/mm/vmscan.c 2004-06-03 21:32:51.102600432 -0700 i've attached a merge against current BK-ish kernels. Lee, would you be interested in testing it? It applies cleanly to an -S0 VP tree. I've tested it only lightly - it compiles and boots and survives some simple swapping but that's all. [patch] Ingo
From: Paolo Ciarrocchi [email blocked] Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk12-R6 Date: Fri, 10 Sep 2004 16:28:37 +0200 On Fri, 10 Sep 2004 15:28:41 +0200, Ingo Molnar [email blocked] wrote: > > i've attached a merge against current BK-ish kernels. Lee, would you be > interested in testing it? It applies cleanly to an -S0 VP tree. I've > tested it only lightly - it compiles and boots and survives some simple > swapping but that's all. Hello kernel folks, what's the plan regarding the inclusion of VP in mainstream ? -- Paolo Personal home page: paoloc.doesntexist.org Buy cool stuff here: http://www.cafepress.com/paoloc
From: Lee Revell <rlrevell@joe-job.com> Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk12-R6 Date: Fri, 10 Sep 2004 12:45:12 -0400 On Fri, 2004-09-10 at 10:28, Paolo Ciarrocchi wrote: > > Hello kernel folks, > what's the plan regarding the inclusion of VP in mainstream ? > I believe the plan is to merge the individual fixes one at a time. See Ingo's recent non-VP-related posts. Once the fixes for all of the real deficiencies in the kernel that the VP patches revealed are merged, then we will have a very small patch. Lee
From: Lee Revell [email blocked] Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk12-R6 Date: Fri, 10 Sep 2004 18:54:48 -0400 On Fri, 2004-09-10 at 09:28, Ingo Molnar wrote: > * Andrew Morton [email blocked] wrote: > > > diff -puN mm/vmscan.c~swapspace-layout-improvements mm/vmscan.c > > --- 25/mm/vmscan.c~swapspace-layout-improvements 2004-06-03 21:32:51.087602712 -0700 > > +++ 25-akpm/mm/vmscan.c 2004-06-03 21:32:51.102600432 -0700 > OK, Andrew's patch seems to be an improvement. I can still cause unbounded latencies, but these only seem to happen when we fill all available RAM and swap space, at which point we start spending milliseconds at a time in scan_swap_map: preemption latency trace v1.0.7 on 2.6.9-rc1-bk12-VP-S0 ------------------------------------------------------- latency: 6032 us, entries: 550 (550) | [VP:1 KP:1 SP:1 HP:1 #CPUS:1] ----------------- | task: xfs/1098, uid:0 nice:0 policy:0 rt_prio:0 ----------------- => started at: rtc_interrupt+0x294/0x450 => ended at: get_swap_page+0x13f/0x350 =======> 00010002 0.000ms (+0.000ms): touch_preempt_timing (rtc_interrupt) 00010002 0.000ms (+0.000ms): printk (rtc_interrupt) 00010002 0.000ms (+0.001ms): vprintk (printk) 00010003 0.002ms (+0.000ms): vscnprintf (vprintk) 00010003 0.002ms (+0.002ms): vsnprintf (vscnprintf) 00010003 0.005ms (+0.004ms): number (vsnprintf) 00010003 0.009ms (+0.001ms): number (vsnprintf) 00010003 0.010ms (+0.001ms): number (vsnprintf) 00010003 0.011ms (+0.000ms): emit_log_char (vprintk) [...] 00010002 1.983ms (+0.000ms): preempt_schedule (do_IRQ) 00000003 1.984ms (+0.000ms): do_softirq (do_IRQ) 00000003 1.984ms (+0.911ms): __do_softirq (do_softirq) 00010002 2.896ms (+0.000ms): do_IRQ (scan_swap_map) 00010002 2.896ms (+0.000ms): do_IRQ (<00000008>) 00010003 2.897ms (+0.004ms): mask_and_ack_8259A (do_IRQ) 00010003 2.901ms (+0.000ms): preempt_schedule (do_IRQ) Full trace: http://krustophenia.net/testresults.php?dataset=2.6.9-rc1-bk12-S0#/ var/www/2.6.9-rc1-bk12-S0/swapspace-layout-improvements-A1.txt The above are just the initial results; I am still testing this. It certainly seems like it can take a beating. Lee Lee
From: K.R. Foley [email blocked] Subject: Re: [patch] voluntary-preempt-2.6.9-rc1-bk12-R6 Date: Fri, 10 Sep 2004 19:21:33 -0500 Lee Revell wrote: > On Fri, 2004-09-10 at 09:28, Ingo Molnar wrote: > >>* Andrew Morton [email blocked] wrote: >> >> >>>diff -puN mm/vmscan.c~swapspace-layout-improvements mm/vmscan.c >>>--- 25/mm/vmscan.c~swapspace-layout-improvements 2004-06-03 21:32:51.087602712 -0700 >>>+++ 25-akpm/mm/vmscan.c 2004-06-03 21:32:51.102600432 -0700 >> > > OK, Andrew's patch seems to be an improvement. I can still cause > unbounded latencies, but these only seem to happen when we fill all > available RAM and swap space, at which point we start spending > milliseconds at a time in scan_swap_map: > > I see much improved performance so far. Been running for about 3 hours and the highest latency I've seen thus far is ~260 usec and that was mmap not swap. The highest latency I've seen from swapping is ~198 and we have been in and out of swap at least several times. The latency trace can be seen here: http://www.cybsft.com/testresults/2.6.9-rc1-bk12-S0/latencytrace1.txt kr

Related Links:

linear search

Anonymous
on
September 11, 2004 - 10:06am

A linear search in a bitmap is the best algorithm for this? Quoting matt dillon (before the DFBSD fork):

"The swap bitmap code I wrote uses a radix tree with size hinting for allocations"

I guess that there's room for improvement in that area then.

Hm

Anonymous
on
September 12, 2004 - 7:59am

When your machine is swapping that heavily, scheduling latency is the least of your problems.

Realtime systems

Anonymous
on
September 12, 2004 - 11:30am

Not if you are running a real time control system and you are trying to save $100 000 worth of steel - I've had systems totally loaded but still able to keep control - would love Linux to be able to do this...

I disagree

renoX
on
September 13, 2004 - 2:22pm

I disagree: while your machine maybe swapping like hell because of an errant process, it should not prevent real-time process to run without too-much scheduling latency.

The real-time process may still keep its "real-time" behaviour if it is pinned in memory even in case of high swap usage if the scheduling latency is correct.

This will fix part of the problem

kmerley
on
September 14, 2004 - 11:33am

If the routine that searches for open swap pages is improved, this will help in speeding up swap, at least swapping out. However, swapping out is many times not noticed in desktop use, whereas, swapping back in does get noticed. Seeing that the hard drive swap-in speed is also influenced by disk seek times, as well as how well the swapped out pages are found efficiently in a table, there will be latency in swapping bytes back in.

One thing you can do about this is to get some more RAM and make it into a ramdisk, and swap to that. All the latencies are still there, but the speed of swapping to disk just got reduced by many orders of magnitude. Even if the swap is fragmented the data transfer speed will be very fast. A fast silicon storage solution will work also, but is not quite as fast, and costs much more.

It is not that the kernel is written badly or has major flaws, it is just that hard drive data transfer speeds are so slow and cause noticeable delays. From the beginning of swap use, everybody accepted that latency so they could run more programs on their machines without having to buy extra RAM. We had to accept that, it was the only practical way.

So swapping is how it is, and it is well done for the most part.

Now we are able to get a large amount of RAM for less relative cost, and it can be partly used for swap. For now, with the kernel VM as it is, you can reduce any hard drive swap latencies by using a ramdisk for swap (perhaps by replacing your hard-drive swap with the new RAM, and maybe setting it for a different division between amount of RAM and amount of swap). In a previous Kerneltrap article one set of steps for how to use RAM as swap are given.

Kim

Useless without compression

Ano Nymous
on
September 16, 2004 - 3:53am

As said before, using RAM as swap makes no sense because you can as well not use swap at all and just use the "extra" ram directly. That said, it would make sense if the data was compressed on the ram swap. Then the latency is slightly higher, but it's still much lower than from a hd and you actually use less ram.

You seem to be confusing two

Anonymous
on
September 20, 2004 - 5:35pm

You seem to be confusing two kinds of latency: the latency of swapping (how long an application waits before it is swapped back in) and the latency of the kernel (how long the whole system is frozen because it is swapping).

Simply speaking: the first kind is how long you wait for Mozilla to swap back in. The second kind is whether xmms will skip just because Mozilla is being swapped in.

What the fine folks at the LKML are speaking about is the second kind.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.