login
Header Space

 
 

Re: [PATCH] mm: use pagevec to rotate reclaimable page

Previous thread: SYSFS: need a noncaching read by Heiko Schocher on Tuesday, September 11, 2007 - 5:43 am. (10 messages)

Next thread: [Patch] Fix panic of cpu online with memory less node by Yasunori Goto on Tuesday, September 11, 2007 - 6:19 am. (1 message)
To: <akpm@...>, <linux-kernel@...>, <linux-mm@...>
Date: Tuesday, September 11, 2007 - 5:31 am

Hi.
While running some memory intensive load, system response
deteriorated just after swap-out started.

The cause of this problem is that when a PG_reclaim page is
moved to the tail of the inactive LRU list in rotate_reclaimable_page(),
lru_lock spin lock is acquired every page writeback . This deteriorates
system performance and makes interrupt hold off time longer when
swap-out started.

Following patch solves this problem. I use pagevec in rotating reclaimable
pages to mitigate LRU spin lock contention and reduce interrupt
hold off time.

I did a test that allocating and touching pages in multiple processes, and
pinging to the test machine in flooding mode to measure response under
memory intensive load.
The test result is:

	-2.6.23-rc5
	--- testmachine ping statistics ---
	3000 packets transmitted, 3000 received, 0% packet loss, time 53222ms
	rtt min/avg/max/mdev = 0.074/0.652/172.228/7.176 ms, pipe 11, ipg/ewma 
17.746/0.092 ms

	-2.6.23-rc5-patched
	--- testmachine ping statistics ---
	3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
	rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma 
17.314/0.091 ms

Max round-trip-time was improved.

The test machine spec is that 4CPU(3.16GHz, Hyper-threading enabled)
8GB memory , 8GB swap.

Thanks.

Signed-off-by :Hisashi Hifumi &lt;hifumi.hisashi@oss.ntt.co.jp&gt;

diff -Nrup linux-2.6.23-rc5.org/include/linux/swap.h 
linux-2.6.23-rc5/include/linux/swap.h
--- linux-2.6.23-rc5.org/include/linux/swap.h	2007-09-06 18:44:06.000000000 +0900
+++ linux-2.6.23-rc5/include/linux/swap.h	2007-09-06 18:45:28.000000000 +0900
@@ -185,6 +185,7 @@ extern void FASTCALL(mark_page_accessed(
  extern void lru_add_drain(void);
  extern int lru_add_drain_all(void);
  extern int rotate_reclaimable_page(struct page *page);
+extern void move_tail_pages(void);
  extern void swap_setup(void);

  /* linux/mm/vmscan.c */
diff -Nrup linux-2.6.23-rc5.org/mm/swap.c linux-2.6.23-rc5/mm/swap.c
--- linux-2.6.23-...
To: Hisashi Hifumi <hifumi.hisashi@...>
Cc: <linux-kernel@...>, <linux-mm@...>
Date: Thursday, September 13, 2007 - 10:37 pm

Your email client performs space-stuffing.  That means the recipient needs

I really don't like the games we play with page refcounts here.  By the
time we consider a page, that page could have been freed and could be
reused for any purpose whatever.  We don't have any business playing around
with the page's internal state when we're unsure of what it is being used
for.


The page_count() test here is a bit of a worry, too.  Why do we need it? 
The caller must have pinned the page in some fashion else we couldn't use
it safely in this function at all.

I assume that you discovered that once we've cleared PageWriteback(), the
page can get reclaimed elsewhere?  If so, that could still happen
immediately after the page_count() test.  It's all a bit of a worry. 
Deferring the ClearPageWriteback() will fix any race concerns, but I do

The tricky part will be working out whether you've found all the places
which need to have move_tail_pages() added to them.

pagevec_move_tail() needs a comment explaining that it must be called with
preemption disabled.  Otherwise it has nasty races.

Actually, given that pagevec_move_tail() is called from both interrupt and
non-interrupt context, I guess that it needs local_irq_save() protection as
well.  In which case the preempt_disable() and preempt_enable() in
move_tail_pages() can be optimised away (use local_irq_save() and
__get_cpu_var()).



So I do think that for safety and sanity's sake, we should be taking a ref
on the pages when they are in a pagevec.  That's going to hurt your nice
performance numbers :(

Please consider doing a single call to __count_vm_events() in
pagevec_move_tail(), instead of multiple calls to __count_vm_event().

It's a bit of a concern that we can have ((num_online_cpus - 1) *
PAGEVEC_SIZE) pages outstanding in the other CPU's pagevecs.  But I _guess_
that'll be OK - it's a relatively small number of pages.

What happens during cpu hot-unplug?  If we take a ref on the pageved'ed
pages then we have page...
To: Andrew Morton <akpm@...>
Cc: <linux-kernel@...>, <linux-mm@...>
Date: Tuesday, September 18, 2007 - 6:41 am

I modified my patch based on your comment.

At 11:37 07/09/14, Andrew Morton wrote:
 &gt;
 &gt;So I do think that for safety and sanity's sake, we should be taking a ref
 &gt;on the pages when they are in a pagevec.  That's going to hurt your nice
 &gt;performance numbers :(
 &gt;

I did ping test again to observe performance deterioration caused by taking 
a ref.

	-2.6.23-rc6-with-modifiedpatch
	--- testmachine ping statistics ---
	3000 packets transmitted, 3000 received, 0% packet loss, time 53386ms
	rtt min/avg/max/mdev = 0.074/0.110/4.716/0.147 ms, pipe 2, ipg/ewma 
17.801/0.129 ms

The result for my original patch is as follows.

	-2.6.23-rc5-with-originalpatch
	--- testmachine ping statistics ---
	3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
	rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma 
17.314/0.091 ms


The influence to response was small.

Thanks.

Signed-off-by :Hisashi Hifumi &lt;hifumi.hisashi@oss.ntt.co.jp&gt;


diff -Nrup linux-2.6.23-rc6.org/include/linux/swap.h 
linux-2.6.23-rc6/include/linux/swap.h
--- linux-2.6.23-rc6.org/include/linux/swap.h	2007-09-14 16:49:57.000000000 +0900
+++ linux-2.6.23-rc6/include/linux/swap.h	2007-09-14 16:58:48.000000000 +0900
@@ -185,6 +185,7 @@ extern void FASTCALL(mark_page_accessed(
  extern void lru_add_drain(void);
  extern int lru_add_drain_all(void);
  extern int rotate_reclaimable_page(struct page *page);
+extern void move_tail_pages(void);
  extern void swap_setup(void);

  /* linux/mm/vmscan.c */
diff -Nrup linux-2.6.23-rc6.org/mm/swap.c linux-2.6.23-rc6/mm/swap.c
--- linux-2.6.23-rc6.org/mm/swap.c	2007-07-09 08:32:17.000000000 +0900
+++ linux-2.6.23-rc6/mm/swap.c	2007-09-18 18:49:07.000000000 +0900
@@ -94,24 +94,62 @@ void put_pages_list(struct list_head *pa
  EXPORT_SYMBOL(put_pages_list);

  /*
+ * pagevec_move_tail() must be called with IRQ disabled.
+ * Otherwise this may cause nasty races.
+ */
+static void pagevec_move_tail(struct pagevec *pvec)
...
To: Hisashi Hifumi <hifumi.hisashi@...>
Cc: <linux-kernel@...>, <linux-mm@...>
Date: Wednesday, September 19, 2007 - 6:17 pm

On Tue, 18 Sep 2007 19:41:14 +0900

well.. that's not really the test which will show up any regressions.

The extra get_page/put_page will affect things like kernel CPU utilisation
on fast writeout on a single CPU.  Say, run a huge write against a fast
storage system on a single pentium 4 CPU and see how much the system CPU
time is increased.

The kernel's internal cpu accounting probably won't be accurate enough to
get believeable numbers from a test like this - better to use the
subtractive approach: see http://www.zipworld.com.au/~akpm/linux/#zc
-
To: Hisashi Hifumi <hifumi.hisashi@...>
Cc: Andrew Morton <akpm@...>, <linux-kernel@...>, <linux-mm@...>
Date: Monday, September 17, 2007 - 9:29 pm

It would be interesting to test -mm kernels. They have a patch which reduces
zone lock contention quite a lot.

I think your patch is a nice idea, and with less zone lock contention in other
areas, it is possible that it might produce a relatively larger improvement.
-
To: Nick Piggin <nickpiggin@...>
Cc: Hisashi Hifumi <hifumi.hisashi@...>, <linux-kernel@...>, <linux-mm@...>
Date: Tuesday, September 18, 2007 - 1:44 pm

I'm a bit wobbly about this patch - it adds additional single-cpu overhead
to reduce multiple-cpu overhead and latency.
-
To: Andrew Morton <akpm@...>
Cc: Hisashi Hifumi <hifumi.hisashi@...>, <linux-kernel@...>, <linux-mm@...>
Date: Monday, September 17, 2007 - 9:47 pm

Yeah, that's true. Although maybe it gets significantly more after the
patch in -mm.

Possibly other page batching sites have similar issues on UP... I wonder
if a type of pagevec that turns into a noop on UP would be interesting...
probably totally unmeasurable and not worth the cost of code
maintenance ;)
-
To: Nick Piggin <nickpiggin@...>
Cc: Hisashi Hifumi <hifumi.hisashi@...>, <linux-kernel@...>, <linux-mm@...>
Date: Tuesday, September 18, 2007 - 2:03 pm

Yes, I wonder that.  Some of the additional overhead will come from
the additional get_page/put_page which is needed for pagevec ownership.
-
To: Andrew Morton <akpm@...>
Cc: <linux-kernel@...>, <linux-mm@...>
Date: Friday, September 14, 2007 - 3:42 am

Thank you for your comment.

At 11:37 07/09/14, Andrew Morton wrote:

 &gt;The page_count() test here is a bit of a worry, too.  Why do we need it?
 &gt;The caller must have pinned the page in some fashion else we couldn't use
 &gt;it safely in this function at all.
 &gt;
 &gt;I assume that you discovered that once we've cleared PageWriteback(), the
 &gt;page can get reclaimed elsewhere?  If so, that could still happen
 &gt;immediately after the page_count() test.  It's all a bit of a worry.
 &gt;Deferring the ClearPageWriteback() will fix any race concerns, but I do
 &gt;think that we need to take a ref on the page for the pagevec ownership.
 &gt;

Actually, I considered taking a ref to pin pages. But this could prevent 
the page
reclaiming activity so I did not use it.

I reflect your comment and send you modified patch.

-
Previous thread: SYSFS: need a noncaching read by Heiko Schocher on Tuesday, September 11, 2007 - 5:43 am. (10 messages)

Next thread: [Patch] Fix panic of cpu online with memory less node by Yasunori Goto on Tuesday, September 11, 2007 - 6:19 am. (1 message)
speck-geostationary