Hi.
While running some memory intensive load, system response
deteriorated just after swap-out started.The cause of this problem is that when a PG_reclaim page is
moved to the tail of the inactive LRU list in rotate_reclaimable_page(),
lru_lock spin lock is acquired every page writeback . This deteriorates
system performance and makes interrupt hold off time longer when
swap-out started.Following patch solves this problem. I use pagevec in rotating reclaimable
pages to mitigate LRU spin lock contention and reduce interrupt
hold off time.I did a test that allocating and touching pages in multiple processes, and
pinging to the test machine in flooding mode to measure response under
memory intensive load.
The test result is:-2.6.23-rc5
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 53222ms
rtt min/avg/max/mdev = 0.074/0.652/172.228/7.176 ms, pipe 11, ipg/ewma
17.746/0.092 ms-2.6.23-rc5-patched
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma
17.314/0.091 msMax round-trip-time was improved.
The test machine spec is that 4CPU(3.16GHz, Hyper-threading enabled)
8GB memory , 8GB swap.Thanks.
Signed-off-by :Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
diff -Nrup linux-2.6.23-rc5.org/include/linux/swap.h
linux-2.6.23-rc5/include/linux/swap.h
--- linux-2.6.23-rc5.org/include/linux/swap.h 2007-09-06 18:44:06.000000000 +0900
+++ linux-2.6.23-rc5/include/linux/swap.h 2007-09-06 18:45:28.000000000 +0900
@@ -185,6 +185,7 @@ extern void FASTCALL(mark_page_accessed(
extern void lru_add_drain(void);
extern int lru_add_drain_all(void);
extern int rotate_reclaimable_page(struct page *page);
+extern void move_tail_pages(void);
extern void swap_setup(void);/* linux/mm/vmscan.c */
diff -Nrup linux-2.6.23-rc5.org/mm/swap.c linux-2.6.23-rc5/mm/swap.c
--- linux-2.6.23-...
Your email client performs space-stuffing. That means the recipient needs
I really don't like the games we play with page refcounts here. By the
time we consider a page, that page could have been freed and could be
reused for any purpose whatever. We don't have any business playing around
with the page's internal state when we're unsure of what it is being used
for.The page_count() test here is a bit of a worry, too. Why do we need it?
The caller must have pinned the page in some fashion else we couldn't use
it safely in this function at all.I assume that you discovered that once we've cleared PageWriteback(), the
page can get reclaimed elsewhere? If so, that could still happen
immediately after the page_count() test. It's all a bit of a worry.
Deferring the ClearPageWriteback() will fix any race concerns, but I doThe tricky part will be working out whether you've found all the places
which need to have move_tail_pages() added to them.pagevec_move_tail() needs a comment explaining that it must be called with
preemption disabled. Otherwise it has nasty races.Actually, given that pagevec_move_tail() is called from both interrupt and
non-interrupt context, I guess that it needs local_irq_save() protection as
well. In which case the preempt_disable() and preempt_enable() in
move_tail_pages() can be optimised away (use local_irq_save() and
__get_cpu_var()).So I do think that for safety and sanity's sake, we should be taking a ref
on the pages when they are in a pagevec. That's going to hurt your nice
performance numbers :(Please consider doing a single call to __count_vm_events() in
pagevec_move_tail(), instead of multiple calls to __count_vm_event().It's a bit of a concern that we can have ((num_online_cpus - 1) *
PAGEVEC_SIZE) pages outstanding in the other CPU's pagevecs. But I _guess_
that'll be OK - it's a relatively small number of pages.What happens during cpu hot-unplug? If we take a ref on the pageved'ed
pages then we have page...
I modified my patch based on your comment.
At 11:37 07/09/14, Andrew Morton wrote:
>
>So I do think that for safety and sanity's sake, we should be taking a ref
>on the pages when they are in a pagevec. That's going to hurt your nice
>performance numbers :(
>I did ping test again to observe performance deterioration caused by taking
a ref.-2.6.23-rc6-with-modifiedpatch
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 53386ms
rtt min/avg/max/mdev = 0.074/0.110/4.716/0.147 ms, pipe 2, ipg/ewma
17.801/0.129 msThe result for my original patch is as follows.
-2.6.23-rc5-with-originalpatch
--- testmachine ping statistics ---
3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms
rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma
17.314/0.091 msThe influence to response was small.
Thanks.
Signed-off-by :Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp>
diff -Nrup linux-2.6.23-rc6.org/include/linux/swap.h
linux-2.6.23-rc6/include/linux/swap.h
--- linux-2.6.23-rc6.org/include/linux/swap.h 2007-09-14 16:49:57.000000000 +0900
+++ linux-2.6.23-rc6/include/linux/swap.h 2007-09-14 16:58:48.000000000 +0900
@@ -185,6 +185,7 @@ extern void FASTCALL(mark_page_accessed(
extern void lru_add_drain(void);
extern int lru_add_drain_all(void);
extern int rotate_reclaimable_page(struct page *page);
+extern void move_tail_pages(void);
extern void swap_setup(void);/* linux/mm/vmscan.c */
diff -Nrup linux-2.6.23-rc6.org/mm/swap.c linux-2.6.23-rc6/mm/swap.c
--- linux-2.6.23-rc6.org/mm/swap.c 2007-07-09 08:32:17.000000000 +0900
+++ linux-2.6.23-rc6/mm/swap.c 2007-09-18 18:49:07.000000000 +0900
@@ -94,24 +94,62 @@ void put_pages_list(struct list_head *pa
EXPORT_SYMBOL(put_pages_list);/*
+ * pagevec_move_tail() must be called with IRQ disabled.
+ * Otherwise this may cause nasty races.
+ */
+static void pagevec_move_tail(struct pagevec *pvec)
...
On Tue, 18 Sep 2007 19:41:14 +0900
well.. that's not really the test which will show up any regressions.
The extra get_page/put_page will affect things like kernel CPU utilisation
on fast writeout on a single CPU. Say, run a huge write against a fast
storage system on a single pentium 4 CPU and see how much the system CPU
time is increased.The kernel's internal cpu accounting probably won't be accurate enough to
get believeable numbers from a test like this - better to use the
subtractive approach: see http://www.zipworld.com.au/~akpm/linux/#zc
-
It would be interesting to test -mm kernels. They have a patch which reduces
zone lock contention quite a lot.I think your patch is a nice idea, and with less zone lock contention in other
areas, it is possible that it might produce a relatively larger improvement.
-
I'm a bit wobbly about this patch - it adds additional single-cpu overhead
to reduce multiple-cpu overhead and latency.
-
Yeah, that's true. Although maybe it gets significantly more after the
patch in -mm.Possibly other page batching sites have similar issues on UP... I wonder
if a type of pagevec that turns into a noop on UP would be interesting...
probably totally unmeasurable and not worth the cost of code
maintenance ;)
-
Yes, I wonder that. Some of the additional overhead will come from
the additional get_page/put_page which is needed for pagevec ownership.
-
Thank you for your comment.
At 11:37 07/09/14, Andrew Morton wrote:
>The page_count() test here is a bit of a worry, too. Why do we need it?
>The caller must have pinned the page in some fashion else we couldn't use
>it safely in this function at all.
>
>I assume that you discovered that once we've cleared PageWriteback(), the
>page can get reclaimed elsewhere? If so, that could still happen
>immediately after the page_count() test. It's all a bit of a worry.
>Deferring the ClearPageWriteback() will fix any race concerns, but I do
>think that we need to take a ref on the page for the pagevec ownership.
>Actually, I considered taking a ref to pin pages. But this could prevent
the page
reclaiming activity so I did not use it.I reflect your comment and send you modified patch.
-
| Greg KH | Re: Announce: Linux-next (Or Andrew's dream :-)) |
| Greg KH | [patch 26/73] NET: Correct two mistaken skb_reset_mac_header() conversions. |
| Greg Kroah-Hartman | [PATCH 007/196] Chinese: add translation of stable_kernel_rules.txt |
| Alan Cox | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
git: | |
| Alexey Dobriyan | Re: [GIT]: Networking |
| Gerrit Renker | [PATCH 03/37] dccp: List management for new feature negotiation |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Andrew Morton | Re: [BUG] New Kernel Bugs |
