Hi. While running some memory intensive load, system response deteriorated just after swap-out started. The cause of this problem is that when a PG_reclaim page is moved to the tail of the inactive LRU list in rotate_reclaimable_page(), lru_lock spin lock is acquired every page writeback . This deteriorates system performance and makes interrupt hold off time longer when swap-out started. Following patch solves this problem. I use pagevec in rotating reclaimable pages to mitigate LRU spin lock contention and reduce interrupt hold off time. I did a test that allocating and touching pages in multiple processes, and pinging to the test machine in flooding mode to measure response under memory intensive load. The test result is: -2.6.23-rc5 --- testmachine ping statistics --- 3000 packets transmitted, 3000 received, 0% packet loss, time 53222ms rtt min/avg/max/mdev = 0.074/0.652/172.228/7.176 ms, pipe 11, ipg/ewma 17.746/0.092 ms -2.6.23-rc5-patched --- testmachine ping statistics --- 3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma 17.314/0.091 ms Max round-trip-time was improved. The test machine spec is that 4CPU(3.16GHz, Hyper-threading enabled) 8GB memory , 8GB swap. Thanks. Signed-off-by :Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> diff -Nrup linux-2.6.23-rc5.org/include/linux/swap.h linux-2.6.23-rc5/include/linux/swap.h --- linux-2.6.23-rc5.org/include/linux/swap.h 2007-09-06 18:44:06.000000000 +0900 +++ linux-2.6.23-rc5/include/linux/swap.h 2007-09-06 18:45:28.000000000 +0900 @@ -185,6 +185,7 @@ extern void FASTCALL(mark_page_accessed( extern void lru_add_drain(void); extern int lru_add_drain_all(void); extern int rotate_reclaimable_page(struct page *page); +extern void move_tail_pages(void); extern void swap_setup(void); /* linux/mm/vmscan.c */ diff -Nrup linux-2.6.23-rc5.org/mm/swap.c linux-2.6.23-rc5/mm/swap.c --- linux-2.6.23-...
Your email client performs space-stuffing. That means the recipient needs I really don't like the games we play with page refcounts here. By the time we consider a page, that page could have been freed and could be reused for any purpose whatever. We don't have any business playing around with the page's internal state when we're unsure of what it is being used for. The page_count() test here is a bit of a worry, too. Why do we need it? The caller must have pinned the page in some fashion else we couldn't use it safely in this function at all. I assume that you discovered that once we've cleared PageWriteback(), the page can get reclaimed elsewhere? If so, that could still happen immediately after the page_count() test. It's all a bit of a worry. Deferring the ClearPageWriteback() will fix any race concerns, but I do The tricky part will be working out whether you've found all the places which need to have move_tail_pages() added to them. pagevec_move_tail() needs a comment explaining that it must be called with preemption disabled. Otherwise it has nasty races. Actually, given that pagevec_move_tail() is called from both interrupt and non-interrupt context, I guess that it needs local_irq_save() protection as well. In which case the preempt_disable() and preempt_enable() in move_tail_pages() can be optimised away (use local_irq_save() and __get_cpu_var()). So I do think that for safety and sanity's sake, we should be taking a ref on the pages when they are in a pagevec. That's going to hurt your nice performance numbers :( Please consider doing a single call to __count_vm_events() in pagevec_move_tail(), instead of multiple calls to __count_vm_event(). It's a bit of a concern that we can have ((num_online_cpus - 1) * PAGEVEC_SIZE) pages outstanding in the other CPU's pagevecs. But I _guess_ that'll be OK - it's a relatively small number of pages. What happens during cpu hot-unplug? If we take a ref on the pageved'ed pages then we have page...
I modified my patch based on your comment. At 11:37 07/09/14, Andrew Morton wrote: > >So I do think that for safety and sanity's sake, we should be taking a ref >on the pages when they are in a pagevec. That's going to hurt your nice >performance numbers :( > I did ping test again to observe performance deterioration caused by taking a ref. -2.6.23-rc6-with-modifiedpatch --- testmachine ping statistics --- 3000 packets transmitted, 3000 received, 0% packet loss, time 53386ms rtt min/avg/max/mdev = 0.074/0.110/4.716/0.147 ms, pipe 2, ipg/ewma 17.801/0.129 ms The result for my original patch is as follows. -2.6.23-rc5-with-originalpatch --- testmachine ping statistics --- 3000 packets transmitted, 3000 received, 0% packet loss, time 51924ms rtt min/avg/max/mdev = 0.072/0.108/3.884/0.114 ms, pipe 2, ipg/ewma 17.314/0.091 ms The influence to response was small. Thanks. Signed-off-by :Hisashi Hifumi <hifumi.hisashi@oss.ntt.co.jp> diff -Nrup linux-2.6.23-rc6.org/include/linux/swap.h linux-2.6.23-rc6/include/linux/swap.h --- linux-2.6.23-rc6.org/include/linux/swap.h 2007-09-14 16:49:57.000000000 +0900 +++ linux-2.6.23-rc6/include/linux/swap.h 2007-09-14 16:58:48.000000000 +0900 @@ -185,6 +185,7 @@ extern void FASTCALL(mark_page_accessed( extern void lru_add_drain(void); extern int lru_add_drain_all(void); extern int rotate_reclaimable_page(struct page *page); +extern void move_tail_pages(void); extern void swap_setup(void); /* linux/mm/vmscan.c */ diff -Nrup linux-2.6.23-rc6.org/mm/swap.c linux-2.6.23-rc6/mm/swap.c --- linux-2.6.23-rc6.org/mm/swap.c 2007-07-09 08:32:17.000000000 +0900 +++ linux-2.6.23-rc6/mm/swap.c 2007-09-18 18:49:07.000000000 +0900 @@ -94,24 +94,62 @@ void put_pages_list(struct list_head *pa EXPORT_SYMBOL(put_pages_list); /* + * pagevec_move_tail() must be called with IRQ disabled. + * Otherwise this may cause nasty races. + */ +static void pagevec_move_tail(struct pagevec *pvec) ...
On Tue, 18 Sep 2007 19:41:14 +0900 well.. that's not really the test which will show up any regressions. The extra get_page/put_page will affect things like kernel CPU utilisation on fast writeout on a single CPU. Say, run a huge write against a fast storage system on a single pentium 4 CPU and see how much the system CPU time is increased. The kernel's internal cpu accounting probably won't be accurate enough to get believeable numbers from a test like this - better to use the subtractive approach: see http://www.zipworld.com.au/~akpm/linux/#zc -
It would be interesting to test -mm kernels. They have a patch which reduces zone lock contention quite a lot. I think your patch is a nice idea, and with less zone lock contention in other areas, it is possible that it might produce a relatively larger improvement. -
I'm a bit wobbly about this patch - it adds additional single-cpu overhead to reduce multiple-cpu overhead and latency. -
Yeah, that's true. Although maybe it gets significantly more after the patch in -mm. Possibly other page batching sites have similar issues on UP... I wonder if a type of pagevec that turns into a noop on UP would be interesting... probably totally unmeasurable and not worth the cost of code maintenance ;) -
Yes, I wonder that. Some of the additional overhead will come from the additional get_page/put_page which is needed for pagevec ownership. -
Thank you for your comment. At 11:37 07/09/14, Andrew Morton wrote: >The page_count() test here is a bit of a worry, too. Why do we need it? >The caller must have pinned the page in some fashion else we couldn't use >it safely in this function at all. > >I assume that you discovered that once we've cleared PageWriteback(), the >page can get reclaimed elsewhere? If so, that could still happen >immediately after the page_count() test. It's all a bit of a worry. >Deferring the ClearPageWriteback() will fix any race concerns, but I do >think that we need to take a ref on the page for the pagevec ownership. > Actually, I considered taking a ref to pin pages. But this could prevent the page reclaiming activity so I did not use it. I reflect your comment and send you modified patch. -
| Linus Torvalds | Linux 2.6.21-rc5 |
| Linus Torvalds | Linux 2.6.26-rc4 |
| Christoph Hellwig | Re: [PATCH] Version 3 (2.6.23-rc8) Smack: Simplified Mandatory Access Control Kernel |
| Bryan Woods | Stardom SATA HSM violation |
git: | |
| Linus Torvalds | People unaware of the importance of "git gc"? |
| Jan Holesovsky | [PATCH] RFC: git lazy clone proof-of-concept |
| Linus Torvalds | cleaner/better zlib sources? |
| martin f krafft | Re: Track /etc directory using Git |
| Chris | OpenBSD 4.4 installation error: write failed; file system full |
| Brian A. Seklecki | sshd_config(5) PermitRootLogin yes |
| steve szmidt | Re: The Atheros story in much fewer words |
| David Newman | setting dscp or tos bits |
| Jim Winstead Jr. | Re: Root Disk/Book Disk Compatibility |
| Jan Nicolai Langfeldt | Re: Hypenation problems under LaTeX. |
| Linus Torvalds | Re: Missing linux/delay.h??? |
| Stew Ellis | Trouble with minicom scripts |
