From: Dave Chinner <dchinner@redhat.com>
When we enter direct reclaim we may have used an arbitrary amount of stack
space, and hence enterring the filesystem to do writeback can then lead to
stack overruns. This problem was recently encountered x86_64 systems with
8k stacks running XFS with simple storage configurations.
Writeback from direct reclaim also adversely affects background writeback. The
background flusher threads should already be taking care of cleaning dirty
pages, and direct reclaim will kick them if they aren't already doing work. If
direct reclaim is also calling ->writepage, it will cause the IO patterns from
the background flusher threads to be upset by LRU-order writeback from
pageout() which can be effectively random IO. Having competing sources of IO
trying to clean pages on the same backing device reduces throughput by
increasing the amount of seeks that the backing device has to do to write back
the pages.
Hence for direct reclaim we should not allow ->writepages to be entered at all.
Set up the relevant scan_control structures to enforce this, and prevent
sc->may_writepage from being set in other places in the direct reclaim path in
response to other events.
Reported-by: John Berthels <john@humyo.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
mm/vmscan.c | 13 ++++++-------
1 files changed, 6 insertions(+), 7 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index e0e5f15..5321ac4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1826,10 +1826,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
* writeout. So in laptop mode, write out the whole world.
*/
writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
- if (total_scanned > writeback_threshold) {
+ if (total_scanned > writeback_threshold)
wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
- sc->may_writepage = 1;
- }
/* Take a nap, wait for some writeback to complete */
if (!sc->hibernation_mode && ...Ummm.. This patch is harder to ack. This patch's pros/cons seems Pros: 1) prevent XFS stack overflow 2) improve io workload performance Cons: 3) TOTALLY kill lumpy reclaim (i.e. high order allocation) So, If we only need to consider io workload this is no downside. but it can't. I think (1) is XFS issue. XFS should care it itself. but (2) is really VM issue. Now our VM makes too agressive pageout() and decrease io throughput. I've heard this issue from Chris (cc to him). I'd like to fix this. but we never kill pageout() completely because we can't assume users don't run high order allocation workload. (perhaps Mel's memory compaction code is going to improve much and we can kill lumpy reclaim in future. but it's another story) --
The filesystem is irrelevant, IMO. The traces from the reporter showed that we've got close to a 2k stack footprint for memory allocation to direct reclaim and then we can put the entire writeback path on top of that. This is roughly 3.5k for XFS, and then depending on the storage subsystem configuration and transport can be another 2k of stack needed below XFS. IOWs, if we completely ignore the filesystem stack usage, there's still up to 4k of stack needed in the direct reclaim path. Given that one of the stack traces supplied show direct reclaim being entered with over 3k of stack already used, pretty much any filesystem is capable of blowing an 8k stack. So, this is not an XFS issue, even though XFS is the first to I didn't expect this to be easy. ;) I had a good look at what the code was doing before I wrote the patch, and IMO, there is no good reason for issuing IO from direct reclaim. My reasoning is as follows - consider a system with a typical sata disk and the machine is low on memory and in direct reclaim. direct reclaim is taking pages of the end of the LRU and writing them one at a time from there. It is scanning thousands of pages pages and it triggers IO on on the dirty ones it comes across. This is done with no regard to the IO patterns it generates - it can (and frequently does) result in completely random single page IO patterns hitting the disk, and as a result cleaning pages happens really, really slowly. If we are in a OOM situation, the machine will grind to a halt as it struggles to clean maybe 1MB of RAM per second. On the other hand, if the IO is well formed then the disk might be capable of 100MB/s. The background flusher threads and filesystems try very hard to issue well formed IOs, so the difference in the rate that memory can be cleaned may be a couple of orders of magnitude. (Of course, the difference will typically be somewhere in between these two extremes, but I'm simply trying to illustrate how big the difference in ...
Thanks explanation. I haven't noticed direct reclaim consume 2k stack. I'll investigate it and try diet it. Well, you seems continue to discuss io workload. I don't disagree such point. example, If only order-0 reclaim skip pageout(), we will get the above lumpy reclaim is for allocation high order page. then, it not only reclaim LRU head page, but also its PFN neighborhood. PFN neighborhood is often newly page and still dirty. then we enfoce pageout cleaning and discard it. When high order allocation occur, we don't only need free enough amount memory, but also need free enough contenious memory block. If we need to consider _only_ io throughput, waiting flusher thread might faster perhaps, but actually we also need to consider reclaim It does. lumpy reclaim doesn't grab last N pages. instead grab contenious So, can you please run two workloads concurrently? - Normal IO workload (fio, iozone, etc..) - echo $NUM > /proc/sys/vm/nr_hugepages Most typical high order allocation is occur by blutal wireless LAN driver. (or some cheap LAN card) But sadly, If the test depend on specific hardware, our discussion might make mess maze easily. then, I hope to use hugepage feature instead. Thanks. --
It hasn't grown in the last 2 years after the last major diet where all the fat was trimmed from it in the last round of the i386 4k stack vs XFS saga. it seems that everything else around XFS has Ok, I see that now - I missed the second call to __isolate_lru_pages() Agreed, that was why I was kind of surprised not to find it was True, but without know how to test and measure such things I can't What do I measure/observe/record that is meaningful? Cheers, Dave. -- Dave Chinner david@fromorbit.com --
So, a rough as guts first pass - just run a large dd (8 times the
size of memory - 8GB file vs 1GB RAM) and repeated try to allocate
the entire of memory in huge pages (500) every 5 seconds. The IO
rate is roughly 100MB/s, so it takes 75-85s to complete the dd.
The script:
$ cat t.sh
#!/bin/bash
echo 0 > /proc/sys/vm/nr_hugepages
echo 3 > /proc/sys/vm/drop_caches
dd if=/dev/zero of=/mnt/scratch/test bs=1024k count=8000 > /dev/null 2>&1 &
(
for i in `seq 1 1 20`; do
sleep 5
/usr/bin/time --format="wall %e" sh -c "echo 500 > /proc/sys/vm/nr_hugepages" 2>&1
grep HugePages_Total /proc/meminfo
done
) | awk '
/wall/ { wall += $2; cnt += 1 }
/Pages/ { pages[cnt] = $2 }
END { printf "average wall time %f\nPages step: ", wall / cnt ;
for (i = 1; i <= cnt; i++) {
printf "%d ", pages[i];
}
}'
----
And the output looks like:
$ sudo ./t.sh
average wall time 0.954500
Pages step: 97 101 101 121 173 173 173 173 173 173 175 194 195 195 202 220 226 419 423 426
$
Run 50 times in a loop, and the outputs averaged, the existing lumpy
reclaim resulted in:
dave@test-1:~$ cat current.txt | awk -f av.awk
av. wall = 0.519385 secs
av Pages step: 192 228 242 255 265 272 279 284 289 294 298 303 307 322 342 366 383 401 412 420
And with my patch that disables ->writepage:
dave@test-1:~$ cat no-direct.txt | awk -f av.awk
av. wall = 0.554163 secs
av Pages step: 231 283 310 316 323 328 336 340 345 351 356 359 364 377 388 397 413 423 432 439
Basically, with my patch lumpy reclaim was *substantially* more
effective with only a slight increase in average allocation latency
with this test case.
I need to add a marker to the output that records when the dd
completes, but from monitoring the writeback rates via PCP, they
were in the balllpark of 85-100MB/s for the existing code, and
95-110MB/s with my patch. Hence it improved both IO throughput and
the effectiveness ...Ummm... Probably, I have to say I'm sorry. I guess my last mail give you a misunderstand. To be honest, I'm not interest this artificial non fragmentation case. The above test-case does 1) discard all cache 2) fill pages by streaming io. then, it makes artificial "file offset neighbor == block neighbor == PFN neighbor" situation. then, file offset order writeout by flusher thread can make PFN contenious pages effectively. Why I dont interest it? because lumpy reclaim is a technique for avoiding external fragmentation mess. IOW, it is for avoiding worst case. but your test case seems to mesure best one. --
And to be brutally honest, I'm not interested in wasting my time trying to come up with a test case that you are interested in. Instead, can you please you provide me with your test cases (scripts, preferably) that you use to measure the effectiveness of Yes, that's true, but it does indicate that in that situation, it is more effective than the current code. FWIW, in the case of HPC applications (which often use huge pages and clear the cache before starting anew job), large streaming IO is a pretty common IO pattern, so I don't think this situation is as artificial as you are Then please provide test cases that you consider valid. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
I have dumb question, If xfs haven't bloat stack usage, why 3.5 stack usage works fine on 4k stack kernel? It seems impossible. Please don't think I blame you. I don't know what is "4k stack vs XFS saga". Agreed. I know making VM mesurement benchmark is very difficult. but probably it is necessary.... --
Because on a 32 bit kernel it's somewhere between 2-2.5k of stack space. That being said, XFS _will_ blow a 4k stack on anything other than the most basic storage configurations, and if you run out of Over a period of years there were repeated attempts to make the default stack size on i386 4k, despite it being known to cause problems one relatively common configurations. Every time it was brought up it was rejected, but every few months somebody else made an attempt to make it the default. There was a lot of flamage directed at XFS because it was seen as the reason that 4k stacks were not made the default.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
It's already known that the VM requesting specific pages be cleaned and reclaimed is a bad IO pattern but unfortunately it is still required by lumpy reclaim. This change would appear to break that although I haven't tested it to be 100% sure. Even without high-order considerations, this patch would appear to make fairly large changes to how direct reclaim behaves. It would no longer wait on page writeback for example so direct reclaim will return sooner than it did potentially going OOM if there were a lot of dirty pages and If an FS caller cannot re-enter the FS, it should be using GFP_NOFS -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
AFAICT it still waits for pages under writeback in exactly the same manner
it does now. shrink_page_list() does the following completely
separately to the sc->may_writepage flag:
666 may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
667 (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
668
669 if (PageWriteback(page)) {
670 /*
671 * Synchronous reclaim is performed in two passes,
672 * first an asynchronous pass over the list to
673 * start parallel writeback, and a second synchronous
674 * pass to wait for the IO to complete. Wait here
675 * for any page for which writeback has already
676 * started.
677 */
678 if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
679 wait_on_page_writeback(page);
680 else
681 goto keep_locked;
682 }
So if the page is under writeback, PAGEOUT_IO_SYNC is set and
we can enter the fs, it will still wait for writeback to complete
just like it does now.
However, the current code only uses PAGEOUT_IO_SYNC in lumpy
reclaim, so for most typical workloads direct reclaim does not wait
on page writeback, either. Hence, this patch doesn't appear to
change the actions taken on a page under writeback in direct
I did a fair bit of low/small memory testing. This is a subjective
observation, but I definitely seemed to get less severe OOM
situations and better overall responisveness with this patch than
This problem is not a filesystem recursion problem which is, as I
understand it, what GFP_NOFS is used to prevent. It's _any_ kernel
code that uses signficant stack before trying to allocate memory
that is the problem. e.g a select() system call:
...Depends. For raw effectiveness, I run a series of performance-related benchmarks with a final test that o Starts a number of parallel compiles that in combination are 1.25 times of physical memory in total size o Sleep three minutes o Start allocating huge pages recording the latency required for each one o Record overall success rate and graph latency over time Right, so it'll still wait on writeback but won't kick it off. That would still be a fairly significant change in behaviour though. Think of synchronous lumpy reclaim for example where it queues up a contiguous But it would be no longer queueing them for writeback so it'd be depending heavily on kswapd or a background cleaning daemon to clean No, but it does queue them back on the LRU where they might be clean the next time they are found on the list. How significant a problem this is I couldn't tell you but it could show a corner case where a large number of direct reclaimers are encountering dirty pages frequenctly and It does, but indirectly. The impact is very direct for lumpy reclaim obviously. For other direct reclaim, pages that were at the end of the LRU list are no longer getting cleaned before doing another lap through the LRU list. And it is possible that it is best overall of only kswapd and the background cleaner are queueing pages for IO. All I can say for sure is that this does appear to hurt lumpy reclaim and does affect normal I'm not denying the evidence but how has it been gotten away with for years then? Prevention of writeback isn't the answer without figuring out how direct reclaimers can queue pages for IO and in the case of lumpy reclaim doing sync IO, then waiting on those pages. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
So, I've been reading along, nodding my head to Dave's side of things because seeks are evil and direct reclaim makes seeks. I'd really loev for direct reclaim to somehow trigger writepages on large chunks instead of doing page by page spatters of IO to the drive. But, somewhere along the line I overlooked the part of Dave's stack trace that said: 43) 1568 912 do_select+0x3d6/0x700 Huh, 912 bytes...for select, really? From poll.h: /* ~832 bytes of stack space used max in sys_select/sys_poll before allocating additional memory. */ #define MAX_STACK_ALLOC 832 #define FRONTEND_STACK_ALLOC 256 #define SELECT_STACK_ALLOC FRONTEND_STACK_ALLOC #define POLL_STACK_ALLOC FRONTEND_STACK_ALLOC #define WQUEUES_STACK_ALLOC (MAX_STACK_ALLOC - FRONTEND_STACK_ALLOC) #define N_INLINE_POLL_ENTRIES (WQUEUES_STACK_ALLOC / sizeof(struct poll_table_entry)) So, select is intentionally trying to use that much stack. It should be using GFP_NOFS if it really wants to suck down that much stack...if only the kernel had some sort of way to dynamically allocate ram, it could try that too. -chris --
Perhaps drop the lock on the page if it is held and call one of the helpers that filesystems use to do this, like: Sure, it's bad, but we focussing on the specific case misses the point that even code that is using minimal stack can enter direct reclaim after consuming 1.5k of stack. e.g.: 50) 3168 64 xfs_vm_writepage+0xab/0x160 [xfs] 51) 3104 384 shrink_page_list+0x65e/0x840 52) 2720 528 shrink_zone+0x63f/0xe10 53) 2192 112 do_try_to_free_pages+0xc2/0x3c0 54) 2080 128 try_to_free_pages+0x77/0x80 55) 1952 240 __alloc_pages_nodemask+0x3e4/0x710 56) 1712 48 alloc_pages_current+0x8c/0xe0 57) 1664 32 __page_cache_alloc+0x67/0x70 58) 1632 144 __do_page_cache_readahead+0xd3/0x220 59) 1488 16 ra_submit+0x21/0x30 60) 1472 80 ondemand_readahead+0x11d/0x250 61) 1392 64 page_cache_async_readahead+0xa9/0xe0 62) 1328 592 __generic_file_splice_read+0x48a/0x530 63) 736 48 generic_file_splice_read+0x4f/0x90 64) 688 96 xfs_splice_read+0xf2/0x130 [xfs] 65) 592 32 xfs_file_splice_read+0x4b/0x50 [xfs] 66) 560 64 do_splice_to+0x77/0xb0 67) 496 112 splice_direct_to_actor+0xcc/0x1c0 68) 384 80 do_splice_direct+0x57/0x80 69) 304 96 do_sendfile+0x16c/0x1e0 70) 208 80 sys_sendfile64+0x8d/0xb0 71) 128 128 system_call_fastpath+0x16/0x1b Yes, __generic_file_splice_read() is a hog, but they seem to be The code that did the allocation is called from multiple different contexts - how is it supposed to know that in some of those contexts it is supposed to treat memory allocation differently? This is my point - if you introduce a new semantic to memory allocation that is "use GFP_NOFS when you are using too much stack" and too much stack is more than 15% of the stack, then pretty much every code path Sure, but to play the devil's ...
On Wed, 14 Apr 2010 11:40:41 +1000 A bit OFF TOPIC. Could you share disassemble of shrink_zone() ? In my environ. 00000000000115a0 <shrink_zone>: 115a0: 55 push %rbp 115a1: 48 89 e5 mov %rsp,%rbp 115a4: 41 57 push %r15 115a6: 41 56 push %r14 115a8: 41 55 push %r13 115aa: 41 54 push %r12 115ac: 53 push %rbx 115ad: 48 83 ec 78 sub $0x78,%rsp 115b1: e8 00 00 00 00 callq 115b6 <shrink_zone+0x16> 115b6: 48 89 75 80 mov %rsi,-0x80(%rbp) disassemble seems to show 0x78 bytes for stack. And no changes to %rsp until retrun. I may misunderstand something... Thanks, -Kame --
I see the same. I didn't compile those kernels, though. IIUC, they were built through the Ubuntu build infrastructure, so there is something different in terms of compiler, compiler options or config to what we are both using. Most likely it is the compiler inlining, though Chris's patches to prevent that didn't seem to change the stack usage. I'm trying to get a stack trace from the kernel that has shrink_zone in it, but I haven't succeeded yet.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
I also got 0x78 byte stack usage. Umm.. Do we discussed real issue now? --
On Wed, Apr 14, 2010 at 2:54 PM, KOSAKI Motohiro
In my case, 0x110 byte in 32 bit machine.
I think it's possible in 64 bit machine.
00001830 <shrink_zone>:
1830: 55 push %ebp
1831: 89 e5 mov %esp,%ebp
1833: 57 push %edi
1834: 56 push %esi
1835: 53 push %ebx
1836: 81 ec 10 01 00 00 sub $0x110,%esp
183c: 89 85 24 ff ff ff mov %eax,-0xdc(%ebp)
1842: 89 95 20 ff ff ff mov %edx,-0xe0(%ebp)
1848: 89 8d 1c ff ff ff mov %ecx,-0xe4(%ebp)
184e: 8b 41 04 mov 0x4(%ecx)
my gcc is following as.
barrios@barriostarget:~/mmotm$ gcc -v
Using built-in specs.
Target: i486-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
4.3.3-5ubuntu4'
--with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
--enable-shared --with-system-zlib --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix --enable-nls
--with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3
--enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
--enable-mpfr --enable-targets=all --with-tune=generic
--enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu
--target=i486-linux-gnu
Thread model: posix
gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4)
Is it depends on config?
--
Kind regards,
Minchan Kim
I changed shrink list by noinline_for_stack.
The result is following as.
00001fe0 <shrink_zone>:
1fe0: 55 push %ebp
1fe1: 89 e5 mov %esp,%ebp
1fe3: 57 push %edi
1fe4: 56 push %esi
1fe5: 53 push %ebx
1fe6: 83 ec 4c sub $0x4c,%esp
1fe9: 89 45 c0 mov %eax,-0x40(%ebp)
1fec: 89 55 bc mov %edx,-0x44(%ebp)
1fef: 89 4d b8 mov %ecx,-0x48(%ebp)
0x110 -> 0x4c.
Should we have to add noinline_for_stack for shrink_list?
--
Kind regards,
Minchan Kim
--
On Wed, 14 Apr 2010 16:19:02 +0900 Hmm. about shirnk_zone(), I don't think uninlining functions directly called by shrink_zone() can be a help. Total stack size of call-chain will be still big. Thanks, -Kame --
On Wed, Apr 14, 2010 at 6:42 PM, KAMEZAWA Hiroyuki Absolutely. But above 500 byte usage is one of hogger and uninlining is not critical about reclaim performance. So I think we don't get any lost than gain. But I don't get in a hurry. adhoc approach is not good. I hope when Mel tackles down consumption of stack in reclaim path, he modifies this part, too. -- Kind regards, Minchan Kim --
Beat in mind that uninlining can slightly increase the stack usage in some cases because arguments, return addresses and the like have to be pushed onto the stack. Inlining or unlining is only the answer when it reduces the It'll be at least two days before I get the chance to try. A lot of the temporary variables used in the reclaim path have existed for some time so it will take a while. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Yes. I totally have missed it. Thanks, Mel. -- Kind regards, Minchan Kim --
Ok, so here's a trace at the top of the stack from a kernel with a
the above shrink_zone disassembly:
$ cat /sys/kernel/debug/tracing/stack_trace
Depth Size Location (49 entries)
----- ---- --------
0) 6152 112 force_qs_rnp+0x58/0x150
1) 6040 48 force_quiescent_state+0x1a7/0x1f0
2) 5992 48 __call_rcu+0x13d/0x190
3) 5944 16 call_rcu_sched+0x15/0x20
4) 5928 16 call_rcu+0xe/0x10
5) 5912 240 radix_tree_delete+0x14a/0x2d0
6) 5672 32 __remove_from_page_cache+0x21/0x110
7) 5640 64 __remove_mapping+0x86/0x100
8) 5576 272 shrink_page_list+0x2fd/0x5a0
9) 5304 400 shrink_inactive_list+0x313/0x730
10) 4904 176 shrink_zone+0x3d1/0x490
11) 4728 128 do_try_to_free_pages+0x2b6/0x380
12) 4600 112 try_to_free_pages+0x5e/0x60
13) 4488 272 __alloc_pages_nodemask+0x3fb/0x730
14) 4216 48 alloc_pages_current+0x87/0xd0
15) 4168 32 __page_cache_alloc+0x67/0x70
16) 4136 80 find_or_create_page+0x4f/0xb0
17) 4056 160 _xfs_buf_lookup_pages+0x150/0x390
.....
So the differences are most likely from the compiler doing
automatic inlining of static functions...
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
I agree that "seeks are evil and direct reclaim makes seeks". Actually, Sorry, I'm lost what you talk about. Why do we need per-file waiting? checkstack.pl says do_select() and __generic_file_splice_read() are one of worstest stack consumer. both sould be fixed. Nodding my head to Dave's side. changing caller argument seems not good solution. I mean - do_select() should use GFP_KERNEL instead stack (as revert 70674f95c0) - reclaim and xfs (and other something else) need to diet. your explanation is very interesting. I have a (probably dumb) question. Why nobody faced stack overflow issue in past? now I think every users Yeah, My answer is simple, All stack eater should be fixed. but XFS seems not innocence too. 3.5K is enough big although xfs have use such amount since very ago. =========================================================== Subject: [PATCH] kconfig: reduce FRAME_WARN default value to 512 Surprisedly, now several odd functions use very much stack. % objdump -d vmlinux | ./scripts/checkstack.pl 0xffffffff81e3db07 get_next_block [vmlinux]: 1976 0xffffffff8130b9bd node_read_meminfo [vmlinux]: 1240 0xffffffff811553fd do_sys_poll [vmlinux]: 1000 0xffffffff8122b49d test_aead [vmlinux]: 904 0xffffffff81154c9d do_select [vmlinux]: 888 0xffffffff81168d9d default_file_splice_read [vmlinux]: 760 Oh well, Every developers have to pay attention a stack usage! Thus, this patch reduce FRAME_WARN default value to 512. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- lib/Kconfig.debug | 3 +-- 1 files changed, 1 insertions(+), 2 deletions(-) diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug index ff01710..44ebba6 100644 --- a/lib/Kconfig.debug +++ b/lib/Kconfig.debug @@ -28,8 +28,7 @@ config ENABLE_MUST_CHECK config FRAME_WARN int "Warn for stack frames larger than (needs gcc 4.4)" range 0 8192 - default 1024 if !64BIT - default 2048 if ...
So use filemap_fdatawrite(page->mapping), or if it's better only to start IO on a segment of the file, use the deepest call chain in queue_work() needs 700 bytes of stack to complete, wait_for_completion() requires almost 2k of stack space at it's deepest, the scheduler has some heavy stack users, etc, Yeah, but when we have ia callchain 70 or more functions deep, The list I'm seeing so far includes: - scheduler - completion interfaces - radix tree - memory allocation, memory reclaim - anything that implements ->writepage - select Good start, but 512 bytes will only catch select and splice read, and there are 300-400 byte functions in the above list that sit near It's always a problem, but the focus on minimising stack usage has gone away since i386 has mostly disappeared from server rooms. XFS has always been the thing that triggered stack usage problems first - the first reports of problems on x86_64 with 8k stacks in low memory situations have only just come in, and this is the first time in a couple of years I've paid close attention to stack usage XFS used to use much more than that - significant effort has been put into reduce the stack footprint over many years. There's not much left to trim without rewriting half the filesystem... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
That does not help the stack usage issue, the caller ends up in ->writepages. From an IO perspective, it'll be better from a seek point of view but from a VM perspective, it may or may not be cleaning the right pages. The real issue here then is that stack usage has gone out of control. Disabling ->writepage in direct reclaim does not guarantee that stack usage will not be a problem again. From your traces, page reclaim itself seems to be a big dirty hog. Differences in what people see in their machines may be down to architecture, compiler but most likely inlining. Changing inlining will not fix the problem, They will need to be tackled in turn then but obviously there should be a focus on the common paths. The reclaim paths do seem particularly heavy and it's down to a lot of temporary variables. I might not get the time today but what I'm going to try do some time this week is o Look at what temporary variables are copies of other pieces of information o See what variables live for the duration of reclaim but are not needed for all of it (i.e. uninline parts of it so variables do not persist) o See if it's possible to dynamically allocate scan_control The last one is the trickiest. Basically, the idea would be to move as much into scan_control as possible. Then, instead of allocating it on the stack, allocate a fixed number of them at boot-time (NR_CPU probably) protected by a semaphore. Limit the number of direct reclaimers that can be active at a time to the number of scan_control variables. kswapd could still allocate its on the stack or with kmalloc. If it works out, it would have two main benefits. Limits the number of processes in direct reclaim - if there is NR_CPU-worth of proceses in direct reclaim, there is too much going on. It would also shrink the stack usage particularly if some of the stack variables are moved into scan_control. I don't think he is levelling a complain at XFS in particular - just pointing out that it's heavy too. Still, we ...
If you ask it to clean a bunch of pages around the one you want to reclaim on the LRU, there is a good chance it will also be cleaning pages that are near the end of the LRU or physically close by as well. It's not a guarantee, but for the additional IO cost of about 10% wall time on that IO to clean the page you need, you also get 1-2 orders of magnitude other pages cleaned. That sounds like a win any way you look at it... I agree that it doesn't solve the stack problem (Chris' suggestion that we enable the bdi flusher interface would fix this); what I'm pointing out is that the arguments that it is too hard or there are no interfaces available to issue larger IO from reclaim are not at That's definitely true, but it shouldn't cloud the fact that most ppl want to kill writeback from direct reclaim, too, so killing two birds with one stone seems like a good idea. How about this? For now, we stop direct reclaim from doing writeback only on order zero allocations, but allow it for higher order allocations. That will prevent the majority of situations where direct reclaim blows the stack and interferes with background writeout, but won't cause lumpy reclaim to change behaviour. This reduces the scope of impact and hence testing and validation the needs to be done. Then we can work towards allowing lumpy reclaim to use background threads as Chris suggested for doing specific writeback operations to solve the remaining problems being seen. Does this seem like a I couldn't agree more - the kernel still needs to be put on a stack usage diet, but the above would give use some breathing space to attack the I like the idea - it really sounds like you want a fixed size, preallocated mempool that can't be enlarged. In fact, I can probably use something like this in XFS to save a couple of hundred bytes of Yeah, true. Sorry іf in being a bit too defensive here - the scars from previous discussions like this are showing through.... Cheers, Dave. -- Dave ...
Tend to agree. but I would proposed slightly different algorithm for avoind incorrect oom. for high order allocation allow to use lumpy reclaim and pageout() for both kswapd and direct reclaim for low order allocation - kswapd: always delegate io to flusher thread - direct reclaim: delegate io to flusher thread only if vm pressure is low This seems more safely. I mean Who want see incorrect oom regression? Tend to agree. probably now we are discussing right approach. but this is definitely needed deep thinking. then, I can't take exactly answer yet. --
Now, vmscan pageout() is one of IO throuput degression source. Some IO workload makes very much order-0 allocation and reclaim and pageout's 4K IOs are making annoying lots seeks. At least, kswapd can avoid such pageout() because kswapd don't need to consider OOM-Killer situation. that's no risk. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- mm/vmscan.c | 7 +++++++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 3ff3311..d392a50 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -614,6 +614,13 @@ static enum page_references page_check_references(struct page *page, if (referenced_page) return PAGEREF_RECLAIM_CLEAN; + /* + * Delegate pageout IO to flusher thread. They can make more + * effective IO pattern. + */ + if (current_is_kswapd()) + return PAGEREF_RECLAIM_CLEAN; + return PAGEREF_RECLAIM; } -- 1.6.5.2 --
What's your opinion on trying to cluster the writes done by pageout,
instead of not doing any paging out in kswapd?
Something along these lines:
Cluster writes to disk due to memory pressure.
Write out logically adjacent pages to the one we're paging out
so that we may get better IOs in these situations:
These pages are likely to be contiguous on disk to the one we're
writing out, so they should get merged into a single disk IO.
Signed-off-by: Suleiman Souhlal <suleiman@google.com>
diff --git a/mm/vmscan.c b/mm/vmscan.c
index c26986c..4e5a613 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -48,6 +48,8 @@
#include "internal.h"
+#define PAGEOUT_CLUSTER_PAGES 16
+
struct scan_control {
/* Incremented by the number of inactive pages that were scanned */
unsigned long nr_scanned;
@@ -350,6 +352,8 @@ typedef enum {
static pageout_t pageout(struct page *page, struct address_space
*mapping,
enum pageout_io sync_writeback)
{
+ int i;
+
/*
* If the page is dirty, only perform writeback if that write
* will be non-blocking. To prevent this allocation from being
@@ -408,6 +412,37 @@ static pageout_t pageout(struct page *page,
struct address_space *mapping,
}
/*
+ * Try to write out logically adjacent dirty pages too, if
+ * possible, to get better IOs, as the IO scheduler should
+ * merge them with the original one, if the file is not too
+ * fragmented.
+ */
+ for (i = 1; i < PAGEOUT_CLUSTER_PAGES; i++) {
+ struct page *p2;
+ int err;
+
+ p2 = find_get_page(mapping, page->index + i);
+ if (p2) {
+ if (trylock_page(p2) == 0) {
+ page_cache_release(p2);
+ break;
+ }
+ if (page_mapped(p2))
+ try_to_unmap(p2, 0);
+ if (PageDirty(p2)) {
+ err = write_one_page(p2, 0);
+ page_cache_release(p2);
+ if (err)
+ break;
+ } else ...Interesting. So, I'd like to review your patch carefully. can you please give me one --
Hannes, if my remember is correct, you tried similar swap-cluster IO long time ago. now I can't remember why we didn't merged such patch. --
Oh, quite vividly in fact :) For a lot of swap loads the LRU order diverged heavily from swap slot order and readaround was a waste of time. Of course, the patch looked good, too, but it did not match reality that well. I guess 'how about this patch?' won't get us as far as 'how about those numbers/graphs of several real-life workloads? oh and here For random IO, LRU order will have nothing to do with mapping/disk order. --
Right, that's why the patch writes out contiguous pages in mapping order. If they are contiguous on disk with the original page, then writing them out as well should be essentially free (when it comes to disk time). There is almost no waste of memory regardless of the access patterns, as far as I can tell. This patch is just a proof of concept and could be improved by getting help from the filesystem/swap code to ensure that the additional pages we're writing out really are contiguous with the original one. -- Suleiman --
Hannes, We recently ran into this problem while running some experiments on ext4 filesystem. We experienced the scenario where we are writing a large file or just opening a large file with limited memory allocation (using containers), and the process got OOMed. The memory assigned to the container is reasonably large, and the OOM can not be reproduced on ext2 with the same configurations. Later we figured this might be due to the delayed block allocation from ext4. Vmscan sends a single page to ext4->writepage(), then ext4 punts if the block is DA'ed and re-dirties the page. On the other hand, the flusher thread use ext4->writepages() which does include the block allocation. We looked at the OOM log under ext4, all pages within the container were in inactive list and either Dirty or WriteBack. Also, the zones are all marked as "all_unreclaimable" which indicates the reclaim path has scanned the LRU quite lot times without making progress. If the delayed block allocation is the cause for pageout() not being able to flush dirty pages and then triggers OOMs, should we signal the fs to force write out dirty pages under memory pressure? --
XFS already does this in ->writepage to try to minimise the impact of the way pageout issues IO. It helps, but it is still not as good as having all the writeback come from the flusher threads because it's still pretty much random IO. And, FWIW, it doesn't solve the stack usage problems, either. In fact, it will make them worse as write_one_page() puts another struct writeback_control on the stack... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
I havent review such patch yet. then, I'm talking about generic thing. pageout() doesn't only writeout file backed page, but also write swap backed page. so, filesystem optimization nor flusher thread Correct. we need to avoid double writeback_control on stack. probably, we need to divide pageout() some piece. --
Doesn't the randomness become irrelevant if you can cluster enough Sorry, this patch was not meant to solve the stack usage problems. -- Suleiman --
No. If you are doing full disk seeks between random chunks, then you still lose a large amount of throughput. e.g. if the seek time is 10ms and your IO time is 10ms for each 4k page, then increasing the size ito 64k makes it 10ms seek and 12ms for the IO. We might increase throughput but we are still limited to 100 IOs per second. We've gone from 400kB/s to 6MB/s, but that's still an order of magnitude short of the 100MB/s full size IOs with little in way of seeks between them will acheive on the same spindle... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
What I meant was that, theoretically speaking, you could increase the maximum amount of pages that get clustered so that you could get 100MB/s, although it most likely wouldn't be a good idea with the current patch. -- Suleiman --
The usual armwaving numbers for ops/sec for an ATA disk are in the 200 ops/sec range so that seems horribly credible. But then I've never quite understood why our anonymous paging isn't sorting stuff as best it can and then using the drive as a log structure with in memory metadata so it can stream the pages onto disk. Read performance is goig to be similar (maybe better if you have a log tidy when idle), write ought to be far better. Alan --
Yeah, in my experience 7200rpm SATA will get you 200 ops/s when you are doing really small seeks as the typical minimum seek time is around 4-5ms. Average seek time, however, is usually in the range of 10ms, because full head sweep + spindle rotation seeks take in the order of 15ms. Hence small random IO tends to result in seek times nearer the average seek time than the minimum, so that's what i tend to use for Sounds like a worthy project for someone to sink their teeth into. Lots of people would like to have a system that can page out at hundreds of megabytes a second.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
I've found one bug in this patch myself. flusher thread don't --
Well, there is some risk here. Direct reclaimers may not be cleaning more pages than it had to previously except it splices subsystems together increasing stack usage and causing further problems. It might not cause OOM-killer issues but it could increase the time dirty pages spend on the LRU. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
No. you are right. I fully agree your previous mail. so, I need to cool down a bit ;) --
This patch is not related the patch series directly.
but [4/4] depend on scan_control has `priority' member.
then, I'm include this.
=============================================
Since 2.6.28 zone->prev_priority is unused. Then it can be removed
safely. It reduce stack usage slightly.
Now I have to say that I'm sorry. 2 years ago, I thghout prev_priority
can be integrate again, it's useful. but four (or more) times trying
haven't got good performance number. thus I give up such approach.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
include/linux/mmzone.h | 15 -------------
mm/page_alloc.c | 2 -
mm/vmscan.c | 54 ++---------------------------------------------
mm/vmstat.c | 2 -
4 files changed, 3 insertions(+), 70 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index cf9e458..ad76962 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -339,21 +339,6 @@ struct zone {
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
/*
- * prev_priority holds the scanning priority for this zone. It is
- * defined as the scanning priority at which we achieved our reclaim
- * target at the previous try_to_free_pages() or balance_pgdat()
- * invocation.
- *
- * We use prev_priority as a measure of how much stress page reclaim is
- * under - it drives the swappiness decision: whether to unmap mapped
- * pages.
- *
- * Access to both this field is quite racy even on uniprocessor. But
- * it is expected to average out OK.
- */
- int prev_priority;
-
- /*
* The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
* this zone's LRU. Maintained by the pageout code.
*/
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d03c946..88513c0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3862,8 +3862,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
zone_seqlock_init(zone);
zone->zone_pgdat = pgdat;
...ditto
This patch is not related the patch series directly.
but [4/4] depend on scan_control has `priority' member.
then, I'm include this.
=========================================
Now very lots function in vmscan have `priority' argument. It consume
stack slightly. To move it on struct scan_control reduce stack.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/vmscan.c | 83 ++++++++++++++++++++++++++--------------------------------
1 files changed, 37 insertions(+), 46 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index dadb461..8b78b49 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -77,6 +77,8 @@ struct scan_control {
int order;
+ int priority;
+
/* Which cgroup do we reclaim from */
struct mem_cgroup *mem_cgroup;
@@ -1130,7 +1132,7 @@ static int too_many_isolated(struct zone *zone, int file,
*/
static unsigned long shrink_inactive_list(unsigned long max_scan,
struct zone *zone, struct scan_control *sc,
- int priority, int file)
+ int file)
{
LIST_HEAD(page_list);
struct pagevec pvec;
@@ -1156,7 +1158,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
*/
if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
lumpy_reclaim = 1;
- else if (sc->order && priority < DEF_PRIORITY - 2)
+ else if (sc->order && sc->priority < DEF_PRIORITY - 2)
lumpy_reclaim = 1;
pagevec_init(&pvec, 1);
@@ -1335,7 +1337,7 @@ static void move_active_pages_to_lru(struct zone *zone,
}
static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
- struct scan_control *sc, int priority, int file)
+ struct scan_control *sc, int file)
{
unsigned long nr_taken;
unsigned long pgscanned;
@@ -1498,17 +1500,17 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
}
static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
- struct zone *zone, struct scan_control *sc, int priority)
+ struct zone *zone, struct scan_control *sc)
...Even if pageout() is called from direct reclaim, we can delegate io to flusher thread if vm pressure is low. Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com> --- mm/vmscan.c | 7 +++++++ 1 files changed, 7 insertions(+), 0 deletions(-) diff --git a/mm/vmscan.c b/mm/vmscan.c index 8b78b49..eab6028 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -623,6 +623,13 @@ static enum page_references page_check_references(struct page *page, if (current_is_kswapd()) return PAGEREF_RECLAIM_CLEAN; + /* + * Now VM pressure is not so high. then we can delegate + * page cleaning to flusher thread safely. + */ + if (!sc->order && sc->priority > DEF_PRIORITY/2) + return PAGEREF_RECLAIM_CLEAN; + return PAGEREF_RECLAIM; } -- 1.6.5.2 --
Now, kernel compile and/or backup operation seems keep nr_vmscan_write==0. Dave, can you please try to run your pageout annoying workload? --
It's just as easy for you to run and observe the effects. Start with a VM with 1GB RAM and a 10GB scratch block device: # mkfs.xfs -f /dev/<blah> # mount -o logbsize=262144,nobarrier /dev/<blah> /mnt/scratch in one shell: # while [ 1 ]; do dd if=/dev/zero of=/mnt/scratch/foo bs=1024k ; done in another shell, if you have fs_mark installed, run: # ./fs_mark -S0 -n 100000 -F -s 0 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/3 -d /mnt/scratch/2 & otherwise run a couple of these in parallel on different directories: # for i in `seq 1 1 100000`; do echo > /mnt/scratch/0/foo.$i ; done Cheers, Dave. -- Dave Chinner david@fromorbit.com --
A filesystem on a loopback device will work just as well ;) Cheers, Dave. -- Dave Chinner david@fromorbit.com --
IMO, this really doesn't fix either of the problems - the bad IO patterns nor the stack usage. All it will take is a bit more memory pressure to trigger stack and IO problems, and the user reporting the problems is generating an awful lot of memory pressure... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Agreed (again), but we've already come to the conclusion that a Yes, I suggested it *as a first step*, not as the end goal. Your patches don't reach the first step which is fixing the reported Given that I haven't been able to trigger OOM without writeback from direct reclaim so far (*) I'm not finding any evidence that it is a problem or that there are regressions. I want to be able to say that this change has no known regressions. I want to find the regression and work to fix them, but without test cases there's no way I can do this. This is what I'm getting frustrated about - I want to fix this problem once and for all, but I can't find out what I need to do to robustly test such a change so we can have a high degree of confidence that it doesn't introduce major regressions. Can anyone help here? (*) except in one case I've already described where it mananged to allocate enough huge pages to starve the system of order zero pages, You're asking me? I've been asking you for workloads that wind up reclaim priority.... :/ All I can say is that the most common trigger I see for OOM is copying a large file on a busy system that is running off a single spindle. When that happens on my laptop I walk away and get a cup of coffee when that happens and when I come back I pick up all the broken bits the OOM killer left behind..... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
I have some diet patch as another patches. I'll post todays diet patch Agreed. I'm sorry that thing. Probably nobody in the world have enough VM test case even though include no linux people. Modern general purpose OS are used really really various purpose and various machine. So, I haven't seen perfectly zero regression VM change. I'm getting the same frustration anytime. Because, Many VM mess is for avoiding extream starvation case. but If ??? Do I misunderstand your last mail? and, I ask which is "the bad IO patterns". if it's not your intention, What do you talked about io pattern? If my understand is correct, you asked me about vmscan hurt case, and I asked you your the bad IO pattern. As far as I understand, you are talking about no specific general thing. then, I also talking general one. In general, I think slow down is better than OOM-killer. So, even though we need more and more improvement, we always care about avoiding incorrect oom. iow, I'd prefer step by step development. --
Now, max_scan of shrink_inactive_list() is always passed less than
SWAP_CLUSTER_MAX. then, we can remove scanning pages loop in it.
This patch also help stack diet.
detail
- remove "while (nr_scanned < max_scan)" loop
- remove nr_freed (now, we use nr_reclaimed directly)
- remove nr_scan (now, we use nr_scanned directly)
- rename max_scan to nr_to_scan
- pass nr_to_scan into isolate_pages() directly instead
using SWAP_CLUSTER_MAX
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/vmscan.c | 190 ++++++++++++++++++++++++++++-------------------------------
1 files changed, 89 insertions(+), 101 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eab6028..4de4029 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1137,16 +1137,22 @@ static int too_many_isolated(struct zone *zone, int file,
* shrink_inactive_list() is a helper for shrink_zone(). It returns the number
* of reclaimed pages
*/
-static unsigned long shrink_inactive_list(unsigned long max_scan,
+static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
struct zone *zone, struct scan_control *sc,
int file)
{
LIST_HEAD(page_list);
struct pagevec pvec;
- unsigned long nr_scanned = 0;
+ unsigned long nr_scanned;
unsigned long nr_reclaimed = 0;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
int lumpy_reclaim = 0;
+ struct page *page;
+ unsigned long nr_taken;
+ unsigned long nr_active;
+ unsigned int count[NR_LRU_LISTS] = { 0, };
+ unsigned long nr_anon;
+ unsigned long nr_file;
while (unlikely(too_many_isolated(zone, file, sc))) {
congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1172,119 +1178,101 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
lru_add_drain();
spin_lock_irq(&zone->lru_lock);
- do {
- struct page *page;
- unsigned long nr_taken;
- unsigned long nr_scan;
- unsigned long nr_freed;
- unsigned long nr_active;
- unsigned int count[NR_LRU_LISTS] = { 0, };
- int ...Yep. I modified bloat-o-meter to work with stacks (imaginatively calling it stack-o-meter) and got the following. The prereq patches are from earlier in the thread with the subjects vmscan: kill prev_priority completely vmscan: move priority variable into scan_control It gets $ stack-o-meter vmlinux-vanilla vmlinux-1-2patchprereq add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-72 (-72) function old new delta kswapd 748 676 -72 and with this patch on top $ stack-o-meter vmlinux-vanilla vmlinux-2-simplfy-shrink add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-144 (-144) function old new delta shrink_zone 1232 1160 -72 kswapd 748 676 -72 I couldn't spot any problems. I'd consider throwing a WARN_ON(nr_to_scan > SWAP_CLUSTER_MAX) in case some future change breaks the assumptions but otherwise. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
And the next time someone adds a new feature to these code paths or the compiler inlines differently these 72 bytes are easily there again. It's not really a long term solution. Code is tending to get more complicated all the time. I consider it unlikely this trend will stop any time soon. So just doing some stack micro optimizations doesn't really help all that much. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
The same logic applies when/if page writeback is split so that it is It's a buying-time venture, I'll agree but as both approaches are only about reducing stack stack they wouldn't be long-term solutions by your criteria. What do you suggest? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
(from easy to more complicated): - Disable direct reclaim with 4K stacks - Do direct reclaim only on separate stacks - Add interrupt stacks to any 8K stack architectures. - Get rid of 4K stacks completely - Think about any other stackings that could give large scale recursion and find ways to run them on separate stacks too. - Long term: maybe we need 16K stacks at some point, depending on how good the VM gets. Alternative would be to stop making Linux more complicated, but that's unlikely to happen. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Just to re-iterate: we're blowing the stack with direct reclaim on x86_64 w/ 8k stacks. The old i386/4k stack problem is a red herring. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Yes that's known, but on 4K it will definitely not work at all. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Yep, that is not being disputed. By the way, what did you use to generate your report? Was it CONFIG_DEBUG_STACK_USAGE or something else? I used a modified bloat-o-meter to gather my data but it'd be nice to be sure I'm seeing the same things as you (minus XFS unless I -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
I'm using the tracing subsystem to get them. Doesn't everyone use
that now? ;)
$ grep STACK .config
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
# CONFIG_CC_STACKPROTECTOR is not set
CONFIG_STACKTRACE=y
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_STACK_TRACER=y
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set
Then:
# echo 1 > /proc/sys/kernel/stack_tracer_enabled
<run workloads>
Monitor the worst recorded stack usage as it changes via:
# cat /sys/kernel/debug/tracing/stack_trace
Depth Size Location (44 entries)
----- ---- --------
0) 5584 288 get_page_from_freelist+0x5c0/0x830
1) 5296 272 __alloc_pages_nodemask+0x102/0x730
2) 5024 48 kmem_getpages+0x62/0x160
3) 4976 96 cache_grow+0x308/0x330
4) 4880 96 cache_alloc_refill+0x27f/0x2c0
5) 4784 96 __kmalloc+0x241/0x250
6) 4688 112 vring_add_buf+0x233/0x420
......
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
Do not like. While I can see why 4K stacks are a serious problem, I'd sooner see 4K stacks disabled than have the kernel behave so differently for direct reclaim. It's be tricky to spot regressions in reclaim that This is a similar but separate problem. It's similar in that interrupt Why would we *not* do this? I can't remember the original reasoning behind 4K stacks but am guessing it helped fork-orientated workloads in startup times in the days before lumpy reclaim and better fragmentation control. The patch series I threw up about reducing stack was a cut-down approach. Instead of using separate stacks, keep the stack usage out of Make this Plan D if nothing else works out and we still hit a wall? -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Think that's a script worth having in-tree?
Ahh, it's a hatchet-job at the moment. I copied bloat-o-meter and altered one function. I made a TODO note to extend bloat-o-meter properly and that would be worth merging. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
This patch is used from [3/4]
===================================
Free_hot_cold_page() and __free_pages_ok() have very similar
freeing preparation. This patch make consolicate it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/page_alloc.c | 40 +++++++++++++++++++++-------------------
1 files changed, 21 insertions(+), 19 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 88513c0..ba9aea7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -599,20 +599,23 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
spin_unlock(&zone->lock);
}
-static void __free_pages_ok(struct page *page, unsigned int order)
+static int free_pages_prepare(struct page *page, unsigned int order)
{
- unsigned long flags;
int i;
int bad = 0;
- int wasMlocked = __TestClearPageMlocked(page);
trace_mm_page_free_direct(page, order);
kmemcheck_free_shadow(page, order);
- for (i = 0 ; i < (1 << order) ; ++i)
- bad += free_pages_check(page + i);
+ for (i = 0 ; i < (1 << order) ; ++i) {
+ struct page *pg = page + i;
+
+ if (PageAnon(pg))
+ pg->mapping = NULL;
+ bad += free_pages_check(pg);
+ }
if (bad)
- return;
+ return -EINVAL;
if (!PageHighMem(page)) {
debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order);
@@ -622,6 +625,17 @@ static void __free_pages_ok(struct page *page, unsigned int order)
arch_free_page(page, order);
kernel_map_pages(page, 1 << order, 0);
+ return 0;
+}
+
+static void __free_pages_ok(struct page *page, unsigned int order)
+{
+ unsigned long flags;
+ int wasMlocked = __TestClearPageMlocked(page);
+
+ if (free_pages_prepare(page, order))
+ return;
+
local_irq_save(flags);
if (unlikely(wasMlocked))
free_page_mlock(page);
@@ -1107,21 +1121,9 @@ void free_hot_cold_page(struct page *page, int cold)
int migratetype;
int wasMlocked = __TestClearPageMlocked(page);
- trace_mm_page_free_direct(page, ...You don't appear to do anything with the return value. bool? Otherwise I see no problems -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
On x86_64, sizeof(struct pagevec) is 8*16=128, but
sizeof(struct list_head) is 8*2=16. So, to replace pagevec with list
makes to reduce 112 bytes stack.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/vmscan.c | 22 ++++++++++++++--------
1 files changed, 14 insertions(+), 8 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4de4029..fbc26d8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -93,6 +93,8 @@ struct scan_control {
unsigned long *scanned, int order, int mode,
struct zone *z, struct mem_cgroup *mem_cont,
int active, int file);
+
+ struct list_head free_batch_list;
};
#define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
@@ -641,13 +643,11 @@ static unsigned long shrink_page_list(struct list_head *page_list,
enum pageout_io sync_writeback)
{
LIST_HEAD(ret_pages);
- struct pagevec freed_pvec;
int pgactivate = 0;
unsigned long nr_reclaimed = 0;
cond_resched();
- pagevec_init(&freed_pvec, 1);
while (!list_empty(page_list)) {
enum page_references references;
struct address_space *mapping;
@@ -822,10 +822,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
__clear_page_locked(page);
free_it:
nr_reclaimed++;
- if (!pagevec_add(&freed_pvec, page)) {
- __pagevec_free(&freed_pvec);
- pagevec_reinit(&freed_pvec);
- }
+ list_add(&page->lru, &sc->free_batch_list);
continue;
cull_mlocked:
@@ -849,8 +846,6 @@ keep:
VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
}
list_splice(&ret_pages, page_list);
- if (pagevec_count(&freed_pvec))
- __pagevec_free(&freed_pvec);
count_vm_events(PGACTIVATE, pgactivate);
return nr_reclaimed;
}
@@ -1238,6 +1233,11 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
PAGEOUT_IO_SYNC);
}
+ /*
+ * Free unused pages.
+ */
+ free_pages_bulk(zone, &sc->free_batch_list);
+
local_irq_disable();
if (current_is_kswapd())
...You could clear this under the zone->lock below before calling __free_one_page. It'd avoid a large number of IRQ enables and disables which are a problem on some CPUs (P4 and Itanium both blow in this regard according This has the effect of bypassing the per-cpu lists as well as making the zone lock hotter. The cache hotness of the data within the page is probably not a factor but the cache hotness of the stuct page is. The zone lock getting hotter is a greater problem. Large amounts of page reclaim or dumping of page cache will now contend on the zone lock where as previously it would have dumped into the per-cpu lists (potentially but not necessarily avoiding the zone lock). While there might be a stack saving in the next patch, there would appear to be definite performance implications in taking this patch. Functionally, I see no problem but I'd put this sort of patch on the -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
At worst, it'll distort the LRU ordering slightly. Lets say the the file-adjacent-page you clean was near the end of the LRU. Before such a patch, it may have gotten cleaned and done another lap of the LRU. After, it would be reclaimed sooner. I don't know if we depend on such behaviour (very doubtful) but it's a subtle enough change. I can't predict what it'll do for IO congestion. Simplistically, there is more IO so it's bad but if the write pattern is less seeky and we needed to I'm afraid I'm not familiar with this interface. Can you point me at some previous discussion so that I am sure I am looking at the right Sure, I'm not resisting fixing this, just your first patch :) There are four goals here 1. Reduce stack usage 2. Avoid the splicing of subsystem stack usage with direct reclaim 3. Preserve lumpy reclaims cleaning of contiguous pages 4. Try and not drastically alter LRU aging 1 and 2 are important for you, 3 is important for me and 4 will have to be dealt with on a case-by-case basis. Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I Ah yes, but I at least will resist killing of writeback from direct reclaim because of lumpy reclaim. Again, I recognise the seek pattern I'd like this to be plan b (or maybe c or d) if we cannot reduce stack usage enough or come up with an alternative fix. From the goals above it mitigates 1, mitigates 2, addresses 3 but potentially allows dirty pages to remain on the LRU with 4 until the background cleaner or kswapd comes along. One reason why I am edgy about this is that lumpy reclaim can kick in for low-enough orders too like order-1 pages for stacks in some cases or order-2 pages for network cards using jumbo frames or some wireless cards. The network cards in particular could still cause the stack I'd like stack reduction to be plan a because it buys time without making the problem exclusively lumpy reclaims where it can still hit, Yep. It would cut down around 1K of stack usage when ...
vi fs/direct-reclaim-helper.c, it has a few placeholders for where the real code needs to go....just look for the ~ marks. I mostly meant that the bdi helper threads were the best place to add knowledge about which pages we want to write for reclaim. We might need to add a thread dedicated to just doing the VM's dirty work, but that's I'd like to add one more: 5. Don't dive into filesystem locks during reclaim. This is different from splicing code paths together, but the filesystem writepage code has become the center of our attempts at doing big fat contiguous writes on disk. We push off work as late as we can until just before the pages go down to disk. I'll pick on ext4 and btrfs for a minute, just to broaden the scope outside of XFS. Writepage comes along and the filesystem needs to actually find blocks on disk for all the dirty pages it has promised to write. So, we start a transaction, we take various allocator locks, modify different metadata, log changed blocks, take a break (logging is hard work you know, need_resched() triggered a by now), stuff it all into the file's metadata, log that, and finally return. Each of the steps above can block for a long time. Ext4 solves this by not doing them. ext4_writepage only writes pages that are already fully allocated on disk. Btrfs is much more efficient at not doing them, it just returns right away for PF_MEMALLOC. This is a long way of saying the filesystem writepage code is the opposite of what direct reclaim wants. Direct reclaim wants to find free ram now, and if it does end up in the mess describe above, it'll just get stuck for a long time on work entirely unrelated to finding free pages. -chris --
This is a real problem, BTW. One of the problems we've been fighting
inside Google is because ext4_writepage() refuses to write pages that
are subject to delayed allocation, it can cause the OOM killer to get
invoked.
I had thought this was because of some evil games we're playing for
container support that makes zones small, but just last night at the
LF Collaboration Summit reception, I ran into a technologist from a
major financial industry customer reported to me that when they tried
using ext4, they ran into the exact same problem because they were
running Oracle which was pinning down 3 gigs of memory, and then when
they tried writing a very big file using ext4, they had the same
problem of writepage() not being able to reclaim enough pages, so the
kernel fell back to invoking the OOM killer, and things got ugly in a
hurry...
One of the things I was proposing internally to try as a long-term
we-gotta-fix writeback is that we need some kind of signal so that we
can do the lumpy reclaim (a) in a separate process, to avoid a lock
inversion problem and the gee-its-going-to-take-a-long-time problem
which Chris Mentioned, and (b) to try to cluster I/O so that we're not
dribbling out writes to the disk in small, seeky, 4k writes, which is
really a disaster from a performance standpoint. Maybe the VM guys
don't care about this, but this sort of things tends to get us
filesystem guys all up in a lather not just because of the really
sucky performance, but also because it tends to mean that the system
can thrash itself to death in low memory situations.
- Ted
--
I must be blind. What tree is this in? I can't see it v2.6.34-rc4, Good add. It's not a new problem either. This came up at least two years ago at around the first VM/FS summit and the response was a long the lines Ok, good summary, thanks. I was only partially aware of some of these. i.e. I knew it was a problem but was not sensitive to how bad it was. Your last point is interesting because lumpy reclaim for large orders under heavy pressure can make the system stutter badly (e.g. during a huge page pool resize). I had blamed just plain IO but messing around with locks and tranactions could have been a large factor and I didn't go looking for it. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Bah, Johannes corrected my literal mind. har de har har :) -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Fundamentally, we have so many pages on the LRU, getting a few out of order at the back end of it is going to be in the noise. If we trade off "perfect" LRU behaviour for cleaning pages an order of magnitude faster, reclaim will find candidate pages for a whole lot faster. And if we have more clean pages available, faster, overall system throughput is going to improve and be much less likely to fall into deep, dark holes where the OOM-killer is the light at the end..... #4 is important to me, too, because that has direct impact on large file IO workloads. however, it is gross changes in behaviour that concern me, not subtle, probably-in-the-noise changes that you're Well, you keep saying that they break #3, but I haven't seen any test cases or results showing that. I've been unable to confirm that lumpy reclaim is broken by disallowing writeback in my testing, so I'm interested to know what tests you are running that show it is We've been through this already, but I'll repeat it again in the hope it sinks in: reducing stack usage is not sufficient to stay within an 8k stack if we can enter writeback with an arbitrary amount of stack already consumed. We've already got a report of 9k of stack usage (7200 bytes left on a order-2 stack) and this is without a complex storage stack - it's just a partition on a SATA drive. We can easily add another 1k, possibly 2k to that stack depth with a complex storage subsystem. Trimming this much (3-4k) is simply not feasible in a callchain that So push lumpy reclaim into a separate thread. It already blocks, so waiting for some other thread to do the work won't change anything. Separating high-order reclaim from LRU reclaim is probably a good idea, anyway - they use different algorithms and while the two are intertwined it's hard to optimise/improve either.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
haha, I don't think anyone pretends the LRU behaviour is perfect. Altering its existing behaviour tends to be done with great care but I'm also less concerned with this aspect. I brought it up because it was a factor. I don't think it'll cause us problems but if problems do arise, it's nice to have a few potential candidates to examine in Ok, I haven't actually tested this. The machines I use are tied up retesting the compaction patches at the moment. The reason why I reckon it'll be a problem is that when these sync-writeback changes were introduced, it significantly helped lumpy reclaim for huge pages. I am making an assumption that backing out those changes will hurt it. Ok, based on this, I'll stop working on the stack-reduction patches. I'll test what I have and push it but I won't bring it further for the moment and instead look at putting writeback into its own thread. If someone else works on it in the meantime, I'll review and test from the No, it wouldn't. As long as it can wait on the right pages, it doesn't They are not a million miles apart either. Lumpy reclaim uses the LRU to select a cursor page and then reclaims around it. Improvements on LRU tend to help lumpy reclaim as well. It's why during the tests I run I can often allocate 80-95% of memory as huge pages on x86-64 as opposed to when anti-frag was being developed first where getting 30% was a cause for celebration :) -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
There are two issues here: stack utilisation and poor IO patterns in direct reclaim. They are different. The poor IO patterns thing is a regression. Some time several years ago (around 2.6.16, perhaps), page reclaim started to do a LOT more dirty-page writeback than it used to. AFAIK nobody attempted to work out why, nor attempted to try to fix it. Doing writearound in pageout() might help. The kernel was in fact was doing that around 2.5.10, but I took it out again because it wasn't obviously beneficial. Writearound is hard to do, because direct-reclaim doesn't have an easy way of pinning the address_space: it can disappear and get freed under your feet. I was able to make this happen under intense MM loads. The current page-at-a-time pageout code pins the address_space by taking a lock on one of its pages. Once that lock is released, we cannot touch *mapping. And lo, the pageout() code is presently buggy: res = mapping->a_ops->writepage(page, &wbc); if (res < 0) handle_write_error(mapping, page, res); The ->writepage can/will unlock the page, and we're passing a hand grenade into handle_write_error(). Any attempt to implement writearound in pageout will need to find a way to safely pin that address_space. One way is to take a temporary ref on mapping->host, but IIRC that introduced nasties with inode_lock. Certainly it'll put more load on that worrisomely-singleton lock. Regarding simply not doing any writeout in direct reclaim (Dave's initial proposal): the problem is that pageout() will clean a page in the target zone. Normal writeout won't do that, so we could get into a situation where vast amounts of writeout is happening, but none of it is cleaning pages in the zone which we're trying to allocate from. It's quite possibly livelockable, too. Doing writearound (if we can get it going) will solve that adequately (assuming that the target page gets reliably written), but it won't help the stack usage problem. To solve the ...
I just know that we XFS guys have been complaining about it a lot.. But that was mostly a tuning issue - before writeout mostly happened from pdflush. If we got into kswapd or direct reclaim we already As Chris mentioned currently btrfs and ext4 do not actually do delalloc conversions from this path, so for typical workloads the amount of writeout that can happen from this path is extremly limited. And unless we get things fixed we will have to do the same for XFS. I'd be much more happy if we could just sort it out at the VM level, because this means we have one sane place for this kind of policy instead of three or more hacks down inside the filesystems. It's rather interesting that all people on the modern fs side completely agree here what the problem is, but it seems rather hard to convince the VM side to do Allowing the flusher threads to do targeted writeout would be the best from the FS POV. We'll still have one source of the I/O, just with another know on how to select the exact region to write out. We can still synchronously wait for the I/O for lumpy reclaim if really nessecary. --
On Sun, 18 Apr 2010 15:05:26 -0400, Christoph Hellwig <hch@infradead.org>
I know also that the ext3 and reisefs guys complained about this issue
--
Best Regards
Sorin Faibish
Corporate Distinguished Engineer
Network Storage Group
EMC²
where information lives
Phone: 508-435-1000 x 48545
Cellphone: 617-510-0422
Email : sfaibish@emc.com
--
Right. It's intended that the great majority of writeout be performed by the fs flusher threads and by the write()r in balance_dirty_pages(). Writeout off the LRU is supposed to be a rare emergency case. Yeah, but it's all bandaids. The first thing we should do is work out why writeout-off-the-LRU increased so much and fix that. Handing writeout off to separate threads might be used to solve the stack consumption problem but we shouldn't use it to "solve" the excess-writeout-from-page-reclaim problem. --
I think both of them are really serious issue. Exposing the whole stack and lock problems with direct reclaim are a bit of a positive side-effect os the writeout tuning messup. Without it the problems would still be just as harmfull, just happenening even less often and thus getting even less attention. --
On Sat, 17 Apr 2010 20:32:39 -0400, Andrew Morton
I for one am looking very seriously at this problem together with Bruce.
We plan to have a discussion on this topic at the next LSF meeting
--
Best Regards
Sorin Faibish
Corporate Distinguished Engineer
Network Storage Group
EMC²
where information lives
Phone: 508-435-1000 x 48545
Cellphone: 617-510-0422
Email : sfaibish@emc.com
--
On Sun, 18 Apr 2010 17:30:36 -0400, James Bottomley
Let's work together to get this done. This is a very good idea. I will try
to bring some facts about the current state by instrumenting the kernel
to sample with higher time granularity the dirty pages dynamics. This will
allow us expose better the problem or lack of. :)
--
Best Regards
Sorin Faibish
Corporate Distinguished Engineer
Network Storage Group
EMC²
where information lives
Phone: 508-435-1000 x 48545
Cellphone: 617-510-0422
Email : sfaibish@emc.com
--
I'd personally hope that this is solved long before the LSF/VM
workshops.... but if not, yes, we should definitely tackle it then.
- Ted
--
I think that part of the problem is that at roughly the same time writeback started on a long down hill slide as well, and we've really only fixed that in the last couple of kernel releases. Also, it tends to take more that just writing a few large files to invoke the LRU-based writeback code is it is generally not invoked in filesystem "performance" testing. Hence my bet is on the fact that the effects of LRU-based writeback are rarely noticed in common testing. IOWs, low memory testing is not something a lot of people do. Add to that the fact that most fs people, including me, have been treating the VM as a black box that a bunch of other people have been taking care of and hence really just been hoping it does the right thing, and we've got a recipe for an unnoticed descent into a Bad Place. That's true, but seeing as we can't safely do writeback from reclaim, we need some method of telling the background threads to write a certain region of an inode. Perhaps some extension of a Which, if we have to set it as low as 1.5k of stack used, may as I'm fundamentally opposed to pushing IO to another place in the VM when it could be just as easily handed to the flusher threads. Also, consider that there's only one kswapd thread in a given context (e.g. per CPU), but we can scale the number of flusher threads as need be.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
On Mon, 19 Apr 2010 10:35:56 +1000 Would this also be the time where we started real dirty accounting, and started playing with the dirty page thresholds? Background writeback is that interesting tradeoff between writing out to make the VM easier (and the data safe) and the chance of someone either rewriting the same data (as benchmarks do regularly... not sure about real workloads) or deleting the temporary file. Maybe we need to do the background dirty writes a bit more aggressive... or play with heuristics where we get an adaptive timeout (say, if the file got closed by the last opener, then do a shorter timeout) -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
Yes, I think that was introduced in 2.6.16/17, so it's definitely in Realistically, I'm concerned about preventing the worst case behaviour from occurring - making the background writes more agressive without preventing writeback in LRU order simply means it will be harder to test the VM corner case that triggers these writeout patterns... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
On Mon, 19 Apr 2010 11:08:05 +1000 while I appreciate that the worst case should not be uber horrific... I care a LOT about getting the normal case right... and am willing to sacrifice the worst case for that.. (obviously not to infinity, it needs to be bounded) -- Arjan van de Ven Intel Open Source Technology Centre For development, discussion and tips for power savings, visit http://www.lesswatts.org --
One machine has completed the test and the results are as expected. When
allocating huge pages under stress, your patch drops the success rates
significantly. On X86-64, it showed
STRESS-HIGHALLOC
stress-highalloc stress-highalloc
enable-directreclaim disable-directreclaim
Under Load 1 89.00 ( 0.00) 73.00 (-16.00)
Under Load 2 90.00 ( 0.00) 85.00 (-5.00)
At Rest 90.00 ( 0.00) 90.00 ( 0.00)
So with direct reclaim, it gets 89% of memory as huge pages at the first
attempt but 73% with your patch applied. The "Under Load 2" test happens
immediately after. With the start kernel, the first and second attempts
are usually the same or very close together. With your patch applied,
there are big differences as it was no longer trying to clean pages.
--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab
--
What was the machine config you were testing on (RAM, CPUs, etc)? And what are these loads? Do you have a script that generates them? If so, can you share them, please? OOC, what was the effect on the background load - did it go faster or slower when writeback was disabled? i.e. did we trade of more large pages for better overall throughput? Also, I'm curious as to the repeatability of the tests you are doing. I found that from run to run I could see a *massive* variance in the results. e.g. one run might only get ~80 huge pages at the first attempt, the test run from the same initial conditions next might get 440 huge pages at the first attempt. I saw the same variance with or without writeback from direct reclaim enabled. Hence only after averaging over tens of runs could I see any sort of trend emerge, and it makes me wonder if your testing is also seeing this sort of variance.... FWIW, if we look results of the test I did, it showed a 20% improvement in large page allocation with a 15% increase in load throughput, while you're showing a 16% degradation in large page allocation. Effectively we've got two workloads that show results at either end of the spectrum (perhaps they are best case vs worst case) but there's no real in-between. What other tests can we run to get a better picture of the effect? Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Compile-based loads that fill up memory and put it under heavy memory pressure that also dirties memory. While they are running, a kernel module is loaded that starts allocating huge pages one at a time so that accurate timing and the state of the system can be gathered at allocation time. The number of allocation attempts is 90% of the number of huge pages that exist Yes, but unfortunately they are not in a publishable state. Parts of Unfortunately, I don't know what the effect on the underlying load is as it takes longer than the huge page allocation attempts do. The tests objective is to check how well lumpy reclaim works undedmemory pressure. However, the time it takes to allocate a huge page increases with direct reclaim disabled (i.e. your patch) early in the test up until about 40% of memory was allocated as huge pages. After that, the latencies with disable-directreclaim are lower until the gives up while the latencies with enable-directreclaim increase. In other words, with direct reclaim writing back pages, lumpy reclaim is a lot more determined to get the pages cleaned and wait on them if necessary. A compromise patch might be to have a wait_on_page_dirty to be cleared instead of queueing the IO and wait_on_page_writeback? How long it stalled would You are using the nr_hugepages interface and writing a large number to it so you are also triggering the hugetlbfs retry-logic and have little control over how many times the allocator gets called on each attempt. How many huge pages it allocates depends on how much progress it is able to make during lumpy reclaim. It's why the tests I run allocate huge pages one at a time and measure the latencies as it goes. The results tend to be quite reproducible. Success figures would be the same between runs and the rate of allocation success would generally be comparable as well. Your test could do something similar by only ever requesting one additional page. It will be good enough to measure allocation latency. The ...
-Andi -- ak@linux.intel.com -- Speaking for myself only. --
I already have some patches to remove trivial parts of struct scan_control, namely may_unmap, may_swap, all_unreclaimable and isolate_pages. The rest needs a deeper look. A rather big offender in there is the combination of shrink_active_list (360 bytes here) and shrink_page_list (200 bytes). I am currently looking at breaking out all the accounting stuff from shrink_active_list into a separate leaf function so that the stack footprint does not add up. Your idea of per-cpu allocated scan controls reminds me of an idea I have had for some time now: moving reclaim into its own threads (per cpu?). Not only would it separate the allocator's stack from the writeback stack, we could also get rid of that too_many_isolated() workaround and coordinate reclaim work better to prevent overreclaim. But that is not a quick fix either... --
Seems interesting. but scan_control diet is not so effective. How much So, I haven't think this way. probably seems good. but I like to do simple diet at first. --
Not much, it cuts 16 bytes on x86 32 bit. The bigger gain is the code clarification it comes with. There is too much state to keep track of in reclaim. --
Yeah, Of cource much. I would propse to revert 70674f95c0. But I doubt GFP_NOFS solve our issue. --
There are lots of other call chains which use multiple KB bytes by itself, so why not give select() that measly 832 bytes? You think only file systems are allowed to use stack? :) Basically if you cannot tolerate 1K (or more likely more) of stack used before your fs is called you're toast in lots of other situations It does this for large inputs, but the whole point of the stack fast path is to avoid it for common cases when a small number of fds is only needed. It's significantly slower to go to any external allocator. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Well, on a 4K stack kernel, 832 bytes is a very large percentage for just one function. Direct reclaim is a problem because it splices parts of the kernel that normally aren't connected together. The people that code in select see 832 bytes and say that's teeny, I should have taken 3832 bytes. But they don't realize their function can dive down into ecryptfs then the filesystem then maybe loop and then perhaps raid6 on top of a Yeah, but since the call chain does eventually go into the allocator, this function needs to be more stack friendly. I do agree that we can't really solve this with noinline_for_stack pixie dust, the long call chains are going to be a problem no matter what. Reading through all the comments so far, I think the short summary is: Cleaning pages in direct reclaim helps the VM because it is able to make sure that lumpy reclaim finds adjacent pages. This isn't a fast operation, it has to wait for IO (infinitely slow compared to the CPU). Will it be good enough for the VM if we add a hint to the bdi writeback threads to work on a general area of the file? The filesystem will get writepages(), the VM will get the IO it needs started. I know Mel mentioned before he wasn't interested in waiting for helper threads, but I don't see how we can work without it. -chris --
The reality is that if you are blowing a 4K process stack you are probably playing russian roulette on the current 8K x86-32 stack as well because of the non IRQ split. So it needs fixing either way --
Yes I think the 8K stack on 32bit should be combined with a interrupt stack too. There's no reason not to have an interrupt stack ever. Again the problem with fixing it is that you won't have any safety net for a slightly different stacking etc. path that you didn't cover. That said extreme examples (like some of those Chris listed) definitely need fixing by moving them to different threads. But even after that you still want a safety net. 4K is just too near the edge. Maybe it would work if we never used any indirect calls, but that's clearly not the case. -Andi -- ak@linux.intel.com -- Speaking for myself only. --
Even without direct reclaim, I doubt stack usage is often at the top of peoples minds except for truly criminal large usages of it. Direct reclaim splicing is somewhat of a problem but it's separate to stack Bear in mind that the context of lumpy reclaim that the VM doesn't care about where the data is on the file or filesystem. It's only concerned about where the data is located in memory. There *may* be a correlation between location-of-data-in-file and location-of-data-in-memory but only if readahead was a factor and readahead happened to hit at a time the page I'm not against the idea as such. It would have advantages in that the thread could reorder the IO for better seeks for example and lumpy reclaim is already potentially waiting a long time so another delay won't hurt. I would worry that it's just hiding the stack usage by moving it to another thread and that there would be communication cost between a direct reclaimer and this writeback thread. The main gain would be in hiding the "splicing" effect between subsystems that direct reclaim can have. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
The big gain from the helper threads is that storage operates at a roughly fixed iop rate. This is true for ssd as well, it's just a much higher rate. So the threads can send down 4K ios and recover clean pages at exactly the same rate it would sending down 64KB ios. I know that for lumpy purposes it might not be the best 64KB, but the other side of it is that we have to write those pages eventually anyway. We might as well write them when it is more or less free. The per-bdi writeback threads are a pretty good base for changing the ordering for writeback, it seems like a good place to integrate requests from the VM about which files (and which offsets in those files) to write back first. -chris --
Hi, Dave. I think your solution is rather aggressive change as Mel and Kosaki already pointed out. Do flush thread aware LRU of dirty pages in system level recency not dirty pages recency? Of course flush thread can clean dirty pages faster than direct reclaimer. But if it don't aware LRUness, hot page thrashing can be happened by corner case. It could lost write merge. And non-rotation storage might be not big of seek cost. I think we have to consider that case if we decide to change direct reclaim I/O. How do we separate the problem? 1. stack hogging problem. 2. direct reclaim random write. And try to solve one by one instead of all at once. -- Kind regards, Minchan Kim --
It may be agressive, but writeback from direct reclaim is, IMO, one of the worst aspects of the current VM design because of it's adverse effect on the IO subsystem. I'd prefer to remove it completely that continue to try and patch around it, especially given that everyone seems to agree that it It writes back in the order inodes were dirtied. i.e. the LRU is a coarser measure, but it it still definitely there. It also takes into account fairness of IO between dirty inodes, so no one dirty inode prevents IO beining issued on a other dirty inodes on the Non-rotational storage still goes faster when it is fed large, well AFAICT, the only way to _reliably_ avoid the stack usage problem is to avoid writeback in direct reclaim. That has the side effect of fixing #2 as well, so do they really need separating? Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Tend to agree. But De we need it by last resort if flusher thread can't catch up write stream? Of course, If everybody agree, we can do it. For it, we need many benchmark result which is very hard. Thanks. It seems to be lost recency. Agreed. I missed. Nand device is stronger than HDD about random read. If we can do it, it's good. but 2. problem is not easy to fix, I think. Compared to 2, 1 is rather easy. So I thought we can solve 1 firstly and then focusing 2. If your suggestion is right, then we can apply your idea. Then we don't need to revert the patch of 1 since small stack usage is always good -- Kind regards, Minchan Kim --
On Tue, 13 Apr 2010 10:17:58 +1000 Hmm. Then, if memoy cgroup is filled by dirty pages, it can't kick writeback and has to wait for someone else's writeback ? How long this will take ? # mount -t cgroup none /cgroup -o memory # mkdir /cgroup/A # echo 20M > /cgroup/A/memory.limit_in_bytes # echo $$ > /cgroup/A/tasks # dd if=/dev/zero of=./tmpfile bs=4096 count=1000000 Can memcg ask writeback thread to "Wake Up Now! and Write this out!" effectively ? Thanks, --
On Fri, 16 Apr 2010 10:13:39 +0900 Hmm.. I saw an oom-kill while testing several cases but performance itself seems not to be far different with or without patch. But I'm unhappy with oom-kill, so some tweak for memcg will be necessary if we'll go with this. Thanks, -Kame --
