Re: [PATCH 1/4] vmscan: delegate pageout io to flusher thread if current is kswapd

Previous thread: [RFC PATCH 0/2] hw-breakpoints allocation constraints updates by Frederic Weisbecker on Monday, April 12, 2010 - 4:01 pm. (2 messages)

Next thread: [PATCH 1/2] mm: add context argument to shrinker callback by Dave Chinner on Monday, April 12, 2010 - 5:24 pm. (15 messages)
From: Dave Chinner
Date: Monday, April 12, 2010 - 5:17 pm

From: Dave Chinner <dchinner@redhat.com>

When we enter direct reclaim we may have used an arbitrary amount of stack
space, and hence enterring the filesystem to do writeback can then lead to
stack overruns. This problem was recently encountered x86_64 systems with
8k stacks running XFS with simple storage configurations.

Writeback from direct reclaim also adversely affects background writeback. The
background flusher threads should already be taking care of cleaning dirty
pages, and direct reclaim will kick them if they aren't already doing work. If
direct reclaim is also calling ->writepage, it will cause the IO patterns from
the background flusher threads to be upset by LRU-order writeback from
pageout() which can be effectively random IO. Having competing sources of IO
trying to clean pages on the same backing device reduces throughput by
increasing the amount of seeks that the backing device has to do to write back
the pages.

Hence for direct reclaim we should not allow ->writepages to be entered at all.
Set up the relevant scan_control structures to enforce this, and prevent
sc->may_writepage from being set in other places in the direct reclaim path in
response to other events.

Reported-by: John Berthels <john@humyo.com>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 mm/vmscan.c |   13 ++++++-------
 1 files changed, 6 insertions(+), 7 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e0e5f15..5321ac4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1826,10 +1826,8 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
 		 * writeout.  So in laptop mode, write out the whole world.
 		 */
 		writeback_threshold = sc->nr_to_reclaim + sc->nr_to_reclaim / 2;
-		if (total_scanned > writeback_threshold) {
+		if (total_scanned > writeback_threshold)
 			wakeup_flusher_threads(laptop_mode ? 0 : total_scanned);
-			sc->may_writepage = 1;
-		}
 
 		/* Take a nap, wait for some writeback to complete */
 		if (!sc->hibernation_mode && ...
From: KOSAKI Motohiro
Date: Tuesday, April 13, 2010 - 1:31 am

Ummm..
This patch is harder to ack. This patch's pros/cons seems

Pros:
	1) prevent XFS stack overflow
	2) improve io workload performance

Cons:
	3) TOTALLY kill lumpy reclaim (i.e. high order allocation)

So, If we only need to consider io workload this is no downside. but
it can't.

I think (1) is XFS issue. XFS should care it itself. but (2) is really
VM issue. Now our VM makes too agressive pageout() and decrease io 
throughput. I've heard this issue from Chris (cc to him). I'd like to 
fix this. but we never kill pageout() completely because we can't
assume users don't run high order allocation workload.
(perhaps Mel's memory compaction code is going to improve much and
 we can kill lumpy reclaim in future. but it's another story)




--

From: Dave Chinner
Date: Tuesday, April 13, 2010 - 3:29 am

The filesystem is irrelevant, IMO.

The traces from the reporter showed that we've got close to a 2k
stack footprint for memory allocation to direct reclaim and then we
can put the entire writeback path on top of that. This is roughly
3.5k for XFS, and then depending on the storage subsystem
configuration and transport can be another 2k of stack needed below
XFS.

IOWs, if we completely ignore the filesystem stack usage, there's
still up to 4k of stack needed in the direct reclaim path. Given
that one of the stack traces supplied show direct reclaim being
entered with over 3k of stack already used, pretty much any
filesystem is capable of blowing an 8k stack.

So, this is not an XFS issue, even though XFS is the first to

I didn't expect this to be easy. ;)

I had a good look at what the code was doing before I wrote the
patch, and IMO, there is no good reason for issuing IO from direct
reclaim.

My reasoning is as follows - consider a system with a typical
sata disk and the machine is low on memory and in direct reclaim.

direct reclaim is taking pages of the end of the LRU and writing
them one at a time from there. It is scanning thousands of pages
pages and it triggers IO on on the dirty ones it comes across.
This is done with no regard to the IO patterns it generates - it can
(and frequently does) result in completely random single page IO
patterns hitting the disk, and as a result cleaning pages happens
really, really slowly. If we are in a OOM situation, the machine
will grind to a halt as it struggles to clean maybe 1MB of RAM per
second.

On the other hand, if the IO is well formed then the disk might be
capable of 100MB/s. The background flusher threads and filesystems
try very hard to issue well formed IOs, so the difference in the
rate that memory can be cleaned may be a couple of orders of
magnitude.

(Of course, the difference will typically be somewhere in between
these two extremes, but I'm simply trying to illustrate how big
the difference in ...
From: KOSAKI Motohiro
Date: Tuesday, April 13, 2010 - 4:39 am

Thanks explanation. I haven't noticed direct reclaim consume
2k stack. I'll investigate it and try diet it.

Well, you seems continue to discuss io workload. I don't disagree
such point. 

example, If only order-0 reclaim skip pageout(), we will get the above

lumpy reclaim is for allocation high order page. then, it not only
reclaim LRU head page, but also its PFN neighborhood. PFN neighborhood
is often newly page and still dirty. then we enfoce pageout cleaning
and discard it.

When high order allocation occur, we don't only need free enough amount
memory, but also need free enough contenious memory block.

If we need to consider _only_ io throughput, waiting flusher thread
might faster perhaps, but actually we also need to consider reclaim

It does. lumpy reclaim doesn't grab last N pages. instead grab contenious

So, can you please run two workloads concurrently?
 - Normal IO workload (fio, iozone, etc..)
 - echo $NUM > /proc/sys/vm/nr_hugepages

Most typical high order allocation is occur by blutal wireless LAN driver.
(or some cheap LAN card)
But sadly, If the test depend on specific hardware, our discussion might
make mess maze easily. then, I hope to use hugepage feature instead.


Thanks.



--

From: Dave Chinner
Date: Tuesday, April 13, 2010 - 7:36 am

It hasn't grown in the last 2 years after the last major diet where
all the fat was trimmed from it in the last round of the i386 4k
stack vs XFS saga. it seems that everything else around XFS has


Ok, I see that now - I missed the second call to __isolate_lru_pages()

Agreed, that was why I was kind of surprised not to find it was

True, but without know how to test and measure such things I can't

What do I measure/observe/record that is meaningful?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Dave Chinner
Date: Tuesday, April 13, 2010 - 8:12 pm

So, a rough as guts first pass - just run a large dd (8 times the
size of memory - 8GB file vs 1GB RAM) and repeated try to allocate
the entire of memory in huge pages (500) every 5 seconds. The IO
rate is roughly 100MB/s, so it takes 75-85s to complete the dd.

The script:

$ cat t.sh
#!/bin/bash

echo 0 > /proc/sys/vm/nr_hugepages
echo 3 > /proc/sys/vm/drop_caches

dd if=/dev/zero of=/mnt/scratch/test bs=1024k count=8000 > /dev/null 2>&1 &

(
for i in `seq 1 1 20`; do
        sleep 5
        /usr/bin/time --format="wall %e" sh -c "echo 500 > /proc/sys/vm/nr_hugepages" 2>&1
        grep HugePages_Total /proc/meminfo
done
) | awk '
        /wall/ { wall += $2; cnt += 1 }
        /Pages/ { pages[cnt] = $2 }
        END { printf "average wall time %f\nPages step: ", wall / cnt ;
                for (i = 1; i <= cnt; i++) {
                        printf "%d ", pages[i];
                }
        }'
----

And the output looks like:

$ sudo ./t.sh
average wall time 0.954500
Pages step: 97 101 101 121 173 173 173 173 173 173 175 194 195 195 202 220 226 419 423 426
$

Run 50 times in a loop, and the outputs averaged, the existing lumpy
reclaim resulted in:

dave@test-1:~$ cat current.txt | awk -f av.awk
av. wall = 0.519385 secs
av Pages step: 192 228 242 255 265 272 279 284 289 294 298 303 307 322 342 366 383 401 412 420

And with my patch that disables ->writepage:

dave@test-1:~$ cat no-direct.txt | awk -f av.awk
av. wall = 0.554163 secs
av Pages step: 231 283 310 316 323 328 336 340 345 351 356 359 364 377 388 397 413 423 432 439

Basically, with my patch lumpy reclaim was *substantially* more
effective with only a slight increase in average allocation latency
with this test case.

I need to add a marker to the output that records when the dd
completes, but from monitoring the writeback rates via PCP, they
were in the balllpark of 85-100MB/s for the existing code, and
95-110MB/s with my patch.  Hence it improved both IO throughput and
the effectiveness ...
From: KOSAKI Motohiro
Date: Tuesday, April 13, 2010 - 11:52 pm

Ummm...

Probably, I have to say I'm sorry. I guess my last mail give you
a misunderstand.
To be honest, I'm not interest this artificial non fragmentation case.
The above test-case does 1) discard all cache 2) fill pages by streaming
io. then, it makes artificial "file offset neighbor == block neighbor == PFN neighbor"
situation. then, file offset order writeout by flusher thread can make
PFN contenious pages effectively.

Why I dont interest it? because lumpy reclaim is a technique for
avoiding external fragmentation mess. IOW, it is for avoiding worst
case. but your test case seems to mesure best one.



--

From: Dave Chinner
Date: Wednesday, April 14, 2010 - 6:56 pm

And to be brutally honest, I'm not interested in wasting my time
trying to come up with a test case that you are interested in.

Instead, can you please you provide me with your test cases
(scripts, preferably) that you use to measure the effectiveness of

Yes, that's true, but it does indicate that in that situation, it is
more effective than the current code. FWIW, in the case of HPC
applications (which often use huge pages and clear the cache before
starting anew job), large streaming IO is a pretty common IO
pattern, so I don't think this situation is as artificial as you are

Then please provide test cases that you consider valid.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: KOSAKI Motohiro
Date: Tuesday, April 13, 2010 - 11:52 pm

I have dumb question, If xfs haven't bloat stack usage, why 3.5
stack usage works fine on 4k stack kernel? It seems impossible.

Please don't think I blame you. I don't know what is "4k stack vs XFS saga".


Agreed. I know making VM mesurement benchmark is very difficult. but
probably it is necessary....



--

From: Dave Chinner
Date: Wednesday, April 14, 2010 - 12:36 am

Because on a 32 bit kernel it's somewhere between 2-2.5k of stack
space. That being said, XFS _will_ blow a 4k stack on anything other
than the most basic storage configurations, and if you run out of

Over a period of years there were repeated attempts to make the
default stack size on i386 4k, despite it being known to cause
problems one relatively common configurations. Every time it was
brought up it was rejected, but every few months somebody else made
an attempt to make it the default. There was a lot of flamage
directed at XFS because it was seen as the reason that 4k stacks
were not made the default....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Mel Gorman
Date: Tuesday, April 13, 2010 - 2:58 am

It's already known that the VM requesting specific pages be cleaned and
reclaimed is a bad IO pattern but unfortunately it is still required by
lumpy reclaim. This change would appear to break that although I haven't
tested it to be 100% sure.

Even without high-order considerations, this patch would appear to make
fairly large changes to how direct reclaim behaves. It would no longer
wait on page writeback for example so direct reclaim will return sooner
than it did potentially going OOM if there were a lot of dirty pages and

If an FS caller cannot re-enter the FS, it should be using GFP_NOFS

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Dave Chinner
Date: Tuesday, April 13, 2010 - 4:19 am

AFAICT it still waits for pages under writeback in exactly the same manner
it does now. shrink_page_list() does the following completely
separately to the sc->may_writepage flag:

 666                 may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
 667                         (PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
 668
 669                 if (PageWriteback(page)) {
 670                         /*
 671                          * Synchronous reclaim is performed in two passes,
 672                          * first an asynchronous pass over the list to
 673                          * start parallel writeback, and a second synchronous
 674                          * pass to wait for the IO to complete.  Wait here
 675                          * for any page for which writeback has already
 676                          * started.
 677                          */
 678                         if (sync_writeback == PAGEOUT_IO_SYNC && may_enter_fs)
 679                                 wait_on_page_writeback(page);
 680                         else
 681                                 goto keep_locked;
 682                 }

So if the page is under writeback, PAGEOUT_IO_SYNC is set and
we can enter the fs, it will still wait for writeback to complete
just like it does now.

However, the current code only uses PAGEOUT_IO_SYNC in lumpy
reclaim, so for most typical workloads direct reclaim does not wait
on page writeback, either. Hence, this patch doesn't appear to
change the actions taken on a page under writeback in direct

I did a fair bit of low/small memory testing. This is a subjective
observation, but I definitely seemed to get less severe OOM
situations and better overall responisveness with this patch than

This problem is not a filesystem recursion problem which is, as I
understand it, what GFP_NOFS is used to prevent. It's _any_ kernel
code that uses signficant stack before trying to allocate memory
that is the problem. e.g a select() system call:

       ...
From: Mel Gorman
Date: Tuesday, April 13, 2010 - 12:34 pm

Depends. For raw effectiveness, I run a series of performance-related
benchmarks with a final test that

o Starts a number of parallel compiles that in combination are 1.25 times
  of physical memory in total size
o Sleep three minutes
o Start allocating huge pages recording the latency required for each one
o Record overall success rate and graph latency over time


Right, so it'll still wait on writeback but won't kick it off. That
would still be a fairly significant change in behaviour though. Think of
synchronous lumpy reclaim for example where it queues up a contiguous

But it would be no longer queueing them for writeback so it'd be
depending heavily on kswapd or a background cleaning daemon to clean

No, but it does queue them back on the LRU where they might be clean the
next time they are found on the list. How significant a problem this is
I couldn't tell you but it could show a corner case where a large number
of direct reclaimers are encountering dirty pages frequenctly and

It does, but indirectly. The impact is very direct for lumpy reclaim
obviously. For other direct reclaim, pages that were at the end of the
LRU list are no longer getting cleaned before doing another lap through
the LRU list.


And it is possible that it is best overall of only kswapd and the
background cleaner are queueing pages for IO. All I can say for sure is
that this does appear to hurt lumpy reclaim and does affect normal

I'm not denying the evidence but how has it been gotten away with for years
then? Prevention of writeback isn't the answer without figuring out how
direct reclaimers can queue pages for IO and in the case of lumpy reclaim
doing sync IO, then waiting on those pages.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Chris Mason
Date: Tuesday, April 13, 2010 - 1:20 pm

So, I've been reading along, nodding my head to Dave's side of things
because seeks are evil and direct reclaim makes seeks.  I'd really loev
for direct reclaim to somehow trigger writepages on large chunks instead
of doing page by page spatters of IO to the drive.

But, somewhere along the line I overlooked the part of Dave's stack trace
that said:

43)     1568     912   do_select+0x3d6/0x700

Huh, 912 bytes...for select, really?  From poll.h:

/* ~832 bytes of stack space used max in sys_select/sys_poll before allocating
   additional memory. */
#define MAX_STACK_ALLOC 832
#define FRONTEND_STACK_ALLOC    256
#define SELECT_STACK_ALLOC      FRONTEND_STACK_ALLOC
#define POLL_STACK_ALLOC        FRONTEND_STACK_ALLOC
#define WQUEUES_STACK_ALLOC     (MAX_STACK_ALLOC - FRONTEND_STACK_ALLOC)
#define N_INLINE_POLL_ENTRIES   (WQUEUES_STACK_ALLOC / sizeof(struct poll_table_entry))

So, select is intentionally trying to use that much stack.  It should be using
GFP_NOFS if it really wants to suck down that much stack...if only the
kernel had some sort of way to dynamically allocate ram, it could try
that too.

-chris
--

From: Dave Chinner
Date: Tuesday, April 13, 2010 - 6:40 pm

Perhaps drop the lock on the page if it is held and call one of the
helpers that filesystems use to do this, like:


Sure, it's bad, but we focussing on the specific case misses the
point that even code that is using minimal stack can enter direct
reclaim after consuming 1.5k of stack. e.g.:

 50)     3168      64   xfs_vm_writepage+0xab/0x160 [xfs]
 51)     3104     384   shrink_page_list+0x65e/0x840
 52)     2720     528   shrink_zone+0x63f/0xe10
 53)     2192     112   do_try_to_free_pages+0xc2/0x3c0
 54)     2080     128   try_to_free_pages+0x77/0x80
 55)     1952     240   __alloc_pages_nodemask+0x3e4/0x710
 56)     1712      48   alloc_pages_current+0x8c/0xe0
 57)     1664      32   __page_cache_alloc+0x67/0x70
 58)     1632     144   __do_page_cache_readahead+0xd3/0x220
 59)     1488      16   ra_submit+0x21/0x30
 60)     1472      80   ondemand_readahead+0x11d/0x250
 61)     1392      64   page_cache_async_readahead+0xa9/0xe0
 62)     1328     592   __generic_file_splice_read+0x48a/0x530
 63)      736      48   generic_file_splice_read+0x4f/0x90
 64)      688      96   xfs_splice_read+0xf2/0x130 [xfs]
 65)      592      32   xfs_file_splice_read+0x4b/0x50 [xfs]
 66)      560      64   do_splice_to+0x77/0xb0
 67)      496     112   splice_direct_to_actor+0xcc/0x1c0
 68)      384      80   do_splice_direct+0x57/0x80
 69)      304      96   do_sendfile+0x16c/0x1e0
 70)      208      80   sys_sendfile64+0x8d/0xb0
 71)      128     128   system_call_fastpath+0x16/0x1b

Yes, __generic_file_splice_read() is a hog, but they seem to be

The code that did the allocation is called from multiple different
contexts - how is it supposed to know that in some of those contexts
it is supposed to treat memory allocation differently?

This is my point - if you introduce a new semantic to memory allocation
that is "use GFP_NOFS when you are using too much stack" and too much
stack is more than 15% of the stack, then pretty much every code path

Sure, but to play the devil's ...
From: KAMEZAWA Hiroyuki
Date: Tuesday, April 13, 2010 - 9:59 pm

On Wed, 14 Apr 2010 11:40:41 +1000

A bit OFF TOPIC.

Could you share disassemble of shrink_zone() ?

In my environ.
00000000000115a0 <shrink_zone>:
   115a0:       55                      push   %rbp
   115a1:       48 89 e5                mov    %rsp,%rbp
   115a4:       41 57                   push   %r15
   115a6:       41 56                   push   %r14
   115a8:       41 55                   push   %r13
   115aa:       41 54                   push   %r12
   115ac:       53                      push   %rbx
   115ad:       48 83 ec 78             sub    $0x78,%rsp
   115b1:       e8 00 00 00 00          callq  115b6 <shrink_zone+0x16>
   115b6:       48 89 75 80             mov    %rsi,-0x80(%rbp)

disassemble seems to show 0x78 bytes for stack. And no changes to %rsp
until retrun.

I may misunderstand something...

Thanks,
-Kame

--

From: Dave Chinner
Date: Tuesday, April 13, 2010 - 10:41 pm

I see the same. I didn't compile those kernels, though. IIUC,
they were built through the Ubuntu build infrastructure, so there is
something different in terms of compiler, compiler options or config
to what we are both using. Most likely it is the compiler inlining,
though Chris's patches to prevent that didn't seem to change the
stack usage.

I'm trying to get a stack trace from the kernel that has shrink_zone
in it, but I haven't succeeded yet....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: KOSAKI Motohiro
Date: Tuesday, April 13, 2010 - 10:54 pm

I also got 0x78 byte stack usage. Umm.. Do we discussed real issue now?




--

From: Minchan Kim
Date: Tuesday, April 13, 2010 - 11:13 pm

On Wed, Apr 14, 2010 at 2:54 PM, KOSAKI Motohiro

In my case, 0x110 byte in 32 bit machine.
I think it's possible in 64 bit machine.

00001830 <shrink_zone>:
    1830:       55                      push   %ebp
    1831:       89 e5                   mov    %esp,%ebp
    1833:       57                      push   %edi
    1834:       56                      push   %esi
    1835:       53                      push   %ebx
    1836:       81 ec 10 01 00 00       sub    $0x110,%esp
    183c:       89 85 24 ff ff ff       mov    %eax,-0xdc(%ebp)
    1842:       89 95 20 ff ff ff       mov    %edx,-0xe0(%ebp)
    1848:       89 8d 1c ff ff ff       mov    %ecx,-0xe4(%ebp)
    184e:       8b 41 04                mov    0x4(%ecx)

my gcc is following as.

barrios@barriostarget:~/mmotm$ gcc -v
Using built-in specs.
Target: i486-linux-gnu
Configured with: ../src/configure -v --with-pkgversion='Ubuntu
4.3.3-5ubuntu4'
--with-bugurl=file:///usr/share/doc/gcc-4.3/README.Bugs
--enable-languages=c,c++,fortran,objc,obj-c++ --prefix=/usr
--enable-shared --with-system-zlib --libexecdir=/usr/lib
--without-included-gettext --enable-threads=posix --enable-nls
--with-gxx-include-dir=/usr/include/c++/4.3 --program-suffix=-4.3
--enable-clocale=gnu --enable-libstdcxx-debug --enable-objc-gc
--enable-mpfr --enable-targets=all --with-tune=generic
--enable-checking=release --build=i486-linux-gnu --host=i486-linux-gnu
--target=i486-linux-gnu
Thread model: posix
gcc version 4.3.3 (Ubuntu 4.3.3-5ubuntu4)


Is it depends on config?



-- 
Kind regards,
Minchan Kim
From: Minchan Kim
Date: Wednesday, April 14, 2010 - 12:19 am

I changed shrink list by noinline_for_stack.
The result is following as.


00001fe0 <shrink_zone>:
    1fe0:       55                      push   %ebp
    1fe1:       89 e5                   mov    %esp,%ebp
    1fe3:       57                      push   %edi
    1fe4:       56                      push   %esi
    1fe5:       53                      push   %ebx
    1fe6:       83 ec 4c                sub    $0x4c,%esp
    1fe9:       89 45 c0                mov    %eax,-0x40(%ebp)
    1fec:       89 55 bc                mov    %edx,-0x44(%ebp)
    1fef:       89 4d b8                mov    %ecx,-0x48(%ebp)

0x110 -> 0x4c.

Should we have to add noinline_for_stack for shrink_list?


-- 
Kind regards,
Minchan Kim
--

From: KAMEZAWA Hiroyuki
Date: Wednesday, April 14, 2010 - 2:42 am

On Wed, 14 Apr 2010 16:19:02 +0900

Hmm. about shirnk_zone(), I don't think uninlining functions directly called
by shrink_zone() can be a help.
Total stack size of call-chain will be still big.

Thanks,
-Kame


--

From: Minchan Kim
Date: Wednesday, April 14, 2010 - 3:01 am

On Wed, Apr 14, 2010 at 6:42 PM, KAMEZAWA Hiroyuki

Absolutely.
But above 500 byte usage is one of hogger and uninlining is not
critical about reclaim performance. So I think we don't get any lost
than gain.

But I don't get in a hurry. adhoc approach is not good.
I hope when Mel tackles down consumption of stack in reclaim path, he
modifies this part, too.




-- 
Kind regards,
Minchan Kim
--

From: Mel Gorman
Date: Wednesday, April 14, 2010 - 3:07 am

Beat in mind that uninlining can slightly increase the stack usage in some
cases because arguments, return addresses and the like have to be pushed
onto the stack. Inlining or unlining is only the answer when it reduces the

It'll be at least two days before I get the chance to try. A lot of the
temporary variables used in the reclaim path have existed for some time so
it will take a while.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Minchan Kim
Date: Wednesday, April 14, 2010 - 3:16 am

Yes. I totally have missed it.
Thanks, Mel.

-- 
Kind regards,
Minchan Kim
--

From: Dave Chinner
Date: Wednesday, April 14, 2010 - 12:06 am

Ok, so here's a trace at the top of the stack from a kernel with a
the above shrink_zone disassembly:

$ cat /sys/kernel/debug/tracing/stack_trace
        Depth    Size   Location    (49 entries)
        -----    ----   --------
  0)     6152     112   force_qs_rnp+0x58/0x150
  1)     6040      48   force_quiescent_state+0x1a7/0x1f0
  2)     5992      48   __call_rcu+0x13d/0x190
  3)     5944      16   call_rcu_sched+0x15/0x20
  4)     5928      16   call_rcu+0xe/0x10
  5)     5912     240   radix_tree_delete+0x14a/0x2d0
  6)     5672      32   __remove_from_page_cache+0x21/0x110
  7)     5640      64   __remove_mapping+0x86/0x100
  8)     5576     272   shrink_page_list+0x2fd/0x5a0
  9)     5304     400   shrink_inactive_list+0x313/0x730
 10)     4904     176   shrink_zone+0x3d1/0x490
 11)     4728     128   do_try_to_free_pages+0x2b6/0x380
 12)     4600     112   try_to_free_pages+0x5e/0x60
 13)     4488     272   __alloc_pages_nodemask+0x3fb/0x730
 14)     4216      48   alloc_pages_current+0x87/0xd0
 15)     4168      32   __page_cache_alloc+0x67/0x70
 16)     4136      80   find_or_create_page+0x4f/0xb0
 17)     4056     160   _xfs_buf_lookup_pages+0x150/0x390
.....

So the differences are most likely from the compiler doing
automatic inlining of static functions...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: KOSAKI Motohiro
Date: Tuesday, April 13, 2010 - 11:52 pm

I agree that "seeks are evil and direct reclaim makes seeks". Actually,

Sorry, I'm lost what you talk about. Why do we need per-file waiting?

checkstack.pl says do_select() and __generic_file_splice_read() are one
of worstest stack consumer. both sould be fixed.


Nodding my head to Dave's side. changing caller argument seems not good
solution. I mean
 - do_select() should use GFP_KERNEL instead stack (as revert 70674f95c0)
 - reclaim and xfs (and other something else) need to diet.


your explanation is very interesting. I have a (probably dumb) question.
Why nobody faced stack overflow issue in past? now I think every users

Yeah, My answer is simple, All stack eater should be fixed.
but XFS seems not innocence too. 3.5K is enough big although
xfs have use such amount since very ago.


===========================================================
Subject: [PATCH] kconfig: reduce FRAME_WARN default value to 512

Surprisedly, now several odd functions use very much stack.

% objdump -d vmlinux | ./scripts/checkstack.pl

0xffffffff81e3db07 get_next_block [vmlinux]:            1976
0xffffffff8130b9bd node_read_meminfo [vmlinux]:         1240
0xffffffff811553fd do_sys_poll [vmlinux]:               1000
0xffffffff8122b49d test_aead [vmlinux]:                 904
0xffffffff81154c9d do_select [vmlinux]:                 888
0xffffffff81168d9d default_file_splice_read [vmlinux]:  760

Oh well, Every developers have to pay attention a stack usage!
Thus, this patch reduce FRAME_WARN default value to 512.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 lib/Kconfig.debug |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index ff01710..44ebba6 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -28,8 +28,7 @@ config ENABLE_MUST_CHECK
 config FRAME_WARN
 	int "Warn for stack frames larger than (needs gcc 4.4)"
 	range 0 8192
-	default 1024 if !64BIT
-	default 2048 if ...
From: Dave Chinner
Date: Wednesday, April 14, 2010 - 12:28 am

So use filemap_fdatawrite(page->mapping), or if it's better only
to start IO on a segment of the file, use

the deepest call chain in queue_work() needs 700 bytes of stack
to complete, wait_for_completion() requires almost 2k of stack space
at it's deepest, the scheduler has some heavy stack users, etc,

Yeah, but when we have ia callchain 70 or more functions deep,

The list I'm seeing so far includes:
	- scheduler
	- completion interfaces
	- radix tree
	- memory allocation, memory reclaim
	- anything that implements ->writepage
	- select

Good start, but 512 bytes will only catch select and splice read,
and there are 300-400 byte functions in the above list that sit near

It's always a problem, but the focus on minimising stack usage has
gone away since i386 has mostly disappeared from server rooms.

XFS has always been the thing that triggered stack usage problems
first - the first reports of problems on x86_64 with 8k stacks in low
memory situations have only just come in, and this is the first time
in a couple of years I've paid close attention to stack usage

XFS used to use much more than that - significant effort has been
put into reduce the stack footprint over many years. There's not
much left to trim without rewriting half the filesystem...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Mel Gorman
Date: Wednesday, April 14, 2010 - 1:51 am

That does not help the stack usage issue, the caller ends up in
->writepages. From an IO perspective, it'll be better from a seek point of
view but from a VM perspective, it may or may not be cleaning the right pages.

The real issue here then is that stack usage has gone out of control.
Disabling ->writepage in direct reclaim does not guarantee that stack
usage will not be a problem again. From your traces, page reclaim itself
seems to be a big dirty hog.

Differences in what people see in their machines may be down to architecture,
compiler but most likely inlining. Changing inlining will not fix the problem,

They will need to be tackled in turn then but obviously there should be
a focus on the common paths. The reclaim paths do seem particularly
heavy and it's down to a lot of temporary variables. I might not get the
time today but what I'm going to try do some time this week is

o Look at what temporary variables are copies of other pieces of information
o See what variables live for the duration of reclaim but are not needed
  for all of it (i.e. uninline parts of it so variables do not persist)
o See if it's possible to dynamically allocate scan_control

The last one is the trickiest. Basically, the idea would be to move as much
into scan_control as possible. Then, instead of allocating it on the stack,
allocate a fixed number of them at boot-time (NR_CPU probably) protected by
a semaphore. Limit the number of direct reclaimers that can be active at a
time to the number of scan_control variables. kswapd could still allocate
its on the stack or with kmalloc.

If it works out, it would have two main benefits. Limits the number of
processes in direct reclaim - if there is NR_CPU-worth of proceses in direct
reclaim, there is too much going on. It would also shrink the stack usage
particularly if some of the stack variables are moved into scan_control.


I don't think he is levelling a complain at XFS in particular - just pointing
out that it's heavy too. Still, we ...
From: Dave Chinner
Date: Wednesday, April 14, 2010 - 6:34 pm

If you ask it to clean a bunch of pages around the one you want to
reclaim on the LRU, there is a good chance it will also be cleaning
pages that are near the end of the LRU or physically close by as
well. It's not a guarantee, but for the additional IO cost of about
10% wall time on that IO to clean the page you need, you also get
1-2 orders of magnitude other pages cleaned. That sounds like a
win any way you look at it...

I agree that it doesn't solve the stack problem (Chris' suggestion
that we enable the bdi flusher interface would fix this); what I'm
pointing out is that the arguments that it is too hard or there are
no interfaces available to issue larger IO from reclaim are not at

That's definitely true, but it shouldn't cloud the fact that most
ppl want to kill writeback from direct reclaim, too, so killing two
birds with one stone seems like a good idea.

How about this? For now, we stop direct reclaim from doing writeback
only on order zero allocations, but allow it for higher order
allocations. That will prevent the majority of situations where
direct reclaim blows the stack and interferes with background
writeout, but won't cause lumpy reclaim to change behaviour.
This reduces the scope of impact and hence testing and validation
the needs to be done.

Then we can work towards allowing lumpy reclaim to use background
threads as Chris suggested for doing specific writeback operations
to solve the remaining problems being seen. Does this seem like a

I couldn't agree more - the kernel still needs to be put on a stack
usage diet, but the above would give use some breathing space to attack the


I like the idea - it really sounds like you want a fixed size,
preallocated mempool that can't be enlarged. In fact, I can probably
use something like this in XFS to save a couple of hundred bytes of

Yeah, true. Sorry іf in being a bit too defensive here - the scars
from previous discussions like this are showing through....

Cheers,

Dave.
-- 
Dave ...
From: KOSAKI Motohiro
Date: Wednesday, April 14, 2010 - 9:09 pm

Tend to agree. but I would proposed slightly different algorithm for
avoind incorrect oom.

for high order allocation
	allow to use lumpy reclaim and pageout() for both kswapd and direct reclaim

for low order allocation
	- kswapd:          always delegate io to flusher thread
	- direct reclaim:  delegate io to flusher thread only if vm pressure is low

This seems more safely. I mean Who want see incorrect oom regression?

Tend to agree. probably now we are discussing right approach. but
this is definitely needed deep thinking. then, I can't take exactly
answer yet.




--

From: KOSAKI Motohiro
Date: Wednesday, April 14, 2010 - 9:11 pm

Now, vmscan pageout() is one of IO throuput degression source.
Some IO workload makes very much order-0 allocation and reclaim
and pageout's 4K IOs are making annoying lots seeks.

At least, kswapd can avoid such pageout() because kswapd don't
need to consider OOM-Killer situation. that's no risk.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3ff3311..d392a50 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -614,6 +614,13 @@ static enum page_references page_check_references(struct page *page,
 	if (referenced_page)
 		return PAGEREF_RECLAIM_CLEAN;
 
+	/*
+	 * Delegate pageout IO to flusher thread. They can make more
+	 * effective IO pattern.
+	 */
+	if (current_is_kswapd())
+		return PAGEREF_RECLAIM_CLEAN;
+
 	return PAGEREF_RECLAIM;
 }
 
-- 
1.6.5.2



--

From: Suleiman Souhlal
Date: Thursday, April 15, 2010 - 1:05 am

What's your opinion on trying to cluster the writes done by pageout,  
instead of not doing any paging out in kswapd?
Something along these lines:

     Cluster writes to disk due to memory pressure.

     Write out logically adjacent pages to the one we're paging out
     so that we may get better IOs in these situations:
     These pages are likely to be contiguous on disk to the one we're
     writing out, so they should get merged into a single disk IO.

     Signed-off-by: Suleiman Souhlal <suleiman@google.com>

diff --git a/mm/vmscan.c b/mm/vmscan.c
index c26986c..4e5a613 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -48,6 +48,8 @@

  #include "internal.h"

+#define PAGEOUT_CLUSTER_PAGES	16
+
  struct scan_control {
  	/* Incremented by the number of inactive pages that were scanned */
  	unsigned long nr_scanned;
@@ -350,6 +352,8 @@ typedef enum {
  static pageout_t pageout(struct page *page, struct address_space  
*mapping,
  						enum pageout_io sync_writeback)
  {
+	int i;
+
  	/*
  	 * If the page is dirty, only perform writeback if that write
  	 * will be non-blocking.  To prevent this allocation from being
@@ -408,6 +412,37 @@ static pageout_t pageout(struct page *page,  
struct address_space *mapping,
  		}

  		/*
+		 * Try to write out logically adjacent dirty pages too, if
+		 * possible, to get better IOs, as the IO scheduler should
+		 * merge them with the original one, if the file is not too
+		 * fragmented.
+		 */
+		for (i = 1; i < PAGEOUT_CLUSTER_PAGES; i++) {
+			struct page *p2;
+			int err;
+
+			p2 = find_get_page(mapping, page->index + i);
+			if (p2) {
+				if (trylock_page(p2) == 0) {
+					page_cache_release(p2);
+					break;
+				}
+				if (page_mapped(p2))
+					try_to_unmap(p2, 0);
+				if (PageDirty(p2)) {
+					err = write_one_page(p2, 0);
+					page_cache_release(p2);
+					if (err)
+						break;
+				} else ...
From: KOSAKI Motohiro
Date: Thursday, April 15, 2010 - 1:17 am

Interesting. 
So, I'd like to review your patch carefully. can you please give me one




--

From: KOSAKI Motohiro
Date: Thursday, April 15, 2010 - 1:26 am

Hannes, if my remember is correct, you tried similar swap-cluster IO
long time ago. now I can't remember why we didn't merged such patch.



--

From: Johannes Weiner
Date: Thursday, April 15, 2010 - 3:30 am

Oh, quite vividly in fact :)  For a lot of swap loads the LRU order
diverged heavily from swap slot order and readaround was a waste of
time.

Of course, the patch looked good, too, but it did not match reality
that well.

I guess 'how about this patch?' won't get us as far as 'how about
those numbers/graphs of several real-life workloads?  oh and here

For random IO, LRU order will have nothing to do with mapping/disk order.
--

From: Suleiman Souhlal
Date: Thursday, April 15, 2010 - 10:24 am

Right, that's why the patch writes out contiguous pages in mapping  
order.

If they are contiguous on disk with the original page, then writing  
them out
as well should be essentially free (when it comes to disk time). There  
is
almost no waste of memory regardless of the access patterns, as far as I
can tell.

This patch is just a proof of concept and could be improved by getting  
help
from the filesystem/swap code to ensure that the additional pages we're
writing out really are contiguous with the original one.

-- Suleiman
--

From: Ying Han
Date: Monday, April 19, 2010 - 7:56 pm

Hannes,

We recently ran into this problem while running some experiments on
ext4 filesystem. We experienced the scenario where we are writing a
large file or just opening a large file with limited memory allocation
(using containers), and the process got OOMed. The memory assigned to
the container is reasonably large, and the OOM can not be reproduced
on ext2 with the same configurations.

Later we figured this might be due to the delayed block allocation
from ext4. Vmscan sends a single page to ext4->writepage(), then ext4
punts if the block is DA'ed and re-dirties the page. On the other
hand, the flusher thread use ext4->writepages() which does include the
block allocation.

We looked at the OOM log under ext4, all pages within the container
were in inactive list and either Dirty or WriteBack. Also, the zones
are all marked as "all_unreclaimable" which indicates the reclaim path
has scanned the LRU quite lot times without making progress. If the
delayed block allocation is the cause for pageout() not being able to
flush dirty pages and then triggers OOMs, should we signal the fs to
force write out dirty pages under memory pressure?

--

From: Dave Chinner
Date: Thursday, April 15, 2010 - 2:32 am

XFS already does this in ->writepage to try to minimise the impact
of the way pageout issues IO. It helps, but it is still not as good
as having all the writeback come from the flusher threads because
it's still pretty much random IO.

And, FWIW, it doesn't solve the stack usage problems, either. In
fact, it will make them worse as write_one_page() puts another
struct writeback_control on the stack...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: KOSAKI Motohiro
Date: Thursday, April 15, 2010 - 2:41 am

I havent review such patch yet. then, I'm talking about generic thing.
pageout() doesn't only writeout file backed page, but also write
swap backed page. so, filesystem optimization nor flusher thread

Correct. we need to avoid double writeback_control on stack.
probably, we need to divide pageout() some piece.



--

From: Suleiman Souhlal
Date: Thursday, April 15, 2010 - 10:27 am

Doesn't the randomness become irrelevant if you can cluster enough

Sorry, this patch was not meant to solve the stack usage problems.

-- Suleiman
--

From: Dave Chinner
Date: Thursday, April 15, 2010 - 4:33 pm

No. If you are doing full disk seeks between random chunks, then you
still lose a large amount of throughput. e.g. if the seek time is
10ms and your IO time is 10ms for each 4k page, then increasing the
size ito 64k makes it 10ms seek and 12ms for the IO. We might increase
throughput but we are still limited to 100 IOs per second. We've
gone from 400kB/s to 6MB/s, but that's still an order of magnitude
short of the 100MB/s full size IOs with little in way of seeks
between them will acheive on the same spindle...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Suleiman Souhlal
Date: Thursday, April 15, 2010 - 4:41 pm

What I meant was that, theoretically speaking, you could increase the
maximum amount of pages that get clustered so that you could get
100MB/s, although it most likely wouldn't be a good idea with the
current patch.

-- Suleiman
--

From: Alan Cox
Date: Friday, April 16, 2010 - 2:50 am

The usual armwaving numbers for ops/sec for an ATA disk are in the 200
ops/sec range so that seems horribly credible.

But then I've never quite understood why our anonymous paging isn't
sorting stuff as best it can and then using the drive as a log structure
with in memory metadata so it can stream the pages onto disk. Read
performance is goig to be similar (maybe better if you have a log tidy
when idle), write ought to be far better.

Alan
--

From: Dave Chinner
Date: Friday, April 16, 2010 - 8:06 pm

Yeah, in my experience 7200rpm SATA will get you 200 ops/s when you
are doing really small seeks as the typical minimum seek time is
around 4-5ms. Average seek time, however, is usually in the range of
10ms, because full head sweep + spindle rotation seeks take in the
order of 15ms.

Hence small random IO tends to result in seek times nearer the
average seek time than the minimum, so that's what i tend to use for

Sounds like a worthy project for someone to sink their teeth into.
Lots of people would like to have a system that can page out at
hundreds of megabytes a second....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: KOSAKI Motohiro
Date: Thursday, April 15, 2010 - 1:18 am

I've found one bug in this patch myself. flusher thread don't



--

From: Mel Gorman
Date: Thursday, April 15, 2010 - 3:31 am

Well, there is some risk here. Direct reclaimers may not be cleaning
more pages than it had to previously except it splices subsystems
together increasing stack usage and causing further problems.

It might not cause OOM-killer issues but it could increase the time
dirty pages spend on the LRU.


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: KOSAKI Motohiro
Date: Thursday, April 15, 2010 - 4:26 am

No. you are right. I fully agree your previous mail. so, I need to cool down a bit ;)







--

From: KOSAKI Motohiro
Date: Wednesday, April 14, 2010 - 9:13 pm

This patch is not related the patch series directly.
but [4/4] depend on scan_control has `priority' member.
then, I'm include this.

=============================================
Since 2.6.28 zone->prev_priority is unused. Then it can be removed
safely. It reduce stack usage slightly.

Now I have to say that I'm sorry. 2 years ago, I thghout prev_priority
can be integrate again, it's useful. but four (or more) times trying
haven't got good performance number. thus I give up such approach.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 include/linux/mmzone.h |   15 -------------
 mm/page_alloc.c        |    2 -
 mm/vmscan.c            |   54 ++---------------------------------------------
 mm/vmstat.c            |    2 -
 4 files changed, 3 insertions(+), 70 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index cf9e458..ad76962 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -339,21 +339,6 @@ struct zone {
 	atomic_long_t		vm_stat[NR_VM_ZONE_STAT_ITEMS];
 
 	/*
-	 * prev_priority holds the scanning priority for this zone.  It is
-	 * defined as the scanning priority at which we achieved our reclaim
-	 * target at the previous try_to_free_pages() or balance_pgdat()
-	 * invocation.
-	 *
-	 * We use prev_priority as a measure of how much stress page reclaim is
-	 * under - it drives the swappiness decision: whether to unmap mapped
-	 * pages.
-	 *
-	 * Access to both this field is quite racy even on uniprocessor.  But
-	 * it is expected to average out OK.
-	 */
-	int prev_priority;
-
-	/*
 	 * The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
 	 * this zone's LRU.  Maintained by the pageout code.
 	 */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d03c946..88513c0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3862,8 +3862,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
 		zone_seqlock_init(zone);
 		zone->zone_pgdat = pgdat;
 ...
From: KOSAKI Motohiro
Date: Wednesday, April 14, 2010 - 9:14 pm

ditto

This patch is not related the patch series directly.
but [4/4] depend on scan_control has `priority' member.
then, I'm include this.
=========================================

Now very lots function in vmscan have `priority' argument. It consume
stack slightly. To move it on struct scan_control reduce stack.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |   83 ++++++++++++++++++++++++++--------------------------------
 1 files changed, 37 insertions(+), 46 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index dadb461..8b78b49 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -77,6 +77,8 @@ struct scan_control {
 
 	int order;
 
+	int priority;
+
 	/* Which cgroup do we reclaim from */
 	struct mem_cgroup *mem_cgroup;
 
@@ -1130,7 +1132,7 @@ static int too_many_isolated(struct zone *zone, int file,
  */
 static unsigned long shrink_inactive_list(unsigned long max_scan,
 			struct zone *zone, struct scan_control *sc,
-			int priority, int file)
+			int file)
 {
 	LIST_HEAD(page_list);
 	struct pagevec pvec;
@@ -1156,7 +1158,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 	 */
 	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
 		lumpy_reclaim = 1;
-	else if (sc->order && priority < DEF_PRIORITY - 2)
+	else if (sc->order && sc->priority < DEF_PRIORITY - 2)
 		lumpy_reclaim = 1;
 
 	pagevec_init(&pvec, 1);
@@ -1335,7 +1337,7 @@ static void move_active_pages_to_lru(struct zone *zone,
 }
 
 static void shrink_active_list(unsigned long nr_pages, struct zone *zone,
-			struct scan_control *sc, int priority, int file)
+			struct scan_control *sc, int file)
 {
 	unsigned long nr_taken;
 	unsigned long pgscanned;
@@ -1498,17 +1500,17 @@ static int inactive_list_is_low(struct zone *zone, struct scan_control *sc,
 }
 
 static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
-	struct zone *zone, struct scan_control *sc, int priority)
+	struct zone *zone, struct scan_control *sc)
 ...
From: KOSAKI Motohiro
Date: Wednesday, April 14, 2010 - 9:15 pm

Even if pageout() is called from direct reclaim, we can delegate io to
flusher thread if vm pressure is low.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |    7 +++++++
 1 files changed, 7 insertions(+), 0 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 8b78b49..eab6028 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -623,6 +623,13 @@ static enum page_references page_check_references(struct page *page,
 	if (current_is_kswapd())
 		return PAGEREF_RECLAIM_CLEAN;
 
+	/*
+	 * Now VM pressure is not so high. then we can delegate
+	 * page cleaning to flusher thread safely.
+	 */
+	if (!sc->order && sc->priority > DEF_PRIORITY/2)
+		return PAGEREF_RECLAIM_CLEAN;
+
 	return PAGEREF_RECLAIM;
 }
 
-- 
1.6.5.2



--

From: KOSAKI Motohiro
Date: Wednesday, April 14, 2010 - 9:35 pm

Now, kernel compile and/or backup operation seems keep nr_vmscan_write==0.
Dave, can you please try to run your pageout annoying workload?



--

From: Dave Chinner
Date: Wednesday, April 14, 2010 - 11:32 pm

It's just as easy for you to run and observe the effects. Start with a VM
with 1GB RAM and a 10GB scratch block device:

# mkfs.xfs -f /dev/<blah>
# mount -o logbsize=262144,nobarrier /dev/<blah> /mnt/scratch

in one shell:

# while [ 1 ]; do dd if=/dev/zero of=/mnt/scratch/foo bs=1024k ; done

in another shell, if you have fs_mark installed, run:

# ./fs_mark -S0 -n 100000 -F -s 0 -d /mnt/scratch/0 -d /mnt/scratch/1 -d /mnt/scratch/3 -d /mnt/scratch/2 &

otherwise run a couple of these in parallel on different directories:

# for i in `seq 1 1 100000`; do echo > /mnt/scratch/0/foo.$i ; done

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Dave Chinner
Date: Wednesday, April 14, 2010 - 11:58 pm

A filesystem on a loopback device will work just as well ;)

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Dave Chinner
Date: Wednesday, April 14, 2010 - 11:20 pm

IMO, this really doesn't fix either of the problems - the bad IO
patterns nor the stack usage. All it will take is a bit more memory
pressure to trigger stack and IO problems, and the user reporting the
problems is generating an awful lot of memory pressure...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Dave Chinner
Date: Thursday, April 15, 2010 - 1:54 am

Agreed (again), but we've already come to the conclusion that a

Yes, I suggested it *as a first step*, not as the end goal. Your
patches don't reach the first step which is fixing the reported

Given that I haven't been able to trigger OOM without writeback from
direct reclaim so far (*) I'm not finding any evidence that it is a
problem or that there are regressions.  I want to be able to say
that this change has no known regressions. I want to find the
regression and  work to fix them, but without test cases there's no
way I can do this.

This is what I'm getting frustrated about - I want to fix this
problem once and for all, but I can't find out what I need to do to
robustly test such a change so we can have a high degree of
confidence that it doesn't introduce major regressions. Can anyone
help here?

(*) except in one case I've already described where it mananged to
allocate enough huge pages to starve the system of order zero pages,

You're asking me? I've been asking you for workloads that wind up
reclaim priority.... :/

All I can say is that the most common trigger I see for OOM is
copying a large file on a busy system that is running off a single
spindle.  When that happens on my laptop I walk away and get a cup
of coffee when that happens and when I come back I pick up all the
broken bits the OOM killer left behind.....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: KOSAKI Motohiro
Date: Thursday, April 15, 2010 - 3:21 am

I have some diet patch as another patches. I'll post todays diet patch

Agreed. I'm sorry that thing. Probably nobody in the world have
enough VM test case even though include no linux people. Modern general
purpose OS are used really really various purpose and various machine.
So, I haven't seen perfectly zero regression VM change. I'm getting 
the same frustration anytime. 

Because, Many VM mess is for avoiding extream starvation case. but If

??? Do I misunderstand your last mail?

and, I ask which is "the bad IO patterns". if it's not your intention,
What do you talked about io pattern?

If my understand is correct, you asked me about vmscan hurt case,
and I asked you your the bad IO pattern. 


As far as I understand, you are talking about no specific general thing.
then, I also talking general one. In general, I think slow down is
better than OOM-killer. So, even though we need more and more improvement,
we always care about avoiding incorrect oom. iow, I'd prefer step by
step development.




--

From: KOSAKI Motohiro
Date: Thursday, April 15, 2010 - 3:23 am

Now, max_scan of shrink_inactive_list() is always passed less than
SWAP_CLUSTER_MAX. then, we can remove scanning pages loop in it.
This patch also help stack diet.

detail
 - remove "while (nr_scanned < max_scan)" loop
 - remove nr_freed (now, we use nr_reclaimed directly)
 - remove nr_scan (now, we use nr_scanned directly)
 - rename max_scan to nr_to_scan
 - pass nr_to_scan into isolate_pages() directly instead
   using SWAP_CLUSTER_MAX

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |  190 ++++++++++++++++++++++++++++-------------------------------
 1 files changed, 89 insertions(+), 101 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index eab6028..4de4029 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1137,16 +1137,22 @@ static int too_many_isolated(struct zone *zone, int file,
  * shrink_inactive_list() is a helper for shrink_zone().  It returns the number
  * of reclaimed pages
  */
-static unsigned long shrink_inactive_list(unsigned long max_scan,
+static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 			struct zone *zone, struct scan_control *sc,
 			int file)
 {
 	LIST_HEAD(page_list);
 	struct pagevec pvec;
-	unsigned long nr_scanned = 0;
+	unsigned long nr_scanned;
 	unsigned long nr_reclaimed = 0;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
 	int lumpy_reclaim = 0;
+	struct page *page;
+	unsigned long nr_taken;
+	unsigned long nr_active;
+	unsigned int count[NR_LRU_LISTS] = { 0, };
+	unsigned long nr_anon;
+	unsigned long nr_file;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1172,119 +1178,101 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 
 	lru_add_drain();
 	spin_lock_irq(&zone->lru_lock);
-	do {
-		struct page *page;
-		unsigned long nr_taken;
-		unsigned long nr_scan;
-		unsigned long nr_freed;
-		unsigned long nr_active;
-		unsigned int count[NR_LRU_LISTS] = { 0, };
-		int ...
From: Mel Gorman
Date: Thursday, April 15, 2010 - 6:15 am

Yep. I modified bloat-o-meter to work with stacks (imaginatively calling it
stack-o-meter) and got the following. The prereq patches are from
earlier in the thread with the subjects

vmscan: kill prev_priority completely
vmscan: move priority variable into scan_control

It gets

$ stack-o-meter vmlinux-vanilla vmlinux-1-2patchprereq 
add/remove: 0/0 grow/shrink: 0/1 up/down: 0/-72 (-72)
function                                     old     new   delta
kswapd                                       748     676     -72

and with this patch on top

$ stack-o-meter vmlinux-vanilla vmlinux-2-simplfy-shrink 
add/remove: 0/0 grow/shrink: 0/2 up/down: 0/-144 (-144)
function                                     old     new   delta
shrink_zone                                 1232    1160     -72
kswapd                                       748     676     -72


I couldn't spot any problems. I'd consider throwing a

WARN_ON(nr_to_scan > SWAP_CLUSTER_MAX) in case some future change breaks
the assumptions but otherwise.


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Andi Kleen
Date: Thursday, April 15, 2010 - 8:01 am

And the next time someone adds a new feature to these code paths or
the compiler inlines differently these 72 bytes are easily there
again. It's not really a long term solution. Code is tending to get
more complicated all the time. I consider it unlikely this trend will
stop any time soon.

So just doing some stack micro optimizations doesn't really help 
all that much.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Mel Gorman
Date: Thursday, April 15, 2010 - 8:44 am

The same logic applies when/if page writeback is split so that it is

It's a buying-time venture, I'll agree but as both approaches are only
about reducing stack stack they wouldn't be long-term solutions by your
criteria. What do you suggest?


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Andi Kleen
Date: Thursday, April 15, 2010 - 9:54 am

(from easy to more complicated):

- Disable direct reclaim with 4K stacks
- Do direct reclaim only on separate stacks
- Add interrupt stacks to any 8K stack architectures.
- Get rid of 4K stacks completely
- Think about any other stackings that could give large scale recursion
and find ways to run them on separate stacks too.
- Long term: maybe we need 16K stacks at some point, depending on how
good the VM gets. Alternative would be to stop making Linux more complicated,
but that's unlikely to happen.


-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Dave Chinner
Date: Thursday, April 15, 2010 - 4:40 pm

Just to re-iterate: we're blowing the stack with direct reclaim on
x86_64  w/ 8k stacks.  The old i386/4k stack problem is a red
herring.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Andi Kleen
Date: Friday, April 16, 2010 - 12:13 am

Yes that's known, but on 4K it will definitely not work at all.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Mel Gorman
Date: Friday, April 16, 2010 - 7:57 am

Yep, that is not being disputed. By the way, what did you use to
generate your report? Was it CONFIG_DEBUG_STACK_USAGE or something else?
I used a modified bloat-o-meter to gather my data but it'd be nice to
be sure I'm seeing the same things as you (minus XFS unless I

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Dave Chinner
Date: Friday, April 16, 2010 - 7:37 pm

I'm using the tracing subsystem to get them. Doesn't everyone use
that now? ;)

$ grep STACK .config
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_REGS_AND_STACK_ACCESS_API=y
# CONFIG_CC_STACKPROTECTOR is not set
CONFIG_STACKTRACE=y
CONFIG_USER_STACKTRACE_SUPPORT=y
CONFIG_STACK_TRACER=y
# CONFIG_DEBUG_STACKOVERFLOW is not set
# CONFIG_DEBUG_STACK_USAGE is not set

Then:

# echo 1 > /proc/sys/kernel/stack_tracer_enabled

<run workloads>

Monitor the worst recorded stack usage as it changes via:

# cat /sys/kernel/debug/tracing/stack_trace
        Depth    Size   Location    (44 entries)
        -----    ----   --------
  0)     5584     288   get_page_from_freelist+0x5c0/0x830
  1)     5296     272   __alloc_pages_nodemask+0x102/0x730
  2)     5024      48   kmem_getpages+0x62/0x160
  3)     4976      96   cache_grow+0x308/0x330
  4)     4880      96   cache_alloc_refill+0x27f/0x2c0
  5)     4784      96   __kmalloc+0x241/0x250
  6)     4688     112   vring_add_buf+0x233/0x420
......


Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Mel Gorman
Date: Friday, April 16, 2010 - 7:55 am

Do not like. While I can see why 4K stacks are a serious problem, I'd
sooner see 4K stacks disabled than have the kernel behave so differently
for direct reclaim. It's be tricky to spot regressions in reclaim that


This is a similar but separate problem. It's similar in that interrupt

Why would we *not* do this? I can't remember the original reasoning
behind 4K stacks but am guessing it helped fork-orientated workloads in
startup times in the days before lumpy reclaim and better fragmentation
control.


The patch series I threw up about reducing stack was a cut-down
approach. Instead of using separate stacks, keep the stack usage out of

Make this Plan D if nothing else works out and we still hit a wall?

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Valdis.Kletnieks
Date: Thursday, April 15, 2010 - 11:22 am

Think that's a script worth having in-tree?
From: Mel Gorman
Date: Friday, April 16, 2010 - 2:39 am

Ahh, it's a hatchet-job at the moment. I copied bloat-o-meter and
altered one function. I made a TODO note to extend bloat-o-meter
properly and that would be worth merging.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: KOSAKI Motohiro
Date: Thursday, April 15, 2010 - 3:24 am

This patch is used from [3/4]

===================================
Free_hot_cold_page() and __free_pages_ok() have very similar
freeing preparation. This patch make consolicate it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/page_alloc.c |   40 +++++++++++++++++++++-------------------
 1 files changed, 21 insertions(+), 19 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 88513c0..ba9aea7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -599,20 +599,23 @@ static void free_one_page(struct zone *zone, struct page *page, int order,
 	spin_unlock(&zone->lock);
 }
 
-static void __free_pages_ok(struct page *page, unsigned int order)
+static int free_pages_prepare(struct page *page, unsigned int order)
 {
-	unsigned long flags;
 	int i;
 	int bad = 0;
-	int wasMlocked = __TestClearPageMlocked(page);
 
 	trace_mm_page_free_direct(page, order);
 	kmemcheck_free_shadow(page, order);
 
-	for (i = 0 ; i < (1 << order) ; ++i)
-		bad += free_pages_check(page + i);
+	for (i = 0 ; i < (1 << order) ; ++i) {
+		struct page *pg = page + i;
+
+		if (PageAnon(pg))
+			pg->mapping = NULL;
+		bad += free_pages_check(pg);
+	}
 	if (bad)
-		return;
+		return -EINVAL;
 
 	if (!PageHighMem(page)) {
 		debug_check_no_locks_freed(page_address(page),PAGE_SIZE<<order);
@@ -622,6 +625,17 @@ static void __free_pages_ok(struct page *page, unsigned int order)
 	arch_free_page(page, order);
 	kernel_map_pages(page, 1 << order, 0);
 
+	return 0;
+}
+
+static void __free_pages_ok(struct page *page, unsigned int order)
+{
+	unsigned long flags;
+	int wasMlocked = __TestClearPageMlocked(page);
+
+	if (free_pages_prepare(page, order))
+		return;
+
 	local_irq_save(flags);
 	if (unlikely(wasMlocked))
 		free_page_mlock(page);
@@ -1107,21 +1121,9 @@ void free_hot_cold_page(struct page *page, int cold)
 	int migratetype;
 	int wasMlocked = __TestClearPageMlocked(page);
 
-	trace_mm_page_free_direct(page, ...
From: Mel Gorman
Date: Thursday, April 15, 2010 - 6:33 am

You don't appear to do anything with the return value. bool? Otherwise I
see no problems


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: KOSAKI Motohiro
Date: Thursday, April 15, 2010 - 3:26 am

On x86_64, sizeof(struct pagevec) is 8*16=128, but
sizeof(struct list_head) is 8*2=16. So, to replace pagevec with list
makes to reduce 112 bytes stack.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |   22 ++++++++++++++--------
 1 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4de4029..fbc26d8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -93,6 +93,8 @@ struct scan_control {
 			unsigned long *scanned, int order, int mode,
 			struct zone *z, struct mem_cgroup *mem_cont,
 			int active, int file);
+
+	struct list_head free_batch_list;
 };
 
 #define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
@@ -641,13 +643,11 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 					enum pageout_io sync_writeback)
 {
 	LIST_HEAD(ret_pages);
-	struct pagevec freed_pvec;
 	int pgactivate = 0;
 	unsigned long nr_reclaimed = 0;
 
 	cond_resched();
 
-	pagevec_init(&freed_pvec, 1);
 	while (!list_empty(page_list)) {
 		enum page_references references;
 		struct address_space *mapping;
@@ -822,10 +822,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
 		__clear_page_locked(page);
 free_it:
 		nr_reclaimed++;
-		if (!pagevec_add(&freed_pvec, page)) {
-			__pagevec_free(&freed_pvec);
-			pagevec_reinit(&freed_pvec);
-		}
+		list_add(&page->lru, &sc->free_batch_list);
 		continue;
 
 cull_mlocked:
@@ -849,8 +846,6 @@ keep:
 		VM_BUG_ON(PageLRU(page) || PageUnevictable(page));
 	}
 	list_splice(&ret_pages, page_list);
-	if (pagevec_count(&freed_pvec))
-		__pagevec_free(&freed_pvec);
 	count_vm_events(PGACTIVATE, pgactivate);
 	return nr_reclaimed;
 }
@@ -1238,6 +1233,11 @@ static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
 						 PAGEOUT_IO_SYNC);
 	}
 
+	/*
+	 * Free unused pages.
+	 */
+	free_pages_bulk(zone, &sc->free_batch_list);
+
 	local_irq_disable();
 	if (current_is_kswapd())
 ...
From: Mel Gorman
Date: Thursday, April 15, 2010 - 6:46 am

You could clear this under the zone->lock below before calling
__free_one_page. It'd avoid a large number of IRQ enables and disables which
are a problem on some CPUs (P4 and Itanium both blow in this regard according

This has the effect of bypassing the per-cpu lists as well as making the
zone lock hotter. The cache hotness of the data within the page is
probably not a factor but the cache hotness of the stuct page is.

The zone lock getting hotter is a greater problem. Large amounts of page
reclaim or dumping of page cache will now contend on the zone lock where
as previously it would have dumped into the per-cpu lists (potentially
but not necessarily avoiding the zone lock).

While there might be a stack saving in the next patch, there would appear
to be definite performance implications in taking this patch.

Functionally, I see no problem but I'd put this sort of patch on the

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Mel Gorman
Date: Thursday, April 15, 2010 - 3:28 am

At worst, it'll distort the LRU ordering slightly. Lets say the the
file-adjacent-page you clean was near the end of the LRU. Before such a
patch, it may have gotten cleaned and done another lap of the LRU.
After, it would be reclaimed sooner. I don't know if we depend on such
behaviour (very doubtful) but it's a subtle enough change. I can't
predict what it'll do for IO congestion. Simplistically, there is more
IO so it's bad but if the write pattern is less seeky and we needed to

I'm afraid I'm not familiar with this interface. Can you point me at
some previous discussion so that I am sure I am looking at the right

Sure, I'm not resisting fixing this, just your first patch :) There are four
goals here

1. Reduce stack usage
2. Avoid the splicing of subsystem stack usage with direct reclaim
3. Preserve lumpy reclaims cleaning of contiguous pages
4. Try and not drastically alter LRU aging

1 and 2 are important for you, 3 is important for me and 4 will have to
be dealt with on a case-by-case basis.

Your patch fixes 2, avoids 1, breaks 3 and haven't thought about 4 but I

Ah yes, but I at least will resist killing of writeback from direct
reclaim because of lumpy reclaim. Again, I recognise the seek pattern

I'd like this to be plan b (or maybe c or d) if we cannot reduce stack usage
enough or come up with an alternative fix. From the goals above it mitigates
1, mitigates 2, addresses 3 but potentially allows dirty pages to remain on
the LRU with 4 until the background cleaner or kswapd comes along.

One reason why I am edgy about this is that lumpy reclaim can kick in
for low-enough orders too like order-1 pages for stacks in some cases or
order-2 pages for network cards using jumbo frames or some wireless
cards. The network cards in particular could still cause the stack

I'd like stack reduction to be plan a because it buys time without
making the problem exclusively lumpy reclaims where it can still hit,


Yep. It would cut down around 1K of stack usage when ...
From: Chris Mason
Date: Thursday, April 15, 2010 - 6:42 am

vi fs/direct-reclaim-helper.c, it has a few placeholders for where the
real code needs to go....just look for the ~ marks.

I mostly meant that the bdi helper threads were the best place to add
knowledge about which pages we want to write for reclaim.  We might need
to add a thread dedicated to just doing the VM's dirty work, but that's

I'd like to add one more:

5. Don't dive into filesystem locks during reclaim.

This is different from splicing code paths together, but
the filesystem writepage code has become the center of our attempts at
doing big fat contiguous writes on disk.  We push off work as late as we
can until just before the pages go down to disk.

I'll pick on ext4 and btrfs for a minute, just to broaden the scope
outside of XFS.  Writepage comes along and the filesystem needs to
actually find blocks on disk for all the dirty pages it has promised to
write.

So, we start a transaction, we take various allocator locks, modify
different metadata, log changed blocks, take a break (logging is hard
work you know, need_resched() triggered a by now), stuff it
all into the file's metadata, log that, and finally return.

Each of the steps above can block for a long time.  Ext4 solves
this by not doing them.  ext4_writepage only writes pages that
are already fully allocated on disk.

Btrfs is much more efficient at not doing them, it just returns right
away for PF_MEMALLOC.

This is a long way of saying the filesystem writepage code is the
opposite of what direct reclaim wants.  Direct reclaim wants to
find free ram now, and if it does end up in the mess describe above,
it'll just get stuck for a long time on work entirely unrelated to
finding free pages.

-chris

--

From: tytso
Date: Thursday, April 15, 2010 - 10:50 am

This is a real problem, BTW.  One of the problems we've been fighting
inside Google is because ext4_writepage() refuses to write pages that
are subject to delayed allocation, it can cause the OOM killer to get
invoked.  

I had thought this was because of some evil games we're playing for
container support that makes zones small, but just last night at the
LF Collaboration Summit reception, I ran into a technologist from a
major financial industry customer reported to me that when they tried
using ext4, they ran into the exact same problem because they were
running Oracle which was pinning down 3 gigs of memory, and then when
they tried writing a very big file using ext4, they had the same
problem of writepage() not being able to reclaim enough pages, so the
kernel fell back to invoking the OOM killer, and things got ugly in a
hurry...

One of the things I was proposing internally to try as a long-term
we-gotta-fix writeback is that we need some kind of signal so that we
can do the lumpy reclaim (a) in a separate process, to avoid a lock
inversion problem and the gee-its-going-to-take-a-long-time problem
which Chris Mentioned, and (b) to try to cluster I/O so that we're not
dribbling out writes to the disk in small, seeky, 4k writes, which is
really a disaster from a performance standpoint.  Maybe the VM guys
don't care about this, but this sort of things tends to get us
filesystem guys all up in a lather not just because of the really
sucky performance, but also because it tends to mean that the system
can thrash itself to death in low memory situations.

    	       	      	     	      	 - Ted
--

From: Mel Gorman
Date: Friday, April 16, 2010 - 8:05 am

I must be blind. What tree is this in? I can't see it v2.6.34-rc4,

Good add. It's not a new problem either. This came up at least two years
ago at around the first VM/FS summit and the response was a long the lines

Ok, good summary, thanks. I was only partially aware of some of these.
i.e. I knew it was a problem but was not sensitive to how bad it was.
Your last point is interesting because lumpy reclaim for large orders under
heavy pressure can make the system stutter badly (e.g. during a huge
page pool resize). I had blamed just plain IO but messing around with
locks and tranactions could have been a large factor and I didn't go
looking for it.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Mel Gorman
Date: Monday, April 19, 2010 - 8:15 am

Bah, Johannes corrected my literal mind. har de har har :)

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Dave Chinner
Date: Thursday, April 15, 2010 - 9:14 pm

Fundamentally, we have so many pages on the LRU, getting a few out
of order at the back end of it is going to be in the noise. If we
trade off "perfect" LRU behaviour for cleaning pages an order of
magnitude faster, reclaim will find candidate pages for a whole lot
faster. And if we have more clean pages available, faster, overall
system throughput is going to improve and be much less likely to
fall into deep, dark holes where the OOM-killer is the light at the
end.....


#4 is important to me, too, because that has direct impact on large
file IO workloads. however, it is gross changes in behaviour that
concern me, not subtle, probably-in-the-noise changes that you're

Well, you keep saying that they break #3, but I haven't seen any
test cases or results showing that. I've been unable to confirm that
lumpy reclaim is broken by disallowing writeback in my testing, so
I'm interested to know what tests you are running that show it is

We've been through this already, but I'll repeat it again in the
hope it sinks in: reducing stack usage is not sufficient to stay
within an 8k stack if we can enter writeback with an arbitrary
amount of stack already consumed.

We've already got a report of 9k of stack usage (7200 bytes left on
a order-2 stack) and this is without a complex storage stack - it's
just a partition on a SATA drive. We can easily add another 1k,
possibly 2k to that stack depth with a complex storage subsystem.
Trimming this much (3-4k) is simply not feasible in a callchain that

So push lumpy reclaim into a separate thread. It already blocks, so
waiting for some other thread to do the work won't change anything.
Separating high-order reclaim from LRU reclaim is probably a good
idea, anyway - they use different algorithms and while the two are
intertwined it's hard to optimise/improve either....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Mel Gorman
Date: Friday, April 16, 2010 - 8:14 am

haha, I don't think anyone pretends the LRU behaviour is perfect.
Altering its existing behaviour tends to be done with great care but

I'm also less concerned with this aspect. I brought it up because it was
a factor. I don't think it'll cause us problems but if problems do
arise, it's nice to have a few potential candidates to examine in

Ok, I haven't actually tested this. The machines I use are tied up
retesting the compaction patches at the moment. The reason why I reckon
it'll be a problem is that when these sync-writeback changes were
introduced, it significantly helped lumpy reclaim for huge pages. I am
making an assumption that backing out those changes will hurt it.


Ok, based on this, I'll stop working on the stack-reduction patches.
I'll test what I have and push it but I won't bring it further for the
moment and instead look at putting writeback into its own thread. If
someone else works on it in the meantime, I'll review and test from the

No, it wouldn't. As long as it can wait on the right pages, it doesn't

They are not a million miles apart either. Lumpy reclaim uses the LRU to
select a cursor page and then reclaims around it. Improvements on LRU tend
to help lumpy reclaim as well. It's why during the tests I run I can often
allocate 80-95% of memory as huge pages on x86-64 as opposed to when anti-frag
was being developed first where getting 30% was a cause for celebration :)

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Andrew Morton
Date: Saturday, April 17, 2010 - 5:32 pm

There are two issues here: stack utilisation and poor IO patterns in
direct reclaim.  They are different.

The poor IO patterns thing is a regression.  Some time several years
ago (around 2.6.16, perhaps), page reclaim started to do a LOT more
dirty-page writeback than it used to.  AFAIK nobody attempted to work
out why, nor attempted to try to fix it.


Doing writearound in pageout() might help.  The kernel was in fact was
doing that around 2.5.10, but I took it out again because it wasn't
obviously beneficial.

Writearound is hard to do, because direct-reclaim doesn't have an easy
way of pinning the address_space: it can disappear and get freed under
your feet.  I was able to make this happen under intense MM loads.  The
current page-at-a-time pageout code pins the address_space by taking a
lock on one of its pages.  Once that lock is released, we cannot touch
*mapping.

And lo, the pageout() code is presently buggy:

		res = mapping->a_ops->writepage(page, &wbc);
		if (res < 0)
			handle_write_error(mapping, page, res);

The ->writepage can/will unlock the page, and we're passing a hand
grenade into handle_write_error().

Any attempt to implement writearound in pageout will need to find a way
to safely pin that address_space.  One way is to take a temporary ref
on mapping->host, but IIRC that introduced nasties with inode_lock. 
Certainly it'll put more load on that worrisomely-singleton lock.


Regarding simply not doing any writeout in direct reclaim (Dave's
initial proposal): the problem is that pageout() will clean a page in
the target zone.  Normal writeout won't do that, so we could get into a
situation where vast amounts of writeout is happening, but none of it
is cleaning pages in the zone which we're trying to allocate from. 
It's quite possibly livelockable, too.

Doing writearound (if we can get it going) will solve that adequately
(assuming that the target page gets reliably written), but it won't
help the stack usage problem.


To solve the ...
From: Christoph Hellwig
Date: Sunday, April 18, 2010 - 12:05 pm

I just know that we XFS guys have been complaining about it a lot..

But that was mostly a tuning issue - before writeout mostly happened
from pdflush.  If we got into kswapd or direct reclaim we already

As Chris mentioned currently btrfs and ext4 do not actually do delalloc
conversions from this path, so for typical workloads the amount of
writeout that can happen from this path is extremly limited.  And unless
we get things fixed we will have to do the same for XFS.  I'd be much
more happy if we could just sort it out at the VM level, because this
means we have one sane place for this kind of policy instead of three
or more hacks down inside the filesystems.  It's rather interesting
that all people on the modern fs side completely agree here what the
problem is, but it seems rather hard to convince the VM side to do


Allowing the flusher threads to do targeted writeout would be the
best from the FS POV.  We'll still have one source of the I/O, just
with another know on how to select the exact region to write out.
We can still synchronously wait for the I/O for lumpy reclaim if really
nessecary.

--

From: Sorin Faibish
Date: Sunday, April 18, 2010 - 12:11 pm

On Sun, 18 Apr 2010 15:05:26 -0400, Christoph Hellwig <hch@infradead.org>  
I know also that the ext3 and reisefs guys complained about this issue



-- 
Best Regards
Sorin Faibish
Corporate Distinguished Engineer
Network Storage Group

        EMC²
where information lives

Phone: 508-435-1000 x 48545
Cellphone: 617-510-0422
Email : sfaibish@emc.com
--

From: Andrew Morton
Date: Sunday, April 18, 2010 - 9:31 am

Right.  It's intended that the great majority of writeout be performed
by the fs flusher threads and by the write()r in balance_dirty_pages().
Writeout off the LRU is supposed to be a rare emergency case.


Yeah, but it's all bandaids.  The first thing we should do is work out
why writeout-off-the-LRU increased so much and fix that.

Handing writeout off to separate threads might be used to solve the
stack consumption problem but we shouldn't use it to "solve" the
excess-writeout-from-page-reclaim problem.

--

From: Christoph Hellwig
Date: Sunday, April 18, 2010 - 12:35 pm

I think both of them are really serious issue.  Exposing the whole
stack and lock problems with direct reclaim are a bit of a positive
side-effect os the writeout tuning messup.  Without it the problems
would still be just as harmfull, just happenening even less often and
thus getting even less attention.

--

From: Sorin Faibish
Date: Sunday, April 18, 2010 - 12:10 pm

On Sat, 17 Apr 2010 20:32:39 -0400, Andrew Morton
I for one am looking very seriously at this problem together with Bruce.
We plan to have a discussion on this topic at the next LSF meeting



-- 
Best Regards
Sorin Faibish
Corporate Distinguished Engineer
Network Storage Group

         EMC²
where information lives

Phone: 508-435-1000 x 48545
Cellphone: 617-510-0422
Email : sfaibish@emc.com
--

From: Sorin Faibish
Date: Sunday, April 18, 2010 - 4:34 pm

On Sun, 18 Apr 2010 17:30:36 -0400, James Bottomley  
Let's work together to get this done. This is a very good idea. I will try
to bring some facts about the current state by instrumenting the kernel
to sample with higher time granularity the dirty pages dynamics. This will
allow us expose better the problem or lack of. :)




-- 
Best Regards
Sorin Faibish
Corporate Distinguished Engineer
Network Storage Group

        EMC²
where information lives

Phone: 508-435-1000 x 48545
Cellphone: 617-510-0422
Email : sfaibish@emc.com
--

From: tytso
Date: Sunday, April 18, 2010 - 8:08 pm

I'd personally hope that this is solved long before the LSF/VM
workshops.... but if not, yes, we should definitely tackle it then.

      	     	       	  	 	    	   - Ted
--

From: Dave Chinner
Date: Sunday, April 18, 2010 - 5:35 pm

I think that part of the problem is that at roughly the same time
writeback started on a long down hill slide as well, and we've
really only fixed that in the last couple of kernel releases. Also,
it tends to take more that just writing a few large files to invoke
the LRU-based writeback code is it is generally not invoked in
filesystem "performance" testing. Hence my bet is on the fact that
the effects of LRU-based writeback are rarely noticed in common
testing.

IOWs, low memory testing is not something a lot of people do. Add to
that the fact that most fs people, including me, have been treating
the VM as a black box that a bunch of other people have been taking
care of and hence really just been hoping it does the right thing,
and we've got a recipe for an unnoticed descent into a Bad Place.



That's true, but seeing as we can't safely do writeback from
reclaim, we need some method of telling the background threads to
write a certain region of an inode. Perhaps some extension of a


Which, if we have to set it as low as 1.5k of stack used, may as

I'm fundamentally opposed to pushing IO to another place in the VM
when it could be just as easily handed to the flusher threads.
Also, consider that there's only one kswapd thread in a given
context (e.g. per CPU), but we can scale the number of flusher
threads as need be....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Arjan van de Ven
Date: Sunday, April 18, 2010 - 5:49 pm

On Mon, 19 Apr 2010 10:35:56 +1000


Would this also be the time where we started real dirty accounting, and
started playing with the dirty page thresholds?

Background writeback is that interesting tradeoff between writing out
to make the VM easier (and the data safe) and the chance of someone
either rewriting the same data (as benchmarks do regularly... not sure
about real workloads) or deleting the temporary file.


Maybe we need to do the background dirty writes a bit more aggressive...
or play with heuristics where we get an adaptive timeout (say, if the
file got closed by the last opener, then do a shorter timeout)


-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Dave Chinner
Date: Sunday, April 18, 2010 - 6:08 pm

Yes, I think that was introduced in 2.6.16/17, so it's definitely in

Realistically, I'm concerned about preventing the worst case
behaviour from occurring - making the background writes more
agressive without preventing writeback in LRU order simply means it
will be harder to test the VM corner case that triggers these
writeout patterns...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Arjan van de Ven
Date: Sunday, April 18, 2010 - 9:32 pm

On Mon, 19 Apr 2010 11:08:05 +1000


while I appreciate that the worst case should not be uber horrific...
I care a LOT about getting the normal case right... and am willing to
sacrifice the worst case for that.. (obviously not to infinity, it
needs to be bounded)

-- 
Arjan van de Ven 	Intel Open Source Technology Centre
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Mel Gorman
Date: Monday, April 19, 2010 - 8:20 am

One machine has completed the test and the results are as expected. When
allocating huge pages under stress, your patch drops the success rates
significantly. On X86-64, it showed

STRESS-HIGHALLOC
              stress-highalloc   stress-highalloc
            enable-directreclaim disable-directreclaim
Under Load 1    89.00 ( 0.00)    73.00 (-16.00)
Under Load 2    90.00 ( 0.00)    85.00 (-5.00)
At Rest         90.00 ( 0.00)    90.00 ( 0.00)

So with direct reclaim, it gets 89% of memory as huge pages at the first
attempt but 73% with your patch applied. The "Under Load 2" test happens
immediately after. With the start kernel, the first and second attempts
are usually the same or very close together. With your patch applied,
there are big differences as it was no longer trying to clean pages.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Dave Chinner
Date: Thursday, April 22, 2010 - 6:06 pm

What was the machine config you were testing on (RAM, CPUs, etc)?
And what are these loads? Do you have a script that generates
them? If so, can you share them, please?

OOC, what was the effect on the background load - did it go faster
or slower when writeback was disabled? i.e. did we trade of more
large pages for better overall throughput?

Also, I'm curious as to the repeatability of the tests you are
doing. I found that from run to run I could see a *massive*
variance in the results. e.g. one run might only get ~80 huge
pages at the first attempt, the test run from the same initial
conditions next might get 440 huge pages at the first attempt. I saw
the same variance with or without writeback from direct reclaim
enabled. Hence only after averaging over tens of runs could I see
any sort of trend emerge, and it makes me wonder if your testing is
also seeing this sort of variance....

FWIW, if we look results of the test I did, it showed a 20%
improvement in large page allocation with a 15% increase in load
throughput, while you're showing a 16% degradation in large page
allocation.  Effectively we've got two workloads that show results
at either end of the spectrum (perhaps they are best case vs worst
case) but there's no real in-between. What other tests can we run to
get a better picture of the effect?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Mel Gorman
Date: Friday, April 23, 2010 - 3:50 am

Compile-based loads that fill up memory and put it under heavy memory
pressure that also dirties memory. While they are running, a kernel module
is loaded that starts allocating huge pages one at a time so that accurate
timing and the state of the system can be gathered at allocation time. The
number of allocation attempts is 90% of the number of huge pages that exist

Yes, but unfortunately they are not in a publishable state. Parts of

Unfortunately, I don't know what the effect on the underlying load is
as it takes longer than the huge page allocation attempts do. The tests
objective is to check how well lumpy reclaim works undedmemory pressure.

However, the time it takes to allocate a huge page increases with direct
reclaim disabled (i.e. your patch) early in the test up until about 40%
of memory was allocated as huge pages. After that, the latencies with
disable-directreclaim are lower until the gives up while the latencies with
enable-directreclaim increase.

In other words, with direct reclaim writing back pages, lumpy reclaim is a
lot more determined to get the pages cleaned and wait on them if necessary. A
compromise patch might be to have a wait_on_page_dirty to be cleared instead
of queueing the IO and wait_on_page_writeback? How long it stalled would

You are using the nr_hugepages interface and writing a large number to it
so you are also triggering the hugetlbfs retry-logic and have little control
over how many times the allocator gets called on each attempt. How many huge
pages it allocates depends on how much progress it is able to make during
lumpy reclaim.

It's why the tests I run allocate huge pages one at a time and measure
the latencies as it goes. The results tend to be quite reproducible.
Success figures would be the same between runs and the rate of
allocation success would generally be comparable as well.

Your test could do something similar by only ever requesting one additional
page. It will be good enough to measure allocation latency.  The ...
From: Andi Kleen
Date: Thursday, April 15, 2010 - 7:57 am

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Johannes Weiner
Date: Wednesday, April 14, 2010 - 7:37 pm

I already have some patches to remove trivial parts of struct scan_control,
namely may_unmap, may_swap, all_unreclaimable and isolate_pages.  The rest
needs a deeper look.

A rather big offender in there is the combination of shrink_active_list (360
bytes here) and shrink_page_list (200 bytes).  I am currently looking at
breaking out all the accounting stuff from shrink_active_list into a separate
leaf function so that the stack footprint does not add up.

Your idea of per-cpu allocated scan controls reminds me of an idea I have
had for some time now: moving reclaim into its own threads (per cpu?).

Not only would it separate the allocator's stack from the writeback stack,
we could also get rid of that too_many_isolated() workaround and coordinate
reclaim work better to prevent overreclaim.

But that is not a quick fix either...
--

From: KOSAKI Motohiro
Date: Wednesday, April 14, 2010 - 7:43 pm

Seems interesting. but scan_control diet is not so effective. How much


So, I haven't think this way. probably seems good. but I like to do
simple diet at first.



--

From: Johannes Weiner
Date: Friday, April 16, 2010 - 4:56 pm

Not much, it cuts 16 bytes on x86 32 bit.  The bigger gain is the code
clarification it comes with.  There is too much state to keep track of
in reclaim.
--

From: KOSAKI Motohiro
Date: Tuesday, April 13, 2010 - 11:52 pm

Yeah, Of cource much. I would propse to revert 70674f95c0.
But I doubt GFP_NOFS solve our issue.



--

From: Andi Kleen
Date: Wednesday, April 14, 2010 - 3:06 am

There are lots of other call chains which use multiple KB bytes by itself,
so why not give select() that measly 832 bytes?

You think only file systems are allowed to use stack? :)

Basically if you cannot tolerate 1K (or more likely more) of stack
used before your fs is called you're toast in lots of other situations

It does this for large inputs, but the whole point of the stack fast
path is to avoid it for common cases when a small number of fds is
only needed.

It's significantly slower to go to any external allocator.

-Andi
-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Chris Mason
Date: Wednesday, April 14, 2010 - 4:20 am

Well, on a 4K stack kernel, 832 bytes is a very large percentage for
just one function.

Direct reclaim is a problem because it splices parts of the kernel that
normally aren't connected together.  The people that code in select see
832 bytes and say that's teeny, I should have taken 3832 bytes.

But they don't realize their function can dive down into ecryptfs then
the filesystem then maybe loop and then perhaps raid6 on top of a

Yeah, but since the call chain does eventually go into the allocator,
this function needs to be more stack friendly.

I do agree that we can't really solve this with noinline_for_stack pixie
dust, the long call chains are going to be a problem no matter what.

Reading through all the comments so far, I think the short summary is:

Cleaning pages in direct reclaim helps the VM because it is able to make
sure that lumpy reclaim finds adjacent pages.  This isn't a fast
operation, it has to wait for IO (infinitely slow compared to the CPU).

Will it be good enough for the VM if we add a hint to the bdi writeback
threads to work on a general area of the file?  The filesystem will get
writepages(), the VM will get the IO it needs started.

I know Mel mentioned before he wasn't interested in waiting for helper
threads, but I don't see how we can work without it.

-chris
--

From: Alan Cox
Date: Wednesday, April 14, 2010 - 5:32 am

The reality is that if you are blowing a 4K process stack you are
probably playing russian roulette on the current 8K x86-32 stack as well
because of the non IRQ split. So it needs fixing either way
--

From: Andi Kleen
Date: Wednesday, April 14, 2010 - 5:34 am

Yes I think the 8K stack on 32bit should be combined with a interrupt 
stack too. There's no reason not to have an interrupt stack ever. 

Again the problem with fixing it is that you won't have any safety net
for a slightly different stacking etc. path that you didn't cover.

That said extreme examples (like some of those Chris listed) definitely
need fixing by moving them to different threads. But even after that
you still want a safety net. 4K is just too near the edge.

Maybe it would work if we never used any indirect calls, but that's
clearly not the case.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.
--

From: Mel Gorman
Date: Wednesday, April 14, 2010 - 6:23 am

Even without direct reclaim, I doubt stack usage is often at the top of
peoples minds except for truly criminal large usages of it. Direct
reclaim splicing is somewhat of a problem but it's separate to stack

Bear in mind that the context of lumpy reclaim that the VM doesn't care
about where the data is on the file or filesystem. It's only concerned
about where the data is located in memory. There *may* be a correlation
between location-of-data-in-file and location-of-data-in-memory but only
if readahead was a factor and readahead happened to hit at a time the page

I'm not against the idea as such. It would have advantages in that the
thread could reorder the IO for better seeks for example and lumpy
reclaim is already potentially waiting a long time so another delay
won't hurt. I would worry that it's just hiding the stack usage by
moving it to another thread and that there would be communication cost
between a direct reclaimer and this writeback thread. The main gain
would be in hiding the "splicing" effect between subsystems that direct
reclaim can have.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Chris Mason
Date: Wednesday, April 14, 2010 - 7:07 am

The big gain from the helper threads is that storage operates at a
roughly fixed iop rate.  This is true for ssd as well, it's just a much
higher rate.  So the threads can send down 4K ios and recover clean pages at
exactly the same rate it would sending down 64KB ios. 

I know that for lumpy purposes it might not be the best 64KB, but the
other side of it is that we have to write those pages eventually anyway.
We might as well write them when it is more or less free.

The per-bdi writeback threads are a pretty good base for changing the
ordering for writeback, it seems like a good place to integrate requests
from the VM about which files (and which offsets in those files) to
write back first.

-chris

--

From: Minchan Kim
Date: Tuesday, April 13, 2010 - 5:24 pm

Hi, Dave.


I think your solution is rather aggressive change as Mel and Kosaki
already pointed out.
Do flush thread aware LRU of dirty pages in system level recency not
dirty pages recency?
Of course flush thread can clean dirty pages faster than direct reclaimer.
But if it don't aware LRUness, hot page thrashing can be happened by
corner case.
It could lost write merge.

And non-rotation storage might be not big of seek cost.
I think we have to consider that case if we decide to change direct reclaim I/O.

How do we separate the problem?

1. stack hogging problem.
2. direct reclaim random write.

And try to solve one by one instead of all at once.

-- 
Kind regards,
Minchan Kim
--

From: Dave Chinner
Date: Tuesday, April 13, 2010 - 9:44 pm

It may be agressive, but writeback from direct reclaim is, IMO, one
of the worst aspects of the current VM design because of it's
adverse effect on the IO subsystem.

I'd prefer to remove it completely that continue to try and patch
around it, especially given that everyone seems to agree that it

It writes back in the order inodes were dirtied. i.e. the LRU is a
coarser measure, but it it still definitely there. It also takes
into account fairness of IO between dirty inodes, so no one dirty
inode prevents IO beining issued on a other dirty inodes on the

Non-rotational storage still goes faster when it is fed large, well

AFAICT, the only way to _reliably_ avoid the stack usage problem is
to avoid writeback in direct reclaim. That has the side effect of
fixing #2 as well, so do they really need separating?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Minchan Kim
Date: Wednesday, April 14, 2010 - 12:54 am

Tend to agree. But De we need it by last resort if flusher thread
can't catch up
write stream?

Of course, If everybody agree, we can do it.
For it, we need many benchmark result which is very hard.

Thanks.
It seems to be lost recency.

Agreed. I missed. Nand device is stronger than HDD about random read.

If we can do it, it's good.
but 2. problem is not easy to fix, I think.
Compared to 2, 1 is rather easy.
So I thought we can solve 1 firstly and then focusing 2.
If your suggestion is right, then we can apply your idea.
Then we don't need to revert the patch of 1 since small stack usage is
always good



-- 
Kind regards,
Minchan Kim
--

From: KAMEZAWA Hiroyuki
Date: Thursday, April 15, 2010 - 6:13 pm

On Tue, 13 Apr 2010 10:17:58 +1000

Hmm. Then, if memoy cgroup is filled by dirty pages, it can't kick writeback
and has to wait for someone else's writeback ?

How long this will take ?
# mount -t cgroup none /cgroup -o memory
# mkdir /cgroup/A
# echo 20M > /cgroup/A/memory.limit_in_bytes
# echo $$ > /cgroup/A/tasks
# dd if=/dev/zero of=./tmpfile bs=4096 count=1000000

Can memcg ask writeback thread to "Wake Up Now! and Write this out!" effectively ?

Thanks,

--

From: KAMEZAWA Hiroyuki
Date: Thursday, April 15, 2010 - 9:18 pm

On Fri, 16 Apr 2010 10:13:39 +0900

Hmm.. I saw an oom-kill while testing several cases but performance itself
seems not to be far different with or without patch.
But I'm unhappy with oom-kill, so some tweak for memcg will be necessary
if we'll go with this.

Thanks,
-Kame

--

Previous thread: [RFC PATCH 0/2] hw-breakpoints allocation constraints updates by Frederic Weisbecker on Monday, April 12, 2010 - 4:01 pm. (2 messages)

Next thread: [PATCH 1/2] mm: add context argument to shrinker callback by Dave Chinner on Monday, April 12, 2010 - 5:24 pm. (15 messages)