This is a merging of two series - the first of which reduces stack usage in page reclaim and the second which writes contiguous pages during reclaim and avoids writeback in direct reclaimers. Changelog since V1 o Merge with series that reduces stack usage in page reclaim in general o Allow memcg to writeback pages as they are not expected to overflow stack o Drop the contiguous-write patch for the moment There is a problem in the stack depth usage of page reclaim. Particularly during direct reclaim, it is possible to overflow the stack if it calls into the filesystems writepage function. This patch series aims to trace writebacks so it can be evaulated how many dirty pages are being written, reduce stack usage of page reclaim in general and avoid direct reclaim writing back pages and overflowing the stack. The first 4 patches are a forward-port of trace points that are partly based on trace points defined by Larry Woodman but never merged. They trace parts of kswapd, direct reclaim, LRU page isolation and page writeback. The tracepoints can be used to evaluate what is happening within reclaim and whether things are getting better or worse. They do not have to be part of the final series but might be useful during discussion and for later regression testing - particularly around the percentage of time spent in reclaim. The 6 patches after that reduce the stack footprint of page reclaim by moving large allocations out of the main call path. Functionally they should be similar although there is a timing change on when pages get freed exactly. This is aimed at giving filesystems as much stack as possible if kswapd is to writeback pages directly. Patch 11 puts dirty pages as it finds them onto a temporary list and then writes them all out with a helper function. This simplifies patch 12 and also increases the chances that IO requests can be optimally merged. Patch 12 prevents direct reclaim writing out pages at all and instead dirty pages are put back on the LRU. For ...
This patch adds a trace event for when page reclaim queues a page for IO and records whether it is synchronous or asynchronous. Excessive synchronous IO for a process can result in noticeable stalls during direct reclaim. Excessive IO from page reclaim may indicate that the system is seriously under provisioned for the amount of dirty pages that exist. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- include/trace/events/vmscan.h | 23 +++++++++++++++++++++++ mm/vmscan.c | 2 ++ 2 files changed, 25 insertions(+), 0 deletions(-) diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h index a331454..b26daa9 100644 --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -154,6 +154,29 @@ TRACE_EVENT(mm_vmscan_lru_isolate, __entry->nr_lumpy_dirty, __entry->nr_lumpy_failed) ); + +TRACE_EVENT(mm_vmscan_writepage, + + TP_PROTO(struct page *page, + int sync_io), + + TP_ARGS(page, sync_io), + + TP_STRUCT__entry( + __field(struct page *, page) + __field(int, sync_io) + ), + + TP_fast_assign( + __entry->page = page; + __entry->sync_io = sync_io; + ), + + TP_printk("page=%p pfn=%lu sync_io=%d", + __entry->page, + page_to_pfn(__entry->page), + __entry->sync_io) +); #endif /* _TRACE_VMSCAN_H */ diff --git a/mm/vmscan.c b/mm/vmscan.c index 25bf05a..58527c4 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -399,6 +399,8 @@ static pageout_t pageout(struct page *page, struct address_space *mapping, /* synchronous write or broken a_ops? */ ClearPageReclaim(page); } + trace_mm_vmscan_writepage(page, + sync_writeback == PAGEOUT_IO_SYNC); inc_zone_page_state(page, NR_VMSCAN_WRITE); return PAGE_SUCCESS; } -- 1.7.1 --
Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed --
Acked-by: Larry Woodman <lwoodman@redhat.com> --
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Now, max_scan of shrink_inactive_list() is always passed less than
SWAP_CLUSTER_MAX. then, we can remove scanning pages loop in it.
This patch also help stack diet.
detail
- remove "while (nr_scanned < max_scan)" loop
- remove nr_freed (now, we use nr_reclaimed directly)
- remove nr_scan (now, we use nr_scanned directly)
- rename max_scan to nr_to_scan
- pass nr_to_scan into isolate_pages() directly instead
using SWAP_CLUSTER_MAX
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/vmscan.c | 211 ++++++++++++++++++++++++++++-------------------------------
1 files changed, 100 insertions(+), 111 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 29e1ecd..6a42ebc 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1132,15 +1132,21 @@ static int too_many_isolated(struct zone *zone, int file,
* shrink_inactive_list() is a helper for shrink_zone(). It returns the number
* of reclaimed pages
*/
-static unsigned long shrink_inactive_list(unsigned long max_scan,
+static unsigned long shrink_inactive_list(unsigned long nr_to_scan,
struct zone *zone, struct scan_control *sc,
int priority, int file)
{
LIST_HEAD(page_list);
struct pagevec pvec;
- unsigned long nr_scanned = 0;
+ unsigned long nr_scanned;
unsigned long nr_reclaimed = 0;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+ struct page *page;
+ unsigned long nr_taken;
+ unsigned long nr_active;
+ unsigned int count[NR_LRU_LISTS] = { 0, };
+ unsigned long nr_anon;
+ unsigned long nr_file;
while (unlikely(too_many_isolated(zone, file, sc))) {
congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1155,129 +1161,112 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
lru_add_drain();
spin_lock_irq(&zone->lru_lock);
- do {
- struct page *page;
- unsigned long nr_taken;
- unsigned long nr_scan;
- unsigned ...Other than the weird whitespace below, -- All rights reversed --
Not sure where this came out of. It's not in my local patch file nor in -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Remove temporary variable that is only used once and does not help
clarify code.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/vmscan.c | 8 +++-----
1 files changed, 3 insertions(+), 5 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 6a42ebc..246e084 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1721,13 +1721,12 @@ static void shrink_zone(int priority, struct zone *zone,
static bool shrink_zones(int priority, struct zonelist *zonelist,
struct scan_control *sc)
{
- enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
struct zoneref *z;
struct zone *zone;
bool all_unreclaimable = true;
- for_each_zone_zonelist_nodemask(zone, z, zonelist, high_zoneidx,
- sc->nodemask) {
+ for_each_zone_zonelist_nodemask(zone, z, zonelist,
+ gfp_zone(sc->gfp_mask), sc->nodemask) {
if (!populated_zone(zone))
continue;
/*
@@ -1773,7 +1772,6 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
unsigned long lru_pages = 0;
struct zoneref *z;
struct zone *zone;
- enum zone_type high_zoneidx = gfp_zone(sc->gfp_mask);
unsigned long writeback_threshold;
get_mems_allowed();
@@ -1785,7 +1783,7 @@ static unsigned long do_try_to_free_pages(struct zonelist *zonelist,
* mem_cgroup will not do shrink_slab.
*/
if (scanning_global_lru(sc)) {
- for_each_zone_zonelist(zone, z, zonelist, high_zoneidx) {
+ for_each_zone_zonelist(zone, z, zonelist, gfp_zone(sc->gfp_mask)) {
if (!cpuset_zone_allowed_hardwall(zone, GFP_KERNEL))
continue;
--
1.7.1
--
Reviewed-by: Rik van Riel <riel@redhat.com> -- All rights reversed --
This patch adds an event for when pages are isolated en-masse from the LRU lists. This event augments the information available on LRU traffic and can be used to evaluate lumpy reclaim. Signed-off-by: Mel Gorman <mel@csn.ul.ie> --- include/trace/events/vmscan.h | 46 +++++++++++++++++++++++++++++++++++++++++ mm/vmscan.c | 14 ++++++++++++ 2 files changed, 60 insertions(+), 0 deletions(-) diff --git a/include/trace/events/vmscan.h b/include/trace/events/vmscan.h index f76521f..a331454 100644 --- a/include/trace/events/vmscan.h +++ b/include/trace/events/vmscan.h @@ -109,6 +109,52 @@ TRACE_EVENT(mm_vmscan_direct_reclaim_end, TP_printk("nr_reclaimed=%lu", __entry->nr_reclaimed) ); +TRACE_EVENT(mm_vmscan_lru_isolate, + + TP_PROTO(int order, + unsigned long nr_requested, + unsigned long nr_scanned, + unsigned long nr_taken, + unsigned long nr_lumpy_taken, + unsigned long nr_lumpy_dirty, + unsigned long nr_lumpy_failed, + int isolate_mode), + + TP_ARGS(order, nr_requested, nr_scanned, nr_taken, nr_lumpy_taken, nr_lumpy_dirty, nr_lumpy_failed, isolate_mode), + + TP_STRUCT__entry( + __field(int, order) + __field(unsigned long, nr_requested) + __field(unsigned long, nr_scanned) + __field(unsigned long, nr_taken) + __field(unsigned long, nr_lumpy_taken) + __field(unsigned long, nr_lumpy_dirty) + __field(unsigned long, nr_lumpy_failed) + __field(int, isolate_mode) + ), + + TP_fast_assign( + __entry->order = order; + __entry->nr_requested = nr_requested; + __entry->nr_scanned = nr_scanned; + __entry->nr_taken = nr_taken; + __entry->nr_lumpy_taken = nr_lumpy_taken; + __entry->nr_lumpy_dirty = nr_lumpy_dirty; + __entry->nr_lumpy_failed = nr_lumpy_failed; + __entry->isolate_mode = isolate_mode; + ), + + TP_printk("isolate_mode=%d order=%d nr_requested=%lu nr_scanned=%lu nr_taken=%lu contig_taken=%lu contig_dirty=%lu ...
Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed --
Acked-by: Larry Woodman <lwoodman@redhat.com> --
Page reclaim cleans individual pages using a_ops->writepage() because from
the VM perspective, it is known that pages in a particular zone must be freed
soon, it considers the target page to be the oldest and it does not want
to wait while background flushers cleans other pages. From a filesystem
perspective this is extremely inefficient as it generates a very seeky
IO pattern leading to the perverse situation where it can take longer to
clean all dirty pages than it would have otherwise.
This patch queues all dirty pages at once to maximise the chances that
the write requests get merged efficiently. It also makes the next patch
that avoids writeout from direct reclaim more straight-forward.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/vmscan.c | 175 ++++++++++++++++++++++++++++++++++++++++++++---------------
1 files changed, 131 insertions(+), 44 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 019f0af..4856a2a 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -323,6 +323,55 @@ typedef enum {
PAGE_CLEAN,
} pageout_t;
+int write_reclaim_page(struct page *page, struct address_space *mapping,
+ enum pageout_io sync_writeback)
+{
+ int res;
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_NONE,
+ .nr_to_write = SWAP_CLUSTER_MAX,
+ .range_start = 0,
+ .range_end = LLONG_MAX,
+ .nonblocking = 1,
+ .for_reclaim = 1,
+ };
+
+ if (!clear_page_dirty_for_io(page))
+ return PAGE_CLEAN;
+
+ SetPageReclaim(page);
+ res = mapping->a_ops->writepage(page, &wbc);
+ /*
+ * XXX: This is the Holy Hand Grenade of PotentiallyInvalidMapping. As
+ * the page lock has been dropped by ->writepage, that mapping could
+ * be anything
+ */
+ if (res < 0)
+ handle_write_error(mapping, page, res);
+ if (res == AOP_WRITEPAGE_ACTIVATE) {
+ ClearPageReclaim(page);
+ return PAGE_ACTIVATE;
+ }
+
+ /*
+ * Wait on writeback if requested to. This happens when
+ * direct reclaiming a large contiguous area and the
+ * first attempt to free a ...Reclaiming clean pages should be fast enough that this should Reviewed-by: Rik van Riel <riel@redhat.com> -- All rights reversed --
Indeed, this was a bit weak. The original point of the patch was to write contiguous pages belonging to the same inode when they were encountered in that batch which made a bit more sense but didn't work out at first Thanks again :) -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Seeing as you have a list of pages for IO, perhaps they could be sorted before issuing ->writepage on them. That is, while this patch issues all the IO in one hit, it doesn't change the order in which the IO is issued - it is still issued in LRU order. Given that they are issued in a short period of time now, rather than across a longer scan period, it is likely that it will not be any faster as: a) IO will not be started as soon, and b) the IO scheduler still only has a small re-ordering window and will choke just as much on random IO patterns. However, there is a list_sort() function that could be used to sort the list; sorting the list of pages by mapping and page->index within the mapping would result in all the pages on each mapping being sent down in ascending offset order at once - exactly how the filesystems want IO to be sent to it. Perhaps this is a simple improvement that can be made to this code that will make a big difference to worst case performance. FWIW, I did this for delayed metadata buffer writeback in XFS recently (i.e. sort the queue of (potentially tens of thousands of) buffers in ascending block order before dispatch) and that showed a 10-15% reduction in seeks on simple kernel compile workloads. This shows that if we optimise IO patterns at higher layers where the sort window is much, much larger than in the IO scheduler, then overall system performance improves.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
On Tue, 15 Jun 2010 09:11:44 +1000 Yup. But then, this all really should be done at the block layer so other io-submitting-paths can benefit from it. IOW, maybe "the sort queue is the submission queue" wasn't a good idea. --
That was what we did in the past with really, really deep IO scheduler queues. That leads to IO latency and OOM problems because we could lock gigabytes of memory away under IO and take minutes to clean it. Besides, there really isn't the right context in the block layer to be able to queue and prioritise large amounts of IO without Perhaps, but IMO sorting should be done where the context allows it to be done most efficiently. Sorting is most effective when ever a significant queue of IO is formed, whether it be in the filesystem, the VFS, the VM or the block layer because the IO stack is very much a GIGO queue. Simply put, there's nothing the lower layers can do to optimise bad IO patterns from the higher layers because they have small sort windows which are necessary to keep IO latency in check. Hence if the higher layers feed the lower layers crap they simply don't have the context or depth to perform the same level of optimistations we can do easily higher up the stack. IOWs, IMO anywhere there is a context with significant queue of IO, that's where we should be doing a better job of sorting before that IO is dispatched to the lower layers. This is still no guarantee of better IO (e.g. if the filesystem fragments the file) but it does give the lower layers a far better chance at optimal allocation and scheduling of IO... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Can we kick flushing for the whole inode at once from
vmscan.c?
That way we should:
1) ensure that the page we want is written to disk, and
2) we flush out related pages at the same time, getting
a decent IO pattern
Chances are that if we want to evict one page from a
file, we'll also want to evict other pages from that
same file. In fact, chances are a good number of them
will live nearby on the LRU list.
Does this make sense?
Would it be hard to add a "please flush this file"
way to call the filesystem flushing threads?
--
All rights reversed
--
Passing the igrab()bed inode into the flusher threads would fix the iput_final() problems, as long as the alloc_pages() caller never blocks indefinitely waiting for the work which the flusher threads are doing. Otherwise we get (very hard-to-hit) deadlocks where the alloc_pages() caller holds VFS locks and is waiting for the flusher threads while all the flusher threads are stuck under iput_final() waiting for those VFS locks. That's fixable by not using igrab()/iput(). You can use lock_page() to pin the address_space. Pass the address of the locked page across to the flusher threads so they don't try to lock it a second time, or just use trylocking on that writeback path or whatever. --
Any thread that does not have __GFP_FS set in its gfp_mask cannot wait for the flusher to complete. This is regardless of the mechanism used to kick the flusher. Then again, those threads cannot call ->writepage today either, so we should be fine keeping that behaviour. Threads that do have __GFP_FS in their gfp_mask can wait for the flusher in various ways. Maybe the lock_page() method can be simplified by having the flusher thread unlock the page the moment it gets it, and then run the normal flusher code? The pageout code (in shrink_page_list) already unlocks the page anyway before putting it back on the relevant LRU list. It would be easy enough to skip that unlock and let the flusher thread take care of it. -- All rights reversed --
mm... kinda. A bare order-zero __GFP_WAIT allocation can still wait I'm not sure. iput_final() can take a lot of locks, both VFS and heaven knows what within the individual filesystems. Is it the case that all allocations which occur under all of those locks is always Well, _something_ has to pin the address_space. A single locked page Once that page is unlocked, we can't touch *mapping - its inode can be concurrently reclaimed. Although I guess the technique in handle_write_error() can be reused. --
__GFP_FS is set with i_mutex held in places, and there is nothing to Nasty. That guy needs to be using lock_page_nosync(). --
--
Need lock_page_nosync here because we have no reference to the mapping when
taking the page lock.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
mm/vmscan.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c
+++ linux-2.6/mm/vmscan.c
@@ -296,7 +296,7 @@ static int may_write_to_queue(struct bac
static void handle_write_error(struct address_space *mapping,
struct page *page, int error)
{
- lock_page(page);
+ lock_page_nosync(page);
if (page_mapping(page) == mapping)
mapping_set_error(mapping, error);
unlock_page(page);
--
Thanks, I've picked this up and merged it into the series and removed -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
kswapd really should be a last effort tool to clean filesystem pages. If it does enough I/O for this to matter significantly we need to fix the VM to move more work to the flusher threads instead of trying We already have that API, in Jens' latest tree that's sync_inodes_sb/writeback_inodes_sb. We could also add a non-waiting variant if required, but I think the big problem with kswapd is that we want to wait on I/O completion under circumstances. --
However, kswapd does not need to wait on I/O completion of any page in particular - it just wants to wait on I/O completion of any inactive pages in the zone (or memcg) where memory is being freed. -- All rights reversed --
None of what you said had much to do with what I said. What you've described are implementation problems in the current block layer because it conflates "sorting" with "queueing". I'm saying "fix that". And... sorting at the block layer will always be superior to sorting at the pagecache layer because the block layer sorts at the physical block level and can handle not-well-laid-out files and can sort and merge pages from different address_spaces. Still, I suspect none of it will improve anything anyway. Those pages are still dirty, possibly-locked and need to go to disk. It doesn't matter from the MM POV whether they sit in some VM list or in the request queue. Possibly there may be some benefit to not putting so many of these unreclaimable pages into the queue all at the the same time. But that's a shortcoming in the block code: we should be able to shove arbitrary numbers of dirty page (segments) into the queue and not gum the system up. Don't try to work around that in the VM. --
Yes it, can do that. And it still does that even if the higher
layers sort their I/O dispatch better,
Filesystems try very hard to allocate adjacent logical offsets in a
file in adjacent physical blocks on disk - that's the whole point of
extent-indexed filesystems. Hence with modern filesystems there is
generally a direct correlation between the page {mapping,index}
tuple and the physical location of the mapped block.
i.e. there is generally zero physical correlation between pages in
different mappings, but there is a high physical correlation
between the index of pages on the same mapping. Hence by sorting
where we have a {mapping,index} context, we push out IO that is
much more likely to be in contiguous physical chunks that the
current random page shootdown.
We optimise applications to use these sorts of correlations all the
time to improve IO patterns. Why can't we make the same sort of
I think you know perfectly well why the system gums up when we
increase block layer queue depth: it's the fact that the _VM_ relies
on block layer queue congestion to limit the amount of dirty memory
in the system.
We've got a feedback loop between the block layer and the VM that
only works if block device queues are kept shallow. Keeping the
number of dirty pages under control is a VM responsibility, but it
is putting limitations on the block layer to ensure that the VM
works correctly. If you want the block layer to have deep queues,
then someone needs to fix the VM not to require knowledge of the
internal operation of the block layer for correct operation.
Adding a few lines of code to sort a list in the VM is far, far
easier than redesigning the write throttling code....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
--
Yes you can. That's exactly what you're recommending! Only you're recommending doing it at the wrong level. The fs-writeback radix-tree walks do it at the wrong level too. Sorting should be done within, or in a layer above the block queues, not within the large number of Nope. Large-number-of-small-files is a pretty common case. If the fs doesn't handle that well (ie: by placing them nearby on disk), it's We can, but it shouldn't be specific to page reclaim. Other places The only difference is that pages which are in the queue (current mm, a little bit still, I guess. Mainly because dirty memory management isn't zone aware, so even though we limit dirty memory globally, a particular zone(set) can get excessively dirtied. Most of this problem happen on the balance_dirty_pages() path, where we It's a hack and a workaround. And I suspect it won't make any difference, especially given Mel's measurements of the number of dirty pages he's seeing coming off the LRU. Although those numbers may well be due to the new quite-low dirty memory thresholds. It would be interesting to code up a little test patch though, see if there's benefit to be had going down this path. --
Umm, I suggested sorting a queue dirty pages that was build by reclaim before dispatching them. How does that translate to If you feed a filesystem garbage IO, you'll get garbage performance and there's nothing that a block layer sort queue can do to fix the damage it does to both performance and filesystem fragmentation levels. It's not just about IO issue - delayed allocation pretty much requires writeback to be issuing well formed IOs to reap the Filesystems already handle this case just fine as we see it from writeback all the time. Untarring a kernel is a good example of this... I suggested sorting all the IO to be issued into per-mapping page groups because: a) makes IO issued from reclaim look almost exactly the same to the filesytem as if writeback is pushing out the IO. b) it looks to be a trivial addition to the new code. I doubt Mel's tests cases will show anything - they simply didn't show enough IO issued from reclaim to make any difference. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
The solution is not to sort pages on their way to be submitted either, really. What I do in fsblock is to maintain a block-nr sorted tree of dirty blocks. This works nicely because fsblock dirty state is properly synchronized with page dirty state. So writeout can just walk this in order and it provides pretty optimal submission pattern of any interleavings of data and metadata. No need for buffer boundary or hacks like that. (needs some intelligence for delalloc, though). But even with all that, it's not the complete story. It doesn't know about direct IO, sync IO, or fsyncs, and it would be very hard and ugly to try to synchronise and sort all that from the pagecache level. It also is a heuristic in terms of optimal block scheduling behaviour. With smarter devices and drivers there might be better ways to go. So what is needed is to get as much info into the block layer as possible. As Andrew says, there shouldn't be such a big difference between pages being writeback or dirty in pagecache. --
I think worrying about indirect blocks really doesn't matter much these days. For one thing extent based filesystems have a lot less of these, and second for a journaling filesystem we only need to log modification to the indirect blocks and not actually write them back in place during the sync. At least for XFS the actual writeback can happen a lot later, as part of the ordered list of delwri buffers. --
That's true, more importantly I meant any interleavings of data from more than one file too. --
How does this work with delayed allocation where there is no block number associated with the page until writeback calls the allocation routine? Cheers, Dave. -- Dave Chinner david@fromorbit.com --
It doesn't. I have been thinking about how best to make that work. The mm/writeback is not in a good position to know what to do, so the fs would have to help. So either an fs callback, or the fs would have to insert the blocks (or some marker) into the tree itself. It's relatively easy to do for a single file (just walk the radix-tree and do delalloc conversions), but between multiple files is harder (current code has the same problem though). --
That's why we still have block layer sorting. But for the problem of larger files doing the sorting above the filesystem is a lot more efficient, not just primarily due to the I/O patters but because it makes life for the filesystem writeback code and allocator a lot Which is actually more or less true - if we do larger amounts of writeback from kswapd we're toast anyway and performance and allocation patters go down the toilet. Then again throwing a list_sort in is a rather trivial addition. Note that in addition to page->index we can also sort by the inode number in the sort function. At least for XFS and the traditional ufs derived allocators that will give you additional locality for small files. --
I tested with a dirty ratio of 40 but I didn't see a major problem.
It's still a case with the tests I saw that direct reclaim of dirty pages
was a relatively rare event except when lumpy reclaim was involved. What
did change is the amount of scanning the direct reclaim and kswapd had to do
(both increased quite a bit) but the percentage of dirty pages encountered was
roughly the same (1-2% of scanned pages were dirty in the case of sysbench).
This is sysbench only rather than flooding with more data.
DIRTY RATIO == 20
FTrace Reclaim Statistics
traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
Direct reclaims 9843 13398 51651
Direct reclaim pages scanned 871367 1008709 3080593
Direct reclaim write async I/O 24883 30699 0
Direct reclaim write sync I/O 0 0 0
Wake kswapd requests 7070819 6961672 11268341
Kswapd wakeups 1578 1500 943
Kswapd pages scanned 22016558 21779455 17393431
Kswapd reclaim write async I/O 1161346 1101641 1717759
Kswapd reclaim write sync I/O 0 0 0
Time stalled direct reclaim (ms) 26.11 45.04 2.97
Time kswapd awake (ms) 5105.06 5135.93 6086.32
User/Sys Time Running Test (seconds) 734.52 712.39 703.9
Percentage Time Spent Direct Reclaim 0.00% 0.00% 0.00%
Total Elapsed Time (seconds) 9710.02 9589.20 9334.45
Percentage Time kswapd Awake 0.06% 0.00% 0.00%
DIRTY RATIO == 40
FTrace Reclaim Statistics
traceonly-v2r5 stackreduce-v2r5 nodirect-v2r5
Direct reclaims 29945 41887 163006
Direct reclaim pages scanned 2853804 3075288 13142072
Direct reclaim write async ...I suspect the right answer is we need to sort both at the block layer and either (a) before you pass things to the filesystem layer, or if you don't do that (b) the filesystem will be forced to do its own queuing/sorting at the very least for delayed allocation pages, so the allocator can do something sane. And given that there are multiple file systems that support delayed allocation, it would be nice if this could be recognized by the writeback code, as opposed to having btrfs, xfs, ext4, all having to implement something very similar at the fs layer. - Ted --
The patch to sort the list being cleaned by reclaim looks like this.
It's not actually tested
vmscan: Sort pages being queued for IO before submitting to the filesystem
While page reclaim submits dirty pages in batch, it doesn't change the
order in which the IO is issued - it is still issued in LRU order. Given
that they are issued in a short period of time now, rather than across a
longer scan period, it is likely that it will not be any faster as:
a) IO will not be started as soon, and
b) the IO scheduler still only has a small re-ordering
window and will choke just as much on random IO patterns.
This patch uses list_sort() function to sort
the list; sorting the list of pages by mapping and page->index
within the mapping would result in all the pages on each mapping
being sent down in ascending offset order at once - exactly how the
filesystems want IO to be sent to it.
Credit mostly goes to Dave Chinner for this idea and the changelog text.
----
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 68b3d22..02ab246 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -32,6 +32,7 @@
#include <linux/topology.h>
#include <linux/cpu.h>
#include <linux/cpuset.h>
+#include <linux/list_sort.h>
#include <linux/notifier.h>
#include <linux/rwsem.h>
#include <linux/delay.h>
@@ -651,6 +652,34 @@ static noinline_for_stack void free_page_list(struct list_head *free_pages)
__pagevec_free(&freed_pvec);
}
+/* Sort based on mapping then index */
+static int page_writeback_cmp(void *data, struct list_head *a, struct list_head *b)
+{
+ struct page *ap = list_entry(a, struct page, lru);
+ struct page *bp = list_entry(b, struct page, lru);
+ pgoff_t diff;
+
+ /*
+ * Page not locked but it's not critical, the mapping is just for sorting
+ * If the mapping is no longer valid, it's of little consequence
+ */
+ if (ap->mapping != bp->mapping) {
+ if (ap->mapping < bp->mapping)
+ return -1;
+ if (ap->mapping > bp->mapping)
+ return ...Even if has not effect on the actual I/O patters it has a massive benefit for the filesystem. When probing delalloc/unwritten space at least XFS does try to convert a larger extent forward from the index, but doesn't bother to go backwards. By providing the trivial sort here we make life a lot easier for the filesystem. In addition to that we do get better I/O patters especially with short queues as smart writepage implementatons will also submit the next few pages, which is essentially free given how the storage works. This means we already have a page cleaned before we might even submit it without sorting. --
how about:
if (PageWriteback(page) {
if (sync_writeback == PAGEOUT_IO_SYNC)
wait_on_page_writeback(page);
} else {
/* synchronous write or broken a_ops? */
ClearPageReclaim(page);
A rather pointless line of 80 chars. I see the point for long string
I think there is. There's quite a few places that do hold multiple
pages locked, but they always lock pages in increasing page->inxex order.
Given that this locks basically in random order it could cause problems
for those places.
--
With the page lock released, the mapping may be no longer valid. Nick posted a patch in relation to it that I need to look at. The comment was because Andrew highlight that this was buggy and I wanted to make sure I Hmm, ok. In that case, I'll have to release the locks on the list and reacquire them. It was something I would have preferred to avoid. Thanks -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
It's just a pretty simple use-after-free. Maybe people forget it because There shouldn't be a problem _holding_ the locks, but there is a problem waiting for multiple locks out of page->index order. But there is a problem with holding the lock of a lot of pages while calling ->writepage on them. So yeah, you can't do that. Hmm, I should rediff that lockdep page_lock patch and get it merged. (although I don't know if that can catch these all these problems easily) --
When shrink_inactive_list() isolates pages, it updates a number of
counters using temporary variables to gather them. These consume stack
and it's in the main path that calls ->writepage(). This patch moves the
accounting updates outside of the main path to reduce stack usage.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/vmscan.c | 63 +++++++++++++++++++++++++++++++++++-----------------------
1 files changed, 38 insertions(+), 25 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 165c2f5..019f0af 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1073,7 +1073,8 @@ static unsigned long clear_active_flags(struct list_head *page_list,
ClearPageActive(page);
nr_active++;
}
- count[lru]++;
+ if (count)
+ count[lru]++;
}
return nr_active;
@@ -1153,12 +1154,13 @@ static int too_many_isolated(struct zone *zone, int file,
* TODO: Try merging with migrations version of putback_lru_pages
*/
static noinline_for_stack void putback_lru_pages(struct zone *zone,
- struct zone_reclaim_stat *reclaim_stat,
+ struct scan_control *sc,
unsigned long nr_anon, unsigned long nr_file,
struct list_head *page_list)
{
struct page *page;
struct pagevec pvec;
+ struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
pagevec_init(&pvec, 1);
@@ -1197,6 +1199,37 @@ static noinline_for_stack void putback_lru_pages(struct zone *zone,
pagevec_release(&pvec);
}
+static noinline_for_stack void update_isolated_counts(struct zone *zone,
+ struct scan_control *sc,
+ unsigned long *nr_anon,
+ unsigned long *nr_file,
+ struct list_head *isolated_list)
+{
+ unsigned long nr_active;
+ unsigned int count[NR_LRU_LISTS] = { 0, };
+ struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
+
+ nr_active = clear_active_flags(isolated_list, count);
+ __count_vm_events(PGDEACTIVATE, nr_active);
+
+ __mod_zone_page_state(zone, ...Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed --
shrink_page_list() sets up a pagevec to release pages as according as they
are free. It uses significant amounts of stack on the pagevec. This
patch adds pages to be freed via pagevec to a linked list which is then
freed en-masse at the end. This avoids using stack in the main path that
potentially calls writepage().
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
mm/vmscan.c | 37 +++++++++++++++++++++++++++++--------
1 files changed, 29 insertions(+), 8 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 34c5c87..165c2f5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -620,6 +620,25 @@ static enum page_references page_check_references(struct page *page,
return PAGEREF_RECLAIM;
}
+static noinline_for_stack void free_page_list(struct list_head *free_pages)
+{
+ struct pagevec freed_pvec;
+ struct page *page, *tmp;
+
+ pagevec_init(&freed_pvec, 1);
+
+ list_for_each_entry_safe(page, tmp, free_pages, lru) {
+ list_del(&page->lru);
+ if (!pagevec_add(&freed_pvec, page)) {
+ __pagevec_free(&freed_pvec);
+ pagevec_reinit(&freed_pvec);
+ }
+ }
+
+ if (pagevec_count(&freed_pvec))
+ __pagevec_free(&freed_pvec);
+}
+
/*
* shrink_page_list() returns the number of reclaimed pages
*/
@@ -628,13 +647,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
enum pageout_io sync_writeback)
{
LIST_HEAD(ret_pages);
- struct pagevec freed_pvec;
+ LIST_HEAD(free_pages);
int pgactivate = 0;
unsigned long nr_reclaimed = 0;
cond_resched();
- pagevec_init(&freed_pvec, 1);
while (!list_empty(page_list)) {
enum page_references references;
struct address_space *mapping;
@@ -809,10 +827,12 @@ static unsigned long shrink_page_list(struct list_head *page_list,
__clear_page_locked(page);
free_it:
nr_reclaimed++;
- if (!pagevec_add(&freed_pvec, page)) {
- __pagevec_free(&freed_pvec);
- pagevec_reinit(&freed_pvec);
- }
+
+ /*
+ * Is there need to periodically free_page_list? It would
+ * ...Reviewed-by: Rik van Riel <riel@redhat.com> -- All rights reversed --
On Mon, 14 Jun 2010 12:17:50 +0100 hm, spose so. I cen't see any trivial way to eliminate the local That's an open-coded pagevec_free(). --
Fair point, will correct. Thanks -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
This patch adds two trace events for kswapd waking up and going asleep for
the purposes of tracking kswapd activity and two trace events for direct
reclaim beginning and ending. The information can be used to work out how
much time a process or the system is spending on the reclamation of pages
and in the case of direct reclaim, how many pages were reclaimed for that
process. High frequency triggering of these events could point to memory
pressure problems.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
include/trace/events/gfpflags.h | 37 +++++++++++++
include/trace/events/kmem.h | 38 +-------------
include/trace/events/vmscan.h | 115 +++++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 24 +++++++--
4 files changed, 173 insertions(+), 41 deletions(-)
create mode 100644 include/trace/events/gfpflags.h
create mode 100644 include/trace/events/vmscan.h
diff --git a/include/trace/events/gfpflags.h b/include/trace/events/gfpflags.h
new file mode 100644
index 0000000..e3615c0
--- /dev/null
+++ b/include/trace/events/gfpflags.h
@@ -0,0 +1,37 @@
+/*
+ * The order of these masks is important. Matching masks will be seen
+ * first and the left over flags will end up showing by themselves.
+ *
+ * For example, if we have GFP_KERNEL before GFP_USER we wil get:
+ *
+ * GFP_KERNEL|GFP_HARDWALL
+ *
+ * Thus most bits set go first.
+ */
+#define show_gfp_flags(flags) \
+ (flags) ? __print_flags(flags, "|", \
+ {(unsigned long)GFP_HIGHUSER_MOVABLE, "GFP_HIGHUSER_MOVABLE"}, \
+ {(unsigned long)GFP_HIGHUSER, "GFP_HIGHUSER"}, \
+ {(unsigned long)GFP_USER, "GFP_USER"}, \
+ {(unsigned long)GFP_TEMPORARY, "GFP_TEMPORARY"}, \
+ {(unsigned long)GFP_KERNEL, "GFP_KERNEL"}, \
+ {(unsigned long)GFP_NOFS, "GFP_NOFS"}, \
+ {(unsigned long)GFP_ATOMIC, "GFP_ATOMIC"}, \
+ {(unsigned long)GFP_NOIO, "GFP_NOIO"}, \
+ {(unsigned long)__GFP_HIGH, "GFP_HIGH"}, \
+ {(unsigned long)__GFP_WAIT, "GFP_WAIT"}, \
+ {(unsigned ...Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed --
Acked-by: Larry Woodman <lwoodman@redhat.com> --
shrink_inactive_list() sets up a pagevec to release unfreeable pages. It
uses significant amounts of stack doing this. This patch splits
shrink_inactive_list() to take the stack usage out of the main path so
that callers to writepage() do not contain an unused pagevec on the
stack.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
---
mm/vmscan.c | 95 +++++++++++++++++++++++++++++++++-------------------------
1 files changed, 54 insertions(+), 41 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 246e084..34c5c87 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1129,19 +1129,65 @@ static int too_many_isolated(struct zone *zone, int file,
}
/*
+ * TODO: Try merging with migrations version of putback_lru_pages
+ */
+static noinline_for_stack void putback_lru_pages(struct zone *zone,
+ struct zone_reclaim_stat *reclaim_stat,
+ unsigned long nr_anon, unsigned long nr_file,
+ struct list_head *page_list)
+{
+ struct page *page;
+ struct pagevec pvec;
+
+ pagevec_init(&pvec, 1);
+
+ /*
+ * Put back any unfreeable pages.
+ */
+ spin_lock(&zone->lru_lock);
+ while (!list_empty(page_list)) {
+ int lru;
+ page = lru_to_page(page_list);
+ VM_BUG_ON(PageLRU(page));
+ list_del(&page->lru);
+ if (unlikely(!page_evictable(page, NULL))) {
+ spin_unlock_irq(&zone->lru_lock);
+ putback_lru_page(page);
+ spin_lock_irq(&zone->lru_lock);
+ continue;
+ }
+ SetPageLRU(page);
+ lru = page_lru(page);
+ add_page_to_lru_list(zone, page, lru);
+ if (is_active_lru(lru)) {
+ int file = is_file_lru(lru);
+ reclaim_stat->recent_rotated[file]++;
+ }
+ if (!pagevec_add(&pvec, page)) {
+ spin_unlock_irq(&zone->lru_lock);
+ __pagevec_release(&pvec);
+ spin_lock_irq(&zone->lru_lock);
+ }
+ }
+ __mod_zone_page_state(zone, NR_ISOLATED_ANON, -nr_anon);
+ __mod_zone_page_state(zone, NR_ISOLATED_FILE, -nr_file);
+
+ spin_unlock_irq(&zone->lru_lock);
+ pagevec_release(&pvec);
+}
+
+/*
...I hate to nitpick on this, but with noinline_for_stack a prototype really is unreadbale unless the function identifier goes to the next line. Compare the one above with: static noinline_for_stack void putback_lru_pages(struct zone *zone, struct zone_reclaim_stat *reclaim_stat, unsigned long nr_anon, unsigned long nr_file, Same here, just even worse due to the spill over ove 80 characters. --
It's a fair nitpick and so is cleaned up. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
On Mon, 14 Jun 2010 12:17:49 +0100 You can get the entire pagevec off the stack - just make it a static-to-shrink_inactive_list() pagevec-per-cpu. Locking just requires pinning to a CPU. We could trivially co-opt shrink_inactive_list()'s spin_lock_irq() for that, but pagevec_release() can be relatively expensive so it'd be sad to move that inside spin_lock_irq(). It'd be better to slap a get_cpu()/put_cpu() around the whole thing. --
That idea has been floated as well. I didn't pursue it because Dave said that giving page reclaim a stack diet was never going to be the full solution so I didn't think the complexity was justified. I kept some of the stack reduction stuff because a) it was there and b) It'd be something interesting to try out when nothing else was happening but I'm not going to focus on it for the moment unless I think it will really help this stack overflow problem. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Since 2.6.28 zone->prev_priority is unused. Then it can be removed
safely. It reduce stack usage slightly.
Now I have to say that I'm sorry. 2 years ago, I thought prev_priority
can be integrate again, it's useful. but four (or more) times trying
haven't got good performance number. Thus I give up such approach.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Reviewed-by: Johannes Weiner <hannes@cmpxchg.org>
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
include/linux/mmzone.h | 15 ------------
mm/page_alloc.c | 2 -
mm/vmscan.c | 57 ------------------------------------------------
mm/vmstat.c | 2 -
4 files changed, 0 insertions(+), 76 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b4d109e..b578eee 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -348,21 +348,6 @@ struct zone {
atomic_long_t vm_stat[NR_VM_ZONE_STAT_ITEMS];
/*
- * prev_priority holds the scanning priority for this zone. It is
- * defined as the scanning priority at which we achieved our reclaim
- * target at the previous try_to_free_pages() or balance_pgdat()
- * invocation.
- *
- * We use prev_priority as a measure of how much stress page reclaim is
- * under - it drives the swappiness decision: whether to unmap mapped
- * pages.
- *
- * Access to both this field is quite racy even on uniprocessor. But
- * it is expected to average out OK.
- */
- int prev_priority;
-
- /*
* The target ratio of ACTIVE_ANON to INACTIVE_ANON pages on
* this zone's LRU. Maintained by the pageout code.
*/
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 431214b..0b0b629 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4081,8 +4081,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
zone_seqlock_init(zone);
zone->zone_pgdat = pgdat;
- zone->prev_priority = ...Reviewed-by: Rik van Riel <riel@redhat.com> -- All rights reversed --
On Mon, 14 Jun 2010 12:17:46 +0100 This would have been badder in earlier days when we were using the scanning priority to decide when to start unmapping pte-mapped pages - page reclaim would have been recirculating large blobs of mapped pages around the LRU until the priority had built to the level where we started to unmap them. However that priority-based decision got removed and right now I don't recall what it got replaced with. Aren't we now unmapping pages way too early and suffering an increased major&minor fault rate? Worried. Things which are still broken after we broke prev_priority: - If page reclaim is having a lot of trouble, prev_priority would have permitted do_try_to_free_pages() to call disable_swap_token() earlier on. As things presently stand, we'll do a lot of thrash-detection stuff before (presumably correctly) dropping the swap token. So. What's up with that? I don't even remember _why_ we disable the swap token once the scanning priority gets severe and the code comments there are risible. And why do we wait until priority==0 rather than priority==1? - Busted prev_priority means that lumpy reclaim will act oddly. Every time someone goes into do some recalim, they'll start out not doing lumpy reclaim. Then, after a while, they'll get a clue and will start doing the lumpy thing. Then they return from reclaim and the next recalim caller will again forget that he should have done lumpy reclaim. I dunno what the effects of this are in the real world, but it seems dumb. And one has to wonder: if we're making these incorrect decisions based upon a bogus view of the current scanning difficulty, why are these various priority-based thresholding heuristics even in there? Are they doing anything useful? So.. either we have a load of useless-crap-and-cruft in there which should be lopped out, or we don't have a load of useless-crap-and-cruft The patch forgot to remove mem_cgroup_get_reclaim_priority() ...
We keep a different set of statistics to decide whether to reclaim only page cache pages, or both page cache and anonymous pages. The function get_scan_ratio parses those The reason is that we never page out the pages belonging to the process owning the swap token (with the exception of that process evicting its own pages). If that process has a really large RSS in the current zone, and we are having problems freeing pages, it may be beneficial to also evict pages from that process. Now that the LRU lists are split out into file backed and swap backed, it may be a lot easier to find pages to evict. That may mean we could notice we're getting into trouble at much higher priority levels and disable the swap token at a higher priority level. I do not believe prev_priority will be very useful here, since How common are lumpy reclaims, anyway? Isn't it more likely that in-between every two higher-order reclaims, a number of order zero reclaims will be happening? In that case, the prev_priority logic may have introduced the The prev_priority code was useful when we had filesystem and swap backed pages mixed on the same LRU list. I am not convinced it still has any use. -- All rights reversed --
On Wed, 16 Jun 2010 19:45:29 -0400 I wasn't talking about anon-vs-file. I was referring to mapped-file versus not-mapped file. If the code sees a mapped page come off the tail of the LRU it'll just unmap and reclaim the thing. This policy caused awful amounts of paging activity when someone started doing lots of read() activity, which is why the VM was changed to value mapped pagecache higher than unmapped pagecache. Did this biasing get hm, lots of "may"s there. A lot more than they should be, I suspect, given the recent trend towards asking for higher-order allocations. Kernel developers should No, stop saying swap! ;) It's all to do with mapped pagecache versus unmapped pagecache. "ytf does my browser get paged out all the time". --
It changed a little, but we still have it:
1) we do not deactivate active file pages if the active file
list is smaller than the inactive file list - this protects
the working set from streaming IO
2) we keep mapped referenced executable pages on the active file
list if they got accessed while on the active list, while
I suspect it does, but I have not actually tested that code
For one, memory sizes today are a lot larger than they were
when 2.6.0 came out.
Secondly, we now know more exactly what is on each LRU list.
That should greatly reduce unnecessary turnover of the list.
For example, if we know there is no swap space available, we
will not bother scanning the anon LRU lists.
If we know there is not enough file cache left to get us up
to the zone high water mark, we will not bother scanning the
few remaining file pages.
Because of those simple checks (in get_scan_priority), I do
not expect that we will have to scan through all of memory
as frequently as we had to do in 2.6.0.
Furthermore, we unconditionally deactivate most active pages
and have a working used-once scheme for pages on the anon
lists. This should also contribute to a reduction in the
We have other measures in place now to protect the working set
on the file LRU lists (see above). We are able to have those
measures in the kernel because we no longer have mixed LRU
lists.
--
All rights reversed
--
sorry for the long delay.
May I explain my experience? I'd like to explain why prev_priority wouldn't
works nowadays.
First of all, Yes, current vmscan still a lot of UP centric code. it
expose some weakness on some dozens CPUs machine. I think we need
more and more improvement.
The problem is, current vmscan mix up per-system-pressure, per-zone-pressure
and per-task-pressure a bit. example, prev_priority try to boost priority to
other concurrent priority. but If the another task have mempolicy restriction,
It's unnecessary, but also makes wrong big latency and exceeding reclaim.
per-task based priority + prev_priority adjustment make the emulation of
per-system pressure. but it have two issue 1) too rough and brutal emulation
2) we need per-zone pressure, not per-system.
another example, currently DEF_PRIORITY is 12. it mean the lru rotate about
2 cycle (1/4096 + 1/2048 + 1/1024 + .. + 1) before invoking OOM-Killer.
but if 10,0000 thrreads enter DEF_PRIORITY reclaim at the same time, the
system have higher memory pressure than priority==0 (1/4096*10,000 > 2).
prev_priority can't solve such multithreads workload issue.
In other word, prev_priority concept assume the sysmtem don't have lots
threads.
And, I don't think lumpy reclaim threshold is big matter, because It was
introduced to case aim7 corner case issue. I don't think such situation
will occur frequently in the real workload. thus end users can't observe
such logic.
For mapped-vs-unmapped thing, I dunnno the exactly reason. That was
introduced by Rik, unfortunatelly I had not joined its activity at
making design time. I can only say, while my testing the current code
works good.
That said, my conclusion is opposite. For long term view, we should
consider to kill reclaim priority completely. Instead, we should
Sure. thanks.
Will fix.
btw, current zone reclaim have wrong swap token usage.
static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
{
(snip)
...My fault, I unintentionally deleted your name from the send script. The intention of the code was to note that orders < PAGE_ALLOC_COSTLY_ORDER, there was an expectatation that those pages would be free or nearly free without page reclaim taking special steps with lumpy reclaim. If this has changed, it's almost certainly because of a greater dependence on high-order pages than previously which should be resisted (it has cropped up a few times recently). I do have a script that uses ftrace to count call sites using high-order allocations and how often they occur which would be of use I'm not aware of current problems with lumpy reclaim related stalls or problems but it's not something I have specifically investigated. If there is a known example workload that is felt to trigger lumpy reclaim more than it should, someone point me in the general direction and I'll -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
This patch adds a simple post-processing script for the reclaim-related trace events. It can be used to give an indication of how much traffic there is on the LRU lists and how severe latencies due to reclaim are. Example output looks like the following Reclaim latencies expressed as order-latency_in_ms uname-3942 9-200.179000000004 9-98.7900000000373 9-99.8330000001006 kswapd0-311 0-662.097999999998 0-2.79700000002049 \ 0-149.100000000035 0-3295.73600000003 0-9806.31799999997 0-35528.833 \ 0-10043.197 0-129740.979 0-3.50500000000466 0-3.54899999999907 \ 0-9297.78999999992 0-3.48499999998603 0-3596.97999999998 0-3.92799999995623 \ 0-3.35000000009313 0-16729.017 0-3.57799999997951 0-47435.0630000001 \ 0-3.7819999998901 0-5864.06999999995 0-18635.334 0-10541.289 9-186011.565 \ 9-3680.86300000001 9-1379.06499999994 9-958571.115 9-66215.474 \ 9-6721.14699999988 9-1962.15299999993 9-1094806.125 9-2267.83199999994 \ 9-47120.9029999999 9-427653.886 9-2.6359999999404 9-632.148999999976 \ 9-476.753000000026 9-495.577000000048 9-8.45900000003166 9-6.6820000000298 \ 9-1.30500000016764 9-251.746000000043 9-383.905000000028 9-80.1419999999925 \ 9-281.160000000149 9-14.8780000000261 9-381.45299999998 9-512.07799999998 \ 9-49.5519999999087 9-167.439000000013 9-183.820999999996 9-239.527999999933 \ 9-19.9479999998584 9-148.747999999905 9-164.583000000101 9-16.9480000000913 \ 9-192.376000000164 9-64.1010000000242 9-1.40800000005402 9-3.60800000000745 \ 9-17.1359999999404 9-4.69500000006519 9-2.06400000001304 9-1582488.554 \ 9-6244.19499999983 9-348153.812 9-2.0999999998603 9-0.987999999895692 \ 0-32218.473 0-1.6140000000596 0-1.28100000019185 0-1.41300000017509 \ 0-1.32299999985844 0-602.584000000032 0-1.34400000004098 0-1.6929999999702 \ 1-22101.8190000001 9-174876.724 9-16.2420000000857 9-175.165999999736 \ 9-15.8589999997057 9-0.604999999981374 9-3061.09000000032 9-479.277000000235 \ 9-1.54499999992549 9-771.985000000335 9-4.88700000010431 ...
Acked-by: Rik van Riel <riel@redhat.com> -- All rights reversed --
Acked-by: Larry Woodman <lwoodman@redhat.com> --
This stuff looks good to me from the filesystem POV. You might want to throw in a follow on patch to remove the PF_MEMALLOC checks from the various ->writepage methods. --
Will do, thanks. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
On Mon, 14 Jun 2010 12:17:41 +0100 Execution time is reduced. Does this shows removing "I/O noise" by direct reclaim makes the system happy ? or writeback in direct reclaim give us too much costs ? It seems I'll have to consider about avoiding direct-reciam in memcg, later. BTW, I think we'll have to add wait-for-pages-to-be-cleaned trick in direct reclaim if we want to avoid too much scanning, later. Thank you for interesting test. Regards, -Kame --
It could be accounted for by the fact that the direct reclaimers are stalled less in direct reclaim. They make more forward progress needing I think it's accounted for by just making more forward progress rather than IO noise. The throughput results for sysbench are all over the place because the disk was maxed so I'm shying away from drawing any conclusions on the This happens for lumpy reclaim. I didn't think it was justified for normal reclaim based on the percentage of dirty pages encountered during scanning. If the percentage of dirty pages scanned, we'll need to first figure out why that happened and then if stalling when they are You're welcome. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
