Commit 84b18490d1f1bc7ed5095c929f78bc002eb70f26 introduces a regression.
With it, our tmpfs test always oom. The test has a lot of rotated anon
pages and cause percent[0] zero. Actually the percent[0] is a very small
value, but our calculation round it to zero. The commit makes vmscan
completely skip anon pages and cause oops.
An option is if percent[x] is zero in get_scan_ratio(), forces it
to 1. See below patch.
But the offending commit still changes behavior. Without the commit, we scan
all pages if priority is zero, below patch doesn't fix this. Don't know if
It's required to fix this too.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79c8098..d5cc34e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1604,6 +1604,18 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
/* Normalize to percentages */
percent[0] = 100 * ap / (ap + fp + 1);
percent[1] = 100 - percent[0];
+ /*
+ * if percent[x] is small and rounded to 0, this case doesn't mean we
+ * should skip scan. Give it at least 1% share.
+ */
+ if (percent[0] == 0) {
+ percent[0] = 1;
+ percent[1] = 99;
+ }
+ if (percent[1] == 0) {
+ percent[0] = 99;
+ percent[1] = 1;
+ }
}
/*
--
Can you please post your /proc/meminfo and reproduce program? I'll digg it. Very unfortunately, this patch isn't acceptable. In past time, vmscan had similar logic, but 1% swap-out made lots bug reports. --
our test is quite sample. mount tmpfs with double memory size and store several copies (memory size * 2/G) of kernel in tmpfs, and then do kernel build. for example, there is 3G memory and then tmpfs size is 6G and there is 6 can you elaborate this? Completely restore previous behavior (do full scan with priority 0) is ok too.
This is a option. but we need to know the root cause anyway. if not, we might reintroduce this issue again in the future. --
I thought I mentioned the root cause in first mail. My debug shows recent_rotated[0] is big, but recent_rotated[1] is almost zero, which makes percent[0] 0. But you can double check too. --
To revert can save percent[0]==0 && priority==0 case. but it shouldn't
happen, I think. It mean to happen big latency issue.
Can you please try following patch? Also, I'll prepare reproduce environment soon.
---
mm/vmscan.c | 12 ++++++++----
1 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79c8098..abf7f79 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1571,15 +1571,19 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
*/
if (unlikely(reclaim_stat->recent_scanned[0] > anon / 4)) {
spin_lock_irq(&zone->lru_lock);
- reclaim_stat->recent_scanned[0] /= 2;
- reclaim_stat->recent_rotated[0] /= 2;
+ while (reclaim_stat->recent_scanned[0] > anon / 4) {
+ reclaim_stat->recent_scanned[0] /= 2;
+ reclaim_stat->recent_rotated[0] /= 2;
+ }
spin_unlock_irq(&zone->lru_lock);
}
if (unlikely(reclaim_stat->recent_scanned[1] > file / 4)) {
spin_lock_irq(&zone->lru_lock);
- reclaim_stat->recent_scanned[1] /= 2;
- reclaim_stat->recent_rotated[1] /= 2;
+ while (reclaim_stat->recent_scanned[1] > file / 4) {
+ reclaim_stat->recent_scanned[1] /= 2;
+ reclaim_stat->recent_rotated[1] /= 2;
+ }
spin_unlock_irq(&zone->lru_lock);
}
--
1.6.5.2
--
if 1% is still big, how about below patch?
Commit 84b18490d1f1bc7ed5095c929f78bc002eb70f26 introduces a regression.
With it, our tmpfs test always oom. The test has a lot of rotated anon
pages and cause percent[0] zero. Actually the percent[0] is a very small
value, but our calculation round it to zero. The commit makes vmscan
completely skip anon pages and cause oops.
To avoid underflow, we don't use percentage, instead we directly calculate
how many pages should be scaned.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79c8098..80a7ed5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1519,27 +1519,50 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
}
/*
+ * Smallish @nr_to_scan's are deposited in @nr_saved_scan,
+ * until we collected @swap_cluster_max pages to scan.
+ */
+static unsigned long nr_scan_try_batch(unsigned long nr_to_scan,
+ unsigned long *nr_saved_scan)
+{
+ unsigned long nr;
+
+ *nr_saved_scan += nr_to_scan;
+ nr = *nr_saved_scan;
+
+ if (nr >= SWAP_CLUSTER_MAX)
+ *nr_saved_scan = 0;
+ else
+ nr = 0;
+
+ return nr;
+}
+
+/*
* Determine how aggressively the anon and file LRU lists should be
* scanned. The relative value of each set of LRU lists is determined
* by looking at the fraction of the pages scanned we did rotate back
* onto the active list instead of evict.
*
- * percent[0] specifies how much pressure to put on ram/swap backed
- * memory, while percent[1] determines pressure on the file LRUs.
+ * nr[x] specifies how many pages should be scaned
*/
-static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
- unsigned long *percent)
+static void get_scan_count(struct zone *zone, struct scan_control *sc,
+ unsigned long *nr, int priority)
{
unsigned long anon, file, free;
unsigned long anon_prio, file_prio;
unsigned long ap, fp;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, ...This patch makes a lot of sense than previous. however I think <1% anon ratio shouldn't happen anyway because file lru doesn't have reclaimable pages. <1% seems no good reclaim rate. perhaps I'll take your patch for stable tree. but we need to attack the root --
KOSAKI-san, I tend to regard this patch as a general improvement for both .33-stable and .34. I do agree with you that it's desirable to do more test&analyze and check further for possibly hidden problems. Thanks, --
Yeah, I don't want ignore .33-stable too. if I can't find the root cause in 2-3 days, I'll revert guilty patch anyway. --
On Wed, 31 Mar 2010 15:00:52 +0900 (JST) It's a good idea to avoid fixing a bug one-way-in-stable, other-way-in-mainline. Because then we have new code in both trees which is different. And the -stable guys sensibly like to see code get a bit of a shakedown in mainline before backporting it. So it would be better to merge the "simple" patch into mainline, tagged for -stable backporting. Then we can later implement the larger fix in mainline, perhaps starting by reverting the "simple" fix. --
.....ok. I don't have to prevent your code maintainship. although I still
think we need to fix the issue completely.
===================================================================
From 52358cbccdfe94e0381974cd6e937bcc6b1c608b Mon Sep 17 00:00:00 2001
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Date: Fri, 2 Apr 2010 17:13:48 +0900
Subject: [PATCH] Revert "vmscan: get_scan_ratio() cleanup"
Shaohua Li reported his tmpfs streaming I/O test can lead to make oom.
The test uses a 6G tmpfs in a system with 3G memory. In the tmpfs,
there are 6 copies of kernel source and the test does kbuild for each
copy. His investigation shows the test has a lot of rotated anon
pages and quite few file pages, so get_scan_ratio calculates percent[0]
(i.e. scanning percent for anon) to be zero. Actually the percent[0]
shoule be a big value, but our calculation round it to zero.
Although before commit 84b18490, we have the same sick too. but the old
logic can rescue percent[0]==0 case only when priority==0. It had hided
the real issue. I didn't think merely streaming io can makes percent[0]==0
&& priority==0 situation. but I was wrong.
So, definitely we have to fix such tmpfs streaming io issue. but anyway
I revert the regression commit at first.
This reverts commit 84b18490d1f1bc7ed5095c929f78bc002eb70f26.
Reported-by: Shaohua Li <shaohua.li@intel.com>
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/vmscan.c | 23 +++++++++--------------
1 files changed, 9 insertions(+), 14 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79c8098..cb3947e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1535,13 +1535,6 @@ static void get_scan_ratio(struct zone *zone, struct scan_control *sc,
unsigned long ap, fp;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
- /* If we have no swap space, do not bother scanning anon pages. */
- if (!sc->may_swap || (nr_swap_pages <= 0)) {
- percent[0] = 0;
- percent[1] = ...should small :) Acked-by: Wu Fengguang <fengguang.wu@intel.com> Thanks, Fengguang --
Agreed on the revert. Acked-by: Rik van Riel <riel@redhat.com> --
Oops, the above mention is wrong. sorry. only 1 page is still too big. because under streaming io workload, the number of scanning anon pages should be zero. this is very strong requirement. if not, backup operation will makes a lot of swapping out. --
Sounds there is no big impact for the workload which you mentioned with the patch.
please see below descriptions.
I updated the description of the patch as fengguang suggested.
Commit 84b18490d introduces a regression. With it, our tmpfs test always oom.
The test uses a 6G tmpfs in a system with 3G memory. In the tmpfs, there are
6 copies of kernel source and the test does kbuild for each copy. My
investigation shows the test has a lot of rotated anon pages and quite few
file pages, so get_scan_ratio calculates percent[0] to be zero. Actually
the percent[0] shoule be a very small value, but our calculation round it
to zero. The commit makes vmscan completely skip anon pages and cause oops.
To avoid underflow, we don't use percentage, instead we directly calculate
how many pages should be scaned. In this way, we should get several scan pages
for < 1% percent. With this fix, my test doesn't oom any more.
Note, this patch doesn't really change logics, but just increase precise. For
system with a lot of memory, this might slightly changes behavior. For example,
in a sequential file read workload, without the patch, we don't swap any anon
pages. With it, if anon memory size is bigger than 16G, we will say one anon page
swapped. The 16G is calculated as PAGE_SIZE * priority(4096) * (fp/ap). fp/ap
is assumed to be 1024 which is common in this workload. So the impact sounds not
a big deal.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79c8098..80a7ed5 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1519,27 +1519,50 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
}
/*
+ * Smallish @nr_to_scan's are deposited in @nr_saved_scan,
+ * until we collected @swap_cluster_max pages to scan.
+ */
+static unsigned long nr_scan_try_batch(unsigned long nr_to_scan,
+ unsigned long *nr_saved_scan)
+{
+ unsigned long nr;
+
+ *nr_saved_scan += nr_to_scan;
+ nr = *nr_saved_scan;
+
+ if (nr ...Umm.. sorry, no. "one fix but introduce another one bug" is not good deal. instead, I'll revert the guilty commit at first as akpm mentioned. thanks. --
Even we revert the commit, the patch still has its benefit, as it increases calculation precision, right? Thanks, Shaohua --
no, you shouldn't ignore the regression case. If we can remove the streaming io corner case by another patch, this patch can be considered to merge. thanks. --
I don't think this is serious. In my calculation, there is only 1 page swapped out for 6G anonmous memory. 1 page should haven't any performance impact. Thanks, Shaohua --
there is. I had received exactly opposite claim. because shrink_zone() is not called only once. it is called very much time. --
1 anon page scanned for every N file pages scanned? Is N a _huge_ enough ratio so that the anon list will be very light scanned? Rik: here is a little background. Under streaming IO, the current get_scan_ratio() will get a percent[0] that is (much) less than 1, so underflows to 0. It has the bad effect of completely disabling the scan of anon list, which leads to OOM in Shaohua's test case. OTOH, it also has the good side effect of keeping anon pages in memory and totally prevent swap IO. Shaohua's patch improves the computation precision by computing nr[] directly in get_scan_ratio(). This is good in general, however will enable light scan of the anon list on streaming IO. Thanks, Fengguang --
The problem is, the VM are couteniously discarding no longer used file cache. if we are scan extra anon 1 page, we will observe tons swap usage after few days. In such case, percent[0] should be big. I think underflowing is not point. His test case is merely streaming io copy, why can't we drop tmpfs cached page? his /proc/meminfo describe his machine didn't have droppable file cache. so, big percent[1] value seems makes no sense. no? I'm not sure we need either below detection. I need more investigate. 1) detect no discardable file cache 2) detect streaming io on tmpfs (as regular file) --
OK the days-of-streaming-io typically happen in file servers. Suppose a file server with 16GB memory, 1GB of which is consumed by anonymous pages, others are for page cache. Assume that the exact file:anon ratio computed by the get_scan_ratio() algorithm is 1000:1. In that case percent[0]=0.1 and is rounded down to 0, which keeps the anon pages in memory for the few days. Now with Shaohua's patch, nr[0] = (262144/4096)/1000 = 0.06 will also be rounded down to 0. It only becomes >=1 when - reclaim runs into trouble and priority goes low - anon list goes huge So I guess Shaohua's patch still has reasonable "underflow" threshold :) Thanks, --
Again, I didn't said his patch is no worth. I only said we don't have to ignore the downside. --
Right, we should document both the upside and downside. The main difference happens when file:anon scan ratio > 100:1. For the current percent[] based computing, percent[0]=0 hence nr[0]=0 which disables anon list scan unconditionally, for good or for bad. For the direct nr[] computing, - nr[0] will be 0 for typical file servers, because with priority=12 and anon lru size < 1.6GB, nr[0] = (anon_size/4096)/100 < 0 - nr[0] will be non-zero when priority=1 and anon_size > 100 pages, this stops OOM for Shaohua's test case, however may not be enough to guarantee safety (your previous reverting patch can provide this guarantee). I liked Shaohua's patch a lot -- it adapts well to both the file-server case and the mostly-anon-pages case :) Thanks, Fengguang --
The downside is obvious: streaming IO (used once data that does not fit in the cache) can push out data that is used more often - requiring that it be swapped in at a later point in time. I understand what Shaohua's patch does, but I do not understand the upside. What good does it do to increase the size of the cache for streaming IO data, which is generally touched only once? What kind of performance benefits can we get by doing --
Not that bad :) With Shaohua's patch the anon list will typically _never_ get scanned, just like before. If it's mostly use-once IO, file:anon will be 1000 or even 10000, and priority=12. Then only anon lists larger than 16GB or 160GB will get So vmscan behavior and performance remain the same as before. For really large anon list, such workload is beyond our imagination. So we cannot assert "don't scan anon list" will be a benefit. On the other hand, in the test case of "do stream IO when most memory occupied by tmpfs pages", it is very bad behavior refuse to scan anon list in normal and suddenly start scanning _the whole_ anon list when priority hits 0. Shaohua's patch helps it by gradually increasing the scan nr of anon list as memory pressure increases. Thanks, --
Yep, the gradually increasing scan nr is the main advantage in my mind. Thanks, --
denominator[2] can be reduced to denominator. The "()" is not necessary here, or better end it here^ Thanks, --
Thanks, will send a updated against -mm since we reverted the offending patch. Thanks, Shaohua --
Ah, the (scan * fraction[file]) may overflow in 32bit kernel! Thanks, Fengguang --
I updated the patch to address previous issues.
is it possible to put this to -mm tree to see if there is anything wield happen?
get_scan_ratio() calculates percentage and if the percentage is < 1%, it will
round percentage down to 0% and cause we completely ignore scanning anon/file
pages to reclaim memory even the total anon/file pages are very big.
To avoid underflow, we don't use percentage, instead we directly calculate
how many pages should be scaned. In this way, we should get several scanned pages
for < 1% percent.
This has some benefits:
1. increase our calculation precision
2. making our scan more smoothly. Without this, if percent[x] is underflow,
shrink_zone() doesn't scan any pages and suddenly it scans all pages when priority
is zero. With this, even priority isn't zero, shrink_zone() gets chance to scan
some pages.
Note, this patch doesn't really change logics, but just increase precision. For
system with a lot of memory, this might slightly changes behavior. For example,
in a sequential file read workload, without the patch, we don't swap any anon
pages. With it, if anon memory size is bigger than 16G, we will see one anon page
swapped. The 16G is calculated as PAGE_SIZE * priority(4096) * (fp/ap). fp/ap
is assumed to be 1024 which is common in this workload. So the impact sounds not
a big deal.
Signed-off-by: Shaohua Li <shaohua.li@intel.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Cc: Wu Fengguang <fengguang.wu@intel.com>
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3ff3311..1070f83 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1519,21 +1519,52 @@ static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
}
/*
+ * Smallish @nr_to_scan's are deposited in @nr_saved_scan,
+ * until we collected @swap_cluster_max pages to scan.
+ */
+static unsigned long nr_scan_try_batch(unsigned long nr_to_scan,
+ unsigned long *nr_saved_scan)
+{
+ unsigned long ...On Fri, 9 Apr 2010 14:51:04 +0800 I grabbed this. Did we decide that this needed to be backported into 2.6.33.x? If so, some words explaining the reasoning would be needed. Come to that, it's not obvious that we need this in 2.6.34 either. What is the user-visible impact here? --
I suspect very little impact, especially during workloads where we can just reclaim clean page cache at DEF_PRIORITY. FWIW, the patch looks good to me, so: Acked-by: Rik van Riel <riel@redhat.com> --
I'm surprised this ack a bit. Rik, do you have any improvement plan about streaming io detection logic? I think the patch have a slightly marginal benefit, it help to <1% scan ratio case. but it have big regression, it cause streaming io (e.g. backup operation) makes tons swap. So, I thought we sould do either, 1) drop this one 2) merge to change stream io detection logic improvement at first, and merge this one at second. Am i missing something? --
How? From the description I believe it took 16GB in a zone before we start scanning anon pages when reclaiming at DEF_PRIORITY? We may need better streaming IO detection, anyway. I have noticed that while heavy sequential reads are fine, the virtual machines on my desktop system do a lot of whole block writes. Presumably, a lot of those writes are to the same blocks, over and over again. This causes the blocks to be promoted to the active file list, which ends up growing the active file list to the point where things from the working set get evicted. All for file pages that may only get WRITTEN to by the guests, because the guests cache their own copy whenever they need to read them! I'll have to check the page cache code to see if it keeps frequently written pages as accessed. We may be better off evicting frequently written pages, and keeping our cache space for data that is read... --
Please remember, 2.6.27 has following +1 scanning modifier.
zone->nr_scan_active += (zone_page_state(zone, NR_ACTIVE) >> priority) + 1;
^^^^
and, early (ano not yet merged) SplitLRU VM has similar +1. likes
scan = zone_nr_lru_pages(zone, sc, l);
scan >>= priority;
scan = (scan * percent[file]) / 100 + 1;
^^^
We didn't think only one page scanning is not big matter. but it was not
correct. we got streaming io bug report. the above +1 makes annoying swap
io. because some server need big backup operation rather much much than
physical memory size.
example, If vm are dropping 1TB use once pages, 0.1% anon scanning makes
1GB scan. and almost server only have some gigabyte swap although it
has >1TB memory.
If my memory is not correct, please correct me.
My point is, greater or smaller than 16GB isn't essential. all patches
should have big worth than the downside. The description said "the impact
sounds not a big deal", nobody disagree it. but it's worth is more little.
One question, In such case your guest don't use DirectIO?
Or do you talk about guest VM behabior?
I guess inactive_file_is_low_global() can be improvement a lot.
but I'm not sure.
--
And now I've merged this patch into my local vmscan patch queue. After solving streaming io issue, I'll put it to mainline. Thanks. --
if the streaming io issue is popular, how about below patch against my last one?
we take priority == DEF_PRIORITY an exception.
Index: linux/mm/vmscan.c
===================================================================
--- linux.orig/mm/vmscan.c 2010-04-14 09:03:28.000000000 +0800
+++ linux/mm/vmscan.c 2010-04-14 09:19:56.000000000 +0800
@@ -1629,6 +1629,22 @@ static void get_scan_count(struct zone *
fraction[0] = ap;
fraction[1] = fp;
denominator = ap + fp + 1;
+
+ /*
+ * memory pressure isn't high, we allow percentage underflow. This
+ * avoids swap in stream io case.
+ */
+ if (priority == DEF_PRIORITY) {
+ if (fraction[0] * 99 < fraction[1]) {
+ fraction[0] = 0;
+ fraction[1] = 1;
+ denominator = 1;
+ } else if (fraction[1] * 99 < fraction[0]) {
+ fraction[0] = 1;
+ fraction[1] = 0;
+ denominator = 1;
+ }
+ }
out:
for_each_evictable_lru(l) {
int file = is_file_lru(l);
--
Hi Your patch seems works. but it is obviously ugly and bandaid patch. So, I like single your previous patch rather than combinate this one. Even though both dropping makes sense rather than both merge. --
Should be very small I think. Thanks, Shaohua --
The changelog can be improved. For example, to describe these items in separate paragraphs: - the behavior change introduced by 84b18490d1f (which claims to be cleanup) - the tmpfs test case - the root cause - the solution The denominator is shared, so one scaler would be sufficient. Also ap, fp can be removed and to use fraction[] directly. And it's better to retain this comment: scan = (scan * fraction[file]) / denominator[file]; Thanks, --
Yes. It made subtle change.
But we should not depend that change.
Current logic seems to be good and clear than old.
I think you were lucky at that time by not-good and not-clear logic.
BTW, How about this?
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 79c8098..f0df563 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1646,11 +1646,6 @@ static void shrink_zone(int priority, struct zone *zone,
int file = is_file_lru(l);
unsigned long scan;
- if (percent[file] == 0) {
- nr[l] = 0;
- continue;
- }
-
scan = zone_nr_lru_pages(zone, sc, l);
if (priority) {
scan >>= priority;
--
Kind regards,
Minchan Kim
--
Unfortunatelly, memcg need your removed code. if removed, swapping out might happen although sc->may_swap==0 when priority==0. --
Can you please post the meminfo before and after the changes (diff maybe?). Can you also please share the ap and fp data from which the percent figures are being calculated Is your swappiness set to 60? Can you please share the OOM/panic message as well. Balbir --
