[CC'd some lucky candidates] Hello, I was just running mkfs.ext4 -b 4096 -E stride=128 -E stripe-width=128 -O ^has_journal /dev/sdb2 on my SSD18M connected via USB1.1, and the result was, well, absolutely, positively _DEVASTATING_. The entire system became _FULLY_ unresponsive, not even switching back down to tty1 via Ctrl-Alt-F1 worked (took 20 seconds for even this key to be respected). Once back on ttys, invoking any command locked up for minutes (note that I'm talking about attempted additional I/O to the _other_, _unaffected_ main system HDD - such as loading some shell binaries -, NOT the external SSD18M!!). Having an attempt at writing a 300M /dev/zero file to the SSD's filesystem was even worse (again tons of unresponsiveness), combined with multiple OOM conditions flying by (I/O to the main HDD was minimal, its LED was almost always _off_, yet everything stuck to an absolute standstill). Clearly there's a very, very important limiter somewhere in bio layer missing or broken, a 300M dd /dev/zero should never manage to put such an onerous penalty on a system, IMHO. I've got SysRq-W traces of these lockup conditions if wanted. Not sure whether this is a 2.6.34-rc3 thing, might be a general issue. Likely the lockup behaviour is a symptom of very high memory pressure. But this memory pressure shouldn't even be allowed to happen in the first place, since the dd submission rate should immediately get limited by the kernel's bio layer / elevators. Also, I'm wondering whether perhaps additionally there are some cond_resched() to be inserted in some places, to try to improve coping with such a broken situation at least. Thanks, Andreas Mohr --
It seems it's a major issue with recent kernels, at least I have the same with >2.6.30 ones, mainly with pendrives on USB port, even if I only download from the net directly to the pendrive, it seems no I/O can be served towards the hdd either in a reasonable speed and a responsive way, window manager took 10 mins to draw window borders and so on ... --
Seems this issue is a variation of the usual "ext3 sync" problem, but in overly critical and unexpected ways (full lockup of almost everything, and multiple OOMs). I retried writing the 300M file with a freshly booted system, and there were _no_ suspicious issues to be observed (free memory went all down to 5M, not too problematic), well, that is, until I launched Firefox (the famous sync-happy beast). After Firefox startup, I had these long freezes again when trying to do transfers with the _UNRELATED_ main HDD of the system (plus some OOMs, again) Setup: USB SSD ext4 non-journal, system HDD ext3, SSD unused except for this one ext4 partition (no swap partition activated there). Of course I can understand and tolerate the existing "ext3 sync" issue, but what's special about this case is that large numbers of bio to a _separate_ _non_-ext3 device seem to put so much memory and I/O pressure on a system that the existing _lightly_ loaded ext3 device gets completely stuck for much longer than I'd usually naively expect an ext3 sync to an isolated device to take - not to mention the OOMs (which are probably causing swap partition handling on the main HDD to contribute to the contention). IOW, we seem to still have too much ugly lock contention interaction between expectedly isolated parts of the system. OTOH the main problem likely still is overly large pressure induced by a thoroughly unthrottled dd 300M, resulting in sync-challenged ext3 and swap activity (this time on the same device!) to break completely, and also OOMs to occur. Probably overly global ext3 sync handling manages to grab a couple more global system locks (bdi, swapping, page handling, ...) before being contended, causing other, non-ext3-challenged parts of the system (e.g. the swap partition on the _same_ device) to not make any progress in the meantime. per-bdi writeback patches (see http://www.serverphorums.com/read.php?12,32355,33238,page=2 ) might have handled a related issue. Following is ...
Andreas, shmem=56 is ignorable, and active_file+inactive_file=13576+27884=41460 < 56122 total pagecache pages. Many applications (this one and below) are stuck in wait_on_page_writeback(). I guess this is why "heavy write to irrelevant partition stalls the whole system". They are stuck on page allocation. Your 512MB system memory is a bit tight, so reclaim pressure is a bit high, which triggers the wait-on-writeback logic. Thanks, --
I wonder if this hacking patch may help. When creating 300MB dirty file with dd, it is creating continuous region of hard-to-reclaim pages in the LRU list. priority can easily go low when irrelevant applications' direct reclaim run into these regions.. Thanks, Fengguang --- diff --git a/mm/vmscan.c b/mm/vmscan.c index e0e5f15..f7179cf 100644 --- a/mm/vmscan.c +++ b/mm/vmscan.c @@ -1149,7 +1149,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan, */ if (sc->order > PAGE_ALLOC_COSTLY_ORDER) lumpy_reclaim = 1; - else if (sc->order && priority < DEF_PRIORITY - 2) + else if (sc->order && priority < DEF_PRIORITY / 2) lumpy_reclaim = 1; pagevec_init(&pvec, 1); --
Sorry I'm confused not. can you please tell us more detail explanation? Why did lumpy reclaim cause OOM? lumpy reclaim might cause direct reclaim slow down. but IIUC it's not cause OOM because OOM is only occur when priority-0 reclaim failure. IO get stcking also prevent --
No I'm not talking OOM. Nor lumpy reclaim. I mean the direct reclaim can get stuck for long time, when we do Sure. But we can wait for IO a bit later -- after scanning 1/64 LRU (the below patch) instead of the current 1/1024. In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to the 22MB writeback pages. There can easily be a continuous range of 512KB dirty/writeback pages in the LRU, which will trigger the wait logic. Thanks, --
In my feeling from your explanation, we need auto adjustment mechanism instead change default value for special machine. no? --
You mean the dumb DEF_PRIORITY/2 may be too large for a 1TB memory box? However for such boxes, whether it be DEF_PRIORITY-2 or DEF_PRIORITY/2 shall be irrelevant: it's trivial anyway to reclaim an order-1 or order-2 page. In other word, lumpy_reclaim will hardly go 1. Do you think so? Thanks, Fengguang --
If my remember is correct, Its order-1 lumpy reclaim was introduced for solving such big box + AIM7 workload made kernel stack (order-1 page) allocation failure. Now, We are living on moore's law. so probably we need to pay attention scalability always. today's big box is going to become desktop box after 3-5 years. Probably, Lee know such problem than me. cc to him. --
In Andreas' trace, the processes are blocked in - do_fork: console-kit-d - __alloc_skb: x-terminal-em, konqueror - handle_mm_fault: tclsh - filemap_fault: ls I'm a bit confused by the last one, and wonder what's the typical gfp order of __alloc_skb(). Thanks, Fengguang --
Probably I've found one of reason of low order lumpy reclaim slow down.
Let's fix obvious bug at first!
============================================================
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Subject: [PATCH] vmscan: page_check_references() check low order lumpy reclaim properly
If vmscan is under lumpy reclaim mode, it have to ignore referenced bit
for making contenious free pages. but current page_check_references()
doesn't.
Fixes it.
Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
mm/vmscan.c | 32 +++++++++++++++++---------------
1 files changed, 17 insertions(+), 15 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3ff3311..13d9546 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -77,6 +77,8 @@ struct scan_control {
int order;
+ int lumpy_reclaim;
+
/* Which cgroup do we reclaim from */
struct mem_cgroup *mem_cgroup;
@@ -575,7 +577,7 @@ static enum page_references page_check_references(struct page *page,
referenced_page = TestClearPageReferenced(page);
/* Lumpy reclaim - ignore references */
- if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+ if (sc->lumpy_reclaim)
return PAGEREF_RECLAIM;
/*
@@ -1130,7 +1132,6 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
unsigned long nr_scanned = 0;
unsigned long nr_reclaimed = 0;
struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
- int lumpy_reclaim = 0;
while (unlikely(too_many_isolated(zone, file, sc))) {
congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1140,17 +1141,6 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
return SWAP_CLUSTER_MAX;
}
- /*
- * If we need a large contiguous chunk of memory, or have
- * trouble getting a small set of contiguous pages, we
- * will reclaim both active and inactive pages.
- *
- * We use the same threshold as pageout congestion_wait below.
- */
- if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
- lumpy_reclaim = ...On Fri, Apr 16, 2010 at 12:16 PM, KOSAKI Motohiro
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>
I am not sure how the patch affects this problem.
But I think the patch is reasonable.
Nice catch, Kosaiki.
How about making new function for readability instead of nesting else?
int is_lumpy_reclaim(struct scan_control *sc)
{
....
}
If you merge patch reduced stack usage of reclaim path, I think it's
enough alone scan_control argument.
It's just nitpick. :)
If you don't mind, ignore, please.
--
Kind regards,
Minchan Kim
--
Good opinion. I don't hope introduce the dependency of "reduced stack usage" series. but I agree that I'll push your proposal later and separately. --
On Fri, 16 Apr 2010 12:16:18 +0900 (JST) Needs a comment explaining its role, please. Something like "direct this reclaim run to perform lumpy reclaim"? A clearer name might be "lumpy_relcaim_mode"? Making it a `bool' would clarify things too. --
Sorry, I've missed your this review comment.
How about this?
---
mm/vmscan.c | 39 ++++++++++++++++++++++++---------------
1 files changed, 24 insertions(+), 15 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 13d9546..c3bcdd4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -77,7 +77,11 @@ struct scan_control {
int order;
- int lumpy_reclaim;
+ /*
+ * Intend to reclaim enough contenious memory rather than to reclaim
+ * enough amount memory. I.e, it's the mode for high order allocation.
+ */
+ bool lumpy_reclaim_mode;
/* Which cgroup do we reclaim from */
struct mem_cgroup *mem_cgroup;
@@ -577,7 +581,7 @@ static enum page_references page_check_references(struct page *page,
referenced_page = TestClearPageReferenced(page);
/* Lumpy reclaim - ignore references */
- if (sc->lumpy_reclaim)
+ if (sc->lumpy_reclaim_mode)
return PAGEREF_RECLAIM;
/*
@@ -1153,7 +1157,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
unsigned long nr_freed;
unsigned long nr_active;
unsigned int count[NR_LRU_LISTS] = { 0, };
- int mode = sc->lumpy_reclaim ? ISOLATE_BOTH : ISOLATE_INACTIVE;
+ int mode = sc->lumpy_reclaim_mode ? ISOLATE_BOTH : ISOLATE_INACTIVE;
unsigned long nr_anon;
unsigned long nr_file;
@@ -1206,7 +1210,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
* but that should be acceptable to the caller
*/
if (nr_freed < nr_taken && !current_is_kswapd() &&
- sc->lumpy_reclaim) {
+ sc->lumpy_reclaim_mode) {
congestion_wait(BLK_RW_ASYNC, HZ/10);
/*
@@ -1609,6 +1613,21 @@ static unsigned long nr_scan_try_batch(unsigned long nr_to_scan,
return nr;
}
+static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc)
+{
+ /*
+ * If we need a large contiguous chunk of memory, or have
+ * trouble getting a small set of contiguous pages, we
+ * will reclaim both active and inactive pages.
+ */
+ if (sc->order > ...swapcache? -- Kind regards, Minchan Kim --
Hi, "Your 512MB system memory is a bit tight". Heh, try to survive making such a statement 15 years ago ;) (but you likely meant this in the context of inducing a whopping 300MB write) Thank you for your reply, I'll test the patch ASAP (with large writes and Firefox sync mixed in), maybe this will improve things already. Andreas Mohr --
Indeed, AFAICS this definitely seems MUCH better than before. I had to do a full kernel rebuild (due to CONFIG_LOCALVERSION_AUTO changes; with no changed configs though). I threw some extra load into the mix (read 400MB instead of 300 through USB1.1, ran gimp, grepped over the entire /usr partition etc.pp.), so far not nearly as severe as before, and no OOMs either. Launched Firefox some time after starting 400MB creation, pretty ok still. Some annoying lags sometimes of course, but nothing absolutely earth-shattering as experienced before. Things really appear to be a LOT better. OK, so which way to go? Thanks a lot, Andreas Mohr --
You are using a USB 1.1 connection, about the same speed as a floppy. If you have not tuned your system to prevent all of the memory from being used to cache writes, it will be used that way. I don't have my notes handy, but I believe you need to tune the "dirty" parameters of /proc/sys/vm so that it makes better use of memory. Of course putting a fast device like SSD on a super slow connection makes no -- Bill Davidsen <davidsen@tmr.com> "We have more to fear from the bungling of the incompetent than from the machinations of the wicked." - from Slashdot --
Ahahahaaa. A rather distant approximation given a speed of 20kB/s vs. 987kB/s ;) (but I get the point you're making here) I'm not at all convinced that USB2.0 would fare any better here, though: after all we are buffering the file that is written to the device - after the fact! (plus there are many existing complaints of people that copying of large files manages to break entire machines, and I doubt many of those were using USB1.1) https://bugzilla.kernel.org/show_bug.cgi?id=13347 https://bugzilla.kernel.org/show_bug.cgi?id=7372 Hmmmm. I don't believe that there should be much in need of being tuned, especially in light of default settings being so problematic. Of course things here are similar to the shell ulimit philosophy, "because I can" (tm) :) And because I like to break systems that happen to work moderately wonderfully for the mainstream(?)(?!?) case of quad cores with 16GB of RAM ;) [well in fact I don't, but of course that just happens to happen...] Thanks for your input, Andreas Mohr --
I will tell you one more thing you can do to test my thought that you are totally filling memory, copy data to the device using DIRECT to keep from dirtying cache. It will slow the copy (to a slight degree) and keep the system responsive. I used to have a USB 2.0 disk, and you are right, it will show the same problems. That's why I have some ideas of tuning. And during the 2.5 development phase I played with "per fd" limits on memory per file, which solved the problem for me. I had some educational discussions with several developers, but this is one of those things which has limited usefulness and development was very busy at that time with things deemed more important, so I never tried to get it ready for inclusion in the kernel. -- Bill Davidsen <davidsen@tmr.com> "We can't solve today's problems by using the same thinking we used in creating them." - Einstein --
Indeed. I have found this to be a persistent problem and I really wish there were more interest in debugging this. I have tried bringing the community's resources to bear on this issue several[1] times, and each time we fail to get enough of the right eyes looking at it or developer interest simply vanishes. I've started putting together a list[2] of pertinent threads/patches/bugs/data in hopes that this will lower the energy barrier of getting up to speed on this issue. Hopefully this will help. Cheers, - Ben [1] https://bugzilla.kernel.org/show_bug.cgi?id=12309 [2] http://goldnerlab.physics.umass.edu/wiki/BenGamari/IoWaitLatency --
