Re: 32GB SSD on USB1.1 P3/700 == ___HELL___ (2.6.34-rc3)

Previous thread: [PATCH] VMware Balloon driver by Dmitry Torokhov on Sunday, April 4, 2010 - 2:52 pm. (36 messages)

Next thread: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer by Hans-Peter Jansen on Sunday, April 4, 2010 - 3:49 pm. (24 messages)
From: Andreas Mohr
Date: Sunday, April 4, 2010 - 3:13 pm

[CC'd some lucky candidates]

Hello,

I was just running
mkfs.ext4 -b 4096 -E stride=128 -E stripe-width=128 -O ^has_journal
/dev/sdb2
on my SSD18M connected via USB1.1, and the result was, well,
absolutely, positively _DEVASTATING_.

The entire system became _FULLY_ unresponsive, not even switching back
down to tty1 via Ctrl-Alt-F1 worked (took 20 seconds for even this key
to be respected).

Once back on ttys, invoking any command locked up for minutes
(note that I'm talking about attempted additional I/O to the _other_,
_unaffected_ main system HDD - such as loading some shell binaries -,
NOT the external SSD18M!!).

Having an attempt at writing a 300M /dev/zero file to the SSD's filesystem
was even worse (again tons of unresponsiveness), combined with multiple
OOM conditions flying by (I/O to the main HDD was minimal, its LED was
almost always _off_, yet everything stuck to an absolute standstill).

Clearly there's a very, very important limiter somewhere in bio layer
missing or broken, a 300M dd /dev/zero should never manage to put
such an onerous penalty on a system, IMHO.


I've got SysRq-W traces of these lockup conditions if wanted.


Not sure whether this is a 2.6.34-rc3 thing, might be a general issue.

Likely the lockup behaviour is a symptom of very high memory pressure.
But this memory pressure shouldn't even be allowed to happen in the first
place, since the dd submission rate should immediately get limited by the kernel's
bio layer / elevators.

Also, I'm wondering whether perhaps additionally there are some cond_resched()
to be inserted in some places, to try to improve coping with such a
broken situation at least.

Thanks,

Andreas Mohr
--

From: =?iso-8859-1?B?R+Fib3IgTOlu4XJ0?=
Date: Sunday, April 4, 2010 - 4:31 pm

It seems it's a major issue with recent kernels, at least I have the same
with >2.6.30 ones, mainly with pendrives on USB port, even if I only
download from the net directly to the pendrive, it seems no I/O can be
served towards the hdd either in a reasonable speed and a responsive way,
window manager took 10 mins to draw window borders and so on ...
--

From: Andreas Mohr
Date: Monday, April 5, 2010 - 3:53 am

Seems this issue is a variation of the usual "ext3 sync" problem,
but in overly critical and unexpected ways (full lockup of almost everything,
and multiple OOMs).

I retried writing the 300M file with a freshly booted system, and there
were _no_ suspicious issues to be observed (free memory went all down to
5M, not too problematic), well, that is, until I launched Firefox
(the famous sync-happy beast).
After Firefox startup, I had these long freezes again when trying to
do transfers with the _UNRELATED_ main HDD of the system
(plus some OOMs, again)

Setup: USB SSD ext4 non-journal, system HDD ext3, SSD unused except for
this one ext4 partition (no swap partition activated there).

Of course I can understand and tolerate the existing "ext3 sync" issue,
but what's special about this case is that large numbers of bio to
a _separate_ _non_-ext3 device seem to put so much memory and I/O pressure
on a system that the existing _lightly_ loaded ext3 device gets completely
stuck for much longer than I'd usually naively expect an ext3 sync to an isolated
device to take - not to mention the OOMs (which are probably causing
swap partition handling on the main HDD to contribute to the contention).

IOW, we seem to still have too much ugly lock contention interaction
between expectedly isolated parts of the system.

OTOH the main problem likely still is overly large pressure induced by a
thoroughly unthrottled dd 300M, resulting in sync-challenged ext3 and swap
activity (this time on the same device!) to break completely, and also OOMs to occur.

Probably overly global ext3 sync handling manages to grab a couple
more global system locks (bdi, swapping, page handling, ...)
before being contended, causing other, non-ext3-challenged
parts of the system (e.g. the swap partition on the _same_ device)
to not make any progress in the meantime.

per-bdi writeback patches (see
http://www.serverphorums.com/read.php?12,32355,33238,page=2 ) might
have handled a related issue.


Following is ...
From: Wu Fengguang
Date: Wednesday, April 7, 2010 - 12:00 am

Andreas,



shmem=56 is ignorable, and 
active_file+inactive_file=13576+27884=41460 < 56122 total pagecache pages.


Many applications (this one and below) are stuck in
wait_on_page_writeback(). I guess this is why "heavy write to
irrelevant partition stalls the whole system".  They are stuck on page
allocation. Your 512MB system memory is a bit tight, so reclaim
pressure is a bit high, which triggers the wait-on-writeback logic.

Thanks,
--

From: Wu Fengguang
Date: Wednesday, April 7, 2010 - 12:08 am

I wonder if this hacking patch may help.

When creating 300MB dirty file with dd, it is creating continuous
region of hard-to-reclaim pages in the LRU list. priority can easily
go low when irrelevant applications' direct reclaim run into these
regions..

Thanks,
Fengguang
---

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e0e5f15..f7179cf 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1149,7 +1149,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 	 */
 	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
 		lumpy_reclaim = 1;
-	else if (sc->order && priority < DEF_PRIORITY - 2)
+	else if (sc->order && priority < DEF_PRIORITY / 2)
 		lumpy_reclaim = 1;
 
 	pagevec_init(&pvec, 1);
--

From: KOSAKI Motohiro
Date: Wednesday, April 14, 2010 - 8:31 pm

Sorry I'm confused not. can you please tell us more detail explanation?
Why did lumpy reclaim cause OOM? lumpy reclaim might cause
direct reclaim slow down. but IIUC it's not cause OOM because OOM is
only occur when priority-0 reclaim failure. IO get stcking also prevent



--

From: Wu Fengguang
Date: Wednesday, April 14, 2010 - 9:19 pm

No I'm not talking OOM. Nor lumpy reclaim.

I mean the direct reclaim can get stuck for long time, when we do

Sure. But we can wait for IO a bit later -- after scanning 1/64 LRU
(the below patch) instead of the current 1/1024.

In Andreas' case, 512MB/1024 = 512KB, this is way too low comparing to
the 22MB writeback pages. There can easily be a continuous range of
512KB dirty/writeback pages in the LRU, which will trigger the wait
logic.

Thanks,
--

From: KOSAKI Motohiro
Date: Wednesday, April 14, 2010 - 9:32 pm

In my feeling from your explanation, we need auto adjustment mechanism
instead change default value for special machine. no?



--

From: Wu Fengguang
Date: Wednesday, April 14, 2010 - 9:41 pm

You mean the dumb DEF_PRIORITY/2 may be too large for a 1TB memory box?

However for such boxes, whether it be DEF_PRIORITY-2 or DEF_PRIORITY/2
shall be irrelevant: it's trivial anyway to reclaim an order-1 or
order-2 page. In other word, lumpy_reclaim will hardly go 1.  Do you
think so?

Thanks,
Fengguang
--

From: KOSAKI Motohiro
Date: Wednesday, April 14, 2010 - 9:55 pm

If my remember is correct, Its order-1 lumpy reclaim was introduced
for solving such big box + AIM7 workload made kernel stack (order-1 page)
allocation failure.

Now, We are living on moore's law. so probably we need to pay attention
scalability always. today's big box is going to become desktop box after
3-5 years.

Probably, Lee know such problem than me. cc to him.



--

From: Wu Fengguang
Date: Wednesday, April 14, 2010 - 10:19 pm

In Andreas' trace, the processes are blocked in
- do_fork:              console-kit-d
- __alloc_skb:          x-terminal-em, konqueror
- handle_mm_fault:      tclsh
- filemap_fault:        ls

I'm a bit confused by the last one, and wonder what's the typical
gfp order of __alloc_skb().

Thanks,
Fengguang
--

From: KOSAKI Motohiro
Date: Thursday, April 15, 2010 - 8:16 pm

Probably I've found one of reason of low order lumpy reclaim slow down.
Let's fix obvious bug at first!


============================================================
From: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Subject: [PATCH] vmscan: page_check_references() check low order lumpy reclaim properly

If vmscan is under lumpy reclaim mode, it have to ignore referenced bit
for making contenious free pages. but current page_check_references()
doesn't.

Fixes it.

Signed-off-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
---
 mm/vmscan.c |   32 +++++++++++++++++---------------
 1 files changed, 17 insertions(+), 15 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3ff3311..13d9546 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -77,6 +77,8 @@ struct scan_control {
 
 	int order;
 
+	int lumpy_reclaim;
+
 	/* Which cgroup do we reclaim from */
 	struct mem_cgroup *mem_cgroup;
 
@@ -575,7 +577,7 @@ static enum page_references page_check_references(struct page *page,
 	referenced_page = TestClearPageReferenced(page);
 
 	/* Lumpy reclaim - ignore references */
-	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
+	if (sc->lumpy_reclaim)
 		return PAGEREF_RECLAIM;
 
 	/*
@@ -1130,7 +1132,6 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 	unsigned long nr_scanned = 0;
 	unsigned long nr_reclaimed = 0;
 	struct zone_reclaim_stat *reclaim_stat = get_reclaim_stat(zone, sc);
-	int lumpy_reclaim = 0;
 
 	while (unlikely(too_many_isolated(zone, file, sc))) {
 		congestion_wait(BLK_RW_ASYNC, HZ/10);
@@ -1140,17 +1141,6 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 			return SWAP_CLUSTER_MAX;
 	}
 
-	/*
-	 * If we need a large contiguous chunk of memory, or have
-	 * trouble getting a small set of contiguous pages, we
-	 * will reclaim both active and inactive pages.
-	 *
-	 * We use the same threshold as pageout congestion_wait below.
-	 */
-	if (sc->order > PAGE_ALLOC_COSTLY_ORDER)
-		lumpy_reclaim = ...
From: Minchan Kim
Date: Thursday, April 15, 2010 - 9:26 pm

On Fri, Apr 16, 2010 at 12:16 PM, KOSAKI Motohiro
Reviewed-by: Minchan Kim <minchan.kim@gmail.com>

I am not sure how the patch affects this problem.
But I think the patch is reasonable.

Nice catch, Kosaiki.

How about making new function for readability instead of nesting else?
int is_lumpy_reclaim(struct scan_control *sc)
{
....
}

If you merge patch reduced stack usage of reclaim path, I think it's
enough alone scan_control argument.
It's just nitpick. :)
If you don't mind, ignore, please.


-- 
Kind regards,
Minchan Kim
--

From: KOSAKI Motohiro
Date: Thursday, April 15, 2010 - 10:33 pm

Good opinion. I don't hope introduce the dependency of "reduced stack usage"
series. but I agree that I'll push your proposal later and separately.



--

From: Andrew Morton
Date: Friday, April 16, 2010 - 2:18 pm

On Fri, 16 Apr 2010 12:16:18 +0900 (JST)

Needs a comment explaining its role, please.  Something like "direct
this reclaim run to perform lumpy reclaim"?

A clearer name might be "lumpy_relcaim_mode"?

Making it a `bool' would clarify things too.
--

From: KOSAKI Motohiro
Date: Wednesday, May 12, 2010 - 7:54 pm

Sorry, I've missed your this review comment.
How about this?


---
 mm/vmscan.c |   39 ++++++++++++++++++++++++---------------
 1 files changed, 24 insertions(+), 15 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 13d9546..c3bcdd4 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -77,7 +77,11 @@ struct scan_control {
 
 	int order;
 
-	int lumpy_reclaim;
+	/*
+	 * Intend to reclaim enough contenious memory rather than to reclaim
+	 * enough amount memory. I.e, it's the mode for high order allocation.
+	 */
+	bool lumpy_reclaim_mode;
 
 	/* Which cgroup do we reclaim from */
 	struct mem_cgroup *mem_cgroup;
@@ -577,7 +581,7 @@ static enum page_references page_check_references(struct page *page,
 	referenced_page = TestClearPageReferenced(page);
 
 	/* Lumpy reclaim - ignore references */
-	if (sc->lumpy_reclaim)
+	if (sc->lumpy_reclaim_mode)
 		return PAGEREF_RECLAIM;
 
 	/*
@@ -1153,7 +1157,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 		unsigned long nr_freed;
 		unsigned long nr_active;
 		unsigned int count[NR_LRU_LISTS] = { 0, };
-		int mode = sc->lumpy_reclaim ? ISOLATE_BOTH : ISOLATE_INACTIVE;
+		int mode = sc->lumpy_reclaim_mode ? ISOLATE_BOTH : ISOLATE_INACTIVE;
 		unsigned long nr_anon;
 		unsigned long nr_file;
 
@@ -1206,7 +1210,7 @@ static unsigned long shrink_inactive_list(unsigned long max_scan,
 		 * but that should be acceptable to the caller
 		 */
 		if (nr_freed < nr_taken && !current_is_kswapd() &&
-		    sc->lumpy_reclaim) {
+		    sc->lumpy_reclaim_mode) {
 			congestion_wait(BLK_RW_ASYNC, HZ/10);
 
 			/*
@@ -1609,6 +1613,21 @@ static unsigned long nr_scan_try_batch(unsigned long nr_to_scan,
 	return nr;
 }
 
+static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc)
+{
+	/*
+	 * If we need a large contiguous chunk of memory, or have
+	 * trouble getting a small set of contiguous pages, we
+	 * will reclaim both active and inactive pages.
+	 */
+	if (sc->order > ...
From: Minchan Kim
Date: Wednesday, April 7, 2010 - 1:39 am

swapcache?


-- 
Kind regards,
Minchan Kim
--

From: Wu Fengguang
Date: Wednesday, April 7, 2010 - 1:52 am

Ah exactly!

Thanks,
Fengguang
--

From: Andreas Mohr
Date: Wednesday, April 7, 2010 - 4:17 am

Hi,


"Your 512MB system memory is a bit tight".
Heh, try to survive making such a statement 15 years ago ;)
(but you likely meant this in the context of inducing a whopping 300MB write)

Thank you for your reply, I'll test the patch ASAP (with large writes
and Firefox sync mixed in), maybe this will improve things already.

Andreas Mohr
--

From: Andreas Mohr
Date: Thursday, April 8, 2010 - 12:46 pm

Indeed, AFAICS this definitely seems MUCH better than before.
I had to do a full kernel rebuild (due to CONFIG_LOCALVERSION_AUTO
changes; with no changed configs though).
I threw some extra load into the mix (read 400MB instead of 300 through
USB1.1, ran gimp, grepped over the entire /usr partition etc.pp.),
so far not nearly as severe as before, and no OOMs either.
Launched Firefox some time after starting 400MB creation, pretty ok
still. Some annoying lags sometimes of course, but nothing absolutely
earth-shattering as experienced before.
Things really appear to be a LOT better.

OK, so which way to go?

Thanks a lot,

Andreas Mohr
--

From: Bill Davidsen
Date: Thursday, April 8, 2010 - 1:12 pm

You are using a USB 1.1 connection, about the same speed as a floppy. If you 
have not tuned your system to prevent all of the memory from being used to cache 
writes, it will be used that way. I don't have my notes handy, but I believe you 
need to tune the "dirty" parameters of /proc/sys/vm so that it makes better use 
of memory.

Of course putting a fast device like SSD on a super slow connection makes no 


-- 
Bill Davidsen <davidsen@tmr.com>
   "We have more to fear from the bungling of the incompetent than from
the machinations of the wicked."  - from Slashdot
--

From: Andreas Mohr
Date: Thursday, April 8, 2010 - 1:35 pm

Ahahahaaa. A rather distant approximation given a speed of 20kB/s vs. 987kB/s ;)
(but I get the point you're making here)

I'm not at all convinced that USB2.0 would fare any better here, though:
after all we are buffering the file that is written to the device
- after the fact!
(plus there are many existing complaints of people that copying of large files
manages to break entire machines, and I doubt many of those were using
USB1.1)
https://bugzilla.kernel.org/show_bug.cgi?id=13347
https://bugzilla.kernel.org/show_bug.cgi?id=7372

Hmmmm. I don't believe that there should be much in need of being
tuned, especially in light of default settings being so problematic.
Of course things here are similar to the shell ulimit philosophy,

"because I can" (tm) :)

And because I like to break systems that happen to work moderately wonderfully
for the mainstream(?)(?!?) case of quad cores with 16GB of RAM ;)
[well in fact I don't, but of course that just happens to happen...]

Thanks for your input,

Andreas Mohr
--

From: Bill Davidsen
Date: Thursday, April 8, 2010 - 3:01 pm

I will tell you one more thing you can do to test my thought that you 
are totally filling memory, copy data to the device using DIRECT to keep 
from dirtying cache. It will slow the copy (to a slight degree) and keep 
the system responsive. I used to have a USB 2.0 disk, and you are right, 
it will show the same problems. That's why I have some ideas of tuning.

And during the 2.5 development phase I played with "per fd" limits on 
memory per file, which solved the problem for me. I had some educational 
discussions with several developers, but this is one of those things 
which has limited usefulness and development was very busy at that time 
with things deemed more important, so I never tried to get it ready for 
inclusion in the kernel.

-- 
Bill Davidsen <davidsen@tmr.com>
  "We can't solve today's problems by using the same thinking we
   used in creating them." - Einstein

--

From: Ben Gamari
Date: Friday, April 9, 2010 - 8:56 am

Indeed. I have found this to be a persistent problem and I really wish there
were more interest in debugging this. I have tried bringing the community's
resources to bear on this issue several[1] times, and each time we fail to
get enough of the right eyes looking at it or developer interest simply vanishes.

I've started putting together a list[2] of pertinent threads/patches/bugs/data
in hopes that this will lower the energy barrier of getting up to speed on
this issue. Hopefully this will help.

Cheers,

- Ben


[1] https://bugzilla.kernel.org/show_bug.cgi?id=12309
[2] http://goldnerlab.physics.umass.edu/wiki/BenGamari/IoWaitLatency

--

Previous thread: [PATCH] VMware Balloon driver by Dmitry Torokhov on Sunday, April 4, 2010 - 2:52 pm. (36 messages)

Next thread: 2.6.34-rc3: simple du (on a big xfs tree) triggers oom killer by Hans-Peter Jansen on Sunday, April 4, 2010 - 3:49 pm. (24 messages)