[PATCH 2/3] writeback: Record if the congestion was unnecessary

Previous thread: [PATCH 1/3] writeback: Account for time spent congestion_waited by Mel Gorman on Thursday, August 26, 2010 - 8:14 am. (3 messages)

Next thread: [PATCH] i915: build error by Mark Stanovich on Thursday, August 26, 2010 - 8:38 am. (1 message)
From: Mel Gorman
Date: Thursday, August 26, 2010 - 8:14 am

If congestion_wait() is called when there is no congestion, the caller
will wait for the full timeout. This can cause unreasonable and
unnecessary stalls. There are a number of potential modifications that
could be made to wake sleepers but this patch measures how serious the
problem is. It keeps count of how many congested BDIs there are. If
congestion_wait() is called with no BDIs congested, the tracepoint will
record that the wait was unnecessary.

Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
 include/trace/events/writeback.h |   11 ++++++++---
 mm/backing-dev.c                 |   15 ++++++++++++---
 2 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index e3bee61..03bb04b 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -155,19 +155,24 @@ DEFINE_WBC_EVENT(wbc_writepage);
 
 TRACE_EVENT(writeback_congest_waited,
 
-	TP_PROTO(unsigned int usec_delayed),
+	TP_PROTO(unsigned int usec_delayed, bool unnecessary),
 
-	TP_ARGS(usec_delayed),
+	TP_ARGS(usec_delayed, unnecessary),
 
 	TP_STRUCT__entry(
 		__field(	unsigned int,	usec_delayed	)
+		__field(	unsigned int,	unnecessary	)
 	),
 
 	TP_fast_assign(
 		__entry->usec_delayed	= usec_delayed;
+		__entry->unnecessary	= unnecessary;
 	),
 
-	TP_printk("usec_delayed=%u", __entry->usec_delayed)
+	TP_printk("usec_delayed=%u unnecessary=%d",
+		__entry->usec_delayed,
+		__entry->unnecessary
+	)
 );
 
 #endif /* _TRACE_WRITEBACK_H */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7ae33e2..a49167f 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
 		__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
 	};
+static atomic_t nr_bdi_congested[2];
 
 void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
 {
@@ -731,7 +732,8 @@ void clear_bdi_congested(struct ...
From: Minchan Kim
Date: Thursday, August 26, 2010 - 10:35 am

Hmm.. Now congestion_wait's semantics "wait for _a_ backing_dev to become uncongested"
But this seems to consider whole backing dev. Is your intention? or Am I missing now?

-- 
Kind regards,
Minchan Kim
--

From: Mel Gorman
Date: Thursday, August 26, 2010 - 10:41 am

Not whole backing devs, all backing devs. This is intentional.

If congestion_wait() is called with 0 BDIs congested, we sleep the full timeout
because a wakeup event will not occur - this is a bad scenario. To know if
0 BDIs were congested, one could either walk all the BDIs checking their
status or maintain a counter like nr_bdi_congested which is what I decided on.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Johannes Weiner
Date: Thursday, August 26, 2010 - 11:29 am

I am not convinced that unnecessary is the right word.  On a workload
without any IO (i.e. no congestion_wait() necessary, ever), I noticed
the VM regressing both in time and in reclaiming the right pages when
simply removing congestion_wait() from the direct reclaim paths (the
one in __alloc_pages_slowpath and the other one in
do_try_to_free_pages).

So just being stupid and waiting for the timeout in direct reclaim
while kswapd can make progress seemed to do a better job for that
load.

I can not exactly pinpoint the reason for that behaviour, it would be
nice if somebody had an idea.

So personally I think it's a good idea to get an insight on the use of
congestion_wait() [patch 1] but I don't agree with changing its
behaviour just yet, or judging its usefulness solely on whether it
correctly waits for bdi congestion.
--

From: Mel Gorman
Date: Thursday, August 26, 2010 - 1:31 pm

There is a possibility that the behaviour in that case was due to flusher
threads doing the writes rather than direct reclaim queueing pages for IO
in an inefficient manner. So the stall is stupid but happens to work out

Unfortunately, I strongly suspect that some of the desktop stalls seen during
IO (one of which involved no writes) were due to calling congestion_wait
and waiting the full timeout where no writes are going on.

It gets potentially worse too. Lets say we have a system with many BDIs of
different speed - e.g. SSD on one end of the spectrum and USB flash drive
on the other. The congestion for writes could be on the USB flash drive but
due to low memory, the allocator, direct reclaimers and kswapd go to sleep
periodically on congestion_wait for USB even though the bulk of the pages
need reclaiming are backed by an SSD.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Shaohua Li
Date: Thursday, August 26, 2010 - 7:12 pm

If this is the case, we already have queue congested. removing
congestion_wait() might cause regression but either your change or the
congestion_wait_check() should not have the regression, as we do check
if the bdi is congested.

Thanks,
Shaohua

--

From: Mel Gorman
Date: Friday, August 27, 2010 - 2:20 am

Not necessarily. The fact that with the full series we sometimes call
cond_sched() indicating that there was no congestion when congestion_wait()
was called proves that. We might have some IO on the queue but it's not
congested. Also, there is no guarantee that the congested queue is one we
care about. If we are reclaiming main memory and the congested queue is a

What congestion_wait_check()? If there is no congestion and no writes,
congestion is the wrong event to sleep on.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Johannes Weiner
Date: Friday, August 27, 2010 - 1:16 am

The workload was accessing a large sparse-file through mmap, so there
wasn't much IO in the first place.

And I experimented on the latest -mmotm where direct reclaim wouldn't

Oh, I am in full agreement here!  Removing those congestion_wait() as
described above showed a reduction in peak latency.  The dilemma is
only that it increased the overall walltime of the load.

And the scanning behaviour deteriorated, as in having increased
scanning pressure on other zones than the unpatched kernel did.

So I think very much that we need a fix.  congestion_wait() causes
stalls and relying on random sleeps for the current reclaim behaviour
can not be the solution, at all.

I just don't think we can remove it based on the argument that it
doesn't do what it is supposed to do, when it does other things right
that it is not supposed to do ;-)
--

From: Mel Gorman
Date: Friday, August 27, 2010 - 2:24 am

Then waiting on congestion was the totally wrong thing to do. We were
effectively calling sleep(HZ/10) and magically this was helping in some
undefined manner. Do you know *which* called of congestion_wait() was

What were the results? I'm preparing a full series incorporating a

Do you know why because leaving in random sleeps() hardly seems to be

Probably because it was scanning more but not finding what it needed.
There is a condition other than congestion it is having trouble with. In
some respects, I think if we change congestion_wait() as I propose,
we may see a case where CPU usage is higher because it's now

We are not removing it, we are just stopping it going to sleep for
stupid reasons. If we find that wall time is increasing as a result, we
have a path to figuring out what the real underlying problem is instead
of sweeping it under the rug.

congestion_wait() is causing other problems such as Christian's bug of
massive IO regressions because it was sleeping when it shouldn't.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Johannes Weiner
Date: Monday, August 30, 2010 - 6:19 am

Removing congestion_wait() in do_try_to_free_pages() definitely
worsens reclaim behaviour for this workload:

1. wallclock time of the testrun increases by 11%

2. the scanners do a worse job and go for the wrong zone:

-pgalloc_dma 79597
-pgalloc_dma32 134465902
+pgalloc_dma 297089
+pgalloc_dma32 134247237

-pgsteal_dma 77501
-pgsteal_dma32 133939446
+pgsteal_dma 294998
+pgsteal_dma32 133722312

-pgscan_kswapd_dma 145897
-pgscan_kswapd_dma32 266141381
+pgscan_kswapd_dma 287981
+pgscan_kswapd_dma32 186647637

-pgscan_direct_dma 9666
-pgscan_direct_dma32 1758655
+pgscan_direct_dma 302495
+pgscan_direct_dma32 80947179

-pageoutrun 1768531
-allocstall 614
+pageoutrun 1927451
+allocstall 8566

I attached the full vmstat contents below.  Also the test program,



Well, for that testcase it is in effect the same as a removal as
there's never congestion.

But again: I agree with your changes per-se, I just don't think they
should get merged as long as they knowingly catalyze a problem that
has yet to be identified.
From: Mel Gorman
Date: Tuesday, August 31, 2010 - 8:02 am

Excellent stuff. I didn't look at your vmstat output because it was for
an old patch and you have already highlighted the problems related to
the workload. Chances are, I'd just reach the same conclusions. What is

Ok, well there was some significant feedback on why wholesale changing of
congestion_wait() reached too far and I've incorporated that feedback. I
also integrated your workload into my testsuite (btw, because there is no
license the script has to download it from a google archive. I might get
back to you on licensing this so it can be made a permanent part of the suite).

These are the results just for your workload on the only machine I had
available with a lot of disk. There are a bunch of kernels because I'm testing
a superset of different series posted recently. The nocongest column is an
unreleased patch that has congestion_wait() and wait_iff_congested() that
only goes to sleep if there is real congestion or a lot of writeback going
on. Rather than worrying about the patch contents for now, lets consider
the results for just your workload.

The report is in 4 parts. The first is the vmstat counter differences as
a result of running your test. The exact interpretation of good and bad
here is open to interpretation. The second part is based on the vmscan
tracepoints. The third part is based on the congestion tracepoints and
the final part reports CPU usage and elapsed time.

MICRO
                                      traceonly-v1r4          nocongest-v1r4           lowlumpy-v1r4     nodirect-v1r4
pgalloc_dma                      89409.00 (   0.00%)     47750.00 ( -87.24%)     47430.00 ( -88.51%)     47246.00 ( -89.24%)
pgalloc_dma32                101407571.00 (   0.00%) 101518722.00 (   0.11%) 101502059.00 (   0.09%) 101511868.00 (   0.10%)
pgalloc_normal                       0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)         0.00 (   0.00%)
pgsteal_dma                      74529.00 (   0.00%)     43386.00 ( -71.78%)     43213.00 ( -72.47%) ...
From: Johannes Weiner
Date: Thursday, September 2, 2010 - 8:49 am

Oh, certainly, feel free to add the following file header:

/*
 * Copyright (c) 2010 Johannes Weiner
 * Code released under the GNU GPLv2.

Agreed.  Also the dma zone is less allocated from, which I suppose is
only the second zone in the zonelist, after dma32.  So allocations

I do not reach the same conclusion here.  More pages are scanned
overall on the same workload, so we _are_ doing more work.

The result for this single-threaded workload improves because CPU-time
is not the issue when the only runnable process needs memory.

But we are in fact becoming less efficient at reclaim, so it would
make sense to also test how this interacts with other processes that

I think one interesting piece that is missing is whether the
scanned/reclaimed ratio went up.  Do you have the kswapd_steal counter
value still available to calculate that ratio?

A "good" result would be, IMO, if that ratio did not get worse, while
at the same time having reclaim perform better due to reduced sleeps.

Another aspect to look out for is increased overreclaim: the total
number of allocations went up (I suppose the sum of reclaimed pages as
well), which means reclaim became more eager and created more
throwout-refault churn.  Those were refaults from a sparse-file, but a
slow backing dev will have more impact on wall clock time.
--

From: Mel Gorman
Date: Thursday, September 2, 2010 - 11:28 am

We did more scanning but we finished sooner so the processor would sleep
sooner, consume less power etc. How good or bad this is depends on what
we mean by work but I am of the opinion that finishing sooner is better

This is indeed a problem. Measuring multiple workloads and drawing
sensible conclusions is going to be a real pain. I'd at least try
running multiple instances of your workload first and seeing how it

Knowing the scanning/reclaimed ration would be nice. The tracepoints as-is
are orientiated around scanning/write ratio and assumed discarding clean
pages was not that big a deal. I did capture vmstat before and after the
test which is not as fine-grained as the tracepoints but it'll do

Vanilla scan:	203050870
Vanilla steal:	102099263
Nocongest scan:	203117496
Nocongest stal:	102114301

Vanilla ratio:		0.5028
Nocongest ratio:	0.5027

So here is an interesting point. While direct scanning rates went up,
overall scanning rates were approximately the same. All that changed
really was who was doing the scanning which is a timing issue. The
reclaim to scanning ratio were *very* close together for the vanilla and

So, the ratio did not get worse. The scanning rates were the same although
who was scanning changed and I'm undecided as to whether that is significant

I can capture the reclaim rates handy enough - just needs another trace
point to keep everything in step. Refaults would be trickier so I would
have a tendency to rely on wall time to tell me if the wrong reclaim
decisions were being made.

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Minchan Kim
Date: Sunday, August 29, 2010 - 9:03 am

Hi, Hannes. 


Not exactly same your experiment but I had a simillar experince. 
I had a experiement about swapout. System has lots of anon pages but
almost no file pages and it already started to swap out. It means
system have no memory. In this case, I forked new process which mmap
some MB pages and touch the pages. It means VM should swapout some MB page 
for the process. And I measured the time until completing touching the pages. 

Sometime it's fast, sometime it's slow. time gap is almost two. 
Interesting thing is when it is fast, many of pages are reclaimed by kswapd. 
Ah.. I used swap to ramdisk and reserve the swap pages by touching before 

I just thought the cause is direct reclaim just reclaims by 32 pages 
but kswapd could reclaim many pages by batch. But i didn't look at it any more
due to busy. Does it make sense?

-- 
Kind regards,
Minchan Kim
--

Previous thread: [PATCH 1/3] writeback: Account for time spent congestion_waited by Mel Gorman on Thursday, August 26, 2010 - 8:14 am. (3 messages)

Next thread: [PATCH] i915: build error by Mark Stanovich on Thursday, August 26, 2010 - 8:38 am. (1 message)