If congestion_wait() is called when there is no congestion, the caller
will wait for the full timeout. This can cause unreasonable and
unnecessary stalls. There are a number of potential modifications that
could be made to wake sleepers but this patch measures how serious the
problem is. It keeps count of how many congested BDIs there are. If
congestion_wait() is called with no BDIs congested, the tracepoint will
record that the wait was unnecessary.
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
---
include/trace/events/writeback.h | 11 ++++++++---
mm/backing-dev.c | 15 ++++++++++++---
2 files changed, 20 insertions(+), 6 deletions(-)
diff --git a/include/trace/events/writeback.h b/include/trace/events/writeback.h
index e3bee61..03bb04b 100644
--- a/include/trace/events/writeback.h
+++ b/include/trace/events/writeback.h
@@ -155,19 +155,24 @@ DEFINE_WBC_EVENT(wbc_writepage);
TRACE_EVENT(writeback_congest_waited,
- TP_PROTO(unsigned int usec_delayed),
+ TP_PROTO(unsigned int usec_delayed, bool unnecessary),
- TP_ARGS(usec_delayed),
+ TP_ARGS(usec_delayed, unnecessary),
TP_STRUCT__entry(
__field( unsigned int, usec_delayed )
+ __field( unsigned int, unnecessary )
),
TP_fast_assign(
__entry->usec_delayed = usec_delayed;
+ __entry->unnecessary = unnecessary;
),
- TP_printk("usec_delayed=%u", __entry->usec_delayed)
+ TP_printk("usec_delayed=%u unnecessary=%d",
+ __entry->usec_delayed,
+ __entry->unnecessary
+ )
);
#endif /* _TRACE_WRITEBACK_H */
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 7ae33e2..a49167f 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -724,6 +724,7 @@ static wait_queue_head_t congestion_wqh[2] = {
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[0]),
__WAIT_QUEUE_HEAD_INITIALIZER(congestion_wqh[1])
};
+static atomic_t nr_bdi_congested[2];
void clear_bdi_congested(struct backing_dev_info *bdi, int sync)
{
@@ -731,7 +732,8 @@ void clear_bdi_congested(struct ...Hmm.. Now congestion_wait's semantics "wait for _a_ backing_dev to become uncongested" But this seems to consider whole backing dev. Is your intention? or Am I missing now? -- Kind regards, Minchan Kim --
Not whole backing devs, all backing devs. This is intentional. If congestion_wait() is called with 0 BDIs congested, we sleep the full timeout because a wakeup event will not occur - this is a bad scenario. To know if 0 BDIs were congested, one could either walk all the BDIs checking their status or maintain a counter like nr_bdi_congested which is what I decided on. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
I am not convinced that unnecessary is the right word. On a workload without any IO (i.e. no congestion_wait() necessary, ever), I noticed the VM regressing both in time and in reclaiming the right pages when simply removing congestion_wait() from the direct reclaim paths (the one in __alloc_pages_slowpath and the other one in do_try_to_free_pages). So just being stupid and waiting for the timeout in direct reclaim while kswapd can make progress seemed to do a better job for that load. I can not exactly pinpoint the reason for that behaviour, it would be nice if somebody had an idea. So personally I think it's a good idea to get an insight on the use of congestion_wait() [patch 1] but I don't agree with changing its behaviour just yet, or judging its usefulness solely on whether it correctly waits for bdi congestion. --
There is a possibility that the behaviour in that case was due to flusher threads doing the writes rather than direct reclaim queueing pages for IO in an inefficient manner. So the stall is stupid but happens to work out Unfortunately, I strongly suspect that some of the desktop stalls seen during IO (one of which involved no writes) were due to calling congestion_wait and waiting the full timeout where no writes are going on. It gets potentially worse too. Lets say we have a system with many BDIs of different speed - e.g. SSD on one end of the spectrum and USB flash drive on the other. The congestion for writes could be on the USB flash drive but due to low memory, the allocator, direct reclaimers and kswapd go to sleep periodically on congestion_wait for USB even though the bulk of the pages need reclaiming are backed by an SSD. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
If this is the case, we already have queue congested. removing congestion_wait() might cause regression but either your change or the congestion_wait_check() should not have the regression, as we do check if the bdi is congested. Thanks, Shaohua --
Not necessarily. The fact that with the full series we sometimes call cond_sched() indicating that there was no congestion when congestion_wait() was called proves that. We might have some IO on the queue but it's not congested. Also, there is no guarantee that the congested queue is one we care about. If we are reclaiming main memory and the congested queue is a What congestion_wait_check()? If there is no congestion and no writes, congestion is the wrong event to sleep on. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
The workload was accessing a large sparse-file through mmap, so there wasn't much IO in the first place. And I experimented on the latest -mmotm where direct reclaim wouldn't Oh, I am in full agreement here! Removing those congestion_wait() as described above showed a reduction in peak latency. The dilemma is only that it increased the overall walltime of the load. And the scanning behaviour deteriorated, as in having increased scanning pressure on other zones than the unpatched kernel did. So I think very much that we need a fix. congestion_wait() causes stalls and relying on random sleeps for the current reclaim behaviour can not be the solution, at all. I just don't think we can remove it based on the argument that it doesn't do what it is supposed to do, when it does other things right that it is not supposed to do ;-) --
Then waiting on congestion was the totally wrong thing to do. We were effectively calling sleep(HZ/10) and magically this was helping in some undefined manner. Do you know *which* called of congestion_wait() was What were the results? I'm preparing a full series incorporating a Do you know why because leaving in random sleeps() hardly seems to be Probably because it was scanning more but not finding what it needed. There is a condition other than congestion it is having trouble with. In some respects, I think if we change congestion_wait() as I propose, we may see a case where CPU usage is higher because it's now We are not removing it, we are just stopping it going to sleep for stupid reasons. If we find that wall time is increasing as a result, we have a path to figuring out what the real underlying problem is instead of sweeping it under the rug. congestion_wait() is causing other problems such as Christian's bug of massive IO regressions because it was sleeping when it shouldn't. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Removing congestion_wait() in do_try_to_free_pages() definitely worsens reclaim behaviour for this workload: 1. wallclock time of the testrun increases by 11% 2. the scanners do a worse job and go for the wrong zone: -pgalloc_dma 79597 -pgalloc_dma32 134465902 +pgalloc_dma 297089 +pgalloc_dma32 134247237 -pgsteal_dma 77501 -pgsteal_dma32 133939446 +pgsteal_dma 294998 +pgsteal_dma32 133722312 -pgscan_kswapd_dma 145897 -pgscan_kswapd_dma32 266141381 +pgscan_kswapd_dma 287981 +pgscan_kswapd_dma32 186647637 -pgscan_direct_dma 9666 -pgscan_direct_dma32 1758655 +pgscan_direct_dma 302495 +pgscan_direct_dma32 80947179 -pageoutrun 1768531 -allocstall 614 +pageoutrun 1927451 +allocstall 8566 I attached the full vmstat contents below. Also the test program, Well, for that testcase it is in effect the same as a removal as there's never congestion. But again: I agree with your changes per-se, I just don't think they should get merged as long as they knowingly catalyze a problem that has yet to be identified.
Excellent stuff. I didn't look at your vmstat output because it was for
an old patch and you have already highlighted the problems related to
the workload. Chances are, I'd just reach the same conclusions. What is
Ok, well there was some significant feedback on why wholesale changing of
congestion_wait() reached too far and I've incorporated that feedback. I
also integrated your workload into my testsuite (btw, because there is no
license the script has to download it from a google archive. I might get
back to you on licensing this so it can be made a permanent part of the suite).
These are the results just for your workload on the only machine I had
available with a lot of disk. There are a bunch of kernels because I'm testing
a superset of different series posted recently. The nocongest column is an
unreleased patch that has congestion_wait() and wait_iff_congested() that
only goes to sleep if there is real congestion or a lot of writeback going
on. Rather than worrying about the patch contents for now, lets consider
the results for just your workload.
The report is in 4 parts. The first is the vmstat counter differences as
a result of running your test. The exact interpretation of good and bad
here is open to interpretation. The second part is based on the vmscan
tracepoints. The third part is based on the congestion tracepoints and
the final part reports CPU usage and elapsed time.
MICRO
traceonly-v1r4 nocongest-v1r4 lowlumpy-v1r4 nodirect-v1r4
pgalloc_dma 89409.00 ( 0.00%) 47750.00 ( -87.24%) 47430.00 ( -88.51%) 47246.00 ( -89.24%)
pgalloc_dma32 101407571.00 ( 0.00%) 101518722.00 ( 0.11%) 101502059.00 ( 0.09%) 101511868.00 ( 0.10%)
pgalloc_normal 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%) 0.00 ( 0.00%)
pgsteal_dma 74529.00 ( 0.00%) 43386.00 ( -71.78%) 43213.00 ( -72.47%) ...Oh, certainly, feel free to add the following file header: /* * Copyright (c) 2010 Johannes Weiner * Code released under the GNU GPLv2. Agreed. Also the dma zone is less allocated from, which I suppose is only the second zone in the zonelist, after dma32. So allocations I do not reach the same conclusion here. More pages are scanned overall on the same workload, so we _are_ doing more work. The result for this single-threaded workload improves because CPU-time is not the issue when the only runnable process needs memory. But we are in fact becoming less efficient at reclaim, so it would make sense to also test how this interacts with other processes that I think one interesting piece that is missing is whether the scanned/reclaimed ratio went up. Do you have the kswapd_steal counter value still available to calculate that ratio? A "good" result would be, IMO, if that ratio did not get worse, while at the same time having reclaim perform better due to reduced sleeps. Another aspect to look out for is increased overreclaim: the total number of allocations went up (I suppose the sum of reclaimed pages as well), which means reclaim became more eager and created more throwout-refault churn. Those were refaults from a sparse-file, but a slow backing dev will have more impact on wall clock time. --
We did more scanning but we finished sooner so the processor would sleep sooner, consume less power etc. How good or bad this is depends on what we mean by work but I am of the opinion that finishing sooner is better This is indeed a problem. Measuring multiple workloads and drawing sensible conclusions is going to be a real pain. I'd at least try running multiple instances of your workload first and seeing how it Knowing the scanning/reclaimed ration would be nice. The tracepoints as-is are orientiated around scanning/write ratio and assumed discarding clean pages was not that big a deal. I did capture vmstat before and after the test which is not as fine-grained as the tracepoints but it'll do Vanilla scan: 203050870 Vanilla steal: 102099263 Nocongest scan: 203117496 Nocongest stal: 102114301 Vanilla ratio: 0.5028 Nocongest ratio: 0.5027 So here is an interesting point. While direct scanning rates went up, overall scanning rates were approximately the same. All that changed really was who was doing the scanning which is a timing issue. The reclaim to scanning ratio were *very* close together for the vanilla and So, the ratio did not get worse. The scanning rates were the same although who was scanning changed and I'm undecided as to whether that is significant I can capture the reclaim rates handy enough - just needs another trace point to keep everything in step. Refaults would be trickier so I would have a tendency to rely on wall time to tell me if the wrong reclaim decisions were being made. -- Mel Gorman Part-time Phd Student Linux Technology Center University of Limerick IBM Dublin Software Lab --
Hi, Hannes. Not exactly same your experiment but I had a simillar experince. I had a experiement about swapout. System has lots of anon pages but almost no file pages and it already started to swap out. It means system have no memory. In this case, I forked new process which mmap some MB pages and touch the pages. It means VM should swapout some MB page for the process. And I measured the time until completing touching the pages. Sometime it's fast, sometime it's slow. time gap is almost two. Interesting thing is when it is fast, many of pages are reclaimed by kswapd. Ah.. I used swap to ramdisk and reserve the swap pages by touching before I just thought the cause is direct reclaim just reclaims by 32 pages but kswapd could reclaim many pages by batch. But i didn't look at it any more due to busy. Does it make sense? -- Kind regards, Minchan Kim --
