Re: Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2

Previous thread: Re: [PATCH 2/6] drivers:misc: Kconfig, Makefile for TI's ST ldisc by Pavan Savoy on Monday, March 22, 2010 - 4:45 pm. (2 messages)

Next thread: systemtap 1.2 release notes by Frank Ch. Eigler on Monday, March 22, 2010 - 5:02 pm. (1 message)
From: Mel Gorman
Date: Monday, March 22, 2010 - 4:50 pm

120+ kernels and a lot of hurt later;

Short summary - The number of times kswapd and the page allocator have been
	calling congestion_wait and the length of time it spends in there
	has been increasing since 2.6.29. Oddly, it has little to do
	with the page allocator itself.

Test scenario
=============
X86-64 machine 1 socket 4 cores
4 consumer-grade disks connected as RAID-0 - software raid. RAID controller
	on-board and a piece of crap, and a decent RAID card could blow
	the budget.
Booted mem=256 to ensure it is fully IO-bound and match closer to what
	Christian was doing

At each test, the disks are partitioned, the raid arrays created and an
ext2 filesystem created. iozone sequential read/write tests are run with
increasing number of processes up to 64. Each test creates 8G of files. i.e.
1 process = 8G. 2 processes = 2x4G etc

	iozone -s 8388608 -t 1 -r 64 -i 0 -i 1
	iozone -s 4194304 -t 2 -r 64 -i 0 -i 1
	etc.

Metrics
=======

Each kernel was instrumented to collected the following stats

	pg-Stall	Page allocator stalled calling congestion_wait
	pg-Wait		The amount of time spent in congestion_wait
	pg-Rclm		Pages reclaimed by direct reclaim
	ksd-stall	balance_pgdat() (ie kswapd) staled on congestion_wait
	ksd-wait	Time spend by balance_pgdat in congestion_wait

Large differences in this do not necessarily show up in iozone because the
disks are so slow that the stalls are a tiny percentage overall. However, in
the event that there are many disks, it might be a greater problem. I believe
Christian is hitting a corner case where small delays trigger a much larger
stall.

Why The Increases
=================

The big problem here is that there was no one change. Instead, it has been
a steady build-up of a number of problems. The ones I identified are in the
block IO, CFQ IO scheduler, tty and page reclaim. Some of these are fixed
but need backporting and others I expect are a major surprise. Whether they
are worth backporting or not heavily depends on ...
From: Christian Ehrhardt
Date: Tuesday, March 23, 2010 - 7:35 am

Thanks for all your effort in searching the real cause behind 


While your tty&evict patch might fix something as seen by your numbers, 
it unfortunately doesn't affect my big throughput loss.

Again the scenario was 4,8 and 16 threads iozone sequential read with 
2Gb files and one disk per process, running on a s390x machine with 4 
cpus and 256m.
My table shows the throughput deviation to plain 2.6.32 git in percent.

percentage                       4thr     8thr    16thr
2.6.32                          0.00%    0.00%    0.00%
2.6.32.10 (stable)              4.44%    7.97%    4.11%
2.6.32.10-ttyfix-revertevict    3.33%    6.64%    5.07%
2.6.33                          5.33%   -2.82%  -10.87%
2.6.33-ttyfix-revertevict       3.33%   -3.32%  -10.51%
2.6.32-watermarkwait           40.00%   58.47%   42.03%

In terms of throughput for my load your patch doesn't change anything 
significantly above the noise level of the test case (which is around 
~1%). The fix probably even has a slight performance decrease in low 
thread cases.

For better comparison I added a 2.6.32 run with your watermark wait 
patch which is still the only one fixing the issue.

That said I'd still love to see watermark wait getting accepted :-)

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance
--

From: Corrado Zoccolo
Date: Tuesday, March 23, 2010 - 2:35 pm

Hi Mel,

The major changes in I/O scheduing behaviour are:
* buffered writes:
  * before we could schedule few writes, then interrupt them to do
some reads, and then go back to writes; now we guarantee some
uninterruptible time slice for writes, but the delay between two
slices is increased. The total write throughput averaged over a time
window larger than 300ms should be comparable, or even better with
2.6.33. Note that the commit you cite has introduced a bug regarding
write throughput on NCQ disks that was later fixed by 1efe8fe1, merged
before 2.6.33 (this may lead to confusing bisection results).
* reads (and sync writes):
  * before, we serviced a single process for 100ms, then switched to
an other, and so on.
  * after, we go round robin for random requests (they get a unified
time slice, like buffered writes do), and we have consecutive time
slices for sequential requests, but the length of the slice is reduced
when the number of concurrent processes doing I/O increases.

This means that with 16 processes doing sequential I/O on the same
disk, before you were switching between processes every 100ms, and now
every 32ms. The old behaviour can be brought back by setting
/sys/block/sd*/queue/iosched/low_latency to 0.
For random I/O, the situation (going round robin, it will translate to

If my intuition that switching between processes too often is
detrimental when you have memory pressure (higher probability to need
to re-page-in some of the pages that were just discarded), I suggest
trying setting low_latency to 0, and maybe increasing the slice_sync
(to get more slice to a single process before switching to an other),
slice_async (to give more uninterruptible time to buffered writes) and
slice_async_rq (to higher the limit of consecutive write requests can
be sent to disk).
While this would normally lead to a bad user experience on a system
with plenty of memory, it should keep things acceptable when paging in
/ swapping / dirty page writeback is overwhelming.

From: Mel Gorman
Date: Wednesday, March 24, 2010 - 4:48 am

This is true. The CFQ and block IO changes in that window are almost
impossible to properly bisect and isolate individual changes. There were
multiple dependant patches that modified each others changes. It's unclear
if this modification can even be isolated although your suggestion below


At the moment, I'm not testing random IO so it shouldn't be a factor in

Christian, would you be able to follow the same instructions and see can
you make a difference to your test? It is known for your situation that
memory is unusually low for size of your workload so it's a possibility.


-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Corrado Zoccolo
Date: Wednesday, March 24, 2010 - 5:56 am

An other parameter that is worth tweaking in this case is the
readahead size. If readahead size is too large for the available
memory, we might be reading, then discarding, and then reading again
the same pages.

I would also like to see some iostats output (iostats -kx 5 >
iostats.log) during the experiment run, to better understand what's
happening.

Thanks,
--

From: Rik van Riel
Date: Tuesday, March 23, 2010 - 3:29 pm

With that many disks, you can easily have dozens of megabytes
of data in flight to the disk at once.  That is a major
fraction of memory.

In fact, you might have all of the inactive file pages under

The patch helped IO tests with reasonable amounts of memory
available, because the VM can cache frequently used data
much more effectively.

This comes at the cost of caching less recently accessed
use-once data, which should not be an issue since the data

No real theories yet, just the observation that your revert
appears to be buggy (see below) and the possibility that your
test may have all of the inactive file pages under IO...


Your revert is buggy.  With this change, anonymous pages will
never get deactivated via shrink_list.
--

From: Mel Gorman
Date: Wednesday, March 24, 2010 - 7:50 am

That is easily possible. Note, I'm not maintaining this workload configuration
is a good idea.

The background to this problem is Christian running a disk-intensive iozone
workload over many CPUs and disks with limited memory. It's already known
that if he added a small amount of extra memory, the problem went away.
The problem was a massive throughput regression and a bisect pinpointed
two patches (both mine) but neither make sense. One altered the order pages
come back from lists but not availability and his hardware does no automatic
merging. A second does alter the availility of pages via the per-cpu lists
but reverting the behaviour didn't help.

The first fix to this was to replace congestion_wait with a waitqueue
that woke up processes if the watermarks were met. This fixed
Christian's problem but Andrew wants to pin the underlying cause.

I strongly suspect that evict-once behaves sensibly when memory is ample

Possibly. The tests have a write and a read phase but I wasn't
collecting the data with sufficient granularity to see which of the

Indeed. With or without evict-once, I'd have an expectation of all the

Bah. I had the initial revert right and screwed up reverting from
2.6.32.10 on. I'm rerunning the tests. Is this right?

-       if (is_active_lru(lru)) {
-               if (inactive_list_is_low(zone, sc, file))
-                   shrink_active_list(nr_to_scan, zone, sc, priority, file);
+       if (is_active_lru(lru)) {
+               shrink_active_list(nr_to_scan, zone, sc, priority, file);

I'm rerunning the revertevict patches at the moment. When they complete,
I'll experiment with dirty limits. Any suggested values or will I just
increase it by some arbitrary amount and see what falls out? e.g.

/me slaps self

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Christian Ehrhardt
Date: Monday, April 19, 2010 - 5:22 am

Sorry for replying that late, but after digging through another pile of tasks I'm happy to come back to this issue and I'll try to answer all open questions.
Fortunately I'm also able to add a few new insights that might resurrect this discussion^^

For the requested CFQ scheduler tuning, its deadline what is here :-)
So I can't apply all that. But in the past I was already able to show that all the "slowdown" occurs above the block device layer (read back through our threads if interessted about details). But eventually that leaves all lower layer tuning out of the critical zone.

Corrado also asked for iostat data, due to the reason explained above (issue above BDL) it doesn't contain anything much useful as expected.
So I'll just add a one liner of good/bad case to show that things like req-sz etc are the same, but just slower.
This "being slower" is caused by the request arriving in the BDL at a lower rate - caused by our beloved full timeouts in congestion_wait.

Device:         rrqm/s   wrqm/s     r/s     w/s     rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
bad sdb         0.00     0.00    154.50    0.00  70144.00     0.00   908.01     0.62    4.05   2.72  42.00
good sdb        0.00     0.00    270.50    0.00 122624.00     0.00   906.65     1.32    4.94   2.92  79.00


So now coming to the probably most critical part - the evict once discussion in this thread.
I'll try to explain what I found in the meanwhile - let me know whats unclear and I'll add data etc.

In the past we identified that "echo 3 > /proc/sys/vm/drop_caches" helps to improve the accuracy of the used testcase by lowering the noise from 5-8% to <1%.
Therefore I ran all tests and verifications with that drops.
In the meanwhile I unfortunately discovered that Mel's fix only helps for the cases when the caches are dropped.
Without it seems to be bad all the time. So don't cast the patch away due to that discovery :-)

On the good side I was also able to analyze a few more things due to that insight - and it ...
From: Johannes Weiner
Date: Monday, April 19, 2010 - 2:44 pm

Ok, so I am the idiot that got quoted on 'the active set is not too big, so
buffer heads are not a problem when avoiding to scan it' in eternal history.

But the threshold inactive/active ratio for skipping active file pages is
actually 1:1.

The easiest 'fix' is probably to change that ratio, 2:1 (or even 3:1?) appears
to be a bit more natural anyway?  Below is a patch that changes it to 2:1.
Christian, can you check if it fixes your regression?

Additionally, we can always scan active file pages but only deactivate them
when the ratio is off and otherwise strip buffers of clean pages.

What do people think?

	Hannes

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f4ede99..a4aea76 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -898,7 +898,7 @@ int mem_cgroup_inactive_file_is_low(struct mem_cgroup *memcg)
 	inactive = mem_cgroup_get_local_zonestat(memcg, LRU_INACTIVE_FILE);
 	active = mem_cgroup_get_local_zonestat(memcg, LRU_ACTIVE_FILE);
 
-	return (active > inactive);
+	return (active > inactive / 2);
 }
 
 unsigned long mem_cgroup_zone_nr_pages(struct mem_cgroup *memcg,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3ff3311..8f1a846 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1466,7 +1466,7 @@ static int inactive_file_is_low_global(struct zone *zone)
 	active = zone_page_state(zone, NR_ACTIVE_FILE);
 	inactive = zone_page_state(zone, NR_INACTIVE_FILE);
 
-	return (active > inactive);
+	return (active > inactive / 2);
 }
 
 /**

--

From: Christian Ehrhardt
Date: Tuesday, April 20, 2010 - 12:20 am

I'll check it out.
from the numbers I have up to now I know that the good->bad transition 
for my case is somewhere between 30M/60M e.g. first and second write.
The ratio 2:1 will eat max 53M of my ~160M that gets split up.

That means setting the ratio to 2:1 or whatever else might help or not, 
but eventually there is just another setting of workload vs. memory 
constraints that would still be affected. Still I guess 3:1 (and I'll 

In think we need something that allows the system to forget its history 
somewhen - be it 1:1 or x:1 - if the workload changes "long enough"(tm) 
it should eventually throw all old things out.
Like I described before many systems have different usage patterns when 
e.g. comparing day/night workload. So it is far from optimal if e.g. day 
write loads eat so much cache and never give it back for nightly huge 
reads tasks or something similar.

Would your suggestion achieve that already?

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance
--

From: Christian Ehrhardt
Date: Tuesday, April 20, 2010 - 1:54 am

For "my case" 2:1 is not enough, 3:1 almost and 4:1 fixes the issue.
Still as I mentioned before I think any value carved in stone can and 
will be bad to some use case - as 1:1 is for mine.

If we end up being unable to fix it internally by allowing the system to 
"forget" and eventually free old unused buffers at least somewhen - then 
we should neither implement it as 2:1 nor 3:1 nor whatsoever, but as 
userspace configurable e.g. /proc/sys/vm/active_inactive_ratio.

I hope your suggestion below or an extension to it will allow the kernel 
to free the buffers somewhen. Depending on how good/fast this solution 
[...]
-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance
--

From: Johannes Weiner
Date: Tuesday, April 20, 2010 - 8:32 am

The idea is that it pans out on its own.  If the workload changes, new
pages get activated and when that set grows too large, we start shrinking
it again.

Of course, right now this unscanned set is way too large and we can end
up wasting up to 50% of usable page cache on false active pages.

A fixed ratio does not scale with varying workloads, obviously, but having
it at a safe level still seems like a good trade-off.

We can still do the optimization, and in the worst case the amount of
memory wasted on false active pages is small enough that it should leave
the system performant.

You have a rather extreme page cache load.  If 4:1 works for you, I think
this is a safe bet for now because we only frob the knobs into the
direction of earlier kernel behaviour.

We still have a nice amount of pages we do not need to scan regularly
(up to 50k file pages for a streaming IO load on a 1G machine).

	Hannes
--

From: Rik van Riel
Date: Tuesday, April 20, 2010 - 10:22 am

Thing is, changing workloads often change back.

Specifically, think of a desktop system that is doing
work for the user during the day and gets backed up
at night.

You do not want the backup to kick the working set
out of memory, because when the user returns in the
morning the desktop should come back quickly after
the screensaver is unlocked.

The big question is, what workload suffers from
having the inactive list at 50% of the page cache?

So far the only big problem we have seen is on a
very unbalanced virtual machine, with 256MB RAM
and 4 fast disks.  The disks simply have more IO
in flight at once than what fits in the inactive
list.

This is a very untypical situation, and we can
probably solve it by excluding the in-flight pages
from the active/inactive file calculation.
--

From: Christian Ehrhardt
Date: Tuesday, April 20, 2010 - 9:23 pm

IMHO it is fine to prevent that nightly backup job from not being 
finished when the user arrives at morning because we didn't give him 
some more cache - and e.g. a 30 sec transition from/to both optimized 
states is fine.
But eventually I guess the point is that both behaviors are reasonable 
to achieve - depending on the users needs.

What we could do is combine all our thoughts we had so far:
a) Rik could create an experimental patch that excludes the in flight pages
b) Johannes could create one for his suggestion to "always scan active 
file pages but only deactivate them when the ratio is off and otherwise 
strip buffers of clean pages"
c) I would extend the patch from Johannes setting the ratio of 
active/inactive pages to be a userspace tunable

a,b,a+b would then need to be tested if they achieve a better behavior.

c on the other hand would be a fine tunable to let administrators 
(knowing their workloads) or distributions (e.g. different values for 
Desktop/Server defaults) adapt their installations.


Did I get you right that this means the write case - explaining why it 
is building up buffers to the 50% max?

Note: It even uses up to 64 disks, with 1 disk per thread so e.g. 16 
threads => 16 disks.

For being "unbalanced" I'd like to mention that over the years I learned 
that sometimes, after a while, virtualized systems look that way without 
being intended - this happens by adding more and more guests and let 

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance
--

From: Christian Ehrhardt
Date: Wednesday, April 21, 2010 - 12:35 am

A first revision of patch c is attached.
I tested assigning different percentages, so far e.g. 50 really behave 
like before and 25 protects ~42M Buffers in my example which would match 
the intended behavior - see patch for more details.

Checkpatch and some basic function tests went fine.

Thinking about it I wondered for what these Buffers are protected.
If the intention to save these buffers is for reuse with similar loads I 
wonder why I "need" three iozones to build up the 85M in my case.

Buffers start at ~0, after iozone run 1 they are at ~35, then after #2 
~65 and after run #3 ~85.
Shouldn't that either allocate 85M for the first directly in case that 
much is needed for a single run - or if not the second and third run 
just "resuse" the 35M Buffers from the first run still held?

Note - "1 iozone run" means "iozone ... -i 0" which sequentially writes 
and then rewrites a 2Gb file on 16 disks in my current case.

looking forward especially to patch b as I'd really like to see a kernel 
able to win back these buffers if they are no more used for a longer 
period while still allowing to grow&protect them while needed.

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance
From: Rik van Riel
Date: Wednesday, April 21, 2010 - 6:19 am

I think you are confusing "buffer heads" with "buffers".

You can strip buffer heads off pages, but that is not
your problem.

"buffers" in /proc/meminfo stands for cached metadata,
eg. the filesystem journal, inodes, directories, etc...
Caching such metadata is legitimate, because it reduces
the number of disk seeks down the line.
--

From: Christian Ehrhardt
Date: Wednesday, April 21, 2010 - 11:21 pm

Trying to answer and consolidate all open parts of this thread down below.


Yeah I mixed that as well, thanks for clarification (Johannes wrote a 
similar response effectively kicking b) from the list of things we could 
do).

Regarding your question from thread reply#3
 > How on earth would a backup job benefit from cache?
 >
 > It only accesses each bit of data once, so caching the
 > to-be-backed-up data is a waste of memory.

If it is a low memory system with a lot of disks (like in my case) 
giving it more cache allows e.g. larger readaheads or less cache 
trashing - but it might be ok, as it might be rare case to hit all those 
constraints at once.
But as we discussed before on virtual servers it can happen from time to 
time due to balooning and much more disk attachments etc.



So definitely not the majority of cases around, but some corner cases 
here and there that would benefit at least from making the preserved 
ratio configurable if we don't find a good way to let it take the memory 
back without hurting the intended preservation functionality.

For that reason - how about the patch I posted yesterday (to consolidate 
this spread out thread I attach it here again)



And finally I still would like to understand why writing the same files 
three times increase the active file pages each time instead of reusing 
those already brought into memory by the first run.
To collect that last open thread as well I'll cite my own question here:

 > Thinking about it I wondered for what these Buffers are protected.
 > If the intention to save these buffers is for reuse with similar 
loads > I wonder why I "need" three iozones to build up the 85M in my case.

 > Buffers start at ~0, after iozone run 1 they are at ~35, then after 
#2 > ~65 and after run #3 ~85.
 > Shouldn't that either allocate 85M for the first directly in case 
that > much is needed for a single run - or if not the second and third 
run > > just "resuse" the 35M Buffers from the first run still ...
From: Christian Ehrhardt
Date: Monday, April 26, 2010 - 3:59 am

Subject: [PATCH][RFC] mm: make working set portion that is protected tunable v2

From: Christian Ehrhardt <ehrhardt@linux.vnet.ibm.com>

*updates in v2*
- use do_div

This patch creates a knob to help users that have workloads suffering from the
fix 1:1 active inactive ratio brought into the kernel by "56e49d21 vmscan:
evict use-once pages first".
It also provides the tuning mechanisms for other users that want an even bigger
working set to be protected.

To be honest the best solution would be to allow a system not using the working
set to regain that memory *somewhen*, and therefore without drawbacks to the
scenarios it was implemented for e.g. UI interactivity while copying a lot of
data. But up to now there was no idea how to get that behaviour implemented.

In the old thread started by Elladan that finally led to 56e49d21 Wu Fengguang
wrote:
 "In the worse scenario, it could waste half the memory that could
 otherwise be used for readahead buffer and to prevent thrashing, in a
 server serving large datasets that are hardly reused, but still slowly
 builds up its active list during the long uptime (think about a slowly
 performance downgrade that can be fixed by a crude dropcache action).

 That said, the actual performance degradation could be much smaller -
 say 15% - all memories are not equal."

We now identified a case with up to -60% Throughput, therefore this patch tries
to provide a more gentle interface than drop_caches to help a system stuck in
this.

In discussion with Rik van Riel and Joannes Weiner we came up that there are
cases that want the current "save 50%" for the working set all the time and
others that would benefit from protectig only a smaller amount.

Eventually no "carved in stone" in kernel ratio will match all use cases,
therefore this patch makes the value tunable via a /proc/sys/vm/ interface
named active_inactive_ratio.

Example configurations might be:
- 50% - like the current kernel
- 0%  - like a kernel pre 56e49d21
- x%  - allow ...
From: KOSAKI Motohiro
Date: Monday, April 26, 2010 - 4:59 am

Hi

I've quick reviewed your patch. but unfortunately I can't write my

We certainly need no knob. because typical desktop users use various
application,
various workload. then, the knob doesn't help them.

Probably, I've missed previous discussion. I'm going to find your previous mail.
--

From: Christian Ehrhardt
Date: Monday, April 26, 2010 - 5:43 am

Briefly - We had discussed non desktop scenarios where like a day load 
that builds up the working set to 50% and a nightly backup job which 
then is unable to use that protected 50% when sequentially reading a lot 
of disks and due to that doesn't finish before morning.

The knob should help those people that know their system would suffer 
from this or similar cases to e.g. set the protected ratio smaller or 
even to zero if wanted.

As mentioned before, being able to gain back those protected 50% would 
be even better - if it can be done in a way not hurting the original 
intention of protecting them.

I personally just don't feel too good knowing that 50% of my memory 
might hang around unused for many hours while they could be of some use.
I absolutely agree with the old intention and see how the patch helped 
with the latency issue Elladan brought up in the past - but it just 

The discussion ends at http://lkml.org/lkml/2010/4/22/38 - feel free to 
click through it.

-- 

Grüsse / regards, Christian Ehrhardt
IBM Linux Technology Center, System z Linux Performance
--

From: Rik van Riel
Date: Monday, April 26, 2010 - 7:20 am

This is a red herring.  A backup touches all of the
data once, so it does not need a lot of page cache
and will not "not finish before morning" due to the
working set being protected.

You're going to have to come up with a more realistic

So far we have seen exactly one workload where it helps
to reduce the size of the active file list, and that is
not due to any need for caching more inactive pages.

On the contrary, it is because ALL OF THE INACTIVE PAGES
are in flight to disk, all under IO at the same time.

Caching has absolutely nothing to do with the regression
you ran into.

--

From: Christian Ehrhardt
Date: Tuesday, April 27, 2010 - 7:00 am

I completely agree that a backup case is read once and therefore doesn't
benefit from caching itself, but you know my scenario from the thread
where this patch emerged from.
="Parallel iozone sequential read - resembling the classic backup case
(read once + sequential)."

While caching isn't helping the classic way, by having data in cache
ready on the next access it is still used transparently as the system
is reading ahead into page cache to assist the sequentially reading
process.
Yes it doesn't happen with direct IO and some, but unfortunately not
all backup tools use DIO. Additionally not all backup jobs have a whole
night, and this can really be a decision maker if you can quickly pump
out your 100 TB main database in 10 or 20 minutes.

So here comes the problem, due to the 50% preserved I assume it comes
into trouble allocating that page cache memory in time. So much that it
even slows down the load - meaning long enough to let the application
completely consume the data already read and then still letting it wait.
More about that below.

Now IMHO this feels comparable to a classic backup job, and by loosing
60% Throughput (more than a Gb/s) is seems neither red nor smells like

Ok this time I think I got your point much better - sorry for 
being confused.
Discard my patch, but I'd really like to clarify and verify your 
assumption in conjunction with my findings and would be happy
if you can help me with that.

As mentioned the case that suffers from the 50% memory protected is
iozone read - so it would be "in flight FROM disk", but I guess that
it is not important if it is from or to right ?

Effectively I have two read cases, one with caches dropped which then 
has almost full memory for page cache in the read case. And the other 
one with a few writes before filling up the protected 50% leading to a 
read case with only half of the memory for page cache.
Now if I really got you right this time the issue is caused by the
fact that the parallel read ahead ...
From: Johannes Weiner
Date: Wednesday, April 21, 2010 - 2:03 am

Please drop that idea, that 'Buffers:' is a red herring.  It's just pages
that do not back files but block devices.  Stripping buffer_heads won't
achieve anything, we need to get rid of the pages.  Sorry, I should have
slept and thought before writing that suggestion.
--

From: Rik van Riel
Date: Wednesday, April 21, 2010 - 6:20 am

How on earth would a backup job benefit from cache?

It only accesses each bit of data once, so caching the
to-be-backed-up data is a waste of memory.
--

From: Rik van Riel
Date: Tuesday, April 20, 2010 - 7:40 am

It has potential advantages and disadvantages.

On smaller desktop systems, it is entirely possible that
the working set is close to half of the page cache.  Your
patch reduces the amount of memory that is protected on
the active file list, so it may cause part of the working
set to get evicted.

On the other hand, having a smaller active list frees up
more memory for sequential (streaming, use-once) disk IO.
This can be useful on systems with large IO subsystems
and small memory (like Christian's s390 virtual machine,
with 256MB RAM and 4 disks!).

I wonder if we could not find some automatic way to
balance between these two situations, for example by
excluding currently-in-flight pages from the calculations.

In Christian's case, he could have 160MB of cache (buffer
+ page cache), of which 70MB is in flight to disk at a
time.  It may be worthwhile to exclude that 70MB from the
total and aim for 45MB active file and 45MB inactive file
pages on his system.  That way IO does not get starved.

On a desktop system, which needs the working set protected
and does less IO, we will automatically protect more of
the working set - since there is no IO to starve.
--

From: Greg KH
Date: Tuesday, March 23, 2010 - 7:38 pm

It will go to the other stable kernels for their next round of releases

No, .30 is no longer being maintained.

thanks,

greg k-h
--

From: Mel Gorman
Date: Wednesday, March 24, 2010 - 4:49 am

Right, I won't lose any sleep over 2.6.30.dodo so :)

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
--

From: Johannes Weiner
Date: Wednesday, March 24, 2010 - 6:13 am

Hi,





I was wondering why kswapd would not make any progress and stall without
dirty pages, luckily Rik has better eyes than me.

So if he is right and most inactive pages are under IO (thus locked and
skipped) when kswapd is running, we have two choices:

  1) deactivate pages and reclaim them instead
  2) sleep and wait for IO to finish

The patch in question changes 1) to 2) because it won't scan small active
lists and the inactive list does not shrink in size when rotating busy
pages.

You said pg-Rclm is only direct reclaim.  I assume the sum of reclaimed
pages from kswapd and direct reclaim stays in the same ballpark, only
the ratio shifted towards direct reclaim?

Waiting for the disks seems to be better than going after the working set
but I have a feeling we are waiting for the wrong event to happen there.

I am amazingly ignorant when it comes to the block layer, but glancing over
the queue congestion code, it seems we are waiting for the queue to shrink
below a certain threshold.  Is this correct?

When it comes to the reclaim scanner, however, aren't we more interested in
single completions than in the overall state of the queue?

With such a constant stream of IO as in Mel's test, I could imagine that
the queue never really gets below that threshold (here goes the ignorance part)
and we always hit the timeout.  While what we really want is to be woken
up when, say, SWAP_CLUSTER_MAX pages finished since we went to sleep.

Because at that point there is a chance to reclaim some pages again,
even if a lot of requests are still pending.

	Hannes
--

Previous thread: Re: [PATCH 2/6] drivers:misc: Kconfig, Makefile for TI's ST ldisc by Pavan Savoy on Monday, March 22, 2010 - 4:45 pm. (2 messages)

Next thread: systemtap 1.2 release notes by Frank Ch. Eigler on Monday, March 22, 2010 - 5:02 pm. (1 message)