Re: The performance and behaviour of the anti-fragmentation related patches

Previous thread: [PATCH] floppy: handle device_create_file() failure while init by Dmitriy Monakhov on Friday, March 2, 2007 - 8:16 am. (1 message)

Next thread: [PATCH 0/9] swap on flash support by Richard Purdie on Friday, March 2, 2007 - 8:54 am. (1 message)
From: Rik van Riel
Date: Friday, March 2, 2007 - 8:29 am

The RSS bits really worry me, since it looks like they could
exacerbate the scalability problems that we are already running
into on very large memory systems.

Linux is *not* happy on 256GB systems.  Even on some 32GB systems
the swappiness setting *needs* to be tweaked before Linux will even
run in a reasonable way.

Pageout scanning needs to be more efficient, not less.  The RSS
bits are worrysome...

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-

From: Andrew Morton
Date: Friday, March 2, 2007 - 9:58 am

Using a zone-per-container or N-64MB-zones-per-container should actually
move us in the direction of *fixing* any such problems.  Because, to a
first-order, the scanning of such a zone has the same behaviour as a 64MB
machine.

(We'd run into a few other problems, some related to the globalness of the

Please send testcases.

-

From: Mel Gorman
Date: Friday, March 2, 2007 - 10:09 am

Quite possibly. Taking software zones from the other large mail I sent,
one could get the 64MB effect by increasing MAX_ORDER_NR_PAGES to be 64MB
in pages. To avoid external fragmentation issues, I'd prefer of course
if these container zones consisted of mainly contiguous memory but with

It would be fixable, especially if containers do their own reclaim on their
container zones and not kswapd. Writing dirty data back periodically would

-- 
-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab
-

From: Christoph Lameter
Date: Friday, March 2, 2007 - 10:23 am

It is not happy if you put 256GB into one zone. We are fine with 1k nodes 
with 8GB each and a 16k page size (which reduces the number of 
page_structs to manage by a fourth). So the total memory is 8TB which is 
significantly larger than 256GB.

If we do this node/zone merging and reassign MAX_ORDER blocks to virtual 
node/zones for containers (with their own LRU etc) then this would also 
reduce the number of page_structs on the list and may make things a bit 
easier.

We would then produce the same effect as the partitioning via NUMA nodes 
on our 8TB boxes. However, then you still have a bandwidth issue since 
your 256 likely only has a single bus and all memory traffic for the 
node/zones has to go through this single bottleneck. That bottleneck does 
not exist on NUMA machines.


-

From: Andrew Morton
Date: Friday, March 2, 2007 - 10:35 am

Oh come on.  What's the workload?  What happens?  system time?  user time?
kernel profiles?
-

From: Rik van Riel
Date: Friday, March 2, 2007 - 10:43 am

I can't share all the details, since a lot of the problems are customer
workloads.

One particular case is a 32GB system with a database that takes most
of memory.  The amount of actually freeable page cache memory is in
the hundreds of MB.   With swappiness at the default level of 60, kswapd
ends up eating most of a CPU, and other tasks also dive into the pageout
code.  Even with swappiness as high as 98, that system still has
problems with the CPU use in the pageout code!

Another typical problem is that people want to back up their database
servers.  During the backup, parts of the working set get evicted from
the VM and performance is horrible.

A third scenario is where a system has way more RAM than swap, and not
a whole lot of freeable page cache.  In this case, the VM ends up
spending WAY too much CPU time scanning and shuffling around essentially
unswappable anonymous memory and tmpfs files.

I have briefly characterized some of these working sets on:

http://linux-mm.org/ProblemWorkloads

One thing I do not yet have are easily runnable test cases.  I know
the problems that happen because customers run into them, but it is
not as easy to reproduce on test systems...

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-

From: Andrew Morton
Date: Friday, March 2, 2007 - 11:06 am

On Fri, 02 Mar 2007 12:43:42 -0500


userspace fixes for this are far, far better than any magic goo the kernel
can implement.  We really need to get off our butts and start educating

Well we've allegedly fixed that, but it isn't going anywhere without
testing.


-

From: Christoph Lameter
Date: Friday, March 2, 2007 - 11:15 am

The memory is likely in use but there is enough memory free in unmapped 
clean pagecache pages so that we occasionally are able to free pages. Then 
the app is reading more from disk replenishing that ...
Thus we are forever cycling through the LRU lists moving pages between 

We have fixed the case in which we compile the kernel without swap. Then 
anonymous pages behave like mlocked pages. Did we do more than that?

-

From: Rik van Riel
Date: Friday, March 2, 2007 - 11:23 am

In this particular case, the system even has swap free.

The kernel just chooses not to use it until it has scanned
some memory, due to the way the swappiness algorithm works.

With 32 CPUs diving into the page reclaim simultaneously,
each trying to scan a fraction of memory, this is disastrous

Not AFAIK.

I would like to see separate pageout selection queues
for anonymous/tmpfs and page cache backed pages.  That
way we can simply scan only that what we want to scan.

There are several ways available to balance pressure
between both sets of lists.

Splitting them out will also make it possible to do
proper use-once replacement for the page cache pages.
Ie. leaving the really active page cache pages on the
page cache active list, instead of deactivating them
because they're lower priority than anonymous pages.

That way we can do a backup without losting the page
cache working set.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-

From: Christoph Lameter
Date: Friday, March 2, 2007 - 12:31 pm

Well I would expect this to have marginal improvements and delay the 
inevitable for awhile until we have even bigger memory. If the app uses 
mmapped data areas then the problem is still there. And such tinkering 
does not solve the issue of large scale I/O requiring the handling of 
gazillions of page structs. I do not think that there is a way around 
somehow handling larger chunks of memory in an easier way. We already do 
handle larger page sizes for some limited purposes and with huge pages we 
already have a larger page size. Mel's defrag/anti-frag patches are 
necessary to allow us to deal with the resulting fragmentation problems.

-

From: Rik van Riel
Date: Friday, March 2, 2007 - 12:40 pm

I suspect we would not need to treat mapped file backed memory any
different from page cache that's not mapped.  After all, if we do
proper use-once accounting, the working set will be on the active
list and other cache will be flushed out the inactive list quickly.

Also, the IO cost for mmapped data areas is the same as the IO
cost for unmapped files, so there's no IO reason to treat them
differently, either.


-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-

From: Bill Irwin
Date: Friday, March 2, 2007 - 2:12 pm

Thundering herds of a sort pounding the LRU locks from direct reclaim
have set off the NMI oopser for users here.


-- wli
-

From: Rik van Riel
Date: Friday, March 2, 2007 - 2:19 pm

Ditto here.

The main reason they end up pounding the LRU locks is the
swappiness heuristic.  They scan too much before deciding
that it would be a good idea to actually swap something
out, and with 32 CPUs doing such scanning simultaneously...

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-

From: Andrew Morton
Date: Friday, March 2, 2007 - 2:52 pm

On Fri, 02 Mar 2007 16:19:19 -0500


What kernel version?
-

From: Rik van Riel
Date: Friday, March 2, 2007 - 3:03 pm

Customers are on the 2.6.9 based RHEL4 kernel, but I believe
we have reproduced the problem on 2.6.18 too during stress
tests.

I have no reason to believe we should stick our heads in the
sand and pretend it no longer exists on 2.6.21.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-

From: Andrew Morton
Date: Friday, March 2, 2007 - 3:22 pm

On Fri, 02 Mar 2007 17:03:10 -0500

Opterons seem to be particularly prone to lock starvation where a cacheline


I have no reason to believe anything.  All I see is handwaviness,
speculation and grand plans to rewrite vast amounts of stuff without even a
testcase to demonstrate that said rewrite improved anything.

None of this is going anywhere, is it?
-

From: Rik van Riel
Date: Friday, March 2, 2007 - 3:34 pm

We tested them.  They only alleviate the problem slightly in
good situations, but things still fall apart badly with less

Your attitude is exactly why the VM keeps falling apart over
and over again.

Fixing "a testcase" in the VM tends to introduce problems for
other test cases, ad infinitum. There's a reason we end up
fixing the same bugs over and over again.

I have been looking through a few hundred VM related bugzillas
and have found the same bugs persist over many different
versions of Linux, sometimes temporarily fixed, but they seem

I will test my changes before I send them to you, but I cannot
promise you that you'll have the computers or software needed
to reproduce the problems.  I doubt I'll have full time access
to such systems myself, either.

32GB is pretty much the minimum size to reproduce some of these
problems. Some workloads may need larger systems to easily trigger
them.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-

From: Martin Bligh
Date: Friday, March 2, 2007 - 3:51 pm

We can find a 32GB system here pretty easily to test things on if
need be.  Setting up large commercial databases is much harder.

I don't have such a machine in the public set of machines we're going
to push to test.kernel.org from at the moment, but will see if I can
arrange it in the future if it's important.


M.
-

From: Rik van Riel
Date: Friday, March 2, 2007 - 3:54 pm

That's my problem, too.

There does not seem to exist any single set of test cases that
accurately predicts how the VM will behave with customer
workloads.

The one thing I can do relatively easily is go through a few
hundred bugzillas and figure out what kinds of problems have
been plaguing the VM consistently over the last few years.
I just finished doing that, and am trying to come up with
fixes for the problems that just don't seem to be easily
fixable with bandaids...

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-

From: Martin J. Bligh
Date: Friday, March 2, 2007 - 4:28 pm

Tracing might help? Showing Andrew traces of what happened in
production for the prev_priority change made it much easier to
demonstrate and explain the real problem ...

M.


-

From: Andrew Morton
Date: Friday, March 2, 2007 - 5:24 pm

On Fri, 02 Mar 2007 15:28:43 -0800

Tracing is one way.

The other way is the old scientific method:

- develop a theory
- add sufficient instrumentation to prove or disprove that theory
- run workload, crunch on numbers
- repeat

Of course, multiple theories can be proven/disproven in a single pass.

Practically, this means adding one new /prov/vmstat entry for each `goto
keep*' in shrink_page_list().  And more instrumentation in
shrink_active_list() to determine the behaviour of swap_tendency.

Once that process is finished, we should have a thorough understanding of
what the problem is.  We can then construct a testcase (it'll be a couple
hundred lines only) and use that testcase to determine what implementation
changes are needed, and whether it actually worked.

Then go back to the real workload, verify that it's still fixed.

Then do whitebox testing of other workloads to check that they haven't
regressed.

-

From: Chuck Ebbert
Date: Friday, March 2, 2007 - 3:52 pm

Hundreds of disks all doing IO at once may also be needed, as
wli points out. Such systems are not readily available for testing.



-

From: Andrew Morton
Date: Friday, March 2, 2007 - 3:59 pm

On Fri, 02 Mar 2007 17:34:31 -0500

What is it with vendors finding MM problems and either not fixing them or
kludging around them and not telling the upstream maintainers about *any*

In that case it was a bad fix.  The aim is to fix known problems without
introducing regressions in other areas.  A perfectly legitimate approach.



32GB isn't particularly large.

Somehow I don't believe that a person or organisation which is incapable of
preparing even a simple testcase will be capable of fixing problems such as
this without breaking things.
-

From: Rik van Riel
Date: Friday, March 2, 2007 - 4:20 pm

I don't believe anybody who relies on one simple test case will
ever be capable of evaluating a patch without breaking things.

Test cases can show problems, but fixing a test case is no
guarantee at all that your VM will behave ok with real world
workloads.  Test cases for the VM can *never* be relied on
to show that a problem went away.

I'll do my best, but I can't promise a simple test case
for every single problem that's plaguing the VM.

-- 
Politics is the struggle between those who want to make their country
the best in the world, and those who believe it already is.  Each group
calls the other unpatriotic.
-

From: William Lee Irwin III
Date: Friday, March 2, 2007 - 6:40 pm

I'm not in the business of defending vendors, but a lot of times the
base is so far downrev it's difficult to relate it to much of anything
current. It may be best not to say precisely how far downrev things can
get, since some of these things are so old even distro vendors won't
touch them.



My gut feeling is to agree, but I get nagging doubts when I try to
think of how to boil things like [major benchmarks whose names are
trademarked/copyrighted/etc. censored] down to simple testcases. Some
other things are obvious but require vast resources, like zillions of
disks fooling throttling/etc. heuristics of ancient downrev kernels.
I guess for those sorts of things the voodoo incantations, chicken
blood, and carcasses of freshly slaughtered goats come out. Might as
well throw in a Tarot reading and some tea leaves while I'm at it.

My tack on basic stability was usually testbooting on several arches,
which various people have an active disinterest in (suggesting, for
example, that I throw out all of my sparc32 systems and replace them
with Opterons, or that anything that goes wrong on ia64 is not only
irrelevant but also that neither I nor anyone else should ever fix them;
you know who you are). It's become clear to me that this is insufficient,
and that I'll need to start using some sort of suite of regression tests,
at the very least to save myself the embarrassment of acking a patch that
oopses when exercised, but also to elevate the standard.


-- wli
-

From: Andrew Morton
Date: Friday, March 2, 2007 - 6:58 pm

On Fri, 2 Mar 2007 17:40:04 -0800

noooooooooo.  You're approaching it from the wrong direction.

Step 1 is to understand what is happening on the affected production
system.  Completely.  Once that is fully understood then it is a relatively
simple matter to concoct a test case which triggers the same failure mode.

It is very hard to go the other way: to poke around with various stress
tests which you think are doing something similar to what you think the
application does in the hope that similar symptoms will trigger so you can
then work out what the kernel is doing.  yuk.

-

From: William Lee Irwin III
Date: Friday, March 2, 2007 - 8:55 pm

Yeah, it's really great when it's possible to get debug info out of
people e.g. they're willing to boot into a kernel instrumented with
the appropriate printk's/etc. Most of the time it's all guesswork.
People who post to lkml are much better about all this on average.

I never truly understood the point of kprobes/jprobes/dprobes (or
whatever the probing letter is), crash dumps, and so on until I ran
into this, not that I use personally them (though I may yet start).
Most of the time I just read the code instead and smoke out what
could be going on by something like the process of devising
counterexamples. For instance, I told that colouroff patch guy about
the possibility of getting the wrong page for the start of the buffer
from virt_to_page() on a cache colored buffer pointer (clearly
cache->gfporder >= 4 in such a case). Deriving the head page without
__GFP_COMP might be considered to be ugly-looking, though.


-- wli
-

From: Eric Dumazet
Date: Friday, March 2, 2007 - 4:16 pm

The first thing done by timespec_trunc() is :

  if (gran <= jiffies_to_usecs(1) * 1000)

This should really be a test against a constant known at compile time.

Alas, it isnt. jiffies_to_usec() was unilined so C compiler emits a function 
call and a multiply to compute : a CONSTANT.

mov    $0x1,%edi
mov    %rbx,0xffffffffffffffe8(%rbp)
mov    %r12,0xfffffffffffffff0(%rbp)
mov    %edx,%ebx
mov    %rsi,0xffffffffffffffc8(%rbp)
mov    %rsi,%r12
callq  ffffffff80232010 <jiffies_to_usecs>
imul   $0x3e8,%eax,%eax
cmp    %ebx,%eax

This patch reorders kernel/time.c a bit so that jiffies_to_usecs() is defined 
before timespec_trunc() so that compiler now generates :

cmp    $0x3d0900,%edx  (HZ=250 on my machine)

This gives a better code (timespec_trunc() becoming a leaf function), and 
shorter kernel size as well.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
From: William Lee Irwin III
Date: Friday, March 2, 2007 - 5:33 pm

AIUI that phenomenon is universal to NUMA. Maybe it's time we
reexamined our locking algorithms in the light of fairness
considerations.


-- wli
-

From: Andrew Morton
Date: Friday, March 2, 2007 - 5:54 pm

On Fri, 2 Mar 2007 16:33:19 -0800

It's also a multicore thing.  iirc Kiran was seeing it on Intel CPUs.

I expect the phenomenon would be observeable on a number of locks in the
kernel, give the appropriate workload.  We just hit it first on lru_lock.

I'd have thought that increasing SWAP_CLUSTER_MAX by two or four orders of
magnitude would plug it, simply by decreasing the acquisition frequency but
I think Kiran fiddled with that to no effect.


See below for Linus's thoughts, forwarded without permission..





Begin forwarded message:

Date: Mon, 22 Jan 2007 13:49:02 -0800 (PST)
From: Linus Torvalds <torvalds@linux-foundation.org>
To: Andrew Morton <akpm@osdl.org>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>, Ravikiran G Thirumalai <kiran@scalex86.org>
Subject: Re: High lock spin time for zone->lru_lock under extreme conditions




I think people need to realize that spinlocks are always going to be 
unfair, and *extremely* so under some conditions. And yes, multi-core 
brought those conditions home to roost for some people (two or more cores 
much closer to each other than others, and able to basically ping-pong the 
spinlock to each other, with nobody else ever able to get it).

There's only a few possible solutions:

 - use the much slower semaphores, which actually try to do fairness. 

 - if you cannot sleep, introduce a separate "fair spinlock" type. It's 
   going to be appreciably slower (and will possibly have a bigger memory 
   footprint) than a regular spinlock, though. But it's certainly a 
   possible thing to do.

 - make sure no lock that you care about ever has high enough contention 
   to matter. NOTE! back-off etc simply will not help. This is not a 
   back-off issue. Back-off helps keep down coherency traffic, but it 
   doesn't help fairness.

If somebody wants to play with fair spinlocks, go wild. I looked at it at 
one point, and it was not wonderful. It's pretty complicated to do, and 
the best way I could come up with was ...
From: Christoph Lameter
Date: Friday, March 2, 2007 - 8:15 pm

This is a phenomenon that is usually addressed at the cache logic level. 
Its a hardware maturation issue. A certain package should not be allowed
to hold onto a cacheline forever and other packages must have a mininum 
time when they can operate on that cacheline.



-

From: William Lee Irwin III
Date: Friday, March 2, 2007 - 9:19 pm

I think when I last asked about that I was told "cache directories are
too expensive" or something on that order, if I'm not botching this,
too. In any event, the above shows a gross inaccuracy in my statement.


-- wli
-

From: Martin J. Bligh
Date: Saturday, March 3, 2007 - 10:16 am

That'd be nice. Unfortunately we're stuck in the real world with
real hardware, and the situation is likely to remain thus for
quite some time ...

M.
-

From: Christoph Lameter
Date: Saturday, March 3, 2007 - 10:50 am

Our real hardware does behave as described and therefore does not suffer 
from the problem.

If you want a software solution then you may want to look at Zoran 
Radovic's work on Hierachical Backoff locks. I had a draft of a patch a 
couple of years back that showed some promise to reduce lock contention. 
HBO locks can solve starvation issues by stopping local lock takers.

See Zoran Radovic "Software Techniques for Distributed Shared Memory", 
Uppsala Universitet, 2005 ISBN 91-554-6385-1.

http://www.gelato.org/pdf/may2005/gelato_may2005_numa_lameter_sgi.pdf

http://www.gelato.unsw.edu.au/archives/linux-ia64/0506/14368.html
-

From: Andrew Morton
Date: Friday, March 2, 2007 - 11:23 am

On Fri, 2 Mar 2007 10:15:36 -0800 (PST)


oh yeah, we took the ran-out-of-swapcache code out.  But if we're going to
do this thing, we should find some way to bring it back.

-

From: Bill Irwin
Date: Friday, March 2, 2007 - 1:59 pm

I know of one sounding similar to this where unreclaimable pages are
pinned by refcounts held by bio's spread across about 850 spindles.
It's mostly read traffic. Several different tunables could be used
to work around it, nr_requests in particular, but also clamping down
on dirty limits to preposterously low levels and setting preposterously
large values of min_free_kbytes. Their kernel is, of course,
substantially downrev (2.6.9-based IIRC), so douse things heavily with
grains of salt.


-- wli
-

Previous thread: [PATCH] floppy: handle device_create_file() failure while init by Dmitriy Monakhov on Friday, March 2, 2007 - 8:16 am. (1 message)

Next thread: [PATCH 0/9] swap on flash support by