On Sun, 8 Jun 2008 16:54:34 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
Nowhere near as intrusive or risky as eg. the timer changes that went
in a few releases ago.
Actually, memory is now getting so large that the current code no
longer works right. On machines 16GB and up, we have discovered
really pathetic behaviour by the VM currently upstream.
Things like the VM scanning over the (locked) shared memory segment
over and over and over again, to get at the 1GB of freeable pagecache
memory in the system. Or the system scanning over all anonymous
memory over and over again, despite the fact that there is no more
swap space left.
With heavy anonymous memory workloads, Linux can stall for minutes
once memory runs low and something needs to be swapped out, because
pretty much all memory is anonymous and everything has the referenced
bit set. We have seen systems with 128GB of RAM hang overnight, once
every CPU got wedged in the pageout scanning code. Typically the VM
decides on a first page to swap out in 2-3 minutes though, and then
it will start several gigabytes of swap IO at once...
Definately not acceptable behaviour.
Hardware gets larger. 4 years ago few people cared about systems
with more than 4GB of memory, but nowadays people have that in their
desktops.
32 bit systems will still get the file/anon LRU split. The only
thing that is 64 bit only in the current patch set is keeping the
unevictable pages off of the LRU lists.
This means that balancing between file and anon eviction will be
the same on 32 and 64 bit systems and things should get sorted out
on both systems at the same time.
People with large Linux servers are experiencing system stalls
of several minutes, or at worst complete livelocks, with the
current VM.
I believe that those issues need to be fixed.
After discussing this for a long time with Larry Woodman,
Lee Schermerhorn and others, I am convinced that they can
not be fixed by putting a bandaid on the current code.
After all, the fundamental problem often is that the file backed
and mem/swap backed pages are on the same LRU.
Think of a case that is becoming more and more common: a database
server with 128GB of RAM, 2GB of (hardly ever used) swap, 80GB of
locked shared memory segment, 30GB of other anonymous memory and
5GB of page cache.
Do you think it is reasonable for the VM to have to scan over
110GB of essentially unevictable memory, just to get at the 5GB
of page cache?
We have. We failed to come up with anything that avoids the
problem without actually fixing the fundamental issues.
If you have an idea, please let us know.
Otherwise, please give us a chance to shake things out in -mm.
I will prepare kernel RPMs for Fedora so users in the community can
easily test these patches too, and help find scenarios where these
patches do not perform as well as what the current kernel has.
I have time to track down and fix any issues that people find.
--
All rights reversed.
--