Andrea Arcangeli released the 2.6.5-rc1-aa1 patchset, offering an alternative virtual memory subsystem that particularly shines on high end servers. Andrea notes that he has removed rmap [story], saying. "I'm running this kernel while writing this and it's under 500M swap load without problems." He explains:
"This implements anon_vma for the anonymous memory unmapping and objrmap for the file mappings, effectively removing rmap completely and replacing it with more efficient algorithms (in terms of memory utilization and cpu cost for the fast paths)."
Read on for his complete email which explains the current implementation, and talks about where he intends to go with his -aa patchset. He also includes some basic benchmarking results, noting, "in terms of performance I hope with this effort we'll get back the two digit percent loss compared to 2.4 based kernels".
From: Andrea Arcangeli [email blocked]
To: linux-kernel
Subject: 2.6.5-rc1-aa1
Date: Thu, 18 Mar 2004 03:22:01 +0100
This implements anon_vma for the anonymous memory unmapping and objrmap
for the file mappings, effectively removing rmap completely and
replacing it with more efficient algorithms (in terms of memory
utilization and cpu cost for the fast paths).
I attached the results for a basic benchmark comparing 2.4, 2.6 and
2.6-aa.
Next thing to fix are the nonlinear mappings (probably I will use the
sysctl for the short term, sysctl may be needed anyways for allowing
mlock to all users), and then the rbtree for the i_mmap{shared} (the
prio_tree isn't stable yet, over time we can replace it with the
prio_tree of course).
I'm running this kernel while writing this and it's under 500M swap load
without problems. Ingo complains some workload with zillon of vmas
in the same file will not work well, but 1) those workloads are supposed
to use remap_file_pages in 32bit archs, and 2) they wants mlock anyways,
and this vm design is optimal on the 64bit without requiring nor
remap_file_pages nor mlock there.
This is better than anonmm because it's finegriend, so applications
using local "malloc" in a process-group will scale better during heavy
swapping (in presence of tons of processes in the same group), and
secondly it gets mremap right (tons of programs uses cows heavily, the
most obvious example is kde, I don't know if they ever run mremap but
this sounds safer overall). It costs a bit more of memory but I don't
think it matters, an anon_vma is only 12 bytes and it can cover
gigabytes of address space.
I did some robusteness improvement as well in the ->nopage handlings,
and a fix in bttv that didn't set VM_RESERVE, plus some cleanup in the
vma merging.
vma merging for mprotect and mremap is disabled for now, but it can be
enabled again, and after we re-enable it file backed vmas should be
supported too in the merging (so far they aren't).
In terms of performance I hope with this effort we'll get back the two
digit percent loss compared to 2.4 based kernels, if this doesn't bring
all performance back the next suspect is the scheduler lacking HT
awareness or stuff like that (but I doubt, see the bench page and
imagine a much bigger box with quite some more ram than 1G).
With regard to the x86 32bit arch this releases 4byte of page_t, plus it
saves tons of zone-normal to make it possible to use >=4G boxes with the
optimal 3:1 model like we do in 2.4 today (something not doable in 2.6
mainline due the huge rmap overhead, and that in practice limits the 2.6
kernel to non PAE boxes [depending on the workload]).
Credit for the original objrmap idea goes to David Miller, the first
good implementation for 2.6 is from Dave McCracken but only for the file
mappings, and the first effort to cover the anonymous memory too is from
Hugh Dickins and Wli also maintained the anonmm code later. I coalesced
the good stuff together plus I designed and implemented the the anon_vma
solution for anonymous memory.
Alternate solutions to anon_vma have been proposed and they may be
considered in alternative to this. Ideally we should split the anon_vma
patch in two parts, one that could be re-used by the anonmm design,
though I was no time to split it so far. I'm not claiming anon_vma is
definitely superior to anonmm but it's the solution I prefer. It is clearly
more efficient in some very high end workload I've in mind, but in the
small boxes it takes a bit more of memory so for the simpler workloads
anonmm is prefereable, plus anonmm allows full vma merging (though it
requires cow during mremap).
URL:
http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.6/2.6.5-rc1-aa1.gz
http://www.us.kernel.org/pub/linux/kernel/people/andrea/kernels/v2.6/2.6.5-rc1-aa1/
Only in 2.6.5-rc1-aa1: 00000_extraversion-1
Add -aa1.
Only in 2.6.5-rc1-aa1: 00000_twofish-2.6.gz
Compatibility cryptoloop.
Only in 2.6.5-rc1-aa1: 00100_objrmap-core-1.gz
Objrmap core from Dave McCracken implementing objrmap
methods for file mappings reusing the truncate
infrastructure (as in DaveM's effort). This doesn't
obsolete rmap yet, anonymous memory is still tracked
via rmap.
Only in 2.6.5-rc1-aa1: 00101_anon_vma-1.gz
Implements innovative objrmap methods for anonymous
memory using the finegriend anon_vma design that handles
mremap gracefully and avoids to scan all mm in a child-group
to find the right vmas as it happens with anonmm. This effectively
obsoletes rmap and infact it releases 4bytes per page_t by dropping the
pte_chains.
The combination of anon_vma and objrmap-core avoids huge waste
of memory and cpu resources and it boosts smp scalability too.
This makes it possible to use >=4G boxes without the 4:4 slowdown.
The nonlinear vmas aren't covered yet, so this is an insecure
kernel at this time (only in terms of local security
DoS, no root compromise can happen or whatever like that, it's
like allowing a bad user to use mlock, so he can turn
down the machine locally). The nonlinear vmas will be covered too with
further patches, this is already more than good enough for starting
beating on it with real loads like single user desktop usage.
The file based methods will have to be optimized further too with
a rbtree.
Both the nonlinear vmas and the rbtree for the file methods
i_mmap{shared} will be covered in the next -aa.
If you find compile breakages s/->mapping/->as.mapping/ will fix it.
Only in 2.6.5-rc1-aa1: 00200_kgdb-ga-1.gz
Only in 2.6.5-rc1-aa1: 00201_kgdb-ga-recent-gcc-fix-1.gz
Only in 2.6.5-rc1-aa1: 00201_kgdb-THREAD_SIZE-fixes-1.gz
Only in 2.6.5-rc1-aa1: 00201_kgdb-x86_64-support-1.gz
kgdb from Andrew's -mm tree.
testprogram:
#include <sys/mman.h>
#include <stdio.h>
#include <fcntl.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/wait.h>
#define SIZE (512*1024*1024)
int main(int argc, char ** argv)
{
int fd = -1, level, max_level;
char * start, * end, * tmp;
max_level = atoi(argv[1]);
if ((start = mmap(0, SIZE, PROT_READ|PROT_WRITE, MAP_ANONYMOUS|MAP_PRIVATE, fd, 0)) == MAP_FAILED)
perror("mmap"), exit(1);
end = start + SIZE;
for (tmp = start; tmp < end; tmp += 4096) {
*tmp = 0;
}
for (level = 0; level < max_level; level++) {
int pid = fork();
if (pid < 0)
perror("fork"), exit(1);
for (tmp = start; tmp < end; tmp += 4096) {
*(volatile char *)tmp;
}
}
for (;;)
if (wait(NULL) < 0)
break;
return 0;
}
2.4.21 based kernel:
6:
real 0m6.548s
user 0m11.340s
sys 0m10.730s
real 0m6.684s
user 0m11.120s
sys 0m11.090s
7:
real 0m12.145s
user 0m22.180s
sys 0m21.330s
real 0m11.862s
user 0m21.680s
sys 0m21.100s
2.6.5-rc2-aa1 (== 2.6 + objrmap + anon_vma):
6:
real 0m6.520s
user 0m9.926s
sys 0m11.791s
real 0m6.527s
user 0m9.823s
sys 0m11.975s
7:
real 0m12.072s
user 0m21.913s
sys 0m22.034s
real 0m12.062s
user 0m21.331s
sys 0m22.161s
2.6.2 (the last non-objrmap kernel I had installed on the test box)
6:
real 0m8.359s
user 0m10.490s
sys 0m17.840s
real 0m8.185s
user 0m9.684s
sys 0m18.078s
7:
real 0m16.150s
user 0m21.184s
sys 0m38.473s
real 0m16.082s
user 0m21.441s
sys 0m38.025s
from number 8 2.6.2 runs out of memory in fork, 2.6.5-rc1-aa1
reaches 9 without problems filling around 400m with pagetables.
what does this test program d
what does this test program do? in particular what does this line do?
the (volatile char *) casts t
the (volatile char *) casts the tmp as a character pointer, like for a string. Volatile tells the compiler that the value in this memory location can change without you touching it, so don't make any assumptions when optimizing. Then it dereferences that pointer to get the contents of tmp.
... right. I understand th
... right.
I understand this part completely but what does access it for?
Shouldn't it be more like:
?
I am interested in the asm that is generated for the first variant as opposed to the second one. Isn't it that
the compiler will see that the value is not used and not generate code for that line?
well, it "should" be, but thi
well, it "should" be, but this program isn't designed to do something useful; it's designed for benchmarking. the dereference takes up an assembly instruction or two, so it serves its purpose. a statement in C may be an expression, that is anything that represents a value (indeed assignment in C is an expression: x = 1 produces a value of whatever x is assigned, in this case 1). so it's perfectly legal to say *(volatile char *)tmp. obviously more assembly instructions would be generated for the assignment, as the contents of the register storing the dereference of tmp would have to be loaded back into the address in memory where c resides. the compiler would generate code for the simple dereference, unless you asked for it to be optimized away with -O#.
Perhaps "volatile" protects it
I'm guessing that by casting to a volatile pointer, the compiler may not optimise away the deference. Otherwise I can't see any reason for the volatile cast, and as you say, without the assignment it could be optimised away.
Such constructs can be necessary for memory mapped hardware access where you sometimes need to just read a register to trigger an operation. Volatile is probably the way to ensure the memory access does not get optimised away.
Just guessing, though.
Is this the right thing to do?
Is removing rmap the right thing to do? Would it be a better approach to just fix rmap for these situations?
...
It isn't removing rmap. It is fixing rmap for these situations ;)
Rmap is a way to go from physical page -> virtual pages. Well actually if we go by that definition, scanning all processes' page tables as in non-rmap kernels is also a method of "rmap", but you know what I mean.
RE: Is this the right thing to do?
No, it wouldn't. rmap uses memory to reduce CPU overhead when swapping a page out (without rmap you use a lot of CPU time to find where that page is used; with rmap there is a list of users). If that uses too much memory, to save memory you just have to avoid rmap. If you have a more clever solution I'm not aware of, let the community know of Your Idea.
Note: Ingo Molnar has actually proposed another solution, which does not touch rmap directly: instead of using 4k wide pages, use 2Mega wide pages (they exist on i386) when a process has lot of pages. So you have less pages, less memory usage by rmap and there are a lot of other reasons to do this.
However, if they do not want to repeat the 2.4 VM hell on a wider scale, they must delay this to 2.7. And nobody has yet written any code trying to do this (at least I'm not aware of it, search for "page clustering" to check).
More details on this on http://lwn.net/Articles/74295/.
On 2.6.4, test program exits..
On 2.6.4, test program exits with Error message :
mmap: Cannot allocate memory
Program exited with code 01.
Any pointers to why this would happen. I am assuming that this is quite generic program which can run on any vanilla kernel version.
Buy another stick of DDR
The test program attempts to allocate 500+MB of memory:
512 * 1024 * 1024 = 536870912
You need a lot of RAM and/or swap to run this test. Since it's mmap() and not malloc(), I don't believe that you actually need that much free. I think mmap'd pages are allocated only as needed, so having swap should be enough to cover it.