The likelyhood of pagetable sharing in 2.6 rose dramatically as Andrew Morton, the neighbourhoold VM IO system god integrated Dave McCracken's impementation into his mm patchset which can be found here.
Many gave some contribution to this feature in various threads, notably Daniel Phillips, who came up with the first implementation (which had a few locking problems) and all agreed it would be a very good idea if the implementation's locking was acceptable. The main benefits are increased fork speed and reduced memory (ZONE_NORMAL) consumption which is a critical problem for some big workloads. Unfortunately the reverse mapping system introduced into 2.5 made both attributes significantly worse despite the best efforts of numerous talented kernel hackers!
There were not a lot of numbers being thrown around on the list, so I took the time to do some of my own! Benchmarks run on 2xPIII 1Gn mem=256MB, swap=
2.4.19 (no rmap, no pagetable sharing) Overhead for creating 1000 tasks: 0.38058s Time - overhead for creating 1000 tasks of: 16 (extra) pages: 0.015s (0.945us per process per page) 32 (extra) pages: 0.028s (0.890us per process per page) 64 (extra) pages: 0.038s (0.600us per process per page) 128 (extra) pages: 0.059s (0.463us per process per page) 256 (extra) pages: 0.105s (0.410us per process per page) 512 (extra) pages: 0.207s (0.404us per process per page) 1024 (extra) pages: 0.443s (0.433us per process per page) 2048 (extra) pages: 0.986s (0.482us per process per page) 4096 (extra) pages: 2.397s (0.585us per process per page) 8192 (extra) pages: 5.917s (0.722us per process per page) 16384 (extra) pages: 11.958s (0.730us per process per page) 32768 (extra) pages: fork: Cannot allocate memory
2.5.40 (optimised reverse mapping) Overhead for creating 1000 tasks: 0.32972s Time - overhead for creating 1000 tasks of: 16 (extra) pages: 0.037s (2.322us per process per page) 32 (extra) pages: 0.064s (1.994us per process per page) 64 (extra) pages: 0.089s (1.384us per process per page) 128 (extra) pages: 0.145s (1.133us per process per page) 256 (extra) pages: 0.267s (1.043us per process per page) 512 (extra) pages: 0.503s (0.982us per process per page) 1024 (extra) pages: 1.051s (1.027us per process per page) 2048 (extra) pages: 2.382s (1.163us per process per page) 4096 (extra) pages: 5.571s (1.360us per process per page) 8192 (extra) pages: 11.244s (1.373us per process per page) 16384 (extra) pages: fork: Cannot allocate memory
2.5.41-mm2 (rmap + pagetable sharing) Overhead for creating 1000 tasks: 0.30261s Time - overhead for creating 1000 tasks of: 16 (extra) pages: 0.028s (1.741us per process per page) 32 (extra) pages: 0.048s (1.495us per process per page) 64 (extra) pages: 0.068s (1.055us per process per page) 128 (extra) pages: 0.103s (0.802us per process per page) 256 (extra) pages: 0.236s (0.924us per process per page) 512 (extra) pages: 0.413s (0.807us per process per page) 1024 (extra) pages: 0.549s (0.536us per process per page) 2048 (extra) pages: 0.553s (0.270us per process per page) 4096 (extra) pages: 0.557s (0.136us per process per page) 8192 (extra) pages: 0.562s (0.069us per process per page) 16384 (extra) pages: 0.558s (0.034us per process per page) 32768 (extra) pages: fork: Cannot allocate memory
The benchmark times create 1000 processes then exit 1000 processes, its not meant to be very accurate but should be enough so to compare the differences, it is also about the most favourable measurement for this feature - expect a percent or two on your regular kernel compile. Some things to note:
Post benchmarks on lkml
Did you post your benchmarks on lkml. It seems to provide useful information.
I haven't actually
no. perhaps I should.
Sure!
don't hesitate to post it there. Those numbers are very informative.
Heh.
It's for stories like these is why I love this site.
What do you do? Just lurk around lkml and wait till you find an interesting thread?
Keep up the good work!
hehe. :)
Re
Yeah well I lurk on lkml but I don't wait for anything, the point to my little story was there wasn't really an interesting thread at all, I just came up with some interesting numbers which is why I submitted.
Contest effect less impressive
I've benchmarked the effect of this with contest on the 2.5.41-mm3
[BENCHMARK] 2.5.41-mm3 +/- shared pagetables with contest
Suffice to say that in no area did it perform better in it's current incarnation.
Why the slowdown?
Any insight on why shared pagetables might slowdown contest? Were you running on a single-proc or a multi-proc machine?
I could see shared pagetables screwing up a multi-proc machine due to increased line sharing between CPUs, but on a single-proc, I don't understand it. Unless, of course, there's increased lock contention or something.
RE: Why the slowdown?
Well the increased line sharing isn't too much of a problem as they are generally read only AFAIK - shared cache lines only "bounce" when they are dirtied. When you look closely at the contest results, usually the regresion in compile time is made up for by a corresponding increase in load work being done. This seems a bit strange to me as contest is normally pretty reproducable, it could be an accounting type bug in the shared pte code that changes the balancing. Remember this is only beta code.
Actually, I had though
Actually, I had thought about the number of increased 'load' iterations. However, that isn't evident in most of the cases. Because fork gets faster, I'd imagine that any kind of heavy forking load (does 'process load' fit that?) would shift performance wide towards the forks. That type of shift does show up in the 'process load', but without a commeasurate increase in LCPU%. All the others show about the same number of load iterations for the other loads. Also, the LCPU% numbers seem to be pretty equal between -mm3 and -mm3s. The additional load iterations for process_load seem only to be due to the load running more times because the load got cheaper.
As for dirtiness -- how does the 'referenced' bit for a page factor into the dirtiness equation. Doesn't that live in the page table contents? If so, wouldn't first-reference to a page also dirty the page table info that maps that page?
Kernel [runs] Time
Kernel [runs] Time CPU% Loads LCPU% Ratio
noload:
uninteresting
process_load:
2.5.41-mm3 [1] 95.5 75 31 28 1.42
2.5.41-mm3s [2] 96.1 74 42 28 1.43
io_load:
2.5.41-mm3 [1] 312.4 25 20 11 4.65
2.5.41-mm3s [2] 425.5 18 27 10 6.33
mem_load:
2.5.41-mm3 [2] 107.1 68 27 2 1.59
2.5.41-mm3s [2] 116.5 62 28 2 1.73
Actually the process_load looks good and seems correct as you say. We were talking about the "regressions" though - io_load & mem_load. At first thought, there shouldn't be a reason why the IO bound workloads should shift around so much. It could be a sensitivity in contest however.
And I would assume changing the referenced bit would dirty the cache. I don't think would flip around much unless the machine is under heavy page replacement, in which case it shouldn't matter too much.
Not a heavy forking load
Process_load initially forks it's processes (4*num_cpus) and then is heavy on context switching. An attempt at making a fork_load for contest was met with disinterest so I abandoned it. The equivalence of kernel compile time in process_load with a rise in number of loads is indicative of an improvement of performance though as you said.
Machine specs
The machine used for the testing is static. As the aim of contest originally was to not forget desktop performance the machine chosen is about average for a recent machine - single cpu, 1133Mhz P3 with 256Mb Ram, 5400 rpm ATA100 drive.
One question
I never really trusted benchmark programs, and I noticed many people feeling the same way - but it seems that Contest has been well received, and that many people now praise it for it's real world like results...
Could you possible be convinced to give us some inside into how Contest works (and doesn't)..
re: One question
Con, if you're interested in sharing this insight, perhaps you could drop me an email first? (jeremy AT kerneltrap.org)
Why? I'd rather put your insight into its own story than hidden in a comment... :)
Contest insight
For the time being you can read the info on the contest homepage http://contest.kolivas.net then maybe we can work on something specific for kerneltrap.