Linux: Shared Pagetables

Submitted by Anonymous
on October 10, 2002 - 11:55pm

The likelyhood of pagetable sharing in 2.6 rose dramatically as Andrew Morton, the neighbourhoold VM IO system god integrated Dave McCracken's impementation into his mm patchset which can be found here.

Many gave some contribution to this feature in various threads, notably Daniel Phillips, who came up with the first implementation (which had a few locking problems) and all agreed it would be a very good idea if the implementation's locking was acceptable. The main benefits are increased fork speed and reduced memory (ZONE_NORMAL) consumption which is a critical problem for some big workloads. Unfortunately the reverse mapping system introduced into 2.5 made both attributes significantly worse despite the best efforts of numerous talented kernel hackers!

There were not a lot of numbers being thrown around on the list, so I took the time to do some of my own! Benchmarks run on 2xPIII 1Gn mem=256MB, swap=

2.4.19 (no rmap, no pagetable sharing)
Overhead for creating 1000 tasks:  0.38058s
Time - overhead for creating 1000 tasks of:
   16 (extra) pages:  0.015s (0.945us per process per page)
   32 (extra) pages:  0.028s (0.890us per process per page)
   64 (extra) pages:  0.038s (0.600us per process per page)
  128 (extra) pages:  0.059s (0.463us per process per page)
  256 (extra) pages:  0.105s (0.410us per process per page)
  512 (extra) pages:  0.207s (0.404us per process per page)
 1024 (extra) pages:  0.443s (0.433us per process per page)
 2048 (extra) pages:  0.986s (0.482us per process per page)
 4096 (extra) pages:  2.397s (0.585us per process per page)
 8192 (extra) pages:  5.917s (0.722us per process per page)
16384 (extra) pages:  11.958s (0.730us per process per page)
32768 (extra) pages: fork: Cannot allocate memory
2.5.40 (optimised reverse mapping)
Overhead for creating 1000 tasks:  0.32972s
Time - overhead for creating 1000 tasks of:
   16 (extra) pages:  0.037s (2.322us per process per page)
   32 (extra) pages:  0.064s (1.994us per process per page)
   64 (extra) pages:  0.089s (1.384us per process per page)
  128 (extra) pages:  0.145s (1.133us per process per page)
  256 (extra) pages:  0.267s (1.043us per process per page)
  512 (extra) pages:  0.503s (0.982us per process per page)
 1024 (extra) pages:  1.051s (1.027us per process per page)
 2048 (extra) pages:  2.382s (1.163us per process per page)
 4096 (extra) pages:  5.571s (1.360us per process per page)
 8192 (extra) pages:  11.244s (1.373us per process per page)
16384 (extra) pages: fork: Cannot allocate memory
2.5.41-mm2 (rmap + pagetable sharing)
Overhead for creating 1000 tasks:  0.30261s
Time - overhead for creating 1000 tasks of:
   16 (extra) pages:  0.028s (1.741us per process per page)
   32 (extra) pages:  0.048s (1.495us per process per page)
   64 (extra) pages:  0.068s (1.055us per process per page)
  128 (extra) pages:  0.103s (0.802us per process per page)
  256 (extra) pages:  0.236s (0.924us per process per page)
  512 (extra) pages:  0.413s (0.807us per process per page)
 1024 (extra) pages:  0.549s (0.536us per process per page)
 2048 (extra) pages:  0.553s (0.270us per process per page)
 4096 (extra) pages:  0.557s (0.136us per process per page)
 8192 (extra) pages:  0.562s (0.069us per process per page)
16384 (extra) pages:  0.558s (0.034us per process per page)
32768 (extra) pages: fork: Cannot allocate memory

The benchmark times create 1000 processes then exit 1000 processes, its not meant to be very accurate but should be enough so to compare the differences, it is also about the most favourable measurement for this feature - expect a percent or two on your regular kernel compile. Some things to note:

  • 2.5.41-mm2 beats 2.4.19 even in the low end in absolute time (time+overhead)!
  • Time to fork+exit a 64MB process (or 1000 of them), 2.5.41-mm1 is 20x faster than 2.5.40 and 10x faster than 2.4.19. That would probably double as size doubles from then on
  • "Overhead" would consist of things such as allocating each task's kernel stack, setting up the task_struct, the syscalls, and copying (or not) pagetables for the program code + shared libraries - Ingo's 2.5 work would explain why both 2.5s are faster than 2.4.19.
  • The per process per page measure for both 2.4.19 and 2.5.40 decrease for a time, then increase. This is probably due to increasing economy of scale helping initially - then datastructures (probably pagetables / reverse maps) outgrowing L1 cache. 2.5.41-mm2 initially improves, then levels out indicating the L1 cache isn't being exceeded.
  • 2.5.40 is constantly around 2x slower than 2.4.19 - due mostly to the overhead of keeping the reverse maps. It also runs out of memory sooner (again rmap overhead)
  • Both 2.5.40 and 2.4.19 are O(number of pages) taking L1 cache effects into account. 2.5.41-mm2 ends up being O(1) presumably multi gigabyte sized processes would not be too worried

Post benchmarks on lkml

on
October 10, 2002 - 5:47pm

Did you post your benchmarks on lkml. It seems to provide useful information.

I haven't actually

Anonymous
on
October 11, 2002 - 12:39am

no. perhaps I should.

Sure!

Anonymous
on
October 11, 2002 - 7:27am

don't hesitate to post it there. Those numbers are very informative.

Heh.

on
October 11, 2002 - 12:52am

It's for stories like these is why I love this site.

What do you do? Just lurk around lkml and wait till you find an interesting thread?

Keep up the good work!

hehe. :)

Re

Anonymous
on
October 11, 2002 - 1:17am

Yeah well I lurk on lkml but I don't wait for anything, the point to my little story was there wasn't really an interesting thread at all, I just came up with some interesting numbers which is why I submitted.

Contest effect less impressive

on
October 11, 2002 - 12:58pm

I've benchmarked the effect of this with contest on the 2.5.41-mm3

[BENCHMARK] 2.5.41-mm3 +/- shared pagetables with contest

Suffice to say that in no area did it perform better in it's current incarnation.

Why the slowdown?

Anonymous
on
October 12, 2002 - 12:45am

Any insight on why shared pagetables might slowdown contest? Were you running on a single-proc or a multi-proc machine?

I could see shared pagetables screwing up a multi-proc machine due to increased line sharing between CPUs, but on a single-proc, I don't understand it. Unless, of course, there's increased lock contention or something.

RE: Why the slowdown?

Anonymous
on
October 12, 2002 - 1:46am

Well the increased line sharing isn't too much of a problem as they are generally read only AFAIK - shared cache lines only "bounce" when they are dirtied. When you look closely at the contest results, usually the regresion in compile time is made up for by a corresponding increase in load work being done. This seems a bit strange to me as contest is normally pretty reproducable, it could be an accounting type bug in the shared pte code that changes the balancing. Remember this is only beta code.

Actually, I had though

Anonymous
on
October 12, 2002 - 2:51am

Actually, I had thought about the number of increased 'load' iterations. However, that isn't evident in most of the cases. Because fork gets faster, I'd imagine that any kind of heavy forking load (does 'process load' fit that?) would shift performance wide towards the forks. That type of shift does show up in the 'process load', but without a commeasurate increase in LCPU%. All the others show about the same number of load iterations for the other loads. Also, the LCPU% numbers seem to be pretty equal between -mm3 and -mm3s. The additional load iterations for process_load seem only to be due to the load running more times because the load got cheaper.

As for dirtiness -- how does the 'referenced' bit for a page factor into the dirtiness equation. Doesn't that live in the page table contents? If so, wouldn't first-reference to a page also dirty the page table info that maps that page?

Kernel [runs] Time

Anonymous
on
October 12, 2002 - 11:02am

Kernel [runs] Time CPU% Loads LCPU% Ratio
noload:
uninteresting
process_load:
2.5.41-mm3 [1] 95.5 75 31 28 1.42
2.5.41-mm3s [2] 96.1 74 42 28 1.43
io_load:
2.5.41-mm3 [1] 312.4 25 20 11 4.65
2.5.41-mm3s [2] 425.5 18 27 10 6.33
mem_load:
2.5.41-mm3 [2] 107.1 68 27 2 1.59
2.5.41-mm3s [2] 116.5 62 28 2 1.73

Actually the process_load looks good and seems correct as you say. We were talking about the "regressions" though - io_load & mem_load. At first thought, there shouldn't be a reason why the IO bound workloads should shift around so much. It could be a sensitivity in contest however.

And I would assume changing the referenced bit would dirty the cache. I don't think would flip around much unless the machine is under heavy page replacement, in which case it shouldn't matter too much.

Not a heavy forking load

on
October 12, 2002 - 11:15am

Process_load initially forks it's processes (4*num_cpus) and then is heavy on context switching. An attempt at making a fork_load for contest was met with disinterest so I abandoned it. The equivalence of kernel compile time in process_load with a rise in number of loads is indicative of an improvement of performance though as you said.

Machine specs

on
October 12, 2002 - 11:12am

The machine used for the testing is static. As the aim of contest originally was to not forget desktop performance the machine chosen is about average for a recent machine - single cpu, 1133Mhz P3 with 256Mb Ram, 5400 rpm ATA100 drive.

One question

on
October 12, 2002 - 1:29pm

I never really trusted benchmark programs, and I noticed many people feeling the same way - but it seems that Contest has been well received, and that many people now praise it for it's real world like results...

Could you possible be convinced to give us some inside into how Contest works (and doesn't)..

re: One question

on
October 12, 2002 - 1:55pm

Con, if you're interested in sharing this insight, perhaps you could drop me an email first? (jeremy AT kerneltrap.org)

Why? I'd rather put your insight into its own story than hidden in a comment... :)

Contest insight

on
October 13, 2002 - 11:42am

For the time being you can read the info on the contest homepage http://contest.kolivas.net then maybe we can work on something specific for kerneltrap.

Comment viewing options

Select your preferred way to display the comments and click "Save settings" to activate your changes.