[RFC/PATH 1/2] MM: Make Page Tables Relocatable -- conditional flush

Previous thread: [PATCH] 8250: Switch 8250 drivers to use _nocache ioremaps by Alan Cox on Tuesday, April 29, 2008 - 6:34 am. (1 message)

Next thread: [RFC/PATH 2/2] MM: Make Page Tables Relocatable -- relocation code. by Ross Biro on Tuesday, April 29, 2008 - 6:43 am. (1 message)
From: Ross Biro
Date: Tuesday, April 29, 2008 - 6:42 am

These Patches make page tables relocatable for numa, memory
defragmentation, and memory hotblug.  The potential need to rewalk the
page tables before making any changes causes a 1.6% peformance
degredation in the lmbench page miss micro benchmark.

Page table relocation is critical for several projects.

1) Numa system process relcoation.  Currently, when a process is migrated
from one node to another the page tables are left behind.  This means 
increased latency and inter-node traffic on all page faults.  Migrating the
page tables with the process will be a performance win.

2) Memory hotplug.  Currently memory hotplug cannot move page tables out of
the memory that is about to be removed.  This code is a first step to 
being able to move page tables out of memory that is going to be unplugged.

3) Memory defragmentation. Currently page tables are the largest chunk
of non-moveable memory (needs verification).  By making page tables
relocatable, we can decrease the number of memory fragments and allow for
higher order allocations.  This is important for supporting huge pages 
that can greatly improve performance in some circumstances.

Currently the high level memory relocation code only supports 1 above.
The low level routines can be used to migrate any page table to any
new page.  However, 1 seems to be the best case for correctness
testing and is also the easiest place to hook into existing code, so
that is what is currently supported.

The low level page table relocation routines are generic and will be easy
to use for 2 and 3.

Signed-off-by:rossb@google.com

-----

Major changes from the previous version include more comments and
replacing the semaphore that serialized access to the relocation code
with an integer count of the number of times it has been reentered.  The
later lead to an improvement from a 3% performace loss to a 1.6% perfromance
loss vs no relocation code (There was also a version change from 2.6.23 to
2.6.25-rc9 since the last benchmark, so some ...
From: Jeremy Fitzhardinge
Date: Wednesday, April 30, 2008 - 10:54 am

So you mean the check to see if there's a migration currently in

 
I've read through this patch a couple of times so far, but I still
don't quite get it.  The "why" rationale is good, but it would be nice
to have a high-level "how" paragraph which explains the overall
principle of operation.  (OK, I think I see how all this fits
together now.)

From looking at it, a few points to note:

- It only tries to move usermode pagetables.  For the most part (at
  least on x86) the kernel pagetables are fairly static (and
  effectively statically allocated), but vmalloc does allocate new
  kernel pagetable memory.
  
  As a consequence, it doesn't need to worry about tlb-flushing global
  pages or unlocked updates to init_mm.

- It would be nice to explain the "delimbo" terminology.  I got it in
  the end, but it took me a while to work out what you meant.

Open questions in my mind:

- How does it deal with migrating the accessed/dirty bits in ptes if
  cpus can be using old versions of the pte for a while after the
  copy?  Losing dirty updates can lose data, so explicitly addressing
  this point in code and/or comments is important.

- Is this deeply incompatible with shared ptes?

- It assumes that each pagetable level is a page in size.  This isn't
  even true on x86 (32-bit PAE pgds are not), and definitely not true
  on other architectures.  It would make sense to skip migrating
  non-page-sized pagetable levels, but the code could/should check for
  it.

- Does it work on 2 and 3-level pagetable systems?  Ideally the clever
  folding stuff would make it all fall out naturally, but somehow that
  never seems to end up working.

- Could you use the existing tlb batching machinery rather than
  MMF_NEED_FLUSH?  They seem to overlap somewhat.

- What architectures does this support?  You change a lot of arch
  files, but it looks to me like you've only implemented this for
  x86-64.  Is that right?  A lot of this patch won't apply to x86 at
  the moment because of ...
From: Ross Biro
Date: Wednesday, April 30, 2008 - 11:40 am

Yup.  But the page fault code is so efficient, that a test and


There are comments in migrate.c on the how.  If they are insufficient,
please indicate what you would like to see.  I've been staring at the


I never liked the delimbo terminology, but it's the best I've been
able to come up with so far.  I'm open to changing it. Otherwise I can

It doesn't currently.  Although it's easy to fix.  Just before the
free, we just have to copy the dirty bits again.  Slow, but not in a

Not deeply.  It just doesn't support them at the moment (although it
doesn't check either.)  It would just need to do all the pmd's


I've never tried to compile it on anything other than a 4 level
system.  I suspect it will fail, but a couple of well placed #ifdef's

As of 2.6.22, I couldn't use any of the existing batching.  They do
overlap, but not 100% and I didn't want to impact the other batching

It currently only supports X86_64.  There are only a couple of missing
things to support other architectures.  The tlb_reload code needs to
be created on all architectures and the node specific page table
allocation code needs to be created.

I'm waiting for the x86 unification to setlle out before doing another
merge.  My guess is that it should support all 4 level page table x86
variants at that point.  The 3 level variants will take a little

It's had tons of testing on moving a few simple programs around on a
fake numa system.  It's had no testing on a real numa system and I

I think it was a simple little macro at one point.  At this point, it

Let me know what you decide to do here.  It shouldn't be too hard to

Not anymore.  It's been replaced by the nesting count and just didn't

When I started, none of the other page table walkers were pure
walkers.  I'll take a look and see if I can find one I can use.
Otherwise this is only really
guaranteed to work on 4 level page tables at this point.  It should

Correct, but I included it for completeness.  We could eliminate it

Mostly I ...
From: Jeremy Fitzhardinge
Date: Wednesday, April 30, 2008 - 12:56 pm

Yeah, its easy for that to happen.  Those comments are helpful, I was 
thinking very high-level things like:

    * what initiates migration?
    * how can an mm be under multiple levels of migration?
    * ...?


Just a comment saying something like "a pagetable page is considered to 
be in limbo if it has been copied, but may still be in use.  It may be 
either in a cpu's stale tlb entry, or in use by the kernel on another 
cpu with a transient reference." would clarify what the delimbo is 

But the issue I'm concerned about is what happens if a process writes 
the page, causing its cpu to mark the (old, in-limbo) pte dirty.  
Meanwhile someone else is scanning the pagetables looking for things to 
evict.  It check the (shiny new) pte, finds it not dirty, and decides to 
evict the apparently clean page.

What, for that matter, stops a page from being evicted from under a 
limboed mapping?  Does it get accounted for (I guess the existing tlb 
flushing should be sufficient to keep it under control).

Also, what happens if a page happens to get migrated twice in quick 
succession (ie, while there's still an in-limbo page from the first 




Well, x86-64 is 4 level, x86-32 PAE is 3 level, and x86-32 non-PAE is 2 
level.  I don't think there'd be too much crying if you didn't support 
32-bit non-PAE, but 32-bit PAE is useful.

I would say that x86 unification is still a fair way from "done", but 

OK.  I was planning on making the change anyway, and this is just 

I wouldn't eliminate it for speed, but it could be misleading if someone 
thought that it would actually do something.  Don't know; no clear 
answer.  Given that delimbo_X are inlines, you could easily put a "if 
(mm == &init_mm)" to skip everything, which would make these cases 
compile to nothing (and "if (__builtin_constant_p(mm) && mm == 
&init_mm)" if you really want to make sure there's no additional 

It's OK for them to be using the old mm while you're migrating because 
they'll be using the ...
From: Ross Biro
Date: Wednesday, April 30, 2008 - 3:15 pm

The delimbo functions can be extended to deal with the dirty bit.
They already have to be called to make sure the cpu is looking at the
proper page flags.  The easiest solution to the races is probably to
make the delimbo pte functions flush the tlb cache to make sure the
cpu will also be looking at the correct entry to update flags.
Otherwise the atomic ptep* functions would probably need to be
modified.

    Ross
--

Previous thread: [PATCH] 8250: Switch 8250 drivers to use _nocache ioremaps by Alan Cox on Tuesday, April 29, 2008 - 6:34 am. (1 message)

Next thread: [RFC/PATH 2/2] MM: Make Page Tables Relocatable -- relocation code. by Ross Biro on Tuesday, April 29, 2008 - 6:43 am. (1 message)