Re: [RFC/PATH 1/2] MM: Make Page Tables Relocatable -- conditional flush

!MAILaRCHIVE_VOTE_RePLACE
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]
To: Ross Biro <rossb@...>
Cc: <linux-kernel@...>, <linux-mm@...>
Date: Wednesday, April 30, 2008 - 1:54 pm

Hi Ross,


So you mean the check to see if there's a migration currently in
progress?  Surely that's a single test+branch?


I would have thought cross-node TLB misses would be a bigger factor.

 
I've read through this patch a couple of times so far, but I still
don't quite get it.  The "why" rationale is good, but it would be nice
to have a high-level "how" paragraph which explains the overall
principle of operation.  (OK, I think I see how all this fits
together now.)

From looking at it, a few points to note:

- It only tries to move usermode pagetables.  For the most part (at
  least on x86) the kernel pagetables are fairly static (and
  effectively statically allocated), but vmalloc does allocate new
  kernel pagetable memory.
  
  As a consequence, it doesn't need to worry about tlb-flushing global
  pages or unlocked updates to init_mm.

- It would be nice to explain the "delimbo" terminology.  I got it in
  the end, but it took me a while to work out what you meant.

Open questions in my mind:

- How does it deal with migrating the accessed/dirty bits in ptes if
  cpus can be using old versions of the pte for a while after the
  copy?  Losing dirty updates can lose data, so explicitly addressing
  this point in code and/or comments is important.

- Is this deeply incompatible with shared ptes?

- It assumes that each pagetable level is a page in size.  This isn't
  even true on x86 (32-bit PAE pgds are not), and definitely not true
  on other architectures.  It would make sense to skip migrating
  non-page-sized pagetable levels, but the code could/should check for
  it.

- Does it work on 2 and 3-level pagetable systems?  Ideally the clever
  folding stuff would make it all fall out naturally, but somehow that
  never seems to end up working.

- Could you use the existing tlb batching machinery rather than
  MMF_NEED_FLUSH?  They seem to overlap somewhat.

- What architectures does this support?  You change a lot of arch
  files, but it looks to me like you've only implemented this for
  x86-64.  Is that right?  A lot of this patch won't apply to x86 at
  the moment because of the pagetable unifications I've been doing.
  Will you be able to adapt it to the unified pagetable code?  Will it
  support all x86 variants in the process?

- How much have you tested it?





This is actually identical for 32 and 64 bit?  Good candidate for unification.






Seems a bit large to be static inline in a header.  Why not just put
it in mm/migrate.c?




A pgd isn't necessarily a page size.  x86-32 PAE its only 64 bytes,
though it happens to allocate a whole page for it at the moment.  I'm
pretty sure other architectures have non-page-sized pgds as well, so
you can't make this assumption in asm-generic/*.



The pgd_list is an x86-specific notion.  And at the moment
pgd_list_add/del aren't even visible outside arch/x86/mm/pgtable.c.


Again, you can't rely on pgd being a page size, so you can't use
page->lru to chain it onto this list.


Unfortunately this won't work well in x86-paravirt_ops; specifically,
under Xen it just plain won't work, and on other virtualization
systems it may just be inefficient.

The problem is that pages which are part of a pagetable must be
treated specially because the hypervisor needs to maintain a guest to
host page mapping, so pagetable entries need to be translated in some
way.  This is managed with a series of hooks to let the hypervisor
have (and change) the information it needs.

Normally this is dealt with in activate_mm and arch_exit_mmap.  These
operate on whole mm-s because its assumed that the lifetime of a pgd
is the same as the lifetime of its mm.  I guess we'd need new hooks to
manage the lifetime of pgds explicitly.  (Hm, though if you switch to
init_mm, update mm->pgd, and then re-activate it, that may just work.)

On the other hand, I've got plans to change the way Xen manages pgds
which would alleviate this problem and allow this code to work as-is,
but it would still require careful handling of the other pagetable
levels (update, which look mostly ok already).


Does this get used anywhere?





enum?



Hm, another pagetable walker.  Sure this one is necessary?  Or does it
replace one of the others?  Is it guaranteed to work on 2, 3 and 4
level pagetables?


Unified now.


Needed?



I think you're never migrating anything in init_mm, so this should be a no-op, right?



Why the indirection?  Do you expect to be passing another function in
here at some point?


latter


A number of places in the kernel update init_mm without holding any
locks.  But I guess since you restrict yourself to the usermode parts
of the pagetable, this is OK.


flush


thrashing

Why not switch to init_mm, do all the migrations on the target mm,
then switch back and get all the other cpus to do a reload/flush?
Wouldn't that achieve the same effect?


So you're saying that you've copied the pte pages, updated the
pagetable to point to them, but the cpu could still have the old
pagetable state in its tlb.

How do you migrate the accessed/dirty state from the old ptes to the
new one?  Losing accessed isn't a huge problem, but losing dirty can
cause data loss.


Elsewhere you use
	  if (ptl != &mm->page_table_lock)
	     ...

which seems cleaner than the #if.


optl == outer_ptl?


So really its migrate_pgd_entry?  It migrates a single thing that a
pgd entry points to?


A pud isn't necessarily a page size either.  I don't think you can
assume that any pagetable level has page-sized elements, though I
guess those levels will necessarily be non-migratable.


As above: a pud isn't necessarily a page.  Also, you need to
specifically deallocate it as a pud to make sure the page is free for
generally useful again (but not until you're sure there are no
lingering users on all cpus).  I think think means you need to queue a
(type, page) tuple on your old_pages list so they can be deallocated
properly.


Good (this will do the right thing in paravirt_ops).


Ditto migrate_pgd.


This seems like a nicer idiom than the "#if NR_CPUS >=
CONFIG_SPLIT_PTLOCK_CPUS" you used in _delimbo_pte.


You need to use the appropriate pgd/pud/pmd/pte_free() function here.
There may be other things needing to be done to the page before it can
be released back into the general kernel page pool.



--
Previous message: [thread] [date] [author]
Next message: [thread] [date] [author]

Messages in current thread:
Re: [RFC/PATH 1/2] MM: Make Page Tables Relocatable -- condi..., Jeremy Fitzhardinge, (Wed Apr 30, 1:54 pm)
Re: [RFC/PATH 1/2] MM: Make Page Tables Relocatable -- condi..., Jeremy Fitzhardinge, (Wed Apr 30, 3:56 pm)