Re: Populating multiple ptes at fault time

Previous thread: none

Next thread: Re: Turning off camera also kills card reader on EeePC 900 by Sitsofe Wheeler on Wednesday, September 17, 2008 - 11:04 am. (1 message)
From: Jeremy Fitzhardinge
Date: Wednesday, September 17, 2008 - 10:47 am

Avi and I were discussing whether we should populate multiple ptes at
pagefault time, rather than one at at time as we do now.

When Linux is operating as a virtual guest, pte population will
generally involve some kind of trap to the hypervisor, either to
validate the pte contents (in Xen's case) or to update the shadow
pagetable (kvm).  This is relatively expensive, and it would be good to
amortise the cost by populating multiple ptes at once.

Xen and kvm already batch pte updates where multiple ptes are explicitly
updated at once (mprotect and unmap, mostly), but in practise that's
relatively rare.  Most pages are demand faulted into a process one at a
time.

It seems to me there are two cases: major faults, and minor faults:

Major faults: the page in question is physically missing, and so the
fault invokes IO.  If we blindly pull in a lot of extra pages that are
never used, then we'll end up wasting a lot of memory.  However, page at
a time IO is pretty bad performance-wise too, so I guess we do clustered
fault-time IO?  If we can distinguish between random and linear fault
patterns, then we can use that as a basis for deciding how much
speculative mapping to do.  Certainly, we should create mappings for any
nearby page which does become physically present.

Minor faults are easier; if the page already exists in memory, we should
just create mappings to it.  If neighbouring pages are also already
present, then we can can cheaply create mappings for them too.


This seems like an obvious idea, so I'm wondering if someone has
prototyped it already to see what effects there are.  In the native
case, pte updates are much cheaper, so perhaps it doesn't help much
there, though it would potentially reduce the number of faults needed. 
But I think there's scope for measurable benefits in the virtual case.

Thanks,
    J
--

From: Rik van Riel
Date: Wednesday, September 17, 2008 - 11:28 am

On Wed, 17 Sep 2008 10:47:30 -0700

This is especially true for mmaped files, where we do not have to
allocate anything to create the mapping.

Populating multiple PTEs at a time is questionable for anonymous
memory, where we'd have to allocate extra pages.

-- 
All rights reversed.
--

From: Jeremy Fitzhardinge
Date: Wednesday, September 17, 2008 - 2:47 pm

It might be worthwhile if the memory access pattern to anonymous memory
is linear.  I agree that speculatively allocating pages on a random
access region would be a bad idea.

    J
--

From: Chris Snook
Date: Wednesday, September 17, 2008 - 1:02 pm

We already have rather well-tested code in the VM to detect fault patterns, 
complete with userspace hints to set readahead policy.  It seems to me that if 
we're going to read nearby pages into pagecache, we might as well actually map 

If we're mapping pagecache, then sure, this is really cheap, but speculatively 

Sounds like something we might want to enable conditionally on the use of pv_ops 
features.

-- Chris
--

From: Jeremy Fitzhardinge
Date: Wednesday, September 17, 2008 - 2:45 pm

No, nested pagetables are the same as native to update, so the main

Right, that was my point.  I'm assuming that that machinery already

OK, makes sense.  Does the access pattern detecting code measure access

Perhaps, but I'd rather avoid it.  I'm hoping this is something we could
do that has - at worst - no effect on the native case, while improving
the virtual case.  The test matrix is already large enough without
adding another stateful switch.  After all, any side effect which makes
it a bad idea for the native case will probably be bad enough to
overwhelm any benefit in the virtual case.

    J
--

From: Jeremy Fitzhardinge
Date: Thursday, September 18, 2008 - 11:53 am

Thanks, that was exactly what I was hoping to see.  I didn't see any
definitive statements against the patch set, other than a concern that
it could make things worse.  Was the upshot that no consensus was
reached about how to detect when its beneficial to preallocate anonymous
pages?

Martin, in that thread you mentioned that you had tried pre-populating
file-backed mappings as well, but "Mmmm ... we tried doing this before
for filebacked pages by sniffing the
pagecache, but it crippled forky workloads (like kernel compile) with the
extra cost in zap_pte_range, etc. ".

Could you describe, or have a pointer to, what you tried and how it
turned out?  Did you end up populating so many (unused) ptes that
zap_pte_range needed to do lots more work?

Christoph (and others): do you think vm changes in the last 4 years
would have changed the outcome of these results?


Thanks,
    J
--

From: Christoph Lameter
Date: Thursday, September 18, 2008 - 12:39 pm

There were multiple discussions on the subject. The consensus was that it was
difficult to generalize this and it would only work on special loads. Plus it

Seems that the code today is similar. So it would still work.
--

From: KOSAKI Motohiro
Date: Thursday, September 18, 2008 - 3:21 pm

but at that time, x86_64 large server doesn't exist yet.
I think mesurement again is valuable because typical server environment
is changed in these days.



--

From: Martin Bligh
Date: Thursday, September 18, 2008 - 1:52 pm

Don't have the patches still, but it was fairly simple - just faulted in
the next 3 pages whenever we took a fault, if the pages were already
in pagecache. I would have thought that was pretty lightweight and

Yup, basically you're assuming good locality of reference, but it turns
out that (as davej would say) "userspace sucks".
--

From: Chris Snook
Date: Thursday, September 18, 2008 - 1:53 pm

Well, *most* userspace sucks.  It might still be worthwhile to do this when 
userspace is using madvise().

-- Chris
--

From: Martin Bligh
Date: Thursday, September 18, 2008 - 2:11 pm

Quite possibly true ... something to benchmark.
--

From: Christoph Lameter
Date: Thursday, September 18, 2008 - 2:13 pm

Well, I guess we need a new binary format that allows one to execute binaries
in kernel address space with full powers.


--

From: Martin Bligh
Date: Thursday, September 18, 2008 - 2:21 pm

Seems ... extreme ;-)
Maybe we just do it if we're in readahead? (or similar)
--

From: Christoph Lameter
Date: Thursday, September 18, 2008 - 2:32 pm

If we are in kernel space then the binary can call the readahead function as
needed ... ;-O

Ok, seriously: Anonymous pages are not subject to readahead so it wont work.


--

From: MinChan Kim
Date: Thursday, September 18, 2008 - 2:49 pm

On Fri, Sep 19, 2008 at 6:32 AM, Christoph Lameter

In case of file-mapped pages, Shouldn't we use just on-demand
readahead mechanism in kernel ?
If it is inefficient, It means we have to change on-demand readahead
mechanism itself.



-- 
Kinds regards,
MinChan Kim
--

From: Christoph Lameter
Date: Thursday, September 18, 2008 - 2:58 pm

Right.

My patches were only for anonymous pages not for file backed because readahead
is available for file backed mappings.

--

From: Martin Bligh
Date: Thursday, September 18, 2008 - 3:08 pm

Do we populate the PTEs though? I didn't think that was batched, but I
might well be wrong.
--

From: Christoph Lameter
Date: Thursday, September 18, 2008 - 3:11 pm

We do not populate the PTEs and AFAICT PTE population was assumed not to be
performance critical since the backing media is comparatively slow.

--

From: Martin Bligh
Date: Thursday, September 18, 2008 - 3:18 pm

I think the times when this matters are things like glibc, which are
heavily shared -
we were only 'prefaulting' when the pagecache was already there. So it's a case
for a "readahead like algorithm", not necessarily a direct hook.

Anonymous pages seem much riskier, as presumably there's a no backing page
except in the fork case.

I presume the reason Jeremy is interested is because his pagefaults are more
expensive than most (under virtualization), so he may well find a
different tradeoff
than I did (try running kernbench?)
--

From: Jeremy Fitzhardinge
Date: Thursday, September 18, 2008 - 3:22 pm

Yes.  My thought was that there should be very little cost to
opportunistically populating the pte for a page which is already

Right.  The faults themselves are more or less the same as the native
case, but setting a pte requires a hypercall compared to a memory write
in the native case.  But I can set any number of ptes in one hypercall,
so batching would amortize the cost.

    J
--

From: Chris Snook
Date: Thursday, September 18, 2008 - 3:23 pm

Perhaps we should.  In a virtual guest, the backing media is often an emulated 
IDE device, or something similarly inefficient, such that the bottleneck is the CPU.

-- Chris
--

From: MinChan Kim
Date: Thursday, September 18, 2008 - 4:16 pm

In embedded environment, many people use nand-like device as storage.
Read cost of nand-like device is less than IDE's one.
Also, Nowaday Embedded stuff would like to use multi-core step by step.



-- 
Kinds regards,
MinChan Kim
--

From: Avi Kivity
Date: Wednesday, September 17, 2008 - 3:02 pm

One problem is the accessed bit.  If it's unset, the shadow code cannot 
make the pte present (since it has to trap in order to set the accessed 
bit); if it's set, we're lying to the vm.

This doesn't affect Xen, only kvm.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Jeremy Fitzhardinge
Date: Wednesday, September 17, 2008 - 3:30 pm

(Just to clarify an ambiguity here: by "present" I mean "exists in

So even if the guest pte were present but non-accessed, the shadow pte
would have to be non-present and you'd end up taking the fault anyway?

Hm, that does undermine the benefits.  Does that mean that when the vm
clears the access bit, you always have to make the shadow non-present? 
I guess so.  And similarly with dirty and writable shadow.

The counter-argument is that something has gone wrong if we start
populating ptes that aren't going to be used in the near future anyway -
if they're never used then any effort taken to populate them is wasted. 
Therefore, setting accessed on them from the outset isn't terribly bad.

(I'm not very convinced by that argument either, and it makes the
potential for bad side-effects much worse if the apparent RSS of a
process is multiplied by some factor.)

    J
--

From: Avi Kivity
Date: Wednesday, September 17, 2008 - 3:47 pm

We don't know whether the page will be used or not.  Keeping the 
accessed bit clear allows the vm to reclaim it early, and in preference 
to the pages it actually used.

We could work around it by having a hypercall to read and clear accessed 
bits.  If we know the guest will only do that via the hypercall, we can 
keep the accessed (and dirty) bits in the host, and not update them in 
the guest at all.  Given good batching, there's potential for a large 
win there.

(If the host throws away a shadow page, it could sync the bits back into 
the guest pte for safekeeping)

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Jeremy Fitzhardinge
Date: Wednesday, September 17, 2008 - 4:02 pm

We added a hypercall to update just the AD bits, though it was primarily
to update D without losing the hardware-set A bit.

I don't think it would be practical to add a hypercall to read the A
bit.  There's too much code which just assumes it can grab a pte and
test the bit state.  There's no pv_op for reading a pte in general, and
even if there were you'd need to have a specialized pv-op for
specifically reading the A bit to avoid unnecessary hypercalls.

Setting/clearing the A bit could be done via the normal set_pte pv_op,
so that's not a big deal.

Do you need to set the A bit synchronously?  What happens if you install
the guest and shadow pte with A clear, and then lazily transfer the A
bit state from the shadow to guest pte?  Maybe at some significant event


    J
--

From: Avi Kivity
Date: Thursday, September 18, 2008 - 1:26 pm

(potential victim cc'ed)


I didn't think so much code would be interested in the accessed bit.  I 
can think of

 - pte teardown (to mark the page accessed)
 - scanning the active list


I'll fail my own unit tests.

If we add an async mode for guests that can cope, maybe this is 
workable.  I guess this is what you're suggesting.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Jeremy Fitzhardinge
Date: Thursday, September 18, 2008 - 3:18 pm

Is the A bit architecturally guaranteed to be synchronously set?  Can

Yes.  At worst Linux would underestimate the process RSS a bit
(depending on how many unsynchronized ptes you leave lying around).  I
bet there's an appropriate pvop hook you could use to force
synchronization just before the kernel actually inspects the bits
(leaving lazy mode sounds good).

    J
--

From: Avi Kivity
Date: Thursday, September 18, 2008 - 4:38 pm

I believe so.  The cpu won't cache tlb entries with the A bit clear 


Not the RSS (that's pte.present pages) but the working set (aka active 

It would have to be a new lazy mode, not the existing one, I think.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Jeremy Fitzhardinge
Date: Thursday, September 18, 2008 - 5:00 pm

The only direct use of pte_young() is in zap_pte_range, within a
mmu_lazy region.  So syncing the A bit state on entering lazy mmu mode
would work fine there.

The call via page_referenced_one() doesn't seem to have a very
convenient hook though.  Perhaps putting something in
page_check_address() would do the job.

    J
--

From: Avi Kivity
Date: Thursday, September 18, 2008 - 5:20 pm

Why there?

Why not explicitly in the callers?  We need more than to exit lazy pte.a 
mode, we also need to enter it again later.

-- 
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.

--

From: Jeremy Fitzhardinge
Date: Thursday, September 18, 2008 - 5:42 pm

Well, sort of but not quite.  The kernel's announcing its about to start
processing a batch of ptes, so the hypervisor can take the opportunity
to update their state before processing.  "Lazy-mode" is from the
perspective of the kernel lazily updating some state the hypervisor
might care about, and the sync happens when leaving mode.

The flip-side is when the hypervisor is lazily updating some state the
kernel cares about, so it makes sense that the sync when the kernel
enters its lazy mode.  But the analogy isn't very good because we don't
really have an explicit notion of "hypervisor lazy mode", or a formal
handoff of shared state between the kernel and hypervisor.  But in this

Because that's the code that actually walks the pagetable and has the
address of the pte; it just returns a pte_t, not a pte_t *.  It depends
on whether you want fetch the A bit via ptep or vaddr (in general we
pass mm, ptep and vaddr to ops which operate on the current pagetable).

    J
--

From: Avi Kivity
Date: Wednesday, September 24, 2008 - 5:31 am

Handwavy.  I think the two notions are separate <insert handwavy 

pte_clear_flush_young_notify_etc() seems even closer.

-- 
error compiling committee.c: too many arguments to function

--

From: Jeremy Fitzhardinge
Date: Thursday, September 25, 2008 - 11:32 am

Perhaps this helps:

Context switches between guest<->hypervisor are relatively expensive. 
The more work we can make each context switch perform the better,
because we can amortize the cost.  Rather than synchronously switching
between the two every time one wants to express a state change to the
other, we batch those changes up and only sync when its important. 
While there are batched outstanding changes in one, the other will have
a somewhat out of date view of the state.  At this level, the idea of
batching is completely symmetrical.

One of the ways we amortize the cost of guest->hypervisor transitions is
by batching multiple pagetable updates together.  This works at two
levels: within explicit arch_enter/leave_lazy_mmu lazy regions, and also
because it is analogous to the architectural requirement that you must
flush the tlb before updates "really" happen.

KVM - and other shadow pagetable implementations - have the additional
problem of transmitting A/D state updates from the shadow pagetable into
the guest pagetable.  Doing this synchronously has the costs we've been
discussing in this thread (namely, extra faults we would like to
avoid).  Doing this in a deferred or batched way is awkward because
there's no analogous architectural asynchrony in updating these pte
flags, and we don't have any existing mechanisms or hooks to support
this kind of deferred update.

However, given that we're talking about cleaning up the pagetable api
anyway, there's no reason we couldn't incorporate this kind of deferred
update in a more formal way.  It definitely makes sense when you have
shadow pagetables, and it probably makes sense on other architectures too.

Very few places actually care about the state of the A/D bits; would it
be expensive to make those places explicitly ask for synchronization
before testing the bits (or alternatively, have an explicit query
operation rather than just poking about in the ptes).  Martin, does this
help with s390's per-page (vs per-pte) A/D ...
From: Martin Schwidefsky
Date: Friday, September 26, 2008 - 3:26 am

With the kvm support the situation on s390 recently has grown a tad more
complicated. We now have dirty bits in the per-page storage key and in
the pgste (page table entry extension) for the kvm guests. For the A/D
bits in the storage key the new pte operations won't help, for the kvm
related bits they could make a difference.

-- 
blue skies,
  Martin.

"Reality continues to ruin my life." - Calvin.


--

From: Benjamin Herrenschmidt
Date: Friday, September 19, 2008 - 10:45 am

Other archs too. On powerpc, !accessed -> not hashed (or not in the TLB
for SW loaded TLB platforms). 

Ben.


--

From: MinChan Kim
Date: Wednesday, September 17, 2008 - 4:50 pm

Hi, all

I have been thinking about this idea in native.
I didn't consider it in minor page fault.
As you know, it costs more cheap than major fault.
However, the page fault is one of big bottleneck on demand-paging system.
I think major fault might be a rather big overhead in many core system.

What do you think about this idea in native ?
Do you really think that this idea don't help much in native ?

If I implement it in native, What kinds of benchmark do I need?
Could you recommend any benchmark ?





-- 
Kinds regards,
MinChan Kim
--

From: KOSAKI Motohiro
Date: Wednesday, September 17, 2008 - 11:58 pm

I guess it is also useful for native.
Then, if you post patch & benchmark result, I'll review it with presusure.




--

From: KAMEZAWA Hiroyuki
Date: Thursday, September 18, 2008 - 12:26 am

On Thu, 18 Sep 2008 08:50:05 +0900
Hmm, is enlarging page-size-for-anonymous-page more difficult ?

Testing some kind of scripts (shell/perl etc..) is candidates.

I use unixbench's exec/shell test to see charge/uncharge overhead of memory
resource controller, which happens at major page fault.

Thanks,
-Kame

--

Previous thread: none

Next thread: Re: Turning off camera also kills card reader on EeePC 900 by Sitsofe Wheeler on Wednesday, September 17, 2008 - 11:04 am. (1 message)