Avi and I were discussing whether we should populate multiple ptes at
pagefault time, rather than one at at time as we do now.
When Linux is operating as a virtual guest, pte population will
generally involve some kind of trap to the hypervisor, either to
validate the pte contents (in Xen's case) or to update the shadow
pagetable (kvm). This is relatively expensive, and it would be good to
amortise the cost by populating multiple ptes at once.
Xen and kvm already batch pte updates where multiple ptes are explicitly
updated at once (mprotect and unmap, mostly), but in practise that's
relatively rare. Most pages are demand faulted into a process one at a
time.
It seems to me there are two cases: major faults, and minor faults:
Major faults: the page in question is physically missing, and so the
fault invokes IO. If we blindly pull in a lot of extra pages that are
never used, then we'll end up wasting a lot of memory. However, page at
a time IO is pretty bad performance-wise too, so I guess we do clustered
fault-time IO? If we can distinguish between random and linear fault
patterns, then we can use that as a basis for deciding how much
speculative mapping to do. Certainly, we should create mappings for any
nearby page which does become physically present.
Minor faults are easier; if the page already exists in memory, we should
just create mappings to it. If neighbouring pages are also already
present, then we can can cheaply create mappings for them too.
This seems like an obvious idea, so I'm wondering if someone has
prototyped it already to see what effects there are. In the native
case, pte updates are much cheaper, so perhaps it doesn't help much
there, though it would potentially reduce the number of faults needed.
But I think there's scope for measurable benefits in the virtual case.
Thanks,
J
--
On Wed, 17 Sep 2008 10:47:30 -0700 This is especially true for mmaped files, where we do not have to allocate anything to create the mapping. Populating multiple PTEs at a time is questionable for anonymous memory, where we'd have to allocate extra pages. -- All rights reversed. --
It might be worthwhile if the memory access pattern to anonymous memory
is linear. I agree that speculatively allocating pages on a random
access region would be a bad idea.
J
--
We already have rather well-tested code in the VM to detect fault patterns, complete with userspace hints to set readahead policy. It seems to me that if we're going to read nearby pages into pagecache, we might as well actually map If we're mapping pagecache, then sure, this is really cheap, but speculatively Sounds like something we might want to enable conditionally on the use of pv_ops features. -- Chris --
No, nested pagetables are the same as native to update, so the main
Right, that was my point. I'm assuming that that machinery already
OK, makes sense. Does the access pattern detecting code measure access
Perhaps, but I'd rather avoid it. I'm hoping this is something we could
do that has - at worst - no effect on the native case, while improving
the virtual case. The test matrix is already large enough without
adding another stateful switch. After all, any side effect which makes
it a bad idea for the native case will probably be bad enough to
overwhelm any benefit in the virtual case.
J
--
I had a patch like that a couple of years back but it was not accepted. http://www.kernel.org/pub/linux/kernel/people/christoph/prefault/ http://readlist.com/lists/vger.kernel.org/linux-kernel/14/70942.html http://www.ussg.iu.edu/hypermail/linux/kernel/0503.1/1292.html --
Thanks, that was exactly what I was hoping to see. I didn't see any
definitive statements against the patch set, other than a concern that
it could make things worse. Was the upshot that no consensus was
reached about how to detect when its beneficial to preallocate anonymous
pages?
Martin, in that thread you mentioned that you had tried pre-populating
file-backed mappings as well, but "Mmmm ... we tried doing this before
for filebacked pages by sniffing the
pagecache, but it crippled forky workloads (like kernel compile) with the
extra cost in zap_pte_range, etc. ".
Could you describe, or have a pointer to, what you tried and how it
turned out? Did you end up populating so many (unused) ptes that
zap_pte_range needed to do lots more work?
Christoph (and others): do you think vm changes in the last 4 years
would have changed the outcome of these results?
Thanks,
J
--
There were multiple discussions on the subject. The consensus was that it was difficult to generalize this and it would only work on special loads. Plus it Seems that the code today is similar. So it would still work. --
but at that time, x86_64 large server doesn't exist yet. I think mesurement again is valuable because typical server environment is changed in these days. --
Don't have the patches still, but it was fairly simple - just faulted in the next 3 pages whenever we took a fault, if the pages were already in pagecache. I would have thought that was pretty lightweight and Yup, basically you're assuming good locality of reference, but it turns out that (as davej would say) "userspace sucks". --
Well, *most* userspace sucks. It might still be worthwhile to do this when userspace is using madvise(). -- Chris --
Quite possibly true ... something to benchmark. --
Well, I guess we need a new binary format that allows one to execute binaries in kernel address space with full powers. --
Seems ... extreme ;-) Maybe we just do it if we're in readahead? (or similar) --
If we are in kernel space then the binary can call the readahead function as needed ... ;-O Ok, seriously: Anonymous pages are not subject to readahead so it wont work. --
On Fri, Sep 19, 2008 at 6:32 AM, Christoph Lameter In case of file-mapped pages, Shouldn't we use just on-demand readahead mechanism in kernel ? If it is inefficient, It means we have to change on-demand readahead mechanism itself. -- Kinds regards, MinChan Kim --
Right. My patches were only for anonymous pages not for file backed because readahead is available for file backed mappings. --
Do we populate the PTEs though? I didn't think that was batched, but I might well be wrong. --
We do not populate the PTEs and AFAICT PTE population was assumed not to be performance critical since the backing media is comparatively slow. --
I think the times when this matters are things like glibc, which are heavily shared - we were only 'prefaulting' when the pagecache was already there. So it's a case for a "readahead like algorithm", not necessarily a direct hook. Anonymous pages seem much riskier, as presumably there's a no backing page except in the fork case. I presume the reason Jeremy is interested is because his pagefaults are more expensive than most (under virtualization), so he may well find a different tradeoff than I did (try running kernbench?) --
Yes. My thought was that there should be very little cost to
opportunistically populating the pte for a page which is already
Right. The faults themselves are more or less the same as the native
case, but setting a pte requires a hypercall compared to a memory write
in the native case. But I can set any number of ptes in one hypercall,
so batching would amortize the cost.
J
--
Perhaps we should. In a virtual guest, the backing media is often an emulated IDE device, or something similarly inefficient, such that the bottleneck is the CPU. -- Chris --
In embedded environment, many people use nand-like device as storage. Read cost of nand-like device is less than IDE's one. Also, Nowaday Embedded stuff would like to use multi-core step by step. -- Kinds regards, MinChan Kim --
One problem is the accessed bit. If it's unset, the shadow code cannot make the pte present (since it has to trap in order to set the accessed bit); if it's set, we're lying to the vm. This doesn't affect Xen, only kvm. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
(Just to clarify an ambiguity here: by "present" I mean "exists in
So even if the guest pte were present but non-accessed, the shadow pte
would have to be non-present and you'd end up taking the fault anyway?
Hm, that does undermine the benefits. Does that mean that when the vm
clears the access bit, you always have to make the shadow non-present?
I guess so. And similarly with dirty and writable shadow.
The counter-argument is that something has gone wrong if we start
populating ptes that aren't going to be used in the near future anyway -
if they're never used then any effort taken to populate them is wasted.
Therefore, setting accessed on them from the outset isn't terribly bad.
(I'm not very convinced by that argument either, and it makes the
potential for bad side-effects much worse if the apparent RSS of a
process is multiplied by some factor.)
J
--
We don't know whether the page will be used or not. Keeping the accessed bit clear allows the vm to reclaim it early, and in preference to the pages it actually used. We could work around it by having a hypercall to read and clear accessed bits. If we know the guest will only do that via the hypercall, we can keep the accessed (and dirty) bits in the host, and not update them in the guest at all. Given good batching, there's potential for a large win there. (If the host throws away a shadow page, it could sync the bits back into the guest pte for safekeeping) -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
We added a hypercall to update just the AD bits, though it was primarily
to update D without losing the hardware-set A bit.
I don't think it would be practical to add a hypercall to read the A
bit. There's too much code which just assumes it can grab a pte and
test the bit state. There's no pv_op for reading a pte in general, and
even if there were you'd need to have a specialized pv-op for
specifically reading the A bit to avoid unnecessary hypercalls.
Setting/clearing the A bit could be done via the normal set_pte pv_op,
so that's not a big deal.
Do you need to set the A bit synchronously? What happens if you install
the guest and shadow pte with A clear, and then lazily transfer the A
bit state from the shadow to guest pte? Maybe at some significant event
J
--
(potential victim cc'ed) I didn't think so much code would be interested in the accessed bit. I can think of - pte teardown (to mark the page accessed) - scanning the active list I'll fail my own unit tests. If we add an async mode for guests that can cope, maybe this is workable. I guess this is what you're suggesting. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
Is the A bit architecturally guaranteed to be synchronously set? Can
Yes. At worst Linux would underestimate the process RSS a bit
(depending on how many unsynchronized ptes you leave lying around). I
bet there's an appropriate pvop hook you could use to force
synchronization just before the kernel actually inspects the bits
(leaving lazy mode sounds good).
J
--
I believe so. The cpu won't cache tlb entries with the A bit clear Not the RSS (that's pte.present pages) but the working set (aka active It would have to be a new lazy mode, not the existing one, I think. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
The only direct use of pte_young() is in zap_pte_range, within a
mmu_lazy region. So syncing the A bit state on entering lazy mmu mode
would work fine there.
The call via page_referenced_one() doesn't seem to have a very
convenient hook though. Perhaps putting something in
page_check_address() would do the job.
J
--
Why there? Why not explicitly in the callers? We need more than to exit lazy pte.a mode, we also need to enter it again later. -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. --
Well, sort of but not quite. The kernel's announcing its about to start
processing a batch of ptes, so the hypervisor can take the opportunity
to update their state before processing. "Lazy-mode" is from the
perspective of the kernel lazily updating some state the hypervisor
might care about, and the sync happens when leaving mode.
The flip-side is when the hypervisor is lazily updating some state the
kernel cares about, so it makes sense that the sync when the kernel
enters its lazy mode. But the analogy isn't very good because we don't
really have an explicit notion of "hypervisor lazy mode", or a formal
handoff of shared state between the kernel and hypervisor. But in this
Because that's the code that actually walks the pagetable and has the
address of the pte; it just returns a pte_t, not a pte_t *. It depends
on whether you want fetch the A bit via ptep or vaddr (in general we
pass mm, ptep and vaddr to ops which operate on the current pagetable).
J
--
Handwavy. I think the two notions are separate <insert handwavy pte_clear_flush_young_notify_etc() seems even closer. -- error compiling committee.c: too many arguments to function --
Perhaps this helps: Context switches between guest<->hypervisor are relatively expensive. The more work we can make each context switch perform the better, because we can amortize the cost. Rather than synchronously switching between the two every time one wants to express a state change to the other, we batch those changes up and only sync when its important. While there are batched outstanding changes in one, the other will have a somewhat out of date view of the state. At this level, the idea of batching is completely symmetrical. One of the ways we amortize the cost of guest->hypervisor transitions is by batching multiple pagetable updates together. This works at two levels: within explicit arch_enter/leave_lazy_mmu lazy regions, and also because it is analogous to the architectural requirement that you must flush the tlb before updates "really" happen. KVM - and other shadow pagetable implementations - have the additional problem of transmitting A/D state updates from the shadow pagetable into the guest pagetable. Doing this synchronously has the costs we've been discussing in this thread (namely, extra faults we would like to avoid). Doing this in a deferred or batched way is awkward because there's no analogous architectural asynchrony in updating these pte flags, and we don't have any existing mechanisms or hooks to support this kind of deferred update. However, given that we're talking about cleaning up the pagetable api anyway, there's no reason we couldn't incorporate this kind of deferred update in a more formal way. It definitely makes sense when you have shadow pagetables, and it probably makes sense on other architectures too. Very few places actually care about the state of the A/D bits; would it be expensive to make those places explicitly ask for synchronization before testing the bits (or alternatively, have an explicit query operation rather than just poking about in the ptes). Martin, does this help with s390's per-page (vs per-pte) A/D ...
With the kvm support the situation on s390 recently has grown a tad more complicated. We now have dirty bits in the per-page storage key and in the pgste (page table entry extension) for the kvm guests. For the A/D bits in the storage key the new pte operations won't help, for the kvm related bits they could make a difference. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. --
Other archs too. On powerpc, !accessed -> not hashed (or not in the TLB for SW loaded TLB platforms). Ben. --
Hi, all I have been thinking about this idea in native. I didn't consider it in minor page fault. As you know, it costs more cheap than major fault. However, the page fault is one of big bottleneck on demand-paging system. I think major fault might be a rather big overhead in many core system. What do you think about this idea in native ? Do you really think that this idea don't help much in native ? If I implement it in native, What kinds of benchmark do I need? Could you recommend any benchmark ? -- Kinds regards, MinChan Kim --
I guess it is also useful for native. Then, if you post patch & benchmark result, I'll review it with presusure. --
On Thu, 18 Sep 2008 08:50:05 +0900 Hmm, is enlarging page-size-for-anonymous-page more difficult ? Testing some kind of scripts (shell/perl etc..) is candidates. I use unixbench's exec/shell test to see charge/uncharge overhead of memory resource controller, which happens at major page fault. Thanks, -Kame --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produ |
