From: Martin Schwidefsky <schwidefsky@de.ibm.com>
From: Hubertus Franke <frankeh@watson.ibm.com>
From: Himanshu Raj
s390 uses the milli-coded ESSA instruction to set the page state. The
page state is formed by four guest page states called block usage states
and three host page states called block content states.
The guest states are:
- stable (S): there is essential content in the page
- unused (U): there is no useful content and any access to the page will
cause an addressing exception
- volatile (V): there is useful content in the page. The host system is
allowed to discard the content anytime, but has to deliver a discard
fault with the absolute address of the page if the guest tries to
access it.
- potential volatile (P): the page has useful content. The host system
is allowed to discard the content after it has checked the dirty bit
of the page. It has to deliver a discard fault with the absolute
address of the page if the guest tries to access it.
The host states are:
- resident: the page is present in real memory.
- preserved: the page is not present in real memory but the content is
preserved elsewhere by the machine, e.g. on the paging device.
- zero: the page is not present in real memory. The content of the page
is logically-zero.
There are 12 combinations of guest and host state, currently only 8 are
valid page states:
Sr: a stable, resident page.
Sp: a stable, preserved page.
Sz: a stable, logically zero page. A page filled with zeroes will be
allocated on first access.
Ur: an unused but resident page. The host could make it Uz anytime but
it doesn't have to.
Uz: an unused, logically zero page.
Vr: a volatile, resident page. The guest can access it normally.
Vz: a volatile, logically zero page. This is a discarded page. The host
will deliver a discard fault for any access to the page.
Pr: a potential volatile, resident page. The guest can access it normally.
The remaining 4 combinations can't ...I created the attached .dot graph based purely on this description. It
looks reasonable, but I didn't see how a page enters a Pr state.
J
That is the first block of state transitions: {Ur,Sr,Vr,Pr}
You can go from any of the four states to any of the remaining three.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
--
You only mention page_set_{unused,stable,volatile}. Is
page_set_stable_if_present() the fourth. And shouldn't that be
"stable_if_clean":
- potential volatile (P): the page has useful content. The host system
is allowed to discard the content after it has checked the dirty bit
of the page. It has to deliver a discard fault with the absolute
address of the page if the guest tries to access it.
The use of "stable" in the function call and "volatile" in this
description is a bit confusing. My understanding is that a page in this
state is either stable or volatile depending on whether its dirty, which
makes sense, but it would be good to consistently refer to it in the
same way.
Updated .dot attached.
J
page_set_volatile has a "writable" argument. For writable==0 you get a Vx page, for writable==1 you get a Px page. With stable_if_clean you are refering to stable_if_present? If yes the answer is that this operation is used to get a page from Vx/Px back to Sx but only if the page has not been discarded. The operation will fail if the page state is Vz/Pz. The dirty bit only matters for the hosts decision to discard the page, these are the state transitions from Vr/Pr Your understanding is good, but how can I make this less confusing? A Px page that is dirty may not be discarded which makes it basically stable. The guest state still is potential volatile though as it does not have a state of Sx. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. --
Hm. But a Vx page is writable isn't it? It's just that its contents
can go away at any time. Or does the kernel treat Vx pages as strictly
RO cached copies of other things?
It also seems to me that given you talking about "potentially volatile"
as a distinct state, it would would be best to have a distinct
state-setting function associated with it, so there's a 1:1
No. I misunderstood and thought that stable_if_present sets the Px
So you mean it will change Vr/Pr to Sr but everything else will fail?
Mainly, use identical terminology in code and description so they can be
easily compared. I found the diagram was quite helpful in understanding
what's going on; feel free to include it in your documentation.
Updated .dot attached; I've updated it to include the page_set_volatile
writable argument and the stable_if_present transitions; commented it,
removed the self-edges which were cluttering things up.
Also, does a page go from Vz->Vr on guest memory write? If so, does a
clean page which goes from Pr->Vz->Vr lose its Px state in the process?
J
Well presumably Vp/Pr => Sp? Is is true that from the guest's perspective, all of the 'p' states are identical to the 'r' states? Do the host states even really need visibility to the guest at all? It may be useful for the guest to be able to distinguish between Ur and Uz but it doesn't seem necessary. BTW Jeremy, the .dot was very useful! Regards, Anthony Liguori --
Vp should never happen, since you'd never preserve a V page. And surely
it would be Pr -> Sr, since the hypervisor wouldn't push the page to
Well, you implicitly see the hypervisor state. If you touch a [UV]z
page then you get a fault telling you that the page has been taken away
from you (I think). And it would definitely help with debugging (seems
likely there's lots of scope for race conditions if you prematurely tell
Yes, there's no way I'd be able to get my head around this otherwise.
BTW, here's an updated one with the host-driven events as dashed lines,
and a couple of extra transitions I think should be in there (but
waiting for Martin's confirmation).
J
You're right, I meant Vp/Pp but they are invalid states. I think one of the things that keeps tripping me up is that the host can change both the host and guest page states. My initial impression was that the host I was thinking that it may be useful to know a Ur verses a Uz when allocating memory. In this case, you'd rather allocate Ur pages verses Uz to avoid the fault. I don't read s390 arch code well, is the host Excellent! Regards, --
Yes. And it seems to me that you get unfortunate outcomes if you have a
Yes, reusing Ur pages might well be better, but who knows - they've
probably got an instruction which makes Uz cheap...
Stuff like this suggets that both parts of the state are packed
together, and are guest-visible:
+ return (state & ESSA_USTATE_MASK) == ESSA_USTATE_VOLATILE &&
+ (state & ESSA_CSTATE_MASK) == ESSA_CSTATE_ZERO;
J
--
Yes, faulting in a Uz page is cheap on s390. Isn't it a lovely Yes, the return value of the ESSA instruction has both the guest state and the host state. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. --
Does that mean that Vz is effectively identical to Uz?
J
--
Hm, on further thought:
If guests writes to Vz pages are disallowed, then the only way out of Vz
is if the guest sets it to something else (Uz,Sz). If so, what's the
point of using that state? Why not make:
Vr -> Uz host discard
Pr -> Uz host discard clean
Sp -> Uz set volatile
Uz -> Uz set volatile
But given how you've described V-state pages, I really would expect
writes to a Vz to work, or alternatively, all writes to V-state pages to
be disallowed. Are there any real uses for a writable Vr page?
On the other hand, removing Vz->Vr does clean up the dot graph a lot...
J
Vz is the page discarded state. The difference to Uz is slim, both states will cause a program check on access. Vz generates a discard fault, Uz generates an addressing exception which is nice for debugging. But I don't see a reason why an implementation that uses Uz instead of You mean in the section that speaks about the guests states S/U/V/P ? Always keep in mind that you can access a V/P page only until it gets discarded. Then the useful content of the page frame is lost and any read of write to the not Vz page will be answered with a discard fault. A Vr page is read-only. If a page gets mapped for writing it needs to get into the Pr state. This is the hint for the host to look at the dirty bit before it discards a page. So yes, there is no use for a writable Vr page. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. --
How do you handle these different cases in Linux? Do you use Vr pages
in the pagecache, and then shoot down the pagecache entry if the host
steals the page?
The Uz access exception presumably just generates a normal oops.
OK, thanks, that clears things up. I was assuming that Vr was
technically writable but that writes could be discarded at any time (ie,
allowing guests to merrily shoot themselves in the foot ;). Making it
forced RO is much more sensible.
J
--
The environment where we currently run all this is z/VM as the host and Linux as the guest. We have two page tables on s390, a host page table and a guest page table. If the host discards a page it simple removes the entry for the page in the host page table. If the guest comes along and accesses the page the host gets the fault and generates the Yes, the handler for an addressing exception will call die() for a Well, technically you could write to a Vr page via the kernel address space. The thing is that the host can just discard the page although it is dirty. The Vr state is used for page cache pages which do not have any writable mapping. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. --
In principle only the guest changes the guest state and only the host changes the host state. The simplified state diagram shows exceptions This is the second optimization you might want to think about. The other is to avoid the page clearing for Uz. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. --
Vp does not happen in the current implementation. But it actually may be useful. z/VM has multiple layers of paging, the first goes to expanded storage which is very fast. If you make the page Vz and the guests needs it you have to do a standard Linux I/O to get retrieve the page. This You get an addressing exception if you touch a Uz page. This indicates a BUG in the Linux code because this is a use after free. If the guests touches a Vz page you get a discard fault. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin. --
In the extended version Vp/Pp to Sr as well but the current z/VM code It is very useful for debugging to have the host state in the guest as well. There is one possible optimization: if the guests finds a Uz page in the free list, it can make it Sz and doesn't have to clear it because the host will provide an already empty page (not yet implemented I've search on my disk and found the state diagrams we've used for the OLS paper. You may find these useful as well. -- blue skies, Martin. "Reality continues to ruin my life." - Calvin.
