Re: [PATCH] x86: set PAE PHYSICAL_MASK_SHIFT to match 64-bit

Previous thread: [PATCH 5/5] pagemap: Wrote some userspace-oriented documentation for pagemap by Thomas Tuttle on Thursday, June 5, 2008 - 8:09 am. (1 message)

Next thread: [PATCH 0 of 3] mmu notifier v18 by Andrea Arcangeli on Thursday, June 5, 2008 - 8:36 am. (6 messages)
From: Jeremy Fitzhardinge
Date: Thursday, June 5, 2008 - 8:21 am

When a 64-bit x86 processor runs in 32-bit PAE mode, a pte can
potentially have the same number of physical address bits as the
64-bit host ("Enhanced Legacy PAE Paging").

This is a bugfix for two cases:
1. running a 32-bit PAE kernel on a machine with
   more than 64GB RAM.
2. running a 32-bit PAE Xen guest on a host machine with
   more than 64GB RAM

In both cases, a pte could need to have more than 36 bits of physical,
and masking it to 36-bits will cause fairly severe havoc.

The 46-bit mask used in 64-bit seems pretty arbitrary.  The physical
size could be between 40 and 52 bits.  Setting the mask to 40 bits
would restrict the physical size to 1TB, which is definitely too
small.  Setting it to 52 would be ridiculously large, and runs the
risk that one of the vendors may decide to put flags rather than
physical address in one of the upper reserved bits.

Doing it "properly" would require testing cpuid leaf 0x80000008, but
it would mean that we would lose the ability to make all these
compile-time constants.

So, stick with 46 bits.  It's enough for now.

[ Ingo: This needs a test, but I think it should be fairly low-risk.
   If it checks out OK, it should be slipped to Linus fairly soon,
   since it is a bugfix.  It's probably worth putting into stable
   too. ]

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Jan Beulich <jbeulich@novell.com>
Cc: Stable Kernel <stable@kernel.org>

diff -r 0eebd30011dc include/asm-x86/page_32.h
--- a/include/asm-x86/page_32.h	Wed Jun 04 10:32:01 2008 +0100
+++ b/include/asm-x86/page_32.h	Thu Jun 05 16:09:53 2008 +0100
@@ -22,7 +22,7 @@
 
 
 #ifdef CONFIG_X86_PAE
-#define __PHYSICAL_MASK_SHIFT	36
+#define __PHYSICAL_MASK_SHIFT	46
 #define __VIRTUAL_MASK_SHIFT	32
 #define PAGETABLE_LEVELS	3
 


--

From: Jan Beulich
Date: Thursday, June 5, 2008 - 8:34 am

Hmm? There's 11 bits available - why would anyone want to assign bits
from the sufficiently official (at least as far as AMD is concerned, I'm not
sure I saw a precise statement on Intel's side) frame number bits? And
even if they would, it would certainly take some control register bit to
enable the feature, so shrinking the mask if that would ever happen
would seem more appropriate.

Bottom line - I'd suggest pushing both 32- and 64-bits up to 52.

Jan

--

From: Jeremy Fitzhardinge
Date: Thursday, June 5, 2008 - 8:42 am

The Intel docs list those 11 bits as available to software, and are not 
reserved for any future flags they may want to add.  I was a bit 


We could have an auction:

    Do I hear 46? 47? 48?  50?  52!  Going once, twice, 52 bits!

Anyway, we can fix it later in a separate patch.  This is a 
change-as-little-as-possible bugfix patch.

    J
--

From: H. Peter Anvin
Date: Thursday, June 5, 2008 - 9:45 am

It should either be 52 bits or dynamic based on CPUID information.  The 
latter is very expensive.

If there end up being additional control bits assigned in this space we 
won't use them  since we know the size of the address space (which won't 
include the control bits) and thus will leave them at zero.

It's largely theoretical, since I believe Linux on x86-64 relies on 
virtual >= physical+N, where I believe N is about 3 bits, and the page 
table format or page size need to change to support more than 48 bits of 
virtual address space.

	-hpa
--

From: Jeremy Fitzhardinge
Date: Thursday, June 5, 2008 - 2:14 pm

I'm more concerned that it might not be possible.  I'm trying to think 
how many places have compile-time constants derived from this mask.  

You mean, if new bits appear we can just adjust the mask accordingly to 

I don't see any relationship between the physical and virtual size.  
Certainly virtual is fixed at 48 bits (4*9+12), but I don't think 
there's any deep reason why physical needs to be within 3 bits.

    J
--

From: H. Peter Anvin
Date: Saturday, June 7, 2008 - 11:35 am

Correct. Remember, the page table entries come from the kernel - not 

Identity-mapping.  1 bit goes to kernel/user split, then the kernel area 
is split into multiple regions, one of which is identity-mapping.  It 
may be just 2.

	-hpa
--

From: Jeremy Fitzhardinge
Date: Friday, June 6, 2008 - 2:21 am

When a 64-bit x86 processor runs in 32-bit PAE mode, a pte can
potentially have the same number of physical address bits as the
64-bit host ("Enhanced Legacy PAE Paging").  This means, in theory,
we could have up to 52 bits of physical address in a pte.

The 32-bit kernel uses a 32-bit unsigned long to represent a pfn.
This means that it can only represent physical addresses up to 32+12=44
bits wide.  Rather than widening pfns everywhere, just set 2^44 as the
Linux x86_32-PAE architectural limit for physical address size.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Jan Beulich <jbeulich@novell.com>
---
 include/asm-x86/page_32.h |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

===================================================================
--- a/include/asm-x86/page_32.h
+++ b/include/asm-x86/page_32.h
@@ -22,7 +22,8 @@
 
 
 #ifdef CONFIG_X86_PAE
-#define __PHYSICAL_MASK_SHIFT	36
+/* 44=32+12, the limit we can fit into an unsigned long pfn */
+#define __PHYSICAL_MASK_SHIFT	44
 #define __VIRTUAL_MASK_SHIFT	32
 #define PAGETABLE_LEVELS	3
 


--

From: Jan Beulich
Date: Friday, June 6, 2008 - 2:58 am

Ah, yes!

When a 64-bit x86 processor runs in 32-bit PAE mode, a pte can
potentially have the same number of physical address bits as the
64-bit host ("Enhanced Legacy PAE Paging").  This means, in theory,
we could have up to 52 bits of physical address in a pte.

The 32-bit kernel uses a 32-bit unsigned long to represent a pfn.
This means that it can only represent physical addresses up to 32+12=44
bits wide.  Rather than widening pfns everywhere, just set 2^44 as the
Linux x86_32-PAE architectural limit for physical address size.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
Cc: Jan Beulich <jbeulich@novell.com>
---
 include/asm-x86/page_32.h |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

===================================================================
--- a/include/asm-x86/page_32.h
+++ b/include/asm-x86/page_32.h
@@ -22,7 +22,8 @@
 
 
 #ifdef CONFIG_X86_PAE
-#define __PHYSICAL_MASK_SHIFT	36
+/* 44=32+12, the limit we can fit into an unsigned long pfn */
+#define __PHYSICAL_MASK_SHIFT	44
 #define __VIRTUAL_MASK_SHIFT	32
 #define PAGETABLE_LEVELS	3
 



--

From: Andi Kleen
Date: Friday, June 6, 2008 - 6:15 am

43bits might be actally safer because of potential sign bugs.

But of course it won't work anyways likely.

-Andi
--

From: Jeremy Fitzhardinge
Date: Friday, June 6, 2008 - 6:50 am

I thought about it, but I think we're pretty consistent about putting 
pfns into unsigned types.  But, yes, you're very marginal at that point.

    J
--

From: Ingo Molnar
Date: Tuesday, June 10, 2008 - 3:31 am

applied to tip/x86/cleanups - thanks Jeremy. No urgency for v2.6.26, 
right?

	Ingo
--

From: Jeremy Fitzhardinge
Date: Tuesday, June 10, 2008 - 6:06 am

Not urgent, but it would be nice to have.

    J

--

From: Ingo Molnar
Date: Friday, June 13, 2008 - 12:24 am

ok, cherry-picked it into x86/urgent. This aspect makes it eligible for 
v2.6.26:

| This is a bugfix for two cases:
| 1. running a 32-bit PAE kernel on a machine with
|   more than 64GB RAM.
| 2. running a 32-bit PAE Xen guest on a host machine with
|   more than 64GB RAM
|
| In both cases, a pte could need to have more than 36 bits of physical,
| and masking it to 36-bits will cause fairly severe havoc.

also added a stable@kernel.org Cc: to the commit, so it will be picked 
up in stable as well.

	Ingo
--

From: Andi Kleen
Date: Thursday, June 5, 2008 - 6:40 pm

The rationale for the 46 bits is that the kernel needs roughly 4x as 
much virtual space as physical space and the virtual space is limited
to 48bits.

To be exact 47 bits is always user space and the 47 bits remaining
for the kernel are split into half, with one half for the direct mapping
and the other half for random mappings.  With some pushing you could
extend it to 46.5 bits or so, but beyond that you'll be in trouble.

It's not arbitrary at all.

-Andi

--

From: Jan Beulich
Date: Friday, June 6, 2008 - 12:14 am

That is only half of it. Since PHYSICAL_MASK also controls other than
RAM mappings, there's really two constants that are needed here:
One (46) to indicate how large the 1:1 mapping can possibly get (and
hence what the upper boundary of usable RAM is - without introducing
highmem), and another (52) to indicate how wide a physical address
(perhaps from a 64-bit PCI BAR) can possibly be (i.e. used to validate
physical addresses / page table entries).

Jan

--

From: Jeremy Fitzhardinge
Date: Friday, June 6, 2008 - 12:59 am

Why's that?  Is the issue the amount of memory needed for pagetables and 
I didn't say it was.  That was the introduction to my explanation of why 
I didn't think it was arbitrary.  Of course, if there had been a comment 
there explaining the rationale, I wouldn't have had to make one up...

    J
--

From: Jan Beulich
Date: Friday, June 6, 2008 - 1:14 am

No, it's the fact that the 1:1 mapping needs as much virtual space as
the physical range covered (including all holes).

Jan

--

From: Jeremy Fitzhardinge
Date: Friday, June 6, 2008 - 1:15 am

Right, I see.  And suddenly 64-bits seems... constrained. ;)

    J

--

From: H. Peter Anvin
Date: Saturday, June 7, 2008 - 11:39 am

Not really.  The vendors are aware of this constraint -- it's hardly 
unique to Linux.  The reason for canonical addresses and all that jazz 
is to keep people from doing stupid things like store stuff in the upper 
16 bits of a pointer (happened a lot on the 68000, where the first 
implementation had only 24 address bits.)  Thus, all changes needed to 
go to a larger virtual address space are all internal to the kernel.

	-hpa
--

From: Arjan van de Ven
Date: Thursday, June 5, 2008 - 9:45 pm

On Thu, 05 Jun 2008 16:21:14 +0100

the problem on 32 bit is that if you have that much ram, you run out of
lowmem FAST.... so you have bigger problems.

-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
--

From: Jeremy Fitzhardinge
Date: Friday, June 6, 2008 - 1:08 am

Sure, you'd have to be barking mad to give a 32-bit system 2^40 bytes of 
RAM.  But under Xen the host's physical addresses are used in guest 
pagetables, so you could have a reasonably sized 32-bit PAE Xen guest be 
exposed to huge host physical addresses.

But the basic point is that, given that Enhanced Legacy PAE Paging 
exists, 36-bits is not  correct, so we should fix it.  And if the 
platform allows addressable hardware to be physically discontigious - 
either memory or devices - then you may end up using large numbers of 
physical bits without having a stupid amount of memory actually present.

    J
--

Previous thread: [PATCH 5/5] pagemap: Wrote some userspace-oriented documentation for pagemap by Thomas Tuttle on Thursday, June 5, 2008 - 8:09 am. (1 message)

Next thread: [PATCH 0 of 3] mmu notifier v18 by Andrea Arcangeli on Thursday, June 5, 2008 - 8:36 am. (6 messages)