Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Previous thread: [PATCH][RESEND] sh: termios ioctl definitions by Alan Cox on Saturday, January 19, 2008 - 12:05 pm. (2 messages)

Next thread: [PATCH] ata_generic: Cenatek support by Alan Cox on Saturday, January 19, 2008 - 12:07 pm. (1 message)
To: <linux-kernel@...>
Cc: Ian Campbell <ijc@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, H. Peter Anvin <hpa@...>, James Bottomley <James.Bottomley@...>, Eric W. Biederman <ebiederm@...>
Date: Saturday, January 19, 2008 - 12:08 pm

Extracted from an earlier patch by Eric Biederman.

Signed-off-by: Ian Campbell <ijc@hellion.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
CC: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---
arch/x86/mach-voyager/voyager_basic.c | 19 +++++++------------
1 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/arch/x86/mach-voyager/voyager_basic.c b/arch/x86/mach-voyager/voyager_basic.c
index 6a949e4..ed41fd8 100644
--- a/arch/x86/mach-voyager/voyager_basic.c
+++ b/arch/x86/mach-voyager/voyager_basic.c
@@ -110,8 +110,9 @@ typedef struct ClickMap {
} Entry[CLICK_ENTRIES];
} ClickMap_t;

-/* This routine is pretty much an awful hack to read the bios clickmap by
- * mapping it into page 0. There are usually three regions in the map:
+/*
+ * This routine reads the bios clickmap. There are usually three
+ * regions in the map:
* Base Memory
* Extended Memory
* zero length marker for end of map
@@ -125,7 +126,6 @@ int __init voyager_memory_detect(int region, __u32 * start, __u32 * length)
__u8 cmos[4];
ClickMap_t *map;
unsigned long map_addr;
- unsigned long old;

if (region >= CLICK_ENTRIES) {
printk("Voyager: Illegal ClickMap region %d\n", region);
@@ -138,12 +138,8 @@ int __init voyager_memory_detect(int region, __u32 * start, __u32 * length)

map_addr = *(unsigned long *)cmos;

- /* steal page 0 for this */
- old = pg0[0];
- pg0[0] = ((map_addr & PAGE_MASK) | _PAGE_RW | _PAGE_PRESENT);
- local_flush_tlb();
- /* now clear everything out but page 0 */
- map = (ClickMap_t *) (map_addr & (~PAGE_MASK));
+ /* Setup a temporary mapping for the clickmap */
+ map = early_ioremap(map_addr, sizeof(*map));

/* zero length is the end of the clickmap */
if (map->Entry[region].Length != 0) {
@@ -152,9 +148,8 @@ int __init voyager_memory_detect(int region, __...

To: <linux-kernel@...>
Cc: Ian Campbell <ijc@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, H. Peter Anvin <hpa@...>, Eric W. Biederman <ebiederm@...>
Date: Saturday, January 19, 2008 - 12:08 pm

Specifically the boot time page tables in a CONFIG_X86_PAE=y enabled
kernel are in PAE format.

early_ioremap is updated to use the standard page table accessors.

Derived from an earlier patch by Eric Biederman.

Signed-off-by: Ian Campbell <ijc@hellion.org.uk>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Eric W. Biederman <ebiederm@xmission.com>
---
arch/x86/kernel/head_32.S | 116 +++++++++++++------------------------
arch/x86/kernel/setup_32.c | 4 +
arch/x86/mm/Makefile_32 | 2 +-
arch/x86/mm/early_pgtable_32.c | 125 ++++++++++++++++++++++++++++++++++++++++
arch/x86/mm/init_32.c | 45 --------------
arch/x86/mm/ioremap_32.c | 53 ++++++++++-------
include/asm-x86/page_32.h | 1 -
include/asm-x86/pgtable_32.h | 4 -
8 files changed, 201 insertions(+), 149 deletions(-)
create mode 100644 arch/x86/mm/early_pgtable_32.c

diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index f409fe2..2090aa4 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -33,44 +33,6 @@
#define X86_VENDOR_ID new_cpu_data+CPUINFO_x86_vendor_id

/*
- * This is how much memory *in addition to the memory covered up to
- * and including _end* we need mapped initially.
- * We need:
- * - one bit for each possible page, but only in low memory, which means
- * 2^32/4096/8 = 128K worst case (4G/4G split.)
- * - enough space to map all low memory, which means
- * (2^32/4096) / 1024 pages (worst case, non PAE)
- * (2^32/4096) / 512 + 4 pages (worst case for PAE)
- * - a few pages for allocator use before the kernel pagetable has
- * been set up
- *
- * Modulo rounding, each megabyte assigned here requires a kilobyte of
- * memory, which is currently unreclaimed.
- *
- * This should be a multiple of a page.
- */
-LOW_PAGES = 1<<(32-PAGE_SHIFT_asm)
-
-/*
- * To preserve the DMA po...

To: Ian Campbell <ijc@...>
Cc: <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, H. Peter Anvin <hpa@...>, Eric W. Biederman <ebiederm@...>
Date: Sunday, January 20, 2008 - 2:30 pm

You have dropped the requirement to map all of low memory (the boot
allocator is used for instance to construct physical mem mapping).
Either you should fix your INIT_MAP_BEYOND_END or make a big comment
telling us why it isn't necessary anymore to map low mem.

--Mika
--

To: Mika <mika.penttila@...>
Cc: <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, H. Peter Anvin <hpa@...>, Eric W. Biederman <ebiederm@...>
Date: Monday, January 21, 2008 - 5:23 pm

I think you are right. The patch ensures that all the initial page
tables themselves have mappings but won't map the additional pages
needed for mapping the rest of lowmem.

However, I think it is no longer necessary to map a whole new 4G worth
of page table pages because the code in kernel_physical_mapping_init now
extends the initial mappings rather than replacing them (see changes to
native_pagetable_setup_start). So now we only need to map 4G worth of
page tables including the initial page tables. That means we only need
to map a fixed set of extra pages rather than the sliding limit
currently used in the patch.

I'm not convinced by the additional 16MB for CONFIG_DEBUG_PAGEALLOC --
we map enough pages for page tables for 4G of lowmem -- adding space for
an extra 16M seems pointless.

Ian.
--
Ian Campbell

Good-bye. I am leaving because I am bored.
-- George Saunders' dying words

--

To: Ian Campbell <ijc@...>
Cc: Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Monday, January 21, 2008 - 5:38 pm

We still need to be able to construct those page tables, which is what

If so, adjusting the limit should be a separate patch.

Either way, I'm increasingly thinking that setting up the initial page
tables via an assembly loop instead of worrying about the C accessors is
actually cleaner (I prototyped it yesterday, although I still need the
rest of the machinery.)

--

To: H. Peter Anvin <hpa@...>
Cc: Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Monday, January 21, 2008 - 5:46 pm

Yes, my initial patch was wrong. However with the patch we no longer
throw away the non-PAE initial page tables and replace them with PAE
ones, instead we augment the initial PAE page tables. This means we only
need initial mappings of 4G worth of page tables rather than 4G plus
what is needed for the non-PAE initial page tables.

I don't think I explained that at all well on either attempt...
Hopefully what I mean will be clearer in patch form -- coming in a

I'm just preparing to send out a version which uses the native_* way of
doing things, its not actually as clean as I would like so I'd be
interested to see the ASM variant.

Ian.
--
Ian Campbell

You shall be rewarded for a dastardly deed.

--

To: Ian Campbell <ijc@...>
Cc: Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Monday, January 21, 2008 - 10:16 pm

This is the asm version I came up with. This is only the actual
assembly part; it doesn't require the (obviously necessary) bootmem
adjustments.

-hpa

To: H. Peter Anvin <hpa@...>
Cc: Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Tuesday, January 22, 2008 - 1:36 pm

I moderately prefer the C version, even if it is in a restricted
environment where care is needed to access global variables. I like that
it avoids multiple copies of the code and also find the structure of
what's going on is more obviously apparent (even to someone who has done
plenty of ASM mode page table frobbing in the past).

Anyhow, I don't feel all that strongly about it so if the opinion of the
early start of day maintainer(s) is strongly in favour of ASM I'll defer

Do you mean the native_pagetable_setup_start/done changes? I'm a bit
confused by not requiring obviously necessary changes -- I presume you
just mean that those changes are desirable but should be deferred into a
separate patch?

The C way doesn't inherently require those two changes to happen in the
same patch either -- probably worth splitting out if we go that route.

Ian.

--
Ian Campbell
Current Noise: Pelican - Bliss In Concrete

Your life would be very empty if you had nothing to regret.

--

To: Ian Campbell <ijc@...>
Cc: Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Tuesday, January 22, 2008 - 2:23 pm

My opinion is that I want it done properly (PIC and all that jazz) or
not at all, and certainly would not want to mix linear and
paging-enabled code in the same file. When it comes to assembly code,
at least people can *see* that there there be dragons.

The plus *and* minus of a C version is that it's easier for people to
modify. The plus side of that is that if we really need it, it's a lot
cleaner; the minus side is that it may encourage more code to creep into
the pre-paging code, which would not be a good thing IMO.

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Tuesday, January 22, 2008 - 3:48 pm

Seems reasonable to me. I'll integrate your asm diff with the other
changes and give it a whirl.

Ian.
--
Ian Campbell

Never go to bed mad. Stay up and fight.
-- Phyllis Diller, "Phyllis Diller's Housekeeping Hints"

--

To: Ian Campbell <ijc@...>
Cc: Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Tuesday, January 22, 2008 - 4:00 pm

This version boots into userspace on both PAE and !PAE. You want to
take it from here?

-hpa

To: H. Peter Anvin <hpa@...>
Cc: Ian Campbell <ijc@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Tuesday, January 22, 2008 - 4:36 pm

ok, i'll wait for Ian to submit the final (tested) version then. A few
possible complications are: PSE-less boxes, 32-bit PAGEALLOC bootups
with tons of RAM, NX-less boxes and NX-able boxes :)

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: H. Peter Anvin <hpa@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Wednesday, January 23, 2008 - 4:52 pm

FYI, CONFIG_DEBUG_PAGEALLOC+PAE is broken. I'll dig in but it might be
the weekend before I get a chance (there's a beer festival in town ;-)).

Ian.
--
Ian Campbell

Flattery will get you everywhere.

--

To: Ian Campbell <ijc@...>
Cc: Ingo Molnar <mingo@...>, H. Peter Anvin <hpa@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Wednesday, January 23, 2008 - 9:06 pm

I'm poking around trying to get Xen working again as well; I may end up
fixing it in passing.

At the moment I've got a problem with early_ioremap's bt_pte[] array
ending up hanging around in init's pagetable, which Xen is most unhappy
about.

J
--

To: Jeremy Fitzhardinge <jeremy@...>
Cc: Ingo Molnar <mingo@...>, H. Peter Anvin <hpa@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Thursday, January 24, 2008 - 5:39 am

Turns out that the initial mapping is extending well past the end of the
128M of physical memory when DEBUG_PAGEALLOC is enabled due to the extra
16M mapping. The overshoot is enough that it interferes with the vmalloc
area. I see the same problem without the patch on non PAE if I reduce
the RAM to 64M. There will be ~twice as many PAE boot page tables which
explains why it triggers with 128M with the patch (INIT_MAP_BEYOND_END
is 70M for !PAE and 138M with when DEBUG_PAGEALLOC is on).

I don't see the crash with PAE before this patch. I think because the
kernel page tables are wiped in native_pagetable_setup_start which would
have blown away the mappings past the end of physical memory, avoiding
the problem. I've added code to wipe those mappings past max_low_pfn
from the boot page tables in pagetable_setup_start which has solved the
problem and I think is the right way to do it since we don't know the
correct limit of physical RAM in head_32.

Tested patch follows. All four of {pae,!pae}x{paravirt,!
paravirt}+DEBUG_PAGEALLOC boot to userspace on qemu (smp:2 mem:128)
using cpu type 486 (no PSE or PAE or anything), pentium (PSE only),
pentium3 (everything but NX) and qemu64 (everything, running 32 bit mode
despite name). Exceptions are obviously 486 and pentium which didn't
boot the PAE versions to userspace -- they correctly bailed during
setup.

I'm not sure how PSE comes to be used ever though -- an EFI only thing?
Using the qemu monitor I could see a bunch of NX bits used when NX was
available.

I also booted both {!PAE,PAE}xPARAVIRT+DEBUG_PAGEALLOC on a physical

That sounds a lot like the problem I was having with the patch I sent
you as well, although I never identified where the problematic mapping
was.

Ian.

---
x86_32: Construct 32 bit boot time page tables in native format.

Specifically the boot time page tables in a CONFIG_X86_PAE=y enabled
kernel are in PAE format.

early_ioremap is updated to use the standard page table accessors.

Clear any mappings...

To: Ian Campbell <ijc@...>
Cc: Jeremy Fitzhardinge <jeremy@...>, Ingo Molnar <mingo@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Thursday, January 24, 2008 - 6:06 pm

This is part of the trickiness with re-using the early pagetables
instead of rebuilding them from scratch - if PSE is available, we have
two options:

- either we build PSE page tables early (which means detecting PSE,
which means if there are any chip-specific CPUID workarounds they have
to be present in the early code), or

- when building the "complete" page tables, coalesce !PSE pagetables
into PSE entries where appropriate.

For PAT to work right, the first chunk probably should *not* be a PSE
page table, which complicates things further. (There is no TLB impact,
since a PSE page table at offset zero or that otherwise have an MTRR
conflict will be broken apart in hardware.) In the former case, it
means splitting it apart later; in the latter case it just means
excluding it from coalescing.

In other words, reusing the early page tables isn't all that
straightforward. It may easily be that it's better to build a new set
of page tables from scratch, however, it would *still* be beneficial to
have the early page tables be in the same format as the later one, since
it lets us use the fixmap area, and therefore {bt,early}_ioremap() much
sooner.

-hpa

--

To: H. Peter Anvin <hpa@...>
Cc: Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Thursday, January 24, 2008 - 6:35 pm

Yes, and it simplifies Xen as it always starts guest domains in the
appropriate pagetable mode and doesn't let the guest change it on the
fly. If early_ioremap depends on non-PAE early pagetables in an
otherwise PAE kernel, we'd need to go to some effort to make sure all
the early_ioremap stuff is skipped (which would be possible but
unpleasant for domU, but very bad in dom0).

J

--

To: Jeremy Fitzhardinge <jeremy@...>
Cc: Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Thursday, January 24, 2008 - 6:39 pm

Yeah, and it's ugly for the kernel proper, so that bit is a no-brainer.
It's just a matter of hammering out the details.

It doesn't sound from the above that you have any opinion either way
about reusing the initial page tables or creating a new set, as long as
they're in the same format.

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Thursday, January 24, 2008 - 6:58 pm

Right.

Xen provides a initial set of pagetables in the appropriate format, so
what head.S generates is moot. For simplicity I graft the Xen-provided
pagetables into swapper_pg_dir in xen_start_kernel, so it is the
functional equivalent to the head.S pagetable construction.

We also don't (yet) support PSE, so that's a non-issue for us too.

J
--

To: Jeremy Fitzhardinge <jeremy@...>
Cc: Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Thursday, January 24, 2008 - 7:08 pm

While we're mucking around in this area, there is another thing which we
should eventually get around to fixing:

we need a set of page tables with an identity mapping as well as the
kernel mapping, for trampolining (during startup, but also during things
like ACPI suspend/resume.) Right now, we let those be the swapper page
tables, but that's probably not really a good idea, since it can hide bugs.

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Thursday, January 24, 2008 - 7:40 pm

So you're suggesting a second system pagetable which has a P=V alias as
well as the normal kernel mapping, used only when we actually need that
alias? Sounds simple enough to arrange.

J
--

To: Jeremy Fitzhardinge <jeremy@...>, Pavel Machek <pavel@...>
Cc: Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Thursday, January 24, 2008 - 7:51 pm

I just looked at the ACPI suspend code, and it looks like it hacks its
own identity map at runtime. Pavel, am I reading that code right?

-hpa
--

To: H. Peter Anvin <hpa@...>, Rafael J. Wysocki <rjw@...>
Cc: Jeremy Fitzhardinge <jeremy@...>, Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Thursday, January 24, 2008 - 8:20 pm

Yes, I think so, I believe we do it on both 32 and 64 bit now.

(It is early here. And I almost got the .c wakeup code to work... it
already sets the mode).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

To: Pavel Machek <pavel@...>
Cc: Rafael J. Wysocki <rjw@...>, Jeremy Fitzhardinge <jeremy@...>, Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Thursday, January 24, 2008 - 8:27 pm

So the background to this... we need an identity map to trampoline at
early boot, obviously, but we'd like it to not stick around more than
necessary. We have zap_low_mappings() now but it's not really sufficient.

Secondary SMP processors need these mappings during trampolining --
presumably including CPU hotplug -- and I'm suspecting it might simply
make sense to use a separate set of page tables (with both the identity
and the kernel map) for trampolining and just keep them around. That

Sweet!

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Rafael J. Wysocki <rjw@...>, Jeremy Fitzhardinge <jeremy@...>, Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Friday, January 25, 2008 - 3:49 am

That would enable some cleanups, yes.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

To: Pavel Machek <pavel@...>
Cc: H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Friday, January 25, 2008 - 6:02 pm

Speaking of cleanups, the following one is applicable IMO.

Greetings,
Rafael

---
arch/x86/mm/init_32.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6/arch/x86/mm/init_32.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_32.c
+++ linux-2.6/arch/x86/mm/init_32.c
@@ -444,23 +444,23 @@ static void __init pagetable_init (void)
paravirt_pagetable_setup_done(pgd_base);
}

-#if defined(CONFIG_HIBERNATION) || defined(CONFIG_ACPI)
+#ifdef CONFIG_SUSPEND
/*
* Swap suspend & friends need this for resume because things like the intel-agp
* driver might have split up a kernel 4MB mapping.
*/
-char __nosavedata swsusp_pg_dir[PAGE_SIZE]
+char swsusp_pg_dir[PAGE_SIZE]
__attribute__ ((aligned (PAGE_SIZE)));

static inline void save_pg_dir(void)
{
memcpy(swsusp_pg_dir, swapper_pg_dir, PAGE_SIZE);
}
-#else
+#else /* !CONFIG_SUSPEND */
static inline void save_pg_dir(void)
{
}
-#endif
+#endif /* !CONFIG_SUSPEND */

void zap_low_mappings (void)
{
--

To: Rafael J. Wysocki <rjw@...>
Cc: Pavel Machek <pavel@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Ian Campbell <ijc@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Monday, January 28, 2008 - 11:00 am

thanks, applied.

Ingo
--

To: Rafael J. Wysocki <rjw@...>
Cc: Pavel Machek <pavel@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Ian Campbell <ijc@...>, Mika Penttil? <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Monday, January 28, 2008 - 12:12 pm

hm, random-qa found build breakage with this patch:

arch/x86/kernel/built-in.o: In function `wakeup_start':
: undefined reference to `swsusp_pg_dir'

config attached.

Ingo

To: Ingo Molnar <mingo@...>
Cc: Pavel Machek <pavel@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Ian Campbell <ijc@...>, Mika Penttil? <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>, Len Brown <lenb@...>
Date: Monday, January 28, 2008 - 1:02 pm

I see. CONFIG_HIBERNATION && CONFIG_ACPI -> CONFIG_ACPI_SLEEP
and the Makefile in arch/x86/kernel/acpi/ wants to build wakeup.S, which is
not necessary. Hmm.

We can do a couple of things:
(1) make wakeup_$(BITS).o depend on CONFIG_SUSPEND (alone)
This will build it if CONFIG_SUSPEND is set, but CONFIG_ACPI is not
(still, that's consistent with the change in question).
(2) make wakeup_$(BITS).o depend on CONFIG_SUSPEND and CONFIG_ACPI
(3) define CONFIG_ACPI_SUSPEND depending on ACPI and SUSPEND and
make wakeup_$(BITS).o as well as swsusp_pg_dir depend on that (most
elegant)

Which one do you prefer?

In case you choose (3), please drop the patch and I'll send a new one to Len.

Thanks,
Rafael
--

To: Rafael J. Wysocki <rjw@...>
Cc: Pavel Machek <pavel@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Ian Campbell <ijc@...>, Mika Penttil? <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>, Len Brown <lenb@...>
Date: Friday, February 1, 2008 - 9:51 am

no strong preference here - pick the one you like best and send a patch
please :-)

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Pavel Machek <pavel@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Ian Campbell <ijc@...>, Mika Penttil? <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>, Len Brown <lenb@...>
Date: Friday, February 1, 2008 - 10:28 am

Here you go, but I think it falls into the ACPI category.

---
From: Rafael J. Wysocki <rjw@sisk.pl>

Since hibernation uses its own temporary page tables for restoring
the image kernel, swsusp_pg_dir is only needed for ACPI resume from
RAM. Also, some files under arch/x86/kernel/acpi need only be compiled
if ACPI suspend to RAM is going to be used.

Signed-off-by: Rafael J. Wysocki <rjw@sisk.pl>
---
arch/x86/kernel/acpi/Makefile | 2 +-
arch/x86/mm/init_32.c | 10 +++++-----
drivers/acpi/Kconfig | 5 +++++
3 files changed, 11 insertions(+), 6 deletions(-)

Index: linux-2.6/arch/x86/mm/init_32.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_32.c
+++ linux-2.6/arch/x86/mm/init_32.c
@@ -423,23 +423,23 @@ static void __init pagetable_init(void)
paravirt_pagetable_setup_done(pgd_base);
}

-#if defined(CONFIG_HIBERNATION) || defined(CONFIG_ACPI)
+#ifdef CONFIG_ACPI_SUSPEND
/*
- * Swap suspend & friends need this for resume because things like the intel-agp
+ * ACPI suspend needs this for resume, because things like the intel-agp
* driver might have split up a kernel 4MB mapping.
*/
-char __nosavedata swsusp_pg_dir[PAGE_SIZE]
+char swsusp_pg_dir[PAGE_SIZE]
__attribute__ ((aligned(PAGE_SIZE)));

static inline void save_pg_dir(void)
{
memcpy(swsusp_pg_dir, swapper_pg_dir, PAGE_SIZE);
}
-#else
+#else /* !CONFIG_ACPI_SUSPEND */
static inline void save_pg_dir(void)
{
}
-#endif
+#endif /* !CONFIG_ACPI_SUSPEND */

void zap_low_mappings(void)
{
Index: linux-2.6/drivers/acpi/Kconfig
===================================================================
--- linux-2.6.orig/drivers/acpi/Kconfig
+++ linux-2.6/drivers/acpi/Kconfig
@@ -42,6 +42,11 @@ menuconfig ACPI

if ACPI

+config ACPI_SUSPEND
+ bool
+ depends on SUSPEND
+ default y
+
config ACPI_SLEEP
bool
depends on PM_SLEEP
Index: linux-2.6/arch/x86/kernel/acpi/Makefile
=============...

To: Rafael J. Wysocki <rjw@...>
Cc: Pavel Machek <pavel@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Ian Campbell <ijc@...>, Mika Penttil? <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>, Len Brown <lenb@...>
Date: Friday, February 1, 2008 - 10:54 am

agreed - Len, would you mind to pick this patch up?

Acked-by: Ingo Molnar <mingo@elte.hu>

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Rafael J. Wysocki <rjw@...>, Pavel Machek <pavel@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Ian Campbell <ijc@...>, Mika Penttil? <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Friday, February 1, 2008 - 6:55 pm

This won't work as written -- for the ACPI code doesn't currently optimize
for the HIBERNATE && ! SUSPEND case, and so both code paths are under ACPI_SLEEP.

While some day there may be a justification to make that optimization,
this isn't the day, and this isn't the patch.

So Rafael and I talked about it and decided to go with the simpler
patch below -- which I'll push.

thanks,
-Len

---
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index da524fb..f2f36f8 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -423,23 +423,23 @@ static void __init pagetable_init(void)
paravirt_pagetable_setup_done(pgd_base);
}

-#if defined(CONFIG_HIBERNATION) || defined(CONFIG_ACPI)
+#ifdef CONFIG_ACPI_SLEEP
/*
- * Swap suspend & friends need this for resume because things like the intel-agp
+ * ACPI suspend needs this for resume, because things like the intel-agp
* driver might have split up a kernel 4MB mapping.
*/
-char __nosavedata swsusp_pg_dir[PAGE_SIZE]
+char swsusp_pg_dir[PAGE_SIZE]
__attribute__ ((aligned(PAGE_SIZE)));

static inline void save_pg_dir(void)
{
memcpy(swsusp_pg_dir, swapper_pg_dir, PAGE_SIZE);
}
-#else
+#else /* !CONFIG_ACPI_SLEEP */
static inline void save_pg_dir(void)
{
}
-#endif
+#endif /* !CONFIG_ACPI_SLEEP */

void zap_low_mappings(void)
{
--

To: Ingo Molnar <mingo@...>
Cc: Pavel Machek <pavel@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Ian Campbell <ijc@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Monday, January 28, 2008 - 11:25 am

Thanks. Well, the comment should also be updated as I can see now. I'll send
a separate patch for that.

Rafael
--

To: Rafael J. Wysocki <rjw@...>
Cc: Ingo Molnar <mingo@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Ian Campbell <ijc@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Monday, January 28, 2008 - 3:40 pm

Sorry, this is subtle and I've overlooked it before.

(I thought you were only changing ifdef).

Now you memcpy() over pg_dir when that pgdir is in use during swsusp
resume. Granted, you memcpy() with same data that already are there,
but it may still do some funny effects.

Hmm, but same argument applies to lower levels of paging in 64-bit and

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

To: Pavel Machek <pavel@...>
Cc: Ingo Molnar <mingo@...>, H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Ian Campbell <ijc@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Monday, January 28, 2008 - 4:26 pm

It is not. swsusp hasn't been using swsusp_pg_dir for several months.

Actually, no. We only do that with the kernel code mapping which should be
safe as long as TLBs are not flushed (and they aren't).

Thanks,
Rafael
--

To: Rafael J. Wysocki <rjw@...>
Cc: Pavel Machek <pavel@...>, Ingo Molnar <mingo@...>, H. Peter Anvin <hpa@...>, Ian Campbell <ijc@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Monday, January 28, 2008 - 4:44 pm

Er, what? Assuming the TLB will retain some mappings while you
overwrite the pagetable is a highly dubious prospect. Are you copying
the same values over, or something else?

J
--

To: Jeremy Fitzhardinge <jeremy@...>
Cc: Pavel Machek <pavel@...>, Ingo Molnar <mingo@...>, H. Peter Anvin <hpa@...>, Ian Campbell <ijc@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Monday, January 28, 2008 - 4:50 pm

As long as a relocatable kernel is not used to restore a non-relocatable one
(or vice versa), we're copying the same values over.

Rafael
--

To: Rafael J. Wysocki <rjw@...>
Cc: Jeremy Fitzhardinge <jeremy@...>, Pavel Machek <pavel@...>, Ingo Molnar <mingo@...>, Ian Campbell <ijc@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Monday, January 28, 2008 - 5:28 pm

So that case is deliberately considered broken?

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Jeremy Fitzhardinge <jeremy@...>, Pavel Machek <pavel@...>, Ingo Molnar <mingo@...>, Ian Campbell <ijc@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Monday, January 28, 2008 - 6:02 pm

Not deliberately, but the fix I had caused a regression. It's just a pending
issue.

Rafael
--

To: Rafael J. Wysocki <rjw@...>
Cc: Pavel Machek <pavel@...>, Ingo Molnar <mingo@...>, Jeremy Fitzhardinge <jeremy@...>, Ian Campbell <ijc@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Monday, January 28, 2008 - 4:31 pm

Okay... does that in any way affect using the kernel code mapping
synchronization code to maintain a set of trampoline pagetables?

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Pavel Machek <pavel@...>, Ingo Molnar <mingo@...>, Jeremy Fitzhardinge <jeremy@...>, Ian Campbell <ijc@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Monday, January 28, 2008 - 4:59 pm

I really don't think so.

Rafael
--

To: Pavel Machek <pavel@...>
Cc: Rafael J. Wysocki <rjw@...>, Ingo Molnar <mingo@...>, Jeremy Fitzhardinge <jeremy@...>, Ian Campbell <ijc@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Monday, January 28, 2008 - 3:51 pm

This really comes down to the concept that we should keep an
identity-mapped page table set around and keep it maintained.
Maintenance should be relatively cheap -- we don't care about the
vmalloc area (but if it's easier to have it, it won't cause any harm),
and we already have to have code to synchronize the PGDs on !PAE and the
PMDs on Xen (although that was supposedly getting fixed). This is
nothing very different than synchronizing yet another PGD[*] offset.

This obviously relates to (and needs to be on top of) the
always-native-pagetables work.

[*] = Almost. There is one exception: for 3 GB kernel:1 GB userspace,
we must ensure that only 1 GB of the kernel area is synced.

-hpa

--

To: H. Peter Anvin <hpa@...>
Cc: Pavel Machek <pavel@...>, Rafael J. Wysocki <rjw@...>, Ingo Molnar <mingo@...>, Ian Campbell <ijc@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Monday, January 28, 2008 - 4:03 pm

No, I don't have any plans there. Xen will continue to require
non-shared kernel pmd, at least for a 32-bit host. I think the point is
that nothing that requires an identity mapping will work under Xen
anyway, so Xen just doesn't care about this case.

J
--

To: Jeremy Fitzhardinge <jeremy@...>
Cc: Pavel Machek <pavel@...>, Rafael J. Wysocki <rjw@...>, Ingo Molnar <mingo@...>, Ian Campbell <ijc@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Monday, January 28, 2008 - 4:06 pm

Still makes it a special case, not just for this.

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Jeremy Fitzhardinge <jeremy@...>, Pavel Machek <pavel@...>, Ingo Molnar <mingo@...>, Ian Campbell <ijc@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Monday, January 28, 2008 - 4:28 pm

In fact swsusp creates its own temporary page tables for restoring the last
part of the image. Please have a look at
arch/x86/kernel/suspend_*_64.c and the files in arch/x86/power (most
importantly suspend.c).

Thanks,
Rafael
--

To: Rafael J. Wysocki <rjw@...>
Cc: H. Peter Anvin <hpa@...>, Jeremy Fitzhardinge <jeremy@...>, Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Friday, January 25, 2008 - 6:11 pm

ACK... and BTW ack for that deferred device removal series.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html
--

To: H. Peter Anvin <hpa@...>
Cc: Pavel Machek <pavel@...>, Rafael J. Wysocki <rjw@...>, Jeremy Fitzhardinge <jeremy@...>, Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>
Date: Thursday, January 24, 2008 - 10:16 pm

We already do this on the 64bit side. We reuse the kernel and the
identity parts from the core kernel page tables but it is actually
a distinct page table.

x86_64 has not had the identity mappings mapped in any of the
normal page tables since the relocatable kernel support was merged
a while ago.

Only on the 32bit side does this still remain an issue. I don't know
if what we can do optimization wise there. Emulating the 64bit code
and having a dedicated top level pgd (as part of the trampoline)
and then a mapping into it the kernel identity mapping and the kernel
mapping (which are the same on 32bit) should work fairly easily.

It is just a handful of pgd entries, and then in the actual kernel
entry code we reload %cr3 with the appropriate kernel page table
and we should be fine. No need for an explicit zap there at all.

Eric
--

To: Eric W. Biederman <ebiederm@...>
Cc: Pavel Machek <pavel@...>, Rafael J. Wysocki <rjw@...>, Jeremy Fitzhardinge <jeremy@...>, Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>
Date: Thursday, January 24, 2008 - 10:25 pm

That's pretty much what I figure. The one issue is that on non-PAE
32-bit (or if we actually have to deal with unsharable PMDs on PAE
kernels) then the PGD (PMD) kernel mappings at least formally should
really be put in sync. This could be done either by the same code which
keeps the PGDs of various processes in sync already or on demand; I
believe my personal preference would be to have all that in the same
place, since we have to do it anyway, and this is nothing different
except for the offset.

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Pavel Machek <pavel@...>, Jeremy Fitzhardinge <jeremy@...>, Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Thursday, January 24, 2008 - 8:46 pm

For clarity, are you referring to the code in arch/x86/kernel/acpi ?

Rafael
--

To: Rafael J. Wysocki <rjw@...>
Cc: Pavel Machek <pavel@...>, Jeremy Fitzhardinge <jeremy@...>, Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Thursday, January 24, 2008 - 9:08 pm

Yes.

-hpa
--

To: Jeremy Fitzhardinge <jeremy@...>
Cc: Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Thursday, January 24, 2008 - 7:44 pm

Yes. We'd use it during initialization and at other times when we need
trampolining, but give the swapper something which only has the kernel map.

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Thursday, January 24, 2008 - 7:51 pm

Hm, though Xen makes it all a bit more complex, as usual. In the PAE
case it wouldn't allow the pmd to be shared, so you'd have to allocate a
new pmd and copy into it. There's probably a way to deal with it within
the existing paravirt hooks...

J
--

To: Jeremy Fitzhardinge <jeremy@...>
Cc: Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Thursday, January 24, 2008 - 8:02 pm

Yeah, I'm aware of this particular piece of Xen braindamage, and
although I had some very unkind words to say about it, it mirrors what
we have to do for the !PAE case anyway, so it can be sort of glossed over.

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Thursday, January 24, 2008 - 8:11 pm

Sort of. If Xen weren't an issue, then both cases are a matter of
copying a set of entries from one place in the pgd to another.

It would be easy enough to add some code on xen side to look for pmd
aliases when using/pinning a pagetable, and allocate'n'copy a new pmd
page as needed. That way the core code can ignore the issue.

J
--

To: Jeremy Fitzhardinge <jeremy@...>
Cc: Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Thursday, January 24, 2008 - 8:15 pm

No, if Xen wasn't an issue there wouldn't be anything to do for the PAE
case at all (since the PGD is trivial.)

Copying PMDs is more or less an analogous case of the !PAE case, once
the allocation is already done. The allocation should be trivial

As much as I'd rather see Xen fixing this than having it continue to
impact the kernel, I presume it will take some time to flush the broken
hypervisors out?

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Thursday, January 24, 2008 - 8:31 pm

I think we're in vehement agreement here. In either case, its just a
matter of something like:

memcpy(pgd, &pgd[USER_PTRS_PER_PGD], sizeof(pgd_t) * KERNEL_PTRS_PER_PGD);

Sorry, I was unclear. I meant in the purely Xen-specific parts of the
kernel (arch/x86/xen). It wouldn't require a hypervisor change.

J
--

To: Jeremy Fitzhardinge <jeremy@...>
Cc: Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Thursday, January 24, 2008 - 8:37 pm

Oh, that makes that option much more viable and probably preferrable.

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Jeremy Fitzhardinge <jeremy@...>, Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>
Date: Thursday, January 24, 2008 - 10:56 pm

[Empty message]
To: Eric W. Biederman <ebiederm@...>
Cc: H. Peter Anvin <hpa@...>, Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>
Date: Friday, January 25, 2008 - 12:41 am

Indeed. The alias mapping can be set up in

Quite so.

J
--

To: Jeremy Fitzhardinge <jeremy@...>
Cc: H. Peter Anvin <hpa@...>, Ian Campbell <ijc@...>, Ingo Molnar <mingo@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>
Date: Friday, January 25, 2008 - 7:07 am

Good. Then this case gets easy.

We just need a pgd that has pgd entries that duplicate the kernel pgd entries
at both address 0 and at the normal kernel address.

In 64bit mode we make this part of the trampoline because we need a pgt below
4G so that we can point a 32bit %cr3 value at it. We can either use that
technique for the 32bit kernel (and be consistent) or we can have a single
trampoline/wakeup pgd that we use. As all pgd entries must be below 4G in
32bit mode.

Although if we really wanted to be restrictive we could have a much more limited
set of identity page table entries that only map the low 1M, or possibly just
640K.

Eric
--

To: Ingo Molnar <mingo@...>
Cc: H. Peter Anvin <hpa@...>, Ian Campbell <ijc@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Tuesday, January 22, 2008 - 5:00 pm

nx_enabled can only be set to non-zero when CONFIG_X86_PAE is
set. The only use not currently inside a CONFIG_X86_PAE block
is the definition, the declaration and a conditional unlikely
test in fault_32.c (is_prefetch).

When !CONFIG_X86_PAE, is_prefetch always returns 0 immediately
as nx_enabled is always 0.

When CONFIG_X86_PAE, the test is preserved, but the test against
the cpu model and stepping is deleted, this may not be correct.

Signed-off-by: Harvey Harrison <harvey.harrison@gmail.com>
---
Ingo, further to your nx vs !nx comment, had this lying around,
needs testing, only affects the CONFIG_X86_PAE case.

arch/x86/mm/fault_32.c | 17 ++++++++---------
arch/x86/mm/fault_64.c | 17 ++++++++---------
arch/x86/mm/init_32.c | 4 +---
include/asm-x86/page_32.h | 2 ++
4 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/arch/x86/mm/fault_32.c b/arch/x86/mm/fault_32.c
index 0bd2417..049c3bb 100644
--- a/arch/x86/mm/fault_32.c
+++ b/arch/x86/mm/fault_32.c
@@ -81,16 +81,15 @@ static int is_prefetch(struct pt_regs *regs, unsigned long addr,
unsigned char *max_instr;

#ifdef CONFIG_X86_32
- if (unlikely(boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
- boot_cpu_data.x86 >= 6)) {
- /* Catch an obscure case of prefetch inside an NX page. */
- if (nx_enabled && (error_code & PF_INSTR))
- return 0;
- } else {
+# ifdef CONFIG_X86_PAE
+ /* If it was a exec fault on NX page, ignore */
+ if (nx_enabled && (error_code & PF_INSTR))
return 0;
- }
-#else
- /* If it was a exec fault ignore */
+# else
+ return 0;
+# endif
+#else /* CONFIG_X86_64 */
+ /* If it was a exec fault on NX page, ignore */
if (error_code & PF_INSTR)
return 0;
#endif
diff --git a/arch/x86/mm/fault_64.c b/arch/x86/mm/fault_64.c
index ccbb8e3..33e8ced 100644
--- a/arch/x86/mm/fault_64.c
+++ b/arch/x86/mm/fault_64.c
@@ -84,16 +84,15 @@ static int is_prefetch(struct pt_regs *regs, unsigned long addr,
un...

To: Ingo Molnar <mingo@...>
Cc: H. Peter Anvin <hpa@...>, Ian Campbell <ijc@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Tuesday, January 22, 2008 - 5:07 pm

Sorry, I missed the usage in kernel/acpi/wakeup_32.S, that's the only
other user. I don't know that code well enough to comment on the usage
there, but if anybody knows if that could be conditionalized, please
advise.

Harvey

--

To: Harvey Harrison <harvey.harrison@...>
Cc: H. Peter Anvin <hpa@...>, Ian Campbell <ijc@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Tuesday, January 22, 2008 - 5:04 pm

will keep an eye on it.

How far away are you from unifying fault_32.c and fault_64.c? You
already managed to line up their sizes:

$ wc -l arch/x86/mm/fault_*.c
742 arch/x86/mm/fault_32.c
734 arch/x86/mm/fault_64.c

;-)

and the raw diff between them doesnt look that bad either:

1 file changed, 127 insertions(+), 135 deletions(-)

so we might as well take a shot at that?

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: H. Peter Anvin <hpa@...>, Ian Campbell <ijc@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Tuesday, January 22, 2008 - 5:35 pm

Hmmm, the extern nx_enabled in page_32.h is already within an
ifndef __ASSEMBLY__ block, so I'm not sure how wakeup_32.S could
be using it. Thoughts?

Harvey

--

To: Ingo Molnar <mingo@...>
Cc: H. Peter Anvin <hpa@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Tuesday, January 22, 2008 - 4:52 pm

I'm not sure I can promise that sort of coverage ;-) Will test on what
hardware I've got available...

Ian.
--
Ian Campbell

Modesty is a vastly overrated virtue.
-- J. K. Galbraith

--

To: Ian Campbell <ijc@...>
Cc: Ingo Molnar <mingo@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Tuesday, January 22, 2008 - 5:00 pm

I tend to use simulators (e.g. Qemu) quite a bit. They let you tune
this kind of stuff.

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Ingo Molnar <mingo@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Tuesday, January 22, 2008 - 6:21 pm

So do I but I'd never really investigated the option to fiddle with the
CPU type -- very useful though, thanks for the tip!

Ian.

--
Ian Campbell

No one can feel as helpless as the owner of a sick goldfish.

--

To: Ingo Molnar <mingo@...>
Cc: Ian Campbell <ijc@...>, Mika Penttilä <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Tuesday, January 22, 2008 - 4:43 pm

PSE-less should be less of an issue than making sure we switch to using
large pages where appropriate, and enable the PGE and NX bits where
appropriate.

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Ian Campbell <ijc@...>, Mika <mika.penttila@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Tuesday, January 22, 2008 - 4:45 pm

yeah - and that would be the right point to enable gigapages as well -
once we have all this stuff consolidated and unified from grounds up.

Ingo
--

To: Ian Campbell <ijc@...>
Cc: <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, H. Peter Anvin <hpa@...>, Eric W. Biederman <ebiederm@...>
Date: Saturday, January 19, 2008 - 7:07 pm

Without a low identity mapping early_printk will not work and printk

You should do the test in the 16 bit boot code. In fact it should
already do it by testing the CPUID REQUIRED_MASK.

The only way this could be entered is if someone skips the 16bit boot code
by using kexec, but has the wrong flags. I'm not sure how to handle

That will break if the kernel is > 4GB won't it? Also same for pmd.

Are you sure there will be enough memory here? You might need to use
an e820 allocator similar to x86-64.

Typical problems is you running into some other memory used by

Unrelated?

-Andi
--

To: Andi Kleen <andi@...>
Cc: <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, H. Peter Anvin <hpa@...>, Eric W. Biederman <ebiederm@...>
Date: Sunday, January 20, 2008 - 12:44 pm

Indeed it does. I don't have any non-PAE to test it but I turned the
failure case into a simple jmp to hlt_loop since we ought never to get

The problem here is that we explicitly want native accessors because
it's too early to use the pv ops since we are still running P==V. A PV
kernel boot will never come down this path -- it is diverted earlier in
head_32.S so using the native versions are appropriate.

I'll try again to use the native_{make,set}_xxx functions but originally
I found the necessary variants weren't defined in all combinations of
PAE/not and PARAVIRT/not.

FWIW we use the same undef trick under arch/x86/boot too and this early

As hpa says we can't be above 4G at this point. Probably I can use some

I was trying to fit in with the native_foo stuff that is available and
happened to be using pud on my last attempt before I switched to the
#undef CONFIG_PARAVIRT approach. I'll switch to pgd if I can get it to
work.

pgd_none (and pud_none) are hardcoded to 0 in the 32 bit case (in
asm-generic/pgtable-nopud.h and asm-generic/pgtable-nopmd.h or
asm-x86/pgtable-3level.h). Presumably this is because at regular runtime
these entries are guaranteed to exist which isn't true this early at
startup.

In fact since we are always going to need a PMD in the PAE case there is
probably not much wrong with simply unconditionally allocating the pmd

True. However the assembly being replaced makes the same assumptions so
I don't think that should block this patch, it's a fixup that can be

Nope, the return type of early_ioremap_pte() changed unsigned long ->
pte_t and that is what is assigned to pte.

I'll spin another version.

Ian.
--
Ian Campbell

"I go on working for the same reason a hen goes on laying eggs."
-- H. L. Mencken

--

To: Ian Campbell <ijc@...>
Cc: Andi Kleen <andi@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, H. Peter Anvin <hpa@...>, Eric W. Biederman <ebiederm@...>
Date: Sunday, January 20, 2008 - 1:39 pm

There are various loaders (kexec, elilo, ...) that skip the 16bit code
and jump directly to 32bit head.S. So in theory those could hit it.

Then i think it would be cleaner to just open code everything without

The 32bit cast still feels unclean. After all the PTE is not 32bit.

-Andi
--

To: Andi Kleen <andi@...>
Cc: Ian Campbell <ijc@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Sunday, January 20, 2008 - 2:48 pm

It's probably just as well, since we don't really know how to get a

I was thinking about this yesterday, and it seems to me that there are
two cleaner options here...

- either we should put in the full machinery to be able to run C code
compiled with -fPIC/-fPIE before paging is enabled. Unfortunately gcc
generates R_386_GOT32 relocations for external references even with
-fPIE, so we'll have to put in some code to adjust the GOT (easy enough
to do.)

As far as the native accessors are concerned, the right thing to do
would is to use the native_ forms thereof, not #undef CONFIG_PARAVIRT.

- alternatively, we recognize that this isn't all that big of a piece of
code and doing it in C really isn't necessary. We can have a small
assembly loop for PAE that matches the small assembly loop we already

No, but (pte_t *) is 32 bits. To be more "Linuxy" it probably should be
(long) or (unsigned long) though.

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Andi Kleen <andi@...>, Ian Campbell <ijc@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Tuesday, January 22, 2008 - 6:05 am

i'd _love_ to have this approach instead of the assembly routines. While
'constructing pagetables' might not look like a big deal in isolation -
C is still 10 times more programmable than assembly. Pushing more of the
early boot code into a sane, non-assembly environment will have positive
long-term effects all across.

Ingo
--

To: Ingo Molnar <mingo@...>
Cc: Andi Kleen <andi@...>, Ian Campbell <ijc@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Tuesday, January 22, 2008 - 12:23 pm

Yes, but that doesn't mean that this particular task is the right thing
for that job. In particular, the GOT adjustment wll be almost the same
size as the whole task.

On the other hand, there is a whole bunch of post-paging code in
head_32.S which doesn't need to be there.

-hpa
--

To: H. Peter Anvin <hpa@...>
Cc: Andi Kleen <andi@...>, Ian Campbell <ijc@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Sunday, January 20, 2008 - 2:55 pm

It would be robably possible to extend the 32bit protocol to some
way to error out in such a case. On the other hand I'm not sure it's really

That's not 32bit either.

-Andi
--

To: Andi Kleen <andi@...>
Cc: Ian Campbell <ijc@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Sunday, January 20, 2008 - 2:54 pm

Looked at the subject line?

-hpa

--

To: Andi Kleen <andi@...>
Cc: Ian Campbell <ijc@...>, <linux-kernel@...>, Thomas Gleixner <tglx@...>, Ingo Molnar <mingo@...>, Eric W. Biederman <ebiederm@...>
Date: Saturday, January 19, 2008 - 7:50 pm

[Empty message]
Previous thread: [PATCH][RESEND] sh: termios ioctl definitions by Alan Cox on Saturday, January 19, 2008 - 12:05 pm. (2 messages)

Next thread: [PATCH] ata_generic: Cenatek support by Alan Cox on Saturday, January 19, 2008 - 12:07 pm. (1 message)