Re: [PATCH 1 of 7] x86: add _PAGE_IOMAP pte flag for IO mappings

Previous thread: [PATCH 00/18] ide: add generic ATA/ATAPI disk driver by Bartlomiej Zolnierkiewicz on Sunday, September 7, 2008 - 3:14 pm. (22 messages)

Next thread: [regression] __tick_program_event of hpet is stuck by Frans Pop on Sunday, September 7, 2008 - 4:14 pm. (11 messages)
From: Jeremy Fitzhardinge
Date: Sunday, September 7, 2008 - 3:21 pm

Hi Ingo,

This series begins to lay the groundwork for Xen domain 0 support.
Domain 0 is much like a normal Xen domain, but it is also allowed to
have direct access to the underlying hardware (including both IO space
and various BIOS tables), so it can run device drivers, etc.  This
means that we need to be able to distinguish between a Xen domain's
normal pseudo-physical address space, and the real machine physical
address space.

This series:
 - Adds a new pte flag - _PAGE_IOMAP - used to indicate that a mapping
   is intended to map an IO device, and the physical address is
   actually a real machine physical address rather than a
   pseudo-physical address.  This is exposed as PAGE_KERNEL_IO and so
   on.  ioremap() and early_ioremap() are modified to set this flag,
   as they (should) always be used to map IO devices and not other
   memory.  By default __supported_pte_mask masks this flag out, so it
   won't end up in the final pagetable.  The Xen (and any other) code
   which cares about this flag unmask it from __supported_pte_mask.

 - Add early_memremap() to deal with the cases where early_ioremap()
   actually being used to map normal memory.

 - Remove a bogus check in x86-64's implementation of
   set_pte_vaddr_pud(), which prevents memory mappings from being
   updated.

 - Convert __acpi_map_table to use early_ioremap(), rather than have
   its own implementation.

 - Make __acpi_map_table always map the memory rather than assuming
   that the linear map maps some of the acpi tables.  This won't be
   true in the virtual case, and always mapping doesn't hurt in the
   native case.

I've tested these patches on both 32 and 64 native booting, and it all
works for me.

Thanks,
	J

--

From: Jeremy Fitzhardinge
Date: Sunday, September 7, 2008 - 3:21 pm

early_ioremap() is redeclared in several places; remove them.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 include/asm-x86/io.h    |   14 --------------
 include/asm-x86/io_64.h |    3 ---
 2 files changed, 17 deletions(-)

diff --git a/include/asm-x86/io.h b/include/asm-x86/io.h
--- a/include/asm-x86/io.h
+++ b/include/asm-x86/io.h
@@ -4,20 +4,6 @@
 #define ARCH_HAS_IOREMAP_WC
 
 #include <linux/compiler.h>
-
-/*
- * early_ioremap() and early_iounmap() are for temporary early boot-time
- * mappings, before the real ioremap() is functional.
- * A boot-time mapping is currently limited to at most 16 pages.
- */
-#ifndef __ASSEMBLY__
-extern void early_ioremap_init(void);
-extern void early_ioremap_clear(void);
-extern void early_ioremap_reset(void);
-extern void *early_ioremap(unsigned long offset, unsigned long size);
-extern void early_iounmap(void *addr, unsigned long size);
-extern void __iomem *fix_ioremap(unsigned idx, unsigned long phys);
-#endif
 
 #define build_mmio_read(name, size, type, reg, barrier) \
 static inline type name(const volatile void __iomem *addr) \
diff --git a/include/asm-x86/io_64.h b/include/asm-x86/io_64.h
--- a/include/asm-x86/io_64.h
+++ b/include/asm-x86/io_64.h
@@ -165,9 +165,6 @@
 
 #include <asm-generic/iomap.h>
 
-extern void *early_ioremap(unsigned long addr, unsigned long size);
-extern void early_iounmap(void *addr, unsigned long size);
-
 /*
  * This one maps high address device memory and turns off caching for that area.
  * it's useful if some control registers are in such an area and write combining


--

From: Jeremy Fitzhardinge
Date: Sunday, September 7, 2008 - 3:21 pm

Use one of the software-defined PTE bits to indicate that a mapping is
intended for an IO address.  On native hardware this is irrelevent,
since a physical address is a physical address.  But in a virtual
environment, physical addresses are also virtualized, so there needs
to be some way to distinguish between pseudo-physical addresses and
actual hardware addresses; _PAGE_IOMAP indicates this intent.

By default, __supported_pte_mask masks out _PAGE_IOMAP, so it doesn't
even appear in the final pagetable.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/mm/init_32.c     |    2 +-
 arch/x86/mm/init_64.c     |    2 +-
 arch/x86/mm/ioremap.c     |    8 ++++----
 include/asm-x86/fixmap.h  |    2 +-
 include/asm-x86/pgtable.h |   14 ++++++++++++--
 5 files changed, 19 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -503,7 +503,7 @@
 
 int nx_enabled;
 
-pteval_t __supported_pte_mask __read_mostly = ~(_PAGE_NX | _PAGE_GLOBAL);
+pteval_t __supported_pte_mask __read_mostly = ~(_PAGE_NX | _PAGE_GLOBAL | _PAGE_IOMAP);
 EXPORT_SYMBOL_GPL(__supported_pte_mask);
 
 #ifdef CONFIG_X86_PAE
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -88,7 +88,7 @@
 
 int after_bootmem;
 
-unsigned long __supported_pte_mask __read_mostly = ~0UL;
+pteval_t __supported_pte_mask __read_mostly = ~_PAGE_IOMAP;
 EXPORT_SYMBOL_GPL(__supported_pte_mask);
 
 static int do_not_nx __cpuinitdata;
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -223,16 +223,16 @@
 	switch (prot_val) {
 	case _PAGE_CACHE_UC:
 	default:
-		prot = PAGE_KERNEL_NOCACHE;
+		prot = PAGE_KERNEL_IO_NOCACHE;
 		break;
 	case _PAGE_CACHE_UC_MINUS:
-		prot = PAGE_KERNEL_UC_MINUS;
+		prot = PAGE_KERNEL_IO_UC_MINUS;
 		break;
 	case ...
From: Avi Kivity
Date: Tuesday, September 9, 2008 - 6:32 am

Could PTE_SPECIAL, added for get_user_pages_really_fast(), be reused for 
this?

-- 
error compiling committee.c: too many arguments to function

--

From: Jeremy Fitzhardinge
Date: Tuesday, September 9, 2008 - 7:47 am

I'm not sure; I still don't really understand how _PAGE_SPECIAL gets
used, other than being user-mode mapping only.  But in principle,
_PAGE_IOMAP could be set on both kernel and user mappings (if you direct
map a device into a process address space), so I think they would
conflict then?

Also, _PAGE_SPECIAL is also shared with _PAGE_CPA_TEST, which is only
used on kernel mappings, so they can co-exist happily.

Is _PAGE_IOMAP at all useful for device passthrough in kvm?

    J
--

From: Avi Kivity
Date: Tuesday, September 9, 2008 - 7:56 am

It's a "don't refcount me" flag, which is not sematically the same as 

No.  We don't care if a page is an I/O page or RAM (other than 
refcounting correctness and page attributes, which are handled by other 
means).

-- 
error compiling committee.c: too many arguments to function

--

From: Keir Fraser
Date: Tuesday, September 9, 2008 - 8:29 am

That's basically what our _PAGE_IO flag (in our old Linux patchset) means.
We use it to cause pte_pfn() to return an invalid pfn and hence avoid
reference counting that way. Since kernel mappings are never reference
counted (I think?) perhaps we could use _PAGE_SPECIAL even if it is
restricted to use on user mappings.

 -- Keir


--

From: Jeremy Fitzhardinge
Date: Tuesday, September 9, 2008 - 8:48 am

Well, _PAGE_IOMAP's most important semantic from Xen's perspective is
that the frame number is considered to already be an MFN and so isn't
converted.  It may be that _PAGE_SPECIAL is also useful for its "no
refcount" properties, but we could set both in that case.

At present we can't set _PAGE_SPECIAL on kernel mappings because we
assume its usermode only, and its shared with _PAGE_CPA_TEST which is
kernel-only.  Most _PAGE_IOMAP mappings are in kernel space, but I
wouldn't discount user mappings if we end up allowing direct access to
passthrough devices.

    J
--

From: Keir Fraser
Date: Tuesday, September 9, 2008 - 9:05 am

Oh, yes, our _PAGE_IO has the same meaning.

 -- Keir


--

From: Avi Kivity
Date: Wednesday, September 10, 2008 - 2:55 am

Actually it's more of a "no struct page" flag, which implies no 
refcounting.  And not having a struct page should correspond well to a 
pte not requiring pfn->mfn conversion and being an I/O page.

-- 
error compiling committee.c: too many arguments to function

--

From: Jeremy Fitzhardinge
Date: Wednesday, September 10, 2008 - 9:38 am

Hm, is that actually true enough to define it?  Could we rename it
something like _PAGE_NOSTRUCTPAGE or something a bit more specific than

But _PAGE_SPECIAL is only set in a few places.  It's not set in ioremap
mappings and so on.  Should it be?

There's also the hiccup that it gets set in a pte with pte_mkspecial() -
but at that point its too late because you've already constructed the
pte and done the pfn->mfn conversion.  _PAGE_IOMAP can only be set when
you initially construct the pte out of a frame number and a pgprot.

    J
--

From: Nick Piggin
Date: Wednesday, September 10, 2008 - 9:55 am

It complements vm_normal_page, which was there first (and coined by
Linus). It is the opposite of normal. This question always comes up
and my answer is always yes, if you can convince Linus to rename
vm_normal_page to the corresponding term :)

It's not exactly _PAGE_NOSTRUCTPAGE. There can be struct pages under

Kernel address space, you mean? No, it is only ever used on user

I don't see this would be any problem because the pte is always constructed
in a single line in both places where it is used.
--

From: Jeremy Fitzhardinge
Date: Wednesday, September 10, 2008 - 10:27 am

Not really.  Normal is normal, but "special" doesn't tell us what kind

To the extent that the struct page may as well not exist?  Does it
contain any meaningful state?  Are they always IO mappings?  Could we

Right.  But if we fold _PAGE_SPECIAL and _PAGE_IOMAP together, it would
start getting used on kernel addresses (and obviously we'd need to

OK.  If we were to fold these two together, then pte_mkspecial() would
have to go, since it wouldn't possible to use correctly in my use case.

    J
--

From: Jeremy Fitzhardinge
Date: Sunday, September 7, 2008 - 3:21 pm

The remappings in setup.c are all just ordinary memory, so use
early_memremap() rather than early_ioremap().

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/kernel/setup.c |    8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -302,7 +302,7 @@
 		if (clen > MAX_MAP_CHUNK-slop)
 			clen = MAX_MAP_CHUNK-slop;
 		mapaddr = ramdisk_image & PAGE_MASK;
-		p = early_ioremap(mapaddr, clen+slop);
+		p = early_memremap(mapaddr, clen+slop);
 		memcpy(q, p+slop, clen);
 		early_iounmap(p, clen+slop);
 		q += clen;
@@ -379,7 +379,7 @@
 		return;
 	pa_data = boot_params.hdr.setup_data;
 	while (pa_data) {
-		data = early_ioremap(pa_data, PAGE_SIZE);
+		data = early_memremap(pa_data, PAGE_SIZE);
 		switch (data->type) {
 		case SETUP_E820_EXT:
 			parse_e820_ext(data, pa_data);
@@ -402,7 +402,7 @@
 		return;
 	pa_data = boot_params.hdr.setup_data;
 	while (pa_data) {
-		data = early_ioremap(pa_data, sizeof(*data));
+		data = early_memremap(pa_data, sizeof(*data));
 		e820_update_range(pa_data, sizeof(*data)+data->len,
 			 E820_RAM, E820_RESERVED_KERN);
 		found = 1;
@@ -428,7 +428,7 @@
 		return;
 	pa_data = boot_params.hdr.setup_data;
 	while (pa_data) {
-		data = early_ioremap(pa_data, sizeof(*data));
+		data = early_memremap(pa_data, sizeof(*data));
 		sprintf(buf, "setup data %x", data->type);
 		reserve_early(pa_data, pa_data+sizeof(*data)+data->len, buf);
 		pa_data = data->next;


--

From: Jeremy Fitzhardinge
Date: Sunday, September 7, 2008 - 3:21 pm

early_ioremap() is also used to map normal memory when constructing
the linear memory mapping.  However, since we sometimes need to be able
to distinguish between actual IO mappings and normal memory mappings,
add a early_memremap() call, which maps with PAGE_KERNEL (as opposed
to PAGE_KERNEL_IO for early_ioremap()), and use it when constructing
pagetables.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/mm/init_64.c |    2 +-
 arch/x86/mm/ioremap.c |   22 +++++++++++++++++-----
 include/asm-x86/io.h  |    1 +
 3 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -312,7 +312,7 @@
 	if (pfn >= table_top)
 		panic("alloc_low_page: ran out of memory");
 
-	adr = early_ioremap(pfn * PAGE_SIZE, PAGE_SIZE);
+	adr = early_memremap(pfn * PAGE_SIZE, PAGE_SIZE);
 	memset(adr, 0, PAGE_SIZE);
 	*phys  = pfn * PAGE_SIZE;
 	return adr;
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -549,12 +549,12 @@
 }
 
 static inline void __init early_set_fixmap(enum fixed_addresses idx,
-					unsigned long phys)
+					   unsigned long phys, pgprot_t prot)
 {
 	if (after_paging_init)
-		set_fixmap(idx, phys);
+		__set_fixmap(idx, phys, prot);
 	else
-		__early_set_fixmap(idx, phys, PAGE_KERNEL);
+		__early_set_fixmap(idx, phys, prot);
 }
 
 static inline void __init early_clear_fixmap(enum fixed_addresses idx)
@@ -582,7 +582,7 @@
 }
 late_initcall(check_early_ioremap_leak);
 
-void __init *early_ioremap(unsigned long phys_addr, unsigned long size)
+static void __init *__early_ioremap(unsigned long phys_addr, unsigned long size, pgprot_t prot)
 {
 	unsigned long offset, last_addr;
 	unsigned int nrpages, nesting;
@@ -631,7 +631,7 @@
 	idx0 = FIX_BTMAP_BEGIN - NR_FIX_BTMAPS*nesting;
 	idx = idx0;
 	while (nrpages > 0) {
-		early_set_fixmap(idx, ...
From: Jeremy Fitzhardinge
Date: Sunday, September 7, 2008 - 3:21 pm

The check prevents flags on mappings from being changed, which is not
desireable.  There's no need to check for replacing a mapping, and
x86-32 does not do this check.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/mm/init_64.c |    3 ---
 1 file changed, 3 deletions(-)

diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -195,9 +195,6 @@
 	}
 
 	pte = pte_offset_kernel(pmd, vaddr);
-	if (!pte_none(*pte) && pte_val(new_pte) &&
-	    pte_val(*pte) != (pte_val(new_pte) & __supported_pte_mask))
-		pte_ERROR(*pte);
 	set_pte(pte, new_pte);
 
 	/*


--

From: Jeremy Fitzhardinge
Date: Sunday, September 7, 2008 - 3:21 pm

__acpi_map_table() effectively reimplements early_ioremap().  Rather
than have that duplication, just implement it in terms of
early_ioremap().

However, unlike early_ioremap(), __acpi_map_table() just maintains a
single mapping which gets replaced each call, and has no corresponding
unmap function.  Implement this by just removing the previous mapping
each time its called.  Unfortunately, this will leave a stray mapping
at the end.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/kernel/acpi/boot.c |   27 +++++++--------------------
 include/asm-x86/acpi.h      |    3 ---
 include/asm-x86/fixmap_32.h |    4 ----
 include/asm-x86/fixmap_64.h |    4 ----
 4 files changed, 7 insertions(+), 31 deletions(-)

diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -121,8 +121,8 @@
  */
 char *__init __acpi_map_table(unsigned long phys, unsigned long size)
 {
-	unsigned long base, offset, mapped_size;
-	int idx;
+	static char *prev_map;
+	static unsigned long prev_size;
 
 	if (!phys || !size)
 		return NULL;
@@ -130,26 +130,13 @@
 	if (phys+size <= (max_low_pfn_mapped << PAGE_SHIFT))
 		return __va(phys);
 
-	offset = phys & (PAGE_SIZE - 1);
-	mapped_size = PAGE_SIZE - offset;
-	clear_fixmap(FIX_ACPI_END);
-	set_fixmap(FIX_ACPI_END, phys);
-	base = fix_to_virt(FIX_ACPI_END);
+	if (prev_map)
+		early_iounmap(prev_map, prev_size);
 
-	/*
-	 * Most cases can be covered by the below.
-	 */
-	idx = FIX_ACPI_END;
-	while (mapped_size < size) {
-		if (--idx < FIX_ACPI_BEGIN)
-			return NULL;	/* cannot handle this */
-		phys += PAGE_SIZE;
-		clear_fixmap(idx);
-		set_fixmap(idx, phys);
-		mapped_size += PAGE_SIZE;
-	}
+	prev_size = size;
+	prev_map = early_ioremap(phys, size);
 
-	return ((unsigned char *)base + offset);
+	return prev_map;
 }
 
 #ifdef CONFIG_PCI_MMCONFIG
diff --git a/include/asm-x86/acpi.h b/include/asm-x86/acpi.h
--- ...
From: Andi Kleen
Date: Sunday, September 7, 2008 - 4:44 pm

It would be better to just fix the ACPI code to unmap.

-Andi
--

From: Jeremy Fitzhardinge
Date: Sunday, September 7, 2008 - 5:03 pm

I was concerned that would cause lots of cross-arch churn, but of course
the only other relevant architecture is ia64.  I'll prep a followup patch.

    J
--

From: Ingo Molnar
Date: Monday, September 8, 2008 - 7:26 am

uhm, there's a nasty trap in that route: it can potentially cause a lot 
of breakage.

It's not robust to assume that the ACPI code is sane wrt. 
mapping/unmapping, because it currently simply doesnt rely on robust 
unmapping (in the linear range).

I tried it in the past and i found tons of crappy ACPI code all around 
that just never unmapped tables. Leaking ACPI maps are hard to find as 
well, and it can occur anytime during bootup.

As a general principle it might be worth fixing those places, and we've 
hardened up the early-ioremap code for leaks during the PAT rewrite, 
still please realize that it can become non-trivial and it might cause a 
lot of unhappy users.

So i'd suggest a different, more carful approach: keep the new code you 
wrote, but print a WARN()ing if prev_map is not unmapped yet when the 
next mapping is acquired. That way the ACPI code can be fixed gradually 
and without breaking existing functionality.

There's another complication: ACPI might rely on multiple mappings being 
present at once, so unmapping the previous one might not be safe. But it 
_should_ be fine most of the time as __acpi_map_table() is only used 
inearly init code - and we fixed most of these things in the PAT 
patchset in any case.

	Ingo
--

From: Jeremy Fitzhardinge
Date: Monday, September 8, 2008 - 9:29 am

You mean there's code which just assumes that it can keep using a
linear-mapped acpi even after __acpi_map_table() should have implicitly

__acpi_map_table() is called by acpi_map_table(), which does have a
acpi_unmap_table() counterpart.  But it only calls iounmap() once we're
past the stage of calling early_*().  I could easily make it call
__acpi_unmap_table()->early_iounmap().  But if the concern is that the
early boot callers of acpi_map_table() "know" that they never need to


And the current behaviour of __acpi_map_table() is to remove the
previous mapping (implicitly, by overwriting the same fixmap slots), so
its only an issue if the callers assume they can keep using
linear-mapped acpi tables after a subsequent call to __acpi_map_table().

    J
--

From: Jeremy Fitzhardinge
Date: Monday, September 8, 2008 - 12:41 pm

Yes, you're right, it's a mess.  I put a unmap and warning in there, but
there's lots of acpi code which doesn't unmap after using its table.  It
doesn't seem to intermix using two tables, fortunately, so the "unmap
previous" behaviour of __acpi_map_table works OK, at least as far as I
can see.

    J
--

From: Ingo Molnar
Date: Wednesday, September 10, 2008 - 4:55 am

ok, i stuck in your patches into tip/master today and -tip testing 
quickly found an early-ioremap leak:

[   36.625100] calling  check_early_ioremap_leak+0x0/0x3d
[   36.630253] ------------[ cut here ]------------
[   36.634884] WARNING: at arch/x86/mm/ioremap.c:577 check_early_ioremap_leak+0x28/0x3d()
[   36.642811] Debug warning: early ioremap leak of 1 areas detected.

find the full log below with ioremap-leak-tracing turned on. I've 
excluded these commits for now from tip/master.

	Ingo

---------------->
[    0.000000] Linux version 2.6.27-rc6-tip-00189-gafeef79-dirty (mingo@titan) (gcc version 4.2.3) #1 SMP Wed Sep 10 13:53:53 CEST 2008
[    0.000000] Command line: root=/dev/sda1 earlyprintk=serial,ttyS0,115200 console=ttyS0,115200 console=tty 5 profile=0 debug initcall_debug apic=debug apic=verbose ignore_loglevel sysrq_always_enabled pci=nomsi early_ioremap_debug
[    0.000000] KERNEL supported cpus:
[    0.000000]   Intel GenuineIntel
[    0.000000]   AMD AuthenticAMD
[    0.000000] BIOS-provided physical RAM map:
[    0.000000]  BIOS-e820: 0000000000000000 - 000000000009fc00 (usable)
[    0.000000]  BIOS-e820: 000000000009fc00 - 00000000000a0000 (reserved)
[    0.000000]  BIOS-e820: 00000000000e0000 - 0000000000100000 (reserved)
[    0.000000]  BIOS-e820: 0000000000100000 - 000000003ed94000 (usable)
[    0.000000]  BIOS-e820: 000000003ed94000 - 000000003ee4e000 (ACPI NVS)
[    0.000000]  BIOS-e820: 000000003ee4e000 - 000000003fea2000 (usable)
[    0.000000]  BIOS-e820: 000000003fea2000 - 000000003fee9000 (ACPI NVS)
[    0.000000]  BIOS-e820: 000000003fee9000 - 000000003feed000 (usable)
[    0.000000]  BIOS-e820: 000000003feed000 - 000000003feff000 (ACPI data)
[    0.000000]  BIOS-e820: 000000003feff000 - 000000003ff00000 (usable)
[    0.000000] console [earlyser0] enabled
[    0.000000] debug: ignoring loglevel setting.
[    0.000000] last_pfn = 0x3ff00 max_arch_pfn = 0x3ffffffff
[    0.000000] init_memory_mapping
[    0.000000]  0000000000 - 003fe00000 page ...
From: Jeremy Fitzhardinge
Date: Wednesday, September 10, 2008 - 9:49 am

Yes, that leak is expected, unfortunately.  __acpi_map_table() has no
corresponding unmap, and only maintains one mapping.  So it will leak
its last mapping when it switches over from using __acpi_map_table() to
ioremap().

So, yes, its ugly, but its guaranteed to be a single leaked mapping. 
But I'm not sure what the best approach to deal with it is.

(All those other backtraces are just informational, right?)

    J
--

From: Ingo Molnar
Date: Thursday, September 11, 2008 - 12:33 am

the false positive should be avoided, for example by unmapping the 

no, they cause hard failures in my test setup. Nor do we want to litter 
the bootup log with messages that are not correct.

	Ingo
--

From: Jeremy Fitzhardinge
Date: Thursday, September 11, 2008 - 11:36 am

Yep.  I'll make it unmap just before setting acpi_gbl_permanent_mmap,

No, I mean that early_ioremap_debug causes all early_ioremap()s to dump
a stack trace, so most of the stack dumps in the boot log are
diagnostic/informational rather than indications of lots of problems.

    J
--

From: Ingo Molnar
Date: Thursday, September 11, 2008 - 11:56 am

ah - now i understand what you mean, sure. The only non-diagnostic 
message that the test barfed on was the leak warning itself. The others 
were there because i did a second boot with the diagnostic messages 
enabled as well. (so that there's a full track of what is mapped where)

	Ingo
--

From: Yinghai Lu
Date: Thursday, September 11, 2008 - 1:34 pm

acpi_os_map_memory is the only user for __acpi_map_table (except es7000_32.c)

void __iomem *__init_refok
acpi_os_map_memory(acpi_physical_address phys, acpi_size size)
{
        if (phys > ULONG_MAX) {
                printk(KERN_ERR PREFIX "Cannot map memory that high\n");
                return NULL;
        }
        if (acpi_gbl_permanent_mmap)
                /*
                * ioremap checks to ensure this is in reserved space
                */
                return ioremap((unsigned long)phys, size);
        else
                return __acpi_map_table((unsigned long)phys, size);
}
EXPORT_SYMBOL_GPL(acpi_os_map_memory);

void acpi_os_unmap_memory(void __iomem * virt, acpi_size size)
{
        if (acpi_gbl_permanent_mmap) {
                iounmap(virt);
        }
}
EXPORT_SYMBOL_GPL(acpi_os_unmap_memory);

just let acpi_os_unmap_memory to call __acpi_unmap_table...

YH
--

From: Jeremy Fitzhardinge
Date: Thursday, September 11, 2008 - 2:07 pm

That was my first attempt.  Unfortunately a lot of the acpi code seems
pretty sloppy about unmapping its tables, and basically relies on
__acpi_map_table's current behaviour of removing the previous mapping
when creating the new mapping.

    J
--

From: Ingo Molnar
Date: Friday, September 12, 2008 - 2:49 am

could we just emit a WARN_ON_ONCE() warning when we have to remove a 
previous mapping, so that the ACPI code can be fixed eventually?

	Ingo
--

From: Yinghai Lu
Date: Friday, September 12, 2008 - 10:31 am

could be some code is shared between early and permanent, and later
may not need unmap.

anyway those code need to be cleanup, to double check if some fix-map
is overwrited unexpected..

YH
--

From: Jeremy Fitzhardinge
Date: Sunday, September 7, 2008 - 3:21 pm

Always map acpi tables, rather than assuming we can use the normal
linear mapping to access the acpi tables.  This is necessary in a
virtual environment where the linear mappings are to pseudo-physical
memory, but the acpi tables exist at a real physical address.  It
doesn't hurt to map in the normal non-virtual case, so just do it
unconditionally.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 arch/x86/kernel/acpi/boot.c |    3 ---
 1 file changed, 3 deletions(-)

diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -126,9 +126,6 @@
 
 	if (!phys || !size)
 		return NULL;
-
-	if (phys+size <= (max_low_pfn_mapped << PAGE_SHIFT))
-		return __va(phys);
 
 	if (prev_map)
 		early_iounmap(prev_map, prev_size);


--

From: Yinghai Lu
Date: Sunday, September 7, 2008 - 4:35 pm

actually,
case 1: acpi tables near mmio, range, we don't map them from
2.6.27-rc1, and it is bigger than max_low_mapped...
case 2: some strange system put acpi in the middle of RAM... like when
8G ram installed, but MMIO is 3.5G, BIOS put acpi tables around 2G..

YH
--

From: Jeremy Fitzhardinge
Date: Sunday, September 7, 2008 - 5:02 pm

OK, so what's your conclusion?  Is this change OK or not?

    J
--

From: Yinghai Lu
Date: Sunday, September 7, 2008 - 5:14 pm

Acked-by: Yinghai Lu <yhlu.kernel@gmail.com>
--

Previous thread: [PATCH 00/18] ide: add generic ATA/ATAPI disk driver by Bartlomiej Zolnierkiewicz on Sunday, September 7, 2008 - 3:14 pm. (22 messages)

Next thread: [regression] __tick_program_event of hpet is stuck by Frans Pop on Sunday, September 7, 2008 - 4:14 pm. (11 messages)