login
Header Space

 
 

Re: Linux 2.6.26-rc1 - pgtable_32.c:178 pmd_bad

Previous thread: [PATCH 1/4] do_wait reorganization by Roland McGrath on Monday, May 5, 2008 - 8:32 pm. (6 messages)

Next thread: [PATCH 1/3] PNP: cleanup pnp_fixup_device() by Rene Herman on Monday, May 5, 2008 - 9:08 pm. (4 messages)
To: Linus Torvalds <torvalds@...>
Cc: Linux Kernel Mailing List <linux-kernel@...>
Date: Monday, May 5, 2008 - 9:06 pm

On Sun, May 4, 2008 at 3:38 AM, Linus Torvalds

I'm seeing this on a Dell 2950 (quad-core) during boot up, but not on
my IBM X60s (dual-core). This is using the lastest git 2.6.26-rc1.

Just checking to see if this is a known bug before doing the bisect --
more painful on the 2950 to reboot.

Thanks,
Jeff.


Freeing unused kernel memory: 252k freed
------------[ cut here ]------------
WARNING: at arch/x86/mm/pgtable_32.c:178 pmd_bad+0xa3/0xe3()
Modules linked in:
Pid: 1226, comm: khelper Not tainted 2.6.26-rc1 #4
 [&lt;c0121614&gt;] warn_on_slowpath+0x40/0x65
 [&lt;c011823e&gt;] kmap_atomic+0x11/0x14
 [&lt;c01183b7&gt;] kunmap_atomic+0x4f/0x78
 [&lt;c014a0db&gt;] get_page_from_freelist+0x33a/0x3da
 [&lt;c014a225&gt;] __alloc_pages_internal+0x98/0x356
 [&lt;c01181f8&gt;] kmap_atomic_prot+0x102/0x137
 [&lt;c01183b7&gt;] kunmap_atomic+0x4f/0x78
 [&lt;c0154dd0&gt;] handle_mm_fault+0xb70/0xbba
 [&lt;c011770c&gt;] pmd_bad+0xa3/0xe3
 [&lt;c0151cad&gt;] follow_page+0x109/0x30a
 [&lt;c015516b&gt;] get_user_pages+0x351/0x3c0
 [&lt;c016aec4&gt;] get_arg_page+0x2b/0x7b
 [&lt;c016afe1&gt;] copy_strings+0xcd/0x173
 [&lt;c016b0a0&gt;] copy_strings_kernel+0x19/0x27
 [&lt;c016c28a&gt;] do_execve+0xd9/0x18f
 [&lt;c010136a&gt;] sys_execve+0x2a/0x4f
 [&lt;c0102d22&gt;] syscall_call+0x7/0xb
 [&lt;c038007b&gt;] i8042_probe+0x29f/0x508
 [&lt;c0106034&gt;] kernel_execve+0x14/0x18
 [&lt;c012dcfc&gt;] ____call_usermodehelper+0x0/0x105
 [&lt;c012ddf7&gt;] ____call_usermodehelper+0xfb/0x105
 [&lt;c012dcfc&gt;] ____call_usermodehelper+0x0/0x105
 [&lt;c0103917&gt;] kernel_thread_helper+0x7/0x10
 =======================
---[ end trace 6aae36751415b63a ]---
Adding 27350652k swap on /dev/sda3.  Priority:-1 extents:1 across:27350652k
--
To: Jeff Chua <jeff.chua.linux@...>
Cc: Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>
Date: Tuesday, May 6, 2008 - 8:49 am

no, i have not seen this reported yet.

we have an strace/ptrace fix pending in x86.git - but that should not 
affect khelper.

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Jeff Chua <jeff.chua.linux@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>
Date: Tuesday, May 6, 2008 - 9:56 am

You need PAE (HIGHMEM64G), and probably HIGHPTE and &gt;4G, to see it.

Don't spend any time on bisection: it bothered me too,


No, it comes from your pmd_bad_v1/pmd_bad_v2 debug patch, which
Linus objected to, yet has crept into his tree nonetheless.

I looked back through the mails and logs on that, and disagree
with most of what was done there.  Here's my pending fix below,
but I've not yet tested whether it builds correctly on all relevant
configs, nor whether my pmd_huge does what's intended, nor written
the necessary comments, nor worked out the Cc list - I'm going out
for a short while, all those will follow later, this sent as a
heads up to relieve Jeff from bisecting.

Not-yet-Signed-off-by: Hugh Dickins &lt;hugh@veritas.com&gt;
---

 arch/x86/mm/hugetlbpage.c    |    3 +++
 arch/x86/mm/pgtable_32.c     |    7 -------
 include/asm-x86/pgtable_32.h |    9 +--------
 include/asm-x86/pgtable_64.h |    6 ++----
 mm/memory.c                  |    5 ++++-
 5 files changed, 10 insertions(+), 20 deletions(-)

--- 2.6.26-rc1/arch/x86/mm/hugetlbpage.c	2008-04-17 03:49:44.000000000 +0100
+++ linux/arch/x86/mm/hugetlbpage.c	2008-05-06 14:13:24.000000000 +0100
@@ -205,6 +205,9 @@ follow_huge_addr(struct mm_struct *mm, u
 
 int pmd_huge(pmd_t pmd)
 {
+	pmd_t unhuge_pmd = __pmd(pmd_val(pmd) &amp; ~(_PAGE_PSE | _PAGE_NX));
+	if (pmd_bad(unhuge_pmd))
+		return 0;
 	return !!(pmd_val(pmd) &amp; _PAGE_PSE);
 }
 
--- 2.6.26-rc1/arch/x86/mm/pgtable_32.c	2008-05-03 21:54:41.000000000 +0100
+++ linux/arch/x86/mm/pgtable_32.c	2008-05-06 14:13:24.000000000 +0100
@@ -172,10 +172,3 @@ void reserve_top_address(unsigned long r
 	__FIXADDR_TOP = -reserve - PAGE_SIZE;
 	__VMALLOC_RESERVE += reserve;
 }
-
-int pmd_bad(pmd_t pmd)
-{
-	WARN_ON_ONCE(pmd_bad_v1(pmd) != pmd_bad_v2(pmd));
-
-	return pmd_bad_v1(pmd);
-}
--- 2.6.26-rc1/include/asm-x86/pgtable_32.h	2008-05-03 21:55:10.000000000 +0100
+++ linux/include/asm-x86/pgtable_32.h	2008-05-06 14:13:24.000000000 +0100
@@ -88,14 +88,7...
To: Hugh Dickins <hugh@...>
Cc: Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>
Date: Tuesday, May 6, 2008 - 2:39 pm

Ok, please let me know when there is a final patch I can apply,

		Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Hans Rosenfeld <hans.rosenfeld@...>, Arjan van de Ven <arjan@...>, Nishanth Aravamudan <nacc@...>, <linux-kernel@...>, <linux-mm@...>
Date: Tuesday, May 6, 2008 - 3:49 pm

Fix warning from pmd_bad() at bootup on a HIGHMEM64G HIGHPTE x86_32.

That came from 9fc34113f6880b215cbea4e7017fc818700384c2 x86: debug pmd_bad();
but we understand now that the typecasting was wrong for PAE in the previous
version: pagetable pages above 4GB looked bad and stopped Arjan from booting.

And revert that cded932b75ab0a5f9181ee3da34a0a488d1a14fd x86: fix pmd_bad
and pud_bad to support huge pages.  It was the wrong way round: we shouldn't
weaken every pmd_bad and pud_bad check to let huge pages slip through - in
part they check that we _don't_ have a huge page where it's not expected.

Put the x86 pmd_bad() and pud_bad() definitions back to what they have long
been: they can be improved (x86_32 should use PTE_MASK, to stop PAE thinking
junk in the upper word is good; and x86_64 should follow x86_32's stricter
comparison, to stop thinking any subset of required bits is good); but that
should be a later patch.

Fix Hans' good observation that follow_page() will never find pmd_huge()
because that would have already failed the pmd_bad test: test pmd_huge in
between the pmd_none and pmd_bad tests.  Tighten x86's pmd_huge() check?
No, once it's a hugepage entry, it can get quite far from a good pmd: for
example, PROT_NONE leaves it with only ACCESSED of the KERN_PGTABLE bits.

However... though follow_page() contains this and another test for huge
pages, so it's nice to keep it working on them, where does it actually get
called on a huge page?  get_user_pages() checks is_vm_hugetlb_page(vma) to
to call alternative hugetlb processing, as does unmap_vmas() and others.

Signed-off-by: Hugh Dickins &lt;hugh@veritas.com&gt;
---
So Hans' original hugepage leak remains unexplained and unfixed.
Hans, did you find that hugepage leak with a standard kernel, or were
you perhaps trying out some hugepage-using patch of your own, without
marking the vma VM_HUGETLB?  Or were you expecting the hugetlbfs file
to truncate itself once all mmappers had gone?  If the standard kernel
leaks huge...
To: Hugh Dickins <hugh@...>
Cc: Hans Rosenfeld <hans.rosenfeld@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, Nishanth Aravamudan <nacc@...>, <linux-kernel@...>, <linux-mm@...>
Date: Wednesday, May 7, 2008 - 12:40 am

Hugh,

Thanks for the patch. It's in Linus's git now and I just booted 2 Dell
2950 and both are no showing the errors.


Thanks,
Jeff.
--
To: Jeff Chua <jeff.chua.linux@...>
Cc: Hans Rosenfeld <hans.rosenfeld@...>, Ingo Molnar <mingo@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, Nishanth Aravamudan <nacc@...>, <linux-kernel@...>, <linux-mm@...>
Date: Wednesday, May 7, 2008 - 1:30 am

^ 
... I'll assume the missing letter is "t" rather than "w" ;)

Hugh
--
To: Hugh Dickins <hugh@...>
Cc: Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, Nishanth Aravamudan <nacc@...>, <linux-kernel@...>, <linux-mm@...>
Date: Tuesday, May 6, 2008 - 4:22 pm

I used a standard kernel (well, not quite, I had made some changes to
the /proc/pid/pagemap code, but nothing that would affect the hugepage
stuff) and some simple test program that would just mmap a hugepage.

I expected that any hugepage that a process had mmapped would
automatically be returned to the system when the process exits. That was
not the case, the process exited and the hugepage was lost (unless I
changed the program to explicitly munmap the hugepage before exiting).
Removing the hugetlbfs file containing the hugepage also didn't free the
page.


-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

--
To: Hans Rosenfeld <hans.rosenfeld@...>
Cc: Hugh Dickins <hugh@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, Nishanth Aravamudan <nacc@...>, <linux-kernel@...>, <linux-mm@...>
Date: Tuesday, May 6, 2008 - 4:42 pm

Could you post the code you're using to do this?  I have to wonder if
you're leaving a fd open somewhere.  Even if you rm the hugepage file,
it'll stay allocated if you have a fd open, or if *someone* is still
mapping it. 

Can you umount your hugetlbfs?

-- Dave

--
To: Dave Hansen <dave@...>
Cc: Hugh Dickins <hugh@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, Nishanth Aravamudan <nacc@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, May 8, 2008 - 10:34 am

A stripped-down program exposing the leak is attached. It doesn't fork
or do any other fancy stuff, so I don't see a way it could leave any fd
open.

While trying to reproduce this, I noticed that the huge page wouldn't
leak when I just mmapped it and exited without explicitly unmapping, as
I described before. The huge page is leaked only when the
/proc/self/pagemap entry for the huge page is read. The kernel I tested

Yes, but the pages remain lost.

Hans


-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

--
To: Hans Rosenfeld <hans.rosenfeld@...>
Cc: Hugh Dickins <hugh@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, Nishanth Aravamudan <nacc@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, May 8, 2008 - 10:52 am

Well, that's an interesting data point! :)

Are you running any of your /proc/&lt;pid&gt;/pagemap patches?

-- Dave

--
To: Dave Hansen <dave@...>
Cc: Hans Rosenfeld <hans.rosenfeld@...>, Hugh Dickins <hugh@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, May 8, 2008 - 11:44 am

Indeed it is! Would explain why we've not seen any leaks until now. I
will try to investigate, as well.

Thanks,
Nish

-- 
Nishanth Aravamudan &lt;nacc@us.ibm.com&gt;
IBM Linux Technology Center
--
To: Dave Hansen <dave@...>
Cc: Hugh Dickins <hugh@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, Nishanth Aravamudan <nacc@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, May 8, 2008 - 11:11 am

No additional patches. The problem already existed before we agreed on
the change to the pagemap code to just include the page size in the
values returned, and not doing any special huge page handling. I suspect
the page walking code used by /proc/pid/pagemap is doing something nasty
when it sees a huge page as it doesn't know how to handle it.


-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

--
To: Hans Rosenfeld <hans.rosenfeld@...>
Cc: Hugh Dickins <hugh@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, Nishanth Aravamudan <nacc@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, May 8, 2008 - 11:51 am

Is there anything in your dmesg?

static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
                          const struct mm_walk *walk, void *private)
{
        pmd_t *pmd;
        unsigned long next;
        int err = 0;

        pmd = pmd_offset(pud, addr);
        do {
                next = pmd_addr_end(addr, end);
                if (pmd_none_or_clear_bad(pmd)) {
                        if (walk-&gt;pte_hole)
                                err = walk-&gt;pte_hole(addr, next, private);
                        if (err)
                                break;
                        continue;


There was a discussion on LKML in the last couple of days about
pmd_bad() triggering on huge pages.  Perhaps we're clearing the mapping
with the pmd_none_or_clear_bad(), and *THAT* is leaking the page.

-- Dave

--
To: Dave Hansen <dave@...>
Cc: Hugh Dickins <hugh@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, Nishanth Aravamudan <nacc@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, May 8, 2008 - 12:19 pm

That makes sense. I remember that explicitly munmapping the huge page
would still work, but it doesn't. I don't quite remember what I did back
then to test this, but I probably made some mistake there that led me to
some false conclusions.


-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

--
To: Hans Rosenfeld <hans.rosenfeld@...>
Cc: Hugh Dickins <hugh@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, Nishanth Aravamudan <nacc@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, May 8, 2008 - 12:42 pm

I can't see how it would possibly work with the code that we have today,
so I guess it was just a false assumption.

static inline int pmd_none_or_clear_bad(pmd_t *pmd)
{
        if (pmd_none(*pmd))
                return 1;
        if (unlikely(pmd_bad(*pmd))) {
                pmd_clear_bad(pmd);
                return 1;
        }
        return 0;
}

void pmd_clear_bad(pmd_t *pmd)
{
        pmd_ERROR(*pmd);
        pmd_clear(pmd);
}

That pmd_clear() will simply zero out the pmd and leak the page.


-- Dave

--
To: Hans Rosenfeld <hans.rosenfeld@...>
Cc: Dave Hansen <dave@...>, Hugh Dickins <hugh@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, May 8, 2008 - 12:33 pm

So this seems to lend credence to Dave's hypothesis. Without, as you
were trying before, teaching pagemap all about hugepages, what are our
options?

Can we just skip over the current iteration of the PMD loop (would we
need something similar for the PTE loop for power?) if pmd_huge(pmd)?

Thanks,
Nish

-- 
Nishanth Aravamudan &lt;nacc@us.ibm.com&gt;
IBM Linux Technology Center
--
To: Nishanth Aravamudan <nacc@...>
Cc: Dave Hansen <dave@...>, Hugh Dickins <hugh@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, May 8, 2008 - 12:51 pm

Allowing huge pages in the page walker would affect both walk_pmd_range
and walk_pud_range. Then either the users of the page walker need to
know how to handle huge pages themselves (in the pmd_entry and pud_entry
callback functions), or the page walker treats huge pages as any other
pages (calling the pte_entry callback function).


-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

--
To: Hans Rosenfeld <hans.rosenfeld@...>
Cc: Dave Hansen <dave@...>, Hugh Dickins <hugh@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, May 8, 2008 - 1:16 pm

Right, I agree *if* we allow huge pages in the walker. But AIUI, things
are broken now with hugepages in the process' address space. This is a
bug upstream and leads to hugepages leaking out of the kernel when
/proc/pid/pagemap is read. Why not, instead (as a short-term fix), skip
hugepage mappings altogether in the page-walker code?

Hrm, upon further investigation, this seems to be a pretty clear
limitation of walk_page_range(). One that is avoided in the other two
callers, i.e.

static int show_smap(struct seq_file *m, void *v)
{
...
	if (vma-&gt;vm_mm &amp;&amp; !is_vm_hugetlb_page(vma))
		walk_page_range(vma-&gt;vm_mm, vma-&gt;vm_start, vma-&gt;vm_end,
				&amp;smaps_walk, &amp;mss);
...
}

static ssize_t clear_refs_write(struct file *file, const char __user *buf,
				size_t count, loff_t *ppos)
{
...
		for (vma = mm-&gt;mmap; vma; vma = vma-&gt;vm_next)
			if (!is_vm_hugetlb_page(vma))
				walk_page_range(mm, vma-&gt;vm_start, vma-&gt;vm_end,
						&amp;clear_refs_walk, vma);
...
}

No such protection exists for

static ssize_t pagemap_read(struct file *file, char __user *buf,
			    size_t count, loff_t *ppos);

So, is there any way to either add a is_vm_hugetlb_page(vma) check into
pagemap_read()? Or can we modify walk_page_range to take the a vma and
skip the walking if is_vm_hugetlb_page(vma) is set [to avoid
complications down the road until hugepage walking is fixed]. I guess
the latter isn't possible for pagemap_read(), since we are just looking
at arbitrary addresses in the process space?

Dunno, seems quite clear that the bug is in pagemap_read(), not any
hugepage code, and that the simplest fix is to make pagemap_read() do
what the other walker-callers do, and skip hugepage regions.

Thanks,
Nish

-- 
Nishanth Aravamudan &lt;nacc@us.ibm.com&gt;
IBM Linux Technology Center
--
To: Nishanth Aravamudan <nacc@...>
Cc: Hans Rosenfeld <hans.rosenfeld@...>, Dave Hansen <dave@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, May 8, 2008 - 2:48 pm

Yes, I'm afraid it needs an is_vm_hugetlb_page(vma) in there somehow:
as you observe, that's what everything else uses to avoid huge issues.

A pmd_huge(*pmd) test is tempting, but it only ever says "yes" on x86:
we've carefully left it undefined what happens to the pgd/pud/pmd/pte
hierarchy in the general arch case, once you're amongst hugepages.

Might follow_huge_addr() be helpful, to avoid the need for a vma?
Perhaps, but my reading is that actually we've never really been
testing that path's success case (because get_user_pages already
skipped is_vm_hugetlb_page), so it might hold further surprises
on one architecture or another.

Many thanks to Hans for persisting, and pointing us to pagemap
to explain this hugepage leak: yes, the pmd_none_or_clear_bad
will be losing it - and corrupting target user address space.

Cc'ed Matt: he may have a view on what he wants his pagewalker
to do with hugepages: I fear it would differ from one usage to
another.  Skip over them has to be safest, though not ideal.

Hugh
--
To: Hugh Dickins <hugh@...>
Cc: Nishanth Aravamudan <nacc@...>, Dave Hansen <dave@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, May 8, 2008 - 4:02 pm

AFAIK the reason for this is that pmd_huge() and pud_huge() are
completely x86-specific. When I looked at the huge page support for
other archs in Linux the last time, all of them marked hugepages with
some page size bits in the PTE, using several PTEs for a single huge
page. So for anything but x86, the pgd/pud/pmd/pte hierarchy should work

Proper hugepage support in /proc/pid/pagemap would be very helpful for
certain kinds of performance analysis, like TLB usage, where you need
page size information when you have a TLB for each supported page size.
That was the reason why I was working on supporting huge pages in
/proc/pid/pagemap in the first place.


-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown

--
To: Hans Rosenfeld <hans.rosenfeld@...>
Cc: Hugh Dickins <hugh@...>, Nishanth Aravamudan <nacc@...>, Dave Hansen <dave@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>
Date: Friday, May 9, 2008 - 5:03 am

s390 also does hugepages at the pmd level, so it's not only x86. And
while it's not an issue today, it's worth noting that ARM also has the
same characteristics for larger sizes. Should someone feel compelled to
implement hugepages there, this will almost certainly come up again -- at
least in so far as pmd_huge() is concerned.

At a quick glance, sparc64 also looks like it might need some special
handling in the pagemap case, too..
--
To: Hans Rosenfeld <hans.rosenfeld@...>
Cc: Hugh Dickins <hugh@...>, Nishanth Aravamudan <nacc@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, May 8, 2008 - 7:15 pm

Here's one quick stab at a solution.  I figured that we already pass
that 'private' variable around.  This patch just sticks that variable
*in* the mm_walk and also makes the caller fill in an 'mm' as well.
Then, we just pass the actual mm_walk around.

Maybe we should just stick the VMA in the mm_walk as well, and have the
common code keep it up to date with the addresses currently being
walked.

Sadly, I didn't quite get enough time to flesh this idea out very far
today, and I'll be offline for a couple of days now.  But, if someone
wants to go this route, I thought this might be useful.  

---

 linux-2.6.git-dave/fs/proc/task_mmu.c |   45 +++++++++++++++++++---------------
 linux-2.6.git-dave/include/linux/mm.h |   17 ++++++------
 linux-2.6.git-dave/mm/pagewalk.c      |   41 +++++++++++++++---------------
 3 files changed, 56 insertions(+), 47 deletions(-)

diff -puN mm/pagewalk.c~pass-mm-into-pagewalkers mm/pagewalk.c
--- linux-2.6.git/mm/pagewalk.c~pass-mm-into-pagewalkers	2008-05-08 15:49:47.000000000 -0700
+++ linux-2.6.git-dave/mm/pagewalk.c	2008-05-08 15:49:54.000000000 -0700
@@ -3,14 +3,14 @@
 #include &lt;linux/sched.h&gt;
 
 static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
-			  const struct mm_walk *walk, void *private)
+			  struct mm_walk *walk)
 {
 	pte_t *pte;
 	int err = 0;
 
 	pte = pte_offset_map(pmd, addr);
 	for (;;) {
-		err = walk-&gt;pte_entry(pte, addr, addr + PAGE_SIZE, private);
+		err = walk-&gt;pte_entry(pte, addr, addr + PAGE_SIZE, walk);
 		if (err)
 		       break;
 		addr += PAGE_SIZE;
@@ -24,7 +24,7 @@ static int walk_pte_range(pmd_t *pmd, un
 }
 
 static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
-			  const struct mm_walk *walk, void *private)
+			  struct mm_walk *walk)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -35,15 +35,15 @@ static int walk_pmd_range(pud_t *pud, un
 		next = pmd_addr_end(addr, end);
 		if (pmd_none_or_clear_bad(pmd)) {
 			if (walk-&gt;pte_hole)
-	...
To: Dave Hansen <dave@...>
Cc: Hans Rosenfeld <hans.rosenfeld@...>, Hugh Dickins <hugh@...>, Nishanth Aravamudan <nacc@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, <linux-kernel@...>, <linux-mm@...>
Date: Wednesday, May 14, 2008 - 3:01 pm

This much looks reasonable. But the real test of course is to actually
teach it about detecting huge pages efficiently. And I suspect that
means tracking the current VMA all the time in the walk. Am I wrong?

-- 
Mathematics is the supreme nostalgia of our time.

--
To: Hans Rosenfeld <hans.rosenfeld@...>
Cc: Hugh Dickins <hugh@...>, Nishanth Aravamudan <nacc@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, Matt Mackall <mpm@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, May 8, 2008 - 4:16 pm

powerpc kinda puts them in pmds, although Adam calls them ptes in his
diagram.  See Adam's very nice pictures here:

	http://linux-mm.org/PageTableStructure

In the arch code, they have a concept of "slices" for each mm that you
can look up the page size for.  That's what they use when the mm/vmas
aren't around.  Their pmd_ts really are just pointers.  I don't think
they have any flags in them at all like _PAGE_PSE.

They just do a special pagetable walk instead of looking *at* the
pagetables.

-- Dave

--
To: Hugh Dickins <hugh@...>
Cc: Nishanth Aravamudan <nacc@...>, Hans Rosenfeld <hans.rosenfeld@...>, Dave Hansen <dave@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, May 8, 2008 - 3:49 pm

There are folks who want to be able to get information about hugepages
from pagemap. Thanks to Hans, there's room in the output format for it.

I'd gone to some lengths to pull VMAs out of the picture as it's quite
ugly to have to simultaneously walk VMAs and pagetables. But I may have
to concede that living with hugepages requires it.

-- 
Mathematics is the supreme nostalgia of our time.

--
To: Matt Mackall <mpm@...>
Cc: Hugh Dickins <hugh@...>, Nishanth Aravamudan <nacc@...>, Hans Rosenfeld <hans.rosenfeld@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, May 8, 2008 - 4:08 pm

Yeah, it will definitely change the way that we have to do the pagetable
walk.

Should we just pass the mm around and make anyone that really wants to
get the VMAs do the lookup themselves?  Or, should we just provide the
VMA?

I'll start with just the mm and see where it goes...

-- Dave

--
To: Nishanth Aravamudan <nacc@...>
Cc: Hans Rosenfeld <hans.rosenfeld@...>, Hugh Dickins <hugh@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, May 8, 2008 - 2:42 pm

Agreed, this certainly isn't a huge page bug.

But, I do think it is absolutely insane to have pmd_clear_bad() going
after perfectly good hugetlb pmds.  The way it is set up now, people are
bound to miss the hugetlb pages because just about every single
pagetable walk has to be specially coded to handle or avoid them.  We
obviously missed it, here, and we had two good examples in the same
file! :)

-- Dave

--
To: Dave Hansen <dave@...>
Cc: Nishanth Aravamudan <nacc@...>, Hans Rosenfeld <hans.rosenfeld@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, May 8, 2008 - 2:58 pm

Like it or not, the pgd/pud/pmd/pte hierarchy cannot be assumed once
you're amongst hugepages.  What happens varies from architecture to
architecture.  Perhaps the hugepage specialists could look at what
in fact the different architectures we know today are doing, and
come up with a better abstraction to encompass them all.  But it's
simply wrong for a "generic" pagewalker to be going blindly in there.

Two good examples in the same file??

Hugh
--
To: Hugh Dickins <hugh@...>
Cc: Nishanth Aravamudan <nacc@...>, Hans Rosenfeld <hans.rosenfeld@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, May 8, 2008 - 3:06 pm

I was just noting that the other two pagewalking users did the right
vm_huge..() check, and we missed it in the third pagewalker user.

-- Dave

--
To: Dave Hansen <dave@...>
Cc: Hugh Dickins <hugh@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, Nishanth Aravamudan <nacc@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, May 8, 2008 - 10:39 am

It is now :)

-- 
%SYSTEM-F-ANARCHISM, The operating system has been overthrown
To: Hans Rosenfeld <hans.rosenfeld@...>
Cc: Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, Nishanth Aravamudan <nacc@...>, <linux-kernel@...>, <linux-mm@...>
Date: Tuesday, May 6, 2008 - 4:36 pm

Hmm.  That doesn't fit with my experience: I've not found an explicit
munmap makes any difference (I wouldn't expect it to), but removing
the file once all openers gone does free everything up.  I guess I'm
overlooking something more experienced hugepagers will soon light upon.

Hugh
--
To: Hugh Dickins <hugh@...>
Cc: Hans Rosenfeld <hans.rosenfeld@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Arjan van de Ven <arjan@...>, <linux-kernel@...>, <linux-mm@...>
Date: Wednesday, May 7, 2008 - 7:39 pm

Nothing strikes me immediately. What you described is what I expected.
As Dave pointed out separately, are you able to unmount hugetlbfs at
this point?

Hans, can you send out a sample application's source? What kernel were
you testing on?

Thanks,
Nish

-- 
Nishanth Aravamudan &lt;nacc@us.ibm.com&gt;
IBM Linux Technology Center
--
To: Hugh Dickins <hugh@...>
Cc: Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Hans Rosenfeld <hans.rosenfeld@...>, Arjan van de Ven <arjan@...>, Nishanth Aravamudan <nacc@...>, <linux-kernel@...>, <linux-mm@...>
Date: Tuesday, May 6, 2008 - 4:06 pm

I'd much rather have pdm_bad() etc fixed up instead, so that they do a 
more proper test (not thinking that a PSE page is bad, since it clearly 
isn't). And then, make them dependent on DEBUG_VM, because doing the 
proper test will be more expensive.

Hmm?

		Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Hans Rosenfeld <hans.rosenfeld@...>, Arjan van de Ven <arjan@...>, Nishanth Aravamudan <nacc@...>, Jeremy Fitzhardinge <jeremy@...>, <linux-kernel@...>, <linux-mm@...>
Date: Tuesday, May 6, 2008 - 4:30 pm

But everywhere we use pmd_bad() etc (most are hidden inside
pmd_none_or_clear_bad() etc) we are expecting never to encounter
a pmd_huge, unless there's corruption.  follow_page() is the one
exception, and even in its case I can't find a current user that
actually could meet a hugepage.  I'd rather tighten up pmd_bad
(in the PAE and x86_64 cases), than weaken it so far as to let
hugepages slip through.  Not that pmd_bad often catches anything:
just coincidentally that 90909090 one today.

Hugh
--
To: Hugh Dickins <hugh@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Hans Rosenfeld <hans.rosenfeld@...>, Arjan van de Ven <arjan@...>, Jeremy Fitzhardinge <jeremy@...>, <linux-kernel@...>, <linux-mm@...>
Date: Thursday, May 8, 2008 - 12:07 pm

There is one case that seems to the source of Hans' problem, as Dave has
figured out: /proc/pid/pagemap, where we fairly straight-forwardly walk
the pagetables.

In there, we unconditionally call pmd_none_or_clear_bad(pmd). And any
userspace process that maps hugepages and then reads in
/proc/pid/pagemap will invoke that path, I think (at least with 2M
pages).

So I agree, you're fixing a potential issue in follow_page() [might
deserve a comment, so someone doesn't go and combine them back later?],
but Hans' issue is most likely related to the pagemap code?

Thanks,
Nish

-- 
Nishanth Aravamudan &lt;nacc@us.ibm.com&gt;
IBM Linux Technology Center
--
To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>
Date: Tuesday, May 6, 2008 - 3:43 pm

Thanks to Ingo for all his testing, but thanks to you for holding
off: I just now thought to see what a hugetlb "pmd" looks like,
and it does NOT look quite as my tightened pmd_huge() expected -
I've had to drop that part of the patch.  Final patch, to longer
Cc list, follows now.

Hugh
--
To: Hugh Dickins <hugh@...>
Cc: Jeff Chua <jeff.chua.linux@...>, Linus Torvalds <torvalds@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Tuesday, May 6, 2008 - 11:04 am

hm, the main objection was about whether to turn PAGE_SIZE from unsigned 
into signed though - and that we didnt do.

regarding the mm/memory.c changes you did - that's something i proposed 
before but didnt really feel positive about it because it affected every 
architecture.


i've thrown this into my queue to test it some more. I wont complain if 
it breaks :)

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Hugh Dickins <hugh@...>, Jeff Chua <jeff.chua.linux@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Tuesday, May 6, 2008 - 11:09 am

No it wasn't.

The main objection was that you SHOULD NOT USE PAGE_SIZE_MASK AT ALL!

You should have used a pagetable-specific macro, becuse PAGE_MASK is 
simply fundamentally WRONG, and has absolutely nothing to do with the PFN 
bits, whether sign-extended or not!

The fact is, the page frane number bits are *not* ~PAGE_MASK or anything 
like that. They share only the low bits - not the high bits.

			Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Ingo Molnar <mingo@...>, Hugh Dickins <hugh@...>, Jeff Chua <jeff.chua.linux@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>
Date: Tuesday, May 6, 2008 - 11:32 am

Yes.  There's already a PTE_MASK for masking out the PFN vs the flags in 
a pte, but it isn't used consistently (perhaps at all).

    J
--
To: Jeremy Fitzhardinge <jeremy@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Molnar <mingo@...>, Jeff Chua <jeff.chua.linux@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>
Date: Tuesday, May 6, 2008 - 12:12 pm

My patch sidesteps this issue by simply restoring most things to how
they were in 2.6.25 (using PTE_MASK in x86_64 but PAGE_MASK in x86_32).
We've contentedly used PAGE_MASK there for many many years.

Seeing the PTE_MASK discussion I did briefly consider using PTE_MASK
at the x86_32 end (it would have the nice effect of testing for lots
of badly set bits it the upper PAE half?), but I wasn't sure it has
the right type at present (there seems to be a "long" type in there
which I'd have expected to need to be long long for PAE), and it was
all even more of a maze than the last time I looked.

I decided it's safer to go back to what we had before to fix the
current warning, together with the original point which got us
into this, and leave any PTE_MASK-ification to some other patch.

I've more to say on it when commenting the patch.

Hugh
--
To: Linus Torvalds <torvalds@...>
Cc: Hugh Dickins <hugh@...>, Jeff Chua <jeff.chua.linux@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Tuesday, May 6, 2008 - 11:15 am

yeah, indeed, sorry about that. Will sort this type mixing out today.

	Ingo
--
To: Ingo Molnar <mingo@...>
Cc: Hugh Dickins <hugh@...>, Jeff Chua <jeff.chua.linux@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Tuesday, May 6, 2008 - 12:16 pm

Just take Hugh's patch. It's correct. Your code wasn't. Just admit it, and 
let others fix things.

			Linus
--
To: Linus Torvalds <torvalds@...>
Cc: Hugh Dickins <hugh@...>, Jeff Chua <jeff.chua.linux@...>, Linux Kernel Mailing List <linux-kernel@...>, Thomas Gleixner <tglx@...>, H. Peter Anvin <hpa@...>, Gabriel C <nix.or.die@...>, Jeremy Fitzhardinge <jeremy@...>
Date: Tuesday, May 6, 2008 - 12:30 pm

yes - Hugh's patch is correct and fixes my hackery, and it works as well 
- it just passed 40 build/boot iterations here. Thanks Hugh!

Tested-by: Ingo Molnar &lt;mingo@elte.hu&gt;

	Ingo
--
Previous thread: [PATCH 1/4] do_wait reorganization by Roland McGrath on Monday, May 5, 2008 - 8:32 pm. (6 messages)

Next thread: [PATCH 1/3] PNP: cleanup pnp_fixup_device() by Rene Herman on Monday, May 5, 2008 - 9:08 pm. (4 messages)
speck-geostationary