On Sun, May 4, 2008 at 3:38 AM, Linus Torvalds I'm seeing this on a Dell 2950 (quad-core) during boot up, but not on my IBM X60s (dual-core). This is using the lastest git 2.6.26-rc1. Just checking to see if this is a known bug before doing the bisect -- more painful on the 2950 to reboot. Thanks, Jeff. Freeing unused kernel memory: 252k freed ------------[ cut here ]------------ WARNING: at arch/x86/mm/pgtable_32.c:178 pmd_bad+0xa3/0xe3() Modules linked in: Pid: 1226, comm: khelper Not tainted 2.6.26-rc1 #4 [<c0121614>] warn_on_slowpath+0x40/0x65 [<c011823e>] kmap_atomic+0x11/0x14 [<c01183b7>] kunmap_atomic+0x4f/0x78 [<c014a0db>] get_page_from_freelist+0x33a/0x3da [<c014a225>] __alloc_pages_internal+0x98/0x356 [<c01181f8>] kmap_atomic_prot+0x102/0x137 [<c01183b7>] kunmap_atomic+0x4f/0x78 [<c0154dd0>] handle_mm_fault+0xb70/0xbba [<c011770c>] pmd_bad+0xa3/0xe3 [<c0151cad>] follow_page+0x109/0x30a [<c015516b>] get_user_pages+0x351/0x3c0 [<c016aec4>] get_arg_page+0x2b/0x7b [<c016afe1>] copy_strings+0xcd/0x173 [<c016b0a0>] copy_strings_kernel+0x19/0x27 [<c016c28a>] do_execve+0xd9/0x18f [<c010136a>] sys_execve+0x2a/0x4f [<c0102d22>] syscall_call+0x7/0xb [<c038007b>] i8042_probe+0x29f/0x508 [<c0106034>] kernel_execve+0x14/0x18 [<c012dcfc>] ____call_usermodehelper+0x0/0x105 [<c012ddf7>] ____call_usermodehelper+0xfb/0x105 [<c012dcfc>] ____call_usermodehelper+0x0/0x105 [<c0103917>] kernel_thread_helper+0x7/0x10 ======================= ---[ end trace 6aae36751415b63a ]--- Adding 27350652k swap on /dev/sda3. Priority:-1 extents:1 across:27350652k --
no, i have not seen this reported yet. we have an strace/ptrace fix pending in x86.git - but that should not affect khelper. Ingo --
You need PAE (HIGHMEM64G), and probably HIGHPTE and >4G, to see it.
Don't spend any time on bisection: it bothered me too,
No, it comes from your pmd_bad_v1/pmd_bad_v2 debug patch, which
Linus objected to, yet has crept into his tree nonetheless.
I looked back through the mails and logs on that, and disagree
with most of what was done there. Here's my pending fix below,
but I've not yet tested whether it builds correctly on all relevant
configs, nor whether my pmd_huge does what's intended, nor written
the necessary comments, nor worked out the Cc list - I'm going out
for a short while, all those will follow later, this sent as a
heads up to relieve Jeff from bisecting.
Not-yet-Signed-off-by: Hugh Dickins <hugh@veritas.com>
---
arch/x86/mm/hugetlbpage.c | 3 +++
arch/x86/mm/pgtable_32.c | 7 -------
include/asm-x86/pgtable_32.h | 9 +--------
include/asm-x86/pgtable_64.h | 6 ++----
mm/memory.c | 5 ++++-
5 files changed, 10 insertions(+), 20 deletions(-)
--- 2.6.26-rc1/arch/x86/mm/hugetlbpage.c 2008-04-17 03:49:44.000000000 +0100
+++ linux/arch/x86/mm/hugetlbpage.c 2008-05-06 14:13:24.000000000 +0100
@@ -205,6 +205,9 @@ follow_huge_addr(struct mm_struct *mm, u
int pmd_huge(pmd_t pmd)
{
+ pmd_t unhuge_pmd = __pmd(pmd_val(pmd) & ~(_PAGE_PSE | _PAGE_NX));
+ if (pmd_bad(unhuge_pmd))
+ return 0;
return !!(pmd_val(pmd) & _PAGE_PSE);
}
--- 2.6.26-rc1/arch/x86/mm/pgtable_32.c 2008-05-03 21:54:41.000000000 +0100
+++ linux/arch/x86/mm/pgtable_32.c 2008-05-06 14:13:24.000000000 +0100
@@ -172,10 +172,3 @@ void reserve_top_address(unsigned long r
__FIXADDR_TOP = -reserve - PAGE_SIZE;
__VMALLOC_RESERVE += reserve;
}
-
-int pmd_bad(pmd_t pmd)
-{
- WARN_ON_ONCE(pmd_bad_v1(pmd) != pmd_bad_v2(pmd));
-
- return pmd_bad_v1(pmd);
-}
--- 2.6.26-rc1/include/asm-x86/pgtable_32.h 2008-05-03 21:55:10.000000000 +0100
+++ linux/include/asm-x86/pgtable_32.h 2008-05-06 14:13:24.000000000 +0100
@@ -88,14 +88,7 @@ extern ...hm, the main objection was about whether to turn PAGE_SIZE from unsigned into signed though - and that we didnt do. regarding the mm/memory.c changes you did - that's something i proposed before but didnt really feel positive about it because it affected every architecture. i've thrown this into my queue to test it some more. I wont complain if it breaks :) Ingo --
No it wasn't. The main objection was that you SHOULD NOT USE PAGE_SIZE_MASK AT ALL! You should have used a pagetable-specific macro, becuse PAGE_MASK is simply fundamentally WRONG, and has absolutely nothing to do with the PFN bits, whether sign-extended or not! The fact is, the page frane number bits are *not* ~PAGE_MASK or anything like that. They share only the low bits - not the high bits. Linus --
yeah, indeed, sorry about that. Will sort this type mixing out today. Ingo --
Just take Hugh's patch. It's correct. Your code wasn't. Just admit it, and let others fix things. Linus --
yes - Hugh's patch is correct and fixes my hackery, and it works as well - it just passed 40 build/boot iterations here. Thanks Hugh! Tested-by: Ingo Molnar <mingo@elte.hu> Ingo --
Yes. There's already a PTE_MASK for masking out the PFN vs the flags in
a pte, but it isn't used consistently (perhaps at all).
J
--
My patch sidesteps this issue by simply restoring most things to how they were in 2.6.25 (using PTE_MASK in x86_64 but PAGE_MASK in x86_32). We've contentedly used PAGE_MASK there for many many years. Seeing the PTE_MASK discussion I did briefly consider using PTE_MASK at the x86_32 end (it would have the nice effect of testing for lots of badly set bits it the upper PAE half?), but I wasn't sure it has the right type at present (there seems to be a "long" type in there which I'd have expected to need to be long long for PAE), and it was all even more of a maze than the last time I looked. I decided it's safer to go back to what we had before to fix the current warning, together with the original point which got us into this, and leave any PTE_MASK-ification to some other patch. I've more to say on it when commenting the patch. Hugh --
Ok, please let me know when there is a final patch I can apply, Linus --
Thanks to Ingo for all his testing, but thanks to you for holding off: I just now thought to see what a hugetlb "pmd" looks like, and it does NOT look quite as my tightened pmd_huge() expected - I've had to drop that part of the patch. Final patch, to longer Cc list, follows now. Hugh --
Fix warning from pmd_bad() at bootup on a HIGHMEM64G HIGHPTE x86_32. That came from 9fc34113f6880b215cbea4e7017fc818700384c2 x86: debug pmd_bad(); but we understand now that the typecasting was wrong for PAE in the previous version: pagetable pages above 4GB looked bad and stopped Arjan from booting. And revert that cded932b75ab0a5f9181ee3da34a0a488d1a14fd x86: fix pmd_bad and pud_bad to support huge pages. It was the wrong way round: we shouldn't weaken every pmd_bad and pud_bad check to let huge pages slip through - in part they check that we _don't_ have a huge page where it's not expected. Put the x86 pmd_bad() and pud_bad() definitions back to what they have long been: they can be improved (x86_32 should use PTE_MASK, to stop PAE thinking junk in the upper word is good; and x86_64 should follow x86_32's stricter comparison, to stop thinking any subset of required bits is good); but that should be a later patch. Fix Hans' good observation that follow_page() will never find pmd_huge() because that would have already failed the pmd_bad test: test pmd_huge in between the pmd_none and pmd_bad tests. Tighten x86's pmd_huge() check? No, once it's a hugepage entry, it can get quite far from a good pmd: for example, PROT_NONE leaves it with only ACCESSED of the KERN_PGTABLE bits. However... though follow_page() contains this and another test for huge pages, so it's nice to keep it working on them, where does it actually get called on a huge page? get_user_pages() checks is_vm_hugetlb_page(vma) to to call alternative hugetlb processing, as does unmap_vmas() and others. Signed-off-by: Hugh Dickins <hugh@veritas.com> --- So Hans' original hugepage leak remains unexplained and unfixed. Hans, did you find that hugepage leak with a standard kernel, or were you perhaps trying out some hugepage-using patch of your own, without marking the vma VM_HUGETLB? Or were you expecting the hugetlbfs file to truncate itself once all mmappers had gone? If the standard kernel leaks ...
I'd much rather have pdm_bad() etc fixed up instead, so that they do a more proper test (not thinking that a PSE page is bad, since it clearly isn't). And then, make them dependent on DEBUG_VM, because doing the proper test will be more expensive. Hmm? Linus --
But everywhere we use pmd_bad() etc (most are hidden inside pmd_none_or_clear_bad() etc) we are expecting never to encounter a pmd_huge, unless there's corruption. follow_page() is the one exception, and even in its case I can't find a current user that actually could meet a hugepage. I'd rather tighten up pmd_bad (in the PAE and x86_64 cases), than weaken it so far as to let hugepages slip through. Not that pmd_bad often catches anything: just coincidentally that 90909090 one today. Hugh --
There is one case that seems to the source of Hans' problem, as Dave has figured out: /proc/pid/pagemap, where we fairly straight-forwardly walk the pagetables. In there, we unconditionally call pmd_none_or_clear_bad(pmd). And any userspace process that maps hugepages and then reads in /proc/pid/pagemap will invoke that path, I think (at least with 2M pages). So I agree, you're fixing a potential issue in follow_page() [might deserve a comment, so someone doesn't go and combine them back later?], but Hans' issue is most likely related to the pagemap code? Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center --
I used a standard kernel (well, not quite, I had made some changes to the /proc/pid/pagemap code, but nothing that would affect the hugepage stuff) and some simple test program that would just mmap a hugepage. I expected that any hugepage that a process had mmapped would automatically be returned to the system when the process exits. That was not the case, the process exited and the hugepage was lost (unless I changed the program to explicitly munmap the hugepage before exiting). Removing the hugetlbfs file containing the hugepage also didn't free the page. -- %SYSTEM-F-ANARCHISM, The operating system has been overthrown --
Hmm. That doesn't fit with my experience: I've not found an explicit munmap makes any difference (I wouldn't expect it to), but removing the file once all openers gone does free everything up. I guess I'm overlooking something more experienced hugepagers will soon light upon. Hugh --
Nothing strikes me immediately. What you described is what I expected. As Dave pointed out separately, are you able to unmount hugetlbfs at this point? Hans, can you send out a sample application's source? What kernel were you testing on? Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center --
Could you post the code you're using to do this? I have to wonder if you're leaving a fd open somewhere. Even if you rm the hugepage file, it'll stay allocated if you have a fd open, or if *someone* is still mapping it. Can you umount your hugetlbfs? -- Dave --
A stripped-down program exposing the leak is attached. It doesn't fork or do any other fancy stuff, so I don't see a way it could leave any fd open. While trying to reproduce this, I noticed that the huge page wouldn't leak when I just mmapped it and exited without explicitly unmapping, as I described before. The huge page is leaked only when the /proc/self/pagemap entry for the huge page is read. The kernel I tested Yes, but the pages remain lost. Hans -- %SYSTEM-F-ANARCHISM, The operating system has been overthrown --
It is now :) -- %SYSTEM-F-ANARCHISM, The operating system has been overthrown
Well, that's an interesting data point! :) Are you running any of your /proc/<pid>/pagemap patches? -- Dave --
No additional patches. The problem already existed before we agreed on the change to the pagemap code to just include the page size in the values returned, and not doing any special huge page handling. I suspect the page walking code used by /proc/pid/pagemap is doing something nasty when it sees a huge page as it doesn't know how to handle it. -- %SYSTEM-F-ANARCHISM, The operating system has been overthrown --
Is there anything in your dmesg?
static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
const struct mm_walk *walk, void *private)
{
pmd_t *pmd;
unsigned long next;
int err = 0;
pmd = pmd_offset(pud, addr);
do {
next = pmd_addr_end(addr, end);
if (pmd_none_or_clear_bad(pmd)) {
if (walk->pte_hole)
err = walk->pte_hole(addr, next, private);
if (err)
break;
continue;
There was a discussion on LKML in the last couple of days about
pmd_bad() triggering on huge pages. Perhaps we're clearing the mapping
with the pmd_none_or_clear_bad(), and *THAT* is leaking the page.
-- Dave
--
That makes sense. I remember that explicitly munmapping the huge page would still work, but it doesn't. I don't quite remember what I did back then to test this, but I probably made some mistake there that led me to some false conclusions. -- %SYSTEM-F-ANARCHISM, The operating system has been overthrown --
So this seems to lend credence to Dave's hypothesis. Without, as you were trying before, teaching pagemap all about hugepages, what are our options? Can we just skip over the current iteration of the PMD loop (would we need something similar for the PTE loop for power?) if pmd_huge(pmd)? Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center --
Allowing huge pages in the page walker would affect both walk_pmd_range and walk_pud_range. Then either the users of the page walker need to know how to handle huge pages themselves (in the pmd_entry and pud_entry callback functions), or the page walker treats huge pages as any other pages (calling the pte_entry callback function). -- %SYSTEM-F-ANARCHISM, The operating system has been overthrown --
Right, I agree *if* we allow huge pages in the walker. But AIUI, things
are broken now with hugepages in the process' address space. This is a
bug upstream and leads to hugepages leaking out of the kernel when
/proc/pid/pagemap is read. Why not, instead (as a short-term fix), skip
hugepage mappings altogether in the page-walker code?
Hrm, upon further investigation, this seems to be a pretty clear
limitation of walk_page_range(). One that is avoided in the other two
callers, i.e.
static int show_smap(struct seq_file *m, void *v)
{
...
if (vma->vm_mm && !is_vm_hugetlb_page(vma))
walk_page_range(vma->vm_mm, vma->vm_start, vma->vm_end,
&smaps_walk, &mss);
...
}
static ssize_t clear_refs_write(struct file *file, const char __user *buf,
size_t count, loff_t *ppos)
{
...
for (vma = mm->mmap; vma; vma = vma->vm_next)
if (!is_vm_hugetlb_page(vma))
walk_page_range(mm, vma->vm_start, vma->vm_end,
&clear_refs_walk, vma);
...
}
No such protection exists for
static ssize_t pagemap_read(struct file *file, char __user *buf,
size_t count, loff_t *ppos);
So, is there any way to either add a is_vm_hugetlb_page(vma) check into
pagemap_read()? Or can we modify walk_page_range to take the a vma and
skip the walking if is_vm_hugetlb_page(vma) is set [to avoid
complications down the road until hugepage walking is fixed]. I guess
the latter isn't possible for pagemap_read(), since we are just looking
at arbitrary addresses in the process space?
Dunno, seems quite clear that the bug is in pagemap_read(), not any
hugepage code, and that the simplest fix is to make pagemap_read() do
what the other walker-callers do, and skip hugepage regions.
Thanks,
Nish
--
Nishanth Aravamudan <nacc@us.ibm.com>
IBM Linux Technology Center
--
Agreed, this certainly isn't a huge page bug. But, I do think it is absolutely insane to have pmd_clear_bad() going after perfectly good hugetlb pmds. The way it is set up now, people are bound to miss the hugetlb pages because just about every single pagetable walk has to be specially coded to handle or avoid them. We obviously missed it, here, and we had two good examples in the same file! :) -- Dave --
Like it or not, the pgd/pud/pmd/pte hierarchy cannot be assumed once you're amongst hugepages. What happens varies from architecture to architecture. Perhaps the hugepage specialists could look at what in fact the different architectures we know today are doing, and come up with a better abstraction to encompass them all. But it's simply wrong for a "generic" pagewalker to be going blindly in there. Two good examples in the same file?? Hugh --
I was just noting that the other two pagewalking users did the right vm_huge..() check, and we missed it in the third pagewalker user. -- Dave --
Yes, I'm afraid it needs an is_vm_hugetlb_page(vma) in there somehow: as you observe, that's what everything else uses to avoid huge issues. A pmd_huge(*pmd) test is tempting, but it only ever says "yes" on x86: we've carefully left it undefined what happens to the pgd/pud/pmd/pte hierarchy in the general arch case, once you're amongst hugepages. Might follow_huge_addr() be helpful, to avoid the need for a vma? Perhaps, but my reading is that actually we've never really been testing that path's success case (because get_user_pages already skipped is_vm_hugetlb_page), so it might hold further surprises on one architecture or another. Many thanks to Hans for persisting, and pointing us to pagemap to explain this hugepage leak: yes, the pmd_none_or_clear_bad will be losing it - and corrupting target user address space. Cc'ed Matt: he may have a view on what he wants his pagewalker to do with hugepages: I fear it would differ from one usage to another. Skip over them has to be safest, though not ideal. Hugh --
There are folks who want to be able to get information about hugepages from pagemap. Thanks to Hans, there's room in the output format for it. I'd gone to some lengths to pull VMAs out of the picture as it's quite ugly to have to simultaneously walk VMAs and pagetables. But I may have to concede that living with hugepages requires it. -- Mathematics is the supreme nostalgia of our time. --
Yeah, it will definitely change the way that we have to do the pagetable walk. Should we just pass the mm around and make anyone that really wants to get the VMAs do the lookup themselves? Or, should we just provide the VMA? I'll start with just the mm and see where it goes... -- Dave --
AFAIK the reason for this is that pmd_huge() and pud_huge() are completely x86-specific. When I looked at the huge page support for other archs in Linux the last time, all of them marked hugepages with some page size bits in the PTE, using several PTEs for a single huge page. So for anything but x86, the pgd/pud/pmd/pte hierarchy should work Proper hugepage support in /proc/pid/pagemap would be very helpful for certain kinds of performance analysis, like TLB usage, where you need page size information when you have a TLB for each supported page size. That was the reason why I was working on supporting huge pages in /proc/pid/pagemap in the first place. -- %SYSTEM-F-ANARCHISM, The operating system has been overthrown --
powerpc kinda puts them in pmds, although Adam calls them ptes in his diagram. See Adam's very nice pictures here: http://linux-mm.org/PageTableStructure In the arch code, they have a concept of "slices" for each mm that you can look up the page size for. That's what they use when the mm/vmas aren't around. Their pmd_ts really are just pointers. I don't think they have any flags in them at all like _PAGE_PSE. They just do a special pagetable walk instead of looking *at* the pagetables. -- Dave --
Here's one quick stab at a solution. I figured that we already pass
that 'private' variable around. This patch just sticks that variable
*in* the mm_walk and also makes the caller fill in an 'mm' as well.
Then, we just pass the actual mm_walk around.
Maybe we should just stick the VMA in the mm_walk as well, and have the
common code keep it up to date with the addresses currently being
walked.
Sadly, I didn't quite get enough time to flesh this idea out very far
today, and I'll be offline for a couple of days now. But, if someone
wants to go this route, I thought this might be useful.
---
linux-2.6.git-dave/fs/proc/task_mmu.c | 45 +++++++++++++++++++---------------
linux-2.6.git-dave/include/linux/mm.h | 17 ++++++------
linux-2.6.git-dave/mm/pagewalk.c | 41 +++++++++++++++---------------
3 files changed, 56 insertions(+), 47 deletions(-)
diff -puN mm/pagewalk.c~pass-mm-into-pagewalkers mm/pagewalk.c
--- linux-2.6.git/mm/pagewalk.c~pass-mm-into-pagewalkers 2008-05-08 15:49:47.000000000 -0700
+++ linux-2.6.git-dave/mm/pagewalk.c 2008-05-08 15:49:54.000000000 -0700
@@ -3,14 +3,14 @@
#include <linux/sched.h>
static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
- const struct mm_walk *walk, void *private)
+ struct mm_walk *walk)
{
pte_t *pte;
int err = 0;
pte = pte_offset_map(pmd, addr);
for (;;) {
- err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, private);
+ err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
if (err)
break;
addr += PAGE_SIZE;
@@ -24,7 +24,7 @@ static int walk_pte_range(pmd_t *pmd, un
}
static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
- const struct mm_walk *walk, void *private)
+ struct mm_walk *walk)
{
pmd_t *pmd;
unsigned long next;
@@ -35,15 +35,15 @@ static int walk_pmd_range(pud_t *pud, un
next = pmd_addr_end(addr, end);
if (pmd_none_or_clear_bad(pmd)) {
if (walk->pte_hole)
- err = ...This much looks reasonable. But the real test of course is to actually teach it about detecting huge pages efficiently. And I suspect that means tracking the current VMA all the time in the walk. Am I wrong? -- Mathematics is the supreme nostalgia of our time. --
s390 also does hugepages at the pmd level, so it's not only x86. And while it's not an issue today, it's worth noting that ARM also has the same characteristics for larger sizes. Should someone feel compelled to implement hugepages there, this will almost certainly come up again -- at least in so far as pmd_huge() is concerned. At a quick glance, sparc64 also looks like it might need some special handling in the pagemap case, too.. --
I can't see how it would possibly work with the code that we have today,
so I guess it was just a false assumption.
static inline int pmd_none_or_clear_bad(pmd_t *pmd)
{
if (pmd_none(*pmd))
return 1;
if (unlikely(pmd_bad(*pmd))) {
pmd_clear_bad(pmd);
return 1;
}
return 0;
}
void pmd_clear_bad(pmd_t *pmd)
{
pmd_ERROR(*pmd);
pmd_clear(pmd);
}
That pmd_clear() will simply zero out the pmd and leak the page.
-- Dave
--
Indeed it is! Would explain why we've not seen any leaks until now. I will try to investigate, as well. Thanks, Nish -- Nishanth Aravamudan <nacc@us.ibm.com> IBM Linux Technology Center --
Hugh, Thanks for the patch. It's in Linus's git now and I just booted 2 Dell 2950 and both are no showing the errors. Thanks, Jeff. --
^ ... I'll assume the missing letter is "t" rather than "w" ;) Hugh --
