Hi Linus, Ingo, Hans,
Please revert commit cded932b75ab0a5f9181ee3da34a0a488d1a14fd
( x86: fix pmd_bad and pud_bad to support huge pages )
since it prevents the kernel to finish booting on my (Penryn based)
laptop. The boot stops right after freeing the init memory.
Took a while to bisect (due to it touching page*.h, which forces a
full recompile), but it definitely is caused by this commit...
Please revert.
[arjan@localhost linux.trees.git]$ git-bisect good
cded932b75ab0a5f9181ee3da34a0a488d1a14fd is first bad commit
commit cded932b75ab0a5f9181ee3da34a0a488d1a14fd
Author: Hans Rosenfeld <hans.rosenfeld@amd.com>
Date: Mon Feb 18 18:10:47 2008 +0100
x86: fix pmd_bad and pud_bad to support huge pages
I recently stumbled upon a problem in the support for huge pages. If a
program using huge pages does not explicitly unmap them, they remain
mapped (and therefore, are lost) after the program exits.
I observed that the free huge page count in /proc/meminfo decreased when
running my program, and it did not increase after the program exited.
After running the program a few times, no more huge pages could be
allocated.
The reason for this seems to be that the x86 pmd_bad and pud_bad
consider pmd/pud entries having the PSE bit set invalid. I think there
is nothing wrong with this bit being set, it just indicates that the
lowest level of translation has been reached. This bit has to be (and
is) checked after the basic validity of the entry has been checked, like
in this fragment from follow_page() in mm/memory.c:
if (pmd_none(*pmd) || unlikely(pmd_bad(*pmd)))
goto no_page_table;
if (pmd_huge(*pmd)) {
BUG_ON(flags & FOLL_GET);
page = follow_huge_pmd(mm, address, pmd, flags & FOLL_WRITE);
goto out;
}
Note that this code currently doesn't work as intended if the pmd refers
to a huge page, the pm...Hrm, not that I necessarily have the right to ask for this, but shouldn't hugepage related patches at least be cc'd to linux-mm? it seems this went via LKML to the x86 tree? There are a few of us that work on hugepages that would have seen this there, but it got lost in the noise (for me) on LKML. Thanks, Nish --
hm, lets figure out why this patch breaks your box, ok? We obviously
have to revert it if we cannot figure it out, but lets at least try -
because the patch itself fixes a real regression and it's not obviously
wrong either. I think there might be some bug hiding somewhere that we
really want to fix instead of this revert.
Could you try to hack up a debug patch perhaps? Uninline the fucntion(s)
then add two versions (one is that breaks on your box and one is that
works on your box) of this same pmd_bad()/pud_bad() functions and do
something like this (pseudocode):
pmd_bad()
{
if (pmd_bad_working(x) != pmd_bad_broken(x))
panic_timeout++;
return pmd_bad_working(x);
}
i.e. we actually use the working function so your box should boot up
just fine - but we instrument things with the broken function as well
and detect the cases where the two values differ. It is an anomaly if
either function ever returns true instead of false.
if after bootup you have a non-zero panic_timeout then there is a
material difference somewhere. In that case try to stick a dump_stack()
into the above case as well - and if the box still boots, send us the
stackdumps. (if the box doesnt boot then perhaps the printout itself
hangs - in that case try a save_stack_trace() hack and print it out
later)
Ingo
--i.e. something like the (tested) patch below. It is not clear whether
your testcase is on 32-bit or 64-bit - so i went for the more likely
32-bit case first.
Ingo
------------------>
Subject: x86: patches/x86-debug-bad-page.patch
From: Ingo Molnar <mingo@elte.hu>
Date: Mon Mar 03 09:53:17 CET 2008
Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
arch/x86/mm/pgtable_32.c | 7 +++++++
include/asm-x86/pgtable_32.h | 6 +++++-
2 files changed, 12 insertions(+), 1 deletion(-)
Index: linux-x86.q/arch/x86/mm/pgtable_32.c
===================================================================
--- linux-x86.q.orig/arch/x86/mm/pgtable_32.c
+++ linux-x86.q/arch/x86/mm/pgtable_32.c
@@ -381,3 +381,10 @@ void __pmd_free_tlb(struct mmu_gather *t
}
#endif
+
+int pmd_bad(pmd_t pmd)
+{
+ WARN_ON_ONCE(pmd_bad_v1(pmd) != pmd_bad_v2(pmd));
+
+ return pmd_bad_v1(pmd);
+}
Index: linux-x86.q/include/asm-x86/pgtable_32.h
===================================================================
--- linux-x86.q.orig/include/asm-x86/pgtable_32.h
+++ linux-x86.q/include/asm-x86/pgtable_32.h
@@ -92,7 +92,11 @@ extern unsigned long pg0[];
/* To avoid harmful races, pmd_none(x) should check only the lower when PAE */
#define pmd_none(x) (!(unsigned long)pmd_val(x))
#define pmd_present(x) (pmd_val(x) & _PAGE_PRESENT)
-#define pmd_bad(x) ((pmd_val(x) \
+
+extern int pmd_bad(pmd_t pmd);
+
+#define pmd_bad_v1(x) ((pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER)) != _KERNPG_TABLE)
+#define pmd_bad_v2(x) ((pmd_val(x) \
& ~(PAGE_MASK | _PAGE_USER | _PAGE_PSE | _PAGE_NX)) \
!= _KERNPG_TABLE)
--------------[ cut here ]------------ WARNING: at arch/x86/mm/pgtable_32.c:387 pmd_bad+0x44/0x53() Modules linked in: Pid: 1, comm: swapper Not tainted 2.6.25-rc3 #14 [<c0424ba5>] warn_on_slowpath+0x41/0x67 [<c0408c5c>] ? native_sched_clock+0x94/0xa6 [<c043f432>] ? lock_release_holdtime+0x1a/0x115 [<c04702d4>] ? handle_mm_fault+0x297/0x7e2 [<c063eee6>] ? _spin_unlock+0x1d/0x20 [<c04707f0>] ? handle_mm_fault+0x7b3/0x7e2 [<c04851c1>] ? do_sync_read+0xab/0xe9 [<c0417223>] pmd_bad+0x44/0x53 [<c046f37f>] follow_page+0x8b/0x1f2 [<c0470aa0>] get_user_pages+0x281/0x2ef [<c0488dec>] get_arg_page+0x2d/0x79 [<c04f63cd>] ? strnlen_user+0x2f/0x4d [<c0488efb>] copy_strings+0xc3/0x160 [<c0488fb4>] copy_strings_kernel+0x1c/0x2b [<c048a109>] do_execve+0x11a/0x1f0 [<c0403351>] sys_execve+0x29/0x51 [<c0404ac6>] syscall_call+0x7/0xb [<c0407697>] ? kernel_execve+0x17/0x1c [<c04021d1>] ? _stext+0x69/0x12c [<c040574f>] ? kernel_thread_helper+0x7/0x10 ======================= ---[ end trace 63b854b89f6f375d ]--- --
hm. I suspect some gcc related difference related to the handling of this masking: pmd_val(x) & ~(PAGE_MASK | _PAGE_USER | _PAGE_PSE | _PAGE_NX) versus: pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER) perhaps it will work if you change it to: pmd_val(x) & (~PAGE_MASK & ~_PAGE_USER & ~_PAGE_PSE & ~_PAGE_NX) ? in any case, the commit has to be reverted as it clearly isnt a NOP on your box as it was intended to be. (it should only have made a difference in a rare hugetlbfs case) Ingo --
interesting observation: if I turn the macros into inlines... the difference goes away. I'm half tempted to just do it as inline period ... any objections ? --
include/asm-x86/pgtable.h has #define _PAGE_BIT_PSE 7 #define _PAGE_PSE (_AC(1, L)<<_PAGE_BIT_PSE) and #define _PAGE_BIT_NX 63 #if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE) #define _PAGE_NX (_AC(1, ULL) << _PAGE_BIT_NX) so (on 32-bit) ~_PAGE_PSE is ~0x80L is 0xffffff7f, which when cast to 64-bit is 0x00000000ffffff7f, so in (~PAGE_MASK & ~_PAGE_USER & ~_PAGE_PSE & ~_PAGE_NX) all the high bits are lost, while the original ~(PAGE_MASK | _PAGE_USER | _PAGE_PSE | _PAGE_NX) works as intended, since the bit inversion is done on a 64-bit number. Maybe all those pagetable bit definitions should use 64-bit (ULL or a cast), as it is now some dangerous detail is hidden behind the macros. Using inline functions for simple constants seems like overkill to me, but would also work of course. Segher --
but we really are interested in the low bits here (lets ignore the NX bit for now) and the patch has been in the queue for a long time (more than a month), so if there was a trivial mask mixup problem it would have shown on the first day. So i suspect some gcc bug instead - and certainly the colorful mixture of types and signs in this expression might have surprised a new version of gcc somewhere. Ingo --
Actually, it is signed, so this isn't true. Comments about unsafeness still apply. Segher --
It turns out that PAGE_SIZE is unsigned. So this gives us for (~PAGE_MASK & ~_PAGE_USER & ~_PAGE_PSE & ~_PAGE_NX) the types UL, L, L, ULL resp. The associativity of & is left-to-right, so this in turn becomes UL, L, ULL UL, ULL ULL and that cast from UL to ULL doesn't sign-extend. Segher --
I went through and made a bunch of these flags signed so that they would
sign-extend properly in 32 vs 64 bit. We could make them
unconditionally 64-bit, but I'm worried that will have unforeseen
consequences too.
In this case, PAGE_MASK should probably be signed too, so the sign
extension happens properly.
Alternatively, they could be cast to pteval_t and/or phys_addr_t so that
they will have an inherent size which matches the pagetable format
(using _AT() rather than _AC()).
J
--Either that, or we could use a custom constructor macro (PTEVAL_C()). This is arguably The Right Thing, but the sign-extending version is OK, too. -hpa --
All the masks either need to be the proper size or signed. -hpa --
Yes, I object. I want to understand why it would matter. If this is a compiler bug, it's a really rather bad one. And if it's just some stupid bug in our pmd_bad() macro, I still want to know what the problem was. Can you compile both ways and look at what changed at the offending site (which is apparently "follow_page()")? And do you have some odd compiler version? Linus --
ok, i think i figured it out: on PAE 32-bit we dont properly sign-extend to a 64-bit pmd value in the new pmd_bad() macro - so if any physical RAM is above 4GB (Arjan's laptop had 4GB of RAM in it?) then we'll start seeing those high bits. This definitely needs PAE and more than 4GB of RAM to trigger. The best fix is the one below (it should solve Arjan's regression with that now-reverted patch redone), as it is the right thing to do [that way sign auto-extend trickles over into PAGE_MASK as well]. It boots fine on a >4GB box of mine but changing the type of PAGE_SIZE affects _everything_ so i'll keep testing it and i'd suggest to delay this fix to shortly after -rc5 is released instead of risking -rc5 with such a late commit. I'll send this and the re-done hugetlbfs fix together early next week. Ingo Signed-off-by: Ingo Molnar <mingo@elte.hu> Index: linux/include/asm-x86/page.h =================================================================== --- linux.orig/include/asm-x86/page.h +++ linux/include/asm-x86/page.h @@ -5,7 +5,7 @@ /* PAGE_SHIFT determines the page size */ #define PAGE_SHIFT 12 -#define PAGE_SIZE (_AC(1,UL) << PAGE_SHIFT) +#define PAGE_SIZE (_AC(1,L) << PAGE_SHIFT) #define PAGE_MASK (~(PAGE_SIZE-1)) #ifdef __KERNEL__ --
This *really* makes me worry.
I do agree that there are good reasons that "PAGE_MASK" should be signed
(since we do want the top bit to extend), but changing "PAGE_SIZE" to
signed seems to be a rather big change, considering that it's used
*everywhere*.
In particular, it's quite possibly used for things like
offset = something % PAGE_SIZE;
etc, where a signed divide is positively wrong.
But even for PAGE_MASK, we literally have code like this:
if ((size_avail & PAGE_MASK) < rg.size) {
where it so _happens_ that "size_avail" is unsigned, but what if it
wasn't? It could turn a unsigned comparison into a signed one, and
introduce any number of security bugs etc.
Sam goes even more for PAGE_SIZE. At least there are only about a thousand
users of PAGE_MASK in the kernel, PAGE_SIZE is used about six times as
many times, and just a _trivial_ grep like
git grep 'PAGE_SIZE.* [<>][= ]'
finds a lot of cases where I'm not at all sure that it's safe to change
PAGE_SIZE to a signed value.
In other words, there are lots of things like
if (x < PAGE_SIZE)
...
where we currently get a unsigned comparison, and where for all I know a
signed PAGE_SIZE means that we should use
if (x >= 0 && x < PAGE_SIZE)
instead.
In short, I refuse to apply this patch after an -rc1 release. I suspect
that I shouldn't apply something like this even *before* an -rc1, because
I think it's just a really bad idea to make these types signed even if it
were to give you magically easier sign extensions to "unsigned long long".
So I would *very* strongly instead argue:
- "unsigned long" is the native kernel type for all address manipulation,
and thus "PAGE_SIZE" and "PAGE_MASK" should continue to have that type.
- anything that uses any other type without explicitly making sure it's
safe is mis-using those macros. IOW, PAGE_MASk was *never* a type that
had anything what-so-ever to do with page table entry bits, and thi...We have had PAGE_MASK being signed on powerpc for ages, so if you do find any such bugs, please let me know. :) I'm not aware of any at present, though. The expression you quoted will be ok as long as either size_avail or rg.size is unsigned, as far as I can see. Our PAGE_SIZE is unsigned long. Paul. --
Yes. PTE_MASK already exists for masking the address bits out of a
pte. It should have the type pteval_t, which will deal with the PAE
case cleanly. It should also only have the middle pfn bits set, so that
_PAGE_NX is also not included.
It's currently defined in terms of PHYSICAL_PAGE_MASK, which 1) is only
used to define PTE_MASK, and 2) defined (wrongly) in terms of PAGE_MASK
(though __PHYSICAL_MASK is correctly defined).
J
--yeah, indeed my patch was sloppy, i didnt think it through - i fell for the lure of the easy-looking 'PAGE_SIZE is small, sign-extend it' hack. Will do it cleanly and will also clean up all the pte/address/pgprot type mixing that currently goes on in this maze of macros. Ingo --
it's the F9 gcc, so yeah it's really new. I'm staring at the disassembly and it's quite different but follow_page() is rather large; trying to make a smaller testcase now --
sadly a more isolated testcase doesn't show the same behavior yet; so it's some fun interaction ;( more staring at the assembly for me --
could you post the "bad" and the "good" disassembled functions, and the specific gcc version string? Ingo --
http://www.fenrus.org/memory.asm.macro <-- failing one http://www.fenrus.org/memory.asm.inline <-- working one --
On x86/{,_64}, _PAGE_PSE and _PAGE_PROTNONE are the same bit. Would
that have any effect here? We encountered that collision when adding
mprotect() support for hugepages.
Thanks,
Nish
--well, this is the PMD, and PROTNONE is a pte property. Ingo --
The PSE bit in the PTE becomes the PATX bit. It shouldn't ever be set in Linux. -hpa --
| Oleg Nesterov | Re: [PATCH, RFC] reimplement flush_workqueue() |
| Linus Torvalds | Re: Linux 2.6.27-rc8 |
| Pavel Roskin | ndiswrapper and GPL-only symbols redux |
| Greg Kroah-Hartman | [PATCH 017/196] aoechr: Convert from class_device to device |
git: | |
| David Symonds | Re: git and binary files |
| Matthieu Moy | git push to a non-bare repository |
| Felipe Oliveira Carvalho | Re: [RFC] Zit: the git-based single file content tracker |
| Jakub Narebski | Re: [VOTE] git versus mercurial (for DragonflyBSD) |
| Patrick McHardy | netfilter 05/29: netns ebtables: part 2 |
| Templin, Fred L | [Resend][PATCH 01/05] ipv6: RFC4214 Support (4) |
| Laszlo Attila Toth | [PATCHv7 0/5 + 3] Interface group patches |
| Jarek Poplawski | Re: [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Han Boetes | shutdown gets stuck at `syncing discs...' |
| Leon Dippenaar | New tcp stack attack |
| Richard Stallman | Real men don't attack straw men |
| GVG GVG | ssh_exchange_identification: Connection closed by remote host |
