Is this really correct?
I have an app which tries to mlock a portion of its stack. With this
patch (and a bunch of debug) in place I get:
[ 170.977782] sys_mlock 0xbfd8b000-0xbfd8c000 4096
[ 170.978200] sys_mlock aligned, range now 0xbfd8b000-0xbfd8c000 4096
[ 170.978209] do_mlock 0xbfd8b000-0xbfd8c000 4096 (locking)
[ 170.978216] do_mlock vma de47d8f0 0xbfd7e000-0xbfd94000
[ 170.978223] mlock_fixup split vma de47d8f0 0xbfd7e000-0xbfd94000 at start 0xbfd8b000
[ 170.978231] mlock_fixup split vma de47d8f0 0xbfd8b000-0xbfd94000 at end 0xbfd8c000
[ 170.978240] __mlock_vma_pages_range locking 0xbfd8b000-0xbfd8c000 (1 pages) in VMA bfd8b000 0xbfd8c000-0x0
[ 170.978248] __mlock_vma_pages_range adjusting start 0xbfd8b000->0xbfd8c000 to avoid guard
[ 170.978256] __mlock_vma_pages_range now locking 0xbfd8c000-0xbfd8c000 (0 pages)
[ 170.978263] do_mlock error = 0
Note how we end up locking 0 pages.
The stack layout is:
0xbfd94000 stack VMA end / base
0xbfd8c000 mlock requested end
0xbfd8b000 mlock requested start
0xbfd7f000 stack VMA start / top
0xbfd7e000 guard page
As part of the mlock_fixup the original VMA (0xbfd7e000-0xbfd94000) is
split into 3, 0xbfd7e000-0xbfd8b000 + 0xbfd8b000-0xbfd8c000 +
0xbfd8c000-0xbfd94000 in order to mlock the middle bit.
Since we have split the original VMA into 3, shouldn't only the bottom
one still have VM_GROWSDOWN set? (how can the top two grow down with the
bottom one in the way?) Certainly it seems wrong to enforce a guard page
on anything but the bottom VMA (which is what appears to be happening).
Although perhaps the larger issue is whether or not it is valid to mlock
below the current end of your current stack, I don't see why it wouldn't
be so perhaps the above is just completely bogus? Isn't it possible that
a process may try and mlock something on a stack page which ...Yes, it does seem like we should teach vma splitting to remove
It's a guard page, not magic. Some architecture ABI's specify that if
you expand the stack by more than a certain number, you need to touch
a page in between (for example, I think alpha had that rule), because
they don't grow the stack automatically by an arbitrary amount. But
x86 has never had that rule, and you can certainly defeat a guard page
by simply accessing by much more than a page.
As far as I'm concerned, the guard page thing is not - and shouldn't
be thought of - a "hard" feature. If it's needed, it's really a bug in
user space. But given that there are bugs in user space, the guard
page makes it a bit harder to abuse those bugs. But it's about "a bit
harder" rather than anything else.
IOW, it does _not_ make up for user space that willfully does crazy
things, and never will.
Linus
--
Agreed, I was just coding that up to check. Is there any VMA coalescing which we need to worry about? We don't appear to do anything like that on munlock at least but I didn't look elsewhere. FWIW the attached mlock.c exhibits the issue for me. Under 2.6.35 it takes 1 additional minor fault if you do not give the "lock" option and 0 minor faults if you do give "lock". Under 2.6.35.2 and 3.6.35.3-rc it works the same without "lock" but with "lock" under 2.6.35.2 the mlock fails and with 2.6.35.3 it thinks it Thanks, was just curious... Ian.
On Fri, Aug 20, 2010 at 8:54 AM, Linus Torvalds
Actually, thinking some more about it, that may not be a good idea.
Why? Simply because we may want to merge the vma's back together if
you do munlock. And it won't (and mustn't) merge if the vm_flags
differ in VM_GROWSDOWN.
So I do think we want to keep VM_GROWSDOWN (and VM_GROWSUP on PA-RISC)
even across a vma split.
Of course, we could set a flag whether the vma really does have a
guard page or not.
That said, it does strike me as rather odd to do VM ops on partial
stacks. What are you doing, exactly, to hit this?
Linus
--
On Fri, Aug 20, 2010 at 9:07 AM, Linus Torvalds
The reason I ask is that the _sane_ thing to do - if we really care
about this - is to change the 'vm_next' singly-linked list into using
'list.h'. It would clean up a fair amount of stuff, like removing the
need for that disgusting 'find_vma_prev()' thing. There are actually
several users of vma's that want to look at the previous vma, but
because it's hard to get at, they do something non-intuitive or odd.
At the same time, we've had that vm_next pointer since pretty much day
one, and I also get a strong feeling that it's not really worth the
churn.
Linus
--
I wasn't sure at first what you were getting at here, so let me see if I figured it out... If we could easily get at the previous VMA (instead of just the next one) then we could easily check if we were mlocking a VM_GROWSDOWN region which had another VM_GROWSDOWN region immediately below it and therefore avoid introducing a guard page at the boundary. Doing this check is currently too expensive because of the need to use It does look like a big task, but if it seems like the only sane option I'll take a look at it and see if can be broken down into manageable stages. You mentioned making this a tunable in your original commit message, that would at least help in the short term so I may look into that too. (prctl would be the right interface?) I wonder if there's any way to auto tune, for example automatically disabling the guard page for a process which mlocks only part of its stack VMA. That would obviously target the specific issue I'm seeing pretty directly and would only reopen the hole for applications which were already doing odd things (c.f. your earlier comment about the guard page not being magic or helping with wilfully crazy userspace). Ian. -- Ian Campbell Let he who takes the plunge remember to return it by Tuesday.
It should be a pretty straightforward search-and-replace. And almost
all of it would be real cleanups.
And it would be trivial to change the loops like
for (vma = mm->mmap; vma; vma = vma->vm_next)
into basically just
list_for_each_entry(vma, &mm->mmap, vm_list)
I'm not convinced a tunable is really the right thing, but in this
case it might be acceptable, since you really are doing something
rather specific and odd. Changing the VM to use doubly-linked lists
I'd really hate to try to do something subtle that doesn't have
obvious semantics. Subtle code is buggy code. Maybe not today, but two
years from now..
Linus
--
On Fri, Aug 20, 2010 at 11:59 AM, Linus Torvalds
I take that back. Cleanups - probably. Simple search-and-replace? No.
That vm_next thing is all over the place, and all of the code knows
that the last vma has a vm_next that is NULL.
So switching it to the "<linux/list.h>" kind of accessors would be a major pain.
There's also lots of really ugly code that is all about the "we can't
easily get to the 'prev' entry in the list". Stuff that would be
cleaned up if we just had a vm_prev, but where the cleanups is just
Yeah, no. It looks like adding a "vm_prev" and doing a regular doubly
linked list thing wouldn't be too bad. But switching over to the
list.h version looks like a nightmare.
Too bad.
Linus
--
I've had to convert normal linked lists to a dual linked list for a project of mine in the past, and for the same reason I could not use the lists how we use them in Linux. However I found an interesting tradeoff which consists in having only the ->prev list circular but keep the ->next one null- terminated. In the end, the API is not much different. A few tests just on the ->next pointer at some places, but you can still use ->prev everywhere to find the list tail, for inserting and removing. And overall that was pretty convenient. It was a lot better than the simple linked list and almost as easy to use as the circular ones. Just my 2 cents, Willy --
You mean something like the below? Should I look at refreshing that thing (yes, that's a 2007 patch)? --- Subject: mm: replace vm_next with a list_head Date: 19 Jul 2007 Replace the vma list with a proper list_head. Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> --- arch/alpha/kernel/osf_sys.c | 2 arch/arm/mm/mmap.c | 2 arch/frv/mm/elf-fdpic.c | 4 - arch/i386/mm/hugetlbpage.c | 2 arch/ia64/kernel/sys_ia64.c | 2 arch/ia64/mm/hugetlbpage.c | 2 arch/mips/kernel/irixelf.c | 20 +++--- arch/mips/kernel/syscall.c | 2 arch/parisc/kernel/sys_parisc.c | 4 - arch/powerpc/mm/tlb_32.c | 2 arch/ppc/mm/tlb.c | 2 arch/sh/kernel/sys_sh.c | 2 arch/sh/mm/cache-sh4.c | 2 arch/sh64/kernel/sys_sh64.c | 2 arch/sparc/kernel/sys_sparc.c | 2 arch/sparc64/kernel/binfmt_aout32.c | 2 arch/sparc64/kernel/sys_sparc.c | 2 arch/sparc64/mm/hugetlbpage.c | 2 arch/um/kernel/skas/tlb.c | 6 - arch/x86_64/ia32/ia32_aout.c | 2 arch/x86_64/kernel/sys_x86_64.c | 2 drivers/char/mem.c | 2 drivers/oprofile/buffer_sync.c | 4 - fs/binfmt_aout.c | 2 fs/binfmt_elf.c | 6 - fs/binfmt_elf_fdpic.c | 4 - fs/hugetlbfs/inode.c | 2 fs/proc/task_mmu.c | 18 ++--- include/linux/init_task.h | 1 include/linux/mm.h | 44 +++++++++++++- include/linux/sched.h | 2 ipc/shm.c | 4 - kernel/acct.c | 5 - kernel/auditsc.c | 4 - kernel/fork.c | 12 +-- mm/madvise.c | 2 mm/memory.c | 20 +++--- ...
On Fri, Aug 20, 2010 at 2:17 PM, Linus Torvalds
Note that the "real" patch is actually even smaller than implied by
this thing. The diffstat
include/linux/mm_types.h | 2 +-
kernel/fork.c | 7 +++++--
mm/mmap.c | 37 +++++++++++++++++++++++++++++++++----
mm/nommu.c | 7 +++++--
4 files changed, 44 insertions(+), 9 deletions(-)
implies that it adds a lot more than it removes, but of the 35 new
lines it adds, about half of it - 16 lines - is just the vma list
verification hack. So it really adds less than 20 lines of code.
Hopefully those 20 lines would then buy themselves back with cleanups.
Still. A doubly-linked list is totally trivial, and I just bet I have
a bug in there somewhere. But looking at your 2007 patch and mine
side-by-side, I do think mine is still less scary.
Linus
--
For sure, and if you need it in a hurry your patch is much saner. So if you feel like its required for .36 go for it, we can look at cleaning it up for .37 or somesuch. --
Yeah. I looked at exactly something like that. With the same name for
vma_next() - except I just passed down 'mm' to it too. Your patch
takes it from vma->mm and then you have a few places do the whole
Interesting. I generated about half the patch, and then decided that
it's not worth it. But the fact that apparently you did the same thing
in 2007 means that maybe there really _is_ a pent-up demand for doing
this dang thing.
I do note that I also did a patch that just added "vm_prev", and it's
a lot smaller. It doesn't allow for the cleanups of using the generic
list iterators, though. And it has the usual problem with the
straightforward doubly-linked lists: you end up with those ugly
conditional assignments like
if (prev)
prev->vm_next = vma;
if (next)
next->vm_prev = vma;
which the list_head approach avoids.
Appended is that much smaller patch. It has an ugly and partial
"validate_vma()" hack there in find_vma() to catch the worst bugs, but
it's really not tested. I did try booting it, and it's not spewing out
errors, but who the heck knows. It doesn't look too horrid, but...
What do you think?
NOTE! I've done _zero_ cleanups. And one of the thing I noticed is
that "find_vma_prev()" is sometimes used to find a "prev" even when
there is no "vma" (ie growing a grow-up stack at the end of the VM, or
adding a new mapping at the end), so my hope that we could get rid of
it entirely was naïve. But there should be other things we should be
able to simplify when we can get at the prev pointer (all the
mprotect/mlock splitting code should work fine without having that
'prev' thing passed around etc).
Linus
I naively hacked something together and it did seem to work, but I I sent a contrived test program in my other mail. The actual use is to mlock a buffer on the stack in order to pass it to a Xen hypercall. The contract with the hypervisor is that the pages passed to the hypercall must be mapped. Ian. -- Ian Campbell Current Noise: Dinosaur Jr. - Green Mind We can defeat gravity. The problem is the paperwork involved. --
On the other hand the VMA merging is just an optimisation, isn't it? -- Ian Campbell Many people are desperately looking for some wise advice which will recommend that they do what they want to do. --
Well, yes and no. This would make it have semantic differences, if you
were to unmap the lower part of the stack.
I could imagine some crazy program wanting to basically return the
stack pages to the system after doing heavy recursion. IOW, they could
do
- use lots of stack because we're recursing 1000 levels deep
- know that we used lots of stack, so after returning do something like
stack = &local variable;
align stack down by two pages
munmap down from there to give memory back
and now it really would be a semantic change where the VM_GROWSDOWN
bit has literally disappeared.
Linus
--
