Re: [Stable-review] [2/3] mm: fix up some user-visible effects of the stack guard page

Previous thread: [3/3] vmware: fix build error in vmware.c by Greg KH on Wednesday, August 18, 2010 - 1:30 pm. (1 message)

Next thread: [0/3] 2.6.35.3 -stable review by Greg KH on Wednesday, August 18, 2010 - 1:31 pm. (1 message)
From: Greg KH
Date: Wednesday, August 18, 2010 - 1:30 pm

[Empty message]
From: Ian Campbell
Date: Friday, August 20, 2010 - 5:54 am

Is this really correct?

I have an app which tries to mlock a portion of its stack. With this
patch (and a bunch of debug) in place I get:
        [  170.977782] sys_mlock 0xbfd8b000-0xbfd8c000 4096
        [  170.978200] sys_mlock aligned, range now 0xbfd8b000-0xbfd8c000 4096
        [  170.978209] do_mlock 0xbfd8b000-0xbfd8c000 4096 (locking)
        [  170.978216] do_mlock vma de47d8f0 0xbfd7e000-0xbfd94000
        [  170.978223] mlock_fixup split vma de47d8f0 0xbfd7e000-0xbfd94000 at start 0xbfd8b000
        [  170.978231] mlock_fixup split vma de47d8f0 0xbfd8b000-0xbfd94000 at end 0xbfd8c000
        [  170.978240] __mlock_vma_pages_range locking 0xbfd8b000-0xbfd8c000 (1 pages) in VMA bfd8b000 0xbfd8c000-0x0
        [  170.978248] __mlock_vma_pages_range adjusting start 0xbfd8b000->0xbfd8c000 to avoid guard
        [  170.978256] __mlock_vma_pages_range now locking 0xbfd8c000-0xbfd8c000 (0 pages)
        [  170.978263] do_mlock error = 0

Note how we end up locking 0 pages.

The stack layout is:
         0xbfd94000 stack VMA end / base
        
         0xbfd8c000 mlock requested end
         0xbfd8b000 mlock requested start
        
         0xbfd7f000 stack VMA start / top
        
         0xbfd7e000 guard page

As part of the mlock_fixup the original VMA (0xbfd7e000-0xbfd94000) is
split into 3, 0xbfd7e000-0xbfd8b000 + 0xbfd8b000-0xbfd8c000 +
0xbfd8c000-0xbfd94000 in order to mlock the middle bit.

Since we have split the original VMA into 3, shouldn't only the bottom
one still have VM_GROWSDOWN set? (how can the top two grow down with the
bottom one in the way?) Certainly it seems wrong to enforce a guard page
on anything but the bottom VMA (which is what appears to be happening).

Although perhaps the larger issue is whether or not it is valid to mlock
below the current end of your current stack, I don't see why it wouldn't
be so perhaps the above is just completely bogus? Isn't it possible that
a process may try and mlock something on a stack page which ...
From: Linus Torvalds
Date: Friday, August 20, 2010 - 8:54 am

Yes, it does seem like we should teach vma splitting to remove

It's a guard page, not magic. Some architecture ABI's specify that if
you expand the stack by more than a certain number, you need to touch
a page in between (for example, I think alpha had that rule), because
they don't grow the stack automatically by an arbitrary amount. But
x86 has never had that rule, and you can certainly defeat a guard page
by simply accessing by much more than a page.

As far as I'm concerned, the guard page thing is not - and shouldn't
be thought of - a "hard" feature. If it's needed, it's really a bug in
user space. But given that there are bugs in user space, the guard
page makes it a bit harder to abuse those bugs. But it's about "a bit
harder" rather than anything else.

IOW, it does _not_ make up for user space that willfully does crazy
things, and never will.

                       Linus
--

From: Ian Campbell
Date: Friday, August 20, 2010 - 9:02 am

Agreed, I was just coding that up to check.

Is there any VMA coalescing which we need to worry about? We don't
appear to do anything like that on munlock at least but I didn't look
elsewhere.

FWIW the attached mlock.c exhibits the issue for me. Under 2.6.35 it
takes 1 additional minor fault if you do not give the "lock" option and
0 minor faults if you do give "lock".

Under 2.6.35.2 and 3.6.35.3-rc it works the same without "lock" but with
"lock" under 2.6.35.2 the mlock fails and with 2.6.35.3 it thinks it

Thanks, was just curious...

Ian.
From: Linus Torvalds
Date: Friday, August 20, 2010 - 9:07 am

On Fri, Aug 20, 2010 at 8:54 AM, Linus Torvalds

Actually, thinking some more about it, that may not be a good idea.
Why? Simply because we may want to merge the vma's back together if
you do munlock. And it won't (and mustn't) merge if the vm_flags
differ in VM_GROWSDOWN.

So I do think we want to keep VM_GROWSDOWN (and VM_GROWSUP on PA-RISC)
even across a vma split.

Of course, we could set a flag whether the vma really does have a
guard page or not.

That said, it does strike me as rather odd to do VM ops on partial
stacks. What are you doing, exactly, to hit this?

                         Linus
--

From: Linus Torvalds
Date: Friday, August 20, 2010 - 9:24 am

On Fri, Aug 20, 2010 at 9:07 AM, Linus Torvalds

The reason I ask is that the _sane_ thing to do - if we really care
about this - is to change the 'vm_next' singly-linked list into using
'list.h'. It would clean up a fair amount of stuff, like removing the
need for that disgusting 'find_vma_prev()' thing. There are actually
several users of vma's that want to look at the previous vma, but
because it's hard to get at, they do something non-intuitive or odd.

At the same time, we've had that vm_next pointer since pretty much day
one, and I also get a strong feeling that it's not really worth the
churn.

                         Linus
--

From: Ian Campbell
Date: Friday, August 20, 2010 - 10:43 am

I wasn't sure at first what you were getting at here, so let me see if I
figured it out...

If we could easily get at the previous VMA (instead of just the next
one) then we could easily check if we were mlocking a VM_GROWSDOWN
region which had another VM_GROWSDOWN region immediately below it and
therefore avoid introducing a guard page at the boundary. Doing this
check is currently too expensive because of the need to use

It does look like a big task, but if it seems like the only sane option
I'll take a look at it and see if can be broken down into manageable
stages.

You mentioned making this a tunable in your original commit message,
that would at least help in the short term so I may look into that too.
(prctl would be the right interface?)

I wonder if there's any way to auto tune, for example automatically
disabling the guard page for a process which mlocks only part of its
stack VMA. That would obviously target the specific issue I'm seeing
pretty directly and would only reopen the hole for applications which
were already doing odd things (c.f. your earlier comment about the guard
page not being magic or helping with wilfully crazy userspace).

Ian.

-- 
Ian Campbell

Let he who takes the plunge remember to return it by Tuesday.
From: Linus Torvalds
Date: Friday, August 20, 2010 - 11:59 am

It should be a pretty straightforward search-and-replace. And almost
all of it would be real cleanups.

And it would be trivial to change the loops like

    for (vma = mm->mmap; vma; vma = vma->vm_next)

into basically just

   list_for_each_entry(vma, &mm->mmap, vm_list)


I'm not convinced a tunable is really the right thing, but in this
case it might be acceptable, since you really are doing something
rather specific and odd. Changing the VM to use doubly-linked lists

I'd really hate to try to do something subtle that doesn't have
obvious semantics. Subtle code is buggy code. Maybe not today, but two
years from now..

                     Linus
--

From: Linus Torvalds
Date: Friday, August 20, 2010 - 12:43 pm

On Fri, Aug 20, 2010 at 11:59 AM, Linus Torvalds

I take that back. Cleanups - probably. Simple search-and-replace? No.
That vm_next thing is all over the place, and all of the code knows
that the last vma has a vm_next that is NULL.

So switching it to the "<linux/list.h>" kind of accessors would be a major pain.

There's also lots of really ugly code that is all about the "we can't
easily get to the 'prev' entry in the list". Stuff that would be
cleaned up if we just had a vm_prev, but where the cleanups is just

Yeah, no. It looks like adding a "vm_prev" and doing a regular doubly
linked list thing wouldn't be too bad. But switching over to the
list.h version looks like a nightmare.

Too bad.

                  Linus
--

From: Willy Tarreau
Date: Friday, August 20, 2010 - 1:34 pm

I've had to convert normal linked lists to a dual linked list for a project
of mine in the past, and for the same reason I could not use the lists how
we use them in Linux. However I found an interesting tradeoff which consists
in having only the ->prev list circular but keep the ->next one null-
terminated.

In the end, the API is not much different. A few tests just on the ->next
pointer at some places, but you can still use ->prev everywhere to find
the list tail, for inserting and removing. And overall that was pretty
convenient. It was a lot better than the simple linked list and almost
as easy to use as the circular ones.

Just my 2 cents,
Willy

--

From: Peter Zijlstra
Date: Friday, August 20, 2010 - 1:42 pm

You mean something like the below?

Should I look at refreshing that thing (yes, that's a 2007 patch)?

---
Subject: mm: replace vm_next with a list_head
Date: 19 Jul 2007

Replace the vma list with a proper list_head.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
 arch/alpha/kernel/osf_sys.c         |    2 
 arch/arm/mm/mmap.c                  |    2 
 arch/frv/mm/elf-fdpic.c             |    4 -
 arch/i386/mm/hugetlbpage.c          |    2 
 arch/ia64/kernel/sys_ia64.c         |    2 
 arch/ia64/mm/hugetlbpage.c          |    2 
 arch/mips/kernel/irixelf.c          |   20 +++---
 arch/mips/kernel/syscall.c          |    2 
 arch/parisc/kernel/sys_parisc.c     |    4 -
 arch/powerpc/mm/tlb_32.c            |    2 
 arch/ppc/mm/tlb.c                   |    2 
 arch/sh/kernel/sys_sh.c             |    2 
 arch/sh/mm/cache-sh4.c              |    2 
 arch/sh64/kernel/sys_sh64.c         |    2 
 arch/sparc/kernel/sys_sparc.c       |    2 
 arch/sparc64/kernel/binfmt_aout32.c |    2 
 arch/sparc64/kernel/sys_sparc.c     |    2 
 arch/sparc64/mm/hugetlbpage.c       |    2 
 arch/um/kernel/skas/tlb.c           |    6 -
 arch/x86_64/ia32/ia32_aout.c        |    2 
 arch/x86_64/kernel/sys_x86_64.c     |    2 
 drivers/char/mem.c                  |    2 
 drivers/oprofile/buffer_sync.c      |    4 -
 fs/binfmt_aout.c                    |    2 
 fs/binfmt_elf.c                     |    6 -
 fs/binfmt_elf_fdpic.c               |    4 -
 fs/hugetlbfs/inode.c                |    2 
 fs/proc/task_mmu.c                  |   18 ++---
 include/linux/init_task.h           |    1 
 include/linux/mm.h                  |   44 +++++++++++++-
 include/linux/sched.h               |    2 
 ipc/shm.c                           |    4 -
 kernel/acct.c                       |    5 -
 kernel/auditsc.c                    |    4 -
 kernel/fork.c                       |   12 +--
 mm/madvise.c                        |    2 
 mm/memory.c                         |   20 +++---
 ...
From: Linus Torvalds
Date: Friday, August 20, 2010 - 2:24 pm

On Fri, Aug 20, 2010 at 2:17 PM, Linus Torvalds

Note that the "real" patch is actually even smaller than implied by
this thing. The diffstat

 include/linux/mm_types.h |    2 +-
 kernel/fork.c            |    7 +++++--
 mm/mmap.c                |   37 +++++++++++++++++++++++++++++++++----
 mm/nommu.c               |    7 +++++--
 4 files changed, 44 insertions(+), 9 deletions(-)

implies that it adds a lot more than it removes, but of the 35 new
lines it adds, about half of it - 16 lines - is just the vma list
verification hack. So it really adds less than 20 lines of code.
Hopefully those 20 lines would then buy themselves back with cleanups.

Still. A doubly-linked list is totally trivial, and I just bet I have
a bug in there somewhere. But looking at your 2007 patch and mine
side-by-side, I do think mine is still less scary.

                     Linus
--

From: Peter Zijlstra
Date: Monday, August 23, 2010 - 1:58 am

For sure, and if you need it in a hurry your patch is much saner.

So if you feel like its required for .36 go for it, we can look at
cleaning it up for .37 or somesuch.


--

From: Linus Torvalds
Date: Friday, August 20, 2010 - 2:17 pm

Yeah. I looked at exactly something like that. With the same name for
vma_next() - except I just passed down 'mm' to it too. Your patch
takes it from vma->mm and then you have a few places do the whole

Interesting. I generated about half the patch, and then decided that
it's not worth it. But the fact that apparently you did the same thing
in 2007 means that maybe there really _is_ a pent-up demand for doing
this dang thing.

I do note that I also did a patch that just added "vm_prev", and it's
a lot smaller. It doesn't allow for the cleanups of using the generic
list iterators, though. And it has the usual problem with the
straightforward doubly-linked lists: you end up with those ugly
conditional assignments like

  if (prev)
     prev->vm_next = vma;
  if (next)
     next->vm_prev = vma;

which the list_head approach avoids.

Appended is that much smaller patch. It has an ugly and partial
"validate_vma()" hack there in find_vma() to catch the worst bugs, but
it's really not tested. I did try booting it, and it's not spewing out
errors, but who the heck knows. It doesn't look too horrid, but...

What do you think?

NOTE! I've done _zero_ cleanups. And one of the thing I noticed is
that "find_vma_prev()" is sometimes used to find a "prev" even when
there is no "vma" (ie growing a grow-up stack at the end of the VM, or
adding a new mapping at the end), so my hope that we could get rid of
it entirely was naïve. But there should be other things we should be
able to simplify when we can get at the prev pointer (all the
mprotect/mlock splitting code should work fine without having that
'prev' thing passed around etc).

                            Linus
From: Ian Campbell
Date: Friday, August 20, 2010 - 9:32 am

I naively hacked something together and it did seem to work, but I


I sent a contrived test program in my other mail.

The actual use is to mlock a buffer on the stack in order to pass it to
a Xen hypercall. The contract with the hypervisor is that the pages
passed to the hypercall must be mapped.

Ian.

-- 
Ian Campbell
Current Noise: Dinosaur Jr. - Green Mind

We can defeat gravity.  The problem is the paperwork involved.

--

From: Ian Campbell
Date: Friday, August 20, 2010 - 9:35 am

On the other hand the VMA merging is just an optimisation, isn't it?

-- 
Ian Campbell

Many people are desperately looking for some wise advice which will
recommend that they do what they want to do.

--

From: Linus Torvalds
Date: Friday, August 20, 2010 - 9:49 am

Well, yes and no. This would make it have semantic differences, if you
were to unmap the lower part of the stack.

I could imagine some crazy program wanting to basically return the
stack pages to the system after doing heavy recursion. IOW, they could
do

 - use lots of stack because we're recursing 1000 levels deep

 - know that we used lots of stack, so after returning do something like

    stack = &local variable;
    align stack down by two pages
    munmap down from there to give memory back

and now it really would be a semantic change where the VM_GROWSDOWN
bit has literally disappeared.

                                  Linus
--

Previous thread: [3/3] vmware: fix build error in vmware.c by Greg KH on Wednesday, August 18, 2010 - 1:30 pm. (1 message)

Next thread: [0/3] 2.6.35.3 -stable review by Greg KH on Wednesday, August 18, 2010 - 1:31 pm. (1 message)