Re: [PATCH, RFD]: Unbreak no-mmu mmap

Previous thread: [PATCH] Include <asm/termbits.h> from <linux/tty_driver.h>. by Robert P. J. Day on Friday, June 8, 2007 - 6:29 am. (15 messages)

Next thread: [GIT PULL] please pull infiniband.git by Roland Dreier on Friday, June 8, 2007 - 7:22 am. (2 messages)
From: Bernd Schmidt
Date: Friday, June 8, 2007 - 6:53 am

Here's a patch to move nommu mmap/munmap ever so slightly closer to mmu
behaviour.  The motivation for this is to be able to deselect uClibc's
UCLIBC_UCLINUX_BROKEN_MUNMAP config option, which speeds up malloc a
fair bit.  I'm interested in comments whether this is a good direction
to go.  The patch is against Linus' tree as of a few minutes ago.


Bernd
-- 
This footer brought to you by insane German lawmakers.
Analog Devices GmbH      Wilhelm-Wagenfeld-Str. 6      80807 Muenchen
Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368
Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif
From: Matt Mackall
Date: Saturday, June 9, 2007 - 12:10 pm

That's worrisome. Breaking existing apps/libraries seems like a bad
idea.

-- 
Mathematics is the supreme nostalgia of our time.
-

From: Robin Getz
Date: Monday, June 11, 2007 - 2:08 pm

It is a bad idea - but on noMMU - breaking existing apps/libraries is 
(unfortunately) a pretty common thing...
  - the standard distribution - uClinux - rebuilds all things : kernel, apps,
     libc, shell, etc which run on the target - everytime you type 'make'
 - the standard library - uClibc - has little concept of binary 
     compatibility, and has none in their years of releasing things.
     (which is OK, since everything is re-built with a simple 'make' anyway)

Plus - what Bernd talked about - uClibc's simple malloc, isn't used that often 
(at all that I am aware of), on any modern uClibc systems. The overhead of 
calling the kernel to manage memory kills performance.

As a user of the kernel in a noMMU environment - I would rather have Bernd's 
patch - which makes things closer to a MMU environment - and allows standard 
applications to work better (at all) - rather than backwards binary 
compatibility.

-Robin
-

From: Mike Frysinger
Date: Monday, June 11, 2007 - 3:09 pm

it isnt breaking anything ... simplemalloc() will continue to execute
in newer kernels
-mike
-

From: Bernd Schmidt
Date: Monday, June 11, 2007 - 4:04 pm

While that's true, it'll have an even bigger memory overhead than it
already does (simplemalloc, by trapping into the kernel and creating
vm_area/vm_list structures for every malloc call, has huge overheads in
both time and space).
I've posted this as an RFD to get a feeling for whether we can change
behaviour here - I do expect that nommu embedded systems are less
constrained by backwards compatibility considerations, but I'd like to
hear from other embedded users whether this is a change they'd welcome.


Bernd
-- 
This footer brought to you by insane German lawmakers.
Analog Devices GmbH      Wilhelm-Wagenfeld-Str. 6      80807 Muenchen
Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368
Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif
-

From: Mike Frysinger
Date: Monday, June 11, 2007 - 4:22 pm

yes, it does increase the runtime overhead, but the simplemalloc
implementation is already tagged as crappy, so i dont think it's that
big of a deal ... especially in light of all of the advantages the
other malloc implementation gets us nommu peeps
-mike
-

From: Robin Getz
Date: Tuesday, June 19, 2007 - 4:26 pm

I'm assuming that since no one had any large objections, that this is OK, and 
we should send to Andrew to live in -mm for awhile?

-Robin
-

From: Bryan Wu
Date: Tuesday, June 19, 2007 - 7:38 pm

Yes, IMO it is fine for kernel and uclibc. Is there any comments from
David and Greg?


Thanks
- Bryan
-

From: Paul Mundt
Date: Tuesday, June 19, 2007 - 8:00 pm

And now you've just broken every non-blackfin nommu platform, as you've

And what is this? It only shows up in the blackfin defconfig. This is not
the place to be putting board-specific hacks.

No real objections to the approach, but it would be nice if these sorts
of things were test compiled for at least one platform that isn't yours,
so the obviously broken stuff is fixed before it's posted and someone
else has to find out about it later.
-

From: Bryan Wu
Date: Tuesday, June 19, 2007 - 8:18 pm

Yes, it is our own NP2 memory allocator option. I think Bernd will fix

Exactly, Could please do some simple test on your SH-NOMMU platform? And
we are waiting for some feedback from other nommu arch maintainers.

David and Grep could you please help on this? Maybe Robin got some m68k
nommu by hand which can be used for testing, I only have Blackfin, -:))

Thanks, Paul.
Regards,
-

From: Greg Ungerer
Date: Tuesday, June 26, 2007 - 10:50 pm

Theres no reason you can't add the MAP_SPLIT_PAGES define in

I have compiled the patch on m68knommu (after adding a MAP_SPLIT_PAGES
define). And it seems to work ok with simple testing.

I don't have a problem with the change, though please do add that
MAP_SPLIT_PAGES define in the appropriate mman.h includes. And like Paul
said there is no place for CONFIG_NP2 in it currently. Please take
that out.

Regards
Greg




------------------------------------------------------------------------
Greg Ungerer  --  Chief Software Dude       EMAIL:     gerg@snapgear.com
Secure Computing Corporation                PHONE:       +61 7 3435 2888
825 Stanley St,                             FAX:         +61 7 3891 3630
Woolloongabba, QLD, 4102, Australia         WEB: http://www.SnapGear.com
-

From: David Howells
Date: Friday, June 22, 2007 - 5:59 am

Blackfin should really use asm-generic/mman.h if it can.

I'll produce my own version of the patch and pass it back to you once I've got
FRV compiling with it.

David
-

From: David Howells
Date: Friday, June 22, 2007 - 6:35 am

How about MAP_TRIM_EXCESS instead of MAP_SPLIT_PAGES?  I think it fits the
function better.

David
-

From: David Howells
Date: Friday, June 22, 2007 - 7:29 am

From: Bernd Schmidt &lt;bernds_cb1@t-online.de&gt;

Here's a patch to move nommu mmap/munmap ever so slightly closer to mmu
behaviour.  The motivation for this is to be able to deselect uClibc's
UCLIBC_UCLINUX_BROKEN_MUNMAP config option, which speeds up malloc a
fair bit.  I'm interested in comments whether this is a good direction
to go.  The patch is against Linus' tree as of a few minutes ago.

This changes nommu mmap/munmap in the following ways:
1. munmap can now unmap subparts of previously allocated blocks.  This
   makes behaviour more consistent with mmu Linux, and allows us to
   simplify and speed up the uClibc malloc implementation.
2. It is no longer possible to get blocks smaller than a page through
   mmap.  This behaviour was used by simplemalloc, which is an insane
   way of implementing malloc on nommu systems and hopefully not used
   by anyone anymore.
3. mmap can now be asked not to round up the allocation to the next
   power of 2 page size.  Excess pages will be freed if MAP_SPLIT_PAGES
   is passed to mmap.
   The latter is done for binfmt_flat and binfmt_elf_fdpic to save
   memory when loading executables.
   If this flag is used, more memory is kept available, but fragmentation
   may (or may not) be higher.

Every VMA can be in two states: either it manages a power-of-2 sized compound
page, or (if VM_SPLIT_PAGES) is set, a set of single pages exactly covering
the area between vm_start and vm_end.

Signed-off-by: Bernd Schmidt &lt;bernds_cb1@t-online.de&gt;
Signed-off-by: David Howells &lt;dhowells@redhat.com&gt;
---

 fs/binfmt_elf_fdpic.c       |   17 +-
 fs/binfmt_flat.c            |   40 ++----
 fs/proc/task_nommu.c        |   14 +-
 include/asm-blackfin/mman.h |    1 
 include/asm-frv/mman.h      |    3 
 include/linux/mm.h          |    2 
 mm/nommu.c                  |  294 ++++++++++++++++++++++++++++++++-----------
 mm/page_alloc.c             |   10 +
 8 files changed, 266 insertions(+), 115 deletions(-)

diff --git a/fs/binfmt_elf_fdpic.c ...
From: David Howells
Date: Wednesday, August 1, 2007 - 5:29 am

There's a problem with your alteration to do_munmap() to split VMAs.  What
happens if an area of a file is mmap'd twice in a process?  Consider the
situation where two independent parts of a process (say two different
libraries) each map an area of the same file.  do_mmap_pgoff() will attempt to
share these regions if it can.

This can lead to you splitting the wrong VMA when someone comes along later to 
do an apparent partial unmap.

I'm currently working on something to give each NOMMU process its own list
VMAs rather than holding them in common.  This is so that I can add VMA
reservation (and more to the point, unreservation) which'll make some stuff
easier.

David
-

From: David Howells
Date: Friday, August 3, 2007 - 7:03 am

Here's a preview of my patch to give each process a separate list of VMAs
under NOMMU mode, just as under MMU mode.  Could you have a look over it
please?

Could you also see if you get a memory leak on the blackfin CPU?  I see a leak
when I use this patch, but I'm not sure whether it's this patch, or whether
it's something else in the arch that is suppressed without this patch.

As far as I can tell by page counting there shouldn't be a leak.

I've taken some of Bernd's patch into this one, and I'll take a bit more when
I've done.

David

[PATCH] NOMMU: Make VMAs per MM as for MMU-mode linux

From: David Howells &lt;dhowells@redhat.com&gt;

Make VMAs per mm_struct as for MMU-mode linux.  This solves the nattch problem
for SYSV SHM where nattch for a segment does not reflect the number of shmat's
(and forks) done.

Signed-Off-By: David Howells &lt;dhowells@redhat.com&gt;
---

 arch/frv/Makefile        |    2 
 arch/frv/kernel/ptrace.c |   11 -
 fs/binfmt_elf_fdpic.c    |   29 --
 fs/proc/internal.h       |    2 
 fs/proc/nommu.c          |   71 ++--
 fs/proc/task_nommu.c     |  135 +++++--
 include/asm-frv/mmu.h    |    1 
 include/linux/mm.h       |   43 +-
 ipc/shm.c                |   12 +
 mm/nommu.c               |  852 +++++++++++++++++++++++++++++++---------------
 10 files changed, 740 insertions(+), 418 deletions(-)

diff --git a/arch/frv/Makefile b/arch/frv/Makefile
index 9bf7345..038e3a8 100644
--- a/arch/frv/Makefile
+++ b/arch/frv/Makefile
@@ -88,7 +88,7 @@ ASFLAGS		+= -mno-fdpic
 # make sure the .S files get compiled with debug info
 # and disable optimisations that are unhelpful whilst debugging
 ifdef CONFIG_DEBUG_INFO
-#CFLAGS		+= -O1
+CFLAGS		+= -O1
 AFLAGS		+= -Wa,--gdwarf2
 ASFLAGS		+= -Wa,--gdwarf2
 endif
diff --git a/arch/frv/kernel/ptrace.c b/arch/frv/kernel/ptrace.c
index 709e9bd..e9af8de 100644
--- a/arch/frv/kernel/ptrace.c
+++ b/arch/frv/kernel/ptrace.c
@@ -69,7 +69,8 @@ static inline int put_reg(struct task_struct *task, int regno,
 }
 ...
From: Bernd Schmidt
Date: Tuesday, August 7, 2007 - 6:12 am

There is a leak:

Node 0, zone      DMA     20      1      1      1      0      1      1
    1      0      0      0      1      2      0
Node 0, zone      DMA     32      1      1      0      0      0      1
    0      0      0      0      1      2      0
Node 0, zone      DMA     47      1      1      1      1      0      0
    1      1      1      1      0      2      0
Node 0, zone      DMA     62      1      1      0      1      1      1
    1      0      1      1      0      2      0
Node 0, zone      DMA     77      1      1      1      0      0      1
    0      0      1      1      0      2      0
Node 0, zone      DMA     92      1      1      0      0      1      0
    1      1      0      1      0      2      0
Node 0, zone      DMA    107      1      1      1      1      1      1
    1      0      0      1      0      2      0
Node 0, zone      DMA    122      1      1      0      1      0      1
    0      0      0      1      0      2      0
Node 0, zone      DMA    137      1      1      1      0      1      0
    1      1      1      0      0      2      0
... and so on.  It's a strange pattern of fragmentation, as if it keeps
allocating 8k pages and freeing one half of them.

Will play with this some more.  Thanks!


Bernd
-

From: David Howells
Date: Tuesday, August 7, 2007 - 6:17 am

Yes.  I found the major leak this morning.  There may be a minor leak, but I'm
not convinced it's in the mmap stuff.  See revised patch.

The leak was due to the fact that I wasn't setting the usage count on the 2nd+
pages when I made a &gt;0 order allocation.

David

---
[PATCH] NOMMU: Make VMAs per MM as for MMU-mode linux

From: David Howells &lt;dhowells@redhat.com&gt;

Make VMAs per mm_struct as for MMU-mode linux.  This solves the nattch problem
for SYSV SHM where nattch for a segment does not reflect the number of shmat's
(and forks) done.

Signed-Off-By: David Howells &lt;dhowells@redhat.com&gt;
---

 arch/frv/Makefile        |    2 
 arch/frv/kernel/ptrace.c |   11 -
 fs/binfmt_elf_fdpic.c    |   29 --
 fs/proc/internal.h       |    2 
 fs/proc/nommu.c          |   71 ++--
 fs/proc/task_nommu.c     |  135 +++++--
 include/asm-frv/mmu.h    |    1 
 include/linux/mm.h       |   43 +-
 ipc/shm.c                |   12 +
 mm/nommu.c               |  852 +++++++++++++++++++++++++++++++---------------
 10 files changed, 740 insertions(+), 418 deletions(-)

diff --git a/arch/frv/Makefile b/arch/frv/Makefile
index 9bf7345..038e3a8 100644
--- a/arch/frv/Makefile
+++ b/arch/frv/Makefile
@@ -88,7 +88,7 @@ ASFLAGS		+= -mno-fdpic
 # make sure the .S files get compiled with debug info
 # and disable optimisations that are unhelpful whilst debugging
 ifdef CONFIG_DEBUG_INFO
-#CFLAGS		+= -O1
+CFLAGS		+= -O1
 AFLAGS		+= -Wa,--gdwarf2
 ASFLAGS		+= -Wa,--gdwarf2
 endif
diff --git a/arch/frv/kernel/ptrace.c b/arch/frv/kernel/ptrace.c
index 709e9bd..e9af8de 100644
--- a/arch/frv/kernel/ptrace.c
+++ b/arch/frv/kernel/ptrace.c
@@ -69,7 +69,8 @@ static inline int put_reg(struct task_struct *task, int regno,
 }
 
 /*
- * check that an address falls within the bounds of the target process's memory mappings
+ * check that an address falls within the bounds of the target process's memory
+ * mappings
  */
 static inline int is_user_addr_valid(struct task_struct *child,
 				     ...
From: David Howells
Date: Tuesday, August 7, 2007 - 6:21 am

Oops.  That was the old patch.  Try this one instead.

David

[PATCH] NOMMU: Make VMAs per MM as for MMU-mode linux

From: David Howells &lt;dhowells@redhat.com&gt;

Make VMAs per mm_struct as for MMU-mode linux.  This solves the nattch problem
for SYSV SHM where nattch for a segment does not reflect the number of shmat's
(and forks) done.

Signed-Off-By: David Howells &lt;dhowells@redhat.com&gt;
---

 arch/frv/Makefile        |    2 
 arch/frv/kernel/ptrace.c |   11 -
 fs/binfmt_elf_fdpic.c    |   29 -
 fs/proc/internal.h       |    2 
 fs/proc/nommu.c          |   71 ++--
 fs/proc/proc_misc.c      |    6 
 fs/proc/task_nommu.c     |  135 ++++---
 include/asm-frv/mmu.h    |    1 
 include/linux/mm.h       |   47 +-
 ipc/shm.c                |   12 +
 kernel/fork.c            |    4 
 mm/mmap.c                |   10 +
 mm/nommu.c               |  900 +++++++++++++++++++++++++++++++---------------
 13 files changed, 784 insertions(+), 446 deletions(-)

diff --git a/arch/frv/Makefile b/arch/frv/Makefile
index 9bf7345..038e3a8 100644
--- a/arch/frv/Makefile
+++ b/arch/frv/Makefile
@@ -88,7 +88,7 @@ ASFLAGS		+= -mno-fdpic
 # make sure the .S files get compiled with debug info
 # and disable optimisations that are unhelpful whilst debugging
 ifdef CONFIG_DEBUG_INFO
-#CFLAGS		+= -O1
+CFLAGS		+= -O1
 AFLAGS		+= -Wa,--gdwarf2
 ASFLAGS		+= -Wa,--gdwarf2
 endif
diff --git a/arch/frv/kernel/ptrace.c b/arch/frv/kernel/ptrace.c
index 709e9bd..e9af8de 100644
--- a/arch/frv/kernel/ptrace.c
+++ b/arch/frv/kernel/ptrace.c
@@ -69,7 +69,8 @@ static inline int put_reg(struct task_struct *task, int regno,
 }
 
 /*
- * check that an address falls within the bounds of the target process's memory mappings
+ * check that an address falls within the bounds of the target process's memory
+ * mappings
  */
 static inline int is_user_addr_valid(struct task_struct *child,
 				     unsigned long start, unsigned long len)
@@ -79,11 +80,11 @@ static inline int is_user_addr_valid(struct ...
From: Bernd Schmidt
Date: Tuesday, August 7, 2007 - 6:37 am

Note that we've had a similar change in our kernel, and we've had to
revert it since the effect on fragmentation is just horrendous.  Without
this, we could run a complete gcc testsuite on the board; with it we'd
run out of high-order pages somewhere in the middle.

This probably wants to be dependent on something like MAP_TRIM_EXCESS.


Bernd
-- 
This footer brought to you by insane German lawmakers.
Analog Devices GmbH      Wilhelm-Wagenfeld-Str. 6      80807 Muenchen
Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368
Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif
-

From: David Howells
Date: Tuesday, August 7, 2007 - 7:03 am

Hmmm...  In the interests of not adding more flags, I wonder if mremap()
should be used for that, with a global setting for mmaps made by binfmts as
they can't really be shrunk afterwards if they're shareable.

David
-

From: Bernd Schmidt
Date: Friday, August 17, 2007 - 4:49 am

Here are some changes to make it more stable on my system.  In 
do_mmap_private, I've commented out the logic to free excess pages, as 
it fragments terribly and causes a simple
  while true; do cat /proc/buddyinfo; done
loop to go oom.  Also, I think you're freeing high-order pages unaligned 
to their order?
In shrink_vma, we must save the mm across calls to remove_vma_from_mm 
(oops when telnetting into the box).
In do_munmap, we can deal with freeing more than one vma.  I've not 
touched the rb-tree logic in the shared file case, as I have no idea 
what it's trying to do given that only exact matches are allowed.

It still does not survive my mmap stress-tester, so I'll keep looking.

Why do we need vm_regions for anonymous memory?  Wouldn't it be enough 
to just have a VMA?


Bernd
-- 
This footer brought to you by insane German lawmakers.
Analog Devices GmbH      Wilhelm-Wagenfeld-Str. 6      80807 Muenchen
Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368
Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif
From: David Howells
Date: Monday, August 20, 2007 - 8:12 am

I wonder if there's a good heuristic for this.  The problem is that whilst
not releasing excess pages _may_ seem like a good idea, if your system is
something like a single persistent app, then it really is not.

For instance, if such an app allocates a byte over 16MB (perhaps implicitly in
the binfmt driver), then you'd completely waste a large chunk of RAM.  In the


Yeah, but some of the pages might still be in use when we want to release


I'd generally rather not do this.  You can't use MAP_FIXED to request adjacent


It makes it simpler to have a common way of allocating memory for both anon
regions and file regions.

David
-

From: Bernd Schmidt
Date: Monday, August 20, 2007 - 9:02 am

I think it would be good to have a mechanism to group free pages by 
purpose - so that if we break up a high-order page in order to allocate 
memory for process A, then the remaining pages remain in a special pool 

Not following you here.  Is it valid to free an order-2 page that's not 

Adjacent regions can happen by accident, and the uClibc malloc will try 
to merge free areas when they are adjacent - there's a lot of 
special-case code in there to prevent this on uClinux systems by 
essentially duplicating VMA tracking.  That's something we want to 
avoid, because it eats performance (especially in threaded apps due to 
additional locking).


Bernd
-- 
This footer brought to you by insane German lawmakers.
Analog Devices GmbH      Wilhelm-Wagenfeld-Str. 6      80807 Muenchen
Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368
Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif
-

Previous thread: [PATCH] Include <asm/termbits.h> from <linux/tty_driver.h>. by Robert P. J. Day on Friday, June 8, 2007 - 6:29 am. (15 messages)

Next thread: [GIT PULL] please pull infiniband.git by Roland Dreier on Friday, June 8, 2007 - 7:22 am. (2 messages)