Here's a patch to move nommu mmap/munmap ever so slightly closer to mmu behaviour. The motivation for this is to be able to deselect uClibc's UCLIBC_UCLINUX_BROKEN_MUNMAP config option, which speeds up malloc a fair bit. I'm interested in comments whether this is a good direction to go. The patch is against Linus' tree as of a few minutes ago. Bernd -- This footer brought to you by insane German lawmakers. Analog Devices GmbH Wilhelm-Wagenfeld-Str. 6 80807 Muenchen Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368 Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif
That's worrisome. Breaking existing apps/libraries seems like a bad idea. -- Mathematics is the supreme nostalgia of our time. -
It is a bad idea - but on noMMU - breaking existing apps/libraries is
(unfortunately) a pretty common thing...
- the standard distribution - uClinux - rebuilds all things : kernel, apps,
libc, shell, etc which run on the target - everytime you type 'make'
- the standard library - uClibc - has little concept of binary
compatibility, and has none in their years of releasing things.
(which is OK, since everything is re-built with a simple 'make' anyway)
Plus - what Bernd talked about - uClibc's simple malloc, isn't used that often
(at all that I am aware of), on any modern uClibc systems. The overhead of
calling the kernel to manage memory kills performance.
As a user of the kernel in a noMMU environment - I would rather have Bernd's
patch - which makes things closer to a MMU environment - and allows standard
applications to work better (at all) - rather than backwards binary
compatibility.
-Robin
-
it isnt breaking anything ... simplemalloc() will continue to execute in newer kernels -mike -
While that's true, it'll have an even bigger memory overhead than it already does (simplemalloc, by trapping into the kernel and creating vm_area/vm_list structures for every malloc call, has huge overheads in both time and space). I've posted this as an RFD to get a feeling for whether we can change behaviour here - I do expect that nommu embedded systems are less constrained by backwards compatibility considerations, but I'd like to hear from other embedded users whether this is a change they'd welcome. Bernd -- This footer brought to you by insane German lawmakers. Analog Devices GmbH Wilhelm-Wagenfeld-Str. 6 80807 Muenchen Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368 Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif -
yes, it does increase the runtime overhead, but the simplemalloc implementation is already tagged as crappy, so i dont think it's that big of a deal ... especially in light of all of the advantages the other malloc implementation gets us nommu peeps -mike -
I'm assuming that since no one had any large objections, that this is OK, and we should send to Andrew to live in -mm for awhile? -Robin -
Yes, IMO it is fine for kernel and uclibc. Is there any comments from David and Greg? Thanks - Bryan -
And now you've just broken every non-blackfin nommu platform, as you've And what is this? It only shows up in the blackfin defconfig. This is not the place to be putting board-specific hacks. No real objections to the approach, but it would be nice if these sorts of things were test compiled for at least one platform that isn't yours, so the obviously broken stuff is fixed before it's posted and someone else has to find out about it later. -
Yes, it is our own NP2 memory allocator option. I think Bernd will fix Exactly, Could please do some simple test on your SH-NOMMU platform? And we are waiting for some feedback from other nommu arch maintainers. David and Grep could you please help on this? Maybe Robin got some m68k nommu by hand which can be used for testing, I only have Blackfin, -:)) Thanks, Paul. Regards, -
Theres no reason you can't add the MAP_SPLIT_PAGES define in I have compiled the patch on m68knommu (after adding a MAP_SPLIT_PAGES define). And it seems to work ok with simple testing. I don't have a problem with the change, though please do add that MAP_SPLIT_PAGES define in the appropriate mman.h includes. And like Paul said there is no place for CONFIG_NP2 in it currently. Please take that out. Regards Greg ------------------------------------------------------------------------ Greg Ungerer -- Chief Software Dude EMAIL: gerg@snapgear.com Secure Computing Corporation PHONE: +61 7 3435 2888 825 Stanley St, FAX: +61 7 3891 3630 Woolloongabba, QLD, 4102, Australia WEB: http://www.SnapGear.com -
Blackfin should really use asm-generic/mman.h if it can. I'll produce my own version of the patch and pass it back to you once I've got FRV compiling with it. David -
How about MAP_TRIM_EXCESS instead of MAP_SPLIT_PAGES? I think it fits the function better. David -
From: Bernd Schmidt <bernds_cb1@t-online.de> Here's a patch to move nommu mmap/munmap ever so slightly closer to mmu behaviour. The motivation for this is to be able to deselect uClibc's UCLIBC_UCLINUX_BROKEN_MUNMAP config option, which speeds up malloc a fair bit. I'm interested in comments whether this is a good direction to go. The patch is against Linus' tree as of a few minutes ago. This changes nommu mmap/munmap in the following ways: 1. munmap can now unmap subparts of previously allocated blocks. This makes behaviour more consistent with mmu Linux, and allows us to simplify and speed up the uClibc malloc implementation. 2. It is no longer possible to get blocks smaller than a page through mmap. This behaviour was used by simplemalloc, which is an insane way of implementing malloc on nommu systems and hopefully not used by anyone anymore. 3. mmap can now be asked not to round up the allocation to the next power of 2 page size. Excess pages will be freed if MAP_SPLIT_PAGES is passed to mmap. The latter is done for binfmt_flat and binfmt_elf_fdpic to save memory when loading executables. If this flag is used, more memory is kept available, but fragmentation may (or may not) be higher. Every VMA can be in two states: either it manages a power-of-2 sized compound page, or (if VM_SPLIT_PAGES) is set, a set of single pages exactly covering the area between vm_start and vm_end. Signed-off-by: Bernd Schmidt <bernds_cb1@t-online.de> Signed-off-by: David Howells <dhowells@redhat.com> --- fs/binfmt_elf_fdpic.c | 17 +- fs/binfmt_flat.c | 40 ++---- fs/proc/task_nommu.c | 14 +- include/asm-blackfin/mman.h | 1 include/asm-frv/mman.h | 3 include/linux/mm.h | 2 mm/nommu.c | 294 ++++++++++++++++++++++++++++++++----------- mm/page_alloc.c | 10 + 8 files changed, 266 insertions(+), 115 deletions(-) diff --git a/fs/binfmt_elf_fdpic.c ...
There's a problem with your alteration to do_munmap() to split VMAs. What happens if an area of a file is mmap'd twice in a process? Consider the situation where two independent parts of a process (say two different libraries) each map an area of the same file. do_mmap_pgoff() will attempt to share these regions if it can. This can lead to you splitting the wrong VMA when someone comes along later to do an apparent partial unmap. I'm currently working on something to give each NOMMU process its own list VMAs rather than holding them in common. This is so that I can add VMA reservation (and more to the point, unreservation) which'll make some stuff easier. David -
Here's a preview of my patch to give each process a separate list of VMAs under NOMMU mode, just as under MMU mode. Could you have a look over it please? Could you also see if you get a memory leak on the blackfin CPU? I see a leak when I use this patch, but I'm not sure whether it's this patch, or whether it's something else in the arch that is suppressed without this patch. As far as I can tell by page counting there shouldn't be a leak. I've taken some of Bernd's patch into this one, and I'll take a bit more when I've done. David [PATCH] NOMMU: Make VMAs per MM as for MMU-mode linux From: David Howells <dhowells@redhat.com> Make VMAs per mm_struct as for MMU-mode linux. This solves the nattch problem for SYSV SHM where nattch for a segment does not reflect the number of shmat's (and forks) done. Signed-Off-By: David Howells <dhowells@redhat.com> --- arch/frv/Makefile | 2 arch/frv/kernel/ptrace.c | 11 - fs/binfmt_elf_fdpic.c | 29 -- fs/proc/internal.h | 2 fs/proc/nommu.c | 71 ++-- fs/proc/task_nommu.c | 135 +++++-- include/asm-frv/mmu.h | 1 include/linux/mm.h | 43 +- ipc/shm.c | 12 + mm/nommu.c | 852 +++++++++++++++++++++++++++++++--------------- 10 files changed, 740 insertions(+), 418 deletions(-) diff --git a/arch/frv/Makefile b/arch/frv/Makefile index 9bf7345..038e3a8 100644 --- a/arch/frv/Makefile +++ b/arch/frv/Makefile @@ -88,7 +88,7 @@ ASFLAGS += -mno-fdpic # make sure the .S files get compiled with debug info # and disable optimisations that are unhelpful whilst debugging ifdef CONFIG_DEBUG_INFO -#CFLAGS += -O1 +CFLAGS += -O1 AFLAGS += -Wa,--gdwarf2 ASFLAGS += -Wa,--gdwarf2 endif diff --git a/arch/frv/kernel/ptrace.c b/arch/frv/kernel/ptrace.c index 709e9bd..e9af8de 100644 --- a/arch/frv/kernel/ptrace.c +++ b/arch/frv/kernel/ptrace.c @@ -69,7 +69,8 @@ static inline int put_reg(struct task_struct *task, int regno, } ...
There is a leak:
Node 0, zone DMA 20 1 1 1 0 1 1
1 0 0 0 1 2 0
Node 0, zone DMA 32 1 1 0 0 0 1
0 0 0 0 1 2 0
Node 0, zone DMA 47 1 1 1 1 0 0
1 1 1 1 0 2 0
Node 0, zone DMA 62 1 1 0 1 1 1
1 0 1 1 0 2 0
Node 0, zone DMA 77 1 1 1 0 0 1
0 0 1 1 0 2 0
Node 0, zone DMA 92 1 1 0 0 1 0
1 1 0 1 0 2 0
Node 0, zone DMA 107 1 1 1 1 1 1
1 0 0 1 0 2 0
Node 0, zone DMA 122 1 1 0 1 0 1
0 0 0 1 0 2 0
Node 0, zone DMA 137 1 1 1 0 1 0
1 1 1 0 0 2 0
... and so on. It's a strange pattern of fragmentation, as if it keeps
allocating 8k pages and freeing one half of them.
Will play with this some more. Thanks!
Bernd
-
Yes. I found the major leak this morning. There may be a minor leak, but I'm not convinced it's in the mmap stuff. See revised patch. The leak was due to the fact that I wasn't setting the usage count on the 2nd+ pages when I made a >0 order allocation. David --- [PATCH] NOMMU: Make VMAs per MM as for MMU-mode linux From: David Howells <dhowells@redhat.com> Make VMAs per mm_struct as for MMU-mode linux. This solves the nattch problem for SYSV SHM where nattch for a segment does not reflect the number of shmat's (and forks) done. Signed-Off-By: David Howells <dhowells@redhat.com> --- arch/frv/Makefile | 2 arch/frv/kernel/ptrace.c | 11 - fs/binfmt_elf_fdpic.c | 29 -- fs/proc/internal.h | 2 fs/proc/nommu.c | 71 ++-- fs/proc/task_nommu.c | 135 +++++-- include/asm-frv/mmu.h | 1 include/linux/mm.h | 43 +- ipc/shm.c | 12 + mm/nommu.c | 852 +++++++++++++++++++++++++++++++--------------- 10 files changed, 740 insertions(+), 418 deletions(-) diff --git a/arch/frv/Makefile b/arch/frv/Makefile index 9bf7345..038e3a8 100644 --- a/arch/frv/Makefile +++ b/arch/frv/Makefile @@ -88,7 +88,7 @@ ASFLAGS += -mno-fdpic # make sure the .S files get compiled with debug info # and disable optimisations that are unhelpful whilst debugging ifdef CONFIG_DEBUG_INFO -#CFLAGS += -O1 +CFLAGS += -O1 AFLAGS += -Wa,--gdwarf2 ASFLAGS += -Wa,--gdwarf2 endif diff --git a/arch/frv/kernel/ptrace.c b/arch/frv/kernel/ptrace.c index 709e9bd..e9af8de 100644 --- a/arch/frv/kernel/ptrace.c +++ b/arch/frv/kernel/ptrace.c @@ -69,7 +69,8 @@ static inline int put_reg(struct task_struct *task, int regno, } /* - * check that an address falls within the bounds of the target process's memory mappings + * check that an address falls within the bounds of the target process's memory + * mappings */ static inline int is_user_addr_valid(struct task_struct *child, ...
Oops. That was the old patch. Try this one instead. David [PATCH] NOMMU: Make VMAs per MM as for MMU-mode linux From: David Howells <dhowells@redhat.com> Make VMAs per mm_struct as for MMU-mode linux. This solves the nattch problem for SYSV SHM where nattch for a segment does not reflect the number of shmat's (and forks) done. Signed-Off-By: David Howells <dhowells@redhat.com> --- arch/frv/Makefile | 2 arch/frv/kernel/ptrace.c | 11 - fs/binfmt_elf_fdpic.c | 29 - fs/proc/internal.h | 2 fs/proc/nommu.c | 71 ++-- fs/proc/proc_misc.c | 6 fs/proc/task_nommu.c | 135 ++++--- include/asm-frv/mmu.h | 1 include/linux/mm.h | 47 +- ipc/shm.c | 12 + kernel/fork.c | 4 mm/mmap.c | 10 + mm/nommu.c | 900 +++++++++++++++++++++++++++++++--------------- 13 files changed, 784 insertions(+), 446 deletions(-) diff --git a/arch/frv/Makefile b/arch/frv/Makefile index 9bf7345..038e3a8 100644 --- a/arch/frv/Makefile +++ b/arch/frv/Makefile @@ -88,7 +88,7 @@ ASFLAGS += -mno-fdpic # make sure the .S files get compiled with debug info # and disable optimisations that are unhelpful whilst debugging ifdef CONFIG_DEBUG_INFO -#CFLAGS += -O1 +CFLAGS += -O1 AFLAGS += -Wa,--gdwarf2 ASFLAGS += -Wa,--gdwarf2 endif diff --git a/arch/frv/kernel/ptrace.c b/arch/frv/kernel/ptrace.c index 709e9bd..e9af8de 100644 --- a/arch/frv/kernel/ptrace.c +++ b/arch/frv/kernel/ptrace.c @@ -69,7 +69,8 @@ static inline int put_reg(struct task_struct *task, int regno, } /* - * check that an address falls within the bounds of the target process's memory mappings + * check that an address falls within the bounds of the target process's memory + * mappings */ static inline int is_user_addr_valid(struct task_struct *child, unsigned long start, unsigned long len) @@ -79,11 +80,11 @@ static inline int is_user_addr_valid(struct ...
Note that we've had a similar change in our kernel, and we've had to revert it since the effect on fragmentation is just horrendous. Without this, we could run a complete gcc testsuite on the board; with it we'd run out of high-order pages somewhere in the middle. This probably wants to be dependent on something like MAP_TRIM_EXCESS. Bernd -- This footer brought to you by insane German lawmakers. Analog Devices GmbH Wilhelm-Wagenfeld-Str. 6 80807 Muenchen Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368 Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif -
Hmmm... In the interests of not adding more flags, I wonder if mremap() should be used for that, with a global setting for mmaps made by binfmts as they can't really be shrunk afterwards if they're shareable. David -
Here are some changes to make it more stable on my system. In do_mmap_private, I've commented out the logic to free excess pages, as it fragments terribly and causes a simple while true; do cat /proc/buddyinfo; done loop to go oom. Also, I think you're freeing high-order pages unaligned to their order? In shrink_vma, we must save the mm across calls to remove_vma_from_mm (oops when telnetting into the box). In do_munmap, we can deal with freeing more than one vma. I've not touched the rb-tree logic in the shared file case, as I have no idea what it's trying to do given that only exact matches are allowed. It still does not survive my mmap stress-tester, so I'll keep looking. Why do we need vm_regions for anonymous memory? Wouldn't it be enough to just have a VMA? Bernd -- This footer brought to you by insane German lawmakers. Analog Devices GmbH Wilhelm-Wagenfeld-Str. 6 80807 Muenchen Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368 Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif
I wonder if there's a good heuristic for this. The problem is that whilst not releasing excess pages _may_ seem like a good idea, if your system is something like a single persistent app, then it really is not. For instance, if such an app allocates a byte over 16MB (perhaps implicitly in the binfmt driver), then you'd completely waste a large chunk of RAM. In the Yeah, but some of the pages might still be in use when we want to release I'd generally rather not do this. You can't use MAP_FIXED to request adjacent It makes it simpler to have a common way of allocating memory for both anon regions and file regions. David -
I think it would be good to have a mechanism to group free pages by purpose - so that if we break up a high-order page in order to allocate memory for process A, then the remaining pages remain in a special pool Not following you here. Is it valid to free an order-2 page that's not Adjacent regions can happen by accident, and the uClibc malloc will try to merge free areas when they are adjacent - there's a lot of special-case code in there to prevent this on uClinux systems by essentially duplicating VMA tracking. That's something we want to avoid, because it eats performance (especially in threaded apps due to additional locking). Bernd -- This footer brought to you by insane German lawmakers. Analog Devices GmbH Wilhelm-Wagenfeld-Str. 6 80807 Muenchen Sitz der Gesellschaft Muenchen, Registergericht Muenchen HRB 40368 Geschaeftsfuehrer Thomas Wessel, William A. Martin, Margaret Seif -
