[First send rejected by vger.kernel.org due to HTML and/or test program attachment. Re-send without, please contact me for the test program.] mmap() is slow on MAP_32BIT allocation failure, sometimes causing NPTL's pthread_create() to run about three orders of magnitude slower. As example, in one case creating new threads goes from about 35,000 cycles up to about 25,000,000 cycles -- which is under 100 threads per second. Larger stacks reduce the severity of slowdown but also make slowdown happen after allocating a few thousand threads. Costs vary with platform, stack size, etc., but thread allocation rates drop suddenly on all of a half-dozen platforms I tried. The cause is NPTL allocates stacks with code of the form (e.g., glibc 2.7 nptl/allocatestack.c): sto = mmap(0, ..., MAP_PRIVATE|MAP_32BIT, ...); if (sto == MAP_FAILED) sto = mmap(0, ..., MAP_PRIVATE, ...); That is, try to allocate in the low 4GB, and when low addresses are exhausted, allocate from any location. Thus, once low addresses run out, every stack allocation does a failing mmap() followed by a successful mmap(). The failing mmap() is slow because it does a linear search of all low-space vma's. Low-address stacks are preferred because some machines context switch much faster when the stack address has only 32 significant bits. Slow allocation was discussed in 2003 but without resolution. See, e.g., http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0321.html, http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0517.html, http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0538.html, and http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0520.html. With increasing use of threads, slow allocation is becoming a problem. Some old machines were faster switching 32b stacks, but new machines seem to switch as fast or faster using 64b stacks. I measured thread-to-thread context switches on two AMD processors and five Intel procesors. Tests used the same code with 32b or 64b stack pointers; tests covered ...
Sigh, unfortunately MAP_32BIT use in 64-bit apps for stacks was
apparently created without foresight about what would happen in the MM
when thread stacks exhaust 4GB.
The problem is that MAP_32BIT is used both as a performance hack for
64-bit apps and as an ABI compat mechanism for 32-bit apps. So we cannot
just start disregarding MAP_32BIT in the kernel - we'd break 32-bit
compat apps and/or compat 32-bit libraries.
There are various other options to solve the (severe!) performance
breakdown:
1- glibc could start not using MAP_32BIT for 64-bit thread stacks (the
boxes where context-switching is slow probably do not matter all that
much anymore - they were very slow at everything 64-bit anyway)
Pros: easiest solution.
Cons: slows down the affected machines and needs a new glibc.
2- We could introduce a new MAP_64BIT_STACK flag which we could
propagate it into MAP_32BIT on those old CPUs. It would be
disregarded on modern CPUs and thread stacks would be 64-bit.
Pros: cleanest solution.
Cons: needs both new glibc and new kernel to take advantage of.
3- We could detect the first-4G-is-full condition and cache it. Problem
is, there will likely be small holes in it so it's rather hard to do
it in a sane way. Also, every munmap() of a thread stack will
invalidate this - triggering a slow linear search every now and then.
Pros: only needs a new kernel to take advantage of.
Cons: is the most complex and messiest solution with no clear
benefit to other workloads. Also, does not 100% solve the
performance problem and prolongues the 4GB stack threads
hack.
i'd go for 1) or 2).
Ingo
--
On Wed, 13 Aug 2008 12:44:45 +0200 I would go for 1) clearly; it's the cleanest thing going forward for sure. -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I want to see numbers first. If there are problems visible I definitely would want to see 2. Andi at the time I wrote that code was very adamant that I use the flag. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkii7gcACgkQ2ijCOnn/RHTveQCeIefB1R5QpuQ71RNMihKL5oWD ZVoAnjjjKgXznRx8qtbrF+fgvcNwsngA =dAz2 -----END PGP SIGNATURE----- --
not sure exactly what numbers you mean, but there are lots of numbers in the first mail, attached below. For example: | As example, in one case creating new threads goes from about 35,000 | cycles up to about 25,000,000 cycles -- which is under 100 threads per | second. Larger stacks reduce the severity of slowdown but also make being able to create only 100 threads per second brings us back to 33 MHz 386 DX Linux performance. Ingo ----------------------> mmap() is slow on MAP_32BIT allocation failure, sometimes causing NPTL's pthread_create() to run about three orders of magnitude slower. As example, in one case creating new threads goes from about 35,000 cycles up to about 25,000,000 cycles -- which is under 100 threads per second. Larger stacks reduce the severity of slowdown but also make slowdown happen after allocating a few thousand threads. Costs vary with platform, stack size, etc., but thread allocation rates drop suddenly on all of a half-dozen platforms I tried. The cause is NPTL allocates stacks with code of the form (e.g., glibc 2.7 nptl/allocatestack.c): sto = mmap(0, ..., MAP_PRIVATE|MAP_32BIT, ...); if (sto == MAP_FAILED) sto = mmap(0, ..., MAP_PRIVATE, ...); That is, try to allocate in the low 4GB, and when low addresses are exhausted, allocate from any location. Thus, once low addresses run out, every stack allocation does a failing mmap() followed by a successful mmap(). The failing mmap() is slow because it does a linear search of all low-space vma's. Low-address stacks are preferred because some machines context switch much faster when the stack address has only 32 significant bits. Slow allocation was discussed in 2003 but without resolution. See, e.g., http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0321.html, http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0517.html, http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0538.html, and http://ussg.iu.edu/hypermail/linux/kernel/0305.1/0520.html. With increasing use of ...
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 I mean numbers indicating that it doesn't hurt performance on any of today's machines. If there are machines where it makes a difference then we need the flag to indicate the _preference_ for a low stack, as opposed to indicating a _requirement_. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkii8VcACgkQ2ijCOnn/RHTiLQCfcZ9xJHMi0Jv59l700ZNJUoi6 aEcAn370XuGhs1u1YeD2Gqq35zQnKh26 =rC0v -----END PGP SIGNATURE----- --
there were a few numbers about that as well, and a test-app. The test
app is below. The numbers were:
| I measured thread-to-thread context switches on two AMD processors and
| five Intel procesors. Tests used the same code with 32b or 64b stack
| pointers; tests covered varying numbers of threads switched and
| varying methods of allocating stacks. Two systems gave
| indistinguishable performance with 32b or 64b stacks, four gave 5%-10%
| better performance using 64b stacks, and of the systems I tested, only
| the P4 microarchitecture x86-64 system gave better performance for 32b
| stacks, in that case vastly better. Most systems had thread-to-thread
| switch costs around 800-1200 cycles. The P4 microarchitecture system
| had 32b context switch costs around 3,000 cycles and 64b context
| switches around 4,800 cycles.
i find it pretty unacceptable these days that we limit any aspect of
pure 64-bit apps in any way to 4GB (or any other 32-bit-ish limit).
[other than the small execution model which is 2GB obviously.]
Ingo
--------------------->
// switch.cc -- measure thread-to-thread context switch times
// using either low-address stacks or high-address stacks
#include <sys/mman.h>
#include <sys/types.h>
#include <pthread.h>
#include <sched.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
const int kRequestedSwaps = 10000;
const int kNumThreads = 2;
const int kRequestedSwapsPerThread = kRequestedSwaps / kNumThreads;
const int kStackSize = 64 * 1024;
const int kTrials = 100;
typedef long long Tsc;
#define LARGEST_TSC (static_cast<Tsc>(1ULL << (8 * sizeof(Tsc) - 2) - 1))
Tsc now() {
unsigned int eax_lo, edx_hi;
Tsc now;
asm volatile("rdtsc" : "=a" (eax_lo), "=d" (edx_hi));
now = ((Tsc)eax_lo) | ((Tsc)(edx_hi) << 32);
return now;
}
// Use 0/1 for size to allow array subscripting.
const int pointer_sizes[] = { 32, 64 };
#define SZ_N (sizeof(pointer_sizes) / sizeof(pointer_sizes[0]))
typedef int ...-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Sure, but if we can pin-point the sub-archs for which it is the problem then a flag to optionally request it is even easier to handle. You'd simply ignore the flag for anything but the P4 architecture. I personally have no problem removing the whole thing because I have no such machine running anymore. But there are people out there who have. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkii/BcACgkQ2ijCOnn/RHQ8FACfZFV+WaBmS6UNqZZ/xDfV/Z7z gIAAoJSmbauchdaIVIebz8N2rPrszAMF =WAzJ -----END PGP SIGNATURE----- --
i suspect you are talking about option #2 i described. It is the option hm, i think the set of people running on such boxes _and_ then upgrading to a new glibc and expecting everything to be just as fast to the microsecond as before should be miniscule. Those P4 derived 64-bit boxes were astonishingly painful in 64-bit mode - most of that hw is running 32-bit i suspect, because 64-bit on it was really a joke. Btw., can you see any problems with option #1: simply removing MAP_32BIT from 64-bit stack allocations in glibc unconditionally? It's the fastest to execute and also the most obvious solution. +1 usecs overhead in the 64-bit context-switch path on those old slow boxes wont matter much. 10 _millisecs_ to start a single thread on top-of-the-line hw is quite unaccepable. (and there's little sane we can do in the kernel about allocation overhead when we have an imperfectly filled 4GB box for all allocations) Ingo --
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Yes, as we both agree, there are still such machines out there. The real problem is: what to do if somebody complains? If we would have the extra flag such people could be accommodated. If there is no such flag then distributions cannot just add the flag (it's part of the kernel API) and they would be caught between a rock and a hard place. Option #2 provides the biggest flexibility. I upstream kernel truly doesn't care about such machines anymore there are two options: - - really do nothing at all - - at least reserve a flag in case somebody wants/has to implement option #2 - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkijA+4ACgkQ2ijCOnn/RHRhLQCdGNvwikwY4hMHBuYUP4WDqsy3 cfcAn2hrN1MoOkN3UIC4iSUCtqD2Yl6W =yG5T -----END PGP SIGNATURE----- --
do nothing at all is not an option - thread creation can take 10 msecs yeah, i already had a patch for that when i wrote my first mail [attached below] and listed it as option #4 - then erased the comment figuring that we'd want to do #1 ;-) As unimplemented flags just get ignored by the kernel, if this flag goes into v2.6.27 as-is and is ignored by the kernel (i.e. we just use a plain old 64-bit [47-bit] allocation), then you could do the glibc change straight away, correct? So then if people complain we can fix it in the kernel purely. how about this then? Ingo ---------------------> Subject: mmap: add MAP_64BIT_STACK From: Ingo Molnar <mingo@elte.hu> Date: Wed Aug 13 12:41:54 CEST 2008 Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/asm-x86/mman.h | 1 + 1 file changed, 1 insertion(+) Index: linux/include/asm-x86/mman.h =================================================================== --- linux.orig/include/asm-x86/mman.h +++ linux/include/asm-x86/mman.h @@ -12,6 +12,7 @@ #define MAP_NORESERVE 0x4000 /* don't check for reservations */ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ +#define MAP_64BIT_STACK 0x20000 /* give out 32bit addresses on old CPUs */ #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ --
I think the flag makes sense but it's name is confusing - 64BIT for a flag which means "maybe request 32-bit stack"! Suggest: +#define MAP_STACK 0x20000 /* 31bit or 64bit address for stack, */ + /* whichever is faster on this CPU */ Also, is this _only_ useful for thread stacks, or are there other memory allocations where 31-bitness affects execution speed on old P4s? -- Jamie --
just about anything i guess - but since those CPUs do not really matter anymore in terms of bleeding-edge performance, what we care about is the intended current use of this flag: thread stacks. Ingo --------------------> From 4812c2fddc7f5a3a4480d541a4cb2b7e4ec21dcb Mon Sep 17 00:00:00 2001 From: Ingo Molnar <mingo@elte.hu> Date: Wed, 13 Aug 2008 18:02:18 +0200 Subject: [PATCH] x86: add MAP_STACK mmap flag as per this discussion: http://lkml.org/lkml/2008/8/12/423 Pardo reported that 64-bit threaded apps, if their stacks exceed the combined size of ~4GB, slow down drastically in pthread_create() - because glibc uses MAP_32BIT to allocate the stacks. The use of MAP_32BIT is a legacy hack - to speed up context switching on certain early model 64-bit P4 CPUs. So introduce a new flag to be used by glibc instead, to not constrain 64-bit apps like this. glibc can switch to this new flag straight away - it will be ignored by the kernel. If those old CPUs ever matter to anyone, support for it can be implemented. Signed-off-by: Ingo Molnar <mingo@elte.hu> --- include/asm-x86/mman.h | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/include/asm-x86/mman.h b/include/asm-x86/mman.h index c1682b5..e5852b5 100644 --- a/include/asm-x86/mman.h +++ b/include/asm-x86/mman.h @@ -12,6 +12,7 @@ #define MAP_NORESERVE 0x4000 /* don't check for reservations */ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ +#define MAP_STACK 0x20000 /* give out 32bit stack addresses on old CPUs */ #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ --
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Actually, I would define the flag as "do whatever is best assuming the allocation is used for stacks". For instance, minimally the /proc/*/maps output could show "[user stack]" or something like this. For security, perhaps, setting of PROC_EXEC can be prevented. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) Comment: Using GnuPG with Fedora - http://enigmail.mozdev.org iEYEARECAAYFAkiluUMACgkQ2ijCOnn/RHSb5gCfb5VhiLA/wbamoAVqfxR32k4N tSIAoK/KAmwcVd+RjkPnb9RSuAeL/KLV =2ynl -----END PGP SIGNATURE----- --
makes sense. Updated patch below. I've also added your Acked-by. Queued it up in tip/x86/urgent, for v2.6.27 merging. ( also, just to make sure: all Linux kernel versions will ignore such extra flags, so you can just update glibc to use this flag unconditionally, correct? ) Ingo ---------------------------> From 2fdc86901d2ab30a12402b46238951d2a7891590 Mon Sep 17 00:00:00 2001 From: Ingo Molnar <mingo@elte.hu> Date: Wed, 13 Aug 2008 18:02:18 +0200 Subject: [PATCH] x86: add MAP_STACK mmap flag as per this discussion: http://lkml.org/lkml/2008/8/12/423 Pardo reported that 64-bit threaded apps, if their stacks exceed the combined size of ~4GB, slow down drastically in pthread_create() - because glibc uses MAP_32BIT to allocate the stacks. The use of MAP_32BIT is a legacy hack - to speed up context switching on certain early model 64-bit P4 CPUs. So introduce a new flag to be used by glibc instead, to not constrain 64-bit apps like this. glibc can switch to this new flag straight away - it will be ignored by the kernel. If those old CPUs ever matter to anyone, support for it can be implemented. Signed-off-by: Ingo Molnar <mingo@elte.hu> Acked-by: Ulrich Drepper <drepper@gmail.com> --- include/asm-x86/mman.h | 1 + 1 files changed, 1 insertions(+), 0 deletions(-) diff --git a/include/asm-x86/mman.h b/include/asm-x86/mman.h index c1682b5..90bc410 100644 --- a/include/asm-x86/mman.h +++ b/include/asm-x86/mman.h @@ -12,6 +12,7 @@ #define MAP_NORESERVE 0x4000 /* don't check for reservations */ #define MAP_POPULATE 0x8000 /* populate (prefault) pagetables */ #define MAP_NONBLOCK 0x10000 /* do not block on IO */ +#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */ #define MCL_CURRENT 1 /* lock all current mappings */ #define MCL_FUTURE 2 /* lock all future mappings */ --
As soon as the patch hits Linus' tree I can change the code. --
it's upstream now: | commit cd98a04a59e2f94fa64d5bf1e26498d27427d5e7 | Author: Ingo Molnar <mingo@elte.hu> | Date: Wed Aug 13 18:02:18 2008 +0200 | | x86: add MAP_STACK mmap flag thanks everyone, Ingo --
Ulrich, I don't understand why you worry more about a _potential_ (and fairly unlikely) complaint, than about a real one today. Thinking ahead may be good, but you take it to absolutely ridiculous heights, to the point where you make potential problems be bigger than -actual- problems. Linus --
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Of course I care. All I try to do is to prevent going from one extreme (all focus on P4s) to the other (ignore P4s completely). Even ignoring this one case here, I think it's in any case useful for userlevel to tell the kernel that an anonymous memory region is needed for a stack. This might allow better optimizations and/or security implementations. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkijIi0ACgkQ2ijCOnn/RHRqCwCcCAeJw+BzO9MSwKRtemm5VAq3 FBYAoKbMwR1pkthjLvNlpCSVS76CCoAq =UfmJ -----END PGP SIGNATURE----- --
On Wed, 13 Aug 2008 11:04:29 -0700 (fwiw as far as I know this is only about early 64 bit P4s, not later yeah maybe we should also tell it we expect it to be used downwards. Oh wait.. MAP_GROWSDOWN ? -- If you want to reach me at my work email, use arjan@linux.intel.com For development, discussion and tips for power savings, visit http://www.lesswatts.org --
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 MAP_GROWSDOWN is unusable because we have to allocate the entire address range for the stack. Otherwise some other allocation happens in that range and all of a sudden the stack cannot grow as much as needed anymore. These flags really can be removed. They should not be used because they are outright dangerous. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.9 (GNU/Linux) iEYEARECAAYFAkijJm8ACgkQ2ijCOnn/RHQ7/wCfcrLJPlKmtY5AC3c+fuX9LGe8 +YwAnRqLCdSQvwOUdsAz8Hq9H3dmnqEA =BKsz -----END PGP SIGNATURE----- --
> This could also be done entirely in glibc (thus removing the dependency on the kernel): set the flag if and only if you detect a P4 CPU. You don't even need to enumerate all the CPUs in the system (which would be more painful) if you make the CPUID test wide enough that it catches all compatible CPUs. -hpa --
It's not limited to 2GB, there's a fallback to >4GB of course. Ok admittedly the fallback is slow, but it's there. I would prefer to not slow down the P4s. There are **lots** of them in field. And they ran 64bit still quite well. Also back then I benchmarked on early K8 and it also made a difference there (but I admit I forgot the numbers) I think it would be better to fix the VM because there are other use cases of applications who prefer to allocate in a lower area. For example Java JVMs now widely use a technique called pointer compression where they dynamically adjust the pointer size based on how much memory the process uses. For that you have to get low memory in the 47bit VM too. The VM should deal with that gracefully. To be honest I always thought the linear search in the VMA list was a little dumb. I'm sure there are other cases where it hurts too. Perhaps this would be really an opportunity to do something about it :) -Andi --
On Wed, 13 Aug 2008 22:42:48 +0200 Yes, the free_area_cache is always going to have failure modes - I think we've been kind of waiting for it to explode. I do think that we need an O(log(n)) search in there. It could still be on the fallback path, so we retain the mostly-O(1) benefits of free_area_cache. --
The standard dumb way to do that would be to have two parallel trees, one to index free space (similar to e.g. the free space btrees in XFS) and the other to index the objects (like today). That would increase the constant factor somewhat by bloating the VMAs, increasing cache overhead etc, and also would be more brute force than elegant. But it would be simple and straight forward. Perhaps the combined data structure experience of linux-kernel can come up with something better and some data structure that allows to look up both efficiently? This would be also an opportunity to reevaluate rbtrees for the object index. One drawback of them is that they are not really optimized to be cache friendly because their nodes are too small. -Andi --
Of course - what you are missing is that _10 milliseconds_ thread creation overhead is completely unacceptable overhead: it is so bad as Nonsense, i had such a P4 based 64-bit box and it was painful. Everyone with half a brain used them as 32-bit machines. Nor is the context-switch overhead in any way significant. Plus, as Arjan mentioned that's a lot of handwaving with no actual numbers. The numbers in this discussion show that the context-switch overhead is small and that the overhead on perfectly good systems that hit this limit is obscurely high. I'd love to zap MAP_32BIT this very minute from the kernel, but you originally shaped the whole thing in such a stupid way that makes its elimination impossible now due to ABI constraints. It would have cost you _nothing_ to have added MAP_64BIT_STACK back then, but the quick & sloppy solution was to reuse MAP_32BIT for 64-bit tasks. And you are stupid about it even now. Bleh. The correct solution is to eliminate this flag from glibc right now, and maybe add the MAP_64BIT_STACK flag as well, as i posted it - if anyone with such old boxes still cares (i doubt anyone does). That flag then will take its usual slow route. Ulrich? Ingo --
MAP_32BIT was not actually added for this originally. It was originally added for the X server's old dynamic loader, which needed 2GB memory. Not sure what the semantics of that would be. For me it would seem ugly to hardcode specific semantics in the kernel for this ("mechanism not policy") But for most possible semantics I can think of the data structure would still IMHO the correct solution is to fix the data structure to not have such a bad complexity in this corner case. We typically do this for all other data structures as we discover such cases. No reason the VMAs should be any different. -Andi --
