Unknown mailing list, search.

Ingo Molnar

64-bit Application Thread Creation Performance

Submitted by Jeremy
on August 18, 2008 - 3:51pm

A recent discussion on the Linux Kernel mailing list noted that threaded 64-bit applications suffer a drastic slowdown in pthread_create performance when stack utilization goes above 4GB. Ingo Molnar offered an explanation of the problem, "unfortunately MAP_32BIT use in 64-bit apps for stacks was apparently created without foresight about what would happen in the MM when thread stacks exhaust 4GB. The problem is that MAP_32BIT is used both as a performance hack for 64-bit apps and as an ABI compat mechanism for 32-bit apps. So we cannot just start disregarding MAP_32BIT in the kernel - we'd break 32-bit compat apps and/or compat 32-bit libraries." The original report noted that once the shared stack goes above 4GB in size, thread creation can take as long as 10 milliseconds, a slowdown described as "quite unacceptable".

Ingo created a patch introducing a new MAP_STACK flag for glibc to be used instead of MAP_32BIT and avoid imposing the 32-bit performance limitation on threaded 64-bit applications. He noted, "glibc can switch to this new flag straight away - it will be ignored by the kernel." The new flag was quickly merged upstream, and changes were planned for glibc.

Quote: In A Few Years We'll Feel Some Sort Of Crunch

Submitted by Jeremy
on June 28, 2008 - 2:43pm

"This reduces native kernel max memory support from around 127 TB to around 120 TB. We also limit the Xen hypervisor to ~7 TB of physical memory - is that wise in the long run?

Quote: Random Kernel Boots

Submitted by Jeremy
on June 4, 2008 - 8:41am

"These random kernel boots found many 'impossible to trigger' bugs and races in the past. The reason for its race finding capability is the timing randomness of the resulting random kernel image: the delays caused by random combination of debugging facilities, build variants, kernel subsystem variants we have."

Removing the Big Kernel Lock

Submitted by Jeremy
on May 15, 2008 - 11:52am
Linux news

"As some of the latency junkies on lkml already know, commit 8e3e076 in v2.6.26-rc2 removed the preemptible BKL feature and made the Big Kernel Lock a spinlock and thus turned it into non-preemptible code again. This commit returned the BKL code to the 2.6.7 state of affairs in essence," began Ingo Molnar. He noted that this had a very negative effect on the real time kernel efforts, adding that Linux creator Linus Torvalds indicated the only acceptable way forward was to completely remove the BKL. Ingo explained:

"This task is not easy at all. 12 years after Linux has been converted to an SMP OS we still have 1300+ legacy BKL using sites. There are 400+ lock_kernel() critical sections and 800+ ioctls. They are spread out across rather difficult areas of often legacy code that few people understand and few people dare to touch. It takes top people like Alan Cox to map the semantics and to remove BKL code, and even for Alan (who is doing this for the TTY code) it is a long and difficult task."

Ingo went on to describe how the BKL works, how it differs from other locking mechanisms, and why this complicates removing it permanently from the kernel. He noted that the various dependencies of the lock are lost in the haze of 15 years of code changes, "all this has built up to a kind of Fear, Uncertainty and Doubt about the BKL: nobody really knows it, nobody really dares to touch it and code can break silently and subtly if BKL locking is wrong." He then suggested "changing the rules of the game", creating a "kill-the-BKL" branch which "turns the BKL into an ordinary albeit somewhat big mutex, with a quirky lock/unlock interface called 'lock_kernel()' and 'unlock_kernel()'."

Defaulting To 4K Stacks

Submitted by Jeremy
on April 22, 2008 - 9:42pm
Linux news

Andrew Morton replied to a commit message making 4k stacks the default, saying, "this patch will cause kernels to crash." Ingo Molnar replied, "what mainline kernels crash and how will they crash? Fedora and other distros have had 4K stacks enabled for years." He added, "we've conducted tens of thousands of bootup tests with all sorts of drivers and kernel options enabled and have yet to see a single crash due to 4K stacks." During the lengthy discussion it was suggested that nfs+xfs+raid kernel configurations, and using ndiswrapper are the most common reasons for overflowing a 4K stack size.

Andi Kleen questioned the usefulness of 4k stacks, "as far as I can figure out they are not [a worthy goal]. They might have been a worthy goal on crappy 2.4 VMs, but these times are long gone." Arjan van de Ven suggested that though the 2.6 VM is much improved over the 2.4 VM, fragmentation with 8K stacks remains an unsolvable problem, "it's basic math; the Linux VM gets to deal with both short and long lasting allocations; no matter how hard you try to get some degree of fragmentation; especially due to the 15:1 acceleration you get due to the lowmem issue. And before you say 'you should use 64 bit on such machines'; I would love it if more people used 64 bit linux. Sadly the adoption rate of that is not very good still.... by far ;(" In another email, Arjan listed two advantages to 4K stacks, "1) less memory consumption in the lowmem zone (critical for enterprise use, also good for general performance), and 2) kernel stacks at 8K are one of the most prominent order-1 allocations in the kernel; again with big-memory systems the fragmentation of the lowmem zone is a problem (and the distros that ship 4K stacks went there because of customer complaints)".

2.6.25, "Long Promised"

Submitted by Jeremy
on April 17, 2008 - 6:48am
Linux news

"It's been long promised, but there it is now," began Linux creator Linus Torvalds, announcing the 2.6.25 Linux kernel. He continued, "special thanks to Ingo who found and fixed a nasty-looking regression that turned out to not be a regression at all, but an old bug that just had not been triggering as reliably before. That said, that was just the last particular regression fix I was holding things up for, and it's not like there weren't a lot of other fixes too, they just didn't end up being the final things that triggered my particular worries." Linus added:

"The full changelog from 2.6.24 is 7.5M, with a 12MB compressed patch. Tons and tons has changed, but if you've been following the -rc releases, you'll already know about the big things. The changes from the last rc (-rc9) are fairly small and mostly pretty trivial, and the shortlog is appended. So it's mostly one-liners, with some updates to drivers (net and usb) and to networking that are a bit larger (although a number of the driver updates are things like just new ID's etc)."

More information about the latest release can be found on the KernelNewbies Linux 2.6.25 wiki page.

Memory Corruption Bug Solved, 2.6.25 Expected Today

Submitted by Jeremy
on April 16, 2008 - 8:55am
Linux news

"Finally found it ... the patch below solves the sparsemem crash and the test system boots up fine now," announced Ingo Molnar. He described the patch as fixing a "memory corruption and crash on 32-bit x86 systems. If a !PAE x86 kernel is booted on a 32-bit system with more than 4GB of RAM, then we call memory_present() with a start/end that goes outside the scope of MAX_PHYSMEM_BITS." He included a source snippet with the loop that could corrupt memory, "depending on what that memory is, we might crash, misbehave or just not notice the bug." Ingo went on to note that the bug was first introduced with sparsemem support in the 2.6.16 kernel:

"I believe this was the reason why my many bisection attempts were unsuccessful: the bug pattern was not stable and seemingly working kernels had the memory corruption too. It was pure luck that v2.6.24 'worked' and v2.6.25-rc9 broke visibly."

Linux creator Linus Torvalds replied, "good job. I've pushed this out, and will let this simmer at least overnight to see if there are any brown-paper-bag issues (either with this or with some last changes from Andrew), but I'm happy, and I think I'll do the real 2.6.25 tomorrow."

Quote: Get A Free Beer

Submitted by Jeremy
on April 15, 2008 - 10:30pm

"Anyone who can correctly guess the method with which i found the exact place that corrupted memory will get a free beer next time we meet :-)"

Quote: Consistent Coding Style

Submitted by Jeremy
on March 28, 2008 - 7:27am

"Is it your argument that consistent coding style is bad?

Kgdb Light

Submitted by Jeremy
on February 9, 2008 - 10:08pm
Linux news

"While this is probably one of the last days of the merge window, please still consider pulling the 'kgdb light' git tree," began Ingo Molnar, explaining:

"This is a slimmed-down and cleaned up version of KGDB that i've created out of the original patches that we submitted two weeks ago. I went over the kgdb patches with Thomas and we cut out everything that we did not like, and cleaned up the result. KGDB is still just as functional as it was before (i tested it on 32-bit and 64-bit x86) - and any desired extra capability or complexity should be added as a delta improvement, not in this initial merge."

Ingo noted that the previous merge request modified 41 files, while this new merge request modifies only 22 files. Among the changes, he highlighted, "removed _all_ critical path impact, even if KGDB is enabled and active; removed all the lowlevel serial drivers; added a redesigned and cleaned up version of the 'KGDB over polled consoles' approach; removed the longjump code; removed the module symbol hacks; removed the GTOD/clocksource hacks; removed the softlockup hacks; removed the toplevel Makefile changes; removed the might_sleep scheduler hack; and did lots of other cleanups and rewrites as well." Ingo summarized, "as a result, this kgdb series has _obviously_ zero impact on the kernel, because it just does not touch any dangerous codepath. From this point on KGDB can evolve in small, well-controlled baby steps, as all other kernel code as well. And the resulting kgdb is still very functional: it can still break into a kernel (via SysRq-G), can catch crashes, can single-step, etc. It's already a quite usable first step."

Debugging With kmemcheck

Submitted by Jeremy
on February 8, 2008 - 12:23pm
Linux news

"With a lot of help from Ingo Molnar and Pekka Enberg over the last couple of weeks, we've been able to produce a new version of kmemcheck!" announced Vegard Nossum, adding, "the current version of the patch boots on real hardware, but we've seen freezes on some machines, so it's not perfect yet. (In other words, this patch is HIGHLY experimental, and run at your own risk, etc.)". He also offered a high level summary of the patch:

"kmemcheck is a patch to the linux kernel that detects use of uninitialized memory. It does this by trapping every read and write to memory that was allocated dynamically (e.g. using kmalloc()). If a memory address is read that has not previously been written to, a message is printed to the kernel log."

Ingo Molnar credited the new patch with already finding 4 kernel bugs, and offered some more insights into how the patch works, and why it's useful, "it should also be made clear that not only does kmemcheck consume half of the RAM to do byte granular tracking of the other half of RAM, it's also slow, very slow, because almost every kernel-space instruction will generate a pagefault and then it will be single-stepped and it takes a debug fault as well. That's of course totally crazy, but that's also OK and it's what makes the feature so interesting and powerful."

Quote: Reporting Linux Bugs

Submitted by Jeremy
on February 4, 2008 - 9:55pm

"[The] lkml is the right mailing list for reporting Linux bugs. This is an extremely harmful trend I've seen lately: some kernel hackers going out on a limb directing the flow of bugreports _away_ from lkml, by suggesting to testers that lkml is somehow inappropriate for reporting Linux kernel bugs."

x86 Architecture Merges in 2.6.25

Submitted by Jeremy
on February 1, 2008 - 5:36pm
Linux news

Ingo Molnar summarized his pull request for changes to the x86 architecture bound for mainline inclusion in 2.6.25 noting, "it's not a small merge, it consists of 908 commits from 96 individual arch/x86 developers (!)". He continued, "a number of core files are changed as well: most notably percpu, debugging details, timers, the firewire remote debugging patch and ... the KGDB remote debugging stub in kernel/kgdb.c." He went on to detail the extent of the testing this tree has received, "in the past few weeks tens of thousands of random x86.git bzImages were successfully built and booted on a number of (commodity) 32-bit and
64-bit testsystems - and there has been a fair amount of test exposure on -mm as well.
" Regarding the remote kernel debugger, Ingo explained:

"We tested KGDB to be merge-worthy within the x86 architecture (the only supported architecture for now) and it's better to have kernel/kgdb.c than arch/x86/kernel/kgdb.c. The code is reasonably clean and the user-space exposure is small - the only real exposure is the decades-old remote GDB protocol. We are happy to fix up any further cleanliness comments that people might have - but we really wanted to start somewhere and get this thing moving. As an added bonus: finally a kernel debugger that can be read without puking too much ;-) [anyone remember KDB?]"

Scheduler Merges for 2.6.25

Submitted by Jeremy
on January 26, 2008 - 11:02am
Linux news

Ingo Molnar posted a merge request for the latest git scheduler tree summarizing, "it contains various enhancements to the scheduler - find the full shortlog is below. 96 commits from 19 authors - scheduler developers have been busy again. :-/" He added, "the scheduling behavior of the kernel to normal users should not change over v2.6.24, but there are a good number of new features and enhancements under the hood." Ingo went on to list a number of these new features, including:

"Various instrumentation and debugging enhancements from Arjan van de Ven; Peter Zijlstra's RT time limit and RT throttling code for the RT scheduling class; Paul E. McKenney's preemptible RCU code; refcount based CPU-hotplug rework by Gautham R Shenoy; there's serious interest in running RT tasks on enterprise-class hardware, so Steven Rostedt and Gregory Haskins wrote a large number of enhancements to the RT scheduling class and load-balancer; Peter Zijlstra's high-resolution scheduler tick code; [...] and a good number of other, smaller enhancements."

x86 Architecture Changes Merging in 2.6.25

Submitted by Jeremy
on January 22, 2008 - 3:11am
Linux news

The final 2.6.24 Linux kernel is expected any day now, so the various subsystem maintainers have begun summarizing what changes are expected to be merged into the mainline kernel during the 2.6.25 merge window. Ingo Molnar spoke to changes for the x86 architecture, "there are 763 commits in x86.git so far, from more than 90 contributors, so it would be difficult to mention and credit every contribution in this mail." Along with a lengthy list of other changes, he included:

"Continued, intense arch/x86 unification and cleanup work by lots of people; FIFO ticket spinlocks for better spinlock scalability; 'regset' generalizations - the most important step towards utrace support (==next-gen ptrace); support for more than 255 CPUs [up to 4096 - in theory up to 65535]; almost complete 64-bit paravirt guest support; KGDB support on x86, finally!"