"As some of the latency junkies on lkml already know, commit 8e3e076 in v2.6.26-rc2 removed the preemptible BKL feature and made the Big Kernel Lock a spinlock and thus turned it into non-preemptible code again. This commit returned the BKL code to the 2.6.7 state of affairs in essence," began Ingo Molnar. He noted that this had a very negative effect on the real time kernel efforts, adding that Linux creator Linus Torvalds indicated the only acceptable way forward was to completely remove the BKL. Ingo explained:
"This task is not easy at all. 12 years after Linux has been converted to an SMP OS we still have 1300+ legacy BKL using sites. There are 400+ lock_kernel() critical sections and 800+ ioctls. They are spread out across rather difficult areas of often legacy code that few people understand and few people dare to touch. It takes top people like Alan Cox to map the semantics and to remove BKL code, and even for Alan (who is doing this for the TTY code) it is a long and difficult task."
Ingo went on to describe how the BKL works, how it differs from other locking mechanisms, and why this complicates removing it permanently from the kernel. He noted that the various dependencies of the lock are lost in the haze of 15 years of code changes, "all this has built up to a kind of Fear, Uncertainty and Doubt about the BKL: nobody really knows it, nobody really dares to touch it and code can break silently and subtly if BKL locking is wrong." He then suggested "changing the rules of the game", creating a "kill-the-BKL" branch which "turns the BKL into an ordinary albeit somewhat big mutex, with a quirky lock/unlock interface called 'lock_kernel()' and 'unlock_kernel()'."
Andrew Morton replied to a commit message making 4k stacks the default, saying, "this patch will cause kernels to crash." Ingo Molnar replied, "what mainline kernels crash and how will they crash? Fedora and other distros have had 4K stacks enabled for years." He added, "we've conducted tens of thousands of bootup tests with all sorts of drivers and kernel options enabled and have yet to see a single crash due to 4K stacks." During the lengthy discussion it was suggested that nfs+xfs+raid kernel configurations, and using ndiswrapper are the most common reasons for overflowing a 4K stack size.
Andi Kleen questioned the usefulness of 4k stacks, "as far as I can figure out they are not [a worthy goal]. They might have been a worthy goal on crappy 2.4 VMs, but these times are long gone." Arjan van de Ven suggested that though the 2.6 VM is much improved over the 2.4 VM, fragmentation with 8K stacks remains an unsolvable problem, "it's basic math; the Linux VM gets to deal with both short and long lasting allocations; no matter how hard you try to get some degree of fragmentation; especially due to the 15:1 acceleration you get due to the lowmem issue. And before you say 'you should use 64 bit on such machines'; I would love it if more people used 64 bit linux. Sadly the adoption rate of that is not very good still.... by far ;(" In another email, Arjan listed two advantages to 4K stacks, "1) less memory consumption in the lowmem zone (critical for enterprise use, also good for general performance), and 2) kernel stacks at 8K are one of the most prominent order-1 allocations in the kernel; again with big-memory systems the fragmentation of the lowmem zone is a problem (and the distros that ship 4K stacks went there because of customer complaints)".
"It's been long promised, but there it is now," began Linux creator Linus Torvalds, announcing the 2.6.25 Linux kernel. He continued, "special thanks to Ingo who found and fixed a nasty-looking regression that turned out to not be a regression at all, but an old bug that just had not been triggering as reliably before. That said, that was just the last particular regression fix I was holding things up for, and it's not like there weren't a lot of other fixes too, they just didn't end up being the final things that triggered my particular worries." Linus added:
"The full changelog from 2.6.24 is 7.5M, with a 12MB compressed patch. Tons and tons has changed, but if you've been following the -rc releases, you'll already know about the big things. The changes from the last rc (-rc9) are fairly small and mostly pretty trivial, and the shortlog is appended. So it's mostly one-liners, with some updates to drivers (net and usb) and to networking that are a bit larger (although a number of the driver updates are things like just new ID's etc)."
More information about the latest release can be found on the KernelNewbies Linux 2.6.25 wiki page.
"Finally found it ... the patch below solves the sparsemem crash and the test system boots up fine now," announced Ingo Molnar. He described the patch as fixing a "memory corruption and crash on 32-bit x86 systems. If a !PAE x86 kernel is booted on a 32-bit system with more than 4GB of RAM, then we call memory_present() with a start/end that goes outside the scope of MAX_PHYSMEM_BITS." He included a source snippet with the loop that could corrupt memory, "depending on what that memory is, we might crash, misbehave or just not notice the bug." Ingo went on to note that the bug was first introduced with sparsemem support in the 2.6.16 kernel:
"I believe this was the reason why my many bisection attempts were unsuccessful: the bug pattern was not stable and seemingly working kernels had the memory corruption too. It was pure luck that v2.6.24 'worked' and v2.6.25-rc9 broke visibly."
Linux creator Linus Torvalds replied, "good job. I've pushed this out, and will let this simmer at least overnight to see if there are any brown-paper-bag issues (either with this or with some last changes from Andrew), but I'm happy, and I think I'll do the real 2.6.25 tomorrow."
"Anyone who can correctly guess the method with which i found the exact place that corrupted memory will get a free beer next time we meet :-)"
"Is it your argument that consistent coding style is bad? That argument has been settled long ago with the creation of Documentation/CodingStyle. All kernel code is supposed to follow that style, unless the resulting line of code looks clearly _wrong_. Your arguments seem to center around ''hey, my way it looks similarly good so i'll do that instead because i'm the maintainer'' - and that argument does not fly. CodingStyle is definitely not gospel and common sense should be applied, but _arbitrarily_ and _intentionally_ deviating from it is considered bad manners and hurts Linux as a whole."
"While this is probably one of the last days of the merge window, please still consider pulling the 'kgdb light' git tree," began Ingo Molnar, explaining:
"This is a slimmed-down and cleaned up version of KGDB that i've created out of the original patches that we submitted two weeks ago. I went over the kgdb patches with Thomas and we cut out everything that we did not like, and cleaned up the result. KGDB is still just as functional as it was before (i tested it on 32-bit and 64-bit x86) - and any desired extra capability or complexity should be added as a delta improvement, not in this initial merge."
Ingo noted that the previous merge request modified 41 files, while this new merge request modifies only 22 files. Among the changes, he highlighted, "removed _all_ critical path impact, even if KGDB is enabled and active; removed all the lowlevel serial drivers; added a redesigned and cleaned up version of the 'KGDB over polled consoles' approach; removed the longjump code; removed the module symbol hacks; removed the GTOD/clocksource hacks; removed the softlockup hacks; removed the toplevel Makefile changes; removed the might_sleep scheduler hack; and did lots of other cleanups and rewrites as well." Ingo summarized, "as a result, this kgdb series has _obviously_ zero impact on the kernel, because it just does not touch any dangerous codepath. From this point on KGDB can evolve in small, well-controlled baby steps, as all other kernel code as well. And the resulting kgdb is still very functional: it can still break into a kernel (via SysRq-G), can catch crashes, can single-step, etc. It's already a quite usable first step."
"With a lot of help from Ingo Molnar and Pekka Enberg over the last couple of weeks, we've been able to produce a new version of kmemcheck!" announced Vegard Nossum, adding, "the current version of the patch boots on real hardware, but we've seen freezes on some machines, so it's not perfect yet. (In other words, this patch is HIGHLY experimental, and run at your own risk, etc.)". He also offered a high level summary of the patch:
"kmemcheck is a patch to the linux kernel that detects use of uninitialized memory. It does this by trapping every read and write to memory that was allocated dynamically (e.g. using kmalloc()). If a memory address is read that has not previously been written to, a message is printed to the kernel log."
Ingo Molnar credited the new patch with already finding 4 kernel bugs, and offered some more insights into how the patch works, and why it's useful, "it should also be made clear that not only does kmemcheck consume half of the RAM to do byte granular tracking of the other half of RAM, it's also slow, very slow, because almost every kernel-space instruction will generate a pagefault and then it will be single-stepped and it takes a debug fault as well. That's of course totally crazy, but that's also OK and it's what makes the feature so interesting and powerful."
"[The] lkml is the right mailing list for reporting Linux bugs. This is an extremely harmful trend I've seen lately: some kernel hackers going out on a limb directing the flow of bugreports _away_ from lkml, by suggesting to testers that lkml is somehow inappropriate for reporting Linux kernel bugs."
Ingo Molnar summarized his pull request for changes to the x86 architecture bound for mainline inclusion in 2.6.25 noting, "it's not a small merge, it consists of 908 commits from 96 individual arch/x86 developers (!)". He continued, "a number of core files are changed as well: most notably percpu, debugging details, timers, the firewire remote debugging patch and ... the KGDB remote debugging stub in kernel/kgdb.c." He went on to detail the extent of the testing this tree has received, "in the past few weeks tens of thousands of random x86.git bzImages were successfully built and booted on a number of (commodity) 32-bit and
64-bit testsystems - and there has been a fair amount of test exposure on -mm as well." Regarding the remote kernel debugger, Ingo explained:
"We tested KGDB to be merge-worthy within the x86 architecture (the only supported architecture for now) and it's better to have kernel/kgdb.c than arch/x86/kernel/kgdb.c. The code is reasonably clean and the user-space exposure is small - the only real exposure is the decades-old remote GDB protocol. We are happy to fix up any further cleanliness comments that people might have - but we really wanted to start somewhere and get this thing moving. As an added bonus: finally a kernel debugger that can be read without puking too much ;-) [anyone remember KDB?]"
Ingo Molnar posted a merge request for the latest git scheduler tree summarizing, "it contains various enhancements to the scheduler - find the full shortlog is below. 96 commits from 19 authors - scheduler developers have been busy again. :-/" He added, "the scheduling behavior of the kernel to normal users should not change over v2.6.24, but there are a good number of new features and enhancements under the hood." Ingo went on to list a number of these new features, including:
"Various instrumentation and debugging enhancements from Arjan van de Ven; Peter Zijlstra's RT time limit and RT throttling code for the RT scheduling class; Paul E. McKenney's preemptible RCU code; refcount based CPU-hotplug rework by Gautham R Shenoy; there's serious interest in running RT tasks on enterprise-class hardware, so Steven Rostedt and Gregory Haskins wrote a large number of enhancements to the RT scheduling class and load-balancer; Peter Zijlstra's high-resolution scheduler tick code; [...] and a good number of other, smaller enhancements."
The final 2.6.24 Linux kernel is expected any day now, so the various subsystem maintainers have begun summarizing what changes are expected to be merged into the mainline kernel during the 2.6.25 merge window. Ingo Molnar spoke to changes for the x86 architecture, "there are 763 commits in x86.git so far, from more than 90 contributors, so it would be difficult to mention and credit every contribution in this mail." Along with a lengthy list of other changes, he included:
"Continued, intense arch/x86 unification and cleanup work by lots of people; FIFO ticket spinlocks for better spinlock scalability; 'regset' generalizations - the most important step towards utrace support (==next-gen ptrace); support for more than 255 CPUs [up to 4096 - in theory up to 65535]; almost complete 64-bit paravirt guest support; KGDB support on x86, finally!"
"'Too small' and 'too insignificant' is not a patch attribute that is in my vocabulary - by definition the best patches are very small and very insignificant (they just happen to end up doing something cool a 1000 steps later ;-) 99% of our problems come from 'too large' and 'too intrusive' patches."
Ingo Molnar announced that version 24 of his Completely Fair Scheduler patch is now available backported to the 2.6.24-rc3, 2.6.23.8, 2.6.22.13, and 2.6.21.7 kernels. He noted that there have been significant changes since the last backport, "36 files changed, 2359 insertions(+), 1082 deletions(-). That's 187 individual commits from 32 authors." Ingo noted, "99% of these changes are already upstream in Linus's git tree and they will be released as part of v2.6.24. (there are 4 pending commits that are in the small 2.6.24-rc3-v24 patch.)" He also highlighted some of the more significant improvements:
"Improved interactivity via Peter Ziljstra's 'virtual slices' feature. As load increases, the scheduler shortens the virtual timeslices that tasks get, so that applications observe the same constant latency for getting on the CPU. (This goes on until the slices reach a minimum granularity value).
"CONFIG_FAIR_USER_SCHED is now available across all backported kernels and the per user weights are configurable via /sys/kernel/uids/. Group scheduling got refined all around."
Ingo Molnar sent a merge request to Linus Torvalds for the latest CFS fixes. CFS, the Completely Fair Scheduler, was merged into the mainline Linux kernel in July of 2007. It was first included in the 2.6.23 kernel, released in October of 2007. The scheduler appears to be quickly stabilizing, visible in the minimal assortment of fixes contained in the latest source code push. Ingo Molnar summarized the changes:
"There are two cross-subsystem groups of fixes: three commits that resolve a KVM build fix on !SMP - acked by Avi to go in via the scheduler git tree because it changes a central include file. The other one is a powerpc CPU time accounting regression fix from Paul Mackerras.
"The remaining 14 commits: one crash fix (only triggerable via the new control-groups filesystem), a delay-accounting regression fix, two performance regression fixes, a latency fix, two small scheduling-behavior regression fixes and seven cleanups."