Ralf Baechle posted the Linux/MIPS architecture merge plans for the upcoming 2.6.24 kernel. The diffstat for all changes showed, "435 files changed, 14274 insertions(+), 10196 deletions(-)", about which Ralf noted, "the number of patch lines and files is inflated by two large whitespace cleanup patches." He continued:
"The biggest actual changes are the support for tickless kernels on MIPS and the rewrite for many of the timer devices previously used as clocksources and clockevents. Various cleanups, including some moving of code and support for 32-bit Broadcom BCM47XX processors, the return of support for LASAT which isn't quite as unused as previously thought."
"Last month, at the kernel summit, there was discussion of putting a Reviewed-by: tag onto patches to document the oversight they had received on their way into the mainline," began Jonathan Corbet in an effort to define the meaning of the recently introduced
reviewed-by tag. He continued, "that tag has made an occasional appearance since then, but there has not yet been a discussion of what it really means. So it has not yet brought a whole lot of value to the process."
In the continued discussion, it was requested that all commit tags be defined, prompting Jonathan to update his documentation to include Signed-off-by, Acked-by, Cc, and Tested-by along with his documentation for Reviewed-by. He offered the following definition for the new Reviewed-by tag:
"The patch has been reviewed and found acceptible according to the Reviewer's Statement as found at the bottom of this file. A Reviewed-by tag is a statement of opinion that the patch is an appropriate modification of the kernel without any remaining serious technical issues. Any interested reviewer (who has done the work) can offer a Reviewed-by tag for a patch."
"15 partitions (at least for sd_mod devices) are too few," Jan Engelhardt suggested along with a patch to try and make the mounting of an unlimited number of partitions possible. H. Peter Anvin proposed as an alternative, "now when we have 20-bit minors, can't we simply recycle some of the higher bits for additional partitions, across the board? 63 partitions seem to have been sufficient; at least I haven't heard anyone complain about that for 15 years."
Alan Cox explained, "this was proposed ages ago. Al Viro vetoed sparse minors and it has been stuck this way ever since. If you have > 15 partitions use device mapper for it. I'd prefer it fixed but it's arguable that device mapper is the right way to punt all our partitioning to userspace".
Paul Jackson described a new per-cpuset flag called 'sched_load_balance', "when enabled in a cpuset (the default value) it tells the kernel scheduler that the scheduler should provide the normal load balancing on the CPUs in that cpuset, sometimes moving tasks from one CPU to a second CPU if the second CPU is less loaded and if that task is allowed to run there. When disabled (write '0' to the file) then it tells the kernel scheduler that load balancing is not required for the CPUs in that cpuset." Paul went on to explain why the feature is useful:
"1) It provides a mechanism for real time isolation of some CPUs, and
"2) it can be used to improve performance on systems with many CPUs by supporting configurations in which load balancing is not done across all CPUs at once, but rather only done in several smaller disjoint sets of CPUs."
Jeff Garzik posted a series of five patches for the forcedeth driver which he described as, "several proposed updates for testing". Forcedeth is a GPL'd driver for the Ethernet interface of the NVIDIA nForce chipset, originally merged into the 2.4.26 and 2.6.5 Linux kernels. Jeff noted two main goals for the patches:
"1) move the driver towards a more sane, simple, easy to verify locking setup -- irq handler would often acquire/release the lock twice for each interrupt.
"2) to eliminate a rarely used, apparently fragile locking scheme that includes heavy use of disable_irq(). this tool is most often employed during NIC reset/reconfiguration, so satisfying this goal implies changing the way NIC reset and config are accomplished."
Jeff explained that he was looking to get the changes tested, "these are intended for feedback and testing, NOT for merging." He went on to explain that one of the changes included two independent napi_structs, one for receiving and one transmitting, "I feel TX NAPI is a useful tool, because it provides an independent TX process control point and system load feedback point. Thus I felt this was slightly superior to tasklets. But who knows if this is a good idea? :) I am interested in feedback and criticism on this issue."
When a Linux user reported a repeatedly high load average on an idle server, tracking the problem to a specific patch labeled, "user of the jiffies rounding code", Andrew Morton replied, "this is unexpected. High load average is due to either a task chewing a lot of CPU time or a task stuck in uninterruptible sleep." Linus Torvalds disagreed, explaining:
"We saw high loadaverages with the timer bogosity with 'gettimeofday()' and 'select()' not agreeing, so they would do things like
'date = time(..); select(.. , timeout = );'and when 'date' wasn't taking the jiffies offset into account, and thus mixing these kinds of different time sources, the select ended up returning immediately because they effectively used different clocks, and suddenly we had some applications chewing up 30% CPU time, because they were in a loop that *tried* to sleep."
Linus offered what he described as an "idiotic patch" to cause the load average to not be calculated exactly once every 5 seconds to prevent it from being in sync with something else waking up every 5 seconds, noting, "the load average is not calculated every tick, because that's not just expensive, but we also want to have some time-based decay." Arjan van de Ven pointed out that this shouldn't help, "I mean, the load gets only updated in actual timer interrupts... and on a tickless system there's very few of those around..... and usually at places round_jiffies() already put a timer on." Linus agreed with this reasoning, suggesting, "maybe Anders' problem stems partly from the fact that he really is using the tweaks to make that tickless theory more true than it tends to be on most systems?" Arjan pointed out that a lot of work has been successful in making tickless kernels wake up less, "we fixed a TON of stuff over the last months.. standard desktops (F8 / next Ubuntu) will be around 10 wakeups/sec, in a lab environment you can get below 2 ;)"
The OpenBSD project maintains a six month release cycle, with the upcoming 4.2 release officially scheduled for November 1'st. Each release includes a song relevant to current issues faced by the project. For this release the song is titled "100001 1010101", about which OpenBSD creator Theo de Raadt notes, "it is designed to sound like a mid-era Rush song, ie. something from Grace Under Pressure or such. And there's a few easter eggs hidden in the song as well. It also explains the inside sleeve image..." The referenced image shows a marathon between some of the different operating system mascots, running a a race through often hostile looking surroundings, fraught with distractions. Toward the bottom is an obvious reference to the recent issue of relicensing BSD code under the GPL, in which Puffy, the OpenBSD mascot, shows a map to Tux, the Linux mascot, and the latter takes off with it. The OpenBSD lyrics page explains that BSD code is shared with all, even non-open-sourced projects who respect the license and frequently return code, "we fully admit that some BSD licensed software has been taken and used by many commercial entities, but contributions come back more often than people seem to know, and when they do, they're always still properly attributed to the original authors, and given back in the same spirit that they were given in the first place." Theo noted, "that's the best we can expect from companies," going on to add, "but we can expect more from projects who talk about sharing -- such as the various Linux projects." He explained:
"Now rather than seeing us as friends who can cooperatively improve all codebases, we are seen as foes who oppose the GPL. The participants of "the race" are being manipulated by the FSF and their legal arm, the SFLC, for the FSF's aims, rather than the goal of getting good source into Linux (and all other code bases). We don't want this to come off as some conspiracy theory, but we simply urge those developers caution -- they should ensure that the path they are being shown by those who have positioned themselves as leaders is still true. Run for yourself, not for their agenda.
"The Race is there to be run, for ourselves, not for others. We do what we do to run our own race, and finish it the best we can. We don't rush off at every distraction, or worry how this will affect our image. We are here to have fun doing right."
Roland Dreier posted an updated overview of his InfiniBand driver merge plans during the upcoming 2.6.24-rc1 merge window. The InfiniBand driver was first merged into the 2.6.11 Linux kernel. Roland noted, "since 2.6.23 still isn't out, and I've managed to reduce my patch review backlog a bit, it's probably a good idea to give another update about what I have queued for 2.6.24 already and what I hope to get to before the merge window opens." His description of the upcoming merge queue is broken into three sections, one for each subdirectory within
drivers/infiniband. In addition to listing all patches intended for 2.6.24, Roland also listed two issues that would most likely wait for 2.6.25:
"1) Multiple CQ event vector support. I haven't seen any discussions about how ULPs or userspace apps should decide which vector to use, and hence no progress has been made since we deferred this during the 2.6.23 merge window.
"2) XRC. Given the length of the backlog above and the fact that a first draft of this code has not been posted yet, I don't see any way that we could have something this major ready in time."
"The following patch makes it possible to give kernel messages a selectable color which helps to distinguish it from other noise, such as boot messages," explained Jan Engelhardt, along with a 43 line patch to the
char driver. As justification for the patch he offered, "NetBSD has it, OpenBSD has it, FreeBSD to some extent, so I think Linux should too." He also noted that an earlier version of the small patch had been previously posted back in April.
Ingo Molnar responded favorably, "looks really good to me!" He went on to suggest, "feature request: would be interesting to have a color table (defined in the .config) dependent on message loglevel. That way KERN_CRIT messages could be red, KERN_INFO ones white, etc."
"This patch series also introduces a new small shell script wrapper, called scripts/checkfiles, that checks the compliance of source files (not in diff format); the script can operate on any mix of files and directories (recursively checking)," Erez Zadok stated describing a patch posted to the Linux Kernel mailing list. The small script works by recursively walking through a specified directory, running checkpatch on a diff of each source file against
/dev/null. Erez added, "I ran the above script and it found nearly 1.5 million reported warnings/errors, with drivers being the largest abuser, not surprisingly." He then offered a breakdown of the warnings within each top level directory, summarizing, "1,464,313 total, any volunteers? :-)"
A bug report filed by Ingo Molnar regarding a procfs crash in the recently released 2.6.23-rc9 kernel was quickly tracked down by Linus Torvalds as a compiler bug. The bug was ultimately determined to be from a compiler optimization generated with an older version of GCC. Ingo was skeptical at first, "it's 4.0.2. Not the latest & greatest but I've been using it for 2 years and this would be the first time it miscompiles a 32-bit kernel out of tens of thousands of successful kernel bootups." Linus replied, "I am 100% sure. I can look at the disassembly, and point to the fact that your Oops happens on code that is simply totally bogus." He continued on to offer an interesting review of the crash, explaining line by line what should have been generated versus what actually was, causing the crash. In the end, Ingo switched to a distribution compiled GCC 4.1.2 and confirmed that the crash went away, "so you are completely right, it's a compiler bug in 4.0.2."
During the thread, Linus suggested that the optimization made by the compiler wasn't "legal", to which Alan Cox retorted, "pedant: valid. Almost all optimizations are legal, nobody has yet written laws about compilers. Sorry but I'm forever fixing misuse of the word 'illegal' in printks, docs and the like and it gets annoying after a bit." Linus playfully responded, "heh. When I'm ruler of the universe, it *will* be illegal. I'm just getting a bit ahead of myself." When asked how long until he expected to be ruler, Linus added, "I'm working on it, I'm working on it. I'm just as frustrated as you are. It turns out to be a non-trivial problem."
Casey Schaufler posted an updated Smack patchset based on feedback from the previous posting, "I have broken the Smack patch into the netlabel changes from Paul Moore (1/2) and the Smack LSM (2/2), at Paul's kind suggestion." He added:
"The smackfs symlinks have proven too contentious. I have removed the facility. Al and Alan are correct that the rich set of mount options currently available can handle any of the use cases I was looking at without excessive difficulty."
Smack is the Simplified Mandatory Access Control Kernel, utilizing the LSM framework to implement label-based mandatory access control and slated for inclusion in the upcoming 2.6.24 mainline kernel during the 2.6.24-rc1 merge window.
Trond Myklebust noted the NFS client updates for the upcoming 2.6.24 kernel:
"Aside from the usual updates from Chuck for NFS-over-IPv6 (still incomplete) and a number of bugfixes for the text-based mount code, the main news in the NFS tree is the merging of support for the NFS/RDMA client code from Tom Talpey and the NetApp New England (NANE) team."
He continued, "we also have the 64-bit inode support from RedHat/Peter Staubach. There is also the addition of a nfs_vm_page_mkwrite() method in order to clean up the mmap() write code. Finally, I've been working on a number of updates for the attribute revalidation, having pulled apart most of the dentry and attribute revalidation into separate variables. A number of fixes that address existing bugs fell out of that review, which should hopefully result in more efficient dcache behaviour..." Actual source changes can be browsed in the NFS client git repository.
"I've just released the 2.6.23-rc9-ext4-1. It collapses some patches in preparation for pushing them to Linus, and adds some of the cleanup patches that had been incorporated into Andrew's broken-out-2007-10-01-04-09 series," announced Theodore Ts'o. He also noted of the current ext4 git tree, "it also has some new development patches in the unstable (not yet ready to push to mainline) portion of the patch series." In an earlier thread Theodore posted a series of patches specifically intended for inclusion in the upcoming 2.6.24 kernel. Included in the patch series was a patch for improving fsck performance, "in performance tests testing e2fsck time, we have seen that e2fsck time on ext3 grows linearly with the total number of inodes in the filesytem. In ext4 with the uninitialized block groups feature, the e2fsck time is constant, based solely on the number of used inodes rather than the total inode count." The patch included an explanation of how the feature works, enabled through a mkfs option:
"With this feature, there is a a high water mark of used inodes for each block group. Block and inode bitmaps can be uninitialized on disk via a flag in the group descriptor to avoid reading or scanning them at e2fsck time. A checksum of each group descriptor is used to ensure that corruption in the group descriptor's bit flags does not cause incorrect operation."
In a continuing discussion about the difference between pluggable security and pluggable schedulers, Linus Torvalds quoted himself:
"Another difference is that when it comes to schedulers, I feel like I actually can make an informed decision. Which means that I'm perfectly happy to just make that decision, and take the flak that I get for it. And I do (both decide, and get flak). That's my job."
He added, "which you seem to not have read or understood (neither did apparently anybody on slashdot)". Linus continued, "the arguments that 'servers' have a different profile than 'desktop' is pure and utter garbage, and is perpetuated by people who don't know what they are talking about." He then asked and answered his own question, "really: tell me what the difference is between 'desktop' and 'server' scheduling. There is absolutely *none*," going on to explain:
"Yes, there are differences in tuning, but those have nothing to do with the basic algorithm. They have to do with goals and trade-offs, and most of the time we should aim for those things to auto-tune (we do have the things in /proc/sys/kernel/, but I really hope very few people use them other than for testing or for some extreme benchmarking - at least I don't personally consider them meant primarily for 'production' use)."