"Currently in mainline the balancing of multiple RT threads is quite broken. That is to say that a high priority thread that is scheduled on a CPU with a higher priority thread, may need to unnecessarily wait while it can easily run on another CPU that's running a lower priority thread," began Steven Rostedt, describing his patchset to introduce improved real time task balancing. He explained:
"Balancing (or migrating) tasks in general is an art. Lots of considerations must be taken into account. Cache lines, NUMA and more. This is true with general processes which expect high through put and migration can be done in batch. But when it comes to RT tasks, we really need to put them off to a CPU that they can run on as soon as possible. Even if it means a bit of cache line flushing. Right now an RT task can wait several milliseconds before it gets scheduled to run. And perhaps even longer. The migration thread is not fast enough to take care of RT tasks."
Steven described his test cases and numerous issues he noticed with the current balancing code, noting, "to solve this issue, I've changed the RT task balancing from a passive method (migration thread) to an active method. This new method is to actively push or pull RT tasks when they are woken up or scheduled."
"I'm trying to keep some external drivers up to date with the kernel, and the first two weeks after the release is the worst time for me. There is no way to distinguish the current git kernel from the latest release. It's only after rc1 is released that I can use the preprocessor to check LINUX_VERSION_CODE," explained Pavel Roskin, describing the ongoing effort to keep the out of tree MadWifi driver in sync with the latest released kernel. Rik Van Riel suggested:
"Consider this an incentive to submit your code for inclusion in the upstream kernel. Having all the common drivers integrated in the mainline kernel makes it much easier for users to use all their hardware, external drivers are not just a pain for the developers."
Pavel acknowledged, "the incentive has already worked for MadWifi, which has landed in the wireless-2.6 repository under the name 'ath5k'. Still, there is a lot of work to do, and some features won't appear in the kernel driver soon, partly because they rely on the chipset features that still need to be reverse engineered. " In response to Pavel's original question, Dave Jones noted that Fedora kernels treat the development between a major release and the first release candidate as "rc0".
In a brief follow up to the earlier pluggable security discussion, Thomas Fricaccia reflected on the implications for the various security frameworks, "I noticed James Morris' proposal to eliminate the LSM in favor of ordaining SELinux as THE security framework forever and amen, followed by the definitive decision by Linus that LSM would remain." He then commented on a recent merged patch preventing the loading of security modules into a running kernel, "but then I noticed that, while the LSM would remain in existence, it was being closed to out-of-tree security frameworks. Yikes! Since then, I've been following the rush to put SMACK, TOMOYO and AppArmor 'in-tree'." Linus Torvalds replied:
"Yeah, it did come up. Andrew, when he sent it on to me, said that the SuSE people were ok with it (AppArmor), but I'm with you - I applied it, but I'm also perfectly willing to unapply it if there actually are valid out-of-tree users that people push for not merging. So Í don't think this is settled in any way - please keep discussing, and bringing it up. I'm definitely not in the camp that thinks that LSM needs to be 'controlled', but on the other hand, I'm also not going to undo that commit unless there are good real arguments for undoing it (not just theoretical ones).
"For example, I do kind of see the point that a 'real' security model might want to be compiled-in, and not something you override from a module. Of course, I'm personally trying to not use any modules at all, so I'm just odd and contrary, so whatever.. Real usage scenarios with LSM modules, please speak up!"
Jaroslav Sykora posted a series of five patches to handle the kernel portion of what he described as "shadow directories", providing an example which utilized FUSE to access the contents of a compressed file from the command line. His first example was
cat hello.zip^/hello.c about which he explained, "the '^' is an escape character and it tells the computer to treat the file as a directory. The kernel patch implements only a redirection of the request to another directory('shadow directory') where a FUSE server must be mounted. The decompression of archives is entirely handled in the user space. More info can be found in the documentation patch in the series."
There were numerous problems suggested. Jan Engelhardt noted, "too bad, since ^ is a valid character in a *file*name. Everything is, with the exception of '
\0' and '
/'. At the end of the day, there are no control characters you could use." Later in the thread an lwn.net article from a couple years ago was quoted, "another branch, led by Al Viro, worries about the locking considerations of this whole scheme. Linux, like most Unix systems, has never allowed hard links to directories for a number of reasons;" The article had been discussing Reiser4, which treats files as directories. In the current discussion, Al Viro added, "as for the posted patch, AFAICS it's FUBAR in handling of .. in such directories. Moreover, how are you going to keep that shadow tree in sync with the main one if somebody starts doing renames in the latter? Or mount --move, or..."
The previous 2.4 Linux kernel maintainer, Marcelo Tossati, resurrected a discussion on adding support for out of memory notifications to the Linux kernel. He explained, "AIX contains the SIGDANGER signal to notify applications to free up some unused cached memory," then noting, "there have been a few discussions on implementing such an idea on Linux, but nothing concrete has been achieved." In a request for discussion, Marcelo added, "on the kernel side Rik suggested two notification points: 'about to swap' (for desktop scenarios) and 'about to OOM' (for embedded-like scenarios)." Rik van Riel explained:
"The first threshold - 'we are about to swap' - means the application frees memory that it can. Eg. free()d memory that glibc has not yet given back to the kernel, or JVM running the garbage collector, or ...
"The second threshold - 'we are out of memory' - means that the first approach has failed and the system needs to do something else. On an embedded system, I would expect some application to exit or maybe restart itself."
Evgeniy Polyakov announced a new version of his distributed storage subsystem, "this release includes [a] mirroring algorithm extension, which allows [the subsystem] to store [the] 'age' of the given node on the underlying media." He went on to explain why this was useful:
"In this case, if [a] failed node gets new media, which does not contain [the] correct 'age' (unique id assigned to the whole storage during initialization time), the whole node will be marked as dirty and eventually resynced.
"This allows [it] to have [a] completely transparent failure recovery - [the] failed node can be just turned off, its hardware fixed and then turned on. DST core will detect [the] connection reset and automatically reconnect when [the] node is ready and resync if needed without any special administrator's steps."
"This is a request to merge KGDB into the mainline kernel," Jason Wessel announced, posting a series of patches aiming toward that goal. He continued, "as of right now KGDB is comprised of 21 different patches adding in the core api and docs first and then working up to add drivers and arch specific support to KGDB. The patches were broken down into logical pieces for review and comments." He went on to explain:
"The intent of the KGDB patches is to unify the KGDB support across all the architectures that elect to implement the KGDB functionality by providing a common core and an arch specific stub. For quite some time there has been different features and uses of KGDB across the most popular architectures. Having a common core that takes care of protocol parsing and the typical use case of software breakpoints should eliminate the inconsistencies across the archs as well as making it easier to add KGDB support to a new arch."
Andrew Morton, who has been supportive of getting a kernel debugger into the mainline kernel, explained that it was too late in the 2.6.24 review cycle to merge KGDB, meaning it would have to wait for 2.6.25 at the earliest, "this won't work very well. There's a lot of review work to be done here, and a lot of it by busy architecture maintainers. Expecting people to do all this review and test work late in the merge window when they're all madly scrambling to get their bugs^Wpatches into mainline is not reasonable. This should all have started a month ago. So we're looking at a 2.6.25 merge for this work."
Ken Chen submitted a patch to reduce the memory footprint of schedstat in a thread titled, "schedstat needs a diet". He explained, "schedstat is useful in investigating CPU scheduler behavior. Ideally, I think it is beneficial to have it on all the time. However, the cost of turning it on in production system is quite high, largely due to number of events it collects and also due to its large memory footprint." His patch converted numerous
unsigned long variables to
unsigned int, "most of the fields probably don't need to be [a] full 64-bits on 64-bit [architectures]. Rolling over 4 billion events will most likly take a long time and user space tools can be made to accommodate that."
Ingo Molnar merged the patch into his scheduler development tree suggesting there were further conversions that could be made, "note that current -git has a whole bunch of new schedstats fields in /proc//sched which can be used to track the exact balancing behavior of tasks. It can be cleared via echoing 0 to the file - so overflow is not an issue. Most of those new fields should probably be unsigned int too. (they are u64 right now.)"
"As most folks are probably now well aware, Junio has been offline for about 11 days and may still be offline for a little while more," Shawn Pearce explained regarding git maintainer Junio Hamano's recent absence from Git development. He noted, "I'm not going to get into the specific details as it is Junio's business and not mine. But I can say that my thoughts and prayers to $DEITY are with him and his family at this time, and I don't expect him to be rushing back to git work tomorrow. However I'm quite certain that Junio will return when he can."
Shawn continued on explaining, "I've decided to step up and try to fill Junio's shoes. To that end I am publishing a maint, master, next (and soon) pu branch from a new fork on repo.or.cz" He offered links to his new git development trees, and followed up in another email summarizing recent changes. He noted, "I based my branches on top of the last items published by Junio, and am hoping that he will be open to pulling directly from these before he starts working again. Junio obviously has the option not to pull from me, but if I do my job of interim maintainer well I can probably talk him into it. :)"
"I should be doing status reports, so here's my first cut at what is happening and what is going on so far. I'll try to do these every few weeks, and I also encourage the project managers of active projects to also do this," explained Greg KH, posting his October 16'th Status Report for the Linux Driver Project. He noted:
"Currently, I'm talking with about 3-4 new companies about more projects, and am working on a list of external modules that need to get cleaned up and added to the kernel tree that this project can help out with.
"But what we are really lacking right now is more companies involvement. If anyone can think of a way to drum up more company interest, please let me know."
Greg then offered a status summary of all six currently active projects, "here's the current projects, and what is going on with them." Drivers currently being developed by the driver project include, "3 i2c devices, a VOIP gateway driver, USB/PCI driver for video timestamp device, highspeed datacapture device driver, and a video digital demodulator driver" Another project is listed as focusing on cleaning up the existing USB driver.
In a short discussion following a patch titled "fix abdhid mismerge", Al Viro noted troubles in tracking down the changeset that caused the problem, "what's the right way to trace the things like that? Linus?" Linus Torvalds, as the original author of
git, replied, "in general, I'm afraid that merge errors are simply not very easy to find." He then offered some general tips for tracking down mis-merges, noting, "if anybody can come up with a better way to find these kinds of mis-merges, I'd love to hear about it." In regards to this particular case, he explained:
"'-c' is for regular combined merges: any file that was modified in both parents will show up as a combination of the diffs of both sides, while a file that was taken in its *entirety* is ignored.
"In this case that's exactly what you wanted. It's just too noisy to necessarily be the default, and you can still have a silent mis-merge if the merger picked *only* one side.
But in general, I suspect that '-c' is often a good thing to try if you cannot find the cause of some change in a regular commit, and suspect a merge error."
"It contains lots of scheduler updates from lots of people - hopefully the last big one for quite some time," began Ingo Molnar, describing his merge request for the linux-2.6-sched git tree. He continued, "most of the focus was on performance (both micro-performance and scalability/balancing), but there's the fair-scheduling feature now Kconfig selectable too. Find the shortlog below." Ingo noted, "code that is touched outside of the scheduler: the KVM bits were acked by Avi, the net/unix change is trivial and only affects sync wakeups, ditto the fs/pipe.c changes - but i can push those separately if it needs an ack from David first." He then concluded:
"Testing status: the changes are chronological and all the interactivity-impacting changes are near the head of the queue and most of them were done weeks ago, and were thus part of the CFS-v22 backport series - which was tested by many people. There are no known regressions at the moment. It's all fully bisectable."
"I've finally got some numbers to go along with the Btrfs variable blocksize feature. The basic idea is to create a read/write interface to map a range of bytes on the address space, and use it in Btrfs for all metadata operations (file operations have always been extent based)," explained Chris Mason in a recent posting to the Linux Filesystem Development mailing list. He linked to some benchmark results and summarized, "the first round of benchmarking shows that larger block sizes do consume more CPU, especially in metadata intensive workloads, but overall read speeds are much better." Chris then noted, "Dave reported that XFS saw much higher write throughput with large blocksizes, but so far I'm seeing the most benefits during reads." David Chinner replied, "the basic conclusion is that different filesystems will benefit in different ways with large block sizes...." explaining:
"Btrfs linearises writes due to it's COW behaviour and this is trades off read speed. i.e. we take more seeks to read data so we can keep the write speed high. By using large blocks, you're reducing the number of seeks needed to find anything, and hence the read speed will increase. Write speed will be pretty much unchanged because btrfs does linear writes no matter the block size.
"XFS doesn't linearise writes and optimises it's layout for a large number of disks and a low number of seeks on reads - the opposite of btrfs. Hence large block sizes reduce the number of writes XFS needs to write a given set of data+metadata and hence write speed increases much more than the read speed (until you get to large tree traversals)."
In a recent discussion on the Linux Kernel mailing list, it was asked "which companies are helping develop the kernel?" One response pointed out that that a recent article on lwn.net summarized this information nicely. Others pointed to a paper maintained by Greg KH, who responded:
"Yeah, but my paper didn't really track companies very well. The lwn.net article is the best, and below is my version of who did things in 2.6.23. Note, the lack of a company is not an indicator that they did nothing, just that I could not easily determine someone worked for them. I'll try to send out my 'who are you working for' emails in a week or so to see if I can further categorize the 'unknowns'."
According to Greg's email, organizations that contributed more than 100 changesets to the recently released 2.6.23 kernel included: Red Hat with 827 changesets (11.7%), IBM with 557 changesets (7.9%), the Linux Foundation with 528 changesets (7.5%), Novell with 449 changesets (6.3%), Intel with 242 changesets (3.4%), Oracle with 158 changesets (2.2%), MIPS Technologies with 143 changesets (2%), Nokia with 133 changesets (1.9%), and NetApp with 119 changesets (1.7%). Greg noted that there were a total of 7,075 changesets from 992 developers working for 126 different employers. 843 of the changesets (11.9%) were contributed by individuals reporting no sponsor for their work.
"Incidentally i was thinking about using KVM for automated testing. Important pieces of hardware should get an in-KVM simulator/emulator, that way developers who do not own that hardware can do functionality testing too," Ingo Molnar suggested during a thread discussing a SCSI driver bug fix. Linus Torvalds was originally unimpressed by the idea:
"Using emulators to test device drivers is almost certain to be pointless. The problem with device drivers tends to be timing issues, odd hardware interactions, and lots of strange (and sometimes undocumented) behaviour and dependencies (eg things like 'you have to wait 50us after setting the reset bit until the hardware has actually reset'). These are all things that you'd generally not catch in emulation - because the emulation by necessity is only going to be a very weak picture of the real thing."
Alan Cox countered, "for some things. I do it a bit because you can use it to fake failures that are tricky to do in the real world. It won't tell you the driver works but its surprisingly good for testing for races (forcing IRQ delivery at specific points), buggy hardware you don't posses, and things like media failures and timeouts your real hardware refuses to do." Linus acquiesced conditionally, "I do agree that you likely find bugs, even if quite often it's exactly because the behaviour is something that will never happen on real hardware," then acknowledged previous debugging efforts by Alan, "but failure testing is very useful - I forget who it was who debugged some driver by taking a CD and just scratching it mercilessly to induce read errors ;)" Ingo added, "something like that wont enable 100% coverage (or even reasonable coverage for most hardware), so it's no replacement for actual hard testing, but it could push out the domain of minimally tested code quite a bit and increase the quality of the kernel."