"It took me quite a while to realize the real root cause of the VAIO - and probably many other machines - suspend/resume regressions, which were unearthed by the dyntick / clockevents patches," Thomas Gleixner explained regarding two patches for fixing suspend issues that Andrew Morton experienced with his VAIO laptop. He continued, "we disable a lot of ACPI/BIOS functionality during suspend, but we keep the lower idle C-states functionality active across suspend/resume. It seems that this causes trouble with certain BIOSes, but I assume that the problem is more wide spread and just not surfacing due to the various scenarios in which a machine goes into suspend/resume." Thomas concluded, "I really hope that this two patches finally set an end to the 'jinxed VAIO heisenbug series', which started when we removed the periodic tick with the clockevents/dyntick patches."
Linus Torvalds expressed some concerns, "the patches look fine, but I somehow have this slight feeling that you gave up a bit too soon on the '*why* does this happen?' question." He agreed that at that point there was a problem with ACPI, but cautioned that this could be triggered by another bug, "in particular, I also suspect that this may not really fix the problem - maybe it just makes the window sufficiently small that it no longer triggers. Because we don't necessarily understand what the real background for the problem is, I'm not sure we can say that it is solved." Linus concluded, "but hey, I think I'll apply the patches as-is. I'd just feel even better if we actually understood *why* doing the CPU Cx states is not something we can do around the suspend code!"
Jens Axboe detailed the changes in his linux-2.6-block.git tree that he plans to merge into the upcoming 2.6.24 kernel. Among the changes were the necessary updates to enable SG chaining which is used for large IO commands, "the goal of sg chaining is to allow support for very large sgtables, without requiring that they be allocated from one contigious piece of memory." Andrew Morton asked for more information, "presumably sg chaining means more overhead on the IO submission paths? If so, has this been quantified?"
Jens explained that there is no overhead for existing logic which doesn't use sg chaining, "just cleanups to drivers to use
for_each_sg() and so on." He continued:
"For actually using the sg chaining, there's some overhead of course. Say we support 256 entries without chaining, or 1mb with 4kb pages. A request with 1000 entried would require 4 trips to the allocator to allocate the chainable lists and 4 trips when freeing that list again. We don't loop the sg list on setup of freeing, just jump to the correct locations. So even for chaining, the cost isn't that big. It enables us to support much larger IO commands and potentially speed up some devices quite a lot, so CPU cost is less of a concern. And for small sglists, there isn't a noticable overhead."
A short thread on the lkml discussed the lack of a
memzero function in the Linux kernel. Cyrill Gorcunov asked, "could anyone tell me why there is no official
memzero function (or macros) in the kernel?" Arjan van de Ven explained, "it doesn't add value....
memset with a constant 0 is just as fast (since the compiler knows it's 0) than any wrapper around it, and the syntax around it is otherwise the same." Linux creator Linus Torvalds went on to explain:
"The reason we have '
clear_page()' is not because the value we're writing is constant - that doesn't really help/change anything at all. We could have had a '
fill_page()' that sets the value to any random byte, it's just that zero is the only value that we really care about.
"So the reason we have '
clear_page()' is because the *size* and *alignment* is constant and known at compile time, and unlike the value you write, that actually matters. So '
memzero()' would never really make sense as anything but a syntactic wrapper around '
"sched_yield() is not - and should not be - about 'recalculating the position in the scheduler queue' like you do now in CFS," Linus Torvalds stated in a discussion with Completely Fair Scheduler author Ingo Molnar, pointing to the man pages to back up his argument that sched_yield should instead move a thread to the end of its queue, adding, "quite frankly, the current CFS behaviour simply looks buggy. It should simply not move it to the 'right place' in the rbtree. It should move it *last*."
Ingo described how it worked with the pre-2.6.23 scheduler, "the O(1) implementation of yield() was pretty arbitrary: it did not move it last on the same priority level - it only did it within the active array. So expired tasks (such as CPU hogs) would come _after_ a yield()-ing task." He went on to compare this to the new process scheduler , "so the yield() implementation was so much tied to the data structures of the O(1) scheduler that it was impossible to fully emulate it in CFS. In CFS we dont have a per-nice-level rbtree, so we cannot move it dead last within the same priority group - but we can move it dead last in the whole tree. (then they'd be put even after nice +19 tasks.) People might complain about _that_." He also noted that this would change the behavior for some desktop applications that call sched_yield(), "there will be lots of regression reports about lost interactivity during load."
"We, the MadWifi team, announce our decision to move away from the binary-only HAL and change the focus of our future development towards ath5k, a completely free (as in freedom) driver which will eventually become an integral part of the Linux kernel," Michael Renzmann posted to the MadWifi development mailing list. The decision comes during continued debate surrounding what is and what is not allowed by the BSD license, and with no official statement yet from the SFLC. Much of the debate was due to an attempt to release BSD licensed files under the GPL, visible for example in the ath5k_hw.c source file which is still labeled as available "under the terms of the GNU General Public License" in the latest version of the file checked into the source repository linked from the MadWifi project page. It appears that actual development of the ath5k driver has been moved to Linville's git tree, where the license is now purely BSD, though debate remains as to what's required to be able to add additional copyrights to source code as have been added to the reverse engineered HAL code originally written by Reyk Floeter. In an earlier confrontation with Atheros, the work done by Reyk was determined to be free of copyright infringement:
"A driver for Atheros wireless cards is available in OpenBSD that talks directly to the hardware, based on reverse engineering efforts done by Reyk Floeter. Relevant parts of the driver have been ported to Linux by Nick Kossifidis to start OpenHAL, a free (as in freedom) replacement of the proprietary HAL. Claims that the OpenBSD driver (and thus also OpenHAL) contains stolen code slowed down the OpenHAL efforts but finally could be voided. The Software Freedom Law Center (SFLC), with the help of Atheros, performed a thorough code review and concluded "that OpenHAL does not infringe copyrights held by Atheros". In other words, the way is clear now for the inclusion of an OpenHAL-based driver into the Linux kernel."
"Intel's Open Source Technology Center is pleased to announce the LessWatts.org project, an open source project for saving power on Linux," began an email posted to the lkml by Arjan van de Ven. The announcement continued:
"LessWatts.org is a place to bring users, developers and distribution makers together around power reduction for linux machines, from mobile to desktop to server to datacenter. LessWatts.org is about a system-level approach to power savings, from the lowest level device drivers in the kernel to the most advanced desktop applications. LessWatts.org is about things you can do to reduce power usage. LessWatts.org is about longer battery life, a lower airconditioning bill, about reducing the impact of computers on the environment."
The announcement went on to note, "at this time of launching the LessWatts.org project, the technology development projects are those that Intel has started, is involved in or has just started working on, such as PowerTOP, Tickless Idle, Graphics and various link power management techniques. We'd like to invite all developers and projects that focus on power saving to join the LessWatts.org effort and community."
Ulrich Drepper noted a difference between the Linux connect(2) man page and the POSIX specification. The former states, "connectionless sockets may dissolve the association by connecting to an address with the sa_family member of sockaddr set to AF_UNSPEC." The latter reads, "if address is a null address for the protocol, the socket's peer address shall be reset." Ulrich explained that he preferred the description in the Linux man page, but the Linux kernel seems to actually follow the POSIX specification, "is this functionality which got lost over time? Or is the man page wrong and this never was the case? Is this a worthwhile change?"
Alan Cox noted, "we got it from the 1003.4g draft socket specification if I remember rightly." David Miller suggested, "the whole AF_UNSPEC thing I'm almost certain comes from BSD, which has behaved that way for centuries." Alan concurred, "its entirely plausible that [the 1003.4g draft socket specification] got it from 4BSE." Ulrich concluded, "I guess I'll just go ahead and file a problem report with the spec. Maybe the Unix vendors will test their implementations and provide feedback."
"What is going on whenever someone changes code is that they make a 'derivative work'," began Theodore Ts'o. "Whether or not you can even make a derivative work, and under what terms the derivative work can be licensed, is strictly up to the license of the original. For example, the BSD license says: '
redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met....' Note the 'with or without modification'. This is what allows people to change BSD licensed code and redistribute said changes." Regarding code that is GPL'd, he added, "it is not a relicencing, per se, since the original version of the file is still available under the original copyright; it is only the derived work which is under the more restrictive copyright."
Disagreement continued as to whether or not the BSD license allows the addition of new copyrights on unmodified or minimally modified code. Another disagreement was over the continued existence of improperly licensed files in developer source code repository histories from when BSD licensed files had been changed to the GPL, a problem since fixed. Jeff Garzik explained:
"In a purely open development environment, even personal developer trees are made public. That's the way we _want_ development to occur. Out in public, with a full audit trail. Your implied ideal scenario is tantamount to guaranteeing that mistakes are never committed to a public repository anywhere. Mistakes will happen. Even legal mistakes. In public.
"What you are seeing is an example of mistakes that were caught in review, and corrected. That's how any scalable review process works... the developer reviews his own work. the team reviews the developer's work. the maintainer reviews the team's work. the next maintainer reviews. and so on, to the top."
"Recently, the CE Linux forum has been working to revive the Linux-tiny project," stated Tim Bird on the Linux Kernel mailing list, adding that Michael Opdenacker has been selected as the project's new primary maintainer. The project's website explains:
"The linux-tiny patchset is a series of patches against the 2.6 mainline Linux kernel to reduce its memory and disk footprint, as well as to add features to aid working on small systems. Target users are developers of embedded system and users of small or legacy machines such as 386s and handhelds."
Andrew Morton suggested that patches should be sent to him to be merged into his -mm tree, aiming for inclusion in the mainline kernel, "seriously, putting this stuff into some private patch collection should be a complete last resort - you should only do this with patches which you (and the rest of us) agree have no hope of ever getting into mainline." Michael, the project's new maintainer, agreed, "you're completely right... The patches should all aim at being included into mainline or die." Tim added, "the patchkit gives a place for things to live while they are out of mainline, and still have multiple people use and work on them. Optimally the duration of being out-of-mainline would be short, but my experience is that sometimes what an embedded developer considers reasonable to hack off the kernel is not considered so reasonable by other developers (even with config options)."
"Ahoy me laddies (and beauties)," Linux creator Linus Torvalds began, announcing the seventh release candidate for the upcoming 2.6.23 kernel, "time for the traditional 'Talk Like a Pirate Day' kernel release!" He noted, "now, last year we had a full release (2.6.18 was immortalized on TLAP-2006), but this year I'm chickening out, and we're just doing what is hopefully going to be the last -rc release for the 2.6.23 series." Full source changes can be viewed via the gitweb interface. Linus also offered a brief summary of the changes:
"I'm not including the diffstat, because it got blown up by the resurrection of the sk98lin driver - because skge that is supposed to supplant it doesn't handle some of the hardware. Oh well.
"Apart from that, we had some mips, powerpc and xtense updates, and various driver tweaks. Things like the USB autosuspend revert should make people happier, and some more clockevents fixes should help suspend/restore on i386."
"In [the first pass] of e2fsck, every inode table in the fileystem is scanned and checked, regardless of whether it is in use," Avantika Mathur began. "This is the most time consuming part of the filesystem check. The unintialized block group feature can greatly reduce e2fsck time by eliminating checking of uninitialized inodes." She went on to explain how it works, "with this feature, there is a a high water mark of used inodes for each block group. Block and inode bitmaps can be uninitialized on disk via a flag in the group descriptor to avoid reading or scanning them at e2fsck time. A checksum of each group descriptor is used to ensure that corruption in the group descriptor's bit flags does not cause incorrect operation." Avantika attached a graph illustrating the advantage of the patch which she summarized as follows:
"The patches have been stress tested with fsstress and fsx. In performance tests testing e2fsck time, we have seen that e2fsck time on ext3 grows linearly with the total number of inodes in the filesytem. In ext4 with the uninitialized block groups feature, the e2fsck time is constant, based solely on the number of used inodes rather than the total inode count. Since typical ext4 filesystems only use 1-10% of their inodes, this feature can greatly reduce e2fsck time for users. With performance improvement of 2-20 times, depending on how full the filesystem is."
"There is a tension here between generality of support infrastructure, maintainability of the infrastructure, simplicity of the infrastructure and reliability of the infrastructure," began Eric Biederman, discussing the need for a common RAS infrastructure for dealing with kernel crashes and what would be involved in getting such tools merged into the mainline kernel. He continued, "the historical linux perspective is that anything that compromises the maintainability or the reliability of the kernel without the tools is unacceptable. There is also a historical perspective that using the single stepping mode of a debugger to diagnose problems frequently leads to symptoms being fixed and not the actual problems being fixed."
Eric compared the kexec on panic code and the kdb code, "on the kexec on panic path the philosophy is that the kernel is broken and as little as possible should be relied upon." He contrasted this to kdb, "from what I can tell the philosophy of the kdb code is that the kernel is mostly ok except for one or two little bugs so it is reasonable to rely on lots of kernel infrastructure." He then suggested that it was because of this difference and reduced maintenance overhead that kexec on panic was merged into the mainline kernel, "I will note that in some sense it is a harder approach to implement as it emphasizes the challenge of drivers that work starting from a random hardware state, and because it draws a clear line between the broken kernel and the recover kernel. But those things are exactly what encourage things to work well." As for what is the next step forward in RAS development, Eric noted, "if someone who is suggesting an implementation can absorb and understand the requirements of the different groups and come up with solutions that meet the requirements of the different projects I think progress can be made. That as far as I know takes talent."
"Here's a new version of my credentials patch. It's still very basic, with only Ext3, (V)FAT, NFS, AFS, SELinux and keyrings compiled in on an x86_64 arch kernel," stated David Howells. He described the patch as, "introduce a copy on write credentials record (struct cred). The fsuid, fsgid, supplementary groups list move into it (DAC security). The session, process and thread keyrings are reflected in it, but don't primarily reside there as they aren't per-thread and occasionally need to be instantiated or replaced by other threads or processes."
Casey Schaufler asked, "what I don't really understand is what value is gained by this exercise. Are the savings sufficiently significant to justify the effort?" Trond Myklebust explained, "it is not about savings, but about new functionality. Basically, the existence of reference-counted credentials will allow AFS and NFS to cache that information and use it for deferred writes etc." David added, "and also make it easier for cachefiles and hopefully NFSd to override the active security. There's a comment somewhere in, I think, the SunRPC code in the Linux kernel bemoaning the lack of this very feature:-)"
A frustrated sounding Andrew Morton released the 2.6.23-rc6-mm1 kernel as "a 29MB diff against 2.6.23-rc6." Many patches are merged first into Andrew's -mm tree for testing before being pushed to Linus' mainline tree during the merge window. Andrew suggested that the -mm process wasn't working as well as it could:
"It took me over two solid days to get this lot compiling and booting on a few boxes. This required around ninety fixup patches and patch droppings. There are several bugs in here which I know of (details below) and presumably many more which I don't know of. I have to say that this just isn't working any more."
"I'm pleased to announce [the] fourth release of the distributed storage subsystem, which allows [you] to form a storage [block device] on top of remote and local nodes, which in turn can be exported to another storage [block device] as a node to form tree-like storage [block devices]," Evgeniy Polyakov stated on the Linux Kernel mailing list. The new release includes a new configuration interface and several bug fixes.
Network device driver and SATA subsystem maintainer, Jeff Garzik, was not impressed with the concept, "[distributed block devices] are not very useful, because it still relies on a useful filesystem sitting on top of the DBS." He went on to explain the problem, "it devolves into one of two cases: (1) multi-path much like today's SCSI, with distributed filesystem arbitrarion to ensure coherency, or (2) the filesystem running on top of the DBS is on a single host, and thus, a single point of failure (SPOF)." He proposed instead that time would be better spent developing a POSIX-only distributed filesystem, "in contrast, a distributed filesystem offers far more scalability, eliminates single points of failure, and offers more room for optimization and redundancy across the cluster." Jeff went on to caution, "a distributed filesystem is also much more complex, which is why distributed block devices are so appealing :)" When Lustre was pointed out as an existing option, Jeff noted, "Lustre is tilted far too much towards high-priced storage, and needs improvement before it could be considered for mainline."