Randall Stewart of Cisco Systems gave a talk titled SCTP, what it is and how to use it, discussing the Stream Control Transmission Protocol (SCTP). A paper that was displayed on the overhead projecter before the talk began summarized:
"Integrated into FreeBSD 7.0 -- first standardized by the Internet Engineering Task force (IETF) in October of 2000, in RFC 2960 and later updated by RFC 4960. SCTP is a message oriented protocol providing reliable end to end communication between two peers in an IP network."
Randall explained that SCTP is an alternative protocol to TCP, UDP. To describe SCTP, he suggested you start with TCP features, including: reliable retransmission, congestion control, flow control, connection oriented, and selective acknowledgements. You then add to it more features, including: "association" 4-way handshake, framing and ordered service, multistreaming, multihoming, and reachability.
Pawel Dawidek first ported ZFS to FreeBSD from OpenSolaris in April of 2007. He continues to actively port new ZFS features from OpenSolaris, and focuses on improving overall ZFS stability. During the introduction to his talk at BSDCan, he explained that his goal was to offer an accessible view of ZFS internals. His discussion was broken into three sections, a review of the layers ZFS is built from and how they work together, a look at unique features found in ZFS and how they work internally, and a report on the current status of ZFS in FreeBSD.
The BSDCan website notes that Pawel is a FreeBSD committer, adding:
"In the FreeBSD project, he works mostly in the storage subsystems area (GEOM, file systems), security (disk encryption, opencrypto framework, IPsec, jails), but his code is also in many other parts of the system. Pawel currently lives in Warsaw, Poland, running his small company."
BSDCan 2008 officially started this morning at 9AM with an opening talk by the event's organizer, Dan Langille. However, in reality the event has already been running for two days, with the FreeBSD tutorials having started on the 14'th. After arriving in Ottawa yesterday afternoon and finding my room in a 20 story University of Ottawa residence, I wandered down to the Royal Oak Pub for early registration, meeting several dozen BSD hackers from all over the world.
This morning's opening talk was well attended, filling up first with clusters of laptop users around the power outlets along both walls. By 15 minutes after the hour, the room was completely full, and Dan started with a humorous slideshow of example letters he's been receiving ever since posting the words "letter of invitation" somewhere on the BSDCan website two year back. Coming primarily from Nigeria, the letter's authors often claim to represent large groups of developers, yet always coming from "disposable" email addresses. After some laughs, he launched into his opening keynote.
"As some of the latency junkies on lkml already know, commit 8e3e076 in v2.6.26-rc2 removed the preemptible BKL feature and made the Big Kernel Lock a spinlock and thus turned it into non-preemptible code again. This commit returned the BKL code to the 2.6.7 state of affairs in essence," began Ingo Molnar. He noted that this had a very negative effect on the real time kernel efforts, adding that Linux creator Linus Torvalds indicated the only acceptable way forward was to completely remove the BKL. Ingo explained:
"This task is not easy at all. 12 years after Linux has been converted to an SMP OS we still have 1300+ legacy BKL using sites. There are 400+ lock_kernel() critical sections and 800+ ioctls. They are spread out across rather difficult areas of often legacy code that few people understand and few people dare to touch. It takes top people like Alan Cox to map the semantics and to remove BKL code, and even for Alan (who is doing this for the TTY code) it is a long and difficult task."
Ingo went on to describe how the BKL works, how it differs from other locking mechanisms, and why this complicates removing it permanently from the kernel. He noted that the various dependencies of the lock are lost in the haze of 15 years of code changes, "all this has built up to a kind of Fear, Uncertainty and Doubt about the BKL: nobody really knows it, nobody really dares to touch it and code can break silently and subtly if BKL locking is wrong." He then suggested "changing the rules of the game", creating a "kill-the-BKL" branch which "turns the BKL into an ordinary albeit somewhat big mutex, with a quirky lock/unlock interface called 'lock_kernel()' and 'unlock_kernel()'."
"I'm please to announce [the] POHMEL high performance network filesystem. POHMELFS stands for Parallel Optimized Host Message Exchange Layered File System," began Evgeniy Polyakov, explaining:
"This is a high performance network filesystem with local coherent cache of data and metadata. Its main goal is distributed parallel processing of data. Network filesystem is a client transport. POHMELFS protocol was proven to be superior to NFS in lots (if not all, then it is in a roadmap) operations."
This latest release prompted Jeff Garzik to reply, "this continues to be a neat and interesting project :)" New features include fast transactions, round-robin failover, and near-wire limit performance. This adds to existing features which include a local coherent data and metadata cache, async processing of most events, and a fast and scalable multi threaded user space server. Planned features include a server extension to allow mirroring data across multiple devices, strong authentication, and possible data encryption when transferring data over the network. Evgeniy linked to several benchmarks in his blog.
"About 45% architecture updates (counting the include files too), about 30% drivers, and about 25% odds-and-ends. The odds-and-ends are mainly Documentation, filesystems (mostly cifs) and core kernel (scheduler updates etc)," said Linux creator Linus Torvalds, announcing the 2.6.26-rc2 kernel. He added, "if you read the shortlog and get the feeling that most of it is pretty boring small details, you'd be right. There is little exciting there." He continued:
"A fairly small part of it, but quite possibly the most noticeable one, is how the semaphore changes impacted the BKL (the old 'big kernel lock' that is still used for some legacy code, for you non-core people out there), which in the past had different versions ('regular', 'preemptable'). A few months ago we dropped the regular BKL version, but in 2.6.25-rc1 we then had performance (and then correctness) issues with the interaction between the semaphore implementation and the preemptable BKL, so we're back to the old regular version for now."
"So this merge window was somewhat rocky in the sense that there was a lot of arguments about it, but at the same time I at least personally think that from a technical angle, we had somewhat less scary stuff going on than has been almost the rule lately," noted Linux creator Linus Torvalds, announcing the 2.6.26-rc1 kernel. He continued:
"Lots of changes, but nothing that really feels all that fragile to me. Famous last words. I expect that the x86 PAT support (which has been long in the making) has the potential to have some issues, but the obvious problems were hashed out long ago, and while the merge window already showed one bug, that one was fairly benign and quickly fixed."
Linus highlighted, "another feature that is notable not for its size, but because people have tried to get me to merge it for some long is kgdb support. Which really turned out pretty small and clean, once people started putting their effort into making it so." He concluded, "so go out and test it. The diffstat and shortlogs are too big to post here (7500+ commits and the compressed full patch is 8.5MB in size), but one interesting tidbit I found was that during this *one* merge window, we had almost 800 different authors."
"This is starting to get beyond frustrating for me," complained David Miller of the latest merge window, launching what turned into a very lengthy and ongoing discussion about the Linux kernel development process. The concept of a regular "merge window" was first discussed in July of 2005 with the release of the 2.6.14-rc4 kernel, following the 2005 Developers' Summit. From 2.6.14 on, the release of each official 2.6.y kernel has been followed by a two week period during which major changes are merged into the kernel, followed by a 2.6.y-rc1 release. David complained that this particular merge window has been more painful than others, "the tree breaks every day, and it's becoming an extremely non-fun environment to work in. We need to slow down the merging, we need to review things more, we need people to test their [...] changes!"
During the lengthy discussion, Linux creator Linus Torvalds explained:
"The notion that we should even _try_ to aim to slow things down, that one I find unlikely to be true, and I don't even understand why anybody would find it a logical goal? Of course, you will have fewer new bugs if you have fewer changes. But that's not a goal, that's a tautology and totally uninteresting. A small program is likely to have fewer bugs, but that doesn't make something small 'better' than something large that does more. Similarly, a stagnant development community will introduce new bugs more seldom. But does that make a stagnant one better than a vibrant one? Hell no. So what I'm arguing against here is not that we should aim for worse quality, but I'm arguing against the false dichotomy of believing that quality is incompatible with lots of change."
"We are pleased to announce the official release of OpenBSD 4.3," began OpenBSD creator Theo de Raadt. "This is our 23rd release on CD-ROM (and 24th via FTP). We remain proud of OpenBSD's record of more than ten years with only two remote holes in the default install." He added, "as in our previous releases, 4.3 provides significant improvements, including new features, in nearly all areas of the system". Four platforms were listed as new or extended, including: sparc64 gained SMP support, "this should work on all supported systems, with the exception of the Sun Enterprise 10000"; hppa K-class servers are now supported; mvme88k gained SMP support on a couple of systems, and support for the 88110 processor was added. Numerous drivers were listed as new or improved, including a huge list of network drivers:
"The bge(4) driver now supports BCM5906/BCM5906M 10/100 and BCM5755 10/100/Gigabit Ethernet devices; the cas(4) driver now supports Cassini+ 10/100/Gigabit Ethernet devices; the em(4) driver now supports ICH9 10/100 and 10/100/Gigabit Ethernet devices; the gem(4) driver now supports the onboard 1000base-SX interface on the Sun Fire V880 server; the ixgb(4) driver now supports the Sun 10Gb PCI-X Ethernet devices; the msk(4) driver now supports Yukon FE+ 10/100 and Yukon Supreme 10/100/Gigabit Ethernet devices; the nfe(4) driver now supports MCP73, MCP77 and MCP79 10/100/Gigabit Ethernet devices; the ral(4) driver now supports RT2800 based wireless network devices; the cmpci(4) driver now supports CMI8768 based audio adapters; the it(4) driver now supports ITE IT8705F/8712F/8716F/8718F/8726F and SiS SiS950 ICs; new bwi(4) driver for the Broadcom AirForce IEEE 802.11b/g wireless network device; new et(4) driver for the Agere/LSI ET1310 10/100/Gigabit Ethernet device; new etphy(4) driver for the Agere/LSI ET1011 TruePHY Gigabit Ethernet PHY; new iwn(4) driver for the Intel Wireless WiFi Link 4965AGN IEEE 802.11a/b/g/Draft-N wireless network device; new upgt(4) driver for the Conexant/Intersil PrismGT SoftMAC USB IEEE 802.11b/g wireless network device."
A more complete list of changes can be found here. ONLamp also recently posted an interview titled, "Puffy and the Cryptonauts: What's New in OpenBSD 4.3". Theo noted, "profits from CD sales are the primary income source for the OpenBSD project -- in essence selling these CD-ROM units ensures that OpenBSD will continue to make another release six months from now."
"Btrfs v0.14 is now available for download," Chris Mason announced, adding, "please note the disk format has changed, and it is not compatible with older versions of Btrfs." The project has gained a new wiki home page on the kernel.org domain, where it is explained, "Btrfs is a new copy on write filesystem for Linux aimed at implementing advanced features while focusing on fault tolerance, repair and easy administration. Initially developed by Oracle, Btrfs is licensed under the GPL and open for contribution from anyone." Regarding the latest release, Chris explained:
"v0.14 has a few performance fixes and closes some races that could have allowed corrupted metadata in v0.13. The major new feature is the ability to manage multiple devices under a single Btrfs mount. Raid0, raid1 and raid10 are supported. Even for single device filesystems, metadata is now duplicated by default. Checksums are verified after reads finish and duplicate copies are used if the checksums don't match."
Chris offered links to multi-device benchmarks summarizing, "in general these numbers show that Btrfs does a good job at scaling to this storage configuration, and that is it on par with both HW raid and MD." Looking forward, he concluded, "next up on the Btrfs todo list is finishing off the device removal and IO error handling code. After that I'll add more fine grained locking to the btrees."
"I've put together an automatic system for applying kernel security patches to the Linux kernel without rebooting it, and I wanted to share this system with the community in case others find it useful or interesting," said Jeff Arnold, announcing ksplice. He explained, "the system takes as input a kernel security patch (which can be a unified diff taken directly from Linus' GIT tree) and the source code corresponding to the running kernel, and it automatically creates a set of kernel modules to perform the update. The running kernel does not need to have been customized in advance in any way." The project's website notes, "ksplice cannot handle semantic changes to data structures—that is, changes that would require existing instances of kernel data structures to be transformed." With this limitation, Jeff suggested ksplice is still able to automatically apply 84% of the kernel security patches released between May 2005 and December 2007. He continued:
"I've been pursuing this project because I don't like dealing with reboots whenever a new local kernel security vulnerability is discovered. The rebootless update practices/systems that are already out there require manually constructing an update (through a process that can be tricky and error-prone), and they tend to have other disadvantages as well (such as requiring a custom kernel, not handling inline functions properly, etc). This new system works on existing kernels, and it simply takes a unified diff as input and does the rest on its own."
Discussing the latest breakage of the linux-next tree, Stephen Rothwell noted that the problem went unnoticed due to the arm tree not currently being included, "this is why I would have liked you to participate in the linux-next tree ...". Arm maintainer Russell King questioned the usefulness, saying, "linux-next will not give me anything which -mm isn't giving me. As I said in the discussion, linux-next value is _very_ small for me. Sorry but true." Several stepped in to offer some reasons that the linux-next tree is useful.
Andrew Morton noted, "putting arm into linux-next means that Stephen (and git) handle the merges rather than having me (and not-git) do it. Which helps me. I expect that linux-next will get a lot more cross-compilation testing than -mm. Which helps you." Greg KH added, "getting your stuff into linux-next would provide a public place for others to base off of, making it easier for them to send patches to you ensuring that they apply properly. Which in the end, will help others be able to contribute easier, and help you by getting patches you do not need to rebase yourself." Stephen Rothwell summarized the advantages for a maintainer:
"5 times a week your tree gets merged with lots of other code destined for Linus' next release. From this you get to find out about things in other trees that clash with yours. This tree gets built on several architectures for several configs (including arm). So you find out if other trees will break yours. I am happy to build more (basically all) the arm configs as I have offered before."
Andrew Morton replied to a commit message making 4k stacks the default, saying, "this patch will cause kernels to crash." Ingo Molnar replied, "what mainline kernels crash and how will they crash? Fedora and other distros have had 4K stacks enabled for years." He added, "we've conducted tens of thousands of bootup tests with all sorts of drivers and kernel options enabled and have yet to see a single crash due to 4K stacks." During the lengthy discussion it was suggested that nfs+xfs+raid kernel configurations, and using ndiswrapper are the most common reasons for overflowing a 4K stack size.
Andi Kleen questioned the usefulness of 4k stacks, "as far as I can figure out they are not [a worthy goal]. They might have been a worthy goal on crappy 2.4 VMs, but these times are long gone." Arjan van de Ven suggested that though the 2.6 VM is much improved over the 2.4 VM, fragmentation with 8K stacks remains an unsolvable problem, "it's basic math; the Linux VM gets to deal with both short and long lasting allocations; no matter how hard you try to get some degree of fragmentation; especially due to the 15:1 acceleration you get due to the lowmem issue. And before you say 'you should use 64 bit on such machines'; I would love it if more people used 64 bit linux. Sadly the adoption rate of that is not very good still.... by far ;(" In another email, Arjan listed two advantages to 4K stacks, "1) less memory consumption in the lowmem zone (critical for enterprise use, also good for general performance), and 2) kernel stacks at 8K are one of the most prominent order-1 allocations in the kernel; again with big-memory systems the fragmentation of the lowmem zone is a problem (and the distros that ship 4K stacks went there because of customer complaints)".
"It's been long promised, but there it is now," began Linux creator Linus Torvalds, announcing the 2.6.25 Linux kernel. He continued, "special thanks to Ingo who found and fixed a nasty-looking regression that turned out to not be a regression at all, but an old bug that just had not been triggering as reliably before. That said, that was just the last particular regression fix I was holding things up for, and it's not like there weren't a lot of other fixes too, they just didn't end up being the final things that triggered my particular worries." Linus added:
"The full changelog from 2.6.24 is 7.5M, with a 12MB compressed patch. Tons and tons has changed, but if you've been following the -rc releases, you'll already know about the big things. The changes from the last rc (-rc9) are fairly small and mostly pretty trivial, and the shortlog is appended. So it's mostly one-liners, with some updates to drivers (net and usb) and to networking that are a bit larger (although a number of the driver updates are things like just new ID's etc)."
More information about the latest release can be found on the KernelNewbies Linux 2.6.25 wiki page.
"Finally found it ... the patch below solves the sparsemem crash and the test system boots up fine now," announced Ingo Molnar. He described the patch as fixing a "memory corruption and crash on 32-bit x86 systems. If a !PAE x86 kernel is booted on a 32-bit system with more than 4GB of RAM, then we call memory_present() with a start/end that goes outside the scope of MAX_PHYSMEM_BITS." He included a source snippet with the loop that could corrupt memory, "depending on what that memory is, we might crash, misbehave or just not notice the bug." Ingo went on to note that the bug was first introduced with sparsemem support in the 2.6.16 kernel:
"I believe this was the reason why my many bisection attempts were unsuccessful: the bug pattern was not stable and seemingly working kernels had the memory corruption too. It was pure luck that v2.6.24 'worked' and v2.6.25-rc9 broke visibly."
Linux creator Linus Torvalds replied, "good job. I've pushed this out, and will let this simmer at least overnight to see if there are any brown-paper-bag issues (either with this or with some last changes from Andrew), but I'm happy, and I think I'll do the real 2.6.25 tomorrow."