"I've finally got some numbers to go along with the Btrfs variable blocksize feature. The basic idea is to create a read/write interface to map a range of bytes on the address space, and use it in Btrfs for all metadata operations (file operations have always been extent based)," explained Chris Mason in a recent posting to the Linux Filesystem Development mailing list. He linked to some benchmark results and summarized, "the first round of benchmarking shows that larger block sizes do consume more CPU, especially in metadata intensive workloads, but overall read speeds are much better." Chris then noted, "Dave reported that XFS saw much higher write throughput with large blocksizes, but so far I'm seeing the most benefits during reads." David Chinner replied, "the basic conclusion is that different filesystems will benefit in different ways with large block sizes...." explaining:
"Btrfs linearises writes due to it's COW behaviour and this is trades off read speed. i.e. we take more seeks to read data so we can keep the write speed high. By using large blocks, you're reducing the number of seeks needed to find anything, and hence the read speed will increase. Write speed will be pretty much unchanged because btrfs does linear writes no matter the block size.
"XFS doesn't linearise writes and optimises it's layout for a large number of disks and a low number of seeks on reads - the opposite of btrfs. Hence large block sizes reduce the number of writes XFS needs to write a given set of data+metadata and hence write speed increases much more than the read speed (until you get to large tree traversals)."
"As far as my testsystem goes, v2.6.23 beats v2.6.22.9 in sysbench," explained Ingo Molnar in response to a posting showing the opposite results. He referred to his own testing results and explained:
"As you can see it in the graph, v2.6.23 schedules much more consistently too. [ v2.6.22 has a small (but potentially statistically insignificant) edge at 4-6 clients, and CFS has a slightly better peak (which is statistically insignificant)."
Ingo noted that he was nuable to find information as to how the other benchmark was generated, "there are no .configs or other testing details at or around that URL that i could use to reproduce their result precisely, so at least a minimal bugreport would be nice." He then offered some tips on how sysbench works and some suggested tunings, "sysbench is a pretty 'batched' workload: it benefits most from batchy scheduling: the client doing as much work as it can, then server doing as much work as it can - and so on. The longer the client can work the more cache-efficient the workload is. Any round-trip to the server due to pesky preemption only blows up the cache footprint of the workload and gives lower throughput."
Andrew Doran posted some threading benchmark results to NetBSD's tech-kern mailing list, following up to some benchmarks he'd posted earlier. The results compared NetBSD -current with FreeBSD -current, and the Linux 2.6.21 kernel. Kris Kennaway was surprised by the results, and ran his own benchmarks with minimal configuration changes, summarizing, "this measurement shows that FreeBSD is performing 70-80% better than NetBSD in this 4 CPU configuration. This is in contrast to Andrew's findings which seem to show NetBSD performing 10% better than FreeBSD on a 4 CPU system (a very old one though)." He added, "the drop-off above 8 threads on FreeBSD is due to non-scalability of mysql itself. i.e. it comes from pthread mutex contention in userland."
Kris ran additional benchmarks with PostgreSQL instead of MySQL, showing much improved scalability above 8 threads, "postgresql is much more scalable than mysql on this workload and doesn't have silly scaling bottlenecks inside the application (cf the tail of the FreeBSD curve for mysql which is where pthread mutex contention kicked in)." He continued his testing, and found that on older 4CPU P3 hardware NetBSD did outperform FreeBSD, "but only by 3-4% (in particular I am not seeing the ~10% difference that Andrew observes on his 4*p3 700MHz). Given the age of the hardware and the fact that I am not seeing it on other workloads or on modern hardware it might just be due to a small scheduling difference on this configuration."
A potential bug reported against the Completely Fair Scheduler suggested that it was causing a network slowdown, measured with the 'Iperf' bandwidth performance benchmarking tool. The performance hit was quickly tracked to the previously discussed changes in how CFS handles sched_yield(). When it was suggested that this was a bug in the new process scheduler, Ingo explained:
"I had a quick look at the source code, and the reason for that weird yield usage was that there's a locking bug in iperf's 'Reporter thread' abstraction and apparently instead of fixing the bug it was worked around via a horrible yield() based user-space lock."
He then submit a small patch to fix the bug and remove the call to sched_yield() resulting in, "iperf uses _much_ less CPU time. On my Core2Duo test system, before the patch it used up 100% CPU time to saturate 1 gigabit of network traffic to another box. With the patch applied it now uses 9% of CPU time." He added playfully, "sched_yield() is almost always the symptom of broken locking or other bug. In that sense CFS does the right thing by exposing such bugs =B-)" Stephen Hemminger pointed out that a similar patch had been submitted to the Iperf project last month as it caused an identical problem with FreeBSD's scheduler.
"After posting some benchmarks involving cfs, I got some feedback, so I decided to do a follow-up that'll hopefully fill in the gaps many people wanted to see filled," Rob Hussey began. He added, "this time around I've done the benchmarks against 2.6.21, 2.6.22-ck1, and 2.6.23-rc6-cfs-devel (latest git as of 12 hours ago)." Rob briefly summarized, "the only analysis I'll offer is that both sd and cfs are improvements, and I'm glad that there is a lot of work being done in this area of linux development. Much respect to Con Kolivas, Ingo Molnar, and Roman Zippel, as well all the others who have contributed."
Referring to a chart in which the blue line represented the CFS process scheduler, and the green line represented the SD "staircase" process scheduler, Ingo Molnar noted, "heh - am i the only one impressed by the consistency of the blue line in this graph? :-) [ and the green line looks a bit like a .. staircase? ]" He acknowledged some slowdown in CFS compared to SD in one of the benchmarks, "-ck1 is 0.8% faster in this particular test." Ingo then explained, "many things happened between 2.6.22-ck1 and 2.6.23-cfs-devel that could affect performance of this test. My initial guess would be sched_clock() overhead." In further testing he applied a low-res-sched-clock that resulted in better performance for CFS leading him to conclude, "the performance difference between -ck and -cfs-devel seems to be mostly down to the more precise (but slower) sched_clock() introduced in v2.6.23 and to the startup penalty of freshly created tasks." When asked if the low-res-sched-clock was likely to be merged, Ingo replied:
"I don't think so - we want precise/accurate scheduling before performance. (otherwise tasks working off the timer tick could steal away cycles without being accounted for them fairly, and could starve out all other tasks.) Unless the difference was really huge in real life - but it isn't."
"Looking at these graphs (and the fixed one from your second email), it sure looks a lot like CFS is doing at *least* as well as the old scheduler in every single test, and doing much better in most of them (in addition it's much more consistent between runs)," Kyle Moffett noted regarding recent benchmarks run against the Completely Fair Scheduler by Rob Hussey. Kyle continued:
"This seems to jive with all the other benchmarks and overall empirical testing that everyone has been doing. Overall I have to say a job well done for Ingo, Peter, Con, and all the other major contributors to this impressive endeavor."
Nick Piggin used 'git bisect' to track a lmbench regression to the main CFS commit, leading to an interesting discussion between Nick and Ingo Molnar. Ultimately the regression was tracked down to the temporary configurability of the scheduler while it is tuned for optimal performance, "one reason for the extra overhead is the current tunability of CFS, but that is not fundamental, it's caused by the many knobs that CFS has at the moment." The solution, already coded but not yet merged in the mainline kernel "changes those knobs to constants, allowing the compiler to optimize the math better and reduce code size," and as a result result, "CFS can be faster at micro-context-switching than 2.6.22."
Ingo described the lmbench configuration in question as a "micro-benchmark", and noted that with a macro-benchmark better performance was more pronounced, "because with CFS the _quality_ of scheduling decisions has increased. So even if we had increased micro-costs (which we wont have once the current tuning period is over and we cast the CFS parameters into constants), the quality of macro-scheduling can offset that, and not only on the desktop!" He summarized, "that's why our main focus in CFS was on the macro-properties of scheduling _first_, and then the micro-properties are adjusted to the macro-constraints as a second layer."
Some of the concerns expressed about the Completely Fair Scheduler were reports that it might not handle 3D games as well as the SD scheduler. In a recent thread, Ingo Molnar noted, "people are regularly testing 3D smoothness, and they find CFS good enough and that matches my experience as well (as limited as it may be). In general my impression is that CFS and SD are roughly on par when it comes to 3D smoothness." He noted that all known regressions were reported against earlier versions of CFS that had long since been fixed, and that he was very interested in any new reports of regressions against the current version of the code, "what is more interesting (to me) is not the positive CFS feedback but negative CFS feedback (although positive feedback certain _feels_ good so don't hold it back intentionally ;-)," adding, "there are no open 3D related regressions for CFS at the moment." Ingo then offered benchmarks illustrating the improved 3D performance of CFS, with numbers showing it to perform as well and in some cases considerably better than the SD scheduler.
Linus Torvalds noted, "I don't think _any_ scheduler is perfect, and almost all of the time, the RightAnswer(tm) ends up being not 'one or the other', but 'somewhere in between'." He noted that he was confident that he'd made the right decision in merging CFS, then added, "but at the same time, no technical decision is ever written in stone. It's all a balancing act. I've replaced the scheduler before, I'm 100% sure we'll replace it again. Schedulers are actually not at all that important in the end: they are a very very small detail in the kernel."
Ingo Molnar [interview] posted a set of 11 patches introducing "the first release of the 'Syslet' kernel feature and kernel subsystem, which provides generic asynchrous system call support". Ingo explains:
"Syslets are small, simple, lightweight programs (consisting of system-calls, 'atoms') that the kernel can execute autonomously (and, not the least, asynchronously), without having to exit back into user-space. Syslets can be freely constructed and submitted by any unprivileged user-space context - and they have access to all the resources (and only those resources) that the original context has access to."
Ingo goes on in his email to explain in greater detail how syslets work, then adds, "as it might be obvious to some of you, the syslet subsystem takes many ideas and experience from my Tux in-kernel webserver :) The syslet code originates from a heavy rewrite of the Tux-atom and the Tux-cachemiss infrastructure." He also offered some benchmark results, showing a 33.9% speedup comparing uncached synchronous IO to syslets, and a 19.2% speedup comparing cached synchronous IO to syslets, "so syslets, in this particular workload, are a nice speedup /both/ in the uncached and in the cached case. (note that i used only a single disk, so the level of parallelism in the hardware is quite limited.)"
Mike Benoit recently posted a link to results from his new and improved file system shootout, using better hardware and running more tests. Using two benchmarks that are designed to measure hard drive and file system performance, Bonnie++ and IOZone, he's compared a number journaling filesystems found in the 2.6 kernel [forum]. Included in the lineup are EXT2 (not journaling, but an effective baseline [story]), JFS, XFS, ReiserFS, Reiser4, and EXT3 each compared head to head on both SCSI and IDE drives.
In Mike's summary he labels JFS and XFS as 'best bang for your buck' explaining, "While not the fastest file systems, both of them consistently perform close to EXT2, while using minimal CPU. XFS seems to be faster over a wider range of benchmarks, however it does use slightly more CPU than JFS. While JFS really starts to slow down with lots of files." As for pure speed, Mike points to Reiser4 which really shined in the Bonnie++ benchmarks, though not quite so much in the IOZone benchmarks. He suggests, "ReiserFS v4 will [definitely] be worth while keeping an eye on, especially considering some of the exciting new features it offers."
Grant Miner posted some interesting benchmark results to the lkml, comparing five journaling filesystems available with the current 2.6.0-test2 development kernel. The tests were conducted with a very simple shell script, mainly timing how long it takes to copy, tar, and remove directories, performing several syncs in between. He summarizes:
- ext3's syncs tended to take the longest [at] 10 seconds, except
- JFS took a whopping 38.18s on its final sync
- xfs used more CPU than ext3 but was slower than ext3
- reiser4 had highest throughput and most CPU usage
- jfs had lowest throughput and least CPU usage
Some interesting discussion follows, debating the results and offering further suggestions on making the tests more useful. For example, Andrew Morton [interview] proposed including ext2 in the tests as a baseline, and Hans Reiser noted that reiser4 continues to improve rapidly. Read on for the full test results and much of the following discussion.