Anticipatory I/O scheduler author Nick Piggin [interview] has been maintaining a lockless pagecache patch for many months. In March he provided a draft document that details the lockless radix-tree utilized by the pagecache [pdf], in which it describes the Read-Copy Update (RCU) algorithm used to share a dynamic data structure without using locks. Jens Axboe recently posted some benchmark results demonstrating some workloads that benefit from the lockless pagecache.
Andrew Morton reviewed Jens' tests and commented, "that doesn't sound like something which a real application is likely to do ;)", going on to add, "although the graph looks good, I wouldn't view this as a super-strong argument in favour of [a] lockless pagecache." Linus Torvalds noted, "true, but on the other hand, it does kind of 'distill' one (small) part of something that real apps _are_ likely to do," adding, "the question, of course, is whether the part that remains (the actual page lookup) is important enough to matter, once it is part of a bigger chain in a real application."
Another concern Andrew expressed was the complexity of the lockless pagecache. When discussing the benchmark results, Jens asked, "are there cases where the lockless page cache performs worse than the current one?" Andrew retorted, "yeah - when human beings try to understand and maintain it. The usual tradeoffs apply ;)" Nick acknowledge that the lockless pagecache code is complex, especially the RCU radix tree, "but I hope it isn't _too_ bad. It is basically a dozen line function at the core, and that gets used to implement find_get_page, find_lock_page. Their semantics remain the same, so that's where the line is drawn (plus minor things, like an addition for reclaim's remove-from-pagecache protocol)."
From: Jens Axboe [email blocked] To: linux-kernel Subject: Lockless page cache test results Date: Wed, 26 Apr 2006 15:53:10 +0200 Hi, Running a splice benchmark on a 4-way IPF box, I decided to give the lockless page cache patches from Nick a spin. I've attached the results as a png, it pretty much speaks for itself.The test in question splices a 1GiB file to a pipe and then splices that to some output. Normally that output would be something interesting, in this case it's simply /dev/null. So it tests the input side of things only, which is what I wanted to do here. To get adequate runtime, the operation is repeated a number of times (120 in this example). The benchmark does that number of loops with 1, 2, 3, and 4 clients each pinned to a private CPU. The pinning is mainly done for more stable results. -- Jens Axboe
From: Nick Piggin [email blocked] Subject: Re: Lockless page cache test results Date: Thu, 27 Apr 2006 00:43:32 +1000 Jens Axboe wrote: > Hi, > > Running a splice benchmark on a 4-way IPF box, I decided to give the > lockless page cache patches from Nick a spin. I've attached the results > as a png, it pretty much speaks for itself. > > The test in question splices a 1GiB file to a pipe and then splices that > to some output. Normally that output would be something interesting, in > this case it's simply /dev/null. So it tests the input side of things > only, which is what I wanted to do here. To get adequate runtime, the > operation is repeated a number of times (120 in this example). The > benchmark does that number of loops with 1, 2, 3, and 4 clients each > pinned to a private CPU. The pinning is mainly done for more stable > results. Thanks Jens! It's interesting, single threaded performance is down a little. Is this significant? In some other results you showed me with 3 splices each running on their own file (ie. no tree_lock contention), lockless looked slightly faster on the same machine. In my microbenchmarks, single threaded lockless is quite a bit faster than vanilla on both P4 and G5. It could well be that the speculative get_page operation is naturally a bit slower on Itanium CPUs -- there is a different mix of barriers, reads, writes, etc. If only someone gave me an IPF system... ;) As you said, it would be nice to see how this goes when the other end are 4 gigabit pipes or so... And then things like specweb and file serving workloads. Nick -- SUSE Labs, Novell Inc.
From: Chen, Kenneth W [email blocked] Subject: RE: Lockless page cache test results Date: Wed, 26 Apr 2006 22:39:30 -0700 Jens Axboe wrote on Wednesday, April 26, 2006 12:46 PM > > It's interesting, single threaded performance is down a little. Is > > this significant? In some other results you showed me with 3 splices > > each running on their own file (ie. no tree_lock contention), lockless > > looked slightly faster on the same machine. > > I can do the same numbers on a 2-way em64t for comparison, that should > get us a little better coverage. I throw the lockless patch and Jens splice-bench into our benchmark harness, here are the numbers I collected, on the following hardware: (1) 2P Intel Xeon, 3.4 GHz/HT, 2M L2 (2) 4P Intel Xeon, 3.0 GHz/HT, 8M L3 (3) 4P Intel Xeon, 3.0 GHz/DC/HT, 2M L2 (per core) Here are the graph: (1) 2P Intel Xeon, 3.4 GHz/HT, 2M L2 http://kernel-perf.sourceforge.net/splice/2P-3.4Ghz.png (2) 4P Intel Xeon, 3.0 GHz/HT, 8M L3 http://kernel-perf.sourceforge.net/splice/4P-3.0Ghz.png (3) 4P Intel Xeon, 3.0 GHz/DC/HT, 2M L2 (per core) http://kernel-perf.sourceforge.net/splice/4P-3.0Ghz-DCHT.png (4) everything on one graph: http://kernel-perf.sourceforge.net/splice/splice.png - Ken
From: Nick Piggin [email blocked] Subject: Re: Lockless page cache test results Date: Thu, 27 Apr 2006 16:07:34 +1000 Chen, Kenneth W wrote: > Jens Axboe wrote on Wednesday, April 26, 2006 12:46 PM > >>>It's interesting, single threaded performance is down a little. Is >>>this significant? In some other results you showed me with 3 splices >>>each running on their own file (ie. no tree_lock contention), lockless >>>looked slightly faster on the same machine. >> >>I can do the same numbers on a 2-way em64t for comparison, that should >>get us a little better coverage. > > > > I throw the lockless patch and Jens splice-bench into our benchmark harness, > here are the numbers I collected, on the following hardware: > > (1) 2P Intel Xeon, 3.4 GHz/HT, 2M L2 > (2) 4P Intel Xeon, 3.0 GHz/HT, 8M L3 > (3) 4P Intel Xeon, 3.0 GHz/DC/HT, 2M L2 (per core) > > Here are the graph: Thanks a lot Ken. So pagecache lookup performance goes up about 15-25% in single threaded tests on your P4s. Phew, I wasn't dreaming it. It is a pity that ipf hasn't improved similarly (and even slowed down a bit, if Jens' numbers are significant to that range). Next time I spend some cycles on lockless pagecache, I'll try to scrounge an ipf and see if I can't improve it (I don't expect miracles). Thanks, Nick -- SUSE Labs, Novell Inc. Send instant messages to your online friends http://au.messenger.yahoo.com
From: Andi Kleen [email blocked] Subject: Re: Lockless page cache test results Date: Thu, 27 Apr 2006 08:15:18 +0200 On Thursday 27 April 2006 07:39, Chen, Kenneth W wrote: > (1) 2P Intel Xeon, 3.4 GHz/HT, 2M L2 > http://kernel-perf.sourceforge.net/splice/2P-3.4Ghz.png > > (2) 4P Intel Xeon, 3.0 GHz/HT, 8M L3 > http://kernel-perf.sourceforge.net/splice/4P-3.0Ghz.png > > (3) 4P Intel Xeon, 3.0 GHz/DC/HT, 2M L2 (per core) > http://kernel-perf.sourceforge.net/splice/4P-3.0Ghz-DCHT.png > > (4) everything on one graph: > http://kernel-perf.sourceforge.net/splice/splice.png Looks like a clear improvement for lockless unless I'm misreading the graphs. (Can you please use different colors next time?) -Andi
From: Andi Kleen [email blocked] Subject: Re: Lockless page cache test results Date: Thu, 27 Apr 2006 08:15:18 +0200 On Thursday 27 April 2006 07:39, Chen, Kenneth W wrote: > (1) 2P Intel Xeon, 3.4 GHz/HT, 2M L2 > http://kernel-perf.sourceforge.net/splice/2P-3.4Ghz.png > > (2) 4P Intel Xeon, 3.0 GHz/HT, 8M L3 > http://kernel-perf.sourceforge.net/splice/4P-3.0Ghz.png > > (3) 4P Intel Xeon, 3.0 GHz/DC/HT, 2M L2 (per core) > http://kernel-perf.sourceforge.net/splice/4P-3.0Ghz-DCHT.png > > (4) everything on one graph: > http://kernel-perf.sourceforge.net/splice/splice.png Looks like a clear improvement for lockless unless I'm misreading the graphs. (Can you please use different colors next time?) -Andi
From: Andrew Morton [email blocked] Subject: Re: Lockless page cache test results Date: Wed, 26 Apr 2006 09:55:11 -0700 Jens Axboe [email blocked] wrote: > > Running a splice benchmark on a 4-way IPF box, I decided to give the > lockless page cache patches from Nick a spin. I've attached the results > as a png, it pretty much speaks for itself. It does. What does the test do? In particular, does it cause the kernel to take tree_lock once per page, or once per batch-of-pages?
From: Jens Axboe [email blocked] Subject: Re: Lockless page cache test results Date: Wed, 26 Apr 2006 19:42:35 +0200 On Wed, Apr 26 2006, Andrew Morton wrote: > Jens Axboe [email blocked] wrote: > > > > Running a splice benchmark on a 4-way IPF box, I decided to give the > > lockless page cache patches from Nick a spin. I've attached the results > > as a png, it pretty much speaks for itself. > > It does. > > What does the test do? > > In particular, does it cause the kernel to take tree_lock once per > page, or once per batch-of-pages? Once per page, it's basically exercising the generic_file_splice_read() path. Basically X number of "clients" open the same file, and fill those pages into a pipe using splice. The output end of the pipe is then spliced to /dev/null to toss it away again. The top of the 4-client vanilla run profile looks like this: samples % symbol name 65328 47.8972 find_get_page Basically the machine is fully pegged, about 7% idle time. We can speedup the lookups with find_get_pages(). The test does 64k max, so with luck we should be able to pull 16 pages in at the time. I'll try and run such a test. But boy I wish find_get_pages_contig() was there for that. I think I'd prefer adding that instead of coding that logic in splice, it can get a little tricky. -- Jens Axboe
From: Andrew Morton [email blocked] Subject: Re: Lockless page cache test results Date: Wed, 26 Apr 2006 11:10:54 -0700 Jens Axboe [email blocked] wrote: > > On Wed, Apr 26 2006, Andrew Morton wrote: > > Jens Axboe [email blocked] wrote: > > > > > > Running a splice benchmark on a 4-way IPF box, I decided to give the > > > lockless page cache patches from Nick a spin. I've attached the results > > > as a png, it pretty much speaks for itself. > > > > It does. > > > > What does the test do? > > > > In particular, does it cause the kernel to take tree_lock once per > > page, or once per batch-of-pages? > > Once per page, it's basically exercising the generic_file_splice_read() > path. Basically X number of "clients" open the same file, and fill those > pages into a pipe using splice. The output end of the pipe is then > spliced to /dev/null to toss it away again. OK. That doesn't sound like something which a real application is likely to do ;) > The top of the 4-client > vanilla run profile looks like this: > > samples % symbol name > 65328 47.8972 find_get_page > > Basically the machine is fully pegged, about 7% idle time. Most of the time an acquisition of tree_lock is associated with a disk read, or a page-size memset, or a page-size memcpy. And often an acquisition of tree_lock is associated with multiple pages, not just a single page. So although the graph looks good, I wouldn't view this as a super-strong argument in favour of lockless pagecache. > We can speedup the lookups with find_get_pages(). The test does 64k max, > so with luck we should be able to pull 16 pages in at the time. I'll try > and run such a test. OK. > But boy I wish find_get_pages_contig() was there > for that. I think I'd prefer adding that instead of coding that logic in > splice, it can get a little tricky. I guess it'd make sense - we haven't had a need for such a thing before. umm, something like... unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start, unsigned int nr_pages, struct page **pages) { unsigned int i; unsigned int ret; pgoff_t index = start; read_lock_irq(&mapping->tree_lock); ret = radix_tree_gang_lookup(&mapping->page_tree, (void **)pages, start, nr_pages); for (i = 0; i < ret; i++) { if (pages[i]->mapping == NULL || pages[i]->index != index) break; page_cache_get(pages[i]); index++; } read_unlock_irq(&mapping->tree_lock); return i; }
From: Jens Axboe [email blocked] Subject: Re: Lockless page cache test results Date: Wed, 26 Apr 2006 20:23:23 +0200 On Wed, Apr 26 2006, Andrew Morton wrote: > Jens Axboe [email blocked] wrote: > > > > On Wed, Apr 26 2006, Andrew Morton wrote: > > > Jens Axboe [email blocked] wrote: > > > > > > > > Running a splice benchmark on a 4-way IPF box, I decided to give the > > > > lockless page cache patches from Nick a spin. I've attached the results > > > > as a png, it pretty much speaks for itself. > > > > > > It does. > > > > > > What does the test do? > > > > > > In particular, does it cause the kernel to take tree_lock once per > > > page, or once per batch-of-pages? > > > > Once per page, it's basically exercising the generic_file_splice_read() > > path. Basically X number of "clients" open the same file, and fill those > > pages into a pipe using splice. The output end of the pipe is then > > spliced to /dev/null to toss it away again. > > OK. That doesn't sound like something which a real application is likely > to do ;) I don't think it's totally unlikely. Could be streaming a large media file to many clients, for instance. Of course you are not going to push gigabytes of data per seconds like this test, but it's still the same type of workload. > > The top of the 4-client > > vanilla run profile looks like this: > > > > samples % symbol name > > 65328 47.8972 find_get_page > > > > Basically the machine is fully pegged, about 7% idle time. > > Most of the time an acquisition of tree_lock is associated with a disk > read, or a page-size memset, or a page-size memcpy. And often an > acquisition of tree_lock is associated with multiple pages, not just a > single page. Yeah with mostly io then I'd be hard pressed to show a difference. > So although the graph looks good, I wouldn't view this as a super-strong > argument in favour of lockless pagecache. I didn't claim it was, just trying to show some data on at least one case where the lockless patches perform well and the stock kernel does not :-) Are there cases where the lockless page cache performs worse than the current one? > > But boy I wish find_get_pages_contig() was there > > for that. I think I'd prefer adding that instead of coding that logic in > > splice, it can get a little tricky. > > I guess it'd make sense - we haven't had a need for such a thing before. > > umm, something like... > > unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start, > unsigned int nr_pages, struct page **pages) > { > unsigned int i; > unsigned int ret; > pgoff_t index = start; > > read_lock_irq(&mapping->tree_lock); > ret = radix_tree_gang_lookup(&mapping->page_tree, > (void **)pages, start, nr_pages); > for (i = 0; i < ret; i++) { > if (pages[i]->mapping == NULL || pages[i]->index != index) > break; > page_cache_get(pages[i]); > index++; > } > read_unlock_irq(&mapping->tree_lock); > return i; > } Ah clever, I didn't think of stopping on the first hole. Works well since you need to manually get a reference on the page anyways. Let me redo the numbers with this splice updated. -- Jens Axboe
From: Andrew Morton [email blocked] Subject: Re: Lockless page cache test results Date: Wed, 26 Apr 2006 11:46:49 -0700 Jens Axboe [email blocked] wrote: > > Are there cases where the lockless page cache performs worse than the > current one? Yeah - when human beings try to understand and maintain it. The usual tradeoffs apply ;)
From: Nick Piggin [email blocked] Subject: Re: Lockless page cache test results Date: Thu, 27 Apr 2006 15:58:13 +1000 Andrew Morton wrote: > Jens Axboe [email blocked] wrote: > >>Are there cases where the lockless page cache performs worse than the >>current one? > > > Yeah - when human beings try to understand and maintain it. Have any tried yet? ;) I won't deny it is complex (because I don't like when I make the same point and people go on to take great trouble to convince me how simple it is!). But I hope it isn't _too_ bad. It is basically a dozen line function at the core, and that gets used to implement find_get_page, find_lock_page. Their semantics remain the same, so that's where the line is drawn (plus minor things, like an addition for reclaim's remove-from-pagecache protocol). IMO the rcu radix tree is probably the most complex bit... but that pales in comparison to things like our prio tree, or RCU trie. > > The usual tradeoffs apply ;) Definitely. It isn't fun if you just take the patch and merge it. -- SUSE Labs, Novell Inc.
From: Linus Torvalds [email blocked] Subject: Re: Lockless page cache test results Date: Wed, 26 Apr 2006 12:00:37 -0700 (PDT) On Wed, 26 Apr 2006, Andrew Morton wrote: > Jens Axboe [email blocked] wrote: > > > > Once per page, it's basically exercising the generic_file_splice_read() > > path. Basically X number of "clients" open the same file, and fill those > > pages into a pipe using splice. The output end of the pipe is then > > spliced to /dev/null to toss it away again. > > OK. That doesn't sound like something which a real application is likely > to do ;) True, but on the other hand, it does kind of "distill" one (small) part of something that real apps _are_ likely to do. The whole 'splice to /dev/null' part can be seen as totally irrelevant, but at the same time a way to ignore all the other parts of normal page cache usage (ie the other parts of page cache usage tend to be the "map it into user space" or the actual "memcpy_to/from_user()" or the "TCP send" part). The question, of course, is whether the part that remains (the actual page lookup) is important enough to matter, once it is part of a bigger chain in a real application. In other words, the splice() thing is just a way to isolate one part of a chain that is usually much more involved, and micro-benchmark just that one part. Splice itself can be optimized to do the lookup locking only once per N pages (where N currently is on the order of ~16), but that may not be as easy for some other paths (ie the normal read path). And the "reading from the same file in multiple threads" _is_ a real load. It may sound stupid, but it would happen for any server that has a lot of locality across clients (and that's very much true for web-servers, for example). That said, under most real loads, the page cach elookup is obviously always going to be just a tiny tiny part (as shown by the fact that Jens quotes 35 GB/s throughput - possible only because splice to /dev/null doesn't need to actually ever even _touch_ the data). The fact that it drops to "just" 3GB/s for four clients is somewhat interesting, though, since that does put a limit on how well we can serve the same file (of course, 3GB/s is still a lot faster than any modern network will ever be able to push things around, but it's getting closer to the possibilities for real hardware (ie IB over PCI-X should be able to do about 1GB/s in "real life") So the fact that basically just lookup/locking overhead can limit things to 3GB/s is absolutely not totally uninteresting. Even if in practice there are other limits that would probably hit us much earlier. It would be interesting to see where doing gang-lookup moves the target, but on the other hand, with smaller files (and small files are still common), gang lookup isn't going to help as much. Of course, with small files, the actual filename lookup is likely to be the real limiter. Linus
From: Jens Axboe [email blocked] Subject: Re: Lockless page cache test results Date: Wed, 26 Apr 2006 21:15:58 +0200 On Wed, Apr 26 2006, Linus Torvalds wrote: > > > On Wed, 26 Apr 2006, Andrew Morton wrote: > > > Jens Axboe [email blocked] wrote: > > > > > > Once per page, it's basically exercising the generic_file_splice_read() > > > path. Basically X number of "clients" open the same file, and fill those > > > pages into a pipe using splice. The output end of the pipe is then > > > spliced to /dev/null to toss it away again. > > > > OK. That doesn't sound like something which a real application is likely > > to do ;) > > True, but on the other hand, it does kind of "distill" one (small) part of > something that real apps _are_ likely to do. > > The whole 'splice to /dev/null' part can be seen as totally irrelevant, > but at the same time a way to ignore all the other parts of normal page > cache usage (ie the other parts of page cache usage tend to be the "map it > into user space" or the actual "memcpy_to/from_user()" or the "TCP send" > part). > > The question, of course, is whether the part that remains (the actual page > lookup) is important enough to matter, once it is part of a bigger chain > in a real application. > > In other words, the splice() thing is just a way to isolate one part of a > chain that is usually much more involved, and micro-benchmark just that > one part. Nick called it a find_get_page() micro benchmark, which is pretty might spot on. So naturally it shows the absolute best side of the lockless page cache, but that is also very interesting. The /dev/null output can just be seen as a "infinitely" fast output method, both from a throughput and light weight POV. > It would be interesting to see where doing gang-lookup moves the target, > but on the other hand, with smaller files (and small files are still > common), gang lookup isn't going to help as much. With a 16-page gang lookup in splice, the top profile for the 4-client case (which is now at 4GiB/sec instead of 3) are: samples % symbol name 30396 36.7217 __do_page_cache_readahead 25843 31.2212 find_get_pages_contig 9699 11.7174 default_idle Even disregarding that readahead contender that could probably be made a little more clever, we are still spending an awful lot of time in the page lookup. I didn't mention this before, but the get_page/put_page overhead is also a lot smaller with the lockless patches. -- Jens Axboe
From: Andrew Morton [email blocked] Subject: Re: Lockless page cache test results Date: Wed, 26 Apr 2006 13:12:00 -0700 Jens Axboe [email blocked] wrote: > > With a 16-page gang lookup in splice, the top profile for the 4-client > case (which is now at 4GiB/sec instead of 3) are: > > samples % symbol name > 30396 36.7217 __do_page_cache_readahead > 25843 31.2212 find_get_pages_contig > 9699 11.7174 default_idle __do_page_cache_readahead() should use gang lookup. We never got around to that, mainly because nothing really demonstrated a need. It's a problem that __do_page_cache_readahead() is being called at all - with everything in pagecache we should be auto-turning-off readahead. This happens because splice is calling the low-level do_pagecache_readahead(). If you convert it to use page_cache_readahead(), that will all vanish if readahead is working right.
lockless patch makes differen
lockless patch makes difference only on SMP right ?
or on UP too ?
It may affect preempt-compile
It may affect preempt-compiled kernels also, but I don't know for sure.
Does help UP
It should help UP too.
Ken Chen benchmarked single threaded performance:
http://kernel-perf.sourceforge.net/splice/splice.png
This is with a SMP compiled kernel, however my microbenchmarks on
P4 and G5 systems suggest that the speedup for on UP kernels should
be similar to the speedup Ken's single CPU tests (1 on the x-axis).
Errr...
Unless I'm mistaken, that graph only has 2P and 4P performance. The only gain on UP should be for preempt, unless I'm mistaken.
Right
The graph only has 2P and 4P machines, however each of those is
tested in a single client configuration - ie. only one CPU in
use.
Due to Intel's FSB type interconnect, this should be fairly similar
to the same workload on a similar single CPU system. From there, I
further guestimated that UP kernel should show similar speedups to
SMP kernel.
The gain for non preempt UP kernels is that it saves an irqs-off
which is pretty expensive on my P4 and G5.
Didn't realize IRQ-off was expensive.
Wow. I didn't realize that it was that expensive to disable interrupts on the P4 and the G5. I'm spoiled. I work with DSPs most of the time. The one I'm most closely involved with has DINT/RINT instructions that have no delay and still allow you to issue 7 other instructions that cycle.
(In case you're curious, RINT means "Restore INTerrupt enable," allowing you to nest DINT/RINT pairs.)