logo
Published on KernelTrap (http://kerneltrap.org)

Linux: vmsplice() versus COW

By Jeremy
Created Apr 21 2006 - 10:48

While explaining the new splice() and tee() buffer management system calls [story [1]], Linus Torvalds made reference to some possible future extensions. This included vmsplice(), a system call since implemented by Jens Axboe "to basically do a 'write to the buffer', but using the reference counting and VM traversal to actually fill the buffer." Reviewing the implications of using such a system call lead to a comparison with FreeBSD's ZERO_COPY_SOCKET which uses COW (copy on write).

Linus explained that while this may look good on specific benchmarks, it actually introduces extra overhead, "the thing is, the cost of marking things COW is not just the cost of the initial page table invalidate: it's also the cost of the fault eventually when you _do_ write to the page, even if at that point you decide that the page is no longer shared, and the fault can just mark the page writable again." He went on to explain, "The COW approach does generate some really nice benchmark numbers, because the way you benchmark this thing is that you never actually write to the user page in the first place, so you end up having a nice benchmark loop that has to do the TLB invalidate just the _first_ time, and never has to do any work ever again later on." Linus didn't pull any punches when he summarized:

"I claim that Mach people (and apparently FreeBSD) are incompetent idiots. Playing games with VM is bad. memory copies are _also_ bad, but quite frankly, memory copies often have _less_ downside than VM games, and bigger caches will only continue to drive that point home."


From: "David S. Miller" [email blocked]
To:  axboe
Subject: Re: Linux 2.6.17-rc2
Date:	Thu, 20 Apr 2006 12:26:47 -0700 (PDT)

From: Jens Axboe [email blocked]
Date: Thu, 20 Apr 2006 16:50:42 +0200

> On Wed, Apr 19 2006, Linus Torvalds wrote:
> >  - vmsplice() system call to basically do a "write to the buffer", but 
> >    using the reference counting and VM traversal to actually fill the 
> >    buffer. This means that the user needs to be careful not to re-use the 
> >    user-space buffer it spliced into the kernel-space one (contrast this 
> >    to "write()", which copies the actual data, and you can thus re-use the 
> >    buffer immediately after a successful write), but that is often easy to 
> >    do.
> 
> This I already did, it was pretty easy and straight forward. I'll post
> it soonish.

Do we plan to do vmsplice() to sockets?  That's interesting, but
requires some serious cooperation from things like TCP so that
the indication of "buffer can be reused now, thanks" is explicit
and indicated as soon as ACK's come back for those parts of the
data stream.

Even UDP would need to wait until the card is done with transmit,
and we have DCCP and SCTP too.

People would want to be able to get event notifications of this,
or do we plan to just block?  Blocking could be problematic,
performance wise.

Anyways, I'm just stabbing in the dark.  It would be useful, because
there is no real clean way to use sendfile() for zero copy of anonymous
user data, and this vmsplice() thing seems like it could bridge that
gap if we do it right.


From: Piet Delaney [email blocked] Subject: Re: Linux 2.6.17-rc2 Date: Thu, 20 Apr 2006 14:37:11 -0700 On Thu, 2006-04-20 at 21:34 +0200, Jens Axboe wrote: > > Anyways, I'm just stabbing in the dark. It would be useful, because > > there is no real clan way to use sendfile() for zero copy of anonymous > > user data, and this vmsplice() thing seems like it could bridge that > > gap if we do it right. > > It should be able to, yes. Seems to me it should just work like regular > splicing, with the difference that you'd have to wait for the reference > count to drop before reusing. One way would be to do as Linus suggests > and make the vmsplice call block or just return -EAGAIN if we are not > ready yet. With that pollable, that should suffice? What about marking the pages Read-Only while it's being used by the kernel and if the user tries to write into them letting the VM dup the page with the COW code? Often you can use a FILO memory allocator in user space to minimize the odds of trying to reuse the page while the kernel is using it. FreeBSD folks developed a ZERO_COPY_SOCKET facility that uses COW; code looked great. -piet -- --- [email blocked]
From: Linus Torvalds [email blocked] Subject: Re: Linux 2.6.17-rc2 Date: Thu, 20 Apr 2006 15:20:14 -0700 (PDT) On Thu, 20 Apr 2006, Piet Delaney wrote: > > What about marking the pages Read-Only while it's being used by the > kernel NO! That's a huge mistake, and anybody that does it that way (FreeBSD) is totally incompetent. Once you play games with page tables, you are generally better off copying the data. The cost of doing page table updates and the associated TLB invalidates is simply not worth it, both from a performance standpoing and a complexity standpoint. Basically, if you want the highest possible performance, you do not want to do TLB invalidates. And if you _don't_ want the highest possible performance, you should just use regular write(), which is actually good enough for most uses, and is portable and easy. The thing is, the cost of marking things COW is not just the cost of the initial page table invalidate: it's also the cost of the fault eventually when you _do_ write to the page, even if at that point you decide that the page is no longer shared, and the fault can just mark the page writable again. That cost is _bigger_ than the cost of just copying the page in the first place. The COW approach does generate some really nice benchmark numbers, because the way you benchmark this thing is that you never actually write to the user page in the first place, so you end up having a nice benchmark loop that has to do the TLB invalidate just the _first_ time, and never has to do any work ever again later on. But you do have to realize that that is _purely_ a benchmark load. It has absolutely _zero_ relevance to any real life. Zero. Nada. None. In real life, COW-faulting overhead is expensive. In real life, TLB invalidates (with a threaded program, and all users of this had better be threaded, or they are leaving more performance on the floor) are expensive. I claim that Mach people (and apparently FreeBSD) are incompetent idiots. Playing games with VM is bad. memory copies are _also_ bad, but quite frankly, memory copies often have _less_ downside than VM games, and bigger caches will only continue to drive that point home. Linus
From: Piet Delaney [email blocked] Subject: Re: Linux 2.6.17-rc2 Date: Thu, 20 Apr 2006 16:39:03 -0700 On Thu, 2006-04-20 at 15:20 -0700, Linus Torvalds wrote: > > On Thu, 20 Apr 2006, Piet Delaney wrote: > > > > What about marking the pages Read-Only while it's being used by the > > kernel > > NO! > > That's a huge mistake, and anybody that does it that way (FreeBSD) is > totally incompetent. Yea, we're not using it either. > > Once you play games with page tables, you are generally better off copying > the data. The cost of doing page table updates and the associated TLB > invalidates is simply not worth it, both from a performance standpoing and > a complexity standpoint. I once wrote some code to find the PTE entries for user buffers; and as I recall the code was only about 20 lines of code. I thought only a small part of the TLB had to be invalidated. I never tested or profiled it and didn't consider the multi-threading issues. Instead of COW, I just returned information in recvmsg control structure indicating that the buffer wasn't being use by the kernel any longer. I kept the list of pages involved in the zero copy in a structure and when the kernel was done with the pages it decremented the page count via a callback, similar to what yzy [email blocked] discussed two weeks ago on the linux-net mailing list. I thought this structure could have pointers to the PTE's and mmu context to clear the PTE entries. Unfortunately it gets messy if the zero copy's overlap onto a shared page. I didn't study the BSD implementation well enough to appreciate how their COW implementation worked. > > Basically, if you want the highest possible performance, you do not want > to do TLB invalidates. And if you _don't_ want the highest possible > performance, you should just use regular write(), which is actually good > enough for most uses, and is portable and easy. We use a zero copy, and also don't mess with the TLB. In our application 99.99% of the data is looked at but not modified (we are looking through TCP streams for a security exploitations). > > The thing is, the cost of marking things COW is not just the cost of the > initial page table invalidate: it's also the cost of the fault eventually > when you _do_ write to the page, even if at that point you decide that the > page is no longer shared, and the fault can just mark the page writable > again. Right, it's difficult for the kernel code to change the involved PTE's when it's done with a page. Then flushing the TLB's of involved CPU's adds to the problem. > > That cost is _bigger_ than the cost of just copying the page in the first > place. > > The COW approach does generate some really nice benchmark numbers, because > the way you benchmark this thing is that you never actually write to the > user page in the first place, so you end up having a nice benchmark loop > that has to do the TLB invalidate just the _first_ time, and never has to > do any work ever again later on. > > But you do have to realize that that is _purely_ a benchmark load. It has > absolutely _zero_ relevance to any real life. Zero. Nada. None. In real > life, COW-faulting overhead is expensive. In real life, TLB invalidates > (with a threaded program, and all users of this had better be threaded, or > they are leaving more performance on the floor) are expensive. Yea, your right, the multi-threading it a real problem, you would have to send a interrupt with information about which part of the TLB needs to be invalidated to each CPU. > > I claim that Mach people (and apparently FreeBSD) are incompetent idiots. > Playing games with VM is bad. memory copies are _also_ bad, but quite > frankly, memory copies often have _less_ downside than VM games, and > bigger caches will only continue to drive that point home. Yep, both of the zero copy implementations that I've worked on have used non-VM techniques to synchronize socket buffer state between the kernel and user space. -piet > > Linus -- --- [email blocked]
From: Linus Torvalds [email blocked] Subject: Re: Linux 2.6.17-rc2 Date: Thu, 20 Apr 2006 17:09:00 -0700 (PDT) On Thu, 20 Apr 2006, Piet Delaney wrote: > > I once wrote some code to find the PTE entries for user buffers; > and as I recall the code was only about 20 lines of code. I thought > only a small part of the TLB had to be invalidated. I never tested > or profiled it and didn't consider the multi-threading issues. Looking up the page table entry is fairly quick, and is definitely worth it. It's usually just a few memory loads, and it may even be cached. So that part of the "VM tricks" is fine. The cost comes when you modify it. Part of it is the initial TLB invalidate cost, but that actually tends to be the smaller part (although it can be pretty steep already, if you have to do a cross-CPU invalidate: that alone may already have taken more time than it would to just do a straightforward copy). The bigger part tends to be that any COW approach will obviously have to be undone later, usually when the user writes to the page. Even if (by the time the fault is taken) the page is no longer shared, and undoing the COW is just a matter of touching the page tables again, just the cost of taking the fault is easily thousands of cycles. At which point the optimization is very debatable indeed. If the COW actually causes a real copy and a new page to be allocated, you've lost everything, and you're solidly in "that sucks" territory. > Instead of COW, I just returned information in recvmsg control > structure indicating that the buffer wasn't being use by the kernel > any longer. That is very close to what I propose with vmsplice(), and yes, once you avoid the COW, it's a clear win to just look up the page in the page tables and increment a usage count. So basically: - just looking up the page is cheap, and that's what vmsplice() does (if people want to actually play with it, Jens now has a vmsplice() implementation in his "splice" branch in his git tree on brick.kernel.dk). It does mean that it's up to the _user_ to not write to the page again until the page is no longer shared, and there are different approaches to handling that. Sometimes the answer may even be that synchronization is done at a much higher level (ie there's some much higher-level protocol where the other end acknowledges the data). The fact that it's up to the user obviously means that the user has to be more careful, but the upside is that you really _do_ get very high performance. If there are no good synchronization mechanisms, the answer may well be "don't use vmsplice()", but the point is that if you _can_ synchronize some other way, vmsplice() runs like a bat out of hell. - playing VM games where you actually modify the VM is almost always a loss. It does have the advantage that the user doesn't have to be aware of the VM games, but if it means that performance isn't really all that much better than just a regular "write()" call, what's the point? I'm of the opinion that we already have robust and user-friendly interfaces (the regular read()/write()/recvmsg/sendmgs() interfaces that are "synchronous" wrt data copies, and that are obviously portable). We've even optimized them as much as we can, so they actually perform pretty well. So there's no point in a half-assed "safe VM" trick with COW, which isn't all that much faster. Playing tricks with zero-copy only makes sense if they are a _lot_ faster, and that implies that you cannot do COW. You really expose the fact that user-space gave a real reference to its own pages away, and that if user space writes to it, it writes to a buffer that is already in flight. (Some users may even be able to take _advantage_ of the fact that the buffer is "in flight" _and_ mapped into user space after it has been submitted. You could imagine code that actually goes on modifying the buffer even while it's being queued for sending. Under some strange circumstances that may actually be useful, although with things like checksums that get invalidated by you changing the data while it's queued up, it may not be acceptable for everything, of course). Linus
From: Linus Torvalds [email blocked] Subject: Re: Linux 2.6.17-rc2 Date: Fri, 21 Apr 2006 10:58:46 -0700 (PDT) I got slashdotted! Yay! On Thu, 20 Apr 2006, Linus Torvalds wrote: > > I claim that Mach people (and apparently FreeBSD) are incompetent idiots. I also claim that Slashdot people usually are smelly and eat their boogers, and have an IQ slightly lower than my daughters pet hamster (that's "hamster" without a "p", btw, for any slashdot posters out there. Try to follow me, ok?). Furthermore, I claim that anybody that hasn't noticed by now that I'm an opinionated bastard, and that "impolite" is my middle name, is lacking a few clues. Finally, it's clear that I'm not only the smartest person around, I'm also incredibly good-looking, and that my infallible charm is also second only to my becoming modesty. So there. Just to clarify. Linus "bow down before me, you scum" Torvalds



Related Links:


Source URL:
http://kerneltrap.org/node/6506