While explaining the new splice() and tee() buffer management system calls [story], Linus Torvalds made reference to some possible future extensions. This included vmsplice(), a system call since implemented by Jens Axboe "to basically do a 'write to the buffer', but using the reference counting and VM traversal to actually fill the buffer." Reviewing the implications of using such a system call lead to a comparison with FreeBSD's ZERO_COPY_SOCKET which uses COW (copy on write).
Linus explained that while this may look good on specific benchmarks, it actually introduces extra overhead, "the thing is, the cost of marking things COW is not just the cost of the initial page table invalidate: it's also the cost of the fault eventually when you _do_ write to the page, even if at that point you decide that the page is no longer shared, and the fault can just mark the page writable again." He went on to explain, "The COW approach does generate some really nice benchmark numbers, because the way you benchmark this thing is that you never actually write to the user page in the first place, so you end up having a nice benchmark loop that has to do the TLB invalidate just the _first_ time, and never has to do any work ever again later on." Linus didn't pull any punches when he summarized:
"I claim that Mach people (and apparently FreeBSD) are incompetent idiots. Playing games with VM is bad. memory copies are _also_ bad, but quite frankly, memory copies often have _less_ downside than VM games, and bigger caches will only continue to drive that point home."
From: "David S. Miller" [email blocked] To: axboe Subject: Re: Linux 2.6.17-rc2 Date: Thu, 20 Apr 2006 12:26:47 -0700 (PDT) From: Jens Axboe [email blocked] Date: Thu, 20 Apr 2006 16:50:42 +0200 > On Wed, Apr 19 2006, Linus Torvalds wrote: > > - vmsplice() system call to basically do a "write to the buffer", but > > using the reference counting and VM traversal to actually fill the > > buffer. This means that the user needs to be careful not to re-use the > > user-space buffer it spliced into the kernel-space one (contrast this > > to "write()", which copies the actual data, and you can thus re-use the > > buffer immediately after a successful write), but that is often easy to > > do. > > This I already did, it was pretty easy and straight forward. I'll post > it soonish. Do we plan to do vmsplice() to sockets? That's interesting, but requires some serious cooperation from things like TCP so that the indication of "buffer can be reused now, thanks" is explicit and indicated as soon as ACK's come back for those parts of the data stream. Even UDP would need to wait until the card is done with transmit, and we have DCCP and SCTP too. People would want to be able to get event notifications of this, or do we plan to just block? Blocking could be problematic, performance wise. Anyways, I'm just stabbing in the dark. It would be useful, because there is no real clean way to use sendfile() for zero copy of anonymous user data, and this vmsplice() thing seems like it could bridge that gap if we do it right.
From: Piet Delaney [email blocked] Subject: Re: Linux 2.6.17-rc2 Date: Thu, 20 Apr 2006 14:37:11 -0700 On Thu, 2006-04-20 at 21:34 +0200, Jens Axboe wrote: > > Anyways, I'm just stabbing in the dark. It would be useful, because > > there is no real clan way to use sendfile() for zero copy of anonymous > > user data, and this vmsplice() thing seems like it could bridge that > > gap if we do it right. > > It should be able to, yes. Seems to me it should just work like regular > splicing, with the difference that you'd have to wait for the reference > count to drop before reusing. One way would be to do as Linus suggests > and make the vmsplice call block or just return -EAGAIN if we are not > ready yet. With that pollable, that should suffice? What about marking the pages Read-Only while it's being used by the kernel and if the user tries to write into them letting the VM dup the page with the COW code? Often you can use a FILO memory allocator in user space to minimize the odds of trying to reuse the page while the kernel is using it. FreeBSD folks developed a ZERO_COPY_SOCKET facility that uses COW; code looked great. -piet -- --- [email blocked]
From: Linus Torvalds [email blocked] Subject: Re: Linux 2.6.17-rc2 Date: Thu, 20 Apr 2006 15:20:14 -0700 (PDT) On Thu, 20 Apr 2006, Piet Delaney wrote: > > What about marking the pages Read-Only while it's being used by the > kernel NO! That's a huge mistake, and anybody that does it that way (FreeBSD) is totally incompetent. Once you play games with page tables, you are generally better off copying the data. The cost of doing page table updates and the associated TLB invalidates is simply not worth it, both from a performance standpoing and a complexity standpoint. Basically, if you want the highest possible performance, you do not want to do TLB invalidates. And if you _don't_ want the highest possible performance, you should just use regular write(), which is actually good enough for most uses, and is portable and easy. The thing is, the cost of marking things COW is not just the cost of the initial page table invalidate: it's also the cost of the fault eventually when you _do_ write to the page, even if at that point you decide that the page is no longer shared, and the fault can just mark the page writable again. That cost is _bigger_ than the cost of just copying the page in the first place. The COW approach does generate some really nice benchmark numbers, because the way you benchmark this thing is that you never actually write to the user page in the first place, so you end up having a nice benchmark loop that has to do the TLB invalidate just the _first_ time, and never has to do any work ever again later on. But you do have to realize that that is _purely_ a benchmark load. It has absolutely _zero_ relevance to any real life. Zero. Nada. None. In real life, COW-faulting overhead is expensive. In real life, TLB invalidates (with a threaded program, and all users of this had better be threaded, or they are leaving more performance on the floor) are expensive. I claim that Mach people (and apparently FreeBSD) are incompetent idiots. Playing games with VM is bad. memory copies are _also_ bad, but quite frankly, memory copies often have _less_ downside than VM games, and bigger caches will only continue to drive that point home. Linus
From: Piet Delaney [email blocked] Subject: Re: Linux 2.6.17-rc2 Date: Thu, 20 Apr 2006 16:39:03 -0700 On Thu, 2006-04-20 at 15:20 -0700, Linus Torvalds wrote: > > On Thu, 20 Apr 2006, Piet Delaney wrote: > > > > What about marking the pages Read-Only while it's being used by the > > kernel > > NO! > > That's a huge mistake, and anybody that does it that way (FreeBSD) is > totally incompetent. Yea, we're not using it either. > > Once you play games with page tables, you are generally better off copying > the data. The cost of doing page table updates and the associated TLB > invalidates is simply not worth it, both from a performance standpoing and > a complexity standpoint. I once wrote some code to find the PTE entries for user buffers; and as I recall the code was only about 20 lines of code. I thought only a small part of the TLB had to be invalidated. I never tested or profiled it and didn't consider the multi-threading issues. Instead of COW, I just returned information in recvmsg control structure indicating that the buffer wasn't being use by the kernel any longer. I kept the list of pages involved in the zero copy in a structure and when the kernel was done with the pages it decremented the page count via a callback, similar to what yzy [email blocked] discussed two weeks ago on the linux-net mailing list. I thought this structure could have pointers to the PTE's and mmu context to clear the PTE entries. Unfortunately it gets messy if the zero copy's overlap onto a shared page. I didn't study the BSD implementation well enough to appreciate how their COW implementation worked. > > Basically, if you want the highest possible performance, you do not want > to do TLB invalidates. And if you _don't_ want the highest possible > performance, you should just use regular write(), which is actually good > enough for most uses, and is portable and easy. We use a zero copy, and also don't mess with the TLB. In our application 99.99% of the data is looked at but not modified (we are looking through TCP streams for a security exploitations). > > The thing is, the cost of marking things COW is not just the cost of the > initial page table invalidate: it's also the cost of the fault eventually > when you _do_ write to the page, even if at that point you decide that the > page is no longer shared, and the fault can just mark the page writable > again. Right, it's difficult for the kernel code to change the involved PTE's when it's done with a page. Then flushing the TLB's of involved CPU's adds to the problem. > > That cost is _bigger_ than the cost of just copying the page in the first > place. > > The COW approach does generate some really nice benchmark numbers, because > the way you benchmark this thing is that you never actually write to the > user page in the first place, so you end up having a nice benchmark loop > that has to do the TLB invalidate just the _first_ time, and never has to > do any work ever again later on. > > But you do have to realize that that is _purely_ a benchmark load. It has > absolutely _zero_ relevance to any real life. Zero. Nada. None. In real > life, COW-faulting overhead is expensive. In real life, TLB invalidates > (with a threaded program, and all users of this had better be threaded, or > they are leaving more performance on the floor) are expensive. Yea, your right, the multi-threading it a real problem, you would have to send a interrupt with information about which part of the TLB needs to be invalidated to each CPU. > > I claim that Mach people (and apparently FreeBSD) are incompetent idiots. > Playing games with VM is bad. memory copies are _also_ bad, but quite > frankly, memory copies often have _less_ downside than VM games, and > bigger caches will only continue to drive that point home. Yep, both of the zero copy implementations that I've worked on have used non-VM techniques to synchronize socket buffer state between the kernel and user space. -piet > > Linus -- --- [email blocked]
From: Linus Torvalds [email blocked] Subject: Re: Linux 2.6.17-rc2 Date: Thu, 20 Apr 2006 17:09:00 -0700 (PDT) On Thu, 20 Apr 2006, Piet Delaney wrote: > > I once wrote some code to find the PTE entries for user buffers; > and as I recall the code was only about 20 lines of code. I thought > only a small part of the TLB had to be invalidated. I never tested > or profiled it and didn't consider the multi-threading issues. Looking up the page table entry is fairly quick, and is definitely worth it. It's usually just a few memory loads, and it may even be cached. So that part of the "VM tricks" is fine. The cost comes when you modify it. Part of it is the initial TLB invalidate cost, but that actually tends to be the smaller part (although it can be pretty steep already, if you have to do a cross-CPU invalidate: that alone may already have taken more time than it would to just do a straightforward copy). The bigger part tends to be that any COW approach will obviously have to be undone later, usually when the user writes to the page. Even if (by the time the fault is taken) the page is no longer shared, and undoing the COW is just a matter of touching the page tables again, just the cost of taking the fault is easily thousands of cycles. At which point the optimization is very debatable indeed. If the COW actually causes a real copy and a new page to be allocated, you've lost everything, and you're solidly in "that sucks" territory. > Instead of COW, I just returned information in recvmsg control > structure indicating that the buffer wasn't being use by the kernel > any longer. That is very close to what I propose with vmsplice(), and yes, once you avoid the COW, it's a clear win to just look up the page in the page tables and increment a usage count. So basically: - just looking up the page is cheap, and that's what vmsplice() does (if people want to actually play with it, Jens now has a vmsplice() implementation in his "splice" branch in his git tree on brick.kernel.dk). It does mean that it's up to the _user_ to not write to the page again until the page is no longer shared, and there are different approaches to handling that. Sometimes the answer may even be that synchronization is done at a much higher level (ie there's some much higher-level protocol where the other end acknowledges the data). The fact that it's up to the user obviously means that the user has to be more careful, but the upside is that you really _do_ get very high performance. If there are no good synchronization mechanisms, the answer may well be "don't use vmsplice()", but the point is that if you _can_ synchronize some other way, vmsplice() runs like a bat out of hell. - playing VM games where you actually modify the VM is almost always a loss. It does have the advantage that the user doesn't have to be aware of the VM games, but if it means that performance isn't really all that much better than just a regular "write()" call, what's the point? I'm of the opinion that we already have robust and user-friendly interfaces (the regular read()/write()/recvmsg/sendmgs() interfaces that are "synchronous" wrt data copies, and that are obviously portable). We've even optimized them as much as we can, so they actually perform pretty well. So there's no point in a half-assed "safe VM" trick with COW, which isn't all that much faster. Playing tricks with zero-copy only makes sense if they are a _lot_ faster, and that implies that you cannot do COW. You really expose the fact that user-space gave a real reference to its own pages away, and that if user space writes to it, it writes to a buffer that is already in flight. (Some users may even be able to take _advantage_ of the fact that the buffer is "in flight" _and_ mapped into user space after it has been submitted. You could imagine code that actually goes on modifying the buffer even while it's being queued for sending. Under some strange circumstances that may actually be useful, although with things like checksums that get invalidated by you changing the data while it's queued up, it may not be acceptable for everything, of course). Linus
From: Linus Torvalds [email blocked] Subject: Re: Linux 2.6.17-rc2 Date: Fri, 21 Apr 2006 10:58:46 -0700 (PDT) I got slashdotted! Yay! On Thu, 20 Apr 2006, Linus Torvalds wrote: > > I claim that Mach people (and apparently FreeBSD) are incompetent idiots. I also claim that Slashdot people usually are smelly and eat their boogers, and have an IQ slightly lower than my daughters pet hamster (that's "hamster" without a "p", btw, for any slashdot posters out there. Try to follow me, ok?). Furthermore, I claim that anybody that hasn't noticed by now that I'm an opinionated bastard, and that "impolite" is my middle name, is lacking a few clues. Finally, it's clear that I'm not only the smartest person around, I'm also incredibly good-looking, and that my infallible charm is also second only to my becoming modesty. So there. Just to clarify. Linus "bow down before me, you scum" Torvalds
C'mon
I'm sorry, but Linus is starting to turn into a horse's ass. That'd be like turning to a friend that runs Gentoo and saying "Gentoo doesn't have a package manager!" Just because the BSD guys have a different implementation doesn't make them incompetent idiots. Stop buying into your own hype.
I see this as a technical mat
I see this as a technical matter between engineers.
Not a function for community waring.
Leave it to kernel hackers, and ignore feelings of anger is the way to go IMHO.
I personally like the Linux Kernel, but I hate the idea of Linux Distros. FreeBSD Also offers me better hardware support and unified ports/packages; [url=http://www.pcbsd.org/]PC-BSD[/url] even supports "Newb proof" installs via the PBI System. (Not to mention one of the best RTFM's to read.)
You should really try Desktop
You should really try DesktopBSD www.desktopbsd.net .
This is not new. Linus has p
This is not new. Linus has publicly questioned the intellect of FreeBSD/Mach folks in the past. I recall some issue with a kernel API involving events; Linux took a minimalist approach and were FreeBSD used some elaborate function. This offended Linus's notion of 'taste' enough that he reamed them out for it. Another incident involved coherency between the filesystem and VM; the Mach approach leads to managing multiversion pages which Linus called 'insane'.
Proof is in the Pudding
If the Mach/*BSD folks can write a benchmark to prove Linus wrong, do it. Performance is a measurable thing. If what Linus is saying is wrong, write the test and run it. If you can prove his points wrong, then Linus is the ID10T. If you can't, then maybe he's right.
Proof is in the Pudding
Usually the burden of proof lay on the one making the claim.
PS: I don't know enough on this specific subject to have an educated opinion about the valididty of Linus' claim. My only point here is that it is not the burden of Mach/*BSD fols to prove Linus wrong, and the absence of such effort of their part cannot be construed as a 'Truth by default'.
Optimization requires measurement
Optimization is a quantitiative thing. Benchmarks, before and after, should exist. That means that a) numbers should be available and b) code should be available. The interface should be proofed regardless anyone else's opinions.
Burden of Proof
Isn't the burden on the FreeBSD engineers to show that their COW approach is better than what existed before?
They did
Linus claims their benchmark is foul.
They solve different problems
Linus wants user-space code to adapt to use a higher-performance model. BSD/Mach want to adapt the existing API behaviours to maximize performance of unmodified code. Linus' approach will result in higher performance, at the expense of more labor for the application writer. It's a perfectly reasonable choice to put the trade-off anywhere along this spectrum, and it is sensible to use the system which best supports your purposes. If Linus doesn't want to do both (provide the higher performance solution with more onerous and unportable interfaces, and increase the performance of those cases which BSD/Mach address with COW) I think it's difficult to know why, since he's inclined to theatrics in this area, but one can guess that he considers the weight of the implementation to be excessive for Linux.
Re: They solve different problems
I think Linus' point is that the Mach/FreeBSD way of doing things is only fast for (micro?)benchmarks. I would suspect that he is claiming that the write() based system on Linux is faster than COW methods on Mach/FreeBSD for real world loads. And it would seem from my reading that he is also claiming that vmsplice() on Linux will run even faster still.
So I guess his overall claim is that writing a COW type zero copy method is a waste of time because it is slower than a well written write() based system (let alone the speed improvements for a vmsplice() type system).
Sensible interfaces and paradigms, please.
Why should applications (Apache, MySQL, etc) have to adapt to something that is broken from the start? Torvalds himself has claimed that there are synchronization issues with the API. Maybe a sane interface that works should be produced before flinging dirt around.
The FreeBSD socket paradigm makes more sense from an application's point of view. Once data is sent to the kernel, the kernel is tasked with the delivery of the data and the user-space buffers should be reusable for the next batch of data. Obviously, you want to allocate enough buffer space that you are not using the same vm page to do all of your socket operations, if you hope to get any sort of performance out of your app (Or you end up page-faulting, then copying the data into KVA anyhow).
This said, there are many improvements that could be brought about to the FreeBSD Way(tm) of doing things, and I can assure you that work in being done on it in -CURRENT.
It isn't broken
It isn't broken at all, where did you get that from?
Synchronisation "issues" does't mean it is broken, or that the issues
are a problem. Actually the synchronisation issues are exactly why it
is a better interface.
The FreeBSD socket paradigm makes more sense from an application's point of view. Once data is sent to the kernel, the kernel is tasked with the delivery of the data and the user-space buffers should be reusable for the next batch of data. Obviously, you want to allocate enough buffer space that you are not using the same vm page to do all of your socket operations, if you hope to get any sort of performance out of your app (Or you end up page-faulting, then copying the data into KVA anyhow).
Why does that make more sense from an application's point of view?
Having to keeping that buffer space around is just symptomatic that
it is a hack for an interface that can't cope with it well.
A new interface that is explicitly designed for zero copy seems to
make much more sense to me.
This said, there are many improvements that could be brought about to the FreeBSD Way(tm) of doing things,
Like what? Care to link any mailing list posts, or tech papers?
and I can assure you that work in being done on it in -CURRENT.
By who? Under what commits?
Mudslinging not becoming to Linus
Right, if you don't understand something, the people who did it must be idiots. Or even if you just don't like it. Hasn't everyone gotten tired of reading about whatever slur Linus is tossing off this week? Oh, wait, we're talking about slashdot kiddies here; whatever Linus says is their holy writ, handed down on god's own toilet paper.
re: C'mon
"Just because the BSD guys have a different implementation doesn't make them incompetent idiots."
I didn't read it like that...
I read it as "this technique (COW) is not worth it in the real world, because it really doesn't provide improved performance except under specific circumstances. Anyone using this braindead technique is incompetent."
Now, while his comments on why this is braindead make sense to me that doesn't mean that he is right. I would like to see some proof that there is insignificant performance improvement from COW. Argument ad autoritatem is not a valid technique.
Ad Hominem attacks are no better.
Your analogy is flawed, a better analogy would be to say to your friend "the performance loss from having to spend so much time compiling and recompiling packages more than covers the performance gains you get from running your 'optimised' software." It may or may not be true (and I'm not making that claim) but it provides a potential technical reason why you consider something worse. The fact that the claim is not backed up leaves it up to the rest of us to either test using a realistic "benchmark" or to demand more proof than "I say so"
Z.
I happen to agree w/ Linus
First off, I don't think Linus really believes the BSD guys are incompetent. He gives pretty valid reasoning as to why their approach to zero-copy is bad. In the end, it's a technical discussion decided on facts.
Re: I happen to agree w/ Linus
I'm not sure about that... The term incompetent is a bit strong if you don't think someone has done something fundamentally wrong.
While he might accept their competence as programmers I definitely get the feeling that he doesn't trust their competence as OS designers. This doesn't mean that they don't have good ideas, just that sometimes the first thoughts on an idea aren't borne out by further analysis or actual implementation. COW sounds like a good idea in principle (let's only make a copy when the copy or the original are going to change) but the actual implementation is more complicated than that, and tracking which pages need to be copied becomes more complicated. If the added complexity in checking if this page needs to be copied is greater than the amount of work in just copying everything then it's not worth it (always copy would be simpler, faster and easier to understand).
You are right, it is a technical discussion that should be decided on facts. In that sense the ad hominem attacks serve no purpose.
Z.
His impropriety knows no bounds
That Linus! Here he is at it again but this time wiping the floor with the very kernel hackers that bow before His Infallibleness.
http://kerneltrap.org/node/6505
>There are some other buffer management system calls that I haven't done yet (and when I say "I haven't done yet", I obviously mean "that I hope some other sucker will do for me, since I'm lazy") ....
So disrespectful to such a committed and brainy bunch (unlike the BSD gang).. but then, Linus is the smartest person on earth.
Where we going?
There's a large difference between choosing a source-based distribution like Gentoo vs a package-managed distro and calling someone on a lousy programming choice that could:
A: Destabilize the system
B: Be slower than more traditional methods
C: Be invalid due to data integrity checking
Wah! Someone famous said something not politically correct about something dumb as hell! I'm gonna get my .0000000000000000001 nanoseconds of fame in by flaming him over it!
Enjoy!
Nihilist wrote: I'm sorry, b
Nihilist wrote:
Starting?
How sarcastic :)
How sarcastic :)
Uh. There is a typo in topic.
Uh. There is a typo in topic... vmsplice, not vmslplice...
What are they talking about?
Did you understand what they are talking about?
I am just a newbie and don't have any slightest idea what they are discussing..
Cool if I can get upto speed...
We are in the same boat
We are in the same boat
There are times when both the
There are times when both the OS and a program need to share a section of writeable memory. The OS and the program don't really know what the other is doing with that shared memory at any given instant. If it were left as that, it could mean that the contents of the memory could be changed by one while the other was in the middle of doing something with it which would lead to unpredictable and undefined results (read: bad!). So, there needs to be some kind of coordination between the OS and the program so they don't step all over each other. There are several ways to do this coordination, and three different ones are discussed in the email excange.
A major difference between the ways, from a user's perspective who doesn't care how it is implemented, is the time it takes do this coordination. The simplest way to coordinate the memory is to keep two copies of the data, one copy for the OS and one for the program, and let each work with their own copies. This involves copying from memory to memory so can be very slow.
Two other methods are discussed -- the way FreeBSD does it and the way Linux should do it. They way FreeBSD does it can be extremely fast in benchmarks, but in a real-world case, the way FreeBSD does it can take a very long time (has to do with technical reasons both of how the computer's processor works with memory, and how programs tend to access memory) compared to both the way Linux should do it (which is relatively fast) and also to even keeping two seperate copies (which is slow).
In summary, the way FreeBSD does it more complicated and can often be slower than just making and using two copies.
Part of the issue is that BSD
Part of the issue is that BSD always syncs up, regardless of whether that's needed. Syncronization code can be very expensive, so that slows down their optimization to the point that it's not longer any faster than with optimizations turned off. It's like driving down a 25 mph street in (short) bursts of 100 mph and (long) 5 mph, hoping that your overall speed is faster.
Linus is proposing that Linux never sync up. Always fast, but can be wrong. Say, driving at 55 mph, but without looking if there are kids in the road. His argument is that if you use the implementation you should know if you need to sync. If that's the case, he'll let you watch for the kids.
Linus is proposing that Linux
I may be wrong, but that's not how I read it. To me it appears he's saying the proposed vmsplice function shouldn't try to do all the work itself, but instead should be used in conjuction with the already existing messaging system in the kernel. If a program uses vmsplice() then it is up to the program to ask the kernel if it is safe to use that memory rather than assume it is safe.
Yeah, vmsplice() is optional.
Yes. That's how I read it as well. If I'm reading it correctly, vmsplice() is an optional interface that programmers can use when they're willing to handle the synchronization work themselves, and want/need very high performance.
The standard, old, safe, slower, traditional, synchronized interfaces are still there. Linus said:
A lot of the comments seem to assume that vmsplice() Is The Way That Linux Will Do Things From Now On. I was initially under that impression myself. But, no, it's just one way that Linux can do things when the need arises.
I think that Linus may get to
I think that Linus may get too much of his emotion into this.
I think that when you are building something you must stay impartial so that your personal liking are not going against your work.
However, what he said seems right and if you read all emails you will also understand that the new vmsplice implementation is kinda bloated when it comes to modification.
*Long life to linux*
Loaned memory from the kernel
Instead of the user telling the kernel that it can use its buffer, and
then needing to be told when the kernel has finished with that buffer,
why not have a system call that returns I/O buffer space that the user can submit once? There would be a memory pool of I/O buffer space managed by the kernel, and when a user process wanted to send a message, it would make a system call to request a given number of bytes of temporary I/O buffer space, fill that buffer, then hand it back to the kernel. For the next I/O transaction, it would do the same, and if the kernel happened to have finished with the previously submitted buffer, and had returned it to the I/O memory pool, then it could hand back that buffer again. Otherwise it would hand back a different part of the I/O-buffer memory pool. This has the advantage over the proposed vmsplice scheme of not making the user process have to be told that it can start writing to its own memory again, and of allowing the user process to compose the next message before the previous one has been sent.
NetBSD has page loaning.
True zero-copy, with none of the problems of FreeBSD, or Linus's
proposed solution.
See "UVM" and "page loaning" in google.
wrong
No, page loaning still needs to shoot down TLBs and it is basically
the same as FreeBSD's COW sharing.
Excepty UVM is much less opti
Excepty UVM is much less optimised than the FreeBSD VM (the downside of supporting all the architectures out there?), so UVM even ends up being slower.
Too Personal
I think you're taking this too personal. I think Linus' seeming flame bait here is intended to make people think & respond. Once debate get's too academic and dry, lots of people don't care as much. But state things like this and you'll get feedback :) And notice here that Linus didn't say anything like, "Linux is better then *BSD". All he said was that people doing this COW thing were dumb. If you're one of those developers then I can understand you're unapreciative reply, otherwise, quit whining.
Of course he didn't say anything like that
He would know the difference between "then" and "than".
Clam down Linus
nice -1 linus
Uh, shouldn't that be: nice
Uh, shouldn't that be:
nice -n19 linux
?
Efficiency is everything
IMHO one of Linus' most enduring qualities is his commitment to efficiency. Yes, one could take the Microsoft approach and simply throw more computing power at the problem but why would you really want to? Look at Windows, for example: Every release requires a beefier CPU and more RAM simply to maintain the same level of performance. That's the lazy way and it's stupid. Linux outperforms for a reason folks -- let us not forget that.
Efficiency is f*ck all when stability suffers
When there's a choice between a safe, simple way of doing something, and an unsafe, complex but marginally faster way, Linus will always opt for the latter. This is why Linux is marginally faster than FreeBSD or NetBSD in some macro-benchmarks, but suffers from data loss problems and unpredicatable scalability. Whereas an OS like Solaris will default to synchronous filesystem writes, Linux defaults to asynchronous - with the inherent risk of data loss - because it gives it a dubious edge in benchmarks.
Linux kernel hackersn also trumpeted the widespread use of O(1) algorithms, and yet again showed there lack of experience, as the algorithms did not in themselves ensure a smooth performance degradation as loads increases. Other factors come into play, and these were ignored in the quest for an appealing but delusion bit of one upmanship.
The mantra of "those that minunderstand Unix are condemned to reinvent it poorly" applies to Linux as much as any OS I've come across.
The whole pissing match is st
The whole pissing match is stupid.
And it depends on the relative speeds of TLBs and memory copies.
Perhaps memory wins today, perhaps it loses 3 years from now.
nice troll
This is why Linux is marginally faster than FreeBSD or NetBSD in some macro-benchmarks, but suffers from data loss problems and unpredicatable scalability.
Ho ho ho.
http://marc.theaimsgroup.com/?t=114416927400001&r=1&w=2
A 2 year old Linux kernel is faster than FreeBSD 6 head by a factor of
7 here, and FreeBSD is actually twice as fast on a single core as it
is on 8 cores!
You were saying something about marginally faster? Unpredictable
scalability?
Linux defaults to asynchronous - with the inherent risk of data loss - because it gives it a dubious edge in benchmarks.
Linux uses journalling filesystems, same as any other modern OS. And
they are not for a "dubious edge in benchmarks". Get a clue.
Linux kernel hackersn also trumpeted the widespread use of O(1) algorithms, and yet again showed there lack of experience, as the algorithms did not in themselves ensure a smooth performance degradation as loads increases.
I'm sure you could have done much better, right? Let's see some of
your proof that Linux's algorithms are poor.
The mantra of "those that minunderstand Unix are condemned to reinvent it poorly" applies to Linux as much as any OS I've come across.
http://www.stdlib.net/~colmmacc/
So does that apply to Solaris 10 too, which is 50% slower than Linux
on their own hardware.
So does that apply to Solaris
So does that apply to Solaris 10 too, which is 50% slower than Linux on their own hardware.
I've compared Linux and Solaris 10 on dual processor Sun Opteron boxes (I can't speak for their SPARC based machines), and Solaris scaled much better than Linux. This was for an application that placed particular emphasis on I/O throughput, and the output from the stats package I used (Orca) showed a smooth degradation in performance for Solaris, but Linux becoming less and less predictable as the load increased. My team saw similar unpredicatability with Linux on Dell Xeon machines, however we didn't try Solaris on the Dells, as the difference in performance between the Opterons and Xeons made the minimal cost savings of the Dells a moot point.
No, just trying to shut up the troll
I would have no reason to doubt your claim, or that yours is
an isolated case. But as you can see there is published data (and
not only the linked benchmark) to show Linux can handily
outperform Solaris on some workloads.
Also, Solaris uses journalling filesystems for speed too.
So the blanket statement that Linux "sacrifices" stability for raw
performance is just a plain old troll (old being the operative word).
" Linux outperforms for a rea
" Linux outperforms for a reason folks -- let us not forget that."
Why is the video playback in the 2.6.16 kernel so jerky then?
It's not like Linux is perfect and Linus is god.
It's not like Linux is perfec
What?! NOOOOOOOOOOOOO!!!
Of course...
The fact that the FreeBSD implementation looks good in benchmarks obviously means that under specific circumstances, it improves performance... under other circumstances it doesn't and should therefore not be used. While the (not yet implemented) Linux method may be better for the general case, or even just as good for a superset of where the FreeBSD method looks good, that doesn't mean that the FreeBSD method is completely worthless. The facts are apparently that FreeBSD has an optimization that makes a significant difference for a specific small set of circumstances RIGHT NOW and Linux is going to get an optimization which will work with those circumstances and more (perhaps even better) EVENTUALLY. It's not as if all sockets are created as a ZERO_COPY_SOCKET or anything, some serious code changes can be required.
On the whole though, adding optimizations which make a small subset of real world usage faster and the rest of them slower is a waste of time. However, since FreeBSD and Linux are volunteer driven, if someone has a case that will benefit from ZERO_COPY_SOCKET and is willing to waste the time adding support for that, and it doesn't break anything else, there's no reason not to include it.
Speaking of COW, and how stupid it is, I'm pretty sure that the Linux fork() implementation uses COW and that the performance improvement was big enough that a real vfork() was removed since it made no sense to maintain it separately (re-added later because, of course, it does). Since apparently the rant is about COW in general, Linux is saying that the person/people who did the COW fork() work are "incompetent idiots". IIRC, Linus himself did a large hunk of that work (not positive though).
In summary then:
The fact that the FreeBSD imp
Exactly! Under the specific circumstance of using a large loop to repeatedly allocate a page but never actually do anything with it then discard it, thousands of times a second, it just SCREAMS!
No, it doesn't. It uses copy_process(). See linux/kernel/fork.c
Linux uses COW
No, it doesn't. It uses copy_process(). See linux/kernel/fork.c
It uses COW, but that's not the "VM games" Linus is talking about. It
does not have to shoot down TLBs. (and is a *much* *much* heavier weight
operation than sending a page of data to the network stack).
And shoots down TLBs!
I stand corrected, it does have to flush TLBs, of course.
The difference is that a) COW in this sense _really_ does save
memory, and b) the flush is amortized over probably many hundreds
or thousands of pages, making it fairly insignificant.
(c) you have to flush the TLB
(c) you have to flush the TLB anyway, since you're creating a new process with a new set of page tables