And I appear to have sent the one without the usage comments at the top. Here they are: * USAGE: * $ gcc -g maptest.c * $ ./a.out z * $ hexdump -C z.NEW * * Notice the block of zero byte records appearing within * the output. * * Run with a second argument: * * $ ./a.out z 1 * $ hexdump -C z.NEW * * No more zeros! Bron. -- Bron Gondwana brong@fastmail.fm --
Very interesting. There's certainly something there. That said, there's a distracting bug which is visible when doing an strace lseek(4, 140333890921392, SEEK_SET) = -1 EINVAL (Invalid argument) write(4, "\0\0\0\0", 4) = 4 which is from that lseek(newfd, mapbase + offset + size - 8, 0); write(newfd, (char *) &zero, 4); where the addition of "mapbase" is insane. So that will write zeroes to the wrong part of the file (offset 64, to be exact). And that will get overwritten by the next write, making it all look entirely insane. That said, that bug may be distracting, but it seems to have nothign at all to do with the actual problem. The bug seems to happen only when the file is not pre-paged in. Nick? Linus --
I don't think this patch is correct, because it doesn't really fix the
basic issue (the code should do the right thing even if a page isn't
there), but it might hide it by faulting in the whole "bytes" range rather
than just the first iov.
So Nick, it's still over to you, but if this does hide it, then that's an
interesting detail in itself.
Linus
---
mm/filemap.c | 17 ++++++++++++++---
1 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index 1e6a7d3..0080a27 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1808,9 +1808,20 @@ EXPORT_SYMBOL(iov_iter_advance);
*/
int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
{
- char __user *buf = i->iov->iov_base + i->iov_offset;
- bytes = min(bytes, i->iov->iov_len - i->iov_offset);
- return fault_in_pages_readable(buf, bytes);
+ unsigned long offset = i->iov_offset;
+ const struct iovec *iov = i->iov;
+
+ while (bytes) {
+ char __user *buf = iov->iov_base + offset;
+ size_t n = min(bytes, iov->iov_len - offset);
+
+ if (fault_in_pages_readable(buf, n))
+ return -EFAULT;
+ bytes -= n;
+ offset = 0;
+ iov++;
+ }
+ return 0;
}
EXPORT_SYMBOL(iov_iter_fault_in_readable);
--
I actually am starting to think that the bug is in __copy_to_user_inatomic_nocache(). The return value of that thing in the case of a failure seems to be totally wrong. In particular, since that function does an unrolled loop of accesses like so: .Ls1: movq (%rsi),%r11 .Ls2: movq 1*8(%rsi),%r8 .Ls3: movq 2*8(%rsi),%r9 .Ls4: movq 3*8(%rsi),%r10 .Ld1: movnti %r11,(%rdi) .Ld2: movnti %r8,1*8(%rdi) .Ld3: movnti %r9,2*8(%rdi) .Ld4: movnti %r10,3*8(%rdi) it may have done three of the 64-bit loads, and then trap on the fouth: but since it hasn't done a _single_ of the stores, it shouldn't count as any different whether it traps on any of .Ls[1-4]. But that code definitely seems to think it makes a difference whether the exception happened at Ls1 or at Ls4, even though both points have copied _exactly_ as many bytes when the exception happens. So I'm starting to think the bug is all in there, not in the VM itself. See arch/x86/lib/copy_user_nocache.S. In fact, even the comment there implies that the author didn't know or care about returning the correct value. Andi and Ingo added to Cc. Linus --
Confirmed. The uncached user copies are totally broken. The number of bytes left uncopied is simply wrong, because of how it does that unrolled loop and doesn't account for the fact that just doing loads does not actually increase the number of bytes copied at all. So because the "copy_to_user_inatomic()" logic cares _deeply_ about how many bytes were actually copied, when the copy count is wrong, the code ends up thinking that it copied more bytes than it actually did, resulting in the corruption in the page cache. Nasty. That whole file is a mess. Sadly, so is the regular "copy_user_64.S" too (it has the same totally broken comment, too!), this is not just the uncached version. And the only reason that it only shows up with the uncached version in _practice_ is that the routine that uses the x86 string instructions (ie the "rep movsq" in copy_user_generic_string) actually gets this all right. So the bug is hidden in that case - which is most CPU's out there (all CPU's that have X86_FEATURE_REP_GOOD set). I don't think that code is reasonably salvageable. Damn. Linus --
Hmm. Something like this *may* salvage it. Untested, so far (I'll reboot and test soon enough), but even if it fixes things, it's not really very good. Why? Because of the unrolling, we may be doing 32 bytes worth of reads from user space before doing a *single* write, so if we have a user copy from the end of a page (say, 16 bytes from the end), we'd take the fault on the third 8-byte load, and not copy anything at all. So instead of copying 16 bytes and then returning "I copied 16 bytes", it will copy _zero_ bytes, and now return "I copied 0 bytes" (it used to incorrectly return that it copied 16 bytes because it could _read_ 16 bytes even though it never wrote them! Totally buggy!). But I think the mm/filemap.c routines handle this case correctly (because of how it probes at least the first iovec), so at least copied will generally not be zero. But as mentioned, this isn't tested yet. It would be better to not do the unrolling, or at least do the faulting behaviour correctly so that we fall back on a byte-for-byte copy when it faults. But not even the "rep movs" version has ever been _that_ careful, so I guess that's ok. Somebody should _really_ double-check this, and it would be wonderful if somebody can come up with something better than this patch. Linus --- arch/x86/lib/copy_user_64.S | 25 +++++++++++-------------- arch/x86/lib/copy_user_nocache_64.S | 25 +++++++++++-------------- 2 files changed, 22 insertions(+), 28 deletions(-) diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S index 70bebd3..ee1c3f6 100644 --- a/arch/x86/lib/copy_user_64.S +++ b/arch/x86/lib/copy_user_64.S @@ -217,19 +217,19 @@ ENTRY(copy_user_generic_unrolled) /* table sorted by exception address */ .section __ex_table,"a" .align 8 - .quad .Ls1,.Ls1e - .quad .Ls2,.Ls2e - .quad .Ls3,.Ls3e - .quad .Ls4,.Ls4e - .quad .Ld1,.Ls1e + .quad .Ls1,.Ls1e /* Ls1-Ls4 have copied zero bytes */ + .quad .Ls2,.Ls1e + .quad ...
If that fixes anything: - The caller is broken because it shouldn't pass a faulting source to copy_to_user() - And you broken copy_from_user error reporting which shares the same code -Andi --
Andi, I'm sorry I cc'd you. You are the author of that crap, but the bug seems to be that you never even understood what copy_from_user() is supposed to do. The whole *and*only* reason for copy_to/from_user() existing AT ALL is exactly the fact that the source or destination access can fault. I don't really see why you continually start arguing about things that are OBVIOUSLY BUGGY, as if they weren't buggy. Once somebody has debugged a buggy routine, you shouldn't argue against it. So here's a hint: next time I claim some code of yours is buggy, either just acknowledge the bug, or stay silent. You'll look smarter that way. Linus --
yes, but only one of them (destination for copy_to_user and source for copy_from_user) Ok if I'm really wrong on this (but frankly I don't see the mistake, sorry) for my person edification: what's a legitimate case for copy_to_user() where the source can fault? -Andi --
Andi, read the patch already. And then shut up. I'm so tired of you always trying to argue against fixes, just because you wrote the code. The fact is, that code was buggy. I've explained _why_ it was buggy. I've sent a patch. I've even talked about how I _tested_ the patch. You've read none of that, you just fixated on the fact that I said "to" instead of "from" by mistake, even though it's the exact same code. At no point did you at all bother to read the patch, nor did you try to understand the problem, nor did you follow the original report or try it out, or think about it. So tell me why I shouldn't just put you in my "idiots" filter? The aggravation just isn't worth it for me. When you grow up and can admit that your code was buggy, or at least *look* at the patches, feel free to start sending me email again. Until then, just don't bother. Linus --
Erm... The reason for copy_to_user() existing is that dereferencing a user-supplied address in the kernel does not have to access any memory reachable in user mode, regardless of any faults. WTF are you guys talking about? --
Linus seems to think that copy_to_user() should have copy_in_user semantics(). It happens to be in some cases (when string instructions are used), but not for the unrolled case. What seems also confusing him is that x86-64 copy_from/to_user use a shared subfunction. The trick that this subfunction uses is to assume that either the destination faults or the source, but never both. It's legal because the caller should never pass in a faulting source for copy to or a faulting destination for copy from. Actually they handle it, but the return value is not correct. Now he "fixed" copy_to_user to return a kind of correct return value for source faults, but it'll of course break copy_from_user()'s return value. It's still unclear why his patch fixes the test case. The caller should be using copy_in_user perhaps? Or is it just buggy by passing something unmapped to copy_to_user? -Andi --
No. I *know* that copy_to/from_user() is supposed to return the number of bytes not copied. And if you loaded a value but didn't store it to the destination, then BY DEFINITION you didn't copy it. Comprende? Obviously you don't. Gaah. Why do I keep bothering to even _try_ to educate you? Can't you accept that I just fixed a bug, which had a test-case an everything? Andi, really. Take it from me. If I tell you something, I'm usually right. Think about that for a moment, instead of living in your own little dream-world where you think you understand everything. Linus --
AFAICS, what happened is that b0rken copy_*FROM*_user() had been discussed with references to copy_*TO*_user(). With proposed patch indeed not affecting any legitimate calls of the latter. Does affect the former and that, from my reading of the code in question, correctly. IOW, s/copy_to_user/copy_from_user/ in Linus' postings upthread and they make sense. --
It looks like mount does need an exact copy, so they've rolled their own (exact_copy_from_user). I guess if you need an exact copy, then it doesn't really matter how inexact an inexact one is, it's still unusable :) All else being equal, a smaller maximum error is preferable, but surely that is outweighed by the correctness issue of returning a valid number of bytes left to operate on. BTW. we already have lots (although steadily declining number) of corner case issues around this whole area, but if we want to get really strict, even an inexact report may be wrong for filemap. Suppose we copy 10 bytes into the pagecache, but report that 5 were copied. That means, we'll subsequently re-copy the delta. Between these two copies, a 2nd writer might come in and write something over those 5 bytes. Then a reader might see the following sequence of those 10 bytes "0000000000" "1111111111" "2222222222" "2222211111" --
Ok, so I just rebooted with this, and it does indeed fix the bug. I'd be happier with a more complete fix (ie being byte-accurate and actually doing the partial copy when it hits a fault in the middle), but this seems to be the minimal fix, and at least fixes the totally bogus return values from the x86-64 __copy_user*() functions. Not that I checked that I got _all_ cases correct (and maybe there are other versions of __copy_user that I missed entirely), but Bron's test-case at least seems to work properly for me now. Bron? If you have a more complete test-suite (ie the real-world case that made you find this), it would be good to verify the whole thing. And again, this is the kind of thing that really wants others looking at it. I personally think that this thing should likely be written as C code, with just the inner loops done as asm (ie the inner "rep movsq" and the inner 64-byte unrolled cached/uncached copies done as inline asms, and then the outer layers could be C). Linus --
I have a real world test case using "cyr_dbtool" from Cyrus 2.3.12 on a known-bug-inducing piece of data (the key and value sizes in the example code were taken from that. Indeed, the example code started off building byte-for-byte identical files, then I changed it to write only a single character across the entire key and value so the hexdump was shorter) I'll give it a go. Bron. --
Is there any reason it doesn't use mmap(MAP_SHARED) and make the modifications that way too? Because quite frankly, the mixture of doing mmap() and write() system calls is quite fragile - and I'm not saying that just because of this particular bug, but because there are all kinds of nasty cache aliasing issues with virtually indexed caches etc that just fundamentally mean that it's often a mistake to mix mmap with read/write at the same time. (For the same reason it's not a good idea to mix writing through an mmap() and then using read() to read it - again, you can have some nasty aliasing going on there). So this particular issue was definitely a kernel bug (and big thanks for making such a good test-case), but in general, it does sound like Cyrus is actively trying to dig itself into a nasty hole there. Linus --
Portability[tm].
It actually does use MAP_SHARED already, but only for reading.
Writing is all done with seeks and writes, followed by a map
"refresh", which is really just an unmmap/mmap if the file has
extended past the "SLOP" (between 8 and 16 k after the end of
the file length at last mapping).
I can't just right now find the place where the reasoning behind
this was explained to me. The theory I think was that mmap write
support is unreliable across systems, but read support is pretty
I'm not actually a maintainer for Cyrus, I just write patches to
Yeah, indeed.
I suspect the response from the Cyrus side might include a
small dose of "POSIX says it's valid to do this, and making
it work is the kernel programmers' lookout".
Ahh - I found the explaination in doc/internal/hacking in
the Cyrus source tree. While 'ack' is a nice tool, it
doesn't check files with no extention by default. Ho hum:
- map_refresh and map_free
- In many cases, it is far more effective to read a file via the operating
system's mmap facility than it is to via the traditional read() and
lseek system calls. To this end, Cyrus provides an operating system
independent wrapper around the mmap() services (or lack thereof) of the
operating system.
- Cyrus currently only supports read-only memory maps, all writes back
to a file need to be done via the more traditional facilities. This is
to enable very low-performance support for operating systems which do not
provide an mmap() facility via a fake userspace mmap.
- To create a map, simply call map_refresh on the map (details are in
lib/map.h). To free it, call map_free on the same map.
- Despite the fact that the maps are read-only, it is often useful to open
the file descriptors O_RDWR, especially if the file decriptors could
possibly be used for writing elsewhere in the code. Some operating
systems REQUIRE file descriptors that are mmap()ed to be opened
O_RDWR, so just do ...Hmm.. I'm pretty sure that using MAP_SHARED for writing is _more_ portable than mixing mmap() and "write()" - or at least more _consistent_. That said, it's probably six one way, and half a dozen the other. The shared writable mmap() doesn't work well on unix-lookalikes (ie "not real unix"). That does include really really old Linux versions (ie 1.x series), but more relevantly probably includes things like QNX etc. On the other hand, the mmap()+write(), as mentioned, doesn't work well on various hardware platforms where theer can be cache aliases, and that includes HP-UX (as you apparently have noticed), but I'm pretty certain there are other cases too. The cache alias issue can actually be really thorny, because it's going to be very hard to see and essentially random: if your working set is big enough (or the cache is small enough) that the cache basically gets flushed between the write and the access through the mmap (and vice versa), you'll never see any problems. But then, _occasionally_, you'll have really hard-to-replicate corruption due to cache aliases (ie you read something from the mmap() after the write, but you don't actually see the newly written data, because it's cached at a different virtual address). Linux tries really hard to be coherent between mmap and read/write even on those kinds of platforms, but I would definitely not call it "portable". It really is a fundamentally nasty thing, and depends deeply on the CPU Yeah, I can certainly see that working. That said, I can also see it failing, partly because of the CPU virtual indexing cache issues, but partly because it's such an unusual thing to do (partly because it simply is known not to work on some systems, ie HP-UX). And that will mean that it is probably not a well-tested path.. As you found out. (Side note: I mention HP-UX just because it is known to historically have totally and utterly brain-damaged and useless mmap support. It _may_ be that they've fixed it ...
As noted above, one thing cyrus does which does seem to be plain "wrong" is that it mmaps a region greater the file size (rounds to an 8k boundary, but 8k-16k past the current end of the file) and then assumes that when it writes to the end of the file (but less than the end of the mmap region) that there's no need to remmap and that data is immediately available within the previous mmaped region. Apparently that works on most OS's (but is what this bug actually exposed), but according to the mmap docs: --- If the size of the mapped file changes after the call to mmap() as a result of some other operation on the mapped file, the effect of references to portions of the mapped region that correspond to added or removed portions of the file is unspecified. --- The way I read that, even if you mmap a file with a size past the end of the file currently, if you subsequently write to the end of that file, you shouldn't assume that written data is available in the region you previously mmaped, which cyrus definitely does do. Amazingly (apart from HP/UX) no OS actually seems to have a problem with this since there would be massive cyrus bug reports otherwise. Rob ---------- robm@fastmail.fm Sign up at http://fastmail.fm for fast, ad free, IMAP accessible email --
Pretty much any OS that tries to be make mmap() coherent with regular read/write accesses will automatically also have to be coherent wrt file size updates. IOW, I don't think that cyrus is real any more "wrong" in this than in assuming that you can mix read/write and mmap() accesses. In fact, I suspect that Cyrus is probably _more_ conservative than most, in that it would not be totally unheard of to just do a much bigger mmap(), and not even bother to re-do it until the file grows past that size (ie no 8k/16k Note that if you really want to be portable, you simply must not mix mmap() with *any* other operations without sprinking in a healthy amount of "msync()" or unmapping/remapping entirely. So _in_practice_ - because everybody tries to do a good job - you can actually expect to have mmap() be coherent, even though there are no real Yeah. Over the years, the pain from having a non-coherent mmap() generally has pushed everybody into just making mmap() easy to use. Which means that mixing things generally works fine, even if it is not at all _guaranteed_. So I'd expect mmap+write to work and be coherent almost always. But it's still a fairly unusual combination, and I would personally think that using MAP_SHARED and writing through the mmap() would be the less surprising model. Linus --
On Tue, Jun 17, 2008 at 02:20:49PM -0700, Linus Torvalds wrote:
[reminder from way back: this bug was caused by writev containing
mmaped pages that weren't paged in, it's 64 bit only. It
Has this been revisited since? I haven't noticed, but I really only
skim LKML - have to save some time in the day for my real job[tm] of
It's been fine for us since, but unfortunately most of the world is
still running distribution "stable" kernels. I've just been helping a
user who's getting corrupted flat file databases on Ubuntu's stable 64
bit xen kernels, and it looks like it's the same issue.
Is there a standard way to tell backporters "you really need to add this
patch for your users' sanity"?
Bron ( I tried reporting it in Launchpad, but kept getting timeout
errors, so I figured reposting here might get noticed. Besides,
I can follow up on the "more complete fix" at the same time! )
--
Yes, there is. We backport the patch into earlier kernel releases and that action _should_ wake the distros up to take a look at the fix. This particular fix (42a886af728c089df8da1b0017b0e7e6c81b5335) was included in 2.6.26 and also is present in 2.6.25.17, but not 2.6.25. So we did backport it into 2.6.25.x. Maybe distros were slow or errant in picking up the patch. --
It went into 2.6.25.8, which was the stable line at the time. It
didn't get backported to 2.6.24. Ubuntu's stable line is called
2.6.24-19.41, and I'm guessing they just didn't realise it was
a fix that actually needs to be applied to everything back to at
least 2.6.23.
I'll go try launchpad again and see if it's unbroken enough to
accept my bug report.
Bron ( really not wanting to keep fielding Cyrus questions on this
for the 4 1/2 years support remaining on that LTS distro! )
--
Bron Gondwana
brong@fastmail.fm
--
How can a load fault legitimately in copy_to_user? -Andi --
The x86-64 copy_*_user functions were always designed to return errors both ways (as in both for load and for store). That's needed because the loops are shared for copy_to_user and copy_from_user. That's normally ok because when you do _to_user you shouldn't fault on the loads and vice versa. If a caller does that it's buggy. -Andi --
That's not the problem, Andi. The problem is that it returns THE WRONG VALUE! If the fault happened on the second load, but the first load was never actually paired up with a store (because of unrolling the loop), then YOU MUST NOT CLAIM THAT YOU DID A 8-BYTE COPY! Because you have copied exactly _zero_ bytes, even though you _loaded_ 8 bytes successfully! See? Claiming that you copied 8 bytes when you didn't do anything at all is WRONG. It's so incredibly wrong that it is scary. Linus --
Loads are not supposed to fault in copy_to_user(). Only stores are. The way it works is that it assumes that either loads fault (when used If your patch fixes something then the main wrong thing is the caller who passes a faulting source address. And again it always breaks the other case. -Andi --
Andi, just shut up already. It is "copy_user". Notice the lack of "from" or "to". That code handles Andi, SHUT THE F*CK UP. Read the code. Read the patch. Sorry for ever involving you. I don't want to hear your idiotic whining. Linus --
Oh, and yes, I said "copy_to_user_inatomic()" by mistake. The fact is that "write()" obviously does copy_from_user, but more importantly that on x86 it's THE EXACT SAME ROUTINE. Linus --
I did exactly the same patch first on your first email and then I realized The patch is wrong because it'll break the other case (in this case copy_from_user) -Andi --
Ok, I'm now putting you in my idiots filter. No, it will not break the other case. You're an idiot. Loading a value without using it will not break anything, quite the reverse. What the patch does is to _fix_ copy_from_user(), because if the second load traps, then the fact that we did the first load IS IMMATERIAL, because its result was never stored! So the patch _fixes_ copy_from_user(), exactly because it says that even if you've loaded 24 bytes, but you faulted on the fourth load, you've still _copied_ exactly zero bytes, because you didn't actually store the 24 bytes you loaded. And it is a no-op for copy_to_user, since copy_to_user will never fault on the load (not the first one, not the second one, not _any_ load), so the exception table entries for the loads are unusued. IOW: - the patch fixes a bug - you refuse to acknowledge this - I'll put you in my "flamers" filter that goes into another mailbox, because it's not worth my time even arguing with you any more. Sorry for ever adding you to the cc. I thought it might be a good idea, since you were the author of the code. But clearly the bug was not because you made a mistake, but because you simply don't seem to understand what the function is supposed to return, and you're not even interested in learning. Linus --
Yes, the new filemap.c code does not require an exact byte count, but it won't work if there is an under-estimation of the number of bytes left to copy. The old filemap.c code actually also relies on the byte count in some cases, I can't remember off the top of my head, but I *think* it was a security measure to prevent uninitialized data leak. Most other cases of course only care about complete success or not, but there are others. filemap_xip, splice are a couple that pop up. Thanks for working that all out before I even read my email :) --
Bah - crap. Sorry about that. I sent you the wrong version of the code. That's what I get for not putting it in revision control. That bug exists because I did an htonl on a value that I shouldn't have. Bron ( now to read the rest of the thread! ) --
