And I appear to have sent the one without the usage comments at the top.
Here they are:* USAGE:
* $ gcc -g maptest.c
* $ ./a.out z
* $ hexdump -C z.NEW
*
* Notice the block of zero byte records appearing within
* the output.
*
* Run with a second argument:
*
* $ ./a.out z 1
* $ hexdump -C z.NEW
*
* No more zeros!Bron.
--
Bron Gondwana
brong@fastmail.fm--
Very interesting. There's certainly something there.
That said, there's a distracting bug which is visible when doing an strace
lseek(4, 140333890921392, SEEK_SET) = -1 EINVAL (Invalid argument)
write(4, "\0\0\0\0", 4) = 4which is from that
lseek(newfd, mapbase + offset + size - 8, 0);
write(newfd, (char *) &zero, 4);where the addition of "mapbase" is insane. So that will write zeroes to
the wrong part of the file (offset 64, to be exact). And that will get
overwritten by the next write, making it all look entirely insane.That said, that bug may be distracting, but it seems to have nothign at
all to do with the actual problem. The bug seems to happen only when the
file is not pre-paged in.Nick?
Linus
--
Bah - crap. Sorry about that. I sent you the wrong version of the
code. That's what I get for not putting it in revision control.That bug exists because I did an htonl on a value that I shouldn't
have.Bron ( now to read the rest of the thread! )
--
I don't think this patch is correct, because it doesn't really fix the
basic issue (the code should do the right thing even if a page isn't
there), but it might hide it by faulting in the whole "bytes" range rather
than just the first iov.So Nick, it's still over to you, but if this does hide it, then that's an
interesting detail in itself.Linus
---
mm/filemap.c | 17 ++++++++++++++---
1 files changed, 14 insertions(+), 3 deletions(-)diff --git a/mm/filemap.c b/mm/filemap.c
index 1e6a7d3..0080a27 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1808,9 +1808,20 @@ EXPORT_SYMBOL(iov_iter_advance);
*/
int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
{
- char __user *buf = i->iov->iov_base + i->iov_offset;
- bytes = min(bytes, i->iov->iov_len - i->iov_offset);
- return fault_in_pages_readable(buf, bytes);
+ unsigned long offset = i->iov_offset;
+ const struct iovec *iov = i->iov;
+
+ while (bytes) {
+ char __user *buf = iov->iov_base + offset;
+ size_t n = min(bytes, iov->iov_len - offset);
+
+ if (fault_in_pages_readable(buf, n))
+ return -EFAULT;
+ bytes -= n;
+ offset = 0;
+ iov++;
+ }
+ return 0;
}
EXPORT_SYMBOL(iov_iter_fault_in_readable);--
I actually am starting to think that the bug is in
__copy_to_user_inatomic_nocache().The return value of that thing in the case of a failure seems to be
totally wrong. In particular, since that function does an unrolled loop of
accesses like so:.Ls1: movq (%rsi),%r11
.Ls2: movq 1*8(%rsi),%r8
.Ls3: movq 2*8(%rsi),%r9
.Ls4: movq 3*8(%rsi),%r10
.Ld1: movnti %r11,(%rdi)
.Ld2: movnti %r8,1*8(%rdi)
.Ld3: movnti %r9,2*8(%rdi)
.Ld4: movnti %r10,3*8(%rdi)it may have done three of the 64-bit loads, and then trap on the fouth:
but since it hasn't done a _single_ of the stores, it shouldn't count as
any different whether it traps on any of .Ls[1-4]. But that code
definitely seems to think it makes a difference whether the exception
happened at Ls1 or at Ls4, even though both points have copied _exactly_
as many bytes when the exception happens.So I'm starting to think the bug is all in there, not in the VM itself.
See arch/x86/lib/copy_user_nocache.S.In fact, even the comment there implies that the author didn't know or
care about returning the correct value.Andi and Ingo added to Cc.
Linus
--
The x86-64 copy_*_user functions were always designed to return errors
both ways (as in both for load and for store). That's needed because
the loops are shared for copy_to_user and copy_from_user. That's normally
ok because when you do _to_user you shouldn't fault on the loads
and vice versa. If a caller does that it's buggy.-Andi
--
That's not the problem, Andi.
The problem is that it returns THE WRONG VALUE!
If the fault happened on the second load, but the first load was never
actually paired up with a store (because of unrolling the loop), then YOU
MUST NOT CLAIM THAT YOU DID A 8-BYTE COPY! Because you have copied exactly
_zero_ bytes, even though you _loaded_ 8 bytes successfully!See?
Claiming that you copied 8 bytes when you didn't do anything at all is
WRONG. It's so incredibly wrong that it is scary.Linus
--
Loads are not supposed to fault in copy_to_user(). Only stores are.
The way it works is that it assumes that either loads fault (when used
If your patch fixes something then the main wrong thing is the caller
who passes a faulting source address.And again it always breaks the other case.
-Andi
--
Andi, just shut up already.
It is "copy_user". Notice the lack of "from" or "to". That code handles
Andi, SHUT THE F*CK UP. Read the code. Read the patch. Sorry for ever
involving you. I don't want to hear your idiotic whining.Linus
--
I did exactly the same patch first on your first email and then I realized
The patch is wrong because it'll break the other case (in this case
copy_from_user)-Andi
--
Ok, I'm now putting you in my idiots filter.
No, it will not break the other case. You're an idiot. Loading a value
without using it will not break anything, quite the reverse. What the
patch does is to _fix_ copy_from_user(), because if the second load traps,
then the fact that we did the first load IS IMMATERIAL, because its result
was never stored!So the patch _fixes_ copy_from_user(), exactly because it says that even
if you've loaded 24 bytes, but you faulted on the fourth load, you've
still _copied_ exactly zero bytes, because you didn't actually store the
24 bytes you loaded.And it is a no-op for copy_to_user, since copy_to_user will never fault on
the load (not the first one, not the second one, not _any_ load), so the
exception table entries for the loads are unusued.IOW:
- the patch fixes a bug
- you refuse to acknowledge this
- I'll put you in my "flamers" filter that goes into another mailbox,
because it's not worth my time even arguing with you any more.Sorry for ever adding you to the cc. I thought it might be a good idea,
since you were the author of the code. But clearly the bug was not because
you made a mistake, but because you simply don't seem to understand what
the function is supposed to return, and you're not even interested in
learning.Linus
--
Yes, the new filemap.c code does not require an exact byte count, but
it won't work if there is an under-estimation of the number of bytes
left to copy.The old filemap.c code actually also relies on the byte count in some
cases, I can't remember off the top of my head, but I *think* it was a
security measure to prevent uninitialized data leak.Most other cases of course only care about complete success or not, but
there are others. filemap_xip, splice are a couple that pop up.Thanks for working that all out before I even read my email :)
--
Oh, and yes, I said "copy_to_user_inatomic()" by mistake. The fact is that
"write()" obviously does copy_from_user, but more importantly that on x86
it's THE EXACT SAME ROUTINE.Linus
--
Confirmed.
The uncached user copies are totally broken. The number of bytes left
uncopied is simply wrong, because of how it does that unrolled loop and
doesn't account for the fact that just doing loads does not actually
increase the number of bytes copied at all.So because the "copy_to_user_inatomic()" logic cares _deeply_ about how
many bytes were actually copied, when the copy count is wrong, the code
ends up thinking that it copied more bytes than it actually did, resulting
in the corruption in the page cache.Nasty.
That whole file is a mess. Sadly, so is the regular "copy_user_64.S" too
(it has the same totally broken comment, too!), this is not just the
uncached version.And the only reason that it only shows up with the uncached version in
_practice_ is that the routine that uses the x86 string instructions (ie
the "rep movsq" in copy_user_generic_string) actually gets this all right.So the bug is hidden in that case - which is most CPU's out there (all
CPU's that have X86_FEATURE_REP_GOOD set).I don't think that code is reasonably salvageable. Damn.
Linus
--
How can a load fault legitimately in copy_to_user?
-Andi
--
Hmm. Something like this *may* salvage it.
Untested, so far (I'll reboot and test soon enough), but even if it fixes
things, it's not really very good.Why?
Because of the unrolling, we may be doing 32 bytes worth of reads from
user space before doing a *single* write, so if we have a user copy from
the end of a page (say, 16 bytes from the end), we'd take the fault on the
third 8-byte load, and not copy anything at all.So instead of copying 16 bytes and then returning "I copied 16 bytes", it
will copy _zero_ bytes, and now return "I copied 0 bytes" (it used to
incorrectly return that it copied 16 bytes because it could _read_ 16
bytes even though it never wrote them! Totally buggy!).But I think the mm/filemap.c routines handle this case correctly (because
of how it probes at least the first iovec), so at least copied will
generally not be zero. But as mentioned, this isn't tested yet.It would be better to not do the unrolling, or at least do the faulting
behaviour correctly so that we fall back on a byte-for-byte copy when it
faults. But not even the "rep movs" version has ever been _that_ careful,
so I guess that's ok.Somebody should _really_ double-check this, and it would be wonderful if
somebody can come up with something better than this patch.Linus
---
arch/x86/lib/copy_user_64.S | 25 +++++++++++--------------
arch/x86/lib/copy_user_nocache_64.S | 25 +++++++++++--------------
2 files changed, 22 insertions(+), 28 deletions(-)diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index 70bebd3..ee1c3f6 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -217,19 +217,19 @@ ENTRY(copy_user_generic_unrolled)
/* table sorted by exception address */
.section __ex_table,"a"
.align 8
- .quad .Ls1,.Ls1e
- .quad .Ls2,.Ls2e
- .quad .Ls3,.Ls3e
- .quad .Ls4,.Ls4e
- .quad .Ld1,.Ls1e
+ .quad .Ls1,.Ls1e /* Ls1-Ls4 have copied zero bytes */
+ .quad .Ls2,.Ls1e
+ .quad .Ls3,.Ls...
Ok, so I just rebooted with this, and it does indeed fix the bug.
I'd be happier with a more complete fix (ie being byte-accurate and
actually doing the partial copy when it hits a fault in the middle), but
this seems to be the minimal fix, and at least fixes the totally bogus
return values from the x86-64 __copy_user*() functions.Not that I checked that I got _all_ cases correct (and maybe there are
other versions of __copy_user that I missed entirely), but Bron's
test-case at least seems to work properly for me now.Bron? If you have a more complete test-suite (ie the real-world case that
made you find this), it would be good to verify the whole thing.And again, this is the kind of thing that really wants others looking at
it. I personally think that this thing should likely be written as C code,
with just the inner loops done as asm (ie the inner "rep movsq" and the
inner 64-byte unrolled cached/uncached copies done as inline asms, and
then the outer layers could be C).Linus
--
On Tue, Jun 17, 2008 at 02:20:49PM -0700, Linus Torvalds wrote:
[reminder from way back: this bug was caused by writev containing
mmaped pages that weren't paged in, it's 64 bit only. ItHas this been revisited since? I haven't noticed, but I really only
skim LKML - have to save some time in the day for my real job[tm] ofIt's been fine for us since, but unfortunately most of the world is
still running distribution "stable" kernels. I've just been helping a
user who's getting corrupted flat file databases on Ubuntu's stable 64
bit xen kernels, and it looks like it's the same issue.Is there a standard way to tell backporters "you really need to add this
patch for your users' sanity"?Bron ( I tried reporting it in Launchpad, but kept getting timeout
errors, so I figured reposting here might get noticed. Besides,
I can follow up on the "more complete fix" at the same time! )
--
Yes, there is. We backport the patch into earlier kernel releases and
that action _should_ wake the distros up to take a look at the fix.This particular fix (42a886af728c089df8da1b0017b0e7e6c81b5335) was
included in 2.6.26 and also is present in 2.6.25.17, but not 2.6.25.
So we did backport it into 2.6.25.x. Maybe distros were slow or errant
in picking up the patch.--
It went into 2.6.25.8, which was the stable line at the time. It
didn't get backported to 2.6.24. Ubuntu's stable line is called
2.6.24-19.41, and I'm guessing they just didn't realise it was
a fix that actually needs to be applied to everything back to at
least 2.6.23.I'll go try launchpad again and see if it's unbroken enough to
accept my bug report.Bron ( really not wanting to keep fielding Cyrus questions on this
for the 4 1/2 years support remaining on that LTS distro! )
--
Bron Gondwana
brong@fastmail.fm--
Is there any reason it doesn't use mmap(MAP_SHARED) and make the
modifications that way too?Because quite frankly, the mixture of doing mmap() and write() system
calls is quite fragile - and I'm not saying that just because of this
particular bug, but because there are all kinds of nasty cache aliasing
issues with virtually indexed caches etc that just fundamentally mean that
it's often a mistake to mix mmap with read/write at the same time.(For the same reason it's not a good idea to mix writing through an mmap()
and then using read() to read it - again, you can have some nasty aliasing
going on there).So this particular issue was definitely a kernel bug (and big thanks for
making such a good test-case), but in general, it does sound like Cyrus is
actively trying to dig itself into a nasty hole there.Linus
--
Portability[tm].
It actually does use MAP_SHARED already, but only for reading.
Writing is all done with seeks and writes, followed by a map
"refresh", which is really just an unmmap/mmap if the file has
extended past the "SLOP" (between 8 and 16 k after the end of
the file length at last mapping).I can't just right now find the place where the reasoning behind
this was explained to me. The theory I think was that mmap write
support is unreliable across systems, but read support is prettyI'm not actually a maintainer for Cyrus, I just write patches to
Yeah, indeed.
I suspect the response from the Cyrus side might include a
small dose of "POSIX says it's valid to do this, and making
it work is the kernel programmers' lookout".Ahh - I found the explaination in doc/internal/hacking in
the Cyrus source tree. While 'ack' is a nice tool, it
doesn't check files with no extention by default. Ho hum:- map_refresh and map_free
- In many cases, it is far more effective to read a file via the operating
system's mmap facility than it is to via the traditional read() and
lseek system calls. To this end, Cyrus provides an operating system
independent wrapper around the mmap() services (or lack thereof) of the
operating system.- Cyrus currently only supports read-only memory maps, all writes back
to a file need to be done via the more traditional facilities. This is
to enable very low-performance support for operating systems which do not
provide an mmap() facility via a fake userspace mmap.- To create a map, simply call map_refresh on the map (details are in
lib/map.h). To free it, call map_free on the same map.- Despite the fact that the maps are read-only, it is often useful to open
the file descriptors O_RDWR, especially if the file decriptors could
possibly be used for writing elsewhere in the code. Some operating
systems REQUIRE file descriptors that are mmap()ed to be opened
O_RDWR, so just do it....
Hmm.. I'm pretty sure that using MAP_SHARED for writing is _more_ portable
than mixing mmap() and "write()" - or at least more _consistent_.That said, it's probably six one way, and half a dozen the other. The
shared writable mmap() doesn't work well on unix-lookalikes (ie "not real
unix"). That does include really really old Linux versions (ie 1.x
series), but more relevantly probably includes things like QNX etc.On the other hand, the mmap()+write(), as mentioned, doesn't work well on
various hardware platforms where theer can be cache aliases, and that
includes HP-UX (as you apparently have noticed), but I'm pretty certain
there are other cases too.The cache alias issue can actually be really thorny, because it's going to
be very hard to see and essentially random: if your working set is big
enough (or the cache is small enough) that the cache basically gets
flushed between the write and the access through the mmap (and vice
versa), you'll never see any problems.But then, _occasionally_, you'll have really hard-to-replicate corruption
due to cache aliases (ie you read something from the mmap() after the
write, but you don't actually see the newly written data, because it's
cached at a different virtual address).Linux tries really hard to be coherent between mmap and read/write even on
those kinds of platforms, but I would definitely not call it "portable".
It really is a fundamentally nasty thing, and depends deeply on the CPUYeah, I can certainly see that working. That said, I can also see it
failing, partly because of the CPU virtual indexing cache issues, but
partly because it's such an unusual thing to do (partly because it simply
is known not to work on some systems, ie HP-UX). And that will mean that
it is probably not a well-tested path.. As you found out.(Side note: I mention HP-UX just because it is known to historically have
totally and utterly brain-damaged and useless mmap support. It _may_ be
that they've fixed it in mo...
As noted above, one thing cyrus does which does seem to be plain "wrong"
is that it mmaps a region greater the file size (rounds to an 8k
boundary, but 8k-16k past the current end of the file) and then assumes
that when it writes to the end of the file (but less than the end of the
mmap region) that there's no need to remmap and that data is immediately
available within the previous mmaped region.Apparently that works on most OS's (but is what this bug actually
exposed), but according to the mmap docs:---
If the size of the mapped file changes after the call to mmap() as a
result of some other operation on the mapped file, the effect of
references to portions of the mapped region that correspond to added or
removed portions of the file is unspecified.
---The way I read that, even if you mmap a file with a size past the end of
the file currently, if you subsequently write to the end of that file,
you shouldn't assume that written data is available in the region you
previously mmaped, which cyrus definitely does do.Amazingly (apart from HP/UX) no OS actually seems to have a problem with
this since there would be massive cyrus bug reports otherwise.Rob
----------
robm@fastmail.fm
Sign up at http://fastmail.fm for fast, ad free, IMAP accessible email--
Pretty much any OS that tries to be make mmap() coherent with regular
read/write accesses will automatically also have to be coherent wrt file
size updates.IOW, I don't think that cyrus is real any more "wrong" in this than in
assuming that you can mix read/write and mmap() accesses. In fact, I
suspect that Cyrus is probably _more_ conservative than most, in that it
would not be totally unheard of to just do a much bigger mmap(), and not
even bother to re-do it until the file grows past that size (ie no 8k/16kNote that if you really want to be portable, you simply must not mix
mmap() with *any* other operations without sprinking in a healthy amount
of "msync()" or unmapping/remapping entirely.So _in_practice_ - because everybody tries to do a good job - you can
actually expect to have mmap() be coherent, even though there are no realYeah. Over the years, the pain from having a non-coherent mmap() generally
has pushed everybody into just making mmap() easy to use. Which means that
mixing things generally works fine, even if it is not at all _guaranteed_.So I'd expect mmap+write to work and be coherent almost always. But it's
still a fairly unusual combination, and I would personally think that
using MAP_SHARED and writing through the mmap() would be the less
surprising model.Linus
--
I have a real world test case using "cyr_dbtool" from Cyrus 2.3.12 on a
known-bug-inducing piece of data (the key and value sizes in the example
code were taken from that. Indeed, the example code started off
building byte-for-byte identical files, then I changed it to write only
a single character across the entire key and value so the hexdump was
shorter)I'll give it a go.
Bron.
--
If that fixes anything:
- The caller is broken because it shouldn't pass a faulting source to copy_to_user()
- And you broken copy_from_user error reporting which shares the same code-Andi
--
Andi, I'm sorry I cc'd you. You are the author of that crap, but the bug
seems to be that you never even understood what copy_from_user() is
supposed to do.The whole *and*only* reason for copy_to/from_user() existing AT ALL is
exactly the fact that the source or destination access can fault.I don't really see why you continually start arguing about things that are
OBVIOUSLY BUGGY, as if they weren't buggy. Once somebody has debugged a
buggy routine, you shouldn't argue against it.So here's a hint: next time I claim some code of yours is buggy, either
just acknowledge the bug, or stay silent. You'll look smarter that way.Linus
--
Erm... The reason for copy_to_user() existing is that dereferencing
a user-supplied address in the kernel does not have to access any
memory reachable in user mode, regardless of any faults. WTF are you
guys talking about?
--
Linus seems to think that copy_to_user() should have
copy_in_user semantics(). It happens to be in some cases (when string instructions
are used), but not for the unrolled case.What seems also confusing him is that x86-64 copy_from/to_user use a shared
subfunction. The trick that this subfunction uses is to assume that
either the destination faults or the source, but never both. It's legal
because the caller should never pass in a faulting source for copy to
or a faulting destination for copy from.Actually they handle it, but the return value is not correct.
Now he "fixed" copy_to_user to return a kind of correct return value
for source faults, but it'll of course break copy_from_user()'s return value.It's still unclear why his patch fixes the test case. The caller should
be using copy_in_user perhaps? Or is it just buggy by passing something
unmapped to copy_to_user?-Andi
--
AFAICS, what happened is that b0rken copy_*FROM*_user() had been discussed
with references to copy_*TO*_user(). With proposed patch indeed not affecting
any legitimate calls of the latter. Does affect the former and that, from
my reading of the code in question, correctly.IOW, s/copy_to_user/copy_from_user/ in Linus' postings upthread and they
make sense.
--
It looks like mount does need an exact copy, so they've rolled their own
(exact_copy_from_user). I guess if you need an exact copy, then it doesn't
really matter how inexact an inexact one is, it's still unusable :)All else being equal, a smaller maximum error is preferable, but surely
that is outweighed by the correctness issue of returning a valid number of
bytes left to operate on.BTW. we already have lots (although steadily declining number) of corner
case issues around this whole area, but if we want to get really strict,
even an inexact report may be wrong for filemap.Suppose we copy 10 bytes into the pagecache, but report that 5 were copied.
That means, we'll subsequently re-copy the delta. Between these two copies,
a 2nd writer might come in and write something over those 5 bytes. Then a
reader might see the following sequence of those 10 bytes
"0000000000"
"1111111111"
"2222222222"
"2222211111"
--
No.
I *know* that copy_to/from_user() is supposed to return the number of
bytes not copied.And if you loaded a value but didn't store it to the destination, then BY
DEFINITION you didn't copy it.Comprende?
Obviously you don't.
Gaah. Why do I keep bothering to even _try_ to educate you? Can't you
accept that I just fixed a bug, which had a test-case an everything?Andi, really. Take it from me. If I tell you something, I'm usually right.
Think about that for a moment, instead of living in your own little
dream-world where you think you understand everything.Linus
--
yes, but only one of them (destination for copy_to_user and source for
copy_from_user)Ok if I'm really wrong on this (but frankly I don't see the mistake, sorry)
for my person edification: what's a legitimate case for copy_to_user()
where the source can fault?-Andi
--
Andi, read the patch already. And then shut up. I'm so tired of you always
trying to argue against fixes, just because you wrote the code.The fact is, that code was buggy. I've explained _why_ it was buggy. I've
sent a patch. I've even talked about how I _tested_ the patch. You've read
none of that, you just fixated on the fact that I said "to" instead of
"from" by mistake, even though it's the exact same code.At no point did you at all bother to read the patch, nor did you try to
understand the problem, nor did you follow the original report or try it
out, or think about it.So tell me why I shouldn't just put you in my "idiots" filter? The
aggravation just isn't worth it for me.When you grow up and can admit that your code was buggy, or at least
*look* at the patches, feel free to start sending me email again.Until then, just don't bother.
Linus
--
| Thomas Gleixner | Re: Linux 2.6.21-rc1 |
| Tarkan Erimer | Re: Dual-Licensing Linux Kernel with GPL V2 and GPL V3 |
| James Bottomley | [Ksummit-2008-discuss] Fixing the Kernel Janitors project |
| James Morris | Re: LSM conversion to static interface |
git: | |
| Natalie Protasevich | [BUG] New Kernel Bugs |
| Christoph Hellwig | Re: [PATCH 06/32] IGET: Mark iget() and read_inode() as being obsolete [try #2] |
| Linus Torvalds | Re: [GIT]: Networking |
| Jarek Poplawski | [PATCH take 2] pkt_sched: Protect gen estimators under est_lock. |
