Re: BUG: mmapfile/writev spurious zero bytes (x86_64/not i386, bisected, reproducable)

Previous thread: [PATCH] cdrom: don't check CDC_PLAY_AUDIO in cdrom_count_tracks() by Tejun Heo on Monday, June 16, 2008 - 10:20 pm. (3 messages)

Next thread: Re: [PATCH] knfsd: nfsd: Handle ERESTARTSYS from syscalls. by Neil Brown on Tuesday, June 17, 2008 - 12:03 am. (2 messages)
From: Bron Gondwana
Date: Monday, June 16, 2008 - 11:00 pm

[Empty message]
From: Bron Gondwana
Date: Monday, June 16, 2008 - 11:02 pm

And I appear to have sent the one without the usage comments at the top.
Here they are:

 * USAGE:
 *    $ gcc -g maptest.c
 *    $ ./a.out z
 *    $ hexdump -C z.NEW
 *
 * Notice the block of zero byte records appearing within
 * the output.
 *
 * Run with a second argument:
 *
 *    $ ./a.out z 1
 *    $ hexdump -C z.NEW
 *
 * No more zeros!

Bron.
-- 
  Bron Gondwana
  brong@fastmail.fm

--

From: Linus Torvalds
Date: Tuesday, June 17, 2008 - 10:08 am

Very interesting. There's certainly something there. 

That said, there's a distracting bug which is visible when doing an strace

 lseek(4, 140333890921392, SEEK_SET)     = -1 EINVAL (Invalid argument)
 write(4, "\0\0\0\0", 4)                 = 4

which is from that

	lseek(newfd, mapbase + offset + size - 8, 0);
	write(newfd, (char *) &zero, 4);

where the addition of "mapbase" is insane. So that will write zeroes to 
the wrong part of the file (offset 64, to be exact). And that will get 
overwritten by the next write, making it all look entirely insane.

That said, that bug may be distracting, but it seems to have nothign at 
all to do with the actual problem. The bug seems to happen only when the 
file is not pre-paged in.

Nick?

		Linus
--

From: Linus Torvalds
Date: Tuesday, June 17, 2008 - 10:45 am

I don't think this patch is correct, because it doesn't really fix the 
basic issue (the code should do the right thing even if a page isn't 
there), but it might hide it by faulting in the whole "bytes" range rather 
than just the first iov.

So Nick, it's still over to you, but if this does hide it, then that's an 
interesting detail in itself.

		Linus

---
 mm/filemap.c |   17 ++++++++++++++---
 1 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 1e6a7d3..0080a27 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1808,9 +1808,20 @@ EXPORT_SYMBOL(iov_iter_advance);
  */
 int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
 {
-	char __user *buf = i->iov->iov_base + i->iov_offset;
-	bytes = min(bytes, i->iov->iov_len - i->iov_offset);
-	return fault_in_pages_readable(buf, bytes);
+	unsigned long offset = i->iov_offset;
+	const struct iovec *iov = i->iov;
+
+	while (bytes) {
+		char __user *buf = iov->iov_base + offset;
+		size_t n = min(bytes, iov->iov_len - offset);
+
+		if (fault_in_pages_readable(buf, n))
+			return -EFAULT;
+		bytes -= n;
+		offset = 0;
+		iov++;
+	}
+	return 0;	
 }
 EXPORT_SYMBOL(iov_iter_fault_in_readable);
 
--

From: Linus Torvalds
Date: Tuesday, June 17, 2008 - 1:16 pm

I actually am starting to think that the bug is in 
__copy_to_user_inatomic_nocache().

The return value of that thing in the case of a failure seems to be 
totally wrong. In particular, since that function does an unrolled loop of 
accesses like so:

	.Ls1:   movq (%rsi),%r11
	.Ls2:   movq 1*8(%rsi),%r8
	.Ls3:   movq 2*8(%rsi),%r9
	.Ls4:   movq 3*8(%rsi),%r10
	.Ld1:   movnti %r11,(%rdi)
	.Ld2:   movnti %r8,1*8(%rdi)
	.Ld3:   movnti %r9,2*8(%rdi)
	.Ld4:   movnti %r10,3*8(%rdi)

it may have done three of the 64-bit loads, and then trap on the fouth: 
but since it hasn't done a _single_ of the stores, it shouldn't count as 
any different whether it traps on any of .Ls[1-4]. But that code 
definitely seems to think it makes a difference whether the exception 
happened at Ls1 or at Ls4, even though both points have copied _exactly_ 
as many bytes when the exception happens.

So I'm starting to think the bug is all in there, not in the VM itself. 
See arch/x86/lib/copy_user_nocache.S.

In fact, even the comment there implies that the author didn't know or 
care about returning the correct value.

Andi and Ingo added to Cc.

			Linus
--

From: Linus Torvalds
Date: Tuesday, June 17, 2008 - 1:41 pm

Confirmed.

The uncached user copies are totally broken. The number of bytes left 
uncopied is simply wrong, because of how it does that unrolled loop and 
doesn't account for the fact that just doing loads does not actually 
increase the number of bytes copied at all.

So because the "copy_to_user_inatomic()" logic cares _deeply_ about how 
many bytes were actually copied, when the copy count is wrong, the code 
ends up thinking that it copied more bytes than it actually did, resulting 
in the corruption in the page cache.

Nasty.

That whole file is a mess. Sadly, so is the regular "copy_user_64.S" too 
(it has the same totally broken comment, too!), this is not just the 
uncached version.

And the only reason that it only shows up with the uncached version in 
_practice_ is that the routine that uses the x86 string instructions (ie 
the "rep movsq" in copy_user_generic_string) actually gets this all right. 

So the bug is hidden in that case - which is most CPU's out there (all 
CPU's that have X86_FEATURE_REP_GOOD set).

I don't think that code is reasonably salvageable. Damn. 

		Linus
--

From: Linus Torvalds
Date: Tuesday, June 17, 2008 - 2:06 pm

Hmm. Something like this *may* salvage it.

Untested, so far (I'll reboot and test soon enough), but even if it fixes 
things, it's not really very good. 

Why?

Because of the unrolling, we may be doing 32 bytes worth of reads from 
user space before doing a *single* write, so if we have a user copy from 
the end of a page (say, 16 bytes from the end), we'd take the fault on the 
third 8-byte load, and not copy anything at all.

So instead of copying 16 bytes and then returning "I copied 16 bytes", it 
will copy _zero_ bytes, and now return "I copied 0 bytes" (it used to 
incorrectly return that it copied 16 bytes because it could _read_ 16 
bytes even though it never wrote them! Totally buggy!).

But I think the mm/filemap.c routines handle this case correctly (because 
of how it probes at least the first iovec), so at least copied will 
generally not be zero. But as mentioned, this isn't tested yet.

It would be better to not do the unrolling, or at least do the faulting 
behaviour correctly so that we fall back on a byte-for-byte copy when it 
faults. But not even the "rep movs" version has ever been _that_ careful, 
so I guess that's ok.

Somebody should _really_ double-check this, and it would be wonderful if 
somebody can come up with something better than this patch.

		Linus
---
 arch/x86/lib/copy_user_64.S         |   25 +++++++++++--------------
 arch/x86/lib/copy_user_nocache_64.S |   25 +++++++++++--------------
 2 files changed, 22 insertions(+), 28 deletions(-)

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index 70bebd3..ee1c3f6 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -217,19 +217,19 @@ ENTRY(copy_user_generic_unrolled)
 	/* table sorted by exception address */
 	.section __ex_table,"a"
 	.align 8
-	.quad .Ls1,.Ls1e
-	.quad .Ls2,.Ls2e
-	.quad .Ls3,.Ls3e
-	.quad .Ls4,.Ls4e
-	.quad .Ld1,.Ls1e
+	.quad .Ls1,.Ls1e	/* Ls1-Ls4 have copied zero bytes */
+	.quad .Ls2,.Ls1e
+	.quad ...
From: Andi Kleen
Date: Tuesday, June 17, 2008 - 2:16 pm

If that fixes anything:
- The caller is broken because it shouldn't pass a faulting source to copy_to_user()
- And you broken copy_from_user error reporting which shares the same code

-Andi


--

From: Linus Torvalds
Date: Tuesday, June 17, 2008 - 2:24 pm

Andi, I'm sorry I cc'd you. You are the author of that crap, but the bug 
seems to be that you never even understood what copy_from_user() is 
supposed to do.

The whole *and*only* reason for copy_to/from_user() existing AT ALL is 
exactly the fact that the source or destination access can fault. 

I don't really see why you continually start arguing about things that are 
OBVIOUSLY BUGGY, as if they weren't buggy. Once somebody has debugged a 
buggy routine, you shouldn't argue against it. 

So here's a hint: next time I claim some code of yours is buggy, either 
just acknowledge the bug, or stay silent. You'll look smarter that way.

			Linus
--

From: Andi Kleen
Date: Tuesday, June 17, 2008 - 2:30 pm

yes, but only one of them (destination for copy_to_user and source for
copy_from_user)


Ok if I'm really wrong on this (but frankly I don't see the mistake, sorry)
for my person edification: what's a legitimate case for copy_to_user()
where the source can fault?

-Andi


--

From: Linus Torvalds
Date: Tuesday, June 17, 2008 - 2:37 pm

Andi, read the patch already. And then shut up. I'm so tired of you always 
trying to argue against fixes, just because you wrote the code.

The fact is, that code was buggy. I've explained _why_ it was buggy. I've 
sent a patch. I've even talked about how I _tested_ the patch. You've read 
none of that, you just fixated on the fact that I said "to" instead of 
"from" by mistake, even though it's the exact same code.

At no point did you at all bother to read the patch, nor did you try to 
understand the problem, nor did you follow the original report or try it 
out, or think about it.

So tell me why I shouldn't just put you in my "idiots" filter? The 
aggravation just isn't worth it for me.

When you grow up and can admit that your code was buggy, or at least 
*look* at the patches, feel free to start sending me email again.

Until then, just don't bother.

		Linus
--

From: Al Viro
Date: Tuesday, June 17, 2008 - 2:36 pm

Erm...  The reason for copy_to_user() existing is that dereferencing
a user-supplied address in the kernel does not have to access any
memory reachable in user mode, regardless of any faults.  WTF are you
guys talking about?
--

From: Andi Kleen
Date: Tuesday, June 17, 2008 - 2:42 pm

Linus seems to think that copy_to_user() should have
copy_in_user semantics(). It happens to be in some cases (when string instructions
are used), but not for the unrolled case.

What seems also confusing him is that x86-64 copy_from/to_user use a shared
subfunction. The trick that this subfunction uses is to assume that
either the destination faults or the source, but never both. It's legal
because the caller should never pass in a faulting source for copy to
or a faulting destination for copy from.

Actually they handle it, but the return value is not correct.

Now he "fixed" copy_to_user to return a kind of correct return value
for source faults, but it'll of course break copy_from_user()'s return value.

It's still unclear why his patch fixes the test case. The caller should
be using copy_in_user perhaps? Or is it just buggy by passing something
unmapped to copy_to_user?

-Andi

--

From: Linus Torvalds
Date: Tuesday, June 17, 2008 - 2:49 pm

No. 

I *know* that copy_to/from_user() is supposed to return the number of 
bytes not copied.

And if you loaded a value but didn't store it to the destination, then BY 
DEFINITION you didn't copy it.

Comprende?

Obviously you don't.

Gaah. Why do I keep bothering to even _try_ to educate you? Can't you 
accept that I just fixed a bug, which had a test-case an everything?

Andi, really. Take it from me. If I tell you something, I'm usually right. 
Think about that for a moment, instead of living in your own little 
dream-world where you think you understand everything.

		Linus
--

From: Al Viro
Date: Tuesday, June 17, 2008 - 3:11 pm

AFAICS, what happened is that b0rken copy_*FROM*_user() had been discussed
with references to copy_*TO*_user().  With proposed patch indeed not affecting
any legitimate calls of the latter.  Does affect the former and that, from
my reading of the code in question, correctly.

IOW, s/copy_to_user/copy_from_user/ in Linus' postings upthread and they
make sense.
--

From: Andi Kleen
Date: Tuesday, June 17, 2008 - 3:21 pm

[Empty message]
From: Nick Piggin
Date: Tuesday, June 17, 2008 - 11:22 pm

It looks like mount does need an exact copy, so they've rolled their own
(exact_copy_from_user). I guess if you need an exact copy, then it doesn't
really matter how inexact an inexact one is, it's still unusable :)

All else being equal, a smaller maximum error is preferable, but surely
that is outweighed by the correctness issue of returning a valid number of
bytes left to operate on.

BTW. we already have lots (although steadily declining number) of corner
case issues around this whole area, but if we want to get really strict,
even an inexact report may be wrong for filemap.

Suppose we copy 10 bytes into the pagecache, but report that 5 were copied.
That means, we'll subsequently re-copy the delta. Between these two copies,
a 2nd writer might come in and write something over those 5 bytes. Then a
reader might see the following sequence of those 10 bytes
"0000000000"
"1111111111"
"2222222222"
"2222211111"
--

From: Linus Torvalds
Date: Tuesday, June 17, 2008 - 2:20 pm

Ok, so I just rebooted with this, and it does indeed fix the bug.

I'd be happier with a more complete fix (ie being byte-accurate and 
actually doing the partial copy when it hits a fault in the middle), but 
this seems to be the minimal fix, and at least fixes the totally bogus 
return values from the x86-64 __copy_user*() functions.

Not that I checked that I got _all_ cases correct (and maybe there are 
other versions of __copy_user that I missed entirely), but Bron's 
test-case at least seems to work properly for me now.

Bron? If you have a more complete test-suite (ie the real-world case that 
made you find this), it would be good to verify the whole thing. 

And again, this is the kind of thing that really wants others looking at 
it. I personally think that this thing should likely be written as C code, 
with just the inner loops done as asm (ie the inner "rep movsq" and the 
inner 64-byte unrolled cached/uncached copies done as inline asms, and 
then the outer layers could be C).

		Linus
--

From: Bron Gondwana
Date: Tuesday, June 17, 2008 - 7:27 pm

I have a real world test case using "cyr_dbtool" from Cyrus 2.3.12 on a
known-bug-inducing piece of data (the key and value sizes in the example
code were taken from that.  Indeed, the example code started off
building byte-for-byte identical files, then I changed it to write only
a single character across the entire key and value so the hexdump was
shorter)

I'll give it a go.

Bron.
--

From: Bron Gondwana
Date: Tuesday, June 17, 2008 - 8:14 pm

[Empty message]
From: Linus Torvalds
Date: Tuesday, June 17, 2008 - 9:03 pm

Is there any reason it doesn't use mmap(MAP_SHARED) and make the 
modifications that way too?

Because quite frankly, the mixture of doing mmap() and write() system 
calls is quite fragile - and I'm not saying that just because of this 
particular bug, but because there are all kinds of nasty cache aliasing 
issues with virtually indexed caches etc that just fundamentally mean that 
it's often a mistake to mix mmap with read/write at the same time.

(For the same reason it's not a good idea to mix writing through an mmap() 
and then using read() to read it - again, you can have some nasty aliasing 
going on there).

So this particular issue was definitely a kernel bug (and big thanks for 
making such a good test-case), but in general, it does sound like Cyrus is 
actively trying to dig itself into a nasty hole there.

			Linus
--


Portability[tm].

It actually does use MAP_SHARED already, but only for reading.
Writing is all done with seeks and writes, followed by a map
"refresh", which is really just an unmmap/mmap if the file has
extended past the "SLOP" (between 8 and 16 k after the end of
the file length at last mapping).

I can't just right now find the place where the reasoning behind
this was explained to me.  The theory I think was that mmap write
support is unreliable across systems, but read support is pretty 

I'm not actually a maintainer for Cyrus, I just write patches to

Yeah, indeed.

I suspect the response from the Cyrus side might include a
small dose of "POSIX says it's valid to do this, and making
it work is the kernel programmers' lookout".

Ahh - I found the explaination in doc/internal/hacking in
the Cyrus source tree.  While 'ack' is a nice tool, it
doesn't check files with no extention by default.  Ho hum:

- map_refresh and map_free

  - In many cases, it is far more effective to read a file via the operating
    system's mmap facility than it is to via the traditional read() and
    lseek system calls.  To this end, Cyrus provides an operating system
    independent wrapper around the mmap() services (or lack thereof) of the
    operating system.

  - Cyrus currently only supports read-only memory maps, all writes back
    to a file need to be done via the more traditional facilities.  This is
    to enable very low-performance support for operating systems which do not
    provide an mmap() facility via a fake userspace mmap.

  - To create a map, simply call map_refresh on the map (details are in
    lib/map.h).  To free it, call map_free on the same map.

  - Despite the fact that the maps are read-only, it is often useful to open
    the file descriptors O_RDWR, especially if the file decriptors could
    possibly be used for writing elsewhere in the code. Some operating
    systems REQUIRE file descriptors that are mmap()ed to be opened
    O_RDWR, so just do ...

Hmm.. I'm pretty sure that using MAP_SHARED for writing is _more_ portable 
than mixing mmap() and "write()" - or at least more _consistent_.

That said, it's probably six one way, and half a dozen the other. The 
shared writable mmap() doesn't work well on unix-lookalikes (ie "not real 
unix"). That does include really really old Linux versions (ie 1.x 
series), but more relevantly probably includes things like QNX etc.

On the other hand, the mmap()+write(), as mentioned, doesn't work well on 
various hardware platforms where theer can be cache aliases, and that 
includes HP-UX (as you apparently have noticed), but I'm pretty certain 
there are other cases too.

The cache alias issue can actually be really thorny, because it's going to 
be very hard to see and essentially random: if your working set is big 
enough (or the cache is small enough) that the cache basically gets 
flushed between the write and the access through the mmap (and vice 
versa), you'll never see any problems.

But then, _occasionally_, you'll have really hard-to-replicate corruption 
due to cache aliases (ie you read something from the mmap() after the 
write, but you don't actually see the newly written data, because it's 
cached at a different virtual address).

Linux tries really hard to be coherent between mmap and read/write even on 
those kinds of platforms, but I would definitely not call it "portable". 
It really is a fundamentally nasty thing, and depends deeply on the CPU 

Yeah, I can certainly see that working. That said, I can also see it 
failing, partly because of the CPU virtual indexing cache issues, but 
partly because it's such an unusual thing to do (partly because it simply 
is known not to work on some systems, ie HP-UX). And that will mean that 
it is probably not a well-tested path.. As you found out.

(Side note: I mention HP-UX just because it is known to historically have 
totally and utterly brain-damaged and useless mmap support. It _may_ be 
that they've fixed it ...

As noted above, one thing cyrus does which does seem to be plain "wrong"
is that it mmaps a region greater the file size (rounds to an 8k
boundary, but 8k-16k past the current end of the file) and then assumes
that when it writes to the end of the file (but less than the end of the
mmap region) that there's no need to remmap and that data is immediately
available within the previous mmaped region.

Apparently that works on most OS's (but is what this bug actually
exposed), but according to the mmap docs:

---
If the size of the mapped file changes after the call to mmap() as a
result of some other operation on the mapped file, the effect of
references to portions of the mapped region that correspond to added or
removed portions of the file is unspecified.
---

The way I read that, even if you mmap a file with a size past the end of
the file currently, if you subsequently write to the end of that file,
you shouldn't assume that written data is available in the region you
previously mmaped, which cyrus definitely does do.

Amazingly (apart from HP/UX) no OS actually seems to have a problem with
this since there would be massive cyrus bug reports otherwise.

Rob

----------
robm@fastmail.fm
Sign up at http://fastmail.fm for fast, ad free, IMAP accessible email

--


Pretty much any OS that tries to be make mmap() coherent with regular 
read/write accesses will automatically also have to be coherent wrt file 
size updates.

IOW, I don't think that cyrus is real any more "wrong" in this than in 
assuming that you can mix read/write and mmap() accesses. In fact, I 
suspect that Cyrus is probably _more_ conservative than most, in that it 
would not be totally unheard of to just do a much bigger mmap(), and not 
even bother to re-do it until the file grows past that size (ie no 8k/16k 

Note that if you really want to be portable, you simply must not mix 
mmap() with *any* other operations without sprinking in a healthy amount 
of "msync()" or unmapping/remapping entirely.

So _in_practice_ - because everybody tries to do a good job - you can 
actually expect to have mmap() be coherent, even though there are no real 

Yeah. Over the years, the pain from having a non-coherent mmap() generally 
has pushed everybody into just making mmap() easy to use. Which means that 
mixing things generally works fine, even if it is not at all _guaranteed_. 

So I'd expect mmap+write to work and be coherent almost always. But it's 
still a fairly unusual combination, and I would personally think that 
using MAP_SHARED and writing through the mmap() would be the less 
surprising model.

		Linus
--

From: Bron Gondwana
Date: Friday, October 3, 2008 - 4:44 am

On Tue, Jun 17, 2008 at 02:20:49PM -0700, Linus Torvalds wrote:

[reminder from way back: this bug was caused by writev containing 
mmaped pages that weren't paged in, it's 64 bit only.  It

Has this been revisited since?  I haven't noticed, but I really only
skim LKML - have to save some time in the day for my real job[tm] of

It's been fine for us since, but unfortunately most of the world is
still running distribution "stable" kernels.  I've just been helping a
user who's getting corrupted flat file databases on Ubuntu's stable 64
bit xen kernels, and it looks like it's the same issue.

Is there a standard way to tell backporters "you really need to add this
patch for your users' sanity"?

Bron ( I tried reporting it in Launchpad, but kept getting timeout
       errors, so I figured reposting here might get noticed.  Besides,
			 I can follow up on the "more complete fix" at the same time! )
--

From: Andrew Morton
Date: Friday, October 3, 2008 - 6:07 am

Yes, there is.  We backport the patch into earlier kernel releases and
that action _should_ wake the distros up to take a look at the fix.

This particular fix (42a886af728c089df8da1b0017b0e7e6c81b5335) was
included in 2.6.26 and also is present in 2.6.25.17, but not 2.6.25. 
So we did backport it into 2.6.25.x.  Maybe distros were slow or errant
in picking up the patch.

--

From: Bron Gondwana
Date: Friday, October 3, 2008 - 5:13 pm

It went into 2.6.25.8, which was the stable line at the time.  It
didn't get backported to 2.6.24.  Ubuntu's stable line is called
2.6.24-19.41, and I'm guessing they just didn't realise it was
a fix that actually needs to be applied to everything back to at
least 2.6.23.

I'll go try launchpad again and see if it's unbroken enough to
accept my bug report.

Bron ( really not wanting to keep fielding Cyrus questions on this
       for the 4 1/2 years support remaining on that LTS distro! )
-- 
  Bron Gondwana
  brong@fastmail.fm

--

From: Andi Kleen
Date: Tuesday, June 17, 2008 - 2:15 pm

How can a load fault legitimately in copy_to_user?

-Andi


--

From: Andi Kleen
Date: Tuesday, June 17, 2008 - 1:58 pm

The x86-64 copy_*_user functions were always designed to return errors
both ways (as in both for load and for store). That's needed because
the loops are shared for copy_to_user and copy_from_user. That's normally
ok because when you do _to_user you shouldn't fault on the loads
and vice versa. If a caller does that it's buggy.

-Andi
--

From: Linus Torvalds
Date: Tuesday, June 17, 2008 - 2:14 pm

That's not the problem, Andi.

The problem is that it returns THE WRONG VALUE!

If the fault happened on the second load, but the first load was never 
actually paired up with a store (because of unrolling the loop), then YOU 
MUST NOT CLAIM THAT YOU DID A 8-BYTE COPY! Because you have copied exactly 
_zero_ bytes, even though you _loaded_ 8 bytes successfully!

See?

Claiming that you copied 8 bytes when you didn't do anything at all is 
WRONG. It's so incredibly wrong that it is scary. 

			Linus
--

From: Andi Kleen
Date: Tuesday, June 17, 2008 - 2:26 pm

Loads are not supposed to fault in copy_to_user(). Only stores are.

The way it works is that it assumes that either loads fault (when used

If your patch fixes something then the main wrong thing is the caller
who passes a faulting source address.

And again it always breaks the other case.

-Andi


--

From: Linus Torvalds
Date: Tuesday, June 17, 2008 - 2:31 pm

Andi, just shut up already.

It is "copy_user". Notice the lack of "from" or "to". That code handles 

Andi, SHUT THE F*CK UP. Read the code. Read the patch. Sorry for ever 
involving you. I don't want to hear your idiotic whining.

		Linus
--

From: Linus Torvalds
Date: Tuesday, June 17, 2008 - 2:34 pm

Oh, and yes, I said "copy_to_user_inatomic()" by mistake. The fact is that 
"write()" obviously does copy_from_user, but more importantly that on x86 
it's THE EXACT SAME ROUTINE. 

			Linus
--

From: Andi Kleen
Date: Tuesday, June 17, 2008 - 2:34 pm

I did exactly the same patch first on your first email and then I realized

The patch is wrong because it'll break the other case (in this case
copy_from_user)

-Andi

--

From: Linus Torvalds
Date: Tuesday, June 17, 2008 - 2:46 pm

Ok, I'm now putting you in my idiots filter.

No, it will not break the other case. You're an idiot. Loading a value 
without using it will not break anything, quite the reverse. What the 
patch does is to _fix_ copy_from_user(), because if the second load traps, 
then the fact that we did the first load IS IMMATERIAL, because its result 
was never stored!

So the patch _fixes_ copy_from_user(), exactly because it says that even 
if you've loaded 24 bytes, but you faulted on the fourth load, you've 
still _copied_ exactly zero bytes, because you didn't actually store the 
24 bytes you loaded.

And it is a no-op for copy_to_user, since copy_to_user will never fault on 
the load (not the first one, not the second one, not _any_ load), so the 
exception table entries for the loads are unusued.

IOW: 
 - the patch fixes a bug
 - you refuse to acknowledge this
 - I'll put you in my "flamers" filter that goes into another mailbox, 
   because it's not worth my time even arguing with you any more.

Sorry for ever adding you to the cc. I thought it might be a good idea, 
since you were the author of the code. But clearly the bug was not because 
you made a mistake, but because you simply don't seem to understand what 
the function is supposed to return, and you're not even interested in 
learning.

			Linus
--

From: Nick Piggin
Date: Tuesday, June 17, 2008 - 11:10 pm

Yes, the new filemap.c code does not require an exact byte count, but
it won't work if there is an under-estimation of the number of bytes
left to copy.

The old filemap.c code actually also relies on the byte count in some
cases, I can't remember off the top of my head, but I *think* it was a
security measure to prevent uninitialized data leak.

Most other cases of course only care about complete success or not, but
there are others. filemap_xip, splice are a couple that pop up.

Thanks for working that all out before I even read my email :)
--

From: Bron Gondwana
Date: Tuesday, June 17, 2008 - 7:21 pm

Bah - crap.  Sorry about that.  I sent you the wrong version of the
code.  That's what I get for not putting it in revision control.

That bug exists because I did an htonl on a value that I shouldn't
have.

Bron ( now to read the rest of the thread! )
--

Previous thread: [PATCH] cdrom: don't check CDC_PLAY_AUDIO in cdrom_count_tracks() by Tejun Heo on Monday, June 16, 2008 - 10:20 pm. (3 messages)

Next thread: Re: [PATCH] knfsd: nfsd: Handle ERESTARTSYS from syscalls. by Neil Brown on Tuesday, June 17, 2008 - 12:03 am. (2 messages)