Re: [patch 0/6] fault vs truncate/invalidate race fix

Previous thread: none

Next thread: Linux 2.6.21-rc1 by Linus Torvalds on Wednesday, February 21, 2007 - 12:53 am. (194 messages)
To: Linux Memory Management <linux-mm@...>, Andrew Morton <akpm@...>
Cc: Linux Kernel <linux-kernel@...>, Nick Piggin <npiggin@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, February 21, 2007 - 12:49 am

The following set of patches are based on current git.

These fix the fault vs invalidate and fault vs truncate_range race for
filemap_nopage mappings, plus those and fault vs truncate race for nonlinear
mappings.

These patches fix silent data corruption that we've had several people hitting
in SUSE kernels. Our kernels have similar patches to lock the page over page
fault, and no problem.

I've also got rid of the horrible populate API, and integrated nonlinear pages
properly with the page fault path.

Downside is that this adds one more vector through which the buffered write
deadlock can occur. However this is just a very tiny one (pte being unmapped
for reclaim), compared to all the other ways that deadlock can occur (unmap,
reclaim, truncate, invalidate). I doubt it will be noticable. At any rate, it
is better than data corruption.

I hope these can get merged (at least into -mm) soon.

Thanks,
Nick

--
SuSE Labs

-

To: Nick Piggin <npiggin@...>
Cc: Linux Memory Management <linux-mm@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Tuesday, February 27, 2007 - 12:36 am

Have these been put into mm? can I expect them in the next -mm so I
can start merging up the drm memory manager code to my -mm tree..

Dave.
-

To: Dave Airlie <airlied@...>
Cc: <npiggin@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>
Date: Tuesday, February 27, 2007 - 1:32 am

Not yet - I need to get back on the correct continent, review the code,
stuff like that. It still hurts that this work makes the write() deadlock

What is the linkage between these patches and DRM?
-

To: Andrew Morton <akpm@...>
Cc: Dave Airlie <airlied@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>
Date: Tuesday, February 27, 2007 - 4:50 am

s/harder/easier of course...

I think there is good reason to assume the buffered write page lock
deadlocks would not occur in "normal" programs (or very very few),
because it would require writing from the same page you are writing to,
or 2 processes writing from the page the other is writing to. If any
innocent users do hit this, at least it is not data corrupting, and is
relatively easy to trace back to the kernel.

In the case of local DoS exploits, the deadlocks already present in the
buffered write path are already trivial to exploit... locking the page
in the fault path doesn't make the deadlock exploit any more possible.

So the downside to merging is that we _may_ get some additional deadlocks.

What is being fixed is silent data corruption that has been reported by
several different users of the SLES kernel (because we have assertions
there to catch it), and can be triggered by DIO or NFS, or anything using
vmtruncate_range or invalidate_inode_pages2 on regular files. Or even a
regular truncate with nonlinear pages. These are known problems on
production workloads.

That's my argument for merging these. I think it's reasonable, but I'm
open to debate.

I did get some page fault performance numbers at one stage. Nothing
really exciting seemed to happen IIRC, but I can do another set of tests

To be fair, I have 2 ways to fix it. Unfortunately one is slow and the
other requires cooperation from filesystem developers. perform_write() is
still on track, but it is going to take a reasonable amount of time and
effort to convert filesystems. I just can't see any gain in holding these
patches back until that all happens.

Thanks,
Nick
-

To: Andrew Morton <akpm@...>
Cc: <npiggin@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>
Date: Tuesday, February 27, 2007 - 2:26 am

the new fault hander made the memory manager code a lot cleaner and
very less hacky in a lot of cases. so I'd rather merge the clean code
than have to fight with the current code...

Dave.
-

To: Dave Airlie <airlied@...>
Cc: Andrew Morton <akpm@...>, <npiggin@...>, <linux-mm@...>, <linux-kernel@...>
Date: Tuesday, February 27, 2007 - 2:54 am

Note that you can probably get away with NOPFN_REFAULT etc... like I did
for the SPEs in the meantime.

Ben.

-

To: Benjamin Herrenschmidt <benh@...>
Cc: Andrew Morton <akpm@...>, <npiggin@...>, <linux-mm@...>, <linux-kernel@...>
Date: Sunday, March 18, 2007 - 7:13 pm

Indeed, Thomas has done this work and I'm just lining up a TTM tree to
start the merge process..

Dave.
-

To: Linux Memory Management <linux-mm@...>, Andrew Morton <akpm@...>
Cc: Linux Kernel <linux-kernel@...>, Nick Piggin <npiggin@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, February 21, 2007 - 12:50 am

Nonlinear mappings are (AFAIKS) simply a virtual memory concept that
encodes the virtual address -> file offset differently from linear
mappings.

I can't see why the filesystem/pagecache code should need to know anything
about it, except for the fact that the ->nopage handler didn't quite pass
down enough information (ie. pgoff). But it is more logical to pass pgoff
rather than have the ->nopage function calculate it itself anyway. And
having the nopage handler install the pte itself is sort of nasty.

This patch introduces a new fault handler that replaces ->nopage and
->populate and (later) ->nopfn. Most of the old mechanism is still in place
so there is a lot of duplication and nice cleanups that can be removed if
everyone switches over.

The rationale for doing this in the first place is that nonlinear mappings
are subject to the pagefault vs invalidate/truncate race too, and it seemed
stupid to duplicate the synchronisation logic rather than just consolidate
the two.

After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
pagecache. Seems like a fringe functionality anyway.

NOPAGE_REFAULT is removed. This should be implemented with ->fault, and
no users have hit mainline yet.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Documentation/feature-removal-schedule.txt | 27 ++++++
Documentation/filesystems/Locking | 2
fs/gfs2/ops_address.c | 2
fs/gfs2/ops_file.c | 2
fs/gfs2/ops_vm.c | 34 ++++---
fs/ncpfs/mmap.c | 23 ++---
fs/ocfs2/aops.c | 2
fs/ocfs2/mmap.c | 17 +--
fs/xfs/linux-2.6/xfs_file.c | 23 ++---
include/linux/mm.h | 36 ++++++--
ipc/shm.c | 2
mm/filemap.c | 93 ++++++++++++--------
mm/filemap_xip.c ...

To: Nick Piggin <npiggin@...>
Cc: Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>, Ingo Molnar <mingo@...>
Date: Wednesday, March 7, 2007 - 2:51 am

It's awkward to layer a largely do-nothing patch like this on top of a
significant functional change. Makes it harder to isolate the source of

Did benh agree with that?

The patch unchangeloggedly adds a basic new structure to core mm
(fault_data). Would be nice to document its fields, especially `flags'.

Please add less pointless blank lines.

How well has this been tested? The ocfs2 changes? gfs2? We should at
least give those guys a heads-up.

Does anybody really pass a NULL `type' arg into filemap_nopage()?

This patch seems to churn things around an awful lot for minimal benefit.

-

To: Andrew Morton <akpm@...>
Cc: Nick Piggin <npiggin@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Ingo Molnar <mingo@...>
Date: Wednesday, March 7, 2007 - 6:05 am

I won't use NOPAGE_REFAULT, I use NOPFN_REFAULT and that has hit
mainline. I will switch to ->fault when I have time to adapt the code,
in the meantime, NOPFN_REFAULT should stay.

Note that one thing we really want with the new ->fault (though I
haven't looked at the patches lately to see if it's available) is to be
able to differenciate faults coming from userspace from faults coming
from the kernel. The major difference is that the former can be
re-executed to handle signals, the later can't. Thus waiting in the
fault handler can be made interruptible in the former case, not in the
later case.

Ben.

-

To: Benjamin Herrenschmidt <benh@...>
Cc: Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Ingo Molnar <mingo@...>
Date: Wednesday, March 7, 2007 - 6:17 am

I think I removed not only NOFPN_REFAULT, but also nopfn itself, *and*
adapted the code for you ;) it is in patch 5/6, sent a while ago.

-

To: Nick Piggin <npiggin@...>
Cc: Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Ingo Molnar <mingo@...>
Date: Wednesday, March 7, 2007 - 6:46 am

Ok, I need to look. I've been travelling, having meeting etc... for the
last couple of weeks and I'm taking a week off next week :-)

Ben.

-

To: Andrew Morton <akpm@...>
Cc: Nick Piggin <npiggin@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>, Ingo Molnar <mingo@...>
Date: Wednesday, March 7, 2007 - 3:19 am

The major vs. minor fault accounting patch that introduced the argument
didn't make non-NULL type arguments a requirement. It's essentially an
optional second return value and the NULL pointer represents the caller
choosing to ignore it. I'm not sure I actually liked that aspect of it,
but that's how it ended up going in. I think it had something to do
with driver churn clashing with the sweep at the time of the merge. I'd
rather the argument be mandatory and defaulted to VM_FAULT_MINOR.

It's something of a non-answer, though, since it only discusses a
convention as opposed to reviewing specific callers of filemap_nopage().
NULL type arguments to ->nopage() are rare at most, and could be easily
eliminated, at least for in-tree drivers.

egrep -nr 'nopage.*NULL' . 2>/dev/null | grep -v '^Bin' on a current
git tree yields zero matches.

-- wli
-

To: Andrew Morton <akpm@...>
Cc: Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>, Ingo Molnar <mingo@...>
Date: Wednesday, March 7, 2007 - 3:08 am

OK. This is actually something that I would like more people to review.
Do we need any different fields? Should it be passed as arguments instead

Dunno, it's exported. I remove that completely in a subsequent patch

Well it fixes the whole design of the nonlinear fault path.
-

To: Andrew Morton <akpm@...>
Cc: Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>, Ingo Molnar <mingo@...>
Date: Wednesday, March 7, 2007 - 4:19 am

If it doesn't look very impressive, it could be because it leaves all
the old crud around for backwards compatibility (the worst offenders
are removed in patch 6/6).

If you look at the patchset as a whole, it removes about 250 lines,
mostly of (non trivial) duplicated code in filemap.c memory.c shmem.c
fremap.c, that is nonlinear pages specific and doesn't get anywhere
near the testing that the linear fault path does.

A minimal fix for nonlinear pages would have required changing all
->populate handlers, which I simply thought was not very productive
considering the testing and coverage issues, and that I was going to
rewrite the nonlinear path anyway.

If you like, you can consider patches 1,2,3 as the fix, and ignore
nonlinear (hey, it doesn't even bother checking truncate_count today!).

Then 4,5,6 is the fault/nonlinear rewrite, take it or leave it. I thought
you would have liked the patches...

-

To: Nick Piggin <npiggin@...>
Cc: Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, March 7, 2007 - 4:27 am

btw., if we decide that nonlinear isnt worth the continuing maintainance
pain, we could internally implement/emulate sys_remap_file_pages() via a
call to mremap() and essentially deprecate it, without breaking the ABI
- and remove all the nonlinear code. (This would split fremap areas into
separate vmas)

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, March 7, 2007 - 4:59 am

Well I think it has a few possible uses outside the PAE database
workloads. UML for one seem to be interested... as much as I don't
use them, I think nonlinear mappings are kinda cool ;)

After these patches, I don't think there is too much burden. The main
thing left really is just the objrmap stuff, but that is just handled
with a minimal 'dumb' algorithm that doesn't cost much.

Then the core of it is just the file pte handling, which really doesn't
seem to be much problem.

Apart from a handful of trivial if (pte_file()) cases throughout mm/,
our maintainance burden basically now amounts to the following patch.
Even the rmap.c change looks bigger than it is because I split out
the nonlinear unmapping code from try_to_unmap_file. Not too bad, eh? :)

--

include/asm-powerpc/pgtable.h | 12 ++++
mm/Kconfig | 6 ++
mm/Makefile | 6 +-
mm/rmap.c | 101 +++++++++++++++++++++++++-----------------
4 files changed, 83 insertions(+), 42 deletions(-)

Index: linux-2.6/include/asm-powerpc/pgtable.h
===================================================================
--- linux-2.6.orig/include/asm-powerpc/pgtable.h
+++ linux-2.6/include/asm-powerpc/pgtable.h
@@ -243,7 +243,12 @@ static inline int pte_write(pte_t pte) {
static inline int pte_exec(pte_t pte) { return pte_val(pte) & _PAGE_EXEC;}
static inline int pte_dirty(pte_t pte) { return pte_val(pte) & _PAGE_DIRTY;}
static inline int pte_young(pte_t pte) { return pte_val(pte) & _PAGE_ACCESSED;}
+
+#ifdef CONFIG_NONLINEAR
static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE;}
+#else
+static inline int pte_file(pte_t pte) { return 0; }
+#endif

static inline void pte_uncache(pte_t pte) { pte_val(pte) |= _PAGE_NO_CACHE; }
static inline void pte_cache(pte_t pte) { pte_val(pte) &= ~_PAGE_NO_CACHE; }
@@ -483,9 +488,16 @@ extern void update_mmu_cache(struct vm_a
#define __swp_entry(type, offset) ((swp_entry_t){((type)...

To: Nick Piggin <npiggin@...>
Cc: Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, March 7, 2007 - 5:22 am

ok. What do you think about the sys_remap_file_pages_prot() thing that
Paolo has done in a nicely split up form - does that complicate things
in any fundamental way? That is what is useful to UML.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, March 7, 2007 - 5:52 am

Last time I looked (a while ago), the only issue I had was that he was
doing a weird special case rather than using another !present pte bit
for his "nonlinear protection" ptes.

I think he fixed that now and so it should be quite good now.
-

To: Ingo Molnar <mingo@...>
Cc: Nick Piggin <npiggin@...>, Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, March 7, 2007 - 5:32 am

Oracle would love it. You don't want to know how far back I've been
asked to backport that.

-- wli
-

To: Bill Irwin <bill.irwin@...>, Nick Piggin <npiggin@...>, Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, March 7, 2007 - 5:35 am

ok, cool! Then the first step would be for you to talk to Paolo and to
pick up the patches, review them, nurse it in -mm, etc. Suffering in
silence is just a pointless act of masochism, not an efficient
upstream-merge tactic ;-)

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Bill Irwin <bill.irwin@...>, Nick Piggin <npiggin@...>, Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, March 7, 2007 - 5:50 am

It was intended for use in a debugging mode for the database, so given
the general mood where fighting backouts was an issue, I was relatively
loath to bring it up. With UML behind it I don't feel that's as much of
a concern.

-- wli
-

To: Ingo Molnar <mingo@...>
Cc: Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, March 7, 2007 - 5:11 am

Oh, there is a bit more nonlinear mmap list manipulation I'd forgotten
about too... makes things a little bit worse, but not too much.
-

To: <mingo@...>
Cc: <npiggin@...>, <akpm@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>
Date: Wednesday, March 7, 2007 - 4:38 am

That would make sense. Dirty page accounting doesn't work either on
non-linear mappings, and I can't see how that could be fixed in any
other way.

Miklos
-

To: Miklos Szeredi <miklos@...>
Cc: <mingo@...>, <npiggin@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>
Date: Wednesday, March 7, 2007 - 4:47 am

It doesn't? Confused - these things don't have anything to do with each
other do they?
-

To: <akpm@...>
Cc: <mingo@...>, <npiggin@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>
Date: Wednesday, March 7, 2007 - 4:51 am

Look in page_mkclean(). Where does it handle non-linear mappings?

Miklos
-

To: Miklos Szeredi <miklos@...>
Cc: <mingo@...>, <npiggin@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Wednesday, March 7, 2007 - 5:07 am

OK, I'd forgotten about that. It won't break dirty memory accounting,
but it'll potentially break dirty memory balancing.

If we have the wrong page (due to nonlinear), page_check_address() will
fail and we'll leave the pte dirty. That puts us back to the pre-2.6.17
algorithms and I guess it'll break the msync guarantees.

Peter, I thought we went through the nonlinear problem ages ago and decided
it was OK?
-

To: Andrew Morton <akpm@...>
Cc: Miklos Szeredi <miklos@...>, <mingo@...>, <npiggin@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>
Date: Wednesday, March 7, 2007 - 5:32 am

Can recollect as much, I modelled it after page_referenced() and can't
find any VM_NONLINEAR specific code in there either.

Will have a hard look, but if its broken, then page_referenced if
equally broken it seems, which would make page reclaim funny in the
light of nonlinear mappings.

-

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Andrew Morton <akpm@...>, Miklos Szeredi <miklos@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>
Date: Wednesday, March 7, 2007 - 5:45 am

page_referenced is just an heuristic, and it ignores nonlinear mappings
and the page which will get filtered down to try_to_unmap.

Page reclaim is already "funny" for nonlinear mappings, page_referenced
is the least of its worries ;) It works, though.

-

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Andrew Morton <akpm@...>, Miklos Szeredi <miklos@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>
Date: Wednesday, March 7, 2007 - 6:04 am

Or, to be more helpful, unmap_mapping_range is what it should be
modelled on.
-

To: Nick Piggin <npiggin@...>
Cc: Andrew Morton <akpm@...>, Miklos Szeredi <miklos@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>
Date: Wednesday, March 7, 2007 - 6:06 am

*sigh* yes was looking at all that code, thats gonna be darn slow
though, but I'll whip up a patch.

/me feels terribly bad about having missed this..

-

To: <a.p.zijlstra@...>
Cc: <npiggin@...>, <akpm@...>, <miklos@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>
Date: Wednesday, March 7, 2007 - 6:13 am

Well, if it's going to be darn slow, maybe it's better to go with
mingo's plan on emulating nonlinear vmas with linear ones. That'll be
darn slow as well, but at least it will be much less complicated.

Miklos
-

To: Andrew Morton <akpm@...>
Cc: <linux-mm@...>, <linux-kernel@...>, David Howells <dhowells@...>, <linux-fsdevel@...>
Date: Wednesday, March 7, 2007 - 6:30 am

Now that I'm making some progress on merging the basic stuff, I'd
like to get opinions about merging page_mkwrite functionality into
->fault().

I still don't see any callers in the tree, but I see no reason why
this won't work (or why it isn't better).

--
Like everything else in life, page_mkwrite()ing is just a primitive,
degenerate form of fault()ing.

Having FAULT_FLAG_WRITE in the fault operation allows us to just get
rid of the page_mkwrite call in do_fault, because filesystems can check
for that flag bit, and do the page_mkwrite thing before returning the
page (this will improve efficiency for everyone).

Then, we introduce another fault flag to signal that the fault is
an event notification for a page, rather than a request for a pgoff.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -176,6 +176,7 @@ extern unsigned int kobjsize(const void
* return with the page locked.
*/
#define VM_CAN_NONLINEAR 0x10000000 /* Has ->fault & does nonlinear pages */
+#define VM_NOTIFY_MKWRITE 0x20000000 /* Has ->fault & wants page writable notification */

#ifndef VM_STACK_DEFAULT_FLAGS /* arch can override this */
#define VM_STACK_DEFAULT_FLAGS VM_DATA_DEFAULT_FLAGS
@@ -201,6 +202,7 @@ extern pgprot_t protection_map[16];

#define FAULT_FLAG_WRITE 0x01
#define FAULT_FLAG_NONLINEAR 0x02
+#define FAULT_FLAG_NOTIFY 0x04 /* fault_data.page contains page */

/*
* fault_data is filled in the the pagefault handler and passed to the
@@ -213,7 +215,10 @@ extern pgprot_t protection_map[16];
* nonlinear mapping support.
*/
struct fault_data {
- unsigned long address;
+ union {
+ unsigned long address;
+ struct page *page;
+ };
pgoff_t pgoff;
unsigned int flags;

@@ -230,9 +235,6 @@ struct vm_operations_struct {
void (*close)(struct vm_area_str...

To: Miklos Szeredi <miklos@...>
Cc: <a.p.zijlstra@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>
Date: Wednesday, March 7, 2007 - 6:21 am

IMO, the best thing to do is just restore msync behaviour, and comment
the fact that we ignore nonlinears. We need to restore msync behaviour
to fix races in regular mappings anyway, at least for now.

-

To: Nick Piggin <npiggin@...>
Cc: Miklos Szeredi <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>
Date: Wednesday, March 7, 2007 - 6:24 am

Yeah, why don't we have a tree per nonlinear vma to find these pages?

Seems to be the best quick solution indeed.

-

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Miklos Szeredi <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>
Date: Wednesday, March 7, 2007 - 6:38 am

We could do something more efficient, but I thought that half the point
was that they didn't carry any of this extra memory, and they could be

If we fix the race in the linear mappings, then we can just do the full
msync for nonlinear vmas, and the fast noop version for everyone else.

I don't see it being a big deal. I doubt anybody is writing out huge
amounts of data via nonlinear mappings.
-

To: Nick Piggin <npiggin@...>
Cc: Miklos Szeredi <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>
Date: Wednesday, March 7, 2007 - 6:47 am

I'm failing to understand this :-(

That extra memory, and apparently they don't want the inefficiency

Well, now they don't, but it could be done or even exploited as a DoS.

-

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Miklos Szeredi <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>
Date: Wednesday, March 7, 2007 - 7:00 am

But so could nonlinear page reclaim. I think we need to restrict nonlinear
mappings to root if we're worried about that.
-

To: Nick Piggin <npiggin@...>
Cc: Peter Zijlstra <a.p.zijlstra@...>, Miklos Szeredi <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>
Date: Wednesday, March 7, 2007 - 8:22 am

Please not root. The users really don't want to be privileged. UML
itself is at least partly for use as privilege isolation of the guest
workload. Oracle has some of the same concerns itself, which is part of
why it uses separate processes heavily, even: to isolate instances from
each other.

-- wli
-

To: Bill Irwin <bill.irwin@...>, Peter Zijlstra <a.p.zijlstra@...>, Miklos Szeredi <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>
Date: Wednesday, March 7, 2007 - 8:36 am

Well non-root users could be allowed to work on mlocked regions on
tmpfs/shm. That way they avoid the pathological nonlinear problems,
and can work within the mlock ulimit.

That is, if we are worried about such a DoS.
-

To: Nick Piggin <npiggin@...>
Cc: Miklos Szeredi <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>
Date: Wednesday, March 7, 2007 - 7:48 am

Bah, my brain is thick and foggy today. Let us try again;

Nonlinear vmas exist because many vmas are expensive somehow, right?
Nonlinear vmas keep the page mapping in the page tables and screw rmaps.

This 'extra memory' you mentioned would be the overhead of tracking the
actual ranges?

And apparently now we want it to not suck on the rmap case :-(

Anyway, if used on a non writeback capable backing store (ramfs)
page_mkclean will never be called. If also mlocked (I think oracle does
this) then page reclaim will pass over too.

So we're only interested in the bdi_cap_accounting_dirty and VM_SHARED
case, right?

Tracking these ranges on a per-vma basis would avoid taking the mm wide
mmap_sem and so would be cheaper than regular vmas.

Can't we just 'fix' it?

-

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Miklos Szeredi <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>
Date: Wednesday, March 7, 2007 - 8:17 am

Do we? I think just "work" is the way we've been handling them up until
now. Making them suck less for rmap makes them suck more for what they're

Well you can today remap N pages in a file, arbitrarily for
sizeof(pte_t)*tiny bit for the upper page tables + small constant
for the vma.

At best, you need an extra pointer to pte / vaddr, so you'd basically

The thing is, I don't think anybody who uses these things cares
about any of the 'problems' you want to fix, do they? We are
interested in dirty pages only for the correctness issue, rather
than performance. Same as reclaim.

-

To: Nick Piggin <npiggin@...>
Cc: Miklos Szeredi <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, Jeff Dike <jdike@...>
Date: Wednesday, March 7, 2007 - 8:41 am

I was hoping some form of range compression would gain something, but if
its a fully random mapping, then yes a shadow page table would be needed

If so, we can just stick to the dead slow but correct 'scan the full
vma' page_mkclean() and nobody would ever trigger it.

What is the DoS scenario wrt reclaim? We really ought to fix that if
real, those UML farms run on nothing but nonlinear reclaim I'd think.

-

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Miklos Szeredi <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, Jeff Dike <jdike@...>
Date: Wednesday, March 7, 2007 - 9:08 am

Not if we restricted it to root and mlocked tmpfs. But then why
wouldn't you just do it with the much more efficient msync walk,
so that if root does want to do writeout via these things, it does

I guess you can just increase the computational complexity of
reclaim quite easily.
-

To: Nick Piggin <npiggin@...>
Cc: Miklos Szeredi <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, Jeff Dike <jdike@...>
Date: Wednesday, March 7, 2007 - 9:19 am

This is all used on ram based filesystems right, they all have
BDI_CAP_NO_WRITEBACK afaik, so page_mkclean will never get called
anyway. Mlock doesn't avoid getting page_mkclean called.

Those who use this on a 'real' filesystem will get hit in the face by a
linear scanning page_mkclean(), but AFAIK nobody does this anyway.

Restricting it to root for such filesystems is unwanted, that'd severely
handicap both UML and Oracle as I understand it (are there other users
of this feature around?)

msync() might never get called and then we're back with the old

Right, on first glance it doesn't look to be too bad, but I should take
a closer look.

-

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Miklos Szeredi <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, Jeff Dike <jdike@...>
Date: Wednesday, March 7, 2007 - 9:36 am

But somebody might do it. I just don't know why you'd want to make

But we're root. With your patch, root *can't* do nonlinear writeback

Well I don't think UML uses nonlinear yet anyway, does it? Can they
make do with restricting nonlinear to mlocked vmas, I wonder? Probably
not.

-

To: <npiggin@...>
Cc: <a.p.zijlstra@...>, <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, <jdike@...>
Date: Wednesday, March 7, 2007 - 9:53 am

Restricting to root doesn't buy you much, nobody wants to be root.
Restricting to mlock is similarly pointless. UML _will_ want to get
swapped out if there's no activity.

Restricting to tmpfs makes sense, but it's probably not what UML
wants.

Conclusion: there's no good solution for UML in kernel-space.

Miklos
-

To: Miklos Szeredi <miklos@...>
Cc: <a.p.zijlstra@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, <jdike@...>
Date: Wednesday, March 7, 2007 - 10:50 am

They could always not use nonlinear, or we could add a ulimit to the

I think it is OK. They might want some persistent storage to migrate
or something, but that can always be done by copying from tmpfs to
a block based filesystem.
-

To: Nick Piggin <npiggin@...>
Cc: Miklos Szeredi <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, Jeff Dike <jdike@...>
Date: Wednesday, March 7, 2007 - 9:52 am

Ooh, you only want to restrict remap_file_pages on mappings from bdi's
without BDI_CAP_NO_WRITEBACK. Sure, I can live with that, and I suspect

True. We could even guesstimate the nonlinear dirty pages by subtracting
the result of page_mkclean() from page_mapcount() and force an
msync(MS_ASYNC) on said mapping (or all (nonlinear) mappings of the

I think it does, but lets ask, Jeff?

-

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Nick Piggin <npiggin@...>, Miklos Szeredi <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>
Date: Wednesday, March 7, 2007 - 11:10 am

Nope, UML needs to be able to change permissions as well as locations.

Would be nice, though, there are apparently nice UML speedups with it.

Jeff

--
Work email - jdike at linux dot intel dot com
-

To: Nick Piggin <npiggin@...>
Cc: Miklos Szeredi <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, Jeff Dike <jdike@...>
Date: Wednesday, March 7, 2007 - 10:34 am

Almost, but not quite, we'd need to extract another value from the
page_mkclean() run, the actual number of mappings encountered. The
return value only sums the number of dirty mappings encountered.

s390 would already work I guess.

Certainly doable.

-

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Miklos Szeredi <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, Jeff Dike <jdike@...>
Date: Wednesday, March 7, 2007 - 11:01 am

But if we restrict it to root only, and have a note in the man page
about it, then it really isn't worth cluttering up the kernel.

-

To: Nick Piggin <npiggin@...>
Cc: Miklos Szeredi <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, Jeff Dike <jdike@...>, hugh <hugh@...>, Linus Torvalds <torvalds@...>
Date: Wednesday, March 7, 2007 - 12:58 pm

compile tested only so far

---

Partial revert of commit: 204ec841fbea3e5138168edbc3a76d46747cc987

Non-linear vmas aren't properly handled by page_mkclean() and fixing that
would result in linear scans of all related non-linear vmas per page_mkclean()
invocation.

This is deemed too costly, hence re-instate the msync scan for non-linear vmas.

However this can lead to double IO:

- pages get instanciated with RO mapping
- page takes write fault, and gets marked with PG_dirty
- page gets tagged for writeout and calls page_mkclean()
- page_mkclean() fails to find the dirty pte (and clean it)
- writeout happens and PG_dirty gets cleared.
- user calls msync, the dirty pte is found and the page marked with PG_dirty
- the page gets writen out _again_ even though its not re-dirtied.

To minimize this reset the protection when creating a nonlinear vma.

I'm not at all happy with this, but plain disallowing remap_file_pages on bdis
without BDI_CAP_NO_WRITEBACK seems to offend some people, hence restrict it to
root only.

Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
---
mm/fremap.c | 21 ++++++++
mm/msync.c | 146 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 162 insertions(+), 5 deletions(-)

Index: linux-2.6-git/mm/msync.c
===================================================================
--- linux-2.6-git.orig/mm/msync.c 2007-03-07 17:18:09.000000000 +0100
+++ linux-2.6-git/mm/msync.c 2007-03-07 17:31:29.000000000 +0100
@@ -7,12 +7,123 @@
/*
* The msync() system call.
*/
+#include <linux/slab.h>
+#include <linux/pagemap.h>
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/mman.h>
+#include <linux/hugetlb.h>
+#include <linux/writeback.h>
#include <linux/file.h>
#include <linux/syscalls.h>

+#include <asm/pgtable.h>
+#include <asm/tlbflush.h>
+
+static unsigned long msync_pte_range(struct vm_area_struct *vma, pmd_t *pm...

To: <a.p.zijlstra@...>
Cc: <npiggin@...>, <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, <jdike@...>, <hugh@...>, <torvalds@...>
Date: Thursday, March 8, 2007 - 7:21 am

Root only for !BDI_CAP_NO_WRITEBACK mappings doesn't make sense
because:

- just encourages insecure applications

- there are no current users that want this and presumable no future
uses either

- it's a maintenance burden: I'll have to layer the m/ctime update
patch on top of this

- the only pro for this has been that Nick thinks it cool ;)

I think the proper way to deal with this is to

- allow BDI_CAP_NO_WRITEBACK (tmpfs/ramfs) uses, makes database
people happy

- for !BDI_CAP_NO_WRITEBACK emulate using do_mmap_pgoff(), should be
trivial, no userspace ABI breakage

Miklos
-

To: Miklos Szeredi <miklos@...>
Cc: <a.p.zijlstra@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, <jdike@...>, <hugh@...>, <torvalds@...>
Date: Thursday, March 8, 2007 - 7:58 am

But you have to update m/ctime for BDI_CAP_NO_WRITEBACK mappings anyway

Yeah that sounds OK.

-

To: <npiggin@...>
Cc: <a.p.zijlstra@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, <jdike@...>, <hugh@...>, <torvalds@...>
Date: Thursday, March 8, 2007 - 8:09 am

Yes, but that's a different aspect of msync(), not about the data
writeback issues that nonlinear mappings have.

So a solution that solves both these problems would probably be more

Fair enough.

Miklos
-

To: Miklos Szeredi <miklos@...>
Cc: <npiggin@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, <jdike@...>, <hugh@...>, <torvalds@...>
Date: Thursday, March 8, 2007 - 7:37 am

I can live with that.

However this still leaves the non-linear reclaim (Nick pointed it out as
a potential DoS and other people have corroborated this). I have no idea
on that to do about that.

Oracle seems to mlock these things anyway, but UML surely would not.

-

To: <a.p.zijlstra@...>
Cc: <npiggin@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, <jdike@...>, <hugh@...>, <torvalds@...>
Date: Thursday, March 8, 2007 - 7:48 am

OK, but that is a completely different problem, not affecting
page_mkclean() or msync().

And it doesn't sound too hard to solve: when current algorithm doesn't
seem to be making progress, then it will have to be done the hard way,
searching for for all nonlinear ptes of a page to unmap.

Miklos
-

To: Miklos Szeredi <miklos@...>
Cc: <npiggin@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, <jdike@...>, <hugh@...>, <torvalds@...>
Date: Thursday, March 8, 2007 - 8:11 am

Ah, you see, but that is when you've already lost.

The DoS is about the computational complexity of the reclaim, not if it
will ever come out of it with free pages.

-

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Miklos Szeredi <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, <jdike@...>, <hugh@...>, <torvalds@...>
Date: Thursday, March 8, 2007 - 8:19 am

If we really want to, we could limit it to mlock for !root. This is
a reasonable way to solve the problem, and UML could fall back on
vma emulated version if they didn't want to use mlock memory...

Or we could limit the size/number of nonlinear vmas that could be
created.

But just quietly, I think there are probably a lot of other ways to
perform a local DoS anyway ;)

-

To: <npiggin@...>
Cc: <a.p.zijlstra@...>, <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, <jdike@...>, <hugh@...>, <torvalds@...>
Date: Thursday, March 8, 2007 - 8:25 am

I aggree, requiring apps to mlock would probably just make things
slightly worse for about 100% of users, without any gain. There could
be a

/proc/sys/vm/turn_off_nonlinear_for_paranoid_sysadmin

knob that would unconditionally emulate nonlinear vmas.

Miklos
-

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Nick Piggin <npiggin@...>, Miklos Szeredi <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, Jeff Dike <jdike@...>, hugh <hugh@...>
Date: Wednesday, March 7, 2007 - 2:00 pm

I don't think that's a viable approach. Nonlinear mappings would normally
be used by databases, and you don't want to limit databases to be run by
root only.

Linus
-

To: Linus Torvalds <torvalds@...>
Cc: Nick Piggin <npiggin@...>, Miklos Szeredi <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, Jeff Dike <jdike@...>, hugh <hugh@...>
Date: Wednesday, March 7, 2007 - 2:12 pm

It was claimed that they use it on tmpfs only, not on a 'real'
filesystem.

-

To: Linus Torvalds <torvalds@...>
Cc: Nick Piggin <npiggin@...>, Miklos Szeredi <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, Jeff Dike <jdike@...>, hugh <hugh@...>
Date: Wednesday, March 7, 2007 - 2:24 pm

More specifically, databases want to use direct IO (I know you hate it)
and use the nonlinear vma as buffer area to feed this direct IO

Mapped IO is unsuited for databases in its current form due to the way
IO errors are handled.

-

To: <a.p.zijlstra@...>
Cc: <npiggin@...>, <miklos@...>, <akpm@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, <jdike@...>
Date: Wednesday, March 7, 2007 - 9:56 am

Looks like it doesn't:

$ grep -r remap_file_pages arch/um/
$

Miklos
-

To: <akpm@...>
Cc: <mingo@...>, <npiggin@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, <a.p.zijlstra@...>
Date: Wednesday, March 7, 2007 - 5:25 am

It won't even get that far, because it only looks at vmas on
mapping->i_mmap, and not on i_mmap_nonlinear.

Miklos
-

To: Andrew Morton <akpm@...>
Cc: Miklos Szeredi <miklos@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Wednesday, March 7, 2007 - 5:18 am

msync breakage is bad, but otherwise I don't know that we care about
dirty page writeout efficiency.

But I think we discovered that those msync changes are bogus anyway
becuase there is a small race window where pte could be dirtied without
page being set dirty?

-

To: Nick Piggin <npiggin@...>
Cc: Miklos Szeredi <miklos@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Wednesday, March 7, 2007 - 5:26 am

Well. We made so many changes to support the synchronous
dirty-the-page-when-we-dirty-the-pte thing that I'm rather doubtful that
the old-style approach still works. It might seem to, most of the time.
But if it _is_ subtly broken, boy it's going to take a long time for us to

Dunno, I don't recall that. We dirty the page before the pte...
-

To: Andrew Morton <akpm@...>
Cc: Miklos Szeredi <miklos@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, Peter Zijlstra <a.p.zijlstra@...>
Date: Wednesday, March 7, 2007 - 5:38 am

I can't think of anything that should have caused breakage (except for

I don't think it isn't really that simple. There is a big comment in
clear_page_dirty_for_io.

-

To: <akpm@...>
Cc: <npiggin@...>, <miklos@...>, <mingo@...>, <linux-mm@...>, <linux-kernel@...>, <benh@...>, <a.p.zijlstra@...>
Date: Wednesday, March 7, 2007 - 5:28 am

That's the one I just submitted a fix for ;)

http://lkml.org/lkml/2007/3/6/308

Miklos
-

To: Ingo Molnar <mingo@...>
Cc: Nick Piggin <npiggin@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>, Paolo 'Blaisorblade' Giarrusso <blaisorblade@...>
Date: Wednesday, March 7, 2007 - 4:35 am

I'm rather regretting having merged it - I don't think it has been used for
much.

Paolo's UML speedup patches might use nonlinear though.
-

To: Andrew Morton <akpm@...>
Cc: Ingo Molnar <mingo@...>, Nick Piggin <npiggin@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>, Paolo 'Blaisorblade' Giarrusso <blaisorblade@...>
Date: Wednesday, March 7, 2007 - 5:29 am

Guess what major real-life application not only uses nonlinear daily
but would even be very happy to see it extended with non-vma-creating
protections and more? It's not terribly typical for things to be
truncated while remap_file_pages() is doing its work, though it's been
proposed as a method of dynamism. It won't stress remap_file_pages() vs.
truncate() in any meaningful way, though, as userspace will be rather
diligent about clearing in-use data out of the file offset range to be
truncated away anyway, and all that via O_DIRECT.

-- wli
-

To: Bill Irwin <bill.irwin@...>
Cc: Ingo Molnar <mingo@...>, Nick Piggin <npiggin@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>, Paolo 'Blaisorblade' Giarrusso <blaisorblade@...>
Date: Wednesday, March 7, 2007 - 5:39 am

The problem here isn't related to truncate or direct-IO. It's just
plain-old MAP_SHARED. nonlinear VMAs are now using the old-style
dirty-memory management. msync() is basically a no-op and the code is
wildly tricky and pretty much untested. The chances that we broke it are
considerable.

-

To: Andrew Morton <akpm@...>
Cc: Bill Irwin <bill.irwin@...>, Ingo Molnar <mingo@...>, Nick Piggin <npiggin@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>, Paolo 'Blaisorblade' Giarrusso <blaisorblade@...>
Date: Wednesday, March 7, 2007 - 6:09 am

Close enough. ;)

This would be of concern for swapping out tmpfs-backed nonlinearly-
mapped files under extreme stress in Oracle's case, though it's rather
typical for it all to be mlock()'d in-core and cases where that's
necessary to be considered grossly underprovisioned. As far as I know,
msync() is not used to manage the nonlinearly-mapped objects, which are
most typically expected to be memory-backed, rendering writeback to
disk of questionable value. Also quite happily, I'm not aware of any
data integrity issues it would explain. Bug though it may be, it
requires a usage model very rarely used by Oracle to trigger, so we've
not run into it.

-- wli
-

To: Andrew Morton <akpm@...>
Cc: Nick Piggin <npiggin@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>, Paolo 'Blaisorblade' Giarrusso <blaisorblade@...>
Date: Wednesday, March 7, 2007 - 4:53 am

yes, i wrote the first, prototype version of that for UML, it needs an
extended version of the syscall, sys_remap_file_pages_prot():

http://redhat.com/~mingo/remap-file-pages-patches/remap-file-pages-prot-...

i also wrote an x86 hypervisor kind of thing for UML, called
'sys_vcpu()', which allows UML to execute guest user-mode in a box,
which also relies on sys_remap_file_pages_prot():

http://redhat.com/~mingo/remap-file-pages-patches/vcpu-2.6.4-rc2-mm1-A2

which reduced the UML guest syscall overhead from 30 usecs to 4 usecs
(with native syscalls taking 2 usecs, on the box i tested, years ago).

So it certainly looked useful to me - but wasnt really picked up widely.

We'll always have the option to get rid of it (and hence completely
reverse the decision to merge it) without breaking the ABI, by emulating
the API via mremap(). That eliminates the UML speedup though. So no need
to feel sorry about having merged it, we can easily revisit that
years-old 'do we want it' decision, without any ABI worries.

Ingo
-

To: Ingo Molnar <mingo@...>
Cc: Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>, Paolo 'Blaisorblade' Giarrusso <blaisorblade@...>
Date: Wednesday, March 7, 2007 - 5:28 am

Depending on whether anyone wants it, and what features they want, we
could emulate the old syscall, and make a new restricted one which is
much less intrusive.

For example, if we can operate only on MAP_ANONYMOUS memory and specify
that nonlinear mappings effectively mlock the pages, then we can get
rid of all the objrmap and unmap_mapping_range handling, forget about
the writeout and msync problems...

-

To: Nick Piggin <npiggin@...>
Cc: Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>, Paolo 'Blaisorblade' Giarrusso <blaisorblade@...>
Date: Wednesday, March 7, 2007 - 5:44 am

Anonymous-only would make it a doorstop for Oracle, since its entire
motive for using it is to window into objects larger than user virtual
address spaces (this likely also applies to UML, though they should
really chime in to confirm). Restrictions to tmpfs and/or ramfs would
likely be liveable, though I suspect some things might want to do it to
shm segments (I'll ask about that one). There's definitely no need for a
persistent backing store for the object to be remapped in Oracle's case,
in any event. It's largely the in-core destination and source of IO, not
something saved on-disk itself.

-- wli
-

To: Bill Irwin <bill.irwin@...>
Cc: Nick Piggin <npiggin@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Thursday, March 8, 2007 - 8:39 am

We need it for shared file mappings (for tmpfs only).

Our scenario is:
RAM is implemented through a shared mapped file, kept on tmpfs (except by dumb
users); various processes share an fd for this file (it's opened and
immediately deleted).

We maintain page tables in x86 style, and TLB flush is implemented through
mmap()/munmap()/mprotect().

Having a VMA per each 4K is not the intended VMA usage: for instance, the
default /proc/sys/vm/max_map_count (64K) is saturated by a UML process with

--
Inform me of my mistakes, so I can add them to my list!
Paolo Giarrusso, aka Blaisorblade
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale!
http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com
-

To: Bill Irwin <bill.irwin@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>, Paolo 'Blaisorblade' Giarrusso <blaisorblade@...>
Date: Wednesday, March 7, 2007 - 5:49 am

Uh, duh yes I don't mean MAP_ANONYMOUS, I was just thinking of the shmem
inode that sits behind MAP_ANONYMOUS|MAP_SHARED. Of course if you don't
have a file descriptor to get a pgoff, then remap_file_pages is a doorstop

Yeah, tmpfs/shm segs are what I was thinking about. If UML can live with
that as well, then I think it might be a good option.

-

To: Bill Irwin <bill.irwin@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>, Paolo 'Blaisorblade' Giarrusso <blaisorblade@...>
Date: Wednesday, March 7, 2007 - 6:02 am

Oh, hmm.... if you can truncate these things then you still need to
force unmap so you still need i_mmap_nonlinear.

But come to think of it, I still don't think nonlinear mappings are
too bad as they are ;)
-

To: Nick Piggin <npiggin@...>
Cc: Bill Irwin <bill.irwin@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Monday, March 12, 2007 - 7:01 pm

Well, we don't need truncate(), but MADV_REMOVE for memory hotunplug, which is
way similar I guess.

About the restriction to tmpfs, I have just discovered
'[PATCH] mm: tracking shared dirty pages' (commit
d08b3851da41d0ee60851f2c75b118e1f7a5fc89), which already partially conflicts
with remap_file_pages for file-based mmaps (and that's fully fine, for now).

Even if UML does not need it, till now if there is a VMA protection and a page
hasn't been remapped with remap_file_pages, the VMA protection is used (just
because it makes sense).

However, it is only used when the PTE is first created - we can never change
protections on a VMA - so it vma_wants_writenotify() is true (on all
file-based and on no shmfs based mapping, right?), and we write-protect the
VMA, it will always be write-protected.

That's no problem for UML, but for any other user (I guess I'll have to
prevent callers from trying such stuff - I started from a pretty generic

Btw, I really like removing ->populate and merging the common code together.
filemap_populate and shmem_populate are so obnoxiously different that I
already wanted to do that (after merging remap_file_pages() core).

Also, I'm curious. Since my patches are already changing remap_file_pages()
code, should they be absolutely merged after yours?
--
Inform me of my mistakes, so I can add them to my list!
Paolo Giarrusso, aka Blaisorblade
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale!
http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com
-

To: Blaisorblade <blaisorblade@...>
Cc: Bill Irwin <bill.irwin@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Monday, March 12, 2007 - 9:19 pm

Yes, I believe that is the case, however I wonder if that is going to be
a problem for you to distinguish between write faults for clean writable

Yeah they are also frustratingly similar to filemap_nopage and shmem_nopage,

Is there a big clash? I don't think I did a great deal to fremap.c (mainly
just removing stuff)...
-

To: Nick Piggin <npiggin@...>
Cc: Bill Irwin <bill.irwin@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Saturday, March 17, 2007 - 8:17 am

I wouldn't be able to distinguish them, but am I going to get write faults for
clean ptes when vma_wants_writenotify() is false (as seems to be for tmpfs)?
I guess not.

For tmpfs pages, clean writable PTEs are mapped as writable so they won't give
Hopefully, we just both modify sys_remap_file_pages(), I'll see soon.
--
Inform me of my mistakes, so I can add them to my list!
Paolo Giarrusso, aka Blaisorblade
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale!
http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com
-

To: Blaisorblade <blaisorblade@...>
Cc: Bill Irwin <bill.irwin@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Saturday, March 17, 2007 - 10:50 pm

Yes, that should be the case. So would this mean that nonlinear protections
don't work on regular files? I guess that's OK if Oracle and UML both use
tmpfs/shm?

-

To: Nick Piggin <npiggin@...>
Cc: Bill Irwin <bill.irwin@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Monday, March 19, 2007 - 4:44 pm

They still work in most cases (including for UML), but if the initial mmap()
specified PROT_WRITE, that is ignored, for pages which are not remapped via
remap_file_pages(). UML uses PROT_NONE for the initial mmap, so that's no

--
Inform me of my mistakes, so I can add them to my list!
Paolo Giarrusso, aka Blaisorblade
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale!
http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com
-

To: Blaisorblade <blaisorblade@...>
Cc: Bill Irwin <bill.irwin@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Tuesday, March 20, 2007 - 2:00 am

But how are you going to distinguish a write fault on a readonly pte for
dirty page accounting vs a read-only nonlinear protection?

You can't store any more data in a present pte AFAIK, so you'd have to
have some out of band data. At which point, you may as well just forget
about vma_wants_writenotify vmas, considering that everybody is using
shmem/ramfs.
-

To: Nick Piggin <npiggin@...>
Cc: Bill Irwin <bill.irwin@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, March 21, 2007 - 3:45 pm

Hmm... I was only thinking to PTEs which hadn't been remapped via
remap_file_pages, but just faulted in with initial mmap() protection.

For the other PTEs, however, I overlooked that the current code ignores
vma_wants_writenotify(), i.e. breaks dirty page accounting for them, and I
refused to even consider this opportunity, even without knowing the purposes

I was going to do that anyway. I'd guess that I should just disallow in
remap_file_pages() the VM_MANYPROTS (i.e. MAP_CHGPROT in flags) &&
vma_wants_writenotify() combination, right? Ok, trivial (shouldn't even have
pointed this out).
--
Inform me of my mistakes, so I can add them to my list!
Paolo Giarrusso, aka Blaisorblade
http://www.user-mode-linux.org/~blaisorblade
Chiacchiera con i tuoi amici in tempo reale!
http://it.yahoo.com/mail_it/foot/*http://it.messenger.yahoo.com
-

To: Nick Piggin <npiggin@...>
Cc: Blaisorblade <blaisorblade@...>, Bill Irwin <bill.irwin@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Monday, March 19, 2007 - 8:04 am

Sometimes ramfs is also used in the Oracle case. I presume that's even
simpler than tmpfs. (Hugetlb, while also used in for the same general
buffer pool, is never used in conjunction with remap_file_pages() etc.)

-- wli
-

To: Nick Piggin <npiggin@...>
Cc: Blaisorblade <blaisorblade@...>, Bill Irwin <bill.irwin@...>, Ingo Molnar <mingo@...>, Andrew Morton <akpm@...>, Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Sunday, March 18, 2007 - 9:09 am

It's OK for UML.

Jeff

--
Work email - jdike at linux dot intel dot com
-

To: Linux Memory Management <linux-mm@...>, Andrew Morton <akpm@...>
Cc: Linux Kernel <linux-kernel@...>, Nick Piggin <npiggin@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, February 21, 2007 - 12:50 am

Remove legacy filemap_nopage and all of the .populate API cruft.

This patch can be skipped if it will cause clashes in your tree, or you
disagree with removing these guys right now.

Signed-off-by: Nick Piggin <npiggin@suse.de>

Documentation/feature-removal-schedule.txt | 18 --
include/linux/mm.h | 8 -
mm/filemap.c | 195 -----------------------------
mm/fremap.c | 71 +---------
mm/memory.c | 36 +----
5 files changed, 19 insertions(+), 309 deletions(-)

Index: linux-2.6/include/linux/mm.h
===================================================================
--- linux-2.6.orig/include/linux/mm.h
+++ linux-2.6/include/linux/mm.h
@@ -230,8 +230,6 @@ struct vm_operations_struct {
void (*close)(struct vm_area_struct * area);
struct page * (*fault)(struct vm_area_struct *vma, struct fault_data * fdata);
struct page * (*nopage)(struct vm_area_struct * area, unsigned long address, int *type);
- int (*populate)(struct vm_area_struct * area, unsigned long address, unsigned long len, pgprot_t prot, unsigned long pgoff, int nonblock);
-
/* notification that a previously read-only page is about to become
* writable, if an error is returned it will cause a SIGBUS */
int (*page_mkwrite)(struct vm_area_struct *vma, struct page *page);
@@ -767,8 +765,6 @@ static inline void unmap_shared_mapping_

extern int vmtruncate(struct inode * inode, loff_t offset);
extern int vmtruncate_range(struct inode * inode, loff_t offset, loff_t end);
-extern int install_page(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, struct page *page, pgprot_t prot);
-extern int install_file_pte(struct mm_struct *mm, struct vm_area_struct *vma, unsigned long addr, unsigned long pgoff, pgprot_t prot);

#ifdef CONFIG_MMU
extern int __handle_mm_fault(struct mm_struct *mm,struct vm_area_struct *vma,
@@ -1084,10 +1080,6 @@ extern void truncate_inode_page...

To: Linux Memory Management <linux-mm@...>, Andrew Morton <akpm@...>
Cc: Linux Kernel <linux-kernel@...>, Nick Piggin <npiggin@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, February 21, 2007 - 12:50 am

Remove ->nopfn and reimplement the existing handlers with ->fault

Signed-off-by: Nick Piggin <npiggin@suse.de>

arch/powerpc/platforms/cell/spufs/file.c | 90 ++++++++++++++++---------------
drivers/char/mspec.c | 29 ++++++---
include/linux/mm.h | 8 --
mm/memory.c | 58 +------------------
4 files changed, 71 insertions(+), 114 deletions(-)

Index: linux-2.6/drivers/char/mspec.c
===================================================================
--- linux-2.6.orig/drivers/char/mspec.c
+++ linux-2.6/drivers/char/mspec.c
@@ -182,24 +182,25 @@ mspec_close(struct vm_area_struct *vma)

/*
- * mspec_nopfn
+ * mspec_fault
*
* Creates a mspec page and maps it to user space.
*/
-static unsigned long
-mspec_nopfn(struct vm_area_struct *vma, unsigned long address)
+static struct page *
+mspec_fault(struct fault_data *fdata)
{
unsigned long paddr, maddr;
unsigned long pfn;
- int index;
- struct vma_data *vdata = vma->vm_private_data;
+ int index = fdata->pgoff;
+ struct vma_data *vdata = fdata->vma->vm_private_data;

- index = (address - vma->vm_start) >> PAGE_SHIFT;
maddr = (volatile unsigned long) vdata->maddr[index];
if (maddr == 0) {
maddr = uncached_alloc_page(numa_node_id());
- if (maddr == 0)
- return NOPFN_OOM;
+ if (maddr == 0) {
+ fdata->type = VM_FAULT_OOM;
+ return NULL;
+ }

spin_lock(&vdata->lock);
if (vdata->maddr[index] == 0) {
@@ -219,13 +220,21 @@ mspec_nopfn(struct vm_area_struct *vma,

pfn = paddr >> PAGE_SHIFT;

- return pfn;
+ fdata->type = VM_FAULT_MINOR;
+ /*
+ * vm_insert_pfn can fail with -EBUSY, but in that case it will
+ * be because another thread has installed the pte first, so it
+ * is no problem.
+ */
+ vm_insert_pfn(fdata->vma, fdata->address, pfn);
+
+ return NULL;
}

static struct vm_operations_struct mspec_vm_ops = {
.open = ms...

To: Linux Memory Management <linux-mm@...>, Andrew Morton <akpm@...>
Cc: Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, February 21, 2007 - 1:13 am

Dang, forgot to quilt refresh after fixing spufs compile.
--

Remove ->nopfn and reimplement the existing handlers with ->fault

Signed-off-by: Nick Piggin <npiggin@suse.de>

arch/powerpc/platforms/cell/spufs/file.c | 90 ++++++++++++++++---------------
drivers/char/mspec.c | 29 ++++++---
include/linux/mm.h | 8 --
mm/memory.c | 58 +------------------
4 files changed, 71 insertions(+), 114 deletions(-)

Index: linux-2.6/drivers/char/mspec.c
===================================================================
--- linux-2.6.orig/drivers/char/mspec.c
+++ linux-2.6/drivers/char/mspec.c
@@ -182,24 +182,25 @@ mspec_close(struct vm_area_struct *vma)

/*
- * mspec_nopfn
+ * mspec_fault
*
* Creates a mspec page and maps it to user space.
*/
-static unsigned long
-mspec_nopfn(struct vm_area_struct *vma, unsigned long address)
+static struct page *
+mspec_fault(struct fault_data *fdata)
{
unsigned long paddr, maddr;
unsigned long pfn;
- int index;
- struct vma_data *vdata = vma->vm_private_data;
+ int index = fdata->pgoff;
+ struct vma_data *vdata = fdata->vma->vm_private_data;

- index = (address - vma->vm_start) >> PAGE_SHIFT;
maddr = (volatile unsigned long) vdata->maddr[index];
if (maddr == 0) {
maddr = uncached_alloc_page(numa_node_id());
- if (maddr == 0)
- return NOPFN_OOM;
+ if (maddr == 0) {
+ fdata->type = VM_FAULT_OOM;
+ return NULL;
+ }

spin_lock(&vdata->lock);
if (vdata->maddr[index] == 0) {
@@ -219,13 +220,21 @@ mspec_nopfn(struct vm_area_struct *vma,

pfn = paddr >> PAGE_SHIFT;

- return pfn;
+ fdata->type = VM_FAULT_MINOR;
+ /*
+ * vm_insert_pfn can fail with -EBUSY, but in that case it will
+ * be because another thread has installed the pte first, so it
+ * is no problem.
+ */
+ vm_insert_pfn(fdata->vma, fdata->address, pfn);
+
+ return NULL;
}

...

To: Linux Memory Management <linux-mm@...>, Andrew Morton <akpm@...>
Cc: Linux Kernel <linux-kernel@...>, Nick Piggin <npiggin@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, February 21, 2007 - 12:50 am

Fix the race between invalidate_inode_pages and do_no_page.

Andrea Arcangeli identified a subtle race between invalidation of
pages from pagecache with userspace mappings, and do_no_page.

The issue is that invalidation has to shoot down all mappings to the
page, before it can be discarded from the pagecache. Between shooting
down ptes to a particular page, and actually dropping the struct page
from the pagecache, do_no_page from any process might fault on that
page and establish a new mapping to the page just before it gets
discarded from the pagecache.

The most common case where such invalidation is used is in file
truncation. This case was catered for by doing a sort of open-coded
seqlock between the file's i_size, and its truncate_count.

Truncation will decrease i_size, then increment truncate_count before
unmapping userspace pages; do_no_page will read truncate_count, then
find the page if it is within i_size, and then check truncate_count
under the page table lock and back out and retry if it had
subsequently been changed (ptl will serialise against unmapping, and
ensure a potentially updated truncate_count is actually visible).

Complexity and documentation issues aside, the locking protocol fails
in the case where we would like to invalidate pagecache inside i_size.
do_no_page can come in anytime and filemap_nopage is not aware of the
invalidation in progress (as it is when it is outside i_size). The
end result is that dangling (->mapping == NULL) pages that appear to
be from a particular file may be mapped into userspace with nonsense
data. Valid mappings to the same place will see a different page.

Andrea implemented two working fixes, one using a real seqlock,
another using a page->flags bit. He also proposed using the page lock
in do_no_page, but that was initially considered too heavyweight.
However, it is not a global or per-file lock, and the page cacheline
is modified in do_no_page to increment _count and _mapcount anyway, so
a further modification sho...

To: Nick Piggin <npiggin@...>
Cc: Linux Memory Management <linux-mm@...>, Andrew Morton <akpm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, March 7, 2007 - 2:36 am

Why was truncate_inode_pages_range() altered to unmap the page if it got
mapped again?

Oh. Because the unmap_mapping_range() call got removed from vmtruncate().
Why? (Please send suitable updates to the changelog).

I guess truncate of a mmapped area isn't sufficiently common to worry about
the inefficiency of this change.

Lots of memory barriers got removed in memory.c, unchangeloggedly.

Gratuitous renaming of locals in do_no_page() makes the change hard to
review. Should have been a separate patch.

In fact, the patch would have been heaps clearer if that renaming had been
a separate patch.

-

To: Andrew Morton <akpm@...>
Cc: Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, March 7, 2007 - 2:57 am

We have to ensure it is unmapped, and be prepared to unmap it while under

Yeah, and it should be more efficient for files that aren't mmapped,

Yeah they were all for the lockless truncate_count checks. Now that

Shall I?
-

To: Nick Piggin <npiggin@...>
Cc: Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, March 7, 2007 - 3:08 am

But vmtruncate() dropped i_size, so nobody will map this page into

If you don't have anything better to do, yes please ;)

-

To: Andrew Morton <akpm@...>
Cc: Linux Memory Management <linux-mm@...>, Linux Kernel <linux-kernel@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, March 7, 2007 - 3:25 am

But there could be a fault in progress... the only way to know is

OK.
-

To: Linux Memory Management <linux-mm@...>, Andrew Morton <akpm@...>
Cc: Linux Kernel <linux-kernel@...>, Nick Piggin <npiggin@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, February 21, 2007 - 12:49 am

Identical block is duplicated twice: contrary to the comment, we have been
re-reading the page *twice* in filemap_nopage rather than once.

If any retry logic or anything is needed, it belongs in lower levels anyway.
Only retry once. Linus agrees.

Signed-off-by: Nick Piggin <npiggin@suse.de>

mm/filemap.c | 24 ------------------------
1 file changed, 24 deletions(-)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -1448,30 +1448,6 @@ page_not_uptodate:
majmin = VM_FAULT_MAJOR;
count_vm_event(PGMAJFAULT);
}
- lock_page(page);
-
- /* Did it get unhashed while we waited for it? */
- if (!page->mapping) {
- unlock_page(page);
- page_cache_release(page);
- goto retry_all;
- }
-
- /* Did somebody else get it up-to-date? */
- if (PageUptodate(page)) {
- unlock_page(page);
- goto success;
- }
-
- error = mapping->a_ops->readpage(file, page);
- if (!error) {
- wait_on_page_locked(page);
- if (PageUptodate(page))
- goto success;
- } else if (error == AOP_TRUNCATED_PAGE) {
- page_cache_release(page);
- goto retry_find;
- }

/*
* Umm, take care of errors if the page isn't up-to-date.
-

To: Linux Memory Management <linux-mm@...>, Andrew Morton <akpm@...>
Cc: Linux Kernel <linux-kernel@...>, Nick Piggin <npiggin@...>, Benjamin Herrenschmidt <benh@...>
Date: Wednesday, February 21, 2007 - 12:49 am

Add a bugcheck for Andrea's pagefault vs invalidate race. This is triggerable
for both linear and nonlinear pages with a userspace test harness (using
direct IO and truncate, respectively).

Signed-off-by: Nick Piggin <npiggin@suse.de>

mm/filemap.c | 2 ++
1 file changed, 2 insertions(+)

Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -120,6 +120,8 @@ void __remove_from_page_cache(struct pag
page->mapping = NULL;
mapping->nrpages--;
__dec_zone_page_state(page, NR_FILE_PAGES);
+
+ BUG_ON(page_mapped(page));
}

void remove_from_page_cache(struct page *page)
-

Previous thread: none

Next thread: Linux 2.6.21-rc1 by Linus Torvalds on Wednesday, February 21, 2007 - 12:53 am. (194 messages)