Re: [PATCH 3/3] mlock: avoid dirtying pages and triggering writeback

Previous thread: make xconfig broken on 2.6.37-rc2 by Ameya Palande on Wednesday, November 17, 2010 - 5:23 am. (2 messages)

Next thread: Are you interested in obtaining a loan from a legitimate lender at a very cheaper rate. by Jennifer Parks on Wednesday, November 17, 2010 - 5:18 am. (1 message)
From: Michel Lespinasse
Date: Wednesday, November 17, 2010 - 5:23 am

As discussed in linux-mm, mlocking a shared, writable vma currently causes
the corresponding pages to be marked as dirty and queued for writeback.
This seems rather unnecessary given that the pages are not being actually
modified during mlock. It is understood that for non-shared mappings
(file or anon) we want to use a write fault in order to break COW, but
there is just no such need for shared mappings.

The first two patches in this series do not introduce any behavior change.
The intent there is to make it obvious that dirtying file pages is
only done in the (writable, shared) case.  I think this clarifies the
code, but I wouldn't mind dropping these two patches if there is no
consensus about them.

The last patch is where we actually avoid dirtying shared mappings
during mlock. Note that as a side effect of this, we won't call
page_mkwrite() for the mappings that define it, and won't be
pre-allocating data blocks at the FS level if the mapped file was
sparsely allocated. My understanding is that mlock does not need to
provide such guarantee, as evidenced by the fact that it never did for
the filesystems that don't define page_mkwrite() - including some
common ones like ext3. However, I would like to gather feedback on this
from filesystem people as a precaution. If this turns out to be a
showstopper, maybe block preallocation can be added back on using a
different interface.

Large shared mlocks are getting significantly (>2x) faster in my tests,
as the disk can be fully used for reading the file instead of having to
share between this and writeback.

Michel Lespinasse (3):
  do_wp_page: remove the 'reuse' flag
  do_wp_page: clarify dirty_page handling
  mlock: avoid dirtying pages and triggering writeback

 mm/memory.c |   90 ++++++++++++++++++++++++++++++++---------------------------
 mm/mlock.c  |    7 ++++-
 2 files changed, 55 insertions(+), 42 deletions(-)

-- 
1.7.3.1

--

From: Michel Lespinasse
Date: Wednesday, November 17, 2010 - 5:23 am

Reorganize the code to remove the 'reuse' flag. No behavior changes.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/memory.c |   11 +++++------
 1 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 0e18b4d..810a75f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2110,7 +2110,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 {
 	struct page *old_page, *new_page;
 	pte_t entry;
-	int reuse = 0, ret = 0;
+	int ret = 0;
 	int page_mkwrite = 0;
 	struct page *dirty_page = NULL;
 
@@ -2147,14 +2147,16 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			}
 			page_cache_release(old_page);
 		}
-		reuse = reuse_swap_page(old_page);
-		if (reuse)
+		if (reuse_swap_page(old_page)) {
 			/*
 			 * The page is all ours.  Move it to our anon_vma so
 			 * the rmap code will not search our parent or siblings.
 			 * Protected against the rmap code by the page lock.
 			 */
 			page_move_anon_rmap(old_page, vma, address);
+			unlock_page(old_page);
+			goto reuse;
+		}
 		unlock_page(old_page);
 	} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
 					(VM_WRITE|VM_SHARED))) {
@@ -2218,10 +2220,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
 		}
 		dirty_page = old_page;
 		get_page(dirty_page);
-		reuse = 1;
-	}
 
-	if (reuse) {
 reuse:
 		flush_cache_page(vma, address, pte_pfn(orig_pte));
 		entry = pte_mkyoung(orig_pte);
-- 
1.7.3.1

--

From: Michel Lespinasse
Date: Wednesday, November 17, 2010 - 5:23 am

When faulting in pages for mlock(), we want to break COW for anonymous
or file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is
no need to write-fault into VM_SHARED vmas since shared file pages can
be mlocked first and dirtied later, when/if they actually get written to.
Skipping the write fault is desirable, as we don't want to unnecessarily
cause these pages to be dirtied and queued for writeback.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/memory.c |    7 ++++++-
 mm/mlock.c  |    7 ++++++-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index d4c0c2e..7f45085 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3291,7 +3291,12 @@ int make_pages_present(unsigned long addr, unsigned long end)
 	vma = find_vma(current->mm, addr);
 	if (!vma)
 		return -ENOMEM;
-	write = (vma->vm_flags & VM_WRITE) != 0;
+	/*
+	 * We want to touch writable mappings with a write fault in order
+	 * to break COW, except for shared mappings because these don't COW
+	 * and we would not want to dirty them for nothing.
+	 */
+	write = (vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE;
 	BUG_ON(addr >= end);
 	BUG_ON(end > vma->vm_end);
 	len = DIV_ROUND_UP(end, PAGE_SIZE) - addr/PAGE_SIZE;
diff --git a/mm/mlock.c b/mm/mlock.c
index b70919c..4f31864 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -171,7 +171,12 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma,
 	VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem));
 
 	gup_flags = FOLL_TOUCH | FOLL_GET;
-	if (vma->vm_flags & VM_WRITE)
+	/*
+	 * We want to touch writable mappings with a write fault in order
+	 * to break COW, except for shared mappings because these don't COW
+	 * and we would not want to dirty them for nothing.
+	 */
+	if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE)
 		gup_flags |= FOLL_WRITE;
 
 	/* We don't try to access the guard page of a stack vma */
-- 
1.7.3.1

--

From: Nick Piggin
Date: Wednesday, November 17, 2010 - 5:57 am

It's not just to break COW, but to do block allocation and such
(filesystem's page_mkwrite op). That needs to at least be explained
in the changelog.

Filesystem doesn't have a good way to fully pin required things
according to mlock, but page_mkwrite provides some reasonable things
(like block allocation / reservation).

--

From: Peter Zijlstra
Date: Wednesday, November 17, 2010 - 8:28 am

Right, but marking all pages dirty isn't really sane. I can imagine
making the reservation but not marking things dirty solution, although
it might be lots harder to implement, esp since some filesystems don't
actually have a page_mkwrite() implementation.


--

From: Michel Lespinasse
Date: Wednesday, November 17, 2010 - 3:05 pm

Really, my understanding is that not pre-allocating filesystem blocks
is just fine. This is, after all, what happens with ext3 and it's
never been reported as a bug (that I know of).

If filesystem people's feedback is that they really want mlock() to
continue pre-allocating blocks, maybe we can just do it using
fallocate() rather than page_mkwrite() callbacks ?

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--

From: Peter Zijlstra
Date: Wednesday, November 17, 2010 - 3:18 pm

Sounds sensible..
--

From: Dave Chinner
Date: Wednesday, November 17, 2010 - 4:11 pm

It's not ext3 you have to worry about - it's the filesystems that
need special state set up on their pages/buffers for ->writepage to
work correctly that are the problem. You need to call
->write_begin/->write_end to get the state set up properly.

If this state is not set up properly, silent data loss will occur
during mmap writes either by ENOSPC or failing to set up writes into
unwritten extents correctly (i.e. we'll be back to where we were in
2.6.15).

I don't think ->page_mkwrite can be worked around - we need that to
be called on the first write fault of any mmap()d page to ensure it
is set up correctly for writeback.  If we don't get write faults
after the page is mlock()d, then we need the ->page_mkwrite() call

Fallocate is not good enough to avoid ->page_mkwrite callbacks.
Indeed, XFS (at least) requires the ->page_mkwrite() callout even
on preallocated space to correctly mark the buffers as unwritten so
extent conversion in ->writepage is triggered correctly (see test
#166 in xfstests).

Hence I think that avoiding ->page_mkwrite callouts is likely to
break some filesystems in subtle, undetected ways.  IMO, regardless
of what is done, it would be really good to start by writing a new
regression test to exercise and encode the expected the mlock
behaviour so we can detect regressions later on....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Michel Lespinasse
Date: Wednesday, November 17, 2010 - 4:31 pm

Just to be clear - I'm proposing to skip the entire do_wp_page() call
by doing a read fault rather than a write fault. If the page wasn't
dirty already, it will stay clean and with a non-writable PTE until it
gets actually written to, at which point we'll get a write fault and
do_wp_page will be invoked as usual.

I am not proposing to skip the page_mkwrite() while upgrading the PTE
permissions, which I think is what you were arguing against ?

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--

From: Dave Chinner
Date: Thursday, November 18, 2010 - 6:46 pm

I have no problem with that - I'm surprised that mlock didn't work

I wasn't arguing against anything, merely pointing out that the
->page_mkwrite call is aboslutely necessary. You've made clarified
that it still occurs, so I'm happy...

FWIW, what I was responding to was the assumption that "this is
alright for ext3, so it must be OK" extrapolation about
->page_mkwrite behaviour. Especially as ext3 does not even implement
the ->page_mkwrite operation (which means mmap writes into holes
can't detect ENOSPC correctly)...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Ted Ts'o
Date: Wednesday, November 17, 2010 - 4:52 pm

OK, so I'm not an mm hacker, so maybe I'm missing something.  Could
part of this be fixed by simply sending the write faults for
mlock()'ed pages, so page_mkwrite() gets called when the page is
dirtied.  Seems like a real waste to have the file system pre-allocate
all of the blocks for a mlock()'ed region.  Why does mlock() have to
result in the write faults getting suppressed when the page is
actually dirtied?

						- Ted
--

From: Andrew Morton
Date: Wednesday, November 17, 2010 - 5:53 pm

On Wed, 17 Nov 2010 18:52:30 -0500

Yup, I don't think it would be too bad to take a minor fault each time
an mlocked page transitions from clean->dirty.

In fact we should already be doing that, after the mlocked page gets
written back by kupdate?  Hope so!

--

From: Michel Lespinasse
Date: Thursday, November 18, 2010 - 4:03 am

This is actually what the patch does - by having mlock() use a read fault,
pages are loaded in memory and mlocked, but the ptes are not marked as
writable so that a later write access will be caught as a write fault at

Yes, handle_mm_fault() is careful to never create writable ptes pointing
to clean file pages, so that a later write fault will correctly dirty
the corresponding page.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--

From: Christoph Hellwig
Date: Thursday, November 18, 2010 - 6:37 am

I think it would help if we could drink a bit of the test driven design
coolaid here. Michel, can you write some testcases where pages on a
shared mapping are mlocked, then dirtied and then munlocked, and then
written out using msync/fsync.  Anything that fails this test on
btrfs/ext4/gfs/xfs/etc obviously doesn't work.

--

From: Hugh Dickins
Date: Thursday, November 18, 2010 - 10:41 am

Whilst it's hard to argue against a request for testing, Dave's worries
just sprang from a misunderstanding of all the talk about "avoiding ->
page_mkwrite".  There's nothing strange or risky about Michel's patch,
it does not avoid ->page_mkwrite when there is a write: it just stops
pretending that there was a write when locking down the shared area.

Hugh
--

From: Michel Lespinasse
Date: Friday, November 19, 2010 - 12:23 am

I think it's still under debate what's an acceptable result for this test
(i.e. what's supposed to happen during mlock of a shared mapping of
a sparsely allocated file - is a fallocate equivalent supposed to happen ?)

So, I decided to test this using memtoy. /data is a separate partition
where I had just 10GB free space, and /data/hole20G was created using
dd if=/dev/zero of=/data/hole20G bs=1M seek=20480 count=0.

memtoy>file /data/hole20G shared
memtoy>map hole20G

At this point the file is mapped using a writable, shared VMA.

memtoy>touch hole20G
memtoy:  touched 5242880 pages in 30.595 secs

At this point memtoy's address space is populated with zeroed
pages. The pages are distinct (meminfo does show 20G of allocated pages),
are classified as file pages, not anon, and are associated with the
struct address_space for /data/hole20G. That file still does not have
blocks allocated, as can be seen with du /data/hole20G.

memtoy>lock hole20G

memtoy tries to mlock the hole20G VMA.
This is where things get interesting.

Using ext2, without my patch (ext3 should be similar):
  - first, mlock does fast progress going though file pages, marking them
    as dirty. Eventually, it hits the dirty limit and gets throttled.
  - then, mlock does slow progress as it needs to wait for writeback.
    writeback occurs and allocates blocks for the /data/hole20G.
    Eventually, the /data partition gets full.
  - then, mlock does no progress as it's at the dirty limit and nothing
    gets written back.
  - mlock never terminates.

Using ext2, with my patch (ext3 should be similar):
  - mlock goes through all pages in ~5 seconds, marking them as mlocked
    (but still not dirty)
  - mlock completes. /data/hole20G still does not have blocks allocated.
  - if memtoy is then instructed to dirty all the pages
    (using 'touch hole20G write'):
    - first, memtoy does fast progress faulting through file pages, marking
      them as dirty. Eventually, it hits the dirty limit and gets ...
From: Theodore Tso
Date: Friday, November 19, 2010 - 6:42 am

My vote would be against.   If you if you mmap a sparse file and then
try writing to it willy-nilly, bad things will happen.  This is true without
a mlock().   Where is it written that mlock() has anything to do with
improving this situation?

If userspace wants to call fallocate() before it calls mlock(), it should
do that.  And in fact, in most cases, userspace should probably be
encouraged to do that.   But having mlock() call fallocate() and
then return ENOSPC if there's no room?  Isn't it confusing that mlock()
call ENOSPC?  Doesn't that give you cognitive dissonance?  It should
because fundamentally mlock() has nothing to do with block allocation!!
Read the API spec!

Look, it was an accident / bug of the implementation that mlock()
magically dirtied all these pages.  It might have made some situations
better, but I very much doubt applications depended upon it, and I'd
really rather not perpetuate this particular magic side effect of the
previously buggy implementation of mlock().

-- Ted
--

From: Christoph Hellwig
Date: Friday, November 19, 2010 - 8:06 am

Exactly.  Allocating space has been a side-effect on a handfull

Indeed.  There is no need to make mlock + flag a parallel-API to
fallocate.

--

From: Andrew Morton
Date: Friday, November 19, 2010 - 3:54 pm

On Thu, 18 Nov 2010 23:23:16 -0800

Wait.  You *tested* the kernel?



Dirtying all that memory at mlock() time is pretty obnoxious.

I'm inclined to agree that your patch implements the desirable
behaviour: don't dirty the page, don't do block allocation.  Take a
fault at first-dirtying and do it then.  This does degrade mlock a bit:
the user will find that the first touch of an mlocked page can cause
synchronous physical I/O, which isn't mlocky behaviour *at all*.  But
we have to be able to do this anyway - whenever the kupdate function
writes back the dirty pages it has to mark them read-only again so the
kernel knows when they get redirtied.

I do agree that this will result in worse file layout for some
reasonable userspace code patterns.  But it was always that way, except
for the eleven-release window where we kinda accidentally fixed that up
in-kernel.  Hopefully most apps which care are already ensuring that
the file is well laid-out.


So all that leaves me thinking that we merge your patches as-is.  Then
work out why users can fairly trivially use mlock to hang the kernel on
ext2 and ext3 (and others?) 
--

From: Ted Ts'o
Date: Friday, November 19, 2010 - 4:22 pm

So at least on RHEL 4 and 5 systems, pam_limits was configured so that
unprivileged processes could only mlock() at most 16k.  This was
deemed enough so that programs could protect crypto keys.  The
thinking when we added the mlock() ulimit setting was that
unprivileged users could very easily make a nuisance of themselves,
and grab way too much system resources, by using mlock() in obnoxious
ways.

I was just checking to see if my memory was correct, and to my
surprise, I've just found that Ubuntu deliberately sets the memlock
ulimit to be unlimited.  Which means that Ubuntu systems are
completely wide open for this particular DOS attack.  So if you
administer an Ubuntu-based server, it might be a good idea to make a
tiny little change to /etc/security/limits.conf....

							- Ted
--

From: Dustin Kirkland
Date: Friday, November 19, 2010 - 5:29 pm

Kees,

Copying you into this thread, in case you'd like to respond from the
Ubuntu side.  Thanks for the heads-up, Ted.


Dustin Kirkland
Canonical, LTD
kirkland@canonical.com
GPG: 1024D/83A61194
From: Michel Lespinasse
Date: Friday, November 19, 2010 - 4:31 pm

On Fri, Nov 19, 2010 at 2:54 PM, Andrew Morton



I would say the hang is not even mlock related - you see without it
also. All you need is mmap a large file with holes and write fault
pages until you run out of disk space. At that point additional write
faults wait for a writeback that can never complete. Sysadmin can
however kill -9 such processes and/or free some space, though.

-- 
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.
--

From: Dave Chinner
Date: Friday, November 19, 2010 - 4:54 pm

That's the typical result of ENOSPC during mmap writes into holes on
filesystems that don't implement ->page_mkwrite() - this effectively
overcommits the filesystem free space with more dirty pages than can
fit in the free space, and then the system will get stuck on ENOSPC
errors when trying to allocate in the writeback path, leaving you


Yup, ext4 implements page_mkwrite(), so can detect ENOSPC during the

Which is how all filesystems _should_ behave. Any filesystem that does
not implement ->page_mkwrite will break under this test, regardless
of your patch. It is exercising the exact problem that page_mkwrite

IMO, you do not need fallocate or any form of preallocation at all
during mlock(). As I've already pointed out, this still requires
filesystems to implement ->page_mkwrite to ensure mmap writes into
preallocated regions work correctly.

Fundamentally, filesystems need to implement ->page-mkwrite() to do
the right thing w.r.t. mmap writes and ENOSPC, and any filesystem
that doesn't is plain broken. You don't need to try to fix this
problem by making mlock() jump through hoops - it'll be fixed by
people implementing a working ->page_mkwrite function for their
filesystem.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com
--

From: Nick Piggin
Date: Wednesday, November 17, 2010 - 10:46 pm

I think it is sane enough. Going back to previous behaviour would be
a regression, wouldn't it?

The right way to fix this would not be to introduce the new regression
but either/both: a specific syscall to mlock-for-read which does not do
any reservations, fix filesystem hook to allow reservation without
implying dirtying. A simple flag to page_mkwrite will be enough (plus
the logic to call it from VM).

If an app has called mlock, presumably it doesn't want to SIGBUS from
out of space, if it can possibly help it. If it isn't going to write
to it, then PROT_READ would be appropriate.

If not, then a note to man page maintainer, and an idea of performance
improvement in an actual use case would be nice.

--

From: Theodore Tso
Date: Thursday, November 18, 2010 - 3:43 am

Why is it at all important that mlock() force block allocation for sparse blocks?    It's  not at all specified in the mlock() API definition that it does that.

Are there really programs that assume that mlock() == fallocate()?!?

-- Ted


--

From: Christoph Hellwig
Date: Thursday, November 18, 2010 - 6:39 am

If there are programs that do they can't predate linux 2.6.15, and only
work on btrfs/ext4/xfs/etc, but not ext2/ext3/reiserfs.  Seems rather
unlikely to me.

--

From: Hugh Dickins
Date: Thursday, November 18, 2010 - 11:00 am

Yes, almost.  I'm very much on this side, that mlocking should not dirty
all those pages; but better admit one argument for the opposition - it's
possible that we'd find a case somewhere, which has always (i.e. even pre-
page_mkwrite) relied upon mlock of an entirely sparse file to result in
a nicely ordered allocation of blocks to the file (as would often have
happened with pdflush, I think), to give good sequential read patterns
ever after; but with this patch would now get much more random block
ordering, according to where the real writes actually fall.

It would be possible for a filesystem's ->fault(vma, &vmf) to observe
that it's being called on a VM_LOCKED|VM_SHARED vma, and make sure that
the page has backing in that case, to reproduce the old allocation behaviour
without all the unnecessary writing.  But that would be extra work in every
filesystem that cares.

Hugh
--

From: Michel Lespinasse
Date: Wednesday, November 17, 2010 - 5:23 am

Reorganize the code so that dirty pages are handled closer to the place
that makes them dirty (handling write fault into shared, writable VMAs).
No behavior changes.

Signed-off-by: Michel Lespinasse <walken@google.com>
---
 mm/memory.c |   72 +++++++++++++++++++++++++++++++---------------------------
 1 files changed, 38 insertions(+), 34 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 810a75f..d4c0c2e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2227,8 +2227,45 @@ reuse:
 		entry = maybe_mkwrite(pte_mkdirty(entry), vma);
 		if (ptep_set_access_flags(vma, address, page_table, entry,1))
 			update_mmu_cache(vma, address, page_table);
+		pte_unmap_unlock(page_table, ptl);
 		ret |= VM_FAULT_WRITE;
-		goto unlock;
+
+		if (!dirty_page)
+			return ret;
+
+		/*
+		 * Yes, Virginia, this is actually required to prevent a race
+		 * with clear_page_dirty_for_io() from clearing the page dirty
+		 * bit after it clear all dirty ptes, but before a racing
+		 * do_wp_page installs a dirty pte.
+		 *
+		 * do_no_page is protected similarly.
+		 */
+		if (!page_mkwrite) {
+			wait_on_page_locked(dirty_page);
+			set_page_dirty_balance(dirty_page, page_mkwrite);
+		}
+		put_page(dirty_page);
+		if (page_mkwrite) {
+			struct address_space *mapping = dirty_page->mapping;
+
+			set_page_dirty(dirty_page);
+			unlock_page(dirty_page);
+			page_cache_release(dirty_page);
+			if (mapping)	{
+				/*
+				 * Some device drivers do not set page.mapping
+				 * but still dirty their pages
+				 */
+				balance_dirty_pages_ratelimited(mapping);
+			}
+		}
+
+		/* file_update_time outside page_lock */
+		if (vma->vm_file)
+			file_update_time(vma->vm_file);
+
+		return ret;
 	}
 
 	/*
@@ -2334,39 +2371,6 @@ gotten:
 		page_cache_release(old_page);
 unlock:
 	pte_unmap_unlock(page_table, ptl);
-	if (dirty_page) {
-		/*
-		 * Yes, Virginia, this is actually required to prevent a race
-		 * with clear_page_dirty_for_io() from clearing the page dirty
-		 * ...
Previous thread: make xconfig broken on 2.6.37-rc2 by Ameya Palande on Wednesday, November 17, 2010 - 5:23 am. (2 messages)

Next thread: Are you interested in obtaining a loan from a legitimate lender at a very cheaper rate. by Jennifer Parks on Wednesday, November 17, 2010 - 5:18 am. (1 message)