As discussed in linux-mm, mlocking a shared, writable vma currently causes the corresponding pages to be marked as dirty and queued for writeback. This seems rather unnecessary given that the pages are not being actually modified during mlock. It is understood that for non-shared mappings (file or anon) we want to use a write fault in order to break COW, but there is just no such need for shared mappings. The first two patches in this series do not introduce any behavior change. The intent there is to make it obvious that dirtying file pages is only done in the (writable, shared) case. I think this clarifies the code, but I wouldn't mind dropping these two patches if there is no consensus about them. The last patch is where we actually avoid dirtying shared mappings during mlock. Note that as a side effect of this, we won't call page_mkwrite() for the mappings that define it, and won't be pre-allocating data blocks at the FS level if the mapped file was sparsely allocated. My understanding is that mlock does not need to provide such guarantee, as evidenced by the fact that it never did for the filesystems that don't define page_mkwrite() - including some common ones like ext3. However, I would like to gather feedback on this from filesystem people as a precaution. If this turns out to be a showstopper, maybe block preallocation can be added back on using a different interface. Large shared mlocks are getting significantly (>2x) faster in my tests, as the disk can be fully used for reading the file instead of having to share between this and writeback. Michel Lespinasse (3): do_wp_page: remove the 'reuse' flag do_wp_page: clarify dirty_page handling mlock: avoid dirtying pages and triggering writeback mm/memory.c | 90 ++++++++++++++++++++++++++++++++--------------------------- mm/mlock.c | 7 ++++- 2 files changed, 55 insertions(+), 42 deletions(-) -- 1.7.3.1 --
Reorganize the code to remove the 'reuse' flag. No behavior changes.
Signed-off-by: Michel Lespinasse <walken@google.com>
---
mm/memory.c | 11 +++++------
1 files changed, 5 insertions(+), 6 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 0e18b4d..810a75f 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2110,7 +2110,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
{
struct page *old_page, *new_page;
pte_t entry;
- int reuse = 0, ret = 0;
+ int ret = 0;
int page_mkwrite = 0;
struct page *dirty_page = NULL;
@@ -2147,14 +2147,16 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
page_cache_release(old_page);
}
- reuse = reuse_swap_page(old_page);
- if (reuse)
+ if (reuse_swap_page(old_page)) {
/*
* The page is all ours. Move it to our anon_vma so
* the rmap code will not search our parent or siblings.
* Protected against the rmap code by the page lock.
*/
page_move_anon_rmap(old_page, vma, address);
+ unlock_page(old_page);
+ goto reuse;
+ }
unlock_page(old_page);
} else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) ==
(VM_WRITE|VM_SHARED))) {
@@ -2218,10 +2220,7 @@ static int do_wp_page(struct mm_struct *mm, struct vm_area_struct *vma,
}
dirty_page = old_page;
get_page(dirty_page);
- reuse = 1;
- }
- if (reuse) {
reuse:
flush_cache_page(vma, address, pte_pfn(orig_pte));
entry = pte_mkyoung(orig_pte);
--
1.7.3.1
--
When faulting in pages for mlock(), we want to break COW for anonymous or file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is no need to write-fault into VM_SHARED vmas since shared file pages can be mlocked first and dirtied later, when/if they actually get written to. Skipping the write fault is desirable, as we don't want to unnecessarily cause these pages to be dirtied and queued for writeback. Signed-off-by: Michel Lespinasse <walken@google.com> --- mm/memory.c | 7 ++++++- mm/mlock.c | 7 ++++++- 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/mm/memory.c b/mm/memory.c index d4c0c2e..7f45085 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3291,7 +3291,12 @@ int make_pages_present(unsigned long addr, unsigned long end) vma = find_vma(current->mm, addr); if (!vma) return -ENOMEM; - write = (vma->vm_flags & VM_WRITE) != 0; + /* + * We want to touch writable mappings with a write fault in order + * to break COW, except for shared mappings because these don't COW + * and we would not want to dirty them for nothing. + */ + write = (vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE; BUG_ON(addr >= end); BUG_ON(end > vma->vm_end); len = DIV_ROUND_UP(end, PAGE_SIZE) - addr/PAGE_SIZE; diff --git a/mm/mlock.c b/mm/mlock.c index b70919c..4f31864 100644 --- a/mm/mlock.c +++ b/mm/mlock.c @@ -171,7 +171,12 @@ static long __mlock_vma_pages_range(struct vm_area_struct *vma, VM_BUG_ON(!rwsem_is_locked(&mm->mmap_sem)); gup_flags = FOLL_TOUCH | FOLL_GET; - if (vma->vm_flags & VM_WRITE) + /* + * We want to touch writable mappings with a write fault in order + * to break COW, except for shared mappings because these don't COW + * and we would not want to dirty them for nothing. + */ + if ((vma->vm_flags & (VM_WRITE | VM_SHARED)) == VM_WRITE) gup_flags |= FOLL_WRITE; /* We don't try to access the guard page of a stack vma */ -- 1.7.3.1 --
It's not just to break COW, but to do block allocation and such (filesystem's page_mkwrite op). That needs to at least be explained in the changelog. Filesystem doesn't have a good way to fully pin required things according to mlock, but page_mkwrite provides some reasonable things (like block allocation / reservation). --
Right, but marking all pages dirty isn't really sane. I can imagine making the reservation but not marking things dirty solution, although it might be lots harder to implement, esp since some filesystems don't actually have a page_mkwrite() implementation. --
Really, my understanding is that not pre-allocating filesystem blocks is just fine. This is, after all, what happens with ext3 and it's never been reported as a bug (that I know of). If filesystem people's feedback is that they really want mlock() to continue pre-allocating blocks, maybe we can just do it using fallocate() rather than page_mkwrite() callbacks ? -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. --
It's not ext3 you have to worry about - it's the filesystems that need special state set up on their pages/buffers for ->writepage to work correctly that are the problem. You need to call ->write_begin/->write_end to get the state set up properly. If this state is not set up properly, silent data loss will occur during mmap writes either by ENOSPC or failing to set up writes into unwritten extents correctly (i.e. we'll be back to where we were in 2.6.15). I don't think ->page_mkwrite can be worked around - we need that to be called on the first write fault of any mmap()d page to ensure it is set up correctly for writeback. If we don't get write faults after the page is mlock()d, then we need the ->page_mkwrite() call Fallocate is not good enough to avoid ->page_mkwrite callbacks. Indeed, XFS (at least) requires the ->page_mkwrite() callout even on preallocated space to correctly mark the buffers as unwritten so extent conversion in ->writepage is triggered correctly (see test #166 in xfstests). Hence I think that avoiding ->page_mkwrite callouts is likely to break some filesystems in subtle, undetected ways. IMO, regardless of what is done, it would be really good to start by writing a new regression test to exercise and encode the expected the mlock behaviour so we can detect regressions later on.... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
Just to be clear - I'm proposing to skip the entire do_wp_page() call by doing a read fault rather than a write fault. If the page wasn't dirty already, it will stay clean and with a non-writable PTE until it gets actually written to, at which point we'll get a write fault and do_wp_page will be invoked as usual. I am not proposing to skip the page_mkwrite() while upgrading the PTE permissions, which I think is what you were arguing against ? -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. --
I have no problem with that - I'm surprised that mlock didn't work I wasn't arguing against anything, merely pointing out that the ->page_mkwrite call is aboslutely necessary. You've made clarified that it still occurs, so I'm happy... FWIW, what I was responding to was the assumption that "this is alright for ext3, so it must be OK" extrapolation about ->page_mkwrite behaviour. Especially as ext3 does not even implement the ->page_mkwrite operation (which means mmap writes into holes can't detect ENOSPC correctly)... Cheers, Dave. -- Dave Chinner david@fromorbit.com --
OK, so I'm not an mm hacker, so maybe I'm missing something. Could part of this be fixed by simply sending the write faults for mlock()'ed pages, so page_mkwrite() gets called when the page is dirtied. Seems like a real waste to have the file system pre-allocate all of the blocks for a mlock()'ed region. Why does mlock() have to result in the write faults getting suppressed when the page is actually dirtied? - Ted --
On Wed, 17 Nov 2010 18:52:30 -0500 Yup, I don't think it would be too bad to take a minor fault each time an mlocked page transitions from clean->dirty. In fact we should already be doing that, after the mlocked page gets written back by kupdate? Hope so! --
This is actually what the patch does - by having mlock() use a read fault, pages are loaded in memory and mlocked, but the ptes are not marked as writable so that a later write access will be caught as a write fault at Yes, handle_mm_fault() is careful to never create writable ptes pointing to clean file pages, so that a later write fault will correctly dirty the corresponding page. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. --
I think it would help if we could drink a bit of the test driven design coolaid here. Michel, can you write some testcases where pages on a shared mapping are mlocked, then dirtied and then munlocked, and then written out using msync/fsync. Anything that fails this test on btrfs/ext4/gfs/xfs/etc obviously doesn't work. --
Whilst it's hard to argue against a request for testing, Dave's worries just sprang from a misunderstanding of all the talk about "avoiding -> page_mkwrite". There's nothing strange or risky about Michel's patch, it does not avoid ->page_mkwrite when there is a write: it just stops pretending that there was a write when locking down the shared area. Hugh --
I think it's still under debate what's an acceptable result for this test
(i.e. what's supposed to happen during mlock of a shared mapping of
a sparsely allocated file - is a fallocate equivalent supposed to happen ?)
So, I decided to test this using memtoy. /data is a separate partition
where I had just 10GB free space, and /data/hole20G was created using
dd if=/dev/zero of=/data/hole20G bs=1M seek=20480 count=0.
memtoy>file /data/hole20G shared
memtoy>map hole20G
At this point the file is mapped using a writable, shared VMA.
memtoy>touch hole20G
memtoy: touched 5242880 pages in 30.595 secs
At this point memtoy's address space is populated with zeroed
pages. The pages are distinct (meminfo does show 20G of allocated pages),
are classified as file pages, not anon, and are associated with the
struct address_space for /data/hole20G. That file still does not have
blocks allocated, as can be seen with du /data/hole20G.
memtoy>lock hole20G
memtoy tries to mlock the hole20G VMA.
This is where things get interesting.
Using ext2, without my patch (ext3 should be similar):
- first, mlock does fast progress going though file pages, marking them
as dirty. Eventually, it hits the dirty limit and gets throttled.
- then, mlock does slow progress as it needs to wait for writeback.
writeback occurs and allocates blocks for the /data/hole20G.
Eventually, the /data partition gets full.
- then, mlock does no progress as it's at the dirty limit and nothing
gets written back.
- mlock never terminates.
Using ext2, with my patch (ext3 should be similar):
- mlock goes through all pages in ~5 seconds, marking them as mlocked
(but still not dirty)
- mlock completes. /data/hole20G still does not have blocks allocated.
- if memtoy is then instructed to dirty all the pages
(using 'touch hole20G write'):
- first, memtoy does fast progress faulting through file pages, marking
them as dirty. Eventually, it hits the dirty limit and gets ...My vote would be against. If you if you mmap a sparse file and then try writing to it willy-nilly, bad things will happen. This is true without a mlock(). Where is it written that mlock() has anything to do with improving this situation? If userspace wants to call fallocate() before it calls mlock(), it should do that. And in fact, in most cases, userspace should probably be encouraged to do that. But having mlock() call fallocate() and then return ENOSPC if there's no room? Isn't it confusing that mlock() call ENOSPC? Doesn't that give you cognitive dissonance? It should because fundamentally mlock() has nothing to do with block allocation!! Read the API spec! Look, it was an accident / bug of the implementation that mlock() magically dirtied all these pages. It might have made some situations better, but I very much doubt applications depended upon it, and I'd really rather not perpetuate this particular magic side effect of the previously buggy implementation of mlock(). -- Ted --
Exactly. Allocating space has been a side-effect on a handfull Indeed. There is no need to make mlock + flag a parallel-API to fallocate. --
On Thu, 18 Nov 2010 23:23:16 -0800 Wait. You *tested* the kernel? Dirtying all that memory at mlock() time is pretty obnoxious. I'm inclined to agree that your patch implements the desirable behaviour: don't dirty the page, don't do block allocation. Take a fault at first-dirtying and do it then. This does degrade mlock a bit: the user will find that the first touch of an mlocked page can cause synchronous physical I/O, which isn't mlocky behaviour *at all*. But we have to be able to do this anyway - whenever the kupdate function writes back the dirty pages it has to mark them read-only again so the kernel knows when they get redirtied. I do agree that this will result in worse file layout for some reasonable userspace code patterns. But it was always that way, except for the eleven-release window where we kinda accidentally fixed that up in-kernel. Hopefully most apps which care are already ensuring that the file is well laid-out. So all that leaves me thinking that we merge your patches as-is. Then work out why users can fairly trivially use mlock to hang the kernel on ext2 and ext3 (and others?) --
So at least on RHEL 4 and 5 systems, pam_limits was configured so that unprivileged processes could only mlock() at most 16k. This was deemed enough so that programs could protect crypto keys. The thinking when we added the mlock() ulimit setting was that unprivileged users could very easily make a nuisance of themselves, and grab way too much system resources, by using mlock() in obnoxious ways. I was just checking to see if my memory was correct, and to my surprise, I've just found that Ubuntu deliberately sets the memlock ulimit to be unlimited. Which means that Ubuntu systems are completely wide open for this particular DOS attack. So if you administer an Ubuntu-based server, it might be a good idea to make a tiny little change to /etc/security/limits.conf.... - Ted --
Kees, Copying you into this thread, in case you'd like to respond from the Ubuntu side. Thanks for the heads-up, Ted. Dustin Kirkland Canonical, LTD kirkland@canonical.com GPG: 1024D/83A61194
On Fri, Nov 19, 2010 at 2:54 PM, Andrew Morton I would say the hang is not even mlock related - you see without it also. All you need is mmap a large file with holes and write fault pages until you run out of disk space. At that point additional write faults wait for a writeback that can never complete. Sysadmin can however kill -9 such processes and/or free some space, though. -- Michel "Walken" Lespinasse A program is never fully debugged until the last user dies. --
That's the typical result of ENOSPC during mmap writes into holes on filesystems that don't implement ->page_mkwrite() - this effectively overcommits the filesystem free space with more dirty pages than can fit in the free space, and then the system will get stuck on ENOSPC errors when trying to allocate in the writeback path, leaving you Yup, ext4 implements page_mkwrite(), so can detect ENOSPC during the Which is how all filesystems _should_ behave. Any filesystem that does not implement ->page_mkwrite will break under this test, regardless of your patch. It is exercising the exact problem that page_mkwrite IMO, you do not need fallocate or any form of preallocation at all during mlock(). As I've already pointed out, this still requires filesystems to implement ->page_mkwrite to ensure mmap writes into preallocated regions work correctly. Fundamentally, filesystems need to implement ->page-mkwrite() to do the right thing w.r.t. mmap writes and ENOSPC, and any filesystem that doesn't is plain broken. You don't need to try to fix this problem by making mlock() jump through hoops - it'll be fixed by people implementing a working ->page_mkwrite function for their filesystem. Cheers, Dave. -- Dave Chinner david@fromorbit.com --
I think it is sane enough. Going back to previous behaviour would be a regression, wouldn't it? The right way to fix this would not be to introduce the new regression but either/both: a specific syscall to mlock-for-read which does not do any reservations, fix filesystem hook to allow reservation without implying dirtying. A simple flag to page_mkwrite will be enough (plus the logic to call it from VM). If an app has called mlock, presumably it doesn't want to SIGBUS from out of space, if it can possibly help it. If it isn't going to write to it, then PROT_READ would be appropriate. If not, then a note to man page maintainer, and an idea of performance improvement in an actual use case would be nice. --
Why is it at all important that mlock() force block allocation for sparse blocks? It's not at all specified in the mlock() API definition that it does that. Are there really programs that assume that mlock() == fallocate()?!? -- Ted --
If there are programs that do they can't predate linux 2.6.15, and only work on btrfs/ext4/xfs/etc, but not ext2/ext3/reiserfs. Seems rather unlikely to me. --
Yes, almost. I'm very much on this side, that mlocking should not dirty all those pages; but better admit one argument for the opposition - it's possible that we'd find a case somewhere, which has always (i.e. even pre- page_mkwrite) relied upon mlock of an entirely sparse file to result in a nicely ordered allocation of blocks to the file (as would often have happened with pdflush, I think), to give good sequential read patterns ever after; but with this patch would now get much more random block ordering, according to where the real writes actually fall. It would be possible for a filesystem's ->fault(vma, &vmf) to observe that it's being called on a VM_LOCKED|VM_SHARED vma, and make sure that the page has backing in that case, to reproduce the old allocation behaviour without all the unnecessary writing. But that would be extra work in every filesystem that cares. Hugh --
Reorganize the code so that dirty pages are handled closer to the place
that makes them dirty (handling write fault into shared, writable VMAs).
No behavior changes.
Signed-off-by: Michel Lespinasse <walken@google.com>
---
mm/memory.c | 72 +++++++++++++++++++++++++++++++---------------------------
1 files changed, 38 insertions(+), 34 deletions(-)
diff --git a/mm/memory.c b/mm/memory.c
index 810a75f..d4c0c2e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2227,8 +2227,45 @@ reuse:
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (ptep_set_access_flags(vma, address, page_table, entry,1))
update_mmu_cache(vma, address, page_table);
+ pte_unmap_unlock(page_table, ptl);
ret |= VM_FAULT_WRITE;
- goto unlock;
+
+ if (!dirty_page)
+ return ret;
+
+ /*
+ * Yes, Virginia, this is actually required to prevent a race
+ * with clear_page_dirty_for_io() from clearing the page dirty
+ * bit after it clear all dirty ptes, but before a racing
+ * do_wp_page installs a dirty pte.
+ *
+ * do_no_page is protected similarly.
+ */
+ if (!page_mkwrite) {
+ wait_on_page_locked(dirty_page);
+ set_page_dirty_balance(dirty_page, page_mkwrite);
+ }
+ put_page(dirty_page);
+ if (page_mkwrite) {
+ struct address_space *mapping = dirty_page->mapping;
+
+ set_page_dirty(dirty_page);
+ unlock_page(dirty_page);
+ page_cache_release(dirty_page);
+ if (mapping) {
+ /*
+ * Some device drivers do not set page.mapping
+ * but still dirty their pages
+ */
+ balance_dirty_pages_ratelimited(mapping);
+ }
+ }
+
+ /* file_update_time outside page_lock */
+ if (vma->vm_file)
+ file_update_time(vma->vm_file);
+
+ return ret;
}
/*
@@ -2334,39 +2371,6 @@ gotten:
page_cache_release(old_page);
unlock:
pte_unmap_unlock(page_table, ptl);
- if (dirty_page) {
- /*
- * Yes, Virginia, this is actually required to prevent a race
- * with clear_page_dirty_for_io() from clearing the page dirty
- * ...