Re: [PATCH -v7 2/2] Update ctime and mtime for memory-mapped files

Previous thread: Re: PROBLEM: Celeron Core by Robert Hancock on Monday, January 21, 2008 - 7:59 pm. (1 message)

Next thread: none
To: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>, <miklos@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Monday, January 21, 2008 - 8:32 pm

This is the seventh version of my solution for the bug #2645:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

Since the previous version, the following has changed: based on
Linus' comment, SMP-safe PTE update implemented.

Discussions, which followed my past submissions, showed that it was
tempting to approach this problem using very different assumptions.
However, many such designs have proved to be incomplete or inefficient.

Taking into account the obvious complexity of this problem, I prepared a
design document, the purpose of which is twofold. First, it summarizes
previous attempts to resolve the ctime/mtime issue. Second, it attempts
to prove that what the patches do is necessary and sufficient for mtime
and ctime update provided that we start from a certain sane set of
requirements. The design document is available via the following link:

http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40

For the seventh version, comprehensive performance testing was performed.
The results of performance tests, including numbers, are available here:

http://bugzilla.kernel.org/show_bug.cgi?id=2645#c43
--

To: Anton Salikhmetov <salikhmetov@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>, <miklos@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Monday, January 21, 2008 - 9:34 pm

Hi Anton,

I applied your patches here and as far as my own test programs go,
these patches solve the previously observed problems I saw with mtime
not getting updated.

Thank you very much for so persistently working on these long standing issues.

--
Jesper Juhl <jesper.juhl@gmail.com>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html
--

To: Jesper Juhl <jesper.juhl@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>, <miklos@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Monday, January 21, 2008 - 9:40 pm

--

To: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>, <miklos@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Monday, January 21, 2008 - 8:32 pm

http://bugzilla.kernel.org/show_bug.cgi?id=2645#c40

Update file times at write references to memory-mapped files.
Force file times update at the next write reference after
calling the msync() system call with the MS_ASYNC flag.

Signed-off-by: Anton Salikhmetov <salikhmetov@gmail.com>
---
mm/memory.c | 6 ++++++
mm/msync.c | 57 ++++++++++++++++++++++++++++++++++++++++++++-------------
2 files changed, 50 insertions(+), 13 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 6dd1cd8..4b0144b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1670,6 +1670,9 @@ gotten:
unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
+ if (vma->vm_file)
+ file_update_time(vma->vm_file);
+
/*
* Yes, Virginia, this is actually required to prevent a race
* with clear_page_dirty_for_io() from clearing the page dirty
@@ -2343,6 +2346,9 @@ out_unlocked:
if (anon)
page_cache_release(vmf.page);
else if (dirty_page) {
+ if (vma->vm_file)
+ file_update_time(vma->vm_file);
+
set_page_dirty_balance(dirty_page, page_mkwrite);
put_page(dirty_page);
}
diff --git a/mm/msync.c b/mm/msync.c
index a4de868..394130d 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -5,6 +5,7 @@
* Copyright (C) 2008 Anton Salikhmetov <salikhmetov@gmail.com>
*/

+#include <asm/tlbflush.h>
#include <linux/file.h>
#include <linux/fs.h>
#include <linux/mm.h>
@@ -13,11 +14,37 @@
#include <linux/syscalls.h>

/*
+ * Scan the PTEs for pages belonging to the VMA and mark them read-only.
+ * It will force a pagefault on the next write access.
+ */
+static void vma_wrprotect(struct vm_area_struct *vma)
+{
+ unsigned long addr;
+
+ for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
+ spinlock_t *ptl;
+ pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
+ pud_t *pud = pud_offset(pgd, addr);
+ pmd_t *pmd = pmd_offset(pud, addr);
+ pte_t *pte = pte_offset_map_lock(vma->vm_mm, p...

To: Anton Salikhmetov <salikhmetov@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>
Date: Tuesday, January 22, 2008 - 12:39 am

Anton Salikhmetov <salikhmetov@gmail.com> writes:

You should probably put your design document somewhere in Documentation

This means on i386 with highmem ptes you will map/flush tlb/unmap each
PTE individually. You will do 512 times as much work as really needed
per PTE leaf page.

The performance critical address space walkers use a different design

Flushing TLBs unbatched can also be very expensive because if the MM is
shared by several CPUs you'll have a inter-processor interrupt for
each iteration. They are quite costly even on smaller systems.

It would be better if you did a single flush_tlb_range() at the end.
This means on x86 this will currently always do a full flush, but that's
still better than really slowing down in the heavily multithreaded case.

-Andi
--

To: Anton Salikhmetov <salikhmetov@...>
Cc: <linux-mm@...>, <jakob@...>, Linux Kernel Mailing List <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <a.p.zijlstra@...>, Andrew Morton <akpm@...>, <protasnb@...>, <miklos@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Monday, January 21, 2008 - 10:16 pm

This is extremely expensive over bigger areas, especially sparsely mapped
ones (it does all the lookups for all four levels over and over and over
again for eachg page).

I think Peter Zijlstra posted a version that uses the regular kind of
nested loop (with inline functions to keep the thing nice and clean),
which gets rid of that.

[ The sad/funny part is that this is all how we *used* to do msync(), back
in the days: we're literally going back to the "pre-cleanup" logic. See
commit 204ec841fbea3e5138168edbc3a76d46747cc987: "mm: msync() cleanup"
for details ]

Quite frankly, I really think you might be better off just doing a

git revert 204ec841fbea3e5138168edbc3a76d46747cc987

and working from there! I just checked, and it still reverts cleanly, and
you'd end up with a nice code-base that (a) has gotten years of testing
and (b) already has the looping-over-the-pagetables code.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: <linux-mm@...>, <jakob@...>, Linux Kernel Mailing List <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <a.p.zijlstra@...>, Andrew Morton <akpm@...>, <protasnb@...>, <miklos@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Monday, January 21, 2008 - 10:39 pm

Thanks for your feedback, Linus!

I will use Peter Zijlstra's version of such an operation in my next
--

To: <salikhmetov@...>
Cc: <torvalds@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>, <miklos@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Tuesday, January 22, 2008 - 4:52 am

But note, that those functions iterate over all the vmas for the given
page range, not just the one msync was performed on. This might get
even more expensive, if the file is mapped lots of times.

The old version, that Linus was referring to, needs some modification
as well, because it doesn't write protect the ptes, just marks them
clean.

Miklos
--

To: Anton Salikhmetov <salikhmetov@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>, <miklos@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Monday, January 21, 2008 - 9:40 pm

Some very pedantic nitpicking below;

I know it's not the common "Linux Kernel way", but 'addr' could be
made to have just 'for' scope here according to C99;

for (unsigned long addr = vma->vm_start; addr < vma->vm_end;

I think keeping some version of the "up to ..." comments makes sense.

--
Jesper Juhl <jesper.juhl@gmail.com>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html
--

To: Jesper Juhl <jesper.juhl@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>, <miklos@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Monday, January 21, 2008 - 9:51 pm

I believe that the C89 style is more common for the Linux kernel, so

The MS_ASYNC and MS_SYNC flags are mutually exclusive, that is why I
did not use the "else-if" here. Moreover, this function itself checks
--

To: Anton Salikhmetov <salikhmetov@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>, <miklos@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Monday, January 21, 2008 - 9:54 pm

I would say that them being mutually exclusive would be a reason *for*
using "else-if" here.

--
Jesper Juhl <jesper.juhl@gmail.com>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html
--

To: Jesper Juhl <jesper.juhl@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>, <miklos@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Monday, January 21, 2008 - 9:57 pm

This check is performed by the sys_msync() function itself in its very
beginning.

--

To: Anton Salikhmetov <salikhmetov@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>, <miklos@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Monday, January 21, 2008 - 10:18 pm

Sure, it's just that, to me, using 'else-if' makes it explicit that
the two are mutually exclusive. Using "if (...), if (...)" doesn't.
Maybe it's just me, but I feel that 'else-if' here better shows the
intention... No big deal.

--
Jesper Juhl <jesper.juhl@gmail.com>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html
--

To: Jesper Juhl <jesper.juhl@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>, <miklos@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Monday, January 21, 2008 - 10:07 pm

debian:/tmp$ cat c.c
void f()
{
for (unsigned long i = 0; i < 10; i++)
continue;
}
debian:/tmp$ gcc -c -pedantic c.c
c.c: In function 'f':
c.c:3: error: 'for' loop initial declaration used outside C99 mode
debian:/tmp$

<<<

--

To: Anton Salikhmetov <salikhmetov@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>, <miklos@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Monday, January 21, 2008 - 10:16 pm

Well, I just wrote the way I'd have writen the loop, I know it's not
the common kernel style. Just offering my feedback/input :)

--
Jesper Juhl <jesper.juhl@gmail.com>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html
--

To: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>, <miklos@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Monday, January 21, 2008 - 8:32 pm

Use the PAGE_ALIGN() macro instead of "manual" alignment.
Improve readability of the loop, which traverses the process
memory regions. Make code more symmetric and possibly boost
performance on some RISC CPUs by moving variable assignments.

Signed-off-by: Anton Salikhmetov <salikhmetov@gmail.com>
---
mm/msync.c | 77 ++++++++++++++++++++++++++++--------------------------------
1 files changed, 36 insertions(+), 41 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..a4de868 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,85 +1,83 @@
/*
- * linux/mm/msync.c
+ * The msync() system call.
*
- * Copyright (C) 1994-1999 Linus Torvalds
+ * Copyright (C) 1994-1999 Linus Torvalds
+ * Copyright (C) 2008 Anton Salikhmetov <salikhmetov@gmail.com>
*/

-/*
- * The msync() system call.
- */
+#include <linux/file.h>
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/mman.h>
-#include <linux/file.h>
-#include <linux/syscalls.h>
#include <linux/sched.h>
+#include <linux/syscalls.h>

/*
* MS_SYNC syncs the entire file - including mappings.
*
* MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
* Now it doesn't do anything, since dirty pages are properly tracked.
*
- * The application may now run fsync() to
- * write out the dirty pages and wait on the writeout and check the result.
- * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
- * async writeout immediately.
+ * The application may now run fsync() to write out the dirty pages and
+ * wait on the writeout and check the result. Or the application may run
+ * fadvise(FADV_DONTNEED) against the fd to start async writeout immediately.
* So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
* applications.
*/
asmlinkage lon...

Previous thread: Re: PROBLEM: Celeron Core by Robert Hancock on Monday, January 21, 2008 - 7:59 pm. (1 message)

Next thread: none