Re: [PATCH -v6 2/2] Updating ctime and mtime for memory-mapped files

Previous thread: [PATCH 1/1] Net: e100, fix iomap mem accesses by Jiri Slaby on Thursday, January 17, 2008 - 6:28 pm. (11 messages)

Next thread: [PATCH 1/6] Modules: Fold percpu_modcopy into module.c by travis on Thursday, January 17, 2008 - 6:35 pm. (1 message)
To: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>, <miklos@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Thursday, January 17, 2008 - 6:31 pm

This is the sixth version of my solution for the bug #2645:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

New since the previous version:

1) a few cosmetic changes according to the latest feedback
for the cleanup patch;

2) implementation of the following suggestion by Miklos Szeredi:

http://lkml.org/lkml/2008/1/17/158

These changes were tested as explained below. Please note that all
tests were performed with all recommended kernel debug options
enabled. Also note that the tests were performed on regular files
residing on both an ext3 partition and a tmpfs filesystem. I also
checked the block device case, which worked for me as well.

1. My own unit test:

http://bugzilla.kernel.org/attachment.cgi?id=14430

Result: all test cases passed successfully.

2. Unit test provided by Miklos Szeredi:

http://lkml.org/lkml/2008/1/14/104

Result: this test produced the following output:

debian-64:~# ./miklos_test test
begin 1200598736 1200598736 1200598617
write 1200598737 1200598737 1200598617
mmap 1200598737 1200598737 1200598738
b 1200598739 1200598739 1200598738
msync b 1200598739 1200598739 1200598738
c 1200598741 1200598741 1200598738
msync c 1200598741 1200598741 1200598738
d 1200598743 1200598743 1200598738
munmap 1200598743 1200598743 1200598738
close 1200598743 1200598743 1200598738
sync 1200598743 1200598743 1200598738
debian-64:~#

3. Regression tests were performed using the following test cases from
the LTP test suite:

msync01
msync02
msync03
msync04
msync05
mmapstress01
mmapstress09
mmapstress10

Result: no regressions were found while running these test cases.

4. Performance test was done using the program available from the
following link:

http://bugzilla.kernel.org/attachment.cgi?id=14493

Result: the impact of the changes was negligible for files of a few
hundred megabytes.

I w...

To: <salikhmetov@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>, <miklos@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 5:40 am

Could you also test with ext4 and post some numbers? Afaik, ext4 uses
nanosecond timestamps, so the time updating code would be exercised
more during the page faults.

What about performance impact on msync(MS_ASYNC)? Could you please do
some measurment of that as well?

Thanks,
Miklos

--

To: Miklos Szeredi <miklos@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 3:48 pm

Did a quick test on an ext4 partition. This is how it looks like:

debian:~/miklos# ./miklos_test /mnt/file
begin 1200662360 1200662360 1200662353
write 1200662361 1200662361 1200662353
mmap 1200662361 1200662361 1200662362
b 1200662363 1200662363 1200662362
msync b 1200662363 1200662363 1200662362
c 1200662365 1200662365 1200662362
msync c 1200662365 1200662365 1200662362
d 1200662367 1200662367 1200662362
munmap 1200662367 1200662367 1200662362
close 1200662367 1200662367 1200662362
sync 1200662367 1200662367 1200662362
debian:~/miklos# mount | grep /mnt

Following are the results of the measurements. Here are the relevant

#define FILE_SIZE (1024 * 1024 * 512)

p = mmap(0, FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);

/* Bring the pages in */
for (i = 0; i < FILE_SIZE; i += 4096)
tmp = p[i];

/* Dirty the pages */
for (i = 0; i < FILE_SIZE; i += 4096)
p[i] = i;

/* How long did we spend in msync(MS_ASYNC)? */
gettimeofday(&tv_start, NULL);
msync(p, FILE_SIZE, MS_ASYNC);
gettimeofday(&tv_stop, NULL);

<<<

For reference tests, the following platforms were used:

1. HP-UX B.11.31, PA-RISC 8800 processor (800 MHz, 64 MB), Memory: 4 GB.

2. HP-UX B.11.31, 2 Intel(R) Itanium 2 9000 series processors (1.59 GHz, 18 MB),
Memory: 15.98 GB.

3. FreeBSD 6.2-RELEASE, Intel(R) Pentium(R) III CPU family 1400MHz, 2 CPUs.
Memory: 4G.

The tests of my solution were performed using the following platform:

A KVM x86_64 guest OS, current Git kernel. Host system: Intel(R) Core(TM)2
Duo CPU T7300 @ 2.00GHz. Further referred to as "the first case".

The following tables give the time difference between the two calls
to gettimeofday(). The test program was run three times in a raw
with a delay of one second between consecutive runs. On Linux
systems, the following commands were issued...

To: <salikhmetov@...>
Cc: <miklos@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Saturday, January 19, 2008 - 6:45 am

Thanks for running these tests.

I was more interested in the slowdown on ext4 (checked with the above
mentioned program). Can you do such a test as well, and post

Interesting.

Thanks,
Miklos
--

To: Miklos Szeredi <miklos@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 6:31 am

--

To: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>, <miklos@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Thursday, January 17, 2008 - 6:31 pm

Updating file times at write references to memory-mapped files and
forcing file times update at the next write reference after
calling the msync() system call with the MS_ASYNC flag.

Signed-off-by: Anton Salikhmetov <salikhmetov@gmail.com>
---
mm/memory.c | 6 ++++++
mm/msync.c | 52 +++++++++++++++++++++++++++++++++++++++-------------
2 files changed, 45 insertions(+), 13 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 4bf0b6d..13d5bbf 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1668,6 +1668,9 @@ gotten:
unlock:
pte_unmap_unlock(page_table, ptl);
if (dirty_page) {
+ if (vma->vm_file)
+ file_update_time(vma->vm_file);
+
/*
* Yes, Virginia, this is actually required to prevent a race
* with clear_page_dirty_for_io() from clearing the page dirty
@@ -2341,6 +2344,9 @@ out_unlocked:
if (anon)
page_cache_release(vmf.page);
else if (dirty_page) {
+ if (vma->vm_file)
+ file_update_time(vma->vm_file);
+
set_page_dirty_balance(dirty_page, page_mkwrite);
put_page(dirty_page);
}
diff --git a/mm/msync.c b/mm/msync.c
index a4de868..a49af28 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -13,11 +13,33 @@
#include <linux/syscalls.h>

/*
+ * Scan the PTEs for pages belonging to the VMA and mark them read-only.
+ * It will force a pagefault on the next write access.
+ */
+static void vma_wrprotect(struct vm_area_struct *vma)
+{
+ unsigned long addr;
+
+ for (addr = vma->vm_start; addr < vma->vm_end; addr += PAGE_SIZE) {
+ spinlock_t *ptl;
+ pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
+ pud_t *pud = pud_offset(pgd, addr);
+ pmd_t *pmd = pmd_offset(pud, addr);
+ pte_t *pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+
+ if (pte_dirty(*pte) && pte_write(*pte))
+ *pte = pte_wrprotect(*pte);
+ pte_unmap_unlock(pte, ptl);
+ }
+}
+
+/*
* MS_SYNC syncs the entire file - including mappings.
*
- * MS_ASYNC does not start I/O (it used to, up to 2.5.6...

To: <salikhmetov@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>, <miklos@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 5:51 am

What about ram based filesystems? They don't start out with read-only
pte's, so I think they don't want them read-protected now either.
Unless this is essential for correct mtime/ctime accounting on these
filesystems (I don't think it really is). But then the mapping should
start out read-only as well, otherwise the time update will only work
--

To: Miklos Szeredi <miklos@...>
Cc: <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 6:15 am

page_mkclean() has all the needed logic for this, it also walks the rmap
and cleans out all other users, which I think is needed too for
consistencies sake:

Process A Process B

mmap(foo.txt) mmap(foo.txt)

dirty page
dirty page

msync(MS_ASYNC)

dirty page

msync(MS_ASYNC) <--- now what?!

So what I would suggest is using the page table walkers from mm, and
walks the page range, obtain the page using vm_normal_page() and call
page_mkclean(). (Oh, and ensure you don't nest the pte lock :-)

All in all, that sounds rather expensive..

--

To: <a.p.zijlstra@...>
Cc: <miklos@...>, <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 6:38 am

Nothing. I think it's perfectly acceptable behavior, if msync in

Right. The advantage of Anton's current approach, is that it's at
least simple, and possibly not so expensive, while providing same
quite sane semantics for MS_ASYNC vs. mtime updates.

Miklos
--

To: Miklos Szeredi <miklos@...>
Cc: <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 7:00 am

how about:

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..a1b3fc6 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -14,6 +14,122 @@
#include <linux/syscalls.h>
#include <linux/sched.h>

+unsigned long masync_pte_range(struct vm_area_struct *vma, pmd_t *pdm,
+ unsigned long addr, unsigned long end)
+{
+ pte_t *pte;
+ spinlock_t *ptl;
+
+ pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
+ arch_enter_lazy_mmu_mode();
+ do {
+ pte_t ptent = *pte;
+
+ if (pte_none(ptent))
+ continue;
+
+ if (!pte_present(ptent))
+ continue;
+
+ if (pte_dirty(ptent) && pte_write(ptent)) {
+ flush_cache_page(vma, addr, pte_pfn(ptent));
+ ptent = ptep_clear_flush(vma, addr, pte);
+ ptent = pte_wrprotect(ptent);
+ set_pte_at(vma->vm_mnm, addr, pte, ptent);
+ }
+ } while (pte++, addr += PAGE_SIZE, addr != end);
+ arch_leave_lazy_mmu_mode();
+ pte_unmap_unlock(pte - 1, ptl);
+
+ return addr;
+}
+
+unsigned long masync_pmd_range(struct vm_area_struct *vma, pud_t *pud,
+ unsigned long addr, unsigned long end)
+{
+ pmd_t *pmd;
+ unsigned long next;
+
+ pmd = pmd_offset(pud, addr);
+ do {
+ next = pmd_addr_end(addr, end);
+ if (pmd_none_or_clear_bad(pmd))
+ continue;
+ next = masync_pte_range(vma, pmd, addr, next);
+ } while (pmd++, addr = next, addr != end);
+
+ return addr;
+}
+
+unsigned long masync_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
+ unsigned long addr, unsigned long end)
+{
+ pud_t *pud;
+ unsigned long next;
+
+ pud = pud_offset(pgd, addr);
+ do {
+ next = pud_addr_end(addr, end);
+ if (pud_none_or_clear_bad(pud))
+ continue;
+ next = masync_pmd_range(vma, pud, addr, next);
+ } while (pud++, addr = next, addr != end);
+
+ return addr;
+}
+
+unsigned long masync_pgd_range()
+{
+ pgd_t *pgd;
+ unsigned long next;
+
+ pgd = pgd_offset(vma->vm_mm, addr);
+ do {
+ next = pgd_addr_end(addr, end);
+ if (pgd_none_of_clear_bad(pgd))
+ continue;
+ next = masync_...

To: <a.p.zijlstra@...>
Cc: <miklos@...>, <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 7:17 am

Hmm, I'm not sure flush_cache_page() is needed. Or does does dirty

This is hoding i_mmap_lock for possibly quite long. Isn't that going
to cause problems?

--

To: Miklos Szeredi <miklos@...>
Cc: <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 7:23 am

after releasing i_mmap_lock. Fixing that would be,.. 'fun'.

I also realized I forgot to copy/paste the prio_tree_iter declaration
and ought to make all these functions static.

But for a quick draft it conveys the idea pretty well, I guess :-)

--

To: <a.p.zijlstra@...>
Cc: <miklos@...>, <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 7:36 am

Maybe i_mmap_lock isn't needed at all, since msync holds mmap_sem,

Yes :)

There could also be nasty performance corner cases, like having a huge
file mapped thousands of times, and having only a couple of pages
dirtied between MS_ASYNC invocations. Then most of that page table
walking would be just unnecessary overhead.

There's something to be said for walking only the dirty pages, and
doing page_mkclean on them, even if in some cases that would be
slower.

But I have a strong feeling of deja vu, and last time it ended with
Andrew not liking the whole thing...

Miklos
--

To: Miklos Szeredi <miklos@...>
Cc: <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 6:25 am

Bah, and will break on s390... so we'd need a page_mkclean() variant
that doesn't actually clear dirty.

--

To: Peter Zijlstra <peterz@...>
Cc: Miklos Szeredi <miklos@...>, <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 1:58 pm

No, we simply want to not play all these very expensive games with dirty
in the first place.

Guys, mmap access times aren't important enough for this. It's not
specified closely enough, and people don't care enough.

Of the patches around so far, the best one by far seems to be the simple
four-liner from Miklos.

And even in that four-liner, I suspect that the *last* two lines are
actually incorrect: there's no point in updating the file time when the
page *becomes* dirty, we should update the file time when it is marked
clean, and "msync(MS_SYNC)" should update it as part of *that*.

So I think the file time update should be part of just the page writeout
logic, not by msync() or page faulting itself or anything like that.

Linus
--

To: <torvalds@...>
Cc: <peterz@...>, <miklos@...>, <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 2:11 pm

Actually all four lines do that. The first two for a write access on
a present, read-only pte, the other two for a write on a non-present

That would need a new page flag (PG_mmap_dirty?). Do we have one
available?

Miklos
--

To: Miklos Szeredi <miklos@...>
Cc: <peterz@...>, <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 2:43 pm

Yeah, that would be bad. We probably have flags free, but those page flags
are always a pain. Scratch that.

How about just setting a per-vma dirty flag, and then instead of updating
the mtime when taking the dirty-page fault, we just set that flag?

Then, on unmap and msync, we just do

if (vma->dirty-flag) {
vma->dirty_flag = 0;
update_file_times(vma->vm_file);
}

and be done with it?

Linus
--

To: <torvalds@...>
Cc: <miklos@...>, <peterz@...>, <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 2:57 pm

But then background writeout, sync(2), etc, wouldn't update the times.
Dunno. I don't think actual _physical_ writeout matters much, so it's
not worse to be 30s early with the timestamp, than to be 30s or more
late.

Miklos
--

To: Miklos Szeredi <miklos@...>
Cc: <peterz@...>, <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 3:08 pm

Sure it would, but only when doing the final unmap.

Did you miss the "on unmap and msync" part?

Linus
--

To: <torvalds@...>
Cc: <miklos@...>, <peterz@...>, <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 3:22 pm

No :)

What I'm saying is that the times could be left un-updated for a long
time if program doesn't do munmap() or msync(MS_SYNC) for a long time.

If program has this pattern:

mmap()
write to map
msync(MS_ASYNC)
sleep(long)
write to map
msync(MS_ASYNC)
sleep(long)
...

Then we'd never see time updates (until the program exits, but that
could be years).

Maybe this doesn't matter, I'm just saying this is a disadvantage
compared to the "update on first dirtying" approach, which would
ensure, that times are updated at least once per 30s.

Miklos
--

To: Miklos Szeredi <miklos@...>
Cc: <peterz@...>, <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 3:35 pm

Sure.

But in those circumstances, the programmer cannot depend on the mtime
*anyway* (because there is no synchronization), so what's the downside?

Let's face it, there's exactly three possible solutions:

- the insane one: trap EVERY SINGLE instruction that does a write to the
page, and update mtime each and every time.

This one is so obviously STUPID that it's not even worth discussing
further, except to say that "yes, there is an 'exact' algorithm, but
no, we are never EVER going to use it".

- the non-exact solutions that don't give you mtime updates every time
a write to the page happens, but give *some* guarantees for things that
will update it.

This is the one I think we can do, and the only things a programmer can
impact using it is "msync()" and "munmap()", since no other operations
really have any thing to do with it in a programmer-visible way (ie a
normal "sync" operation may happen in the background and has no
progam-relevant timing information)

Other things *may* or may not update mtime (some filesystems - take
most networked one as an example - will *always* update mtime on the
server on writeback, so we cannot ever guarantee that nothing but
msync/munmap does so), but at least we'll have a minimum set of things
that people can depend on.

- the "we don't care at all solutions".

mmap(MAP_WRITE) doesn't really update times reliably after the write
has happened (but might do it *before* - maybe the mmap() itself does).

Those are the three choices, I think. We currently approximate #3. We
*can* do #2 (and there are various flavors of it). And even *aiming* for
#1 is totally insane and stupid.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Miklos Szeredi <miklos@...>, <peterz@...>, <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 6:32 pm

Hi Linus,

Can we get "if the write to the page hits the disk, the mtime has hit the disk
already no less than SOME_GRANULARITY before"?

That is very important for computer forensics. Esp. in saving your ass!

Ok, now back again to making that fast :-)

Best Regards

Ingo Oeser
--

To: Ingo Oeser <ioe-lkml@...>
Cc: Miklos Szeredi <miklos@...>, <peterz@...>, <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 6:47 pm

I certainly don't mind it if we have some tighter guarantees, but what I'd
want is:

- keep it simple. Let's face it, Linux has never ever given those
guarantees before, and it's not is if anybody has really cared. Even
now, the issue seems to be more about paper standards conformance than
anything else.

- I get worried about people playing around with the dirty bit in
particular. We have had some really rather nasty bugs here. Most of
which are totally impossible to trigger under normal loads (for
example the old random-access utorrent writable mmap issue from about
a year ago).

So these two issues - the big red danger signs flashing in my brain,
coupled with the fact that no application has apparently ever really
noticed in the last 15 years - just makes it a case where I'd like each
step of the way to be obvious and simple and no larger than really
absolutely necessary.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Ingo Oeser <ioe-lkml@...>, Miklos Szeredi <miklos@...>, <peterz@...>, <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Monday, January 21, 2008 - 10:25 am

I have been working on getting something supported here for
because I have some very large Wall Street customers who do
care about getting the mtime updated because their backups
are getting corrupted. They are incomplete because although
their applications update files, they don't get backed up

Simple is good. However, too simple is not good. I would suggest
that we implement file time updates which make sense and if they
happen to follow POSIX, then nifty, otherwise, oh well.

Thanx...

ps
--

To: Peter Staubach <staubach@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Oeser <ioe-lkml@...>, Miklos Szeredi <miklos@...>, <peterz@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Monday, January 21, 2008 - 10:36 am

Thank you very much for your support, Peter!

I'm going to submit the design document, the next version of the patch
--

To: Linus Torvalds <torvalds@...>
Cc: Ingo Oeser <ioe-lkml@...>, Miklos Szeredi <miklos@...>, <peterz@...>, <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 6:54 pm

On Fri, 18 Jan 2008 14:47:33 -0800 (PST)

There is one issue which is way more than just standards conformance.

When a program changes file data through mmap(), at some point the
mtime needs to be update so that backup programs know to back up the
new version of the file.

Backup programs not seeing an updated mtime is a really big deal.

--
All rights reversed.
--

To: Rik van Riel <riel@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Oeser <ioe-lkml@...>, Miklos Szeredi <miklos@...>, <peterz@...>, <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 8:50 pm

And that's fixed with the 4-line approach.

Reminds me, I've got a patch here for addressing that problem with loop mounts:

Writes to loop should update the mtime of the underlying file.

Signed-off-by: Matt Mackall <mpm@selenic.com>

Index: l/drivers/block/loop.c
===================================================================
--- l.orig/drivers/block/loop.c 2007-11-05 17:50:07.000000000 -0600
+++ l/drivers/block/loop.c 2007-11-05 19:03:51.000000000 -0600
@@ -221,6 +221,7 @@ static int do_lo_send_aops(struct loop_d
offset = pos & ((pgoff_t)PAGE_CACHE_SIZE - 1);
bv_offs = bvec->bv_offset;
len = bvec->bv_len;
+ file_update_time(file);
while (len > 0) {
sector_t IV;
unsigned size;
@@ -299,6 +300,7 @@ static int __do_lo_send_write(struct fil

set_fs(get_ds());
bw = file->f_op->write(file, buf, len, &pos);
+ file_update_time(file);
set_fs(old_fs);
if (likely(bw == len))
return 0;

--
Mathematics is the supreme nostalgia of our time.

--

To: <mpm@...>
Cc: <riel@...>, <torvalds@...>, <ioe-lkml@...>, <miklos@...>, <peterz@...>, <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Saturday, January 19, 2008 - 6:22 am

--

To: Miklos Szeredi <miklos@...>
Cc: <riel@...>, <torvalds@...>, <ioe-lkml@...>, <peterz@...>, <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Saturday, January 19, 2008 - 11:49 am

Yes, this second case is redundant. Still needed in the first case.

--
Mathematics is the supreme nostalgia of our time.

--

To: Matt Mackall <mpm@...>
Cc: Linus Torvalds <torvalds@...>, Ingo Oeser <ioe-lkml@...>, Miklos Szeredi <miklos@...>, <peterz@...>, <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Saturday, January 19, 2008 - 12:25 am

On Fri, 18 Jan 2008 18:50:03 -0600

Acked-by: Rik van Riel <riel@redhat.com>

--
All rights reversed.
--

To: Linus Torvalds <torvalds@...>
Cc: Miklos Szeredi <miklos@...>, <peterz@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 3:58 pm

The current solution doesn't hit the performance at all when compared to
the competitor POSIX-compliant systems. It is faster and does even more
than the POSIX standard requires.

Please see the test results I've sent into the thread "-v6 0/2":

http://lkml.org/lkml/2008/1/18/447

--

To: Anton Salikhmetov <salikhmetov@...>
Cc: Miklos Szeredi <miklos@...>, <peterz@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 4:22 pm

Your current patches have two problems:
- they are simply unnecessarily invasive for a relatively simple issue
- all versions I've looked at closer are buggy too

Example:

+ if (pte_dirty(*pte) && pte_write(*pte))
+ *pte = pte_wrprotect(*pte);

Uhhuh. Looks simple enough. Except it does a non-atomic pte access while
other CPU's may be accessing it and updating it from their hw page table
walkers. What will happen? Who knows? I can see lost access bits at a
minimum.

IOW, this isn't simple code. It's code that it is simple to screw up. In
this case, you really need to use ptep_set_wrprotect(), for example.

So why not do it in many fewer lines with that simpler vma->dirty flag?

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Miklos Szeredi <miklos@...>, <peterz@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 5:03 pm

Before using pte_wrprotect() the vma_wrprotect() routine uses the
pte_offset_map_lock() macro to get the PTE and to acquire the ptl
spinlock. Why did you say that this code was not SMP-safe? It should

Neither the dirty flag you suggest, nor the AS_MCTIME flag I've
introduced in my previous solutions solve the following problem:

- mmap()
- a write reference
- msync() with MS_ASYNC
- a write reference
- msync() with MS_ASYNC

The POSIX standard requires the ctime and mtime stamps to be updated
not later than at the second call to msync() with the MS_ASYNC flag.

Some other POSIX-compliant operating system such as HP-UX and FreeBSD
--

To: Anton Salikhmetov <salikhmetov@...>
Cc: Miklos Szeredi <miklos@...>, <peterz@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 5:27 pm

It's atomic WITH RESPECT TO OTHER PEOPLE WHO GET THE LOCK.

Guess how much another x86 CPU cares when it sets the accessed bit in

.. and that is no excuse for bad code.

Linus
--

To: Linus Torvalds <torvalds@...>
Cc: Miklos Szeredi <miklos@...>, <peterz@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 6:04 pm

Thank you very much for taking part in this discussion. Personally,
it's very important to me. But I'm not sure that I understand which
bit can be lost.

Please let me explain.

The logic for my vma_wrprotect() routine was taken from the
page_check_address() function in mm/rmap.c. Here is a code snippet of
the latter function:

pgd = pgd_offset(mm, address);
if (!pgd_present(*pgd))
return NULL;

pud = pud_offset(pgd, address);
if (!pud_present(*pud))
return NULL;

pmd = pmd_offset(pud, address);
if (!pmd_present(*pmd))
return NULL;

pte = pte_offset_map(pmd, address);
/* Make a quick check before getting the lock */
if (!pte_present(*pte)) {
pte_unmap(pte);
return NULL;
}

ptl = pte_lockptr(mm, pmd);
spin_lock(ptl);
if (pte_present(*pte) && page_to_pfn(page) == pte_pfn(*pte)) {
*ptlp = ptl;
return pte;
}
pte_unmap_unlock(pte, ptl);

The page_check_address() function is called from the
page_mkclean_one() routine as follows:

pte = page_check_address(page, mm, address, &ptl);
if (!pte)
goto out;

if (pte_dirty(*pte) || pte_write(*pte)) {
pte_t entry;

flush_cache_page(vma, address, pte_pfn(*pte));
entry = ptep_clear_flush(vma, address, pte);
entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);
set_pte_at(mm, address, pte, entry);
ret = 1;
}

pte_unmap_unlock(pte, ptl);

The write-protection of the PTE is done using the pte_wrprotect()
entity. I intended to do the same during msync() with MS_ASYNC. I
understand that I'm taking a risk of looking a complete idiot now,
however I don't see any difference between the two situations.

I presumed tha...

To: Anton Salikhmetov <salikhmetov@...>
Cc: Miklos Szeredi <miklos@...>, <peterz@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 6:21 pm

.. and the page_mkclean_one() function is totally different.

That's a rather expensive sequence, but it's done exactly because it has
to be done that way. What it does is to

- *atomically* load the pte entry _and_ clear the old one in memory.

That's the

entry = ptep_clear_flush(vma, address, pte);

thing, and it basically means that it's doing some
architecture-specific magic to make sure that another CPU that accesses
the PTE at the same time will never actually modify the pte (because
it's clear and not valid)

- it then - while the page table is actually clear and invalid - takes
the old value and turns it into the new one:

entry = pte_wrprotect(entry);
entry = pte_mkclean(entry);

- and finally, it replaces the entry with the new one:

set_pte_at(mm, address, pte, entry);

which takes care to write the new entry in some specific way that is
atomic wrt other CPU's (ie on 32-bit x86 with a 64-bit page table
entry it writes the high word first, see the write barriers in
"native_set_pte()" in include/asm-x86/pgtable-3level.h

Now, compare that subtle and correct thing with what is *not* correct:

if (pte_dirty(*pte) && pte_write(*pte))
*pte = pte_wrprotect(*pte);

which makes no effort at all to make sure that it's safe in case another
CPU updates the accessed bit.

Now, arguably it's unlikely to cause horrible problems at least on x86,
because:

- we only do this if the pte is already marked dirty, so while we can
lose the accessed bit, we can *not* lose the dirty bit. And the
accessed bit isn't such a big deal.

- it's not doing any of the "be careful about" ordering things, but since
the really important bits aren't changing, ordering probably won't
practically matter.

But the problem is that we have something like 24 different architectures,
it's hard to make sure that none of them have issues.

In other words: it may well work in practice. But when these things go
...

To: Linus Torvalds <torvalds@...>
Cc: Miklos Szeredi <miklos@...>, <peterz@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 6:35 pm

Linus, I am very grateful to you for your extremely clear explanation
of the issue I have overlooked!

Back to the msync() issue, I'm going to come back with a new design
for the bug fix.

Thank you once again.

--

To: Miklos Szeredi <miklos@...>
Cc: <torvalds@...>, <peterz@...>, <miklos@...>, <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 2:28 pm

On Fri, 18 Jan 2008 19:11:47 +0100

I thought the page writing stuff looked at (and cleared) the pte
dirty bit, too?

--
All rights reversed.
--

To: <riel@...>
Cc: <miklos@...>, <torvalds@...>, <peterz@...>, <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 2:51 pm

Yeah, it does. Hmm...

What happens on munmap? The times _could_ get updated from there as
well, but it's getting complicated.

Miklos

--

To: Peter Zijlstra <peterz@...>
Cc: Miklos Szeredi <miklos@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 6:39 am

--

To: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>, <miklos@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Thursday, January 17, 2008 - 6:31 pm

Using the PAGE_ALIGN() macro instead of "manual" alignment and
improved readability of the loop traversing the process memory regions.

Signed-off-by: Anton Salikhmetov <salikhmetov@gmail.com>
---
mm/msync.c | 77 ++++++++++++++++++++++++++++--------------------------------
1 files changed, 36 insertions(+), 41 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..a4de868 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,85 +1,83 @@
/*
- * linux/mm/msync.c
+ * The msync() system call.
*
- * Copyright (C) 1994-1999 Linus Torvalds
+ * Copyright (C) 1994-1999 Linus Torvalds
+ * Copyright (C) 2008 Anton Salikhmetov <salikhmetov@gmail.com>
*/

-/*
- * The msync() system call.
- */
+#include <linux/file.h>
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/mman.h>
-#include <linux/file.h>
-#include <linux/syscalls.h>
#include <linux/sched.h>
+#include <linux/syscalls.h>

/*
* MS_SYNC syncs the entire file - including mappings.
*
* MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
* Now it doesn't do anything, since dirty pages are properly tracked.
*
- * The application may now run fsync() to
- * write out the dirty pages and wait on the writeout and check the result.
- * Or the application may run fadvise(FADV_DONTNEED) against the fd to start
- * async writeout immediately.
+ * The application may now run fsync() to write out the dirty pages and
+ * wait on the writeout and check the result. Or the application may run
+ * fadvise(FADV_DONTNEED) against the fd to start async writeout immediately.
* So by _not_ starting I/O in MS_ASYNC we provide complete flexibility to
* applications.
*/
asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
{
unsigned long end;
- struct mm_struct *mm = ...

To: <salikhmetov@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>, <miklos@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 5:33 am

I think you may have misunderstood my last comment. These are OK:

struct mm_struct *mm = current->mm;
int unmapped_error = 0;
int error = -EINVAL;

This is not so good:

int error, unmapped_error;

This is the worst:

int error = -EINVAL, unmapped_error = 0;

So I think the original code is fine as it is.

Othewise patch looks OK now.

Miklos
--

To: Miklos Szeredi <miklos@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>, <r.e.wolff@...>, <hidave.darkstar@...>, <hch@...>
Date: Friday, January 18, 2008 - 6:30 am

I moved the initialization of the variables to the code where they are needed.

--

Previous thread: [PATCH 1/1] Net: e100, fix iomap mem accesses by Jiri Slaby on Thursday, January 17, 2008 - 6:28 pm. (11 messages)

Next thread: [PATCH 1/6] Modules: Fold percpu_modcopy into module.c by travis on Thursday, January 17, 2008 - 6:35 pm. (1 message)