[PATCH 0/2] yet another attempt to fix the ctime and mtime issue

Previous thread: Re: Huawei EC321 CDMA PCCARD support broken by Zhang Weiwu on Saturday, January 12, 2008 - 11:26 pm. (1 message)

Next thread: [CALL FOR TESTING] Make Ext3 fsck way faster [2.6.23.13] by Abhishek Rai on Sunday, January 13, 2008 - 1:41 am. (4 messages)
To: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>
Date: Sunday, January 13, 2008 - 12:39 am

The POSIX standard requires that the ctime and mtime fields
for memory-mapped files should be updated after a write
reference to the memory region where the file data is mapped.
At least FreeBSD 6.2 and HP-UX 11i implement this properly.
Linux does not, which leads to data loss problems in database
backup applications.

Kernel Bug Tracker contains more information about the problem:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

There have been several attempts in the past to address this
issue. Following are a few links to LKML discussions related
to this bug:

http://lkml.org/lkml/2006/5/17/138
http://lkml.org/lkml/2007/2/21/242
http://lkml.org/lkml/2008/1/7/234

All earlier solutions were criticized. Some solutions did not
handle memory-mapped block devices properly. Some led to forcing
applications to explicitly call msync() to update file metadata.
Some contained errors in using kernel synchronization primitives.

In the two patches that follow, I would like to propose a new
solution.

This is the third version of my changes. This version takes
into account all feedback I received for the two previous versions.
The overall design remains basically the same as the one that
was acked by Rick van Riel:

http://lkml.org/lkml/2008/1/11/208

To the best of my knowledge, these patches are free of all the
drawbacks found during previous attempts by Peter Staubach,
Miklos Szeredi and myself.

New since the previous version:

1) no need to explicitly call msync() to update file times;
2) changing block device data is visible to all device files
associated with the block device;
3) in the cleanup part, the error checks are separated out as
suggested by Rik van Riel;
4) some small refinements accodring to the LKML comments.

This is how I tested the patches.

1. To test the features mentioned above, I wrote a unit test
available from

http://bugzilla.kernel.org/attachment.cgi?id=14430

I verified that the unit test passed successfully for both
r...

To: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>
Date: Sunday, January 13, 2008 - 12:39 am

http://bugzilla.kernel.org/show_bug.cgi?id=2645

Changes for updating the ctime and mtime fields for memory-mapped files:

1) new flag triggering update of the inode data;
2) new function to update ctime and mtime for block device files;
3) new helper function to update ctime and mtime when needed;
4) updating time stamps for mapped files in sys_msync() and do_fsync();
5) implementing the feature of auto-updating ctime and mtime.

Signed-off-by: Anton Salikhmetov <salikhmetov@gmail.com>
---
fs/buffer.c | 1 +
fs/fs-writeback.c | 2 ++
fs/inode.c | 42 +++++++++++++++++++++++++++++++++++-------
fs/sync.c | 2 ++
include/linux/fs.h | 9 ++++++++-
include/linux/pagemap.h | 3 ++-
mm/msync.c | 24 ++++++++++++++++--------
mm/page-writeback.c | 1 +
8 files changed, 67 insertions(+), 17 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 7249e01..09adf7e 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -719,6 +719,7 @@ static int __set_page_dirty(struct page *page,
}
write_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+ set_bit(AS_MCTIME, &mapping->flags);

return 1;
}
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 0fca820..c25ebd5 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -243,6 +243,8 @@ __sync_single_inode(struct inode *inode, struct writeback_control *wbc)

spin_unlock(&inode_lock);

+ mapping_update_time(mapping);
+
ret = do_writepages(mapping, wbc);

/* Don't write the inode if only I_DIRTY_PAGES was set */
diff --git a/fs/inode.c b/fs/inode.c
index ed35383..c02bfab 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1243,8 +1243,8 @@ void touch_atime(struct vfsmount *mnt, struct dentry *dentry)
EXPORT_SYMBOL(touch_atime);

/**
- * file_update_time - update mtime and ctime time
- * @file: file accessed
+ * inode_update_time - update mtime and ctime time
+ * @i...

To: <salikhmetov@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>
Date: Monday, January 14, 2008 - 7:08 am

How exactly is this done?

Is this catering for this case:

1 page is dirtied through mapping
2 app calls msync(MS_ASYNC)
3 page is written again through mapping
4 app calls msync(MS_ASYNC)
5 ...
6 page is written back

What happens at 4? Do we care about this one at all?

Umm, why not just call with file->f_dentry->d_inode, so that you don't
need to do this ugly search for the physical inode? The file pointer
--

To: Miklos Szeredi <miklos@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>
Date: Monday, January 14, 2008 - 8:22 am

The POSIX standard requires updating the file times every time when msync()
is called with MS_ASYNC. I.e. the time stamps should be updated even
when no physical synchronization is being done immediately.

I'm not sure if I undestood your question. I see two possible
interpretations for this
question, and I'm answering both.

The intention was to make the data changes in the block device data visible to
all device files associated with the block device. Hence the search,
because the time
stamps for all such device files should be updated as well.

Not only the sys_msync() and do_fsync() routines call the helper function
mapping_update_time(). Not all call sites of the helper function have the file
pointer available. Therefore, I pass the mapping pointer to the helper function
--

To: <salikhmetov@...>
Cc: <miklos@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>
Date: Monday, January 14, 2008 - 9:14 am

[Empty message]
To: Miklos Szeredi <miklos@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>
Date: Monday, January 14, 2008 - 2:59 pm

The __sync_single_inode() function calls mapping_update_time()
--

To: Miklos Szeredi <miklos@...>
Cc: <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <akpm@...>, <protasnb@...>
Date: Monday, January 14, 2008 - 9:35 am

I must agree, doing the mmap dirty, MS_ASYNC, mmap retouch, MS_ASYNC
case correctly would need a lot more code which I doubt is worth the
effort.

It would require scanning the PTEs and marking them read-only again on
MS_ASYNC, and some more logic in set_page_dirty() because that currently
bails out early if the page in question is already dirty.

--

To: Peter Zijlstra <a.p.zijlstra@...>
Cc: Miklos Szeredi <miklos@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <akpm@...>, <protasnb@...>
Date: Monday, January 14, 2008 - 10:17 am

Thanks for your review, Peter and Miklos!

I overlooked this case when AS_MCTIME flag has been turned off and the
page is still dirty.

On the other hand, the words "shall be marked for update" may be
considered as just setting the AS_MCTIME flag, not updating the time
stamps.

What do you think about calling mapping_update_time() inside of "if
(MS_SYNC & flags)"? I suggest such change because the code for
--

To: <salikhmetov@...>
Cc: <a.p.zijlstra@...>, <miklos@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <akpm@...>, <protasnb@...>
Date: Tuesday, January 15, 2008 - 5:53 am

I think that's a good idea. As a first iteration, just updating the
mtime/ctime in msync(MS_SYNC) and remove_vma() (called at munmap time)
would be a big improvement over what we currently have.

I would also recommend, that you drop mapping_update_time() and the
related functions from the patch, and just use file_update_time()
instead.

Miklos
--

To: Miklos Szeredi <miklos@...>
Cc: <a.p.zijlstra@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <akpm@...>, <protasnb@...>
Date: Tuesday, January 15, 2008 - 6:46 am

Thank you for your recommendations. I will submit my new solution shortly.

--

To: Miklos Szeredi <miklos@...>
Cc: <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <akpm@...>, <protasnb@...>, Nick Piggin <nickpiggin@...>
Date: Monday, January 14, 2008 - 9:39 am

More fun, it would require marking them RO but leaving the dirty bit
set, because this ext3 fudge where we confuse the page dirty state - or
did that get fixed?

--

To: <peterz@...>
Cc: <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <akpm@...>, <protasnb@...>, <nickpiggin@...>
Date: Monday, January 14, 2008 - 9:45 am

That got fixed by Nick, I think.

The alternative to marking pages RO, is to walk the PTEs in MS_ASYNC,
note the dirty bit and mark pages clean. But it's possibly even more
complicated.

Miklos
--

To: <miklos@...>
Cc: <peterz@...>, <salikhmetov@...>, <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <akpm@...>, <protasnb@...>, <nickpiggin@...>
Date: Monday, January 14, 2008 - 9:47 am

^^^^
ptes, I mean
--

To: <salikhmetov@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>
Date: Monday, January 14, 2008 - 7:15 am

Oh, and here's a test program I wrote, that can be used to check this
behavior. It has two options:

-s use MS_SYNC instead of MS_ASYNC
-f fork and do the msync on a different mapping

Back then I haven't found a single OS, that fully conformed to all the
stupid POSIX rules regarding mmaps and ctime/mtime.

Miklos
----

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <string.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/wait.h>

static const char *filename;
static int msync_flag = MS_ASYNC;
static int msync_fork = 0;

static void print_times(const char *msg)
{
struct stat stbuf;
stat(filename, &stbuf);
printf("%s\t%li\t%li\t%li\n", msg, stbuf.st_ctime, stbuf.st_mtime,
stbuf.st_atime);
}

static void do_msync(void *addr, int len)
{
int res;
if (!msync_fork) {
res = msync(addr, len, msync_flag);
if (res == -1) {
perror("msync");
exit(1);
}
} else {
int pid = fork();
if (pid == -1) {
perror("fork");
exit(1);
}
if (!pid) {
int fd = open(filename, O_RDWR);
if (fd == -1) {
perror("open");
exit(1);
}
addr = mmap(NULL, 4096, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (addr == MAP_FAILED) {
perror("mmap");
exit(1);
}
res = msync(addr, len, msync_flag);
if (res == -1) {
perror("msync");
exit(1);
}
exit(0);
}
wait(NULL);
}
}

static void usage(const char *progname)
{
fprintf(stderr, "usage: %s filename [-sf]\n", progname);
exit(1);
}

int main(int argc, char *argv[])
{
int res;
char *addr;
int fd;

if (argc < 2)
usage(argv...

To: Miklos Szeredi <miklos@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>
Date: Monday, January 14, 2008 - 8:25 am

Thank you very much for sharing your code.

--

To: Anton Salikhmetov <salikhmetov@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>
Date: Sunday, January 13, 2008 - 12:59 am

On Sun, 13 Jan 2008 07:39:59 +0300

Acked-by: Rik van Riel <riel@redhat.com>

--
All rights reversed.
--

To: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>
Date: Sunday, January 13, 2008 - 12:39 am

Substantial code cleanup of the sys_msync() function:

1) using the PAGE_ALIGN() macro instead of "manual" alignment;
2) improved readability of the loop traversing the process memory regions.

Signed-off-by: Anton Salikhmetov <salikhmetov@gmail.com>
---
mm/msync.c | 78 +++++++++++++++++++++++++++---------------------------------
1 files changed, 35 insertions(+), 43 deletions(-)

diff --git a/mm/msync.c b/mm/msync.c
index 144a757..ff654c9 100644
--- a/mm/msync.c
+++ b/mm/msync.c
@@ -1,24 +1,25 @@
/*
* linux/mm/msync.c
*
+ * The msync() system call.
* Copyright (C) 1994-1999 Linus Torvalds
+ *
+ * Substantial code cleanup.
+ * Copyright (C) 2008 Anton Salikhmetov <salikhmetov@gmail.com>
*/

-/*
- * The msync() system call.
- */
+#include <linux/file.h>
#include <linux/fs.h>
#include <linux/mm.h>
#include <linux/mman.h>
-#include <linux/file.h>
-#include <linux/syscalls.h>
#include <linux/sched.h>
+#include <linux/syscalls.h>

/*
* MS_SYNC syncs the entire file - including mappings.
*
* MS_ASYNC does not start I/O (it used to, up to 2.5.67).
- * Nor does it marks the relevant pages dirty (it used to up to 2.6.17).
+ * Nor does it mark the relevant pages dirty (it used to up to 2.6.17).
* Now it doesn't do anything, since dirty pages are properly tracked.
*
* The application may now run fsync() to
@@ -33,71 +34,62 @@ asmlinkage long sys_msync(unsigned long start, size_t len, int flags)
unsigned long end;
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
- int unmapped_error = 0;
- int error = -EINVAL;
+ int error = 0, unmapped_error = 0;

if (flags & ~(MS_ASYNC | MS_INVALIDATE | MS_SYNC))
- goto out;
+ return -EINVAL;
if (start & ~PAGE_MASK)
- goto out;
+ return -EINVAL;
if ((flags & MS_ASYNC) && (flags & MS_SYNC))
- goto out;
- error = -ENOMEM;
- len = (len + ~PAGE_MASK) & PAGE_MASK;
+ return -EIN...

To: <salikhmetov@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>
Date: Monday, January 14, 2008 - 6:49 am

Where did this line go? It's needed because after releasing and
reacquiring the mmap sem, the old vma may have gone away.

I suggest, that when doing such a massive cleanup, that you split it
up even further into easily understandable pieces, so such bugs cannot

Miklos
--

To: Miklos Szeredi <miklos@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <riel@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>
Date: Monday, January 14, 2008 - 7:56 am

--

To: Anton Salikhmetov <salikhmetov@...>
Cc: <linux-mm@...>, <jakob@...>, <linux-kernel@...>, <valdis.kletnieks@...>, <ksm@...>, <staubach@...>, <jesper.juhl@...>, <torvalds@...>, <a.p.zijlstra@...>, <akpm@...>, <protasnb@...>
Date: Sunday, January 13, 2008 - 12:46 am

On Sun, 13 Jan 2008 07:39:58 +0300

Acked-by: Rik van Riel <riel@redhat.com>

--
All rights reversed.
--

Previous thread: Re: Huawei EC321 CDMA PCCARD support broken by Zhang Weiwu on Saturday, January 12, 2008 - 11:26 pm. (1 message)

Next thread: [CALL FOR TESTING] Make Ext3 fsck way faster [2.6.23.13] by Abhishek Rai on Sunday, January 13, 2008 - 1:41 am. (4 messages)