Re: [PATCH][RFC][BUG] updating the ctime and mtime time stamps in msync()

Previous thread: STT_FUNC for assembler checksum and semaphore ops by John Reiser on Monday, January 7, 2008 - 1:23 pm. (3 messages)

Next thread: Writing tests for write back by Michael Rubin on Monday, January 7, 2008 - 2:47 pm. (1 message)
To: <linux-mm@...>
Cc: <linux-kernel@...>
Date: Monday, January 7, 2008 - 1:54 pm

From: Anton Salikhmetov <salikhmetov@gmail.com>

Due to the lack of reaction in LKML I presume the message was lost
in the high traffic of that list. Resending it now with the addressee changed
to the memory management mailing list.

I would like to propose my solution for the bug #2645 from the kernel bug tracker:

http://bugzilla.kernel.org/show_bug.cgi?id=2645

The Open Group defines the behavior of the mmap() function as follows.

The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED
and PROT_WRITE shall be marked for update at some point in the interval
between a write reference to the mapped region and the next call to msync()
with MS_ASYNC or MS_SYNC for that portion of the file by any process.
If there is no such call and if the underlying file is modified as a result
of a write reference, then these fields shall be marked for update at some
time after the write reference.

The above citation was taken from the following link:

http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html

Therefore, the msync() function should be called before verifying the time
stamps st_mtime and st_ctime in the test program Badari wrote in the context
of the bug #2645. Otherwise, the time stamps may be updated
at some unspecified moment according to the POSIX standard.

I changed his test program a little. The changed unit test can be downloaded
using the following link:

http://pygx.sourceforge.net/mmap.c

This program showed that the msync() function had a bug:
it did not update the st_mtime and st_ctime fields.

The program shows appropriate behavior of the msync()
function using the kernel with the proposed patch applied.
Specifically, the ctime and mtime time stamps do change
when modifying the mapped memory and do not change when
there have been no write references between the mmap()
and msync() system calls.

Additionally, the test cases for the msync() system call from
the LTP test suite (msync01 - msync05, mmapstress01, mmapstress09,
...

To: Anton Salikhmetov <salikhmetov@...>
Cc: <linux-mm@...>, <linux-kernel@...>
Date: Wednesday, January 9, 2008 - 5:18 pm

Anton Salikhmetov wrote:
> From: Anton Salikhmetov <salikhmetov@gmail.com>
>
> I would like to propose my solution for the bug #2645 from the kernel
bug tracker:
>
> http://bugzilla.kernel.org/show_bug.cgi?id=2645
>
> The Open Group defines the behavior of the mmap() function as follows.
>
> The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED
> and PROT_WRITE shall be marked for update at some point in the interval
> between a write reference to the mapped region and the next call to
msync()
> with MS_ASYNC or MS_SYNC for that portion of the file by any process.
> If there is no such call and if the underlying file is modified as a
result
> of a write reference, then these fields shall be marked for update at
some
> time after the write reference.
>
> The above citation was taken from the following link:
>
> http://www.opengroup.org/onlinepubs/009695399/functions/mmap.html
>
> Therefore, the msync() function should be called before verifying the
time
> stamps st_mtime and st_ctime in the test program Badari wrote in the
context
> of the bug #2645. Otherwise, the time stamps may be updated
> at some unspecified moment according to the POSIX standard.
>
> I changed his test program a little. The changed unit test can be
downloaded
> using the following link:
>
> http://pygx.sourceforge.net/mmap.c
>
> This program showed that the msync() function had a bug:
> it did not update the st_mtime and st_ctime fields.
>
> The program shows the appropriate behavior of the msync()
> function using the kernel with the proposed patch applied.
> Specifically, the ctime and mtime time stamps do change
> when modifying the mapped memory and do not change when
> there have been no write references between the mmap()
> and msync() system calls.
>
>

Sorry, I don't see where the test pro...

To: Anton Salikhmetov <salikhmetov@...>
Cc: <linux-mm@...>, <linux-kernel@...>
Date: Wednesday, January 9, 2008 - 4:50 pm

On Mon, 07 Jan 2008 20:54:19 +0300

As long as the ctime and mtime stamps change when the memory is
written to, what exactly is the problem?

Is it that the ctime and mtime does not change again when the memory
is written to again?

Is there a way for backup programs to miss file modification times?

Could you explain (using short words and simple sentences) what the
exact problem is?

Eg.

1) program mmaps file
2) program writes to mmaped area
3) ??? <=== this part, in equally simple words :)
4) data loss

An explanation like that will help people understand exactly what the

Due to the various cleanups all being in one file, it took me a while
to understand the patch. In an area of code this subtle, it may be
better to submit the cleanups in (a) separate patch(es) from the patch

I wonder if calling file_update_time() from inside the loop is the
best idea. Why not call that function just once, after msync breaks
from the loop?

thanks,

Rik
--
All Rights Reversed
--

To: Rik van Riel <riel@...>
Cc: <linux-mm@...>, <linux-kernel@...>
Date: Wednesday, January 9, 2008 - 8:40 pm

Now I'm working on my next solution for this bug. It will probably

That function should be called inside of the loop because the memory
region, which msync() is called with, may contain pages mapped for
--

To: Rik van Riel <riel@...>
Cc: Anton Salikhmetov <salikhmetov@...>, <linux-mm@...>, <linux-kernel@...>
Date: Wednesday, January 9, 2008 - 5:01 pm

A "not" is missing from the sentence above. The quote above should have

So essentially the problem is that mtime stamps are _never_ changed when
the file is only modified through mmap. Not even when calling msync().

--
Kind regards,
Klaus S. Madsen
--

To: Rik van Riel <riel@...>
Cc: Anton Salikhmetov <salikhmetov@...>, <linux-mm@...>, <linux-kernel@...>
Date: Wednesday, January 9, 2008 - 5:06 pm

It's like this:

Monday 9:04AM: System boots, database server starts up, mmaps file
Monday 9:06AM: Database server writes to mmap area, updates mtime/ctime
Monday <many times> Database server writes to mmap area, no further update..
Monday 11:45PM: Backup sees "file modified 9:06AM, let's back it up"
Tuesday 9:00AM-5:00PM: Database server touches it another 5,398 times, no mtime
Tuesday 11:45PM: Backup sees "file modified back on Monday, we backed this up..
Wed 9:00AM-5:00PM: More updates, more not touching the mtime
Wed 11:45PM: *yawn* It hasn't been touched in 2 days, no sense in backing it up..

Lather, rinse, repeat....

To: <Valdis.Kletnieks@...>
Cc: Anton Salikhmetov <salikhmetov@...>, <linux-mm@...>, <linux-kernel@...>
Date: Wednesday, January 9, 2008 - 6:06 pm

On Wed, 09 Jan 2008 16:06:17 -0500

On the other hand, updating the mtime and ctime whenever a page is dirtied
also does not work right. Apparently that can break mutt.

Calling msync() every once in a while with Anton's patch does not look like a
fool proof method to me either, because the VM can write all the dirty pages
to disk by itself, leaving nothing for msync() to detect. (I think...)

Can we get by with simply updating the ctime and mtime every time msync()
is called, regardless of whether or not the mmaped pages were still dirty
by the time we called msync() ?

--
All Rights Reversed
--

To: Rik van Riel <riel@...>
Cc: <Valdis.Kletnieks@...>, <linux-mm@...>, <linux-kernel@...>
Date: Wednesday, January 9, 2008 - 8:48 pm

Please tell why you think that can break mutt? Such approach was
--

To: Rik van Riel <riel@...>
Cc: <Valdis.Kletnieks@...>, Anton Salikhmetov <salikhmetov@...>, <linux-mm@...>, <linux-kernel@...>
Date: Wednesday, January 9, 2008 - 6:33 pm

On Wed, Jan 09, 2008 at 05:06:33PM -0500, Rik van Riel wrote:

Just verified this at one customer site; they had a db that was last backed up

Thinking back on the atime discussion, I bet there would be some performance

Reading the man page:
"The st_ctime and st_mtime fields of a file that is mapped with MAP_SHARED and
PROT_WRITE will be marked for update at some point in the interval between a
write reference to the mapped region and the next call to msync() with
MS_ASYNC or MS_SYNC for that portion of the file by any process. If there is no
such call, these fields may be marked for update at any time after a write
reference if the underlying file is modified as a result."

So, whenever someone writes in the region, this must cause us to update the
mtime/ctime no later than at the time of the next call to msync().

Could one do something like the lazy atime updates, coupled with a forced flush

The update must still happen, eventually, after a write to the mapped region
followed by an unmap/close even if no msync is ever called.

The msync only serves as a "no later than" deadline. The write to the region
triggers the need for the update.

At least this is how I read the standard - please feel free to correct me if I
am mistaken.

--

/ jakob

--

To: Jakob Oestergaard <jakob@...>
Cc: <Valdis.Kletnieks@...>, Anton Salikhmetov <salikhmetov@...>, <linux-mm@...>, <linux-kernel@...>
Date: Wednesday, January 9, 2008 - 7:41 pm

On Wed, 9 Jan 2008 23:33:40 +0100

You are absolutely right. If we wrote dirty pages to disk, the ctime
and mtime updates must happen no later than msync or close time.

I guess a third possible time (if we want to minimize the number of
updates) would be when natural syncing of the file data to disk, by
other things in the VM, would be about to clear the I_DIRTY_PAGES
flag on the inode. That way we do not need to remember any special
"we already flushed all dirty data, but we have not updated the mtime
and ctime yet" state.

Does this sound reasonable?

--
All rights reversed.
--

To: Rik van Riel <riel@...>
Cc: Jakob Oestergaard <jakob@...>, Anton Salikhmetov <salikhmetov@...>, <linux-mm@...>, <linux-kernel@...>
Date: Thursday, January 10, 2008 - 4:48 pm

Is it possible that a *very* large file (multi-gigabyte or even bigger database,
for example) would never get out of I_DIRTY_PAGES, because there's always a
few dozen just-recently dirtied pages that haven't made it out to disk yet?

Of course, getting a *consistent* backup of a file like that is quite the
challenge already, because of the high likelyhood of the file being changed
while the backup runs - that's why big sites often do a 'quiesce/snapshot/wakeup'
on a database and then backup the snapshot...

To: Rik van Riel <riel@...>
Cc: Jakob Oestergaard <jakob@...>, <Valdis.Kletnieks@...>, <linux-mm@...>, <linux-kernel@...>
Date: Wednesday, January 9, 2008 - 8:03 pm

[Empty message]
To: Anton Salikhmetov <salikhmetov@...>
Cc: Rik van Riel <riel@...>, <Valdis.Kletnieks@...>, <linux-mm@...>, <linux-kernel@...>
Date: Thursday, January 10, 2008 - 4:51 am

On Thu, Jan 10, 2008 at 03:03:03AM +0300, Anton Salikhmetov wrote:

If the update was done as Rik suggested, with the addition that msync()
triggered an explicit sync of the inode data, then everything would be ok,
right?

--

/ jakob

--

To: Jakob Oestergaard <jakob@...>, Anton Salikhmetov <salikhmetov@...>, Rik van Riel <riel@...>, <Valdis.Kletnieks@...>, <linux-mm@...>, <linux-kernel@...>
Date: Thursday, January 10, 2008 - 6:53 am

Indeed, if msync() is called with MS_SYNC an explicit sync is
triggered, and Rik's suggestion would work. However, the POSIX
standard requires a call to msync() with MS_ASYNC to update the
st_ctime and st_mtime stamps too. No explicit sync of the inode data
is triggered in the current implementation of msync(). Hence Rik's
--

To: Anton Salikhmetov <salikhmetov@...>
Cc: Jakob Oestergaard <jakob@...>, Anton Salikhmetov <salikhmetov@...>, <Valdis.Kletnieks@...>, <linux-mm@...>, <linux-kernel@...>
Date: Thursday, January 10, 2008 - 11:45 am

On Thu, 10 Jan 2008 13:53:59 +0300

Since your patch is already changing msync(), has it occurred
to you that your patch could change msync() to do the right
thing?

--
All rights reversed.
--

To: Rik van Riel <riel@...>
Cc: Jakob Oestergaard <jakob@...>, <Valdis.Kletnieks@...>, <linux-mm@...>, <linux-kernel@...>
Date: Thursday, January 10, 2008 - 11:56 am

These changes catch the simple case, where the file is mmap'd,
modified via the mmap'd region, and then an msync is done,
all on a mostly quiet system.

However, I don't see how they will work if there has been
something like a sync(2) done after the mmap'd region is
modified and the msync call. When the inode is written out
as part of the sync process, I_DIRTY_PAGES will be cleared,
thus causing a miss in this code.

The I_DIRTY_PAGES check here is good, but I think that there
needs to be some code elsewhere too, to catch the case where
I_DIRTY_PAGES is being cleared, but the time fields still need
to be updated.

<<<

--

To: Anton Salikhmetov <salikhmetov@...>
Cc: Jakob Oestergaard <jakob@...>, <Valdis.Kletnieks@...>, <linux-mm@...>, <linux-kernel@...>
Date: Thursday, January 10, 2008 - 12:07 pm

On Thu, 10 Jan 2008 18:56:07 +0300

Agreed. The mtime and ctime should probably also be updated
when I_DIRTY_PAGES is cleared.

The alternative would be to remember that the inode had been
dirty in the past, and have the mtime and ctime updated on
msync or close - which would be more complex.

--
All rights reversed.
--

To: Rik van Riel <riel@...>
Cc: Anton Salikhmetov <salikhmetov@...>, Jakob Oestergaard <jakob@...>, <Valdis.Kletnieks@...>, <linux-mm@...>, <linux-kernel@...>
Date: Thursday, January 10, 2008 - 12:46 pm

And also remembering that the file times should not be updated
if the pages were modified via a write(2) operation. Or if
there has been an intervening write(2) operation...

The number of cases to consider and the boundary conditions
quickly make this reasonably complex to get right. That's why
this is the 4'th or 5'th attempt in the last 18 months or so
to get this situation addressed.

ps
--

To: Rik van Riel <riel@...>
Cc: Jakob Oestergaard <jakob@...>, <Valdis.Kletnieks@...>, <linux-mm@...>, <linux-kernel@...>
Date: Thursday, January 10, 2008 - 12:40 pm

Adding the new flag (AS_MCTIME) has been already suggested by Peter
Staubach in his first solution for this bug. Now I understand that the
--

To: Anton Salikhmetov <salikhmetov@...>
Cc: Rik van Riel <riel@...>, Jakob Oestergaard <jakob@...>, <Valdis.Kletnieks@...>, <linux-mm@...>, <linux-kernel@...>
Date: Thursday, January 10, 2008 - 12:52 pm

Well, that was the approach before we had I_DIRTY_PAGES. I am
still wondering whether we can get this approach to work, with
a little more support and heuristics. PeterZ's work to better
track dirty pages should be helpful in determining when and why
a patch was dirty.

I keep thinking that by recording the time when a page was found
to be dirty and the file is mmap'd and then updating the mtime
and ctime fields in the inode during msync() and sync_single_inode()
if that time is newer than the mtime and ctime fields, then we
can solve the problem of when and when not to update those two
time fields.

I haven't had a chance to think it all through completely or do
the appropriate analysis yet though.

ps
--

To: Rik van Riel <riel@...>
Cc: <Valdis.Kletnieks@...>, Anton Salikhmetov <salikhmetov@...>, <linux-mm@...>, <linux-kernel@...>
Date: Wednesday, January 9, 2008 - 6:19 pm

Could you elaborate on why that would break mutt? I am assuming
that the pages being modified are mmap'd, but if they are not, then

As long as we can keep track of that information and then remember
it for an munmap so that eventually the file times do get updated,
then this should work.

It would seem that a better solution would be to update the file
times whenever the inode gets cleaned, ie. modified pages written
out and the inode synchronized to the disk. That way, long running
programs would not have to msync occasionally in order to have
the data file properly backed up.

Thanx...

ps

ps
--

To: <linux-mm@...>
Cc: <linux-kernel@...>, <joe@...>
Date: Wednesday, January 9, 2008 - 7:32 am

Since no reaction in LKML was recieved for this message it seemed
logical to suggest closing the bug #2645 as "WONTFIX":

http://bugzilla.kernel.org/show_bug.cgi?id=2645#c15

However, the reporter of the bug, Jacob Oestergaard, insisted the

Please re-submit to LKML.

This bug causes backup systems to *miss* changed files.

This bug does cause data loss in common real-world deployments (I gave an
example with a database when posting the bug, but this affects the data from
all mmap using applications with common backup systems).

Silent exclusion from backups is very very nasty.

<<<

Please comment on my solution or commit it if it's acceptable in its
present form.

--

To: Anton Salikhmetov <salikhmetov@...>
Cc: <linux-mm@...>, <linux-kernel@...>, <joe@...>
Date: Wednesday, January 9, 2008 - 5:28 pm

Yes, please!

Let's have the right discussion and get this bug addressed for real.
It is a real bug and is causing data corruption for some very large
Red Hat customers because their applications were architected to
use mmap, but their backups are not backing up the modified files
due to this aspect of the system.

This is the 4'th or 5'th attempt in the last 2 years to submit a
patch to address this situation. None have been able to make it
all of the way through the process and to be integrated.

I posted some comments.

Thanx...

--

To: Anton Salikhmetov <salikhmetov@...>
Cc: <linux-mm@...>, <linux-kernel@...>, <joe@...>
Date: Wednesday, January 9, 2008 - 10:41 am

Good idea. The bug is real and should be fixed IMHO.

Not just backup systems, but any application that relies on mtime
Agreed.

In fact if mtime is not reliable (which it is not) one could argue
that we might as well not update it at all, ever. But I think we can
I've only looked briefly at your patch but it seems resonable. I'll
try to do some testing with it later.

Thank you for working on this long standing bug.

...

I agree that our current behaviour is certainly not what the standard
(sensibly) requires.

--
Jesper Juhl <jesper.juhl@gmail.com>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html
--

To: Jesper Juhl <jesper.juhl@...>
Cc: <linux-mm@...>, <linux-kernel@...>, <joe@...>
Date: Wednesday, January 9, 2008 - 11:31 am

Jesper, thank you very much for your answer!

In fact, I tested my change quite extensively using test cases for the
mmap() and msync() system calls from the LTP test suite. Please note

Additionally, the test cases for the msync() system call from
the LTP test suite (msync01 - msync05, mmapstress01, mmapstress09,
and mmapstress10) successfully passed using the kernel
with the patch included into this email.

<<<
--

To: Anton Salikhmetov <salikhmetov@...>
Cc: <linux-mm@...>, <linux-kernel@...>, <joe@...>
Date: Wednesday, January 9, 2008 - 8:22 am

On Wed, Jan 09, 2008 at 02:32:53PM +0300, Anton Salikhmetov wrote:

This problem is seen with both Amanda and TSM (Tivoli Storage Manager).

A site running Amanda with, say, a full backup weekly and incremental backups
daily, will only get weekly backups of their mmap modified databases.

However, large sites running TSM will be hit even harder by this because TSM
will always perform incremental backups from the client (managing which
versions to keep for how long on the server side). TSM will *never* again take
a backup of the mmap modified database.

The really nasty part is; nothing is failing. Everything *appears* to work.
Your data is just not backed up because it appears to be untouched.

So, if you run TSM (or pretty much any other backup solution actually) on
Linux, maybe you should run a
find / -type f -print0 | xargs -0 touch
before starting your backup job. Sort of removes the point of using proper
backup software, but at least you'll get your data backed up.

--

/ jakob

--

To: Anton Salikhmetov <salikhmetov@...>
Cc: <linux-mm@...>, <linux-kernel@...>, <joe@...>
Date: Wednesday, January 9, 2008 - 7:47 am

Thank you!

A quick run-down for those who don't know what this is about:

Some applications use mmap() to modify files. Common examples are databases.

Linux does not update the mtime of files that are modified using mmap, even if
msync() is called.

This is very clearly against OpenGroup specifications.

This misfeatures causes such files to be silently *excluded* from normal backup
runs.

Solaris implements this properly.

NT has the same bug as Linux, using their private bastardisation of the mmap
interface - but since they don't care about SuS and are broken in so many other
ways, that really doesn't matter.

So, dear kernel developers, can we please integrate this patch to make Linux
stop silently excluding peoples databases from their backup?

--

/ jakob

--

Previous thread: STT_FUNC for assembler checksum and semaphore ops by John Reiser on Monday, January 7, 2008 - 1:23 pm. (3 messages)

Next thread: Writing tests for write back by Michael Rubin on Monday, January 7, 2008 - 2:47 pm. (1 message)