Re: [RFC] [PATCH 3/3] Recursive mtime for ext3

Previous thread: [PATCH] r/o bind mounts: fix buggy loop by Dave Hansen on Tuesday, November 6, 2007 - 9:59 am. (1 message)

Next thread: Use of virtio device IDs by Anthony Liguori on Tuesday, November 6, 2007 - 10:16 am. (13 messages)
From: Jan Kara
Date: Tuesday, November 6, 2007 - 10:15 am

Hello,

  in following three patches is implemented recursive mtime feature for
ext3. The first two patches are mostly clean-up patches, the third patch
implements the feature itself. If somebody is interested in testing this
(or even writing a support of this feature in rsync and similar), please
contact me. Attached are sources of simple tools set_recmod, get_recmod
for testing the feature and also a patch implementing basic support of
the feature in e2fsprogs. Comments welcome.

								Honza

-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
From: Jan Kara
Date: Tuesday, November 6, 2007 - 10:18 am

Hello,

  the following patch makes more lightweight handling of
EXT3_I(inode)->i_flags possible.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
---

Implement atomic updates of EXT3_I(inode)->i_flags. So far the i_flags access
was guarded mostly by i_mutex but this is quite heavy-weight. We now use
inode->i_lock to protect i_flags reading and updates in ext3. This patch
introduces a bogus warning that jflag and oldflags may be uninitialized -
anyone knows how to cleanly get rid of it?

Signed-off-by: Jan Kara <jack@suse.cz>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.23/fs/ext3/dir.c linux-2.6.23-1-i_flags_atomicity/fs/ext3/dir.c
--- linux-2.6.23/fs/ext3/dir.c	2007-10-11 12:01:23.000000000 +0200
+++ linux-2.6.23-1-i_flags_atomicity/fs/ext3/dir.c	2007-11-05 14:04:56.000000000 +0100
@@ -108,10 +108,10 @@ static int ext3_readdir(struct file * fi
 	sb = inode->i_sb;
 
 #ifdef CONFIG_EXT3_INDEX
-	if (EXT3_HAS_COMPAT_FEATURE(inode->i_sb,
-				    EXT3_FEATURE_COMPAT_DIR_INDEX) &&
-	    ((EXT3_I(inode)->i_flags & EXT3_INDEX_FL) ||
-	     ((inode->i_size >> sb->s_blocksize_bits) == 1))) {
+	if (is_dx(inode) ||
+	    (EXT3_HAS_COMPAT_FEATURE(inode->i_sb, \
+					EXT3_FEATURE_COMPAT_DIR_INDEX) &&
+	     (inode->i_size >> sb->s_blocksize_bits) == 1)) {
 		err = ext3_dx_readdir(filp, dirent, filldir);
 		if (err != ERR_BAD_DX_DIR) {
 			ret = err;
@@ -121,7 +121,9 @@ static int ext3_readdir(struct file * fi
 		 * We don't set the inode dirty flag since it's not
 		 * critical that it get flushed back to the disk.
 		 */
+		spin_lock(&inode->i_lock);
 		EXT3_I(filp->f_path.dentry->d_inode)->i_flags &= ~EXT3_INDEX_FL;
+		spin_unlock(&inode->i_lock);
 	}
 #endif
 	stored = 0;
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23/fs/ext3/ialloc.c linux-2.6.23-1-i_flags_atomicity/fs/ext3/ialloc.c
--- linux-2.6.23/fs/ext3/ialloc.c	2006-11-29 22:57:37.000000000 +0100
+++ linux-2.6.23-1-i_flags_atomicity/fs/ext3/ialloc.c	2007-11-05 14:14:50.000000000 +0100
@@ ...
From: Jan Kara
Date: Tuesday, November 6, 2007 - 10:19 am

Make space reserved for fragments as unused as they were never implemented.
Remove also related initializations. We later use the space for recursive
mtime.

Signed-off-by: Jan Kara <jack@suse.cz>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-1-i_flags_atomicity/fs/ext3/ialloc.c linux-2.6.23-2-make_flags_unused/fs/ext3/ialloc.c
--- linux-2.6.23-1-i_flags_atomicity/fs/ext3/ialloc.c	2007-11-05 14:14:50.000000000 +0100
+++ linux-2.6.23-2-make_flags_unused/fs/ext3/ialloc.c	2007-11-05 14:37:33.000000000 +0100
@@ -576,11 +576,6 @@ got:
 	/* dirsync only applies to directories */
 	if (!S_ISDIR(mode))
 		ei->i_flags &= ~EXT3_DIRSYNC_FL;
-#ifdef EXT3_FRAGMENTS
-	ei->i_faddr = 0;
-	ei->i_frag_no = 0;
-	ei->i_frag_size = 0;
-#endif
 	ei->i_file_acl = 0;
 	ei->i_dir_acl = 0;
 	ei->i_dtime = 0;
diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-1-i_flags_atomicity/fs/ext3/inode.c linux-2.6.23-2-make_flags_unused/fs/ext3/inode.c
--- linux-2.6.23-1-i_flags_atomicity/fs/ext3/inode.c	2007-11-05 14:24:39.000000000 +0100
+++ linux-2.6.23-2-make_flags_unused/fs/ext3/inode.c	2007-11-05 14:38:05.000000000 +0100
@@ -2651,11 +2651,6 @@ void ext3_read_inode(struct inode * inod
 	}
 	inode->i_blocks = le32_to_cpu(raw_inode->i_blocks);
 	ei->i_flags = le32_to_cpu(raw_inode->i_flags);
-#ifdef EXT3_FRAGMENTS
-	ei->i_faddr = le32_to_cpu(raw_inode->i_faddr);
-	ei->i_frag_no = raw_inode->i_frag;
-	ei->i_frag_size = raw_inode->i_fsize;
-#endif
 	ei->i_file_acl = le32_to_cpu(raw_inode->i_file_acl);
 	if (!S_ISREG(inode->i_mode)) {
 		ei->i_dir_acl = le32_to_cpu(raw_inode->i_dir_acl);
@@ -2790,11 +2785,6 @@ static int ext3_do_update_inode(handle_t
 	spin_lock(&inode->i_lock);
 	raw_inode->i_flags = cpu_to_le32(ei->i_flags);
 	spin_unlock(&inode->i_lock);
-#ifdef EXT3_FRAGMENTS
-	raw_inode->i_faddr = cpu_to_le32(ei->i_faddr);
-	raw_inode->i_frag = ei->i_frag_no;
-	raw_inode->i_fsize = ei->i_frag_size;
-#endif
 	raw_inode->i_file_acl = cpu_to_le32(ei->i_file_acl);
 	if ...
From: Jan Kara
Date: Tuesday, November 6, 2007 - 10:19 am

Implement recursive mtime (rtime) feature for ext3. The feature works as
follows: In each directory we keep a flag EXT3_RTIME_FL (modifiable by a user)
whether rtime should be updated. In case a directory or a file in it is
modified and when the flag is set, directory's rtime is updated, the flag is
cleared, and we move to the parent. If the flag is set there, we clear it,
update rtime and continue upwards upto the root of the filesystem. In case a
regular file or symlink is modified, we pick arbitrary of its parents (actually
the one that happens to be at the head of i_dentry list) and start the rtime
update algorith there.

As the flag is always cleared after updating rtime and we don't climb up the
tree if the flag is cleared, we have constant amortized complexity of rtime
updates. That's for theoretical time consumption ;) Practically, there's no
measurable performance impact for a test case like: touch every file in a
kernel tree where every directory has RTIME flag set.

Intended use case is that application which wants to watch any modification in
a subtree scans the subtree and sets flags for all inodes there. Next time, it
just needs to recurse in directories having rtime newer than the start of the
previous scan. There it can handle modifications and set the flag again. It is
up to application to watch out for hardlinked files. It can e.g.  build their
list and check their mtime separately (when a hardlink to a file is created its
inode is modified and rtimes properly updated and thus any application has an
effective way of finding new hardlinked files).

Signed-off-by: Jan Kara <jack@suse.cz>

diff -rupX /home/jack/.kerndiffexclude linux-2.6.23-2-ext3_make_frags_unused/fs/ext3/ialloc.c linux-2.6.23-3-ext3_recursive_mtime/fs/ext3/ialloc.c
--- linux-2.6.23-2-ext3_make_frags_unused/fs/ext3/ialloc.c	2007-11-05 16:58:10.000000000 +0100
+++ linux-2.6.23-3-ext3_recursive_mtime/fs/ext3/ialloc.c	2007-11-05 16:58:53.000000000 +0100
@@ -569,7 +569,7 @@ got:
 	/* Guard reading of ...
From: Arjan van de Ven
Date: Tuesday, November 6, 2007 - 10:40 am

On Tue, 6 Nov 2007 18:19:45 +0100

Ok since mtime (and rtime) are part of the inode and not the dentry...
how do you deal with hardlinks? And with cases of files that have been
unlinked? (ok the later is a wash obviously other than not crashing)

-- 
If you want to reach me at my work email, use arjan@linux.intel.com
For development, discussion and tips for power savings, 
visit http://www.lesswatts.org
-

From: H. Peter Anvin
Date: Tuesday, November 6, 2007 - 11:04 am

There is only one possible answer... he only updates the directory path 
that was used to touch the particular file involved.  Thus, the 
semantics gets grotty not just in the presence of hard links, but also 
in the presence of bind- and other non-root mounts.

	-hpa
-

From: Jan Kara
Date: Wednesday, November 7, 2007 - 4:51 am

Unlinked files are easy - you just don't propagate the rtime anywhere.
  Update of recursive mtime does not pass filesystem boundaries (i.e.
mountpoints) so bind mounts and such are non-issue (hmm, at least that was
my original idea but as I'm looking now I don't handle bind mounts properly
so that needs to be fixed). With hardlinks, you are right that the
behaviour is undeterministic - I tried to argue in the text of the mail
that this does not actually matter - there are not many hardlinks on usual
system and so the application can check hardlinked files in the old way -
i.e. look at mtime.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
-

From: Al Viro
Date: Tuesday, November 6, 2007 - 11:01 am

*ewwww*


You know, you can do that with aush^H^Hdit right now...
-

From: Jan Kara
Date: Wednesday, November 7, 2007 - 7:54 am

Oh yes, there is :) But I tried to argue it does not really matter -
application would have to handle hardlinks in a special way but I find that
  Interesting idea, no I have not thought about this. I guess you mean
watching all the VFS modification events and then do the checking and propagation
from user space... My first feeling is that the performance penalty would be
considerably higher (currently I am at 1% performance penalty for quite
pessimistic test case) but in case the current patch would be considered
unacceptable, I can try how large the penalty would be. Thanks for
suggestion.

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
-

From: Theodore Tso
Date: Tuesday, November 6, 2007 - 12:40 pm

Umm, yuck.

What if more than one application wants to use this facility?

The application is using a global per-inode flag that is written out
to disk.  So sweeping the entire subtree and setting this flag will
involve a lot of disk i/o; as does setting a mod-time, since it could
potentially require a large number of inode updates, and then the
application needs to sweep through the subtree and reset the flags
(resulting in more disk i/o).  The performance would seem to me to be
really pessimal.  

In addition, after you crash, there might not be any application
waiting to watch modifications in that subtree, and yet the flags
would still be set so the system would still be paying the performance
penalties of needing to propagate modtimes until all of the flags
disappear --- and for a large subtree, that might not be for a long,
long time.

So if the goal is some kind of modification notification system that
watches a subtree efficiently, avoiding some of the deficiencies of
inotify and dnotify, the interface doesn't seem to be the right way to
go about things.  The fact that only one application at a time can use
this interface, even if you ignore the issues of hard links and the
performance problems and the lack of cleanup after a reboot, seems in
my mind to just be a irreparable fatal flaw to this particular scheme.

Regards,

						- Ted
-

From: Jan Kara
Date: Wednesday, November 7, 2007 - 7:36 am

That should be fine - let's see: Each application keeps somewhere a time when
it started a scan of a subtree (or it can actually remember a time when it
set the flag for each directory), during the scan, it sets the flag on
each directory. When it wakes up to recheck the subtree it just compares
the rtime against the stored time - if rtime is greater, subtree has been
modified since the last scan and we recurse in it and when we are finished
with it we set the flag. Now notice that we don't care about the flag when
we check for changes - we care only for rtime - so if there are several
applications interested in the same subtree, the flag just gets set more
often and thus the update of rtime happens more often but the same scheme
  I don't get it here - you need to scan the whole subtree and set the flag
only during the initial scan. Later, you need to scan and set the flag only
for directories in whose subtree something changed. Similarty rtime needs
to be updated for each inode at most once after the scan. Maybe we have
different different ideas of use-cases: I consider this useful for larger
subtrees which change only seldom (or only their small parts) or you want
to check for changes only once per some longer time - so uses like backup
with rsync, updatedb, cachefiles for trees with config files (like KDE has)
etc. There the penalty for additional IO is during rtime updates is quite
negligible - if you have some usecase you'd like to measure, please propose
it and I'll measure it. I have tested the following:
  Create a tree of depth 5 where each directory has 5 subdirectories and
the leaf directories have 10 files in it. You set the flag on all
directories (umount and mount again) and then touch one file in every directory.
  With the feature enabled this takes 36.1176s (average from 5 tests) with
deviation 0.29509. Without the feature it takes 35.75480 with deviation
0.15433. So the difference in performance is 1% which is just slightly
above the error and I'd find this test ...
From: Theodore Tso
Date: Wednesday, November 7, 2007 - 5:20 pm

OK, so in this case you don't need to set rtime on the every single
file inode, but only directory inode, right?  Because you're only
using checking the rtime at the directory level, and not the flag.
And it's just as easy for you to check the rtime flag for the file's
containing directory (modulo magic vis-a-vis hard links) as the file's
inode.

I'm just really wishing that rtime and the rtime flag didn't have live
on disk, but could rather be in memory.  If you only needed to save
the directory flags and rtimes, that might actually be doable.

Note by the way that since you need to own the file/directory to set
flags, this means that only programs that are running as root or
running as the uid who owns the entire subtree will be able to use
this scheme.  One advantage of doing in kernel memory is that you
might be able to support watching a tree that is not owned by the

OK, so in the worst case every single file in a kernel source tree
might change after doing an extreme git checkout.  That means around
36k of files get updated.  So if you have to set/clear the rtime flag
during the checkout process 36k file inodes would have to have their
rtime flag cleared, plus 2k worth of directory inodes; but those would
probably be folded into other changes made to the inodes anyway.  But
then when trackerd goes back and scans the subtree, if you are
actually setting rtime flags for every single file inode, then that's
38k of indoes that need updating.  If you only need to set the rtime
flags for directories, that's only 2k worth of extra gratuitous inode
updates.

							- Ted
-

From: Jan Kara
Date: Thursday, November 8, 2007 - 3:56 am

Yes, that's actually what I'm doing - sorry if I didn't make it clear
  I already gave some thought to this but there seemed to be some
drawbacks. Query I want to support is: given a directory, tell me which of
its subdirectories (arbitrarily deep below) have been modified since time
T.  That is what you need to support faster rsync, updatedb and similar
loads.  Also I want to allow a reboot to happen inbetween the modification
and a query (handling a crash correctly would be nice too but honestly my
current implementation is not completely reliable in this regard either) so
some pernament storage is needed in any case. What I can imagine we could
do is to report all modifications to userspace - that has a problem that
there are *many* possible modifications but we are interested only whether
there happened some since time T. We could improve this by an in-memory
inode flag "I'm not interested in modifications any further" and reporting
the change only if the parent directory does not have this flag set (note
that this flag gets lost when we evict the inode from memory). But I would
say that in the end all this message passing, climbing the tree from
userspace and maintaining data structure in memory and on disk would cost
use more than the current implementation... Also it has the disadvantage
that we miss the modifications which happen before we start the userspace
daemon catching the events.
  Doing this in kernel memory has a problem how to solve the persistency
across reboots (dumping mod's to userspace on request?) and also on my
system you'd have roughly a few MB of pinned memory for these purposes...
  Yes, that is the advantage. On the other hand we could allow setting that
particular flag even without being an owner of the inode. In fact, I
don't currently see use case where you won't be either root (rsync,
updatedb) or an owner of the files (watching config file trees) but I guess
  Yes, here the impact is hardly measurable as I've written in the previous
  As I wrote ...
From: Theodore Tso
Date: Thursday, November 8, 2007 - 7:37 am

Ah, OK, so the two things that I didn't get from your patch
description are:

1) the rtime flag and rtime field are only set on directories
2) the intended use is not trackerd and its ilk, but rsync and updatedb,
   so it is desirable that scan/queries be persistent across reboots

But then the major hole in this scheme is still the issue of hard
links.  The rsync program is still going to have to scan the entire
subtree looking for hard links, since an inode with multiple links
into the directory tree can't guarantee that all of its parent
directories will have their rtime field updated.

A program like updatedb which only cares about filenames will be OK,
since that means it really only cares about knowing when directories
have changed, and you can't have hard links to directories.

The other problem, of course, is that this feature would become ext
2/3/4 specific, and I could see future filesystems possibly wanting
this.  So this raises the question of whether the interface should be
at the VFS layer or not --- and if so, how to handle querying whether
a particulra filesystem supports it, and what happens if you have a
subtree which is covered by a filesystem that doesn't support rtime?

So a program like rsync would need to scan /proc/self/mounts to see
whether or not it would be safe to use this feature in the first
place.  And, of course, rsync would need to know whether it has write
access to the tree in order to set flags in the directory, and what to
do if some portion of the subtree isn't writeable by rsync.


Sometimes people like to use rsync to copy a subtree to which they
have read access but not write access.  (And here note that it's not
enough to have write access, you actually need to *own* all of the
directories in the subtree).

Yes, it's safe to let any user *set* the rtime flag, but we couldn't
let them clear the rtime flag, since then they would be able to hide a
file modification from some other (potentially privileged) process.
Speaking of ...
From: Jan Kara
Date: Thursday, November 8, 2007 - 8:28 am

Not really - initially rsync can scan a tree for hardlinks and remember
where they are. If a hardlink to a file is created, an rtime update is
sent up the tree via the path used to create the link. So during next scan,
rsync will see the file is modified and finds out that its nlink is > 1
and adds it to the list of hardlinked files.
  So for things like regular backups hardlinks can be dealt with
  Yes, being filesystem specific and thus requiring special handling of
  Yes, the cases where we cannot modify the flag in a tree would have to be
handled (similarly as the cases where the filesystem simply does not
support the feature). I don't think it wouldn't be too complicated but I have
  Yes, so in such cases my feature won't be able to help. But I think
  No, the patch does not allow this. But anyway in case user has enough
  Hardlinks can be worked-around as I wrote above and there would have to
be a fallback in case we cannot set the flag. So I agree the code would be
more complicated but I think it could be done in a quite clean way - but of
course that has to be proven by a patch which I don't have yet. I have not
spoken to rsync maintainers about this - first I want to have at least a
preliminary version of a patch for rsync so that we have something to
talk about...
								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR
-

Previous thread: [PATCH] r/o bind mounts: fix buggy loop by Dave Hansen on Tuesday, November 6, 2007 - 9:59 am. (1 message)

Next thread: Use of virtio device IDs by Anthony Liguori on Tuesday, November 6, 2007 - 10:16 am. (13 messages)