Re: [PATCH 14/38] fallthru: ext2 fallthru support

Previous thread: Userspace helpers at static addresses on ARM [was: Re: [PATCH] fix the "unknown" case] by Mathieu Desnoyers on Tuesday, June 15, 2010 - 11:29 am. (2 messages)

Next thread: [PATCH] tty: Add EXTPROC support for LINEMODE by hyc on Tuesday, June 15, 2010 - 11:56 am. (9 messages)
From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

This version of union mounts implements two major changes requested by
Al Viro:

* Drastically simplify the union stack for a directory.  It is now a
  singly linked list rooted in the dentry of the topmost directory,
  instead of a set of path -> path mappings kept in a hash table.  The
  union hash table lookup routines have gone away, along with most of
  struct union_dir.

* On union mount, clone the underlying read-only mounts and keep them
  in a list hanging off the superblock of the topmost file system.

It also includes many other minor fixups, but those are the big
changes.

Patches are against 2.6.34.  Git version is in branch "linked_list" of:

git://git.kernel.org/pub/scm/linux/kernel/git/val/linux-2.6.git

Next up: Rewrite user_path_nd() and associated code, and implement the
rest of Al Viro's code review comments.

-VAL

Felix Fietkau (2):
  whiteout: jffs2 whiteout support
  fallthru: jffs2 fallthru support

Jan Blunck (11):
  VFS: Make lookup_hash() return a struct path
  autofs4: Save autofs trigger's vfsmount in super block info
  whiteout/NFSD: Don't return information about whiteouts to userspace
  whiteout: Add vfs_whiteout() and whiteout inode operation
  whiteout: Set S_OPAQUE inode flag when creating directories
  whiteout: Allow removal of a directory with whiteouts
  whiteout: tmpfs whiteout support
  whiteout: Split of ext2_append_link() from ext2_add_link()
  whiteout: ext2 whiteout support
  union-mount: Introduce MNT_UNION and MS_UNION flags
  union-mount: Call do_whiteout() on unlink and rmdir in unions

Valerie Aurora (25):
  VFS: Comment follow_mount() and friends
  VFS: Add read-only users count to superblock
  fallthru: Basic fallthru definitions
  fallthru: ext2 fallthru support
  fallthru: tmpfs fallthru support
  union-mount: Union mounts documentation
  union-mount: Introduce union_dir structure and basic operations
  union-mount: Free union dirs on removal from dcache
  union-mount: Support for mounting union mount file ...
From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

From: Jan Blunck <jblunck@suse.de>

XXX - This is broken and included just to make union mounts work.  See
discussion at:

http://kerneltrap.org/mailarchive/linux-fsdevel/2010/1/15/6708053/thread

Original commit message:

This is a bugfix/replacement for commit
051d381259eb57d6074d02a6ba6e90e744f1a29f:

    During a path walk if an autofs trigger is mounted on a dentry,
    when the follow_link method is called, the nameidata struct
    contains the vfsmount and mountpoint dentry of the parent mount
    while the dentry that is passed in is the root of the autofs
    trigger mount.  I believe it is impossible to get the vfsmount of
    the trigger mount, within the follow_link method, when only the
    parent vfsmount and the root dentry of the trigger mount are
    known.

The solution in this commit was to replace the path embedded in the
parent's nameidata with the path of the link itself in
__do_follow_link().  This is a relatively harmless misuse of the
field, but union mounts ran into a bug during follow_link() caused by
the nameidata containing the wrong path (we count on it being what it
is all other places - the path of the parent).

A cleaner and easier to understand solution is to save the necessary
vfsmount in the autofs superblock info when it is mounted.  Then we
can easily update the vfsmount in autofs4_follow_link().

Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
Acked-by: Ian Kent <raven@themaw.net>
Cc: autofs@linux.kernel.org
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
---
 fs/autofs4/autofs_i.h |    1 +
 fs/autofs4/init.c     |   11 ++++++++++-
 fs/autofs4/root.c     |    6 ++++++
 fs/namei.c            |    7 ++-----
 4 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h
index 3d283ab..de3af64 100644
--- a/fs/autofs4/autofs_i.h
+++ b/fs/autofs4/autofs_i.h
@@ -133,6 +133,7 @@ struct autofs_sb_info {
 	int reghost_enabled;
 	int ...
From: Ian Kent
Date: Tuesday, June 15, 2010 - 9:04 pm

Instead of saving the vfsmount we could save a pointer to the dentry of
the mount point in the autofs super block info struct. I think that's
the bit I don't have so it would be sufficient for a lookup_mnt() for
the needed vfsmount in ->follow_mount().

Objections?


--

From: Valerie Aurora
Date: Wednesday, June 16, 2010 - 4:14 pm

I'm not sure... it seems like it would have the same problem that Al
described with pinning the vfsmount forever.  But I don't know autofs
at all.

Could you run through a quick example of the case that triggers this
problem in the first place?  The problem is when you have a symlink
that triggers an automount, and you are trying to get from the target
of the symlink to the vfsmount of the file system containing the
symlink in the first place?  Or do I have that wrong?

Thanks,

-VAL
--

From: Ian Kent
Date: Wednesday, June 16, 2010 - 7:04 pm

That's why I asked.
But I don't see how the dentry can go away since it's covered by the

Ha!

Yes, you would think we were talking about a symlink but this dentry is
a directory, a trigger for a mount that uses ->follow_mount() to do the
mount, similar to the way the NFS client mounts nohide mounts when they
crossed.

In the autofs case we have:

<path in fs>/dir
      <autofs fs (with type direct or offset) mounted on>/dir

When ->follow_link() is called the nameidata has the vfsmount of the
once removed mount because it hasn't yet been updated in (say)
link_path_walk(), but the dentry passed to ->follow_link() is the global
root of the autofs fs so we have no way of discovering the vfsmount or
the dentry upon which the autofs trigger mount is mounted. Which of
course prevents us from mounting and following the trigger.

The example is rather poor, sorry, hope it is sufficient.

--

From: Ian Kent
Date: Sunday, June 20, 2010 - 8:39 pm

No comments so far.

Before I dive into testing if this actually does what I need, can I get
an "in principal" ack or nack for the patch so union mounts can move on
please?

Note that this patch hasn't even been compile tested so the point is to
decide whether it is worth going ahead with it.


autofs4 - save autofs trigger mountpoint in super block info

From: Ian Kent <raven@themaw.net>

Adapted from the original patch from Jan Blunck <jblunck@suse.de>.

Original commit message:

This is a bugfix/replacement for commit
051d381259eb57d6074d02a6ba6e90e744f1a29f:

    During a path walk if an autofs trigger is mounted on a dentry,
    when the follow_link method is called, the nameidata struct
    contains the vfsmount and mountpoint dentry of the parent mount
    while the dentry that is passed in is the root of the autofs
    trigger mount.  I believe it is impossible to get the vfsmount of
    the trigger mount, within the follow_link method, when only the
    parent vfsmount and the root dentry of the trigger mount are
    known.

The solution in this commit was to replace the path embedded in the
parent's nameidata with the path of the link itself in
__do_follow_link().  This is a relatively harmless misuse of the
field, but union mounts ran into a bug during follow_link() caused by
the nameidata containing the wrong path (we count on it being what it
is all other places - the path of the parent).

A cleaner and easier to understand solution is to save the necessary
mountpoint dentry in the autofs superblock info when it is mounted.
Then we can cwlookup the needed vfsmount in autofs4_follow_link().

Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Jan Blunck <jblunck@suse.de>
Cc: Valerie Aurora <vaurora@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: autofs@linux.kernel.org
---

 fs/autofs4/autofs_i.h |    1 +
 fs/autofs4/init.c     |   11 ++++++++++-
 fs/autofs4/root.c     |   13 +++++++++++++
 fs/namei.c            |    7 ++-----
 ...
From: Miklos Szeredi
Date: Monday, June 21, 2010 - 6:06 am

mnt_mountpoint is NULL at the point you try to save it, so this is not
--

From: Ian Kent
Date: Monday, June 21, 2010 - 9:46 pm

What about this approach then?


autofs4 - lookup vfsmount in follow_link()

From: Ian Kent <raven@themaw.net>

Adapted from the original patch from Jan Blunck <jblunck@suse.de>.

Original commit message:

This is a bugfix/replacement for commit
051d381259eb57d6074d02a6ba6e90e744f1a29f:

    During a path walk if an autofs trigger is mounted on a dentry,
    when the follow_link method is called, the nameidata struct
    contains the vfsmount and mountpoint dentry of the parent mount
    while the dentry that is passed in is the root of the autofs
    trigger mount.  I believe it is impossible to get the vfsmount of
    the trigger mount, within the follow_link method, when only the
    parent vfsmount and the root dentry of the trigger mount are
    known.

The solution in this commit was to replace the path embedded in the
parent's nameidata with the path of the link itself in
__do_follow_link().  This is a relatively harmless misuse of the
field, but union mounts ran into a bug during follow_link() caused by
the nameidata containing the wrong path (we count on it being what it
is all other places - the path of the parent).

A better solution is to lookup the vfsmount when the mount is triggered,
which can be done because binding an autofs file system mount to another
location isn't valid (even though we can't actually veto this from the
autofs module).

Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Jan Blunck <jblunck@suse.de>
Cc: Valerie Aurora <vaurora@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: autofs@linux.kernel.org
---

 fs/autofs4/root.c |   47 +++++++++++++++++++++++++++++++++++++++++++++++
 fs/namei.c        |    7 ++-----
 fs/namespace.c    |    1 +
 3 files changed, 50 insertions(+), 5 deletions(-)


diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index db4117e..62dbcef 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -208,6 +208,40 @@ static int try_to_fill_dentry(struct dentry *dentry, int flags)
 	return 0;
 ...
From: J. R. Okajima
Date: Monday, June 21, 2010 - 10:49 pm

dentry->d_subdirs?
parent->dentry->...?

Or how about iterate_mounts() instead of loop over dentries?
For example (just a example),

struct args {
	/* input */
	struct dentry *root;

	/* output */
	struct vfsmount *mnt;
};

static int compare_mnt(struct vfsmount *mnt, void *arg)
{
	struct args *a = arg;

	if (mnt->mnt_root != a->root)
		return 0;
	a->mnt = mntget(mnt);
	return 1;
}

struct vfsmount *autofs4_find_vfsmount(struct dentry *root)
{
	int err;
	struct args args = {
		.root = root
	};

	err = iterate_mounts(compare_mnt, &args, current->nsproxy->mnt_ns);
}


J. R. Okajima
--

From: Ian Kent
Date: Tuesday, June 22, 2010 - 6:11 am

Yep, thanks, cut and paste error.

Like I said, I don't want to go though the test process unless I have
something that is, in principal, OK.

If whatever approach we use is acceptable, and will work, then I'll put
the effort into it. I just don' want to spend a heap of time on
something that is basically not the right thing to do. For example,

Oh, I'm not up with this, I'll have to check this out, might be useful
for more than just this case, thanks for the comments.

Ian


--

From: Ian Kent
Date: Tuesday, June 22, 2010 - 6:23 pm

I may be missing something about this, but why is it safe to use
iterate_mounts(), since it doesn't take the vfsmount_lock when
traversing the list of mounts?

Ian


--

From: J. R. Okajima
Date: Tuesday, June 22, 2010 - 7:07 pm

The sample code was not correct.
We need to acquire vfsmount_lock or down_read(namespace_sem).

Or it may be better to extract the body of iterate_mounts() and create a
new function __iterate_mounts() such like that.

__iterate_mounts()
{
	/* equiv to the current iterate_mounts */
}

iterate_mount()
{
	down_read(namespace_sem);
	or spin_lock(&vfsmount_lock);

	__iterate_mount();

	spin_unlock(&vfsmount_lock);
	or up_read(namespace_sem);
}


J. R. Okajima
--

From: Ian Kent
Date: Tuesday, June 22, 2010 - 7:37 pm

Yep, thought so.
That's a useful enough function to warrant that IMHO.
I'll continue checking its usages before I do it though.

Ian


--

From: Ian Kent
Date: Wednesday, June 23, 2010 - 10:16 pm

Ok, lets try this again.

The compiler is way smarter that I, so it probably isn't quite so bad
this time. Obviously I need to add a Cc for the audit system maintainer.


autofs4 - lookup vfsmount in follow_link()

From: Ian Kent <raven@themaw.net>

Adapted from the original patch from Jan Blunck <jblunck@suse.de>.

Original commit message:

This is a bugfix/replacement for commit
051d381259eb57d6074d02a6ba6e90e744f1a29f:

    During a path walk if an autofs trigger is mounted on a dentry,
    when the follow_link method is called, the nameidata struct
    contains the vfsmount and mountpoint dentry of the parent mount
    while the dentry that is passed in is the root of the autofs
    trigger mount.  I believe it is impossible to get the vfsmount of
    the trigger mount, within the follow_link method, when only the
    parent vfsmount and the root dentry of the trigger mount are
    known.

The solution in this commit was to replace the path embedded in the
parent's nameidata with the path of the link itself in
__do_follow_link().  This is a relatively harmless misuse of the
field, but union mounts ran into a bug during follow_link() caused by
the nameidata containing the wrong path (we count on it being what it
is all other places - the path of the parent).

A better solution is to lookup the vfsmount when the mount is triggered,
which can be done because binding an autofs file system mount to another
location isn't valid (even though we can't actually veto this from the
autofs module).

Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Jan Blunck <jblunck@suse.de>
Cc: Valerie Aurora <vaurora@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: autofs@linux.kernel.org
---

 fs/autofs4/root.c |   49 +++++++++++++++++++++++++++++++++++++++++++++++++
 fs/namei.c        |    7 ++-----
 fs/namespace.c    |    8 ++++++--
 3 files changed, 57 insertions(+), 7 deletions(-)


diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index db4117e..114959b ...
From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

From: Jan Blunck <jblunck@suse.de>

In case of an union directory we don't want that the directories on lower
layers of the union "show through". So to prevent that the contents of
underlying directories magically shows up after a mkdir() we set the S_OPAQUE
flag if directories are created where a whiteout existed before.

Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
---
 fs/namei.c         |   11 ++++++++++-
 include/linux/fs.h |    3 +++
 2 files changed, 13 insertions(+), 1 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 2c723e2..8c67636 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2107,6 +2107,7 @@ SYSCALL_DEFINE3(mknod, const char __user *, filename, int, mode, unsigned, dev)
 int vfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
 {
 	int error = may_create(dir, dentry);
+	int opaque = 0;
 
 	if (error)
 		return error;
@@ -2119,9 +2120,17 @@ int vfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
 	if (error)
 		return error;
 
+	if (d_is_whiteout(dentry))
+		opaque = 1;
+
 	error = dir->i_op->mkdir(dir, dentry, mode);
-	if (!error)
+	if (!error) {
 		fsnotify_mkdir(dir, dentry);
+		if (opaque) {
+			dentry->d_inode->i_flags |= S_OPAQUE;
+			mark_inode_dirty(dentry->d_inode);
+		}
+	}
 	return error;
 }
 
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7afdbd4..e9aa650 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -236,6 +236,7 @@ struct inodes_stat_t {
 #define S_NOCMTIME	128	/* Do not update file c/mtime */
 #define S_SWAPFILE	256	/* Do not truncate: swapon got its bmaps */
 #define S_PRIVATE	512	/* Inode is fs-internal */
+#define S_OPAQUE	1024	/* Directory is opaque */
 
 /*
  * Note that nosuid etc flags are inode-specific: setting some file-system
@@ -271,6 +272,8 @@ struct inodes_stat_t {
 #define IS_SWAPFILE(inode)	((inode)->i_flags & S_SWAPFILE)
 #define IS_PRIVATE(inode)	((inode)->i_flags & S_PRIVATE)
 
+#define ...
From: Ian Kent
Date: Monday, July 12, 2010 - 9:05 pm

I found this hard to understand.

Do you mean:

For directories within a union that are whiteouts we don't want the entries of
lower layer file system to "show through". To achieve this we set the S_OPAQUE
--

From: Valerie Aurora
Date: Friday, July 16, 2010 - 1:12 pm

That is much clearer.  I ended up with this version, what do you think?

whiteout: Set opaque flag if new directory was previously a whiteout

If we mkdir() a directory on the top layer of a union, we don't want
entries from a matching directory on the lower layer to "show through"
suddenly.  To prevent this, we set the opaque flag on a directory if
there was previously a white-out with the same name. (If there is no
white-out and the directory exists in a lower layer, then mkdir() will
fail with EEXIST.)

-VAL
--

From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

From: Jan Blunck <jblunck@suse.de>

Add support for whiteout dentries to tmpfs.  This includes adding
support for whiteouts to d_genocide(), which is called to tear down
pinned tmpfs dentries.  Whiteouts have to be persistent, so they have
a pinning extra ref count that needs to be dropped by d_genocide().

Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk>
Cc: linux-mm@kvack.org
---
 fs/dcache.c |   13 +++++-
 mm/shmem.c  |  149 +++++++++++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 147 insertions(+), 15 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 265015d..3b0e525 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2229,7 +2229,18 @@ resume:
 		struct list_head *tmp = next;
 		struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child);
 		next = tmp->next;
-		if (d_unhashed(dentry)||!dentry->d_inode)
+		/*
+		 * Skip unhashed and negative dentries, but process
+		 * positive dentries and whiteouts.  A whiteout looks
+		 * kind of like a negative dentry for purposes of
+		 * lookup, but it has an extra pinning ref count
+		 * because it can't be evicted like a negative dentry
+		 * can.  What we care about here is ref counts - and
+		 * we need to drop the ref count on a whiteout before
+		 * we can evict it.
+		 */
+		if (d_unhashed(dentry)||(!dentry->d_inode &&
+					 !d_is_whiteout(dentry)))
 			continue;
 		if (!list_empty(&dentry->d_subdirs)) {
 			this_parent = dentry;
diff --git a/mm/shmem.c b/mm/shmem.c
index eef4ebe..c58ecf4 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1805,6 +1805,76 @@ static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf)
 	return 0;
 }
 
+static int shmem_rmdir(struct inode *dir, struct dentry *dentry);
+static int shmem_unlink(struct inode *dir, struct dentry *dentry);
+
+/*
+ * This is the whiteout support for tmpfs. It uses one ...
From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:40 am

---
 fs/open.c |   25 +++++++++++++++++++++----
 1 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 3c1ae55..336fe01 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -669,18 +669,32 @@ out:
 SYSCALL_DEFINE3(fchmodat, int, dfd, const char __user *, filename, mode_t, mode)
 {
 	struct path path;
+	struct nameidata nd;
+	struct vfsmount *mnt;
 	struct inode *inode;
+	char *tmp;
 	int error;
 	struct iattr newattrs;
 
-	error = user_path_at(dfd, filename, LOOKUP_FOLLOW, &path);
+	error = user_path_nd(dfd, filename, LOOKUP_FOLLOW, &nd,
+				     &path, &tmp);
 	if (error)
 		goto out;
-	inode = path.dentry->d_inode;
 
-	error = mnt_want_write(path.mnt);
+	if (IS_DIR_UNIONED(nd.path.dentry))
+		mnt = nd.path.mnt;
+	else
+		mnt = path.mnt;
+
+	error = mnt_want_write(mnt);
 	if (error)
 		goto dput_and_out;
+
+	error = union_copyup(&nd, &path);
+	if (error)
+		goto mnt_drop_write_and_out;
+
+	inode = path.dentry->d_inode;
 	mutex_lock(&inode->i_mutex);
 	error = security_path_chmod(path.dentry, path.mnt, mode);
 	if (error)
@@ -692,9 +706,12 @@ SYSCALL_DEFINE3(fchmodat, int, dfd, const char __user *, filename, mode_t, mode)
 	error = notify_change(path.dentry, &newattrs);
 out_unlock:
 	mutex_unlock(&inode->i_mutex);
-	mnt_drop_write(path.mnt);
+mnt_drop_write_and_out:
+	mnt_drop_write(mnt);
 dput_and_out:
 	path_put(&path);
+	path_put(&nd.path);
+	putname(tmp);
 out:
 	return error;
 }
-- 
1.6.3.3

--

From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

This patch adds the basic structures and operations of VFS-based union
mounts (but not the ability to mount or lookup unioned file systems).
Each directory in a unioned file system has an associated union stack
created when the directory is first looked up.  The union stack is a
union_dir structure kept in a hash table indexed by mount and dentry
of the directory; thus, specific paths are unioned, not dentries
alone.  The union_dir keeps a pointer to the upper path and the lower
path and can be looked up by either path.  Currently only two layers
are supported, but the union_dir struct is flexible enough to allow
more than two layers.

This particular version of union mounts is based on ideas by Jan
Blunck, Bharata Rao, and many others.

Signed-off-by: Valerie Aurora <vaurora@redhat.com>
---
 fs/Kconfig             |   13 +++++
 fs/Makefile            |    1 +
 fs/dcache.c            |    3 +
 fs/union.c             |  119 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/union.h             |   66 ++++++++++++++++++++++++++
 include/linux/dcache.h |    4 +-
 include/linux/fs.h     |    1 +
 7 files changed, 206 insertions(+), 1 deletions(-)
 create mode 100644 fs/union.c
 create mode 100644 fs/union.h

diff --git a/fs/Kconfig b/fs/Kconfig
index 5f85b59..f99c3a9 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -59,6 +59,19 @@ source "fs/notify/Kconfig"
 
 source "fs/quota/Kconfig"
 
+config UNION_MOUNT
+       bool "Union mounts (writable overlasy) (EXPERIMENTAL)"
+       depends on EXPERIMENTAL
+       help
+         Union mounts allow you to mount a transparent writable
+	 layer over a read-only file system, for example, an ext3
+	 partition on a hard drive over a CD-ROM root file system
+	 image.
+
+	 See <file:Documentation/filesystems/union-mounts.txt> for details.
+
+	 If unsure, say N.
+
 source "fs/autofs/Kconfig"
 source "fs/autofs4/Kconfig"
 source "fs/fuse/Kconfig"
diff --git a/fs/Makefile b/fs/Makefile
index 97f340f..1949af2 100644
--- ...
From: Valerie Aurora
Date: Friday, July 16, 2010 - 1:51 pm

I did a quick review and think this is right.  The SLAB_PANIC flag in
combination with this being called early in boot means it will panic

Nope, fixed.

Thanks,

-VAL
--

From: Miklos Szeredi
Date: Wednesday, August 4, 2010 - 7:51 am

This botches the carefully tuned length of struct dentry.  At least a
FIXME comment needs to be added that this is something to be
addressed.

Why was the hash table concept dropped?  The header comment still
talks about that?

Miklos
--

From: Valerie Aurora
Date: Wednesday, August 4, 2010 - 12:47 pm

Simply, Al Viro didn't like it.  But note that the current
implementation still uses part of the hash table solution.  You still
have union_dir structures external to dentries for the read-only
layers of the stack.  The change is from Al's observation that the
topmost dentry could only be part of one stack.  Why do a lookup on
the topmost dentry when you could keep an pointer to the stack in the
dentry itself and skip the lookup?  Once you have the head of the
stack, you don't need lookup for the rest of it.  This eliminates all
the lookup machinery and the union hash table lock, which seems like a
big win to me.

The biggest drawback of the hash table in my mind was that it
introduced a new global synchronization point in lookup.  Making it go
fast would be dcache lookup optimization all over again.

Thanks,

-VAL
--

From: Miklos Szeredi
Date: Thursday, August 5, 2010 - 3:28 am

That dentry field will be unused most of the time and we lose space
for d_iname for *all* filesystems.  On 64bit this results in max
inline name going from 32 down to 24 bytes.  On my root fs 7% of names
are 24-31 in length.  That's more than triple that of names which are
more than 32 in length.

Yeah, union mounts can be configured out, but that's not much

I already asked this, but I'll ask again, what about doing this with a
union filesystem?  That solves this problem in one simple go, as well
as a host of others.

I'll do some experimenting because I feel it should be possible to do
all this in a union fs with most of the advantages of union mounts.
That doesn't mean it won't need any VFS support, but I think the
amount of VFS burden can be considerably reduced with that approach at
a small price (just dentry tree duplication).

Miklos
--

From: Valerie Aurora
Date: Friday, August 6, 2010 - 10:09 am

That would be great.  My theory on the current version is to do
everything in the VFS except when it is much cleaner to make minor
changes to the underlying fs.  I went this way because I'd worked on a
stacked file system version and couldn't see how to avoid the
complexity that unionfs ran into.  But a VFS/stacked fs hybrid might
look nicer than a VFS/low-level fs hybrid.

-VAL
--

From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

Implement unioned directories, whiteouts, and fallthrus in pathname
lookup routines.  do_lookup() and lookup_hash() call lookup_union()
after looking up the dentry from the top-level file system.
lookup_union() is centered around __lookup_hash(), which does cached
and/or real lookups and revalidates each dentry in the union stack.

XXX - implement negative union cache entries

XXX - What about different permissions on different layers on the same
directory name?  Should complain, fail, test permissions on all
layers, what?
---
 fs/namei.c |  171 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/union.c |   94 +++++++++++++++++++++++++++++++++
 fs/union.h |    7 +++
 3 files changed, 271 insertions(+), 1 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 06aad7e..45be5e5 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -35,6 +35,7 @@
 #include <asm/uaccess.h>
 
 #include "internal.h"
+#include "union.h"
 
 /* [Feb-1997 T. Schoebel-Theuer]
  * Fundamental changes in the pathname lookup mechanisms (namei)
@@ -722,6 +723,160 @@ static __always_inline void follow_dotdot(struct nameidata *nd)
 	follow_mount(&nd->path);
 }
 
+static struct dentry *__lookup_hash(struct qstr *name, struct dentry *base,
+				    struct nameidata *nd);
+
+/*
+ * __lookup_union - Given a path from the topmost layer, lookup and
+ * revalidate each dentry in its union stack, building it if necessary
+ *
+ * @nd - nameidata for the parent of @topmost
+ * @name - pathname from this element on
+ * @topmost - path of the topmost matching dentry
+ *
+ * Given the nameidata and the path of the topmost dentry for this
+ * pathname, lookup, revalidate, and build the associated union stack.
+ * @topmost must be either a negative dentry or a directory, and not a
+ * whiteout.
+ *
+ * This function may stomp nd->path with the path of the parent
+ * directory of lower layer, so the caller must save nd->path and
+ * restore it afterwards.  You probably want to use lookup_union(),
+ * ...
From: Ian Kent
Date: Monday, July 12, 2010 - 9:49 pm

From: Valerie Aurora
Date: Monday, July 19, 2010 - 2:58 pm

It's also the head of the list.  Good anti-comment, there.  Fixed, thanks!

-VAL
--

From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

For union mounts, a file located on the lower layer will incorrectly
return EROFS on an access check.  To fix this, use the new
path_permission() call, which ignores a read-only lower layer file
system if the target will be copied up to the topmost file system.
---
 fs/open.c |   21 +++++++++++++++++----
 1 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 74e5cd9..7f7958e 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -32,6 +32,7 @@
 #include <linux/ima.h>
 
 #include "internal.h"
+#include "union.h"
 
 int vfs_statfs(struct dentry *dentry, struct kstatfs *buf)
 {
@@ -454,7 +455,10 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
 	const struct cred *old_cred;
 	struct cred *override_cred;
 	struct path path;
+	struct nameidata nd;
+	struct vfsmount *mnt;
 	struct inode *inode;
+	char *tmp;
 	int res;
 
 	if (mode & ~S_IRWXO)	/* where's F_OK, X_OK, W_OK, R_OK? */
@@ -478,10 +482,17 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
 
 	old_cred = override_creds(override_cred);
 
-	res = user_path_at(dfd, filename, LOOKUP_FOLLOW, &path);
+	res = user_path_nd(dfd, filename, LOOKUP_FOLLOW,
+				   &nd, &path, &tmp);
 	if (res)
 		goto out;
 
+	/* For union mounts, use the topmost mnt's permissions */
+	if (IS_DIR_UNIONED(nd.path.dentry))
+		mnt = nd.path.mnt;
+	else
+		mnt = path.mnt;
+
 	inode = path.dentry->d_inode;
 
 	if ((mode & MAY_EXEC) && S_ISREG(inode->i_mode)) {
@@ -490,11 +501,11 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
 		 * with the "noexec" flag.
 		 */
 		res = -EACCES;
-		if (path.mnt->mnt_flags & MNT_NOEXEC)
+		if (mnt->mnt_flags & MNT_NOEXEC)
 			goto out_path_release;
 	}
 
-	res = inode_permission(inode, mode | MAY_ACCESS);
+	res = path_permission(&path, &nd.path, mode | MAY_ACCESS);
 	/* SuS v2 requires we report a read only fs too */
 	if (res || !(mode & S_IWOTH) || ...
From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

---
 fs/namei.c |   24 ++++++++++++++++++++----
 1 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 505b51d..d2f2618 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2938,16 +2938,18 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
 {
 	struct dentry *new_dentry;
 	struct nameidata nd;
+	struct nameidata old_nd;
 	struct path old_path;
 	int error;
 	char *to;
+	char *from;
 
 	if ((flags & ~AT_SYMLINK_FOLLOW) != 0)
 		return -EINVAL;
 
-	error = user_path_at(olddfd, oldname,
+	error = user_path_nd(olddfd, oldname,
 			     flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0,
-			     &old_path);
+			     &old_nd, &old_path, &from);
 	if (error)
 		return error;
 
@@ -2955,8 +2957,20 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
 	if (error)
 		goto out;
 	error = -EXDEV;
-	if (old_path.mnt != nd.path.mnt)
-		goto out_release;
+	if (old_path.mnt != nd.path.mnt) {
+		if (IS_DIR_UNIONED(old_nd.path.dentry) &&
+		    (old_nd.path.mnt == nd.path.mnt)) {
+			error = mnt_want_write(old_nd.path.mnt);
+			if (error)
+				goto out_release;
+			error = union_copyup(&old_nd, &old_path);
+			mnt_drop_write(old_nd.path.mnt);
+			if (error)
+				goto out_release;
+		} else {
+			goto out_release;
+		}
+	}
 	new_dentry = lookup_create(&nd, 0);
 	error = PTR_ERR(new_dentry);
 	if (IS_ERR(new_dentry))
@@ -2979,6 +2993,8 @@ out_release:
 	putname(to);
 out:
 	path_put(&old_path);
+	path_put(&old_nd.path);
+	putname(from);
 
 	return error;
 }
-- 
1.6.3.3

--

From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:40 am

Copy up a file when opened with write permissions.  Does not copy up
the file data when O_TRUNC is specified.
---
 fs/namei.c |   28 ++++++++++++++++++++++++++++
 1 files changed, 28 insertions(+), 0 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 6096413..7514096 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1907,6 +1907,24 @@ exit:
 	return ERR_PTR(error);
 }
 
+static int open_union_copyup(struct nameidata *nd, struct path *path,
+			     int open_flag)
+{
+	struct vfsmount *oldmnt = path->mnt;
+	int error;
+
+	if (open_flag & O_TRUNC)
+		error = union_copyup_len(nd, path, 0);
+	else
+		error = union_copyup(nd, path);
+	if (error)
+		return error;
+	if (oldmnt != path->mnt)
+		mntput(nd->path.mnt);
+
+	return error;
+}
+
 static struct file *do_last(struct nameidata *nd, struct path *path,
 			    int open_flag, int acc_mode,
 			    int mode, const char *pathname)
@@ -1958,6 +1976,11 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
 			if (!path->dentry->d_inode->i_op->lookup)
 				goto exit_dput;
 		}
+		if (acc_mode & MAY_WRITE) {
+			error = open_union_copyup(nd, path, open_flag);
+			if (error)
+				goto exit_dput;
+		}
 		path_to_nameidata(path, nd);
 		audit_inode(pathname, nd->path.dentry);
 		goto ok;
@@ -2029,6 +2052,11 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
 	if (path->dentry->d_inode->i_op->follow_link)
 		return NULL;
 
+	if (acc_mode & MAY_WRITE) {
+		error = open_union_copyup(nd, path, open_flag);
+		if (error)
+			goto exit_dput;
+	}
 	path_to_nameidata(path, nd);
 	error = -EISDIR;
 	if (S_ISDIR(path->dentry->d_inode->i_mode))
-- 
1.6.3.3

--

From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:40 am

On rename() of a file on union mount, copyup and whiteout the source
file.  Both are done under the rename mutex.  I believe this is
actually atomic.

XXX - May not need to do file copyup under the lock.
XXX - Convert newly empty unioned dirs to not-unioned
---
 fs/namei.c |   76 +++++++++++++++++++++++++++++++++++++++++++++++++++++++----
 1 files changed, 70 insertions(+), 6 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index d2f2618..6096413 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3155,6 +3155,7 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
 {
 	struct dentry *old_dir, *new_dir;
 	struct path old, new;
+	struct path to_whiteout = {NULL, NULL};
 	struct dentry *trap;
 	struct nameidata oldnd, newnd;
 	char *from;
@@ -3170,13 +3171,9 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
 		goto exit1;
 
 	error = -EXDEV;
+	/* Union mounts will pass below test - dirs always on topmost */
 	if (oldnd.path.mnt != newnd.path.mnt)
 		goto exit2;
-	/* Rename on union mounts not implemented yet */
-	/* XXX much harsher check than necessary - can do some renames */
-	if (IS_DIR_UNIONED(oldnd.path.dentry) ||
-	    IS_DIR_UNIONED(newnd.path.dentry))
-		goto exit2;
 	old_dir = oldnd.path.dentry;
 	error = -EBUSY;
 	if (oldnd.last_type != LAST_NORM)
@@ -3199,7 +3196,7 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
 	error = -ENOENT;
 	if (!old.dentry->d_inode)
 		goto exit4;
-	/* unless the source is a directory trailing slashes give -ENOTDIR */
+	/* unless the source is a directory, trailing slashes give -ENOTDIR */
 	if (!S_ISDIR(old.dentry->d_inode->i_mode)) {
 		error = -ENOTDIR;
 		if (oldnd.last.name[oldnd.last.len])
@@ -3211,6 +3208,11 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
 	error = -EINVAL;
 	if (old.dentry == trap)
 		goto exit4;
+	error = -EXDEV;
+	/* Can't rename a directory from a lower layer */
+	if (IS_DIR_UNIONED(oldnd.path.dentry) &&
+	    ...
From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:40 am

---
 fs/open.c |   23 ++++++++++++++++++++---
 1 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 7f7958e..68c97dd 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -718,18 +718,35 @@ static int chown_common(struct path *path, uid_t user, gid_t group)
 SYSCALL_DEFINE3(chown, const char __user *, filename, uid_t, user, gid_t, group)
 {
 	struct path path;
+	struct nameidata nd;
+	struct vfsmount *mnt;
+	char *tmp;
 	int error;
 
-	error = user_path(filename, &path);
+	error = user_path_nd(AT_FDCWD, filename, LOOKUP_FOLLOW,
+				     &nd, &path, &tmp);
 	if (error)
 		goto out;
-	error = mnt_want_write(path.mnt);
+
+	if (IS_DIR_UNIONED(nd.path.dentry))
+		mnt = nd.path.mnt;
+	else
+		mnt = path.mnt;
+
+	error = mnt_want_write(mnt);
 	if (error)
 		goto out_release;
+
+	error = union_copyup(&nd, &path);
+	if (error)
+		goto out_drop_write;
 	error = chown_common(&path, user, group);
-	mnt_drop_write(path.mnt);
+out_drop_write:
+	mnt_drop_write(mnt);
 out_release:
 	path_put(&path);
+	path_put(&nd.path);
+	putname(tmp);
 out:
 	return error;
 }
-- 
1.6.3.3

--

From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:40 am

---
 fs/xattr.c |   31 +++++++++++++++++++++++++------
 1 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/fs/xattr.c b/fs/xattr.c
index 66bb5c7..4e2b5f6 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -320,17 +320,36 @@ SYSCALL_DEFINE5(lsetxattr, const char __user *, pathname,
 		size_t, size, int, flags)
 {
 	struct path path;
+	struct nameidata nd;
+	struct vfsmount *mnt;
+	char *tmp;
 	int error;
 
-	error = user_lpath(pathname, &path);
+	error = user_path_nd(AT_FDCWD, pathname, 0, &nd, &path, &tmp);
 	if (error)
 		return error;
-	error = mnt_want_write(path.mnt);
-	if (!error) {
-		error = setxattr(path.dentry, name, value, size, flags);
-		mnt_drop_write(path.mnt);
-	}
+
+	if (IS_DIR_UNIONED(nd.path.dentry))
+		mnt = nd.path.mnt;
+	else
+		mnt = path.mnt;
+
+	error = mnt_want_write(mnt);
+	if (error)
+		goto out;
+
+	error = union_copyup(&nd, &path);
+	if (error)
+		goto out_drop_write;
+
+	error = setxattr(path.dentry, name, value, size, flags);
+
+out_drop_write:
+	mnt_drop_write(mnt);
+out:
 	path_put(&path);
+	path_put(&nd.path);
+	putname(tmp);
 	return error;
 }
 
-- 
1.6.3.3

--

From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

When a file on the read-only layer of a union mount is altered, it
must be copied up to the topmost read-write layer.  This patch creates
union_copyup() and its supporting routines.

Thanks to Valdis Kletnieks for a bug fix.

Cc: Valdis.Kletnieks@vt.edu
---
 fs/union.c |  323 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 fs/union.h |    7 +-
 2 files changed, 329 insertions(+), 1 deletions(-)

diff --git a/fs/union.c b/fs/union.c
index 76a6c34..0982446 100644
--- a/fs/union.c
+++ b/fs/union.c
@@ -24,6 +24,8 @@
 #include <linux/namei.h>
 #include <linux/file.h>
 #include <linux/security.h>
+#include <linux/splice.h>
+#include <linux/xattr.h>
 
 #include "union.h"
 
@@ -191,6 +193,72 @@ int needs_lookup_union(struct path *parent_path, struct path *path)
 	return 1;
 }
 
+/**
+ * union_copyup_xattr
+ *
+ * @old: dentry of original file
+ * @new: dentry of new copy
+ *
+ * Copy up extended attributes from the original file to the new one.
+ *
+ * XXX - Permissions?  For now, copying up every xattr.
+ */
+
+static int union_copyup_xattr(struct dentry *old, struct dentry *new)
+{
+	ssize_t list_size, size;
+	char *buf, *name, *value;
+	int error;
+
+	/* Check for xattr support */
+	if (!old->d_inode->i_op->getxattr ||
+	    !new->d_inode->i_op->getxattr)
+		return 0;
+
+	/* Find out how big the list of xattrs is */
+	list_size = vfs_listxattr(old, NULL, 0);
+	if (list_size <= 0)
+		return list_size;
+
+	/* Allocate memory for the list */
+	buf = kzalloc(list_size, GFP_KERNEL);
+	if (!buf)
+		return -ENOMEM;
+
+	/* Allocate memory for the xattr's value */
+	error = -ENOMEM;
+	value = kmalloc(XATTR_SIZE_MAX, GFP_KERNEL);
+	if (!value)
+		goto out;
+
+	/* Actually get the list of xattrs */
+	list_size = vfs_listxattr(old, buf, list_size);
+	if (list_size <= 0) {
+		error = list_size;
+		goto out_free_value;
+	}
+
+	for (name = buf; name < (buf + list_size); name += strlen(name) + 1) {
+		/* XXX Locking? old is on read-only fs ...
From: Ian Kent
Date: Monday, July 12, 2010 - 9:56 pm

From: Valerie Aurora
Date: Monday, July 19, 2010 - 3:41 pm

It checks if len (the size of the file to be copied up) will overflow
size_t or ssize_t on this machine.  The file could have been created
on a 64-bit box, and be too big to be manipulated on a 32-bit box.  It
could use a comment, fixed.

-VAL
--

From: Miklos Szeredi
Date: Wednesday, August 4, 2010 - 8:26 am

What happens if there's a crash in the middle of the copyup?

Possible solution is using rename to atomically "replace" the
underlying file.  That however introduces namespace issues: where to
put the temporary file which then needs to be deleted on "fsck.union"?

Miklos
--

From: Valerie Aurora
Date: Thursday, August 5, 2010 - 12:54 pm

This kind of problem is what makes union mounts so much fun to work
on!! </sarcasm>

So far this version of union mounts has kept the namespace clean, so
I'd like to keep it that way.  One of my ideas is to mark the new file
as "copy-in-progress" and if we encounter such a file, we restart the
copyup again.  But how to mark it?  A new inode flag?

This applies in some form to directory copyup too.  However, we
already have a flag we use to indicate that it's copied up - the
opaque flag.  I moved that to be set after the directory entries are
copied up.  If it crashes in the middle, it can be safely restarted
the next time we call readdir() on that directory.

I added a comment to the commit message describing the problem, so
it's at least documented.

-VAL
--

From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

Split inode_permission() into inode and file-system-dependent parts.
Create path_permission() to check permission based on the path to the
inode.  This is for union mounts, in which an inode can be located on
a read-only lower layer file system but is still writable, since we
will copy it up to the writable top layer file system.  So in that
case, we want to ignore MS_RDONLY on the lower layer.  To make this
decision, we must know the path (vfsmount, dentry) of both the target
and its parent.

XXX - so ugly!
---
 fs/namei.c         |   92 ++++++++++++++++++++++++++++++++++++++++++++--------
 include/linux/fs.h |    1 +
 2 files changed, 79 insertions(+), 14 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 1e6adf7..4fd431e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -241,29 +241,20 @@ int generic_permission(struct inode *inode, int mask,
 }
 
 /**
- * inode_permission  -  check for access rights to a given inode
+ * __inode_permission  -  check for access rights to a given inode
  * @inode:	inode to check permission on
  * @mask:	right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
  *
  * Used to check for read/write/execute permissions on an inode.
- * We use "fsuid" for this, letting us set arbitrary permissions
- * for filesystem access without changing the "normal" uids which
- * are used for other things.
+ *
+ * This does not check for a read-only file system.  You probably want
+ * inode_permission().
  */
-int inode_permission(struct inode *inode, int mask)
+static int __inode_permission(struct inode *inode, int mask)
 {
 	int retval;
 
 	if (mask & MAY_WRITE) {
-		umode_t mode = inode->i_mode;
-
-		/*
-		 * Nobody gets write access to a read-only fs.
-		 */
-		if (IS_RDONLY(inode) &&
-		    (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)))
-			return -EROFS;
-
 		/*
 		 * Nobody gets write access to an immutable file.
 		 */
@@ -288,6 +279,79 @@ int inode_permission(struct inode *inode, int mask)
 }
 
 /**
+ * sb_permission  -  check ...
From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

From: Jan Blunck <jblunck@suse.de>

Add per mountpoint flag for Union Mount support. You need additional patches
to util-linux for that to work - see:

git://git.kernel.org/pub/scm/utils/util-linux-ng/val/util-linux-ng.git

Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
---
 fs/namespace.c        |    5 ++++-
 include/linux/fs.h    |    1 +
 include/linux/mount.h |    4 ++--
 3 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index b788cfa..7a399ba 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -808,6 +808,7 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
 		{ MNT_NODIRATIME, ",nodiratime" },
 		{ MNT_RELATIME, ",relatime" },
 		{ MNT_STRICTATIME, ",strictatime" },
+		{ MNT_UNION, ",union" },
 		{ 0, NULL }
 	};
 	const struct proc_fs_info *fs_infop;
@@ -2018,10 +2019,12 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
 		mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
 	if (flags & MS_RDONLY)
 		mnt_flags |= MNT_READONLY;
+	if (flags & MS_UNION)
+		mnt_flags |= MNT_UNION;
 
 	flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
 		   MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
-		   MS_STRICTATIME);
+		   MS_STRICTATIME | MS_UNION);
 
 	if (flags & MS_REMOUNT)
 		retval = do_remount(&path, flags & ~MS_REMOUNT, mnt_flags,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b59cd7b..dbd9881 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -192,6 +192,7 @@ struct inodes_stat_t {
 #define MS_REMOUNT	32	/* Alter flags of a mounted FS */
 #define MS_MANDLOCK	64	/* Allow mandatory locks on an FS */
 #define MS_DIRSYNC	128	/* Directory modifications are synchronous */
+#define MS_UNION	256	/* Merge namespace with FS mounted below */
 #define MS_NOATIME	1024	/* Do not update access times. */
 #define MS_NODIRATIME	2048	/* Do not update directory access times */
 #define MS_BIND		4096
diff --git ...
From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

If a dentry is removed from dentry cache because its usage count drops
to zero, the union_dirs in its union stack are freed too.
---
 fs/dcache.c |   11 +++++++++++
 1 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 54ff5a3..ce54dc5 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -34,6 +34,7 @@
 #include <linux/fs_struct.h>
 #include <linux/hardirq.h>
 #include "internal.h"
+#include "union.h"
 
 int sysctl_vfs_cache_pressure __read_mostly = 100;
 EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
@@ -175,6 +176,7 @@ static struct dentry *d_kill(struct dentry *dentry)
 	dentry_stat.nr_dentry--;	/* For d_free, below */
 	/*drops the locks, at that point nobody can reach this dentry */
 	dentry_iput(dentry);
+	d_free_unions(dentry);
 	if (IS_ROOT(dentry))
 		parent = NULL;
 	else
@@ -696,6 +698,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
 					iput(inode);
 			}
 
+			d_free_unions(dentry);
 			d_free(dentry);
 
 			/* finished when we fall off the top of the tree,
@@ -1535,6 +1538,7 @@ void d_delete(struct dentry * dentry)
 	isdir = S_ISDIR(dentry->d_inode->i_mode);
 	if (atomic_read(&dentry->d_count) == 1) {
 		dentry_iput(dentry);
+		d_free_unions(dentry);
 		fsnotify_nameremove(dentry, isdir);
 		return;
 	}
@@ -1545,6 +1549,13 @@ void d_delete(struct dentry * dentry)
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&dcache_lock);
 
+	/*
+	 * Remove any associated unions.  While someone still has this
+	 * directory open (ref count > 0), we could not have deleted
+	 * it unless it was empty, and therefore has no references to
+	 * directories below it.  So we don't need the unions.
+	 */
+	d_free_unions(dentry);
 	fsnotify_nameremove(dentry, isdir);
 }
 EXPORT_SYMBOL(d_delete);
-- 
1.6.3.3

--

From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

From: Jan Blunck <jblunck@suse.de>

Call do_whiteout() when removing files and directories from a union
mounted file system.

Signed-off-by: Valerie Aurora <vaurora@redhat.com>
---
 fs/namei.c |    8 ++++++++
 1 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 45be5e5..1e6adf7 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2592,6 +2592,10 @@ static long do_rmdir(int dfd, const char __user *pathname)
 	error = security_path_rmdir(&nd.path, path.dentry);
 	if (error)
 		goto exit4;
+	if (IS_DIR_UNIONED(nd.path.dentry)) {
+		error = do_whiteout(&nd, &path, 1);
+		goto exit4;
+	}
 	error = vfs_rmdir(nd.path.dentry->d_inode, path.dentry);
 exit4:
 	mnt_drop_write(nd.path.mnt);
@@ -2681,6 +2685,10 @@ static long do_unlinkat(int dfd, const char __user *pathname)
 		error = security_path_unlink(&nd.path, path.dentry);
 		if (error)
 			goto exit3;
+		if (IS_DIR_UNIONED(nd.path.dentry)) {
+			error = do_whiteout(&nd, &path, 0);
+			goto exit3;
+		}
 		error = vfs_unlink(nd.path.dentry->d_inode, path.dentry);
 exit3:
 		mnt_drop_write(nd.path.mnt);
-- 
1.6.3.3

--

From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:40 am

---
 fs/open.c |   24 ++++++++++++++++++++----
 1 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 68c97dd..3c1ae55 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -230,14 +230,17 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
 static long do_sys_truncate(const char __user *pathname, loff_t length)
 {
 	struct path path;
+	struct nameidata nd;
+	struct vfsmount *mnt;
 	struct inode *inode;
+	char *tmp;
 	int error;
 
 	error = -EINVAL;
 	if (length < 0)	/* sorry, but loff_t says... */
 		goto out;
 
-	error = user_path(pathname, &path);
+	error = user_path_nd(AT_FDCWD, pathname, 0, &nd, &path, &tmp);
 	if (error)
 		goto out;
 	inode = path.dentry->d_inode;
@@ -251,11 +254,16 @@ static long do_sys_truncate(const char __user *pathname, loff_t length)
 	if (!S_ISREG(inode->i_mode))
 		goto dput_and_out;
 
-	error = mnt_want_write(path.mnt);
+	if (IS_DIR_UNIONED(nd.path.dentry))
+		mnt = nd.path.mnt;
+	else
+		mnt = path.mnt;
+
+	error = mnt_want_write(mnt);
 	if (error)
 		goto dput_and_out;
 
-	error = inode_permission(inode, MAY_WRITE);
+	error = path_permission(&path, &nd.path, MAY_WRITE);
 	if (error)
 		goto mnt_drop_write_and_out;
 
@@ -263,6 +271,12 @@ static long do_sys_truncate(const char __user *pathname, loff_t length)
 	if (IS_APPEND(inode))
 		goto mnt_drop_write_and_out;
 
+	error = union_copyup_len(&nd, &path, length);
+	if (error)
+		goto mnt_drop_write_and_out;
+
+	/* path may have changed after copyup */
+	inode = path.dentry->d_inode;
 	error = get_write_access(inode);
 	if (error)
 		goto mnt_drop_write_and_out;
@@ -284,9 +298,11 @@ static long do_sys_truncate(const char __user *pathname, loff_t length)
 put_write_and_out:
 	put_write_access(inode);
 mnt_drop_write_and_out:
-	mnt_drop_write(path.mnt);
+	mnt_drop_write(mnt);
 dput_and_out:
 	path_put(&path);
+	path_put(&nd.path);
+	putname(tmp);
 out:
 	return error;
 }
-- 
1.6.3.3

--

From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

Create and tear down union mount structures on mount.  Check
requirements for union mounts.  This version clones the read-only
mounts and puts them in an array hanging off the superblock of the
topmost layer.

XXX - need array? maybe use mnt_child or mnt_hash instead

Thanks to Felix Fietkau <nbd@openwrt.org> for a bug fix.
---
 fs/namespace.c        |  231 ++++++++++++++++++++++++++++++++++++++++++++++++-
 fs/super.c            |    1 +
 include/linux/fs.h    |    3 +
 include/linux/mount.h |    2 +
 4 files changed, 235 insertions(+), 2 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 7a399ba..9f3884c 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -33,6 +33,7 @@
 #include <asm/unistd.h>
 #include "pnode.h"
 #include "internal.h"
+#include "union.h"
 
 #define HASH_SHIFT ilog2(PAGE_SIZE / sizeof(struct list_head))
 #define HASH_SIZE (1UL << HASH_SHIFT)
@@ -1049,6 +1050,7 @@ void umount_tree(struct vfsmount *mnt, int propagate, struct list_head *kill)
 		propagate_umount(kill);
 
 	list_for_each_entry(p, kill, mnt_hash) {
+		d_free_unions(p->mnt_root);
 		list_del_init(&p->mnt_expire);
 		list_del_init(&p->mnt_list);
 		__touch_mnt_namespace(p->mnt_ns);
@@ -1334,6 +1336,193 @@ static int invent_group_ids(struct vfsmount *mnt, bool recurse)
 	return 0;
 }
 
+/**
+ * check_mnt_union - mount-time checks for union mount
+ *
+ * @mntpnt: path of the mountpoint the new mount will be on
+ * @topmost_mnt: vfsmount of the new file system to be mounted
+ * @mnt_flags: mount flags for the new file system
+ *
+ * Mount-time check of upper and lower layer file systems to see if we
+ * can union mount one on the other.
+ *
+ * The rules:
+ *
+ * Lower layer(s) read-only: We can't deal with namespace changes in
+ * the lower layers of a union, so the lower layer must be read-only.
+ * Note that we could possibly convert a read-write unioned mount into
+ * a read-only mount here, which would give us a way to union more
+ * than one layer with ...
From: Miklos Szeredi
Date: Wednesday, August 4, 2010 - 7:55 am

If I do

  mount -r fs1 /mnt
  mount -r fs2 /mnt
  mount -ounion fs3 /mnt

then only fs2 and fs3 will be unioned.

Or how are multiple read-only layers supposed to work?

Miklos
--

From: Ian Kent
Date: Monday, July 12, 2010 - 9:47 pm

Is there a need to check fallthru, umm ... that probably doesn't

Last sentence looks a bit odd, would this be better?

We union every underlying file system that is mounted read-only on the
--

From: Valerie Aurora
Date: Wednesday, August 4, 2010 - 9:26 pm

Try branch "for_miklos" in:

git://git.kernel.org/pub/scm/linux/kernel/git/val/linux-2.6.git

It's against 2.6.34, I'm rebasing against 2.6.35 tomorrow.

-VAL
--

From: Valerie Aurora
Date: Friday, July 16, 2010 - 2:02 pm

Actually, that's on my todo list - right now I'm assuming MS_WHITEOUT
implies fallthru support as well.  But it doesn't.

We're a little short on MS_* flags.  I'm thinking of just checking
->whiteout and ->fallthru for non-NULL on the root dir and getting rid
of MS_WHITEOUT entirely.  Thoughts?

-VAL
--

From: Valerie Aurora
Date: Friday, July 16, 2010 - 2:05 pm

Hm, I appear to have re-written that in the latest set of patches.

-VAL
--

From: Ian Kent
Date: Monday, July 19, 2010 - 8:12 pm

Checking for the methods is a good idea I think, since they are assumed
to be present by the code, at least in some places.

Although it shouldn't happen, it is possible for a file system to create
the root dentry with these methods defined but other dentrys without
them defined, so a file system implementation error could cause some
unpleasant crashes. Maybe requiring the flags to indicate support would
help avoid unpleasant implementation problems like this, not sure
really. 

Also not sure if a method existence check should always be made prior to
use, regardless.

Ian


--

From: Valerie Aurora
Date: Wednesday, August 4, 2010 - 2:59 pm

I went for MS_WHITEOUT and MS_FALLTHRU, and added the checks for the
ops being non-null.

-VAL
--

From: Miklos Szeredi
Date: Thursday, August 5, 2010 - 3:34 am

This bit me.  Mount failing with EINVAL is a big PITA.

Miklos


Index: linux-2.6/fs/namespace.c
===================================================================
--- linux-2.6.orig/fs/namespace.c	2010-08-05 11:06:56.000000000 +0200
+++ linux-2.6/fs/namespace.c	2010-08-05 11:39:19.000000000 +0200
@@ -1387,6 +1387,7 @@ check_mnt_union(struct path *mntpnt, str
 		return 0;
 
 #ifndef CONFIG_UNION_MOUNT
+	printk(KERN_INFO "union mount: not supported by the kernel\n");
 	return -EINVAL;
 #endif
 	for (p = lower_mnt; p; p = next_mnt(p, lower_mnt)) {
@@ -1396,17 +1397,23 @@ check_mnt_union(struct path *mntpnt, str
 			return -EBUSY;
 	}
 
-	if (!IS_ROOT(mntpnt->dentry))
+	if (!IS_ROOT(mntpnt->dentry)) {
+		printk(KERN_INFO "union mount: not root\n");
 		return -EINVAL;
+	}
 
 	if (mnt_flags & MNT_READONLY)
 		return -EROFS;
 
-	if (!(topmost_mnt->mnt_sb->s_flags & MS_WHITEOUT))
+	if (!(topmost_mnt->mnt_sb->s_flags & MS_WHITEOUT)) {
+		printk(KERN_INFO "union mount: whiteout not supported by fs\n");
 		return -EINVAL;
+	}
 
-	if (!(topmost_mnt->mnt_sb->s_flags & MS_FALLTHRU))
+	if (!(topmost_mnt->mnt_sb->s_flags & MS_FALLTHRU)) {
+		printk(KERN_INFO "union mount: fallthrough not supported by fs\n");
 		return -EINVAL;
+	}
 
 	/* XXX top level mount should only be mounted once */
 
--

From: Valerie Aurora
Date: Friday, August 6, 2010 - 9:33 am

Thanks, I merged this.

--

From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

Document design and implementation of union mounts (a.k.a. writable
overlays).
---
 Documentation/filesystems/union-mounts.txt |  759 ++++++++++++++++++++++++++++
 1 files changed, 759 insertions(+), 0 deletions(-)
 create mode 100644 Documentation/filesystems/union-mounts.txt

diff --git a/Documentation/filesystems/union-mounts.txt b/Documentation/filesystems/union-mounts.txt
new file mode 100644
index 0000000..2ada88d
--- /dev/null
+++ b/Documentation/filesystems/union-mounts.txt
@@ -0,0 +1,759 @@
+Union mounts (a.k.a. writable overlays)
+=======================================
+
+This document describes the architecture and current status of union
+mounts, also known as writable overlays.
+
+In this document:
+ - Overview of union mounts
+ - Terminology
+ - VFS implementation
+ - Locking strategy
+ - VFS/file system interface
+ - Userland interface
+ - NFS interaction
+ - Status
+ - Contributing to union mounts
+
+Overview
+========
+
+A union mount layers one read-write file system over a one read-only
+file system, with all writes going to the writable file system.  The
+namespace of both file systems appears as a combined whole to
+userland, with files and directories on the writable file system
+covering up any files or directories with matching pathnames on the
+read-only file system.  The read-write file system is the "topmost"
+or "upper" file system and the read-only file system is the "lower"
+file system.  A few use cases:
+
+- Root file system on CD with writes saved to hard drive (LiveCD)
+- Multiple virtual machines with the same starting root file system
+- Cluster with NFS mounted root on clients
+
+Most if not all of these problems could be solved with a COW block
+device or a clustered file system (include NFS mounts).  However, for
+some use cases, sharing is more efficient and better performing if
+done at the file system namespace level.  COW block devices only
+increase their divergence as time goes on, and a fully coherent
+writable ...
From: Alex Riesen
Date: Thursday, June 17, 2010 - 1:01 am

This may be a dumb question (I must admit I did only very little research),
but how does one cleanup the topmost layer of whiteouts and fallthroughs,
so that the entries of lower layer(s) can be made visible again?
--

From: Valerie Aurora
Date: Thursday, June 17, 2010 - 11:39 am

I'm not sure how best to do this.  We don't want to add more system
calls.  One thought of mine has been to do this offline, when the file
system is unmounted.  For example, e2fsck could add a feature to
delete whiteouts and fallthrus.  Another option is to add a flag to an
existing system call.

Any ideas?

-VAL
--

From: Alex Riesen
Date: Thursday, June 17, 2010 - 1:32 pm

But that means that if the topmost filesystem is getting full of whiteouts
and fallthroughs there will be no way to free up the space without taking
the volume offline! That makes operation of union mount on always-on
systems difficult. Many personal electronics are always-on today, it
will be annoying to have to shutdown them on reconfigurations or just

That makes me think that the cleanup operation will be topmost
filesystem specific. Maybe this even means that one have to
have the filesystem specific tools installed on every system

Or calls, if the whiteouts (or even fallthroughs) are to be read
through directory file handles. unlinkat(2) ? It already has
dirfd and flags arguments.
--

From: Valerie Aurora
Date: Friday, June 18, 2010 - 2:06 pm

Whiteouts and fallthrus go away when a directory is deleted.  So, "rm
-rf /trash/" will actually free up disk space.  You can also move the
files you want to keep to a temp directory, rmdir the old one, and
move that dir back.

Unfortunately, union mounts runs into a lot of bizarre ENOSPC
problems.  But in the degenerate case in which you delete every single
file from the lower layer file system, that information will take up
only one whiteout per top-level subdir.  You don't keep whiteouts for

Any union mount utilities would be distributed as part of the normal

Yeah, unlinkat() looks promising.

-VAL
--

From: Miklos Szeredi
Date: Monday, June 21, 2010 - 6:14 am

One more advantage of doing whiteouts, etc. with hard links and
extended attributes instead of as special filesystem objects.  That
way they are visible (unless part of a union) and can be treated as
normal filesystem objects.

Miklos
--

From: Valerie Aurora
Date: Monday, June 21, 2010 - 4:17 pm

This should be reasonably easy to prototype - the whiteout and
fallthru patches are pretty well separated from the rest of union
mounts.

-VAL
--

From: Alex Riesen
Date: Wednesday, June 23, 2010 - 1:43 am

But then you have to break union to cleanup the topmost filesystem.
That'll surely take the mount filesystem (in its working configuration, at
least) offline. Not much better than using fsck.
--

From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

Add support for fallthru directory entries to ext2.

XXX What to do for d_ino for fallthrus?  If we return the inode from
the the underlying file system, it comes from a different inode
"namespace" and that will produce spurious matches.  This argues for
implementation of fallthrus as symlinks because they have to allocate
an inode (and inode number) anyway, and we can later reuse it if we
copy the file up.

Cc: Theodore Tso <tytso@mit.edu>
Cc: linux-ext4@vger.kernel.org
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
Signed-off-by: Jan Blunck <jblunck@suse.de>
---
 fs/ext2/dir.c           |   92 ++++++++++++++++++++++++++++++++++++++++++++--
 fs/ext2/ext2.h          |    1 +
 fs/ext2/namei.c         |   22 +++++++++++
 include/linux/ext2_fs.h |    1 +
 4 files changed, 112 insertions(+), 4 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 030bd46..f3b4aff 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -219,7 +219,8 @@ static inline int ext2_match (int len, const char * const name,
 {
 	if (len != de->name_len)
 		return 0;
-	if (!de->inode && (de->file_type != EXT2_FT_WHT))
+	if (!de->inode && ((de->file_type != EXT2_FT_WHT) &&
+			   (de->file_type != EXT2_FT_FALLTHRU)))
 		return 0;
 	return !memcmp(name, de->name, len);
 }
@@ -256,6 +257,7 @@ static unsigned char ext2_filetype_table[EXT2_FT_MAX] = {
 	[EXT2_FT_SOCK]		= DT_SOCK,
 	[EXT2_FT_SYMLINK]	= DT_LNK,
 	[EXT2_FT_WHT]		= DT_WHT,
+	[EXT2_FT_FALLTHRU]	= DT_UNKNOWN,
 };
 
 #define S_SHIFT 12
@@ -342,6 +344,24 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t filldir)
 					ext2_put_page(page);
 					return 0;
 				}
+			} else if (de->file_type == EXT2_FT_FALLTHRU) {
+				int over;
+				unsigned char d_type = DT_UNKNOWN;
+
+				offset = (char *)de - kaddr;
+				/* XXX We don't know the inode number
+				 * of the directory entry in the
+				 * underlying file system.  Should
+				 * look it up, either on fallthru
+				 * creation at first readdir or now at
+				 * ...
From: Ian Kent
Date: Monday, July 12, 2010 - 9:30 pm

From: Miklos Szeredi
Date: Wednesday, August 4, 2010 - 7:44 am

If a previously used ext2 filesystem with is mounted again then
fallthroughs don't appear to work as expected.  Stat returns ENOENT
for these entries.


That's an idea, but I guess it won't make everyone happy since it
wastes both disk space and memory.

One of the key differentiators for union mounts concept was that it
doesn't duplicate inodes and dentries from the layers.  With the
directory copyup on lookup that's already partially lost, but that can
be justified by the fact that non-directories usually far outnumber
directories.

Another idea is to use an internal inode and make all fallthroughs be
hard links to that.

I think the same would work for whiteouts as well.  I don't like the
fact that whiteouts are invisible even when not mounted as part of a
union.

Miklos
--

From: Valerie Aurora
Date: Wednesday, August 4, 2010 - 3:48 pm

Hm, I wrote one test case for this that worked (attached).  Can you
give me more details on your test case?  Thanks,

-VAL
From: Miklos Szeredi
Date: Thursday, August 5, 2010 - 3:36 am

uml:~# mount -oloop -r ext3-2.img /mnt/img/
uml:~# mount -oloop -r ext3.img /mnt/img/
uml:~# losetup -f ovl.img 
uml:~# mmount -b 8 -t ext2 /dev/loop2 /mnt/img/
uml:~# "ls" /mnt/img
bunion  lost+found  union
uml:~# "ls" /mnt/img/union
1  2  3
uml:~# "ls" /mnt/img/union/1
a  x
uml:~# umount /mnt/img/
uml:~# mmount -b 8 -t ext2 /dev/loop2 /mnt/img/
uml:~# ls -l /mnt/img/  
total 14
drwxr-xr-x 2 root root  1024 Aug  5 09:56 bunion
drwx------ 2 root root 12288 Aug  5 09:41 lost+found
drwxr-xr-x 3 root root  1024 Aug  5 09:56 union
uml:~# ls -l /mnt/img/union/
ls: cannot access /mnt/img/union/3: No such file or directory
ls: cannot access /mnt/img/union/2: No such file or directory
total 1
drwxr-xr-x 2 root root 1024 Aug  5 09:56 1
?????????? ? ?    ?       ?            ? 2
?????????? ? ?    ?       ?            ? 3
uml:~# ls -l /mnt/img/union/1
ls: cannot access /mnt/img/union/1/a: No such file or directory
ls: cannot access /mnt/img/union/1/x: No such file or directory
total 0
?????????? ? ? ? ?            ? a
?????????? ? ? ? ?            ? x
uml:~# 

Thanks,
Miklos
--

From: Valerie Aurora
Date: Thursday, August 5, 2010 - 4:30 pm

Cool, thanks.  Yes, I suppose the fallthrus should be ignored if they
don't fall through to anything.  If I do a proper lookup for d_ino, I
can kill two birds with one stone, since that will tell us whether
there is anything below the fallthru and thus whether to return this
directory entry.

--

From: Valerie Aurora
Date: Friday, August 6, 2010 - 10:16 am

Oh, "mmount -b 8" == "mount -o union".  Is this the mmount from mtools

Okay, I'll experiment more and see what I can do.

--

From: Miklos Szeredi
Date: Friday, August 6, 2010 - 10:44 am

It's primitive utility that basically just wraps the mount(2) syscall
without any fstab/mtab support:

  http://www.kernel.org/pub/linux/kernel/people/mszeredi/mmount/

Miklos
--

From: Miklos Szeredi
Date: Thursday, August 5, 2010 - 4:13 am

Best would be if it didn't need any modification to filesystems.  All
this having to upgrade util-linux, e2fsprogs, having incompatible
filesystem features is a pain for users (just been through that).

What we already have in most filesystems:

 - extended attributes, e.g. use the system.union.* namespace and
   denote whiteouts and falltroughs with such an attribute

 - hard links to make sure a separate inode is not necessary for each
   whiteout/fallthrough entry

 - some way for the user to easily identify such files when not
   mounted as part of a union e.g. make it a symlink pointing to
   "(deleted)" or whatever

Later the extended attributes can also be used for other things like
e.g. chmod()/chown() only copying up metadata, not data, and
indicating that data is still found on the lower layers.

Miklos
--

From: Valerie Aurora
Date: Friday, August 6, 2010 - 10:12 am

Just a quick note to say that my explicit design was to do as much as
possible in the VFS, except when adding a little support to the
low-level fs would make it significantly faster, simpler, and more
correct.  I think for union mounts to perform moderately well, and to
avoid namespace problems, we can't build it 100% out of existing file
system parts like xattrs.  However, I could be wrong and I will
definitely give any other implementation serious consideration.

-VAL
--

From: Valerie Aurora
Date: Tuesday, August 17, 2010 - 3:27 pm

Jan Kara helped convince me this might be better than fs-specific

The problem with hard links is that you run into hard link limits.  I
don't think we can do hard links for whiteouts and fallthrus.  Each
whiteout or fallthru will cost an inode if we implement them as
extended attributes.  This cost has to be balanced against the cost of
implementing them as dentries, which is mainly code complexity in

Perhaps we can simply not interpret the whiteout/fallthru extended
attributes when the file system is not unioned and let userland

It would certainly be more extensible than in-dentry flags.

-VAL
--

From: Miklos Szeredi
Date: Wednesday, August 18, 2010 - 1:26 am

get_unlinked_inode() is a great idea.  But I feel that individual
inodes for each fallthrough is excessive.  It'll make the first
readdir() really really expensive and wastes a lot of disk and memory
for no good reason.

Not sure how to fix the hard link limits problem though...

Thanks,
Miklos
--

From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

Add support for fallthru directory entries to tmpfs

XXX - Makes up inode number for dirent

Signed-off-by: Valerie Aurora <vaurora@redhat.com>
---
 fs/dcache.c |    3 +-
 fs/libfs.c  |   21 +++++++++++++++++--
 mm/shmem.c  |   60 ++++++++++++++++++++++++++++++++++++++++++++++++++++------
 3 files changed, 73 insertions(+), 11 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index b76f9e4..1575af4 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2240,7 +2240,8 @@ resume:
 		 * we can evict it.
 		 */
 		if (d_unhashed(dentry)||(!dentry->d_inode &&
-					 !d_is_whiteout(dentry)))
+					 !d_is_whiteout(dentry) &&
+					 !d_is_fallthru(dentry)))
 			continue;
 		if (!list_empty(&dentry->d_subdirs)) {
 			this_parent = dentry;
diff --git a/fs/libfs.c b/fs/libfs.c
index ea9a6cc..2b28ca9 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -134,6 +134,7 @@ int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
 	struct dentry *cursor = filp->private_data;
 	struct list_head *p, *q = &cursor->d_u.d_child;
 	ino_t ino;
+	int d_type;
 	int i = filp->f_pos;
 
 	switch (i) {
@@ -159,14 +160,28 @@ int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
 			for (p=q->next; p != &dentry->d_subdirs; p=p->next) {
 				struct dentry *next;
 				next = list_entry(p, struct dentry, d_u.d_child);
-				if (d_unhashed(next) || !next->d_inode)
+				if (d_unhashed(next) || (!next->d_inode && !d_is_fallthru(next)))
 					continue;
 
+				if (d_is_fallthru(next)) {
+					/* XXX We don't know the inode
+					 * number of the directory
+					 * entry in the underlying
+					 * file system.  Should look
+					 * it up, either on fallthru
+					 * creation at first readdir
+					 * or now at filldir time. */
+					ino = 123; /* Made up ino */
+					d_type = DT_UNKNOWN;
+				} else {
+					ino = next->d_inode->i_ino;
+					d_type = dt_type(next->d_inode);
+				}
+
 				spin_unlock(&dcache_lock);
 				if (filldir(dirent, next->d_name.name, 
 					    ...
From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:40 am

---
 fs/xattr.c |   34 +++++++++++++++++++++++++++-------
 1 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/fs/xattr.c b/fs/xattr.c
index 46f87e8..66bb5c7 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -19,7 +19,7 @@
 #include <linux/fsnotify.h>
 #include <linux/audit.h>
 #include <asm/uaccess.h>
-
+#include "union.h"
 
 /*
  * Check permissions for extended attribute access.  This is a bit complicated
@@ -281,17 +281,37 @@ SYSCALL_DEFINE5(setxattr, const char __user *, pathname,
 		size_t, size, int, flags)
 {
 	struct path path;
+	struct nameidata nd;
+	struct vfsmount *mnt;
+	char *tmp;
 	int error;
 
-	error = user_path(pathname, &path);
+	error = user_path_nd(AT_FDCWD, pathname, LOOKUP_FOLLOW, &nd, &path,
+			     &tmp);
 	if (error)
 		return error;
-	error = mnt_want_write(path.mnt);
-	if (!error) {
-		error = setxattr(path.dentry, name, value, size, flags);
-		mnt_drop_write(path.mnt);
-	}
+
+	if (IS_DIR_UNIONED(nd.path.dentry))
+		mnt = nd.path.mnt;
+	else
+		mnt = path.mnt;
+
+	error = mnt_want_write(mnt);
+	if (error)
+		goto out;
+
+	error = union_copyup(&nd, &path);
+	if (error)
+		goto out_drop_write;
+
+	error = setxattr(path.dentry, name, value, size, flags);
+
+out_drop_write:
+	mnt_drop_write(mnt);
+out:
 	path_put(&path);
+	path_put(&nd.path);
+	putname(tmp);
 	return error;
 }
 
-- 
1.6.3.3

--

From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:40 am

---
 fs/utimes.c |   14 ++++++++++++--
 1 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/fs/utimes.c b/fs/utimes.c
index e4c75db..e83b6bd 100644
--- a/fs/utimes.c
+++ b/fs/utimes.c
@@ -8,8 +8,10 @@
 #include <linux/stat.h>
 #include <linux/utime.h>
 #include <linux/syscalls.h>
+#include <linux/slab.h>
 #include <asm/uaccess.h>
 #include <asm/unistd.h>
+#include "union.h"
 
 #ifdef __ARCH_WANT_SYS_UTIME
 
@@ -152,18 +154,26 @@ long do_utimes(int dfd, char __user *filename, struct timespec *times, int flags
 		error = utimes_common(&file->f_path, times);
 		fput(file);
 	} else {
+		struct nameidata nd;
+		char *tmp;
 		struct path path;
 		int lookup_flags = 0;
 
 		if (!(flags & AT_SYMLINK_NOFOLLOW))
 			lookup_flags |= LOOKUP_FOLLOW;
 
-		error = user_path_at(dfd, filename, lookup_flags, &path);
+		error = user_path_nd(dfd, filename, lookup_flags, &nd, &path,
+				     &tmp);
 		if (error)
 			goto out;
 
-		error = utimes_common(&path, times);
+		error = union_copyup(&nd, &path);
+
+		if (!error)
+			error = utimes_common(&path, times);
 		path_put(&path);
+		path_put(&nd.path);
+		putname(tmp);
 	}
 
 out:
-- 
1.6.3.3

--

From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:40 am

---
 fs/open.c |   23 ++++++++++++++++++++---
 1 files changed, 20 insertions(+), 3 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 336fe01..b021dcb 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -812,18 +812,35 @@ out:
 SYSCALL_DEFINE3(lchown, const char __user *, filename, uid_t, user, gid_t, group)
 {
 	struct path path;
+	struct nameidata nd;
+	struct vfsmount *mnt;
+	char *tmp;
 	int error;
 
-	error = user_lpath(filename, &path);
+	error = user_path_nd(AT_FDCWD, filename, 0, &nd, &path, &tmp);
 	if (error)
 		goto out;
-	error = mnt_want_write(path.mnt);
+
+	if (IS_DIR_UNIONED(nd.path.dentry))
+		mnt = nd.path.mnt;
+	else
+		mnt = path.mnt;
+
+	error = mnt_want_write(mnt);
 	if (error)
 		goto out_release;
+
+	error = union_copyup(&nd, &path);
+	if (error)
+		goto out_drop_write;
+
 	error = chown_common(&path, user, group);
-	mnt_drop_write(path.mnt);
+out_drop_write:
+	mnt_drop_write(mnt);
 out_release:
 	path_put(&path);
+	path_put(&nd.path);
+	putname(tmp);
 out:
 	return error;
 }
-- 
1.6.3.3

--

From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

From: Felix Fietkau <nbd@openwrt.org>

Add support for whiteout dentries to jffs2.

XXX - David Woodhouse suggests several changes and provides an
untested patch.  See:

http://patchwork.ozlabs.org/patch/50466/

Signed-off-by: Felix Fietkau <nbd@openwrt.org>
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: linux-mtd@lists.infradead.org
---
 fs/jffs2/dir.c        |   72 +++++++++++++++++++++++++++++++++++++++++++++++-
 fs/jffs2/fs.c         |    4 +++
 fs/jffs2/super.c      |    2 +-
 include/linux/jffs2.h |    2 +
 4 files changed, 77 insertions(+), 3 deletions(-)

diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c
index 7aa4417..c259193 100644
--- a/fs/jffs2/dir.c
+++ b/fs/jffs2/dir.c
@@ -34,6 +34,8 @@ static int jffs2_mknod (struct inode *,struct dentry *,int,dev_t);
 static int jffs2_rename (struct inode *, struct dentry *,
 			 struct inode *, struct dentry *);
 
+static int jffs2_whiteout (struct inode *, struct dentry *, struct dentry *);
+
 const struct file_operations jffs2_dir_operations =
 {
 	.read =		generic_read_dir,
@@ -56,6 +58,7 @@ const struct inode_operations jffs2_dir_inode_operations =
 	.mknod =	jffs2_mknod,
 	.rename =	jffs2_rename,
 	.check_acl =	jffs2_check_acl,
+	.whiteout =     jffs2_whiteout,
 	.setattr =	jffs2_setattr,
 	.setxattr =	jffs2_setxattr,
 	.getxattr =	jffs2_getxattr,
@@ -98,8 +101,14 @@ static struct dentry *jffs2_lookup(struct inode *dir_i, struct dentry *target,
 			fd = fd_list;
 		}
 	}
-	if (fd)
-		ino = fd->ino;
+	if (fd) {
+		spin_lock(&target->d_lock);
+		if (fd->type == DT_WHT)
+			target->d_flags |= DCACHE_WHITEOUT;
+		else
+			ino = fd->ino;
+		spin_unlock(&target->d_lock);
+	}
 	mutex_unlock(&dir_f->sem);
 	if (ino) {
 		inode = jffs2_iget(dir_i->i_sb, ino);
@@ -498,6 +507,11 @@ static int jffs2_mkdir (struct inode *dir_i, struct dentry *dentry, int mode)
 		return PTR_ERR(inode);
 	}
 
+	if (dentry->d_flags & DCACHE_WHITEOUT) {
+		inode->i_flags |= ...
From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

From: Jan Blunck <jblunck@suse.de>

This patch adds whiteout support to EXT2. A whiteout is an empty directory
entry (inode == 0) with the file type set to EXT2_FT_WHT. Therefore it
allocates space in directories. Due to being implemented as a filetype it is
necessary to have the EXT2_FEATURE_INCOMPAT_FILETYPE flag set.

XXX - Needs serious review.  Al wonders: What happens with a delete at
the beginning of a block?  Will we find the matching dentry or the
first empty space?

Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: linux-ext4@vger.kernel.org
---
 fs/ext2/dir.c           |   96 +++++++++++++++++++++++++++++++++++++++++++++--
 fs/ext2/ext2.h          |    3 +
 fs/ext2/inode.c         |   11 ++++-
 fs/ext2/namei.c         |   67 +++++++++++++++++++++++++++++++-
 fs/ext2/super.c         |    6 +++
 include/linux/ext2_fs.h |    4 ++
 6 files changed, 177 insertions(+), 10 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 57207a9..030bd46 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -219,7 +219,7 @@ static inline int ext2_match (int len, const char * const name,
 {
 	if (len != de->name_len)
 		return 0;
-	if (!de->inode)
+	if (!de->inode && (de->file_type != EXT2_FT_WHT))
 		return 0;
 	return !memcmp(name, de->name, len);
 }
@@ -255,6 +255,7 @@ static unsigned char ext2_filetype_table[EXT2_FT_MAX] = {
 	[EXT2_FT_FIFO]		= DT_FIFO,
 	[EXT2_FT_SOCK]		= DT_SOCK,
 	[EXT2_FT_SYMLINK]	= DT_LNK,
+	[EXT2_FT_WHT]		= DT_WHT,
 };
 
 #define S_SHIFT 12
@@ -448,6 +449,26 @@ ino_t ext2_inode_by_name(struct inode *dir, struct qstr *child)
 	return res;
 }
 
+/* Special version for filetype based whiteout support */
+ino_t ext2_inode_by_dentry(struct inode *dir, struct dentry *dentry)
+{
+	ino_t res = 0;
+	struct ext2_dir_entry_2 *de;
+	struct page *page;
+
+	de = ext2_find_entry (dir, &dentry->d_name, &page);
+	if (de) {
+		res = le32_to_cpu(de->inode);
+		if ...
From: Ian Kent
Date: Monday, July 12, 2010 - 9:24 pm

This looks odd, can someone tell me what's actually going with de and de1

Is page "always" set in ext2_find_entry(), I couldn't quite make that out?
If dentry is negative, isn't this a use without initialization of page in
--

From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

From: Jan Blunck <jblunck@suse.de>

do_whiteout() allows removal of a directory when it has whiteouts but
is logically empty.

XXX - This patch abuses readdir() to check if the union directory is
logically empty - that is, all the entries are whiteouts (or "." or
"..").  Currently, we have no clean VFS interface to ask the lower
file system if a directory is empty.

Fixes:
 - Add ->is_directory_empty() op
 - Add is_directory_empty flag to dentry (ugly dcache populate)
 - Ask underlying fs to remove it and look for an error return
 - (your idea here)

Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
---
 fs/namei.c |   84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 84 insertions(+), 0 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 8c67636..06aad7e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2249,6 +2249,90 @@ static int vfs_whiteout(struct inode *dir, struct dentry *old_dentry, int isdir)
 }
 
 /*
+ * XXX - We are abusing readdir to check if a union directory is
+ * logically empty.
+ */
+static int filldir_is_empty(void *__buf, const char *name, int namlen,
+			    loff_t offset, u64 ino, unsigned int d_type)
+{
+	int *is_empty = (int *)__buf;
+
+	switch (namlen) {
+	case 2:
+		if (name[1] != '.')
+			break;
+	case 1:
+		if (name[0] != '.')
+			break;
+		return 0;
+	}
+
+	if (d_type == DT_WHT)
+		return 0;
+
+	(*is_empty) = 0;
+	return 0;
+}
+
+static int directory_is_empty(struct path *path)
+{
+	struct file *file;
+	int err;
+	int is_empty = 1;
+
+	BUG_ON(!S_ISDIR(path->dentry->d_inode->i_mode));
+
+	/* references for the file pointer */
+	path_get(path);
+
+	file = dentry_open(path->dentry, path->mnt, O_RDONLY, current_cred());
+	if (IS_ERR(file))
+		return 0;
+
+	err = vfs_readdir(file, filldir_is_empty, &is_empty);
+
+	fput(file);
+	return is_empty;
+}
+
+static int do_whiteout(struct nameidata *nd, struct path *path, int ...
From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

From: Jan Blunck <jblunck@suse.de>

The ext2_append_link() is later used to find or append a directory
entry to whiteout.

Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
Cc: Theodore Tso <tytso@mit.edu>
Cc: linux-ext4@vger.kernel.org
---
 fs/ext2/dir.c |   70 ++++++++++++++++++++++++++++++++++++++++----------------
 1 files changed, 50 insertions(+), 20 deletions(-)

diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c
index 7516957..57207a9 100644
--- a/fs/ext2/dir.c
+++ b/fs/ext2/dir.c
@@ -472,9 +472,10 @@ void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de,
 }
 
 /*
- *	Parent is locked.
+ * Find or append a given dentry to the parent directory
  */
-int ext2_add_link (struct dentry *dentry, struct inode *inode)
+static ext2_dirent * ext2_append_entry(struct dentry * dentry,
+				       struct page ** page)
 {
 	struct inode *dir = dentry->d_parent->d_inode;
 	const char *name = dentry->d_name.name;
@@ -482,13 +483,10 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
 	unsigned chunk_size = ext2_chunk_size(dir);
 	unsigned reclen = EXT2_DIR_REC_LEN(namelen);
 	unsigned short rec_len, name_len;
-	struct page *page = NULL;
-	ext2_dirent * de;
+	ext2_dirent * de = NULL;
 	unsigned long npages = dir_pages(dir);
 	unsigned long n;
 	char *kaddr;
-	loff_t pos;
-	int err;
 
 	/*
 	 * We take care of directory expansion in the same loop.
@@ -498,20 +496,19 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode)
 	for (n = 0; n <= npages; n++) {
 		char *dir_end;
 
-		page = ext2_get_page(dir, n, 0);
-		err = PTR_ERR(page);
-		if (IS_ERR(page))
+		*page = ext2_get_page(dir, n, 0);
+		de = ERR_PTR(PTR_ERR(*page));
+		if (IS_ERR(*page))
 			goto out;
-		lock_page(page);
-		kaddr = page_address(page);
+		lock_page(*page);
+		kaddr = page_address(*page);
 		dir_end = kaddr + ext2_last_byte(dir, n);
 		de = (ext2_dirent *)kaddr;
 		kaddr += PAGE_CACHE_SIZE - reclen;
 		while ((char ...
From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

From: Jan Blunck <jblunck@suse.de>

Whiteout a given directory entry.  File systems that support whiteouts
must implement the new ->whiteout() directory inode operation.

XXX - Only whiteout when there is a matching entry in a lower layer.

XXX - MS_WHITEOUT only indicates whiteouts, but we also use it for
fallthrus.  Can we just check root->i_op->whiteout and ->fallthru?  Or
do we need an MS_FALLTHRU?

Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
---
 Documentation/filesystems/vfs.txt |   10 +++++-
 fs/dcache.c                       |    4 ++-
 fs/namei.c                        |   73 ++++++++++++++++++++++++++++++++++++-
 include/linux/dcache.h            |    6 +++
 include/linux/fs.h                |    2 +
 5 files changed, 92 insertions(+), 3 deletions(-)

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 3de2f32..8846b4f 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -308,7 +308,7 @@ struct inode_operations
 -----------------------
 
 This describes how the VFS can manipulate an inode in your
-filesystem. As of kernel 2.6.22, the following members are defined:
+filesystem. As of kernel 2.6.33, the following members are defined:
 
 struct inode_operations {
 	int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
@@ -319,6 +319,7 @@ struct inode_operations {
 	int (*mkdir) (struct inode *,struct dentry *,int);
 	int (*rmdir) (struct inode *,struct dentry *);
 	int (*mknod) (struct inode *,struct dentry *,int,dev_t);
+	int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
 	int (*rename) (struct inode *, struct dentry *,
 			struct inode *, struct dentry *);
 	int (*readlink) (struct dentry *, char __user *,int);
@@ -382,6 +383,13 @@ otherwise noted.
 	will probably need to call d_instantiate() just as you would
 	in the create() method
 
+  ...
From: Ian Kent
Date: Monday, July 12, 2010 - 8:52 pm

Couple of comments below.




--

From: Valerie Aurora
Date: Friday, July 16, 2010 - 12:50 pm

That's a merge error, thanks!

-VAL
--

From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

Add comments describing what the directions "up" and "down" mean and
ref count handling to the VFS follow_mount() family of functions.

Signed-off-by: Valerie Aurora <vaurora@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
---
 fs/namei.c     |   43 +++++++++++++++++++++++++++++++++++++++----
 fs/namespace.c |   16 ++++++++++++++--
 2 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index b86b96f..ec178f1 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -596,6 +596,17 @@ loop:
 	return err;
 }
 
+/*
+ * follow_up - Find the mountpoint of path's vfsmount
+ *
+ * Given a path, find the mountpoint of its source file system.
+ * Replace @path with the path of the mountpoint in the parent mount.
+ * Up is towards /.
+ *
+ * Return 1 if we went up a level and 0 if we were already at the
+ * root.
+ */
+
 int follow_up(struct path *path)
 {
 	struct vfsmount *parent;
@@ -616,8 +627,22 @@ int follow_up(struct path *path)
 	return 1;
 }
 
-/* no need for dcache_lock, as serialization is taken care in
- * namespace.c
+/*
+ * __follow_mount - Return the most recent mount at this mountpoint
+ *
+ * Given a mountpoint, find the most recently mounted file system at
+ * this mountpoint and return the path to its root dentry.  This is
+ * the file system that is visible, and it is in the direction of VFS
+ * "down" - away from the root of the mount tree.  See comments to
+ * lookup_mnt() for an example of "down."
+ *
+ * Does not decrement the refcount on the given mount even if it
+ * follows it to another mount and returns that path instead.
+ *
+ * Returns 0 if path was unchanged, 1 if we followed it to another mount.
+ *
+ * No need for dcache_lock, as serialization is taken care in
+ * namespace.c.
  */
 static int __follow_mount(struct path *path)
 {
@@ -636,6 +661,12 @@ static int __follow_mount(struct path *path)
 	return res;
 }
 
+/*
+ * Like __follow_mount, but no return value and drops references to
+ * both ...
From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

From: Jan Blunck <jblunck@suse.de>

Userspace isn't ready for handling another file type, so silently drop
whiteout directory entries before they leave the kernel.

Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
Cc: linux-nfs@vger.kernel.org
Cc: "J. Bruce Fields" <bfields@fieldses.org>
Cc: Neil Brown <neilb@suse.de>
---
 fs/compat.c       |    9 +++++++++
 fs/nfsd/nfs3xdr.c |    5 +++++
 fs/nfsd/nfs4xdr.c |    5 +++++
 fs/nfsd/nfsxdr.c  |    4 ++++
 fs/readdir.c      |    9 +++++++++
 5 files changed, 32 insertions(+), 0 deletions(-)

diff --git a/fs/compat.c b/fs/compat.c
index 0544873..5d88516 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -839,6 +839,9 @@ static int compat_fillonedir(void *__buf, const char *name, int namlen,
 	struct compat_old_linux_dirent __user *dirent;
 	compat_ulong_t d_ino;
 
+	if (d_type == DT_WHT)
+		return 0;
+
 	if (buf->result)
 		return -EINVAL;
 	d_ino = ino;
@@ -910,6 +913,9 @@ static int compat_filldir(void *__buf, const char *name, int namlen,
 	compat_ulong_t d_ino;
 	int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 2, sizeof(compat_long_t));
 
+	if (d_type == DT_WHT)
+		return 0;
+
 	buf->error = -EINVAL;	/* only used if we fail.. */
 	if (reclen > buf->count)
 		return -EINVAL;
@@ -999,6 +1005,9 @@ static int compat_filldir64(void * __buf, const char * name, int namlen, loff_t
 	int reclen = ALIGN(jj + namlen + 1, sizeof(u64));
 	u64 off;
 
+	if (d_type == DT_WHT)
+		return 0;
+
 	buf->error = -EINVAL;	/* only used if we fail.. */
 	if (reclen > buf->count)
 		return -EINVAL;
diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 2a533a0..9b96f5a 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -885,6 +885,11 @@ encode_entry(struct readdir_cd *ccd, const char *name, int namlen,
 	int		elen;		/* estimated entry length in words */
 	int		num_entry_words = 0;	/* actual number of words */
 
+	if (d_type ...
From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

While we can check if a file system is currently read-only, we can't
guarantee that it will stay read-only.  The file system can be
remounted read-write at any time; it's also conceivable that a file
system can be mounted a second time and converted to read-write if the
underlying fs allows it.  This is a problem for union mounts, which
require the underlying file system be read-only.  Add a read-only
users count and don't allow remounts to change the file system to
read-write or read-write mounts if there are any read-only users.

Signed-off-by: Valerie Aurora <vaurora@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
---
 fs/namespace.c     |   11 +++++++++++
 fs/super.c         |   23 +++++++++++++++++++++++
 include/linux/fs.h |    8 ++++++++
 3 files changed, 42 insertions(+), 0 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index d405444..b788cfa 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -200,6 +200,17 @@ int __mnt_is_readonly(struct vfsmount *mnt)
 }
 EXPORT_SYMBOL_GPL(__mnt_is_readonly);
 
+static void inc_hard_readonly_users(struct vfsmount *mnt)
+{
+	mnt->mnt_sb->s_hard_readonly_users++;
+}
+
+static void dec_hard_readonly_users(struct vfsmount *mnt)
+{
+	BUG_ON(mnt->mnt_sb->s_hard_readonly_users == 0);
+	mnt->mnt_sb->s_hard_readonly_users--;
+}
+
 static inline void inc_mnt_writers(struct vfsmount *mnt)
 {
 #ifdef CONFIG_SMP
diff --git a/fs/super.c b/fs/super.c
index 1527e6a..6add39b 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -118,6 +118,7 @@ out:
  */
 static inline void destroy_super(struct super_block *s)
 {
+	BUG_ON(s->s_hard_readonly_users);
 	security_sb_free(s);
 	kfree(s->s_subtype);
 	kfree(s->s_options);
@@ -557,6 +558,21 @@ out:
 	return err;
 }
 
+/*
+ * Some uses of file systems require that they never be mounted
+ * read-write anywhere (e.g., the lower layers of union mounts must
+ * always be read-only).  If there are any of these "hard" read-only
+ * mounts, don't permit a transition to ...
From: Valerie Aurora
Date: Tuesday, June 15, 2010 - 11:39 am

From: Jan Blunck <jblunck@suse.de>

This patch changes lookup_hash() into returning a struct path.

XXX - Check for correctness, otherwise obvious

Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
---
 fs/namei.c |  113 ++++++++++++++++++++++++++++++-----------------------------
 1 files changed, 57 insertions(+), 56 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index ec178f1..3b43c48 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1155,7 +1155,7 @@ int vfs_path_lookup(struct dentry *dentry, struct vfsmount *mnt,
 }
 
 static struct dentry *__lookup_hash(struct qstr *name,
-		struct dentry *base, struct nameidata *nd)
+				    struct dentry *base, struct nameidata *nd)
 {
 	struct dentry *dentry;
 	struct inode *inode;
@@ -1212,14 +1212,22 @@ out:
  * needs parent already locked. Doesn't follow mounts.
  * SMP-safe.
  */
-static struct dentry *lookup_hash(struct nameidata *nd)
+static int lookup_hash(struct nameidata *nd, struct qstr *name,
+		       struct path *path)
 {
 	int err;
 
 	err = exec_permission(nd->path.dentry->d_inode);
 	if (err)
-		return ERR_PTR(err);
-	return __lookup_hash(&nd->last, nd->path.dentry, nd);
+		return err;
+	path->mnt = nd->path.mnt;
+	path->dentry =  __lookup_hash(name, nd->path.dentry, nd);
+	if (IS_ERR(path->dentry)) {
+		err = PTR_ERR(path->dentry);
+		path->dentry = NULL;
+		path->mnt = NULL;
+	}
+	return err;
 }
 
 static int __lookup_one_len(const char *name, struct qstr *this,
@@ -1701,12 +1709,9 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
 
 	/* OK, it's O_CREAT */
 	mutex_lock(&dir->d_inode->i_mutex);
+	error = lookup_hash(nd, &nd->last, path);
 
-	path->dentry = lookup_hash(nd);
-	path->mnt = nd->path.mnt;
-
-	error = PTR_ERR(path->dentry);
-	if (IS_ERR(path->dentry)) {
+	if (error) {
 		mutex_unlock(&dir->d_inode->i_mutex);
 		goto exit;
 	}
@@ -1958,7 +1963,8 @@ ...
From: Ian Kent
Date: Monday, July 12, 2010 - 9:51 pm

There's a bit of indirection going on here so it isn't clear to me if
--

Previous thread: Userspace helpers at static addresses on ARM [was: Re: [PATCH] fix the "unknown" case] by Mathieu Desnoyers on Tuesday, June 15, 2010 - 11:29 am. (2 messages)

Next thread: [PATCH] tty: Add EXTPROC support for LINEMODE by hyc on Tuesday, June 15, 2010 - 11:56 am. (9 messages)