This version of union mounts implements two major changes requested by Al Viro: * Drastically simplify the union stack for a directory. It is now a singly linked list rooted in the dentry of the topmost directory, instead of a set of path -> path mappings kept in a hash table. The union hash table lookup routines have gone away, along with most of struct union_dir. * On union mount, clone the underlying read-only mounts and keep them in a list hanging off the superblock of the topmost file system. It also includes many other minor fixups, but those are the big changes. Patches are against 2.6.34. Git version is in branch "linked_list" of: git://git.kernel.org/pub/scm/linux/kernel/git/val/linux-2.6.git Next up: Rewrite user_path_nd() and associated code, and implement the rest of Al Viro's code review comments. -VAL Felix Fietkau (2): whiteout: jffs2 whiteout support fallthru: jffs2 fallthru support Jan Blunck (11): VFS: Make lookup_hash() return a struct path autofs4: Save autofs trigger's vfsmount in super block info whiteout/NFSD: Don't return information about whiteouts to userspace whiteout: Add vfs_whiteout() and whiteout inode operation whiteout: Set S_OPAQUE inode flag when creating directories whiteout: Allow removal of a directory with whiteouts whiteout: tmpfs whiteout support whiteout: Split of ext2_append_link() from ext2_add_link() whiteout: ext2 whiteout support union-mount: Introduce MNT_UNION and MS_UNION flags union-mount: Call do_whiteout() on unlink and rmdir in unions Valerie Aurora (25): VFS: Comment follow_mount() and friends VFS: Add read-only users count to superblock fallthru: Basic fallthru definitions fallthru: ext2 fallthru support fallthru: tmpfs fallthru support union-mount: Union mounts documentation union-mount: Introduce union_dir structure and basic operations union-mount: Free union dirs on removal from dcache union-mount: Support for mounting union mount file ...
From: Jan Blunck <jblunck@suse.de> XXX - This is broken and included just to make union mounts work. See discussion at: http://kerneltrap.org/mailarchive/linux-fsdevel/2010/1/15/6708053/thread Original commit message: This is a bugfix/replacement for commit 051d381259eb57d6074d02a6ba6e90e744f1a29f: During a path walk if an autofs trigger is mounted on a dentry, when the follow_link method is called, the nameidata struct contains the vfsmount and mountpoint dentry of the parent mount while the dentry that is passed in is the root of the autofs trigger mount. I believe it is impossible to get the vfsmount of the trigger mount, within the follow_link method, when only the parent vfsmount and the root dentry of the trigger mount are known. The solution in this commit was to replace the path embedded in the parent's nameidata with the path of the link itself in __do_follow_link(). This is a relatively harmless misuse of the field, but union mounts ran into a bug during follow_link() caused by the nameidata containing the wrong path (we count on it being what it is all other places - the path of the parent). A cleaner and easier to understand solution is to save the necessary vfsmount in the autofs superblock info when it is mounted. Then we can easily update the vfsmount in autofs4_follow_link(). Signed-off-by: Jan Blunck <jblunck@suse.de> Signed-off-by: Valerie Aurora <vaurora@redhat.com> Acked-by: Ian Kent <raven@themaw.net> Cc: autofs@linux.kernel.org Cc: Alexander Viro <viro@zeniv.linux.org.uk> --- fs/autofs4/autofs_i.h | 1 + fs/autofs4/init.c | 11 ++++++++++- fs/autofs4/root.c | 6 ++++++ fs/namei.c | 7 ++----- 4 files changed, 19 insertions(+), 6 deletions(-) diff --git a/fs/autofs4/autofs_i.h b/fs/autofs4/autofs_i.h index 3d283ab..de3af64 100644 --- a/fs/autofs4/autofs_i.h +++ b/fs/autofs4/autofs_i.h @@ -133,6 +133,7 @@ struct autofs_sb_info { int reghost_enabled; int ...
Instead of saving the vfsmount we could save a pointer to the dentry of the mount point in the autofs super block info struct. I think that's the bit I don't have so it would be sufficient for a lookup_mnt() for the needed vfsmount in ->follow_mount(). Objections? --
I'm not sure... it seems like it would have the same problem that Al described with pinning the vfsmount forever. But I don't know autofs at all. Could you run through a quick example of the case that triggers this problem in the first place? The problem is when you have a symlink that triggers an automount, and you are trying to get from the target of the symlink to the vfsmount of the file system containing the symlink in the first place? Or do I have that wrong? Thanks, -VAL --
That's why I asked.
But I don't see how the dentry can go away since it's covered by the
Ha!
Yes, you would think we were talking about a symlink but this dentry is
a directory, a trigger for a mount that uses ->follow_mount() to do the
mount, similar to the way the NFS client mounts nohide mounts when they
crossed.
In the autofs case we have:
<path in fs>/dir
<autofs fs (with type direct or offset) mounted on>/dir
When ->follow_link() is called the nameidata has the vfsmount of the
once removed mount because it hasn't yet been updated in (say)
link_path_walk(), but the dentry passed to ->follow_link() is the global
root of the autofs fs so we have no way of discovering the vfsmount or
the dentry upon which the autofs trigger mount is mounted. Which of
course prevents us from mounting and following the trigger.
The example is rather poor, sorry, hope it is sufficient.
--
No comments so far.
Before I dive into testing if this actually does what I need, can I get
an "in principal" ack or nack for the patch so union mounts can move on
please?
Note that this patch hasn't even been compile tested so the point is to
decide whether it is worth going ahead with it.
autofs4 - save autofs trigger mountpoint in super block info
From: Ian Kent <raven@themaw.net>
Adapted from the original patch from Jan Blunck <jblunck@suse.de>.
Original commit message:
This is a bugfix/replacement for commit
051d381259eb57d6074d02a6ba6e90e744f1a29f:
During a path walk if an autofs trigger is mounted on a dentry,
when the follow_link method is called, the nameidata struct
contains the vfsmount and mountpoint dentry of the parent mount
while the dentry that is passed in is the root of the autofs
trigger mount. I believe it is impossible to get the vfsmount of
the trigger mount, within the follow_link method, when only the
parent vfsmount and the root dentry of the trigger mount are
known.
The solution in this commit was to replace the path embedded in the
parent's nameidata with the path of the link itself in
__do_follow_link(). This is a relatively harmless misuse of the
field, but union mounts ran into a bug during follow_link() caused by
the nameidata containing the wrong path (we count on it being what it
is all other places - the path of the parent).
A cleaner and easier to understand solution is to save the necessary
mountpoint dentry in the autofs superblock info when it is mounted.
Then we can cwlookup the needed vfsmount in autofs4_follow_link().
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Jan Blunck <jblunck@suse.de>
Cc: Valerie Aurora <vaurora@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: autofs@linux.kernel.org
---
fs/autofs4/autofs_i.h | 1 +
fs/autofs4/init.c | 11 ++++++++++-
fs/autofs4/root.c | 13 +++++++++++++
fs/namei.c | 7 ++-----
...mnt_mountpoint is NULL at the point you try to save it, so this is not --
What about this approach then?
autofs4 - lookup vfsmount in follow_link()
From: Ian Kent <raven@themaw.net>
Adapted from the original patch from Jan Blunck <jblunck@suse.de>.
Original commit message:
This is a bugfix/replacement for commit
051d381259eb57d6074d02a6ba6e90e744f1a29f:
During a path walk if an autofs trigger is mounted on a dentry,
when the follow_link method is called, the nameidata struct
contains the vfsmount and mountpoint dentry of the parent mount
while the dentry that is passed in is the root of the autofs
trigger mount. I believe it is impossible to get the vfsmount of
the trigger mount, within the follow_link method, when only the
parent vfsmount and the root dentry of the trigger mount are
known.
The solution in this commit was to replace the path embedded in the
parent's nameidata with the path of the link itself in
__do_follow_link(). This is a relatively harmless misuse of the
field, but union mounts ran into a bug during follow_link() caused by
the nameidata containing the wrong path (we count on it being what it
is all other places - the path of the parent).
A better solution is to lookup the vfsmount when the mount is triggered,
which can be done because binding an autofs file system mount to another
location isn't valid (even though we can't actually veto this from the
autofs module).
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Jan Blunck <jblunck@suse.de>
Cc: Valerie Aurora <vaurora@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: autofs@linux.kernel.org
---
fs/autofs4/root.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
fs/namei.c | 7 ++-----
fs/namespace.c | 1 +
3 files changed, 50 insertions(+), 5 deletions(-)
diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index db4117e..62dbcef 100644
--- a/fs/autofs4/root.c
+++ b/fs/autofs4/root.c
@@ -208,6 +208,40 @@ static int try_to_fill_dentry(struct dentry *dentry, int flags)
return 0;
...dentry->d_subdirs?
parent->dentry->...?
Or how about iterate_mounts() instead of loop over dentries?
For example (just a example),
struct args {
/* input */
struct dentry *root;
/* output */
struct vfsmount *mnt;
};
static int compare_mnt(struct vfsmount *mnt, void *arg)
{
struct args *a = arg;
if (mnt->mnt_root != a->root)
return 0;
a->mnt = mntget(mnt);
return 1;
}
struct vfsmount *autofs4_find_vfsmount(struct dentry *root)
{
int err;
struct args args = {
.root = root
};
err = iterate_mounts(compare_mnt, &args, current->nsproxy->mnt_ns);
}
J. R. Okajima
--
Yep, thanks, cut and paste error. Like I said, I don't want to go though the test process unless I have something that is, in principal, OK. If whatever approach we use is acceptable, and will work, then I'll put the effort into it. I just don' want to spend a heap of time on something that is basically not the right thing to do. For example, Oh, I'm not up with this, I'll have to check this out, might be useful for more than just this case, thanks for the comments. Ian --
I may be missing something about this, but why is it safe to use iterate_mounts(), since it doesn't take the vfsmount_lock when traversing the list of mounts? Ian --
The sample code was not correct.
We need to acquire vfsmount_lock or down_read(namespace_sem).
Or it may be better to extract the body of iterate_mounts() and create a
new function __iterate_mounts() such like that.
__iterate_mounts()
{
/* equiv to the current iterate_mounts */
}
iterate_mount()
{
down_read(namespace_sem);
or spin_lock(&vfsmount_lock);
__iterate_mount();
spin_unlock(&vfsmount_lock);
or up_read(namespace_sem);
}
J. R. Okajima
--
Yep, thought so. That's a useful enough function to warrant that IMHO. I'll continue checking its usages before I do it though. Ian --
Ok, lets try this again.
The compiler is way smarter that I, so it probably isn't quite so bad
this time. Obviously I need to add a Cc for the audit system maintainer.
autofs4 - lookup vfsmount in follow_link()
From: Ian Kent <raven@themaw.net>
Adapted from the original patch from Jan Blunck <jblunck@suse.de>.
Original commit message:
This is a bugfix/replacement for commit
051d381259eb57d6074d02a6ba6e90e744f1a29f:
During a path walk if an autofs trigger is mounted on a dentry,
when the follow_link method is called, the nameidata struct
contains the vfsmount and mountpoint dentry of the parent mount
while the dentry that is passed in is the root of the autofs
trigger mount. I believe it is impossible to get the vfsmount of
the trigger mount, within the follow_link method, when only the
parent vfsmount and the root dentry of the trigger mount are
known.
The solution in this commit was to replace the path embedded in the
parent's nameidata with the path of the link itself in
__do_follow_link(). This is a relatively harmless misuse of the
field, but union mounts ran into a bug during follow_link() caused by
the nameidata containing the wrong path (we count on it being what it
is all other places - the path of the parent).
A better solution is to lookup the vfsmount when the mount is triggered,
which can be done because binding an autofs file system mount to another
location isn't valid (even though we can't actually veto this from the
autofs module).
Signed-off-by: Ian Kent <raven@themaw.net>
Cc: Jan Blunck <jblunck@suse.de>
Cc: Valerie Aurora <vaurora@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: autofs@linux.kernel.org
---
fs/autofs4/root.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++++
fs/namei.c | 7 ++-----
fs/namespace.c | 8 ++++++--
3 files changed, 57 insertions(+), 7 deletions(-)
diff --git a/fs/autofs4/root.c b/fs/autofs4/root.c
index db4117e..114959b ...From: Jan Blunck <jblunck@suse.de>
In case of an union directory we don't want that the directories on lower
layers of the union "show through". So to prevent that the contents of
underlying directories magically shows up after a mkdir() we set the S_OPAQUE
flag if directories are created where a whiteout existed before.
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
---
fs/namei.c | 11 ++++++++++-
include/linux/fs.h | 3 +++
2 files changed, 13 insertions(+), 1 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 2c723e2..8c67636 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2107,6 +2107,7 @@ SYSCALL_DEFINE3(mknod, const char __user *, filename, int, mode, unsigned, dev)
int vfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
{
int error = may_create(dir, dentry);
+ int opaque = 0;
if (error)
return error;
@@ -2119,9 +2120,17 @@ int vfs_mkdir(struct inode *dir, struct dentry *dentry, int mode)
if (error)
return error;
+ if (d_is_whiteout(dentry))
+ opaque = 1;
+
error = dir->i_op->mkdir(dir, dentry, mode);
- if (!error)
+ if (!error) {
fsnotify_mkdir(dir, dentry);
+ if (opaque) {
+ dentry->d_inode->i_flags |= S_OPAQUE;
+ mark_inode_dirty(dentry->d_inode);
+ }
+ }
return error;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7afdbd4..e9aa650 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -236,6 +236,7 @@ struct inodes_stat_t {
#define S_NOCMTIME 128 /* Do not update file c/mtime */
#define S_SWAPFILE 256 /* Do not truncate: swapon got its bmaps */
#define S_PRIVATE 512 /* Inode is fs-internal */
+#define S_OPAQUE 1024 /* Directory is opaque */
/*
* Note that nosuid etc flags are inode-specific: setting some file-system
@@ -271,6 +272,8 @@ struct inodes_stat_t {
#define IS_SWAPFILE(inode) ((inode)->i_flags & S_SWAPFILE)
#define IS_PRIVATE(inode) ((inode)->i_flags & S_PRIVATE)
+#define ...I found this hard to understand. Do you mean: For directories within a union that are whiteouts we don't want the entries of lower layer file system to "show through". To achieve this we set the S_OPAQUE --
That is much clearer. I ended up with this version, what do you think? whiteout: Set opaque flag if new directory was previously a whiteout If we mkdir() a directory on the top layer of a union, we don't want entries from a matching directory on the lower layer to "show through" suddenly. To prevent this, we set the opaque flag on a directory if there was previously a white-out with the same name. (If there is no white-out and the directory exists in a lower layer, then mkdir() will fail with EEXIST.) -VAL --
From: Jan Blunck <jblunck@suse.de> Add support for whiteout dentries to tmpfs. This includes adding support for whiteouts to d_genocide(), which is called to tear down pinned tmpfs dentries. Whiteouts have to be persistent, so they have a pinning extra ref count that needs to be dropped by d_genocide(). Signed-off-by: Jan Blunck <jblunck@suse.de> Signed-off-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Valerie Aurora <vaurora@redhat.com> Cc: Hugh Dickins <hugh.dickins@tiscali.co.uk> Cc: linux-mm@kvack.org --- fs/dcache.c | 13 +++++- mm/shmem.c | 149 +++++++++++++++++++++++++++++++++++++++++++++++++++++------ 2 files changed, 147 insertions(+), 15 deletions(-) diff --git a/fs/dcache.c b/fs/dcache.c index 265015d..3b0e525 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -2229,7 +2229,18 @@ resume: struct list_head *tmp = next; struct dentry *dentry = list_entry(tmp, struct dentry, d_u.d_child); next = tmp->next; - if (d_unhashed(dentry)||!dentry->d_inode) + /* + * Skip unhashed and negative dentries, but process + * positive dentries and whiteouts. A whiteout looks + * kind of like a negative dentry for purposes of + * lookup, but it has an extra pinning ref count + * because it can't be evicted like a negative dentry + * can. What we care about here is ref counts - and + * we need to drop the ref count on a whiteout before + * we can evict it. + */ + if (d_unhashed(dentry)||(!dentry->d_inode && + !d_is_whiteout(dentry))) continue; if (!list_empty(&dentry->d_subdirs)) { this_parent = dentry; diff --git a/mm/shmem.c b/mm/shmem.c index eef4ebe..c58ecf4 100644 --- a/mm/shmem.c +++ b/mm/shmem.c @@ -1805,6 +1805,76 @@ static int shmem_statfs(struct dentry *dentry, struct kstatfs *buf) return 0; } +static int shmem_rmdir(struct inode *dir, struct dentry *dentry); +static int shmem_unlink(struct inode *dir, struct dentry *dentry); + +/* + * This is the whiteout support for tmpfs. It uses one ...
---
fs/open.c | 25 +++++++++++++++++++++----
1 files changed, 21 insertions(+), 4 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index 3c1ae55..336fe01 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -669,18 +669,32 @@ out:
SYSCALL_DEFINE3(fchmodat, int, dfd, const char __user *, filename, mode_t, mode)
{
struct path path;
+ struct nameidata nd;
+ struct vfsmount *mnt;
struct inode *inode;
+ char *tmp;
int error;
struct iattr newattrs;
- error = user_path_at(dfd, filename, LOOKUP_FOLLOW, &path);
+ error = user_path_nd(dfd, filename, LOOKUP_FOLLOW, &nd,
+ &path, &tmp);
if (error)
goto out;
- inode = path.dentry->d_inode;
- error = mnt_want_write(path.mnt);
+ if (IS_DIR_UNIONED(nd.path.dentry))
+ mnt = nd.path.mnt;
+ else
+ mnt = path.mnt;
+
+ error = mnt_want_write(mnt);
if (error)
goto dput_and_out;
+
+ error = union_copyup(&nd, &path);
+ if (error)
+ goto mnt_drop_write_and_out;
+
+ inode = path.dentry->d_inode;
mutex_lock(&inode->i_mutex);
error = security_path_chmod(path.dentry, path.mnt, mode);
if (error)
@@ -692,9 +706,12 @@ SYSCALL_DEFINE3(fchmodat, int, dfd, const char __user *, filename, mode_t, mode)
error = notify_change(path.dentry, &newattrs);
out_unlock:
mutex_unlock(&inode->i_mutex);
- mnt_drop_write(path.mnt);
+mnt_drop_write_and_out:
+ mnt_drop_write(mnt);
dput_and_out:
path_put(&path);
+ path_put(&nd.path);
+ putname(tmp);
out:
return error;
}
--
1.6.3.3
--
This patch adds the basic structures and operations of VFS-based union mounts (but not the ability to mount or lookup unioned file systems). Each directory in a unioned file system has an associated union stack created when the directory is first looked up. The union stack is a union_dir structure kept in a hash table indexed by mount and dentry of the directory; thus, specific paths are unioned, not dentries alone. The union_dir keeps a pointer to the upper path and the lower path and can be looked up by either path. Currently only two layers are supported, but the union_dir struct is flexible enough to allow more than two layers. This particular version of union mounts is based on ideas by Jan Blunck, Bharata Rao, and many others. Signed-off-by: Valerie Aurora <vaurora@redhat.com> --- fs/Kconfig | 13 +++++ fs/Makefile | 1 + fs/dcache.c | 3 + fs/union.c | 119 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/union.h | 66 ++++++++++++++++++++++++++ include/linux/dcache.h | 4 +- include/linux/fs.h | 1 + 7 files changed, 206 insertions(+), 1 deletions(-) create mode 100644 fs/union.c create mode 100644 fs/union.h diff --git a/fs/Kconfig b/fs/Kconfig index 5f85b59..f99c3a9 100644 --- a/fs/Kconfig +++ b/fs/Kconfig @@ -59,6 +59,19 @@ source "fs/notify/Kconfig" source "fs/quota/Kconfig" +config UNION_MOUNT + bool "Union mounts (writable overlasy) (EXPERIMENTAL)" + depends on EXPERIMENTAL + help + Union mounts allow you to mount a transparent writable + layer over a read-only file system, for example, an ext3 + partition on a hard drive over a CD-ROM root file system + image. + + See <file:Documentation/filesystems/union-mounts.txt> for details. + + If unsure, say N. + source "fs/autofs/Kconfig" source "fs/autofs4/Kconfig" source "fs/fuse/Kconfig" diff --git a/fs/Makefile b/fs/Makefile index 97f340f..1949af2 100644 --- ...
I did a quick review and think this is right. The SLAB_PANIC flag in combination with this being called early in boot means it will panic Nope, fixed. Thanks, -VAL --
This botches the carefully tuned length of struct dentry. At least a FIXME comment needs to be added that this is something to be addressed. Why was the hash table concept dropped? The header comment still talks about that? Miklos --
Simply, Al Viro didn't like it. But note that the current implementation still uses part of the hash table solution. You still have union_dir structures external to dentries for the read-only layers of the stack. The change is from Al's observation that the topmost dentry could only be part of one stack. Why do a lookup on the topmost dentry when you could keep an pointer to the stack in the dentry itself and skip the lookup? Once you have the head of the stack, you don't need lookup for the rest of it. This eliminates all the lookup machinery and the union hash table lock, which seems like a big win to me. The biggest drawback of the hash table in my mind was that it introduced a new global synchronization point in lookup. Making it go fast would be dcache lookup optimization all over again. Thanks, -VAL --
That dentry field will be unused most of the time and we lose space for d_iname for *all* filesystems. On 64bit this results in max inline name going from 32 down to 24 bytes. On my root fs 7% of names are 24-31 in length. That's more than triple that of names which are more than 32 in length. Yeah, union mounts can be configured out, but that's not much I already asked this, but I'll ask again, what about doing this with a union filesystem? That solves this problem in one simple go, as well as a host of others. I'll do some experimenting because I feel it should be possible to do all this in a union fs with most of the advantages of union mounts. That doesn't mean it won't need any VFS support, but I think the amount of VFS burden can be considerably reduced with that approach at a small price (just dentry tree duplication). Miklos --
That would be great. My theory on the current version is to do everything in the VFS except when it is much cleaner to make minor changes to the underlying fs. I went this way because I'd worked on a stacked file system version and couldn't see how to avoid the complexity that unionfs ran into. But a VFS/stacked fs hybrid might look nicer than a VFS/low-level fs hybrid. -VAL --
Implement unioned directories, whiteouts, and fallthrus in pathname lookup routines. do_lookup() and lookup_hash() call lookup_union() after looking up the dentry from the top-level file system. lookup_union() is centered around __lookup_hash(), which does cached and/or real lookups and revalidates each dentry in the union stack. XXX - implement negative union cache entries XXX - What about different permissions on different layers on the same directory name? Should complain, fail, test permissions on all layers, what? --- fs/namei.c | 171 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++- fs/union.c | 94 +++++++++++++++++++++++++++++++++ fs/union.h | 7 +++ 3 files changed, 271 insertions(+), 1 deletions(-) diff --git a/fs/namei.c b/fs/namei.c index 06aad7e..45be5e5 100644 --- a/fs/namei.c +++ b/fs/namei.c @@ -35,6 +35,7 @@ #include <asm/uaccess.h> #include "internal.h" +#include "union.h" /* [Feb-1997 T. Schoebel-Theuer] * Fundamental changes in the pathname lookup mechanisms (namei) @@ -722,6 +723,160 @@ static __always_inline void follow_dotdot(struct nameidata *nd) follow_mount(&nd->path); } +static struct dentry *__lookup_hash(struct qstr *name, struct dentry *base, + struct nameidata *nd); + +/* + * __lookup_union - Given a path from the topmost layer, lookup and + * revalidate each dentry in its union stack, building it if necessary + * + * @nd - nameidata for the parent of @topmost + * @name - pathname from this element on + * @topmost - path of the topmost matching dentry + * + * Given the nameidata and the path of the topmost dentry for this + * pathname, lookup, revalidate, and build the associated union stack. + * @topmost must be either a negative dentry or a directory, and not a + * whiteout. + * + * This function may stomp nd->path with the path of the parent + * directory of lower layer, so the caller must save nd->path and + * restore it afterwards. You probably want to use lookup_union(), + * ...
It's also the head of the list. Good anti-comment, there. Fixed, thanks! -VAL --
For union mounts, a file located on the lower layer will incorrectly
return EROFS on an access check. To fix this, use the new
path_permission() call, which ignores a read-only lower layer file
system if the target will be copied up to the topmost file system.
---
fs/open.c | 21 +++++++++++++++++----
1 files changed, 17 insertions(+), 4 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index 74e5cd9..7f7958e 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -32,6 +32,7 @@
#include <linux/ima.h>
#include "internal.h"
+#include "union.h"
int vfs_statfs(struct dentry *dentry, struct kstatfs *buf)
{
@@ -454,7 +455,10 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
const struct cred *old_cred;
struct cred *override_cred;
struct path path;
+ struct nameidata nd;
+ struct vfsmount *mnt;
struct inode *inode;
+ char *tmp;
int res;
if (mode & ~S_IRWXO) /* where's F_OK, X_OK, W_OK, R_OK? */
@@ -478,10 +482,17 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
old_cred = override_creds(override_cred);
- res = user_path_at(dfd, filename, LOOKUP_FOLLOW, &path);
+ res = user_path_nd(dfd, filename, LOOKUP_FOLLOW,
+ &nd, &path, &tmp);
if (res)
goto out;
+ /* For union mounts, use the topmost mnt's permissions */
+ if (IS_DIR_UNIONED(nd.path.dentry))
+ mnt = nd.path.mnt;
+ else
+ mnt = path.mnt;
+
inode = path.dentry->d_inode;
if ((mode & MAY_EXEC) && S_ISREG(inode->i_mode)) {
@@ -490,11 +501,11 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
* with the "noexec" flag.
*/
res = -EACCES;
- if (path.mnt->mnt_flags & MNT_NOEXEC)
+ if (mnt->mnt_flags & MNT_NOEXEC)
goto out_path_release;
}
- res = inode_permission(inode, mode | MAY_ACCESS);
+ res = path_permission(&path, &nd.path, mode | MAY_ACCESS);
/* SuS v2 requires we report a read only fs too */
if (res || !(mode & S_IWOTH) || ...---
fs/namei.c | 24 ++++++++++++++++++++----
1 files changed, 20 insertions(+), 4 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 505b51d..d2f2618 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2938,16 +2938,18 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
{
struct dentry *new_dentry;
struct nameidata nd;
+ struct nameidata old_nd;
struct path old_path;
int error;
char *to;
+ char *from;
if ((flags & ~AT_SYMLINK_FOLLOW) != 0)
return -EINVAL;
- error = user_path_at(olddfd, oldname,
+ error = user_path_nd(olddfd, oldname,
flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0,
- &old_path);
+ &old_nd, &old_path, &from);
if (error)
return error;
@@ -2955,8 +2957,20 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
if (error)
goto out;
error = -EXDEV;
- if (old_path.mnt != nd.path.mnt)
- goto out_release;
+ if (old_path.mnt != nd.path.mnt) {
+ if (IS_DIR_UNIONED(old_nd.path.dentry) &&
+ (old_nd.path.mnt == nd.path.mnt)) {
+ error = mnt_want_write(old_nd.path.mnt);
+ if (error)
+ goto out_release;
+ error = union_copyup(&old_nd, &old_path);
+ mnt_drop_write(old_nd.path.mnt);
+ if (error)
+ goto out_release;
+ } else {
+ goto out_release;
+ }
+ }
new_dentry = lookup_create(&nd, 0);
error = PTR_ERR(new_dentry);
if (IS_ERR(new_dentry))
@@ -2979,6 +2993,8 @@ out_release:
putname(to);
out:
path_put(&old_path);
+ path_put(&old_nd.path);
+ putname(from);
return error;
}
--
1.6.3.3
--
Copy up a file when opened with write permissions. Does not copy up
the file data when O_TRUNC is specified.
---
fs/namei.c | 28 ++++++++++++++++++++++++++++
1 files changed, 28 insertions(+), 0 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 6096413..7514096 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1907,6 +1907,24 @@ exit:
return ERR_PTR(error);
}
+static int open_union_copyup(struct nameidata *nd, struct path *path,
+ int open_flag)
+{
+ struct vfsmount *oldmnt = path->mnt;
+ int error;
+
+ if (open_flag & O_TRUNC)
+ error = union_copyup_len(nd, path, 0);
+ else
+ error = union_copyup(nd, path);
+ if (error)
+ return error;
+ if (oldmnt != path->mnt)
+ mntput(nd->path.mnt);
+
+ return error;
+}
+
static struct file *do_last(struct nameidata *nd, struct path *path,
int open_flag, int acc_mode,
int mode, const char *pathname)
@@ -1958,6 +1976,11 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
if (!path->dentry->d_inode->i_op->lookup)
goto exit_dput;
}
+ if (acc_mode & MAY_WRITE) {
+ error = open_union_copyup(nd, path, open_flag);
+ if (error)
+ goto exit_dput;
+ }
path_to_nameidata(path, nd);
audit_inode(pathname, nd->path.dentry);
goto ok;
@@ -2029,6 +2052,11 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
if (path->dentry->d_inode->i_op->follow_link)
return NULL;
+ if (acc_mode & MAY_WRITE) {
+ error = open_union_copyup(nd, path, open_flag);
+ if (error)
+ goto exit_dput;
+ }
path_to_nameidata(path, nd);
error = -EISDIR;
if (S_ISDIR(path->dentry->d_inode->i_mode))
--
1.6.3.3
--
On rename() of a file on union mount, copyup and whiteout the source
file. Both are done under the rename mutex. I believe this is
actually atomic.
XXX - May not need to do file copyup under the lock.
XXX - Convert newly empty unioned dirs to not-unioned
---
fs/namei.c | 76 +++++++++++++++++++++++++++++++++++++++++++++++++++++++----
1 files changed, 70 insertions(+), 6 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index d2f2618..6096413 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -3155,6 +3155,7 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
{
struct dentry *old_dir, *new_dir;
struct path old, new;
+ struct path to_whiteout = {NULL, NULL};
struct dentry *trap;
struct nameidata oldnd, newnd;
char *from;
@@ -3170,13 +3171,9 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
goto exit1;
error = -EXDEV;
+ /* Union mounts will pass below test - dirs always on topmost */
if (oldnd.path.mnt != newnd.path.mnt)
goto exit2;
- /* Rename on union mounts not implemented yet */
- /* XXX much harsher check than necessary - can do some renames */
- if (IS_DIR_UNIONED(oldnd.path.dentry) ||
- IS_DIR_UNIONED(newnd.path.dentry))
- goto exit2;
old_dir = oldnd.path.dentry;
error = -EBUSY;
if (oldnd.last_type != LAST_NORM)
@@ -3199,7 +3196,7 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
error = -ENOENT;
if (!old.dentry->d_inode)
goto exit4;
- /* unless the source is a directory trailing slashes give -ENOTDIR */
+ /* unless the source is a directory, trailing slashes give -ENOTDIR */
if (!S_ISDIR(old.dentry->d_inode->i_mode)) {
error = -ENOTDIR;
if (oldnd.last.name[oldnd.last.len])
@@ -3211,6 +3208,11 @@ SYSCALL_DEFINE4(renameat, int, olddfd, const char __user *, oldname,
error = -EINVAL;
if (old.dentry == trap)
goto exit4;
+ error = -EXDEV;
+ /* Can't rename a directory from a lower layer */
+ if (IS_DIR_UNIONED(oldnd.path.dentry) &&
+ ...---
fs/open.c | 23 ++++++++++++++++++++---
1 files changed, 20 insertions(+), 3 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index 7f7958e..68c97dd 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -718,18 +718,35 @@ static int chown_common(struct path *path, uid_t user, gid_t group)
SYSCALL_DEFINE3(chown, const char __user *, filename, uid_t, user, gid_t, group)
{
struct path path;
+ struct nameidata nd;
+ struct vfsmount *mnt;
+ char *tmp;
int error;
- error = user_path(filename, &path);
+ error = user_path_nd(AT_FDCWD, filename, LOOKUP_FOLLOW,
+ &nd, &path, &tmp);
if (error)
goto out;
- error = mnt_want_write(path.mnt);
+
+ if (IS_DIR_UNIONED(nd.path.dentry))
+ mnt = nd.path.mnt;
+ else
+ mnt = path.mnt;
+
+ error = mnt_want_write(mnt);
if (error)
goto out_release;
+
+ error = union_copyup(&nd, &path);
+ if (error)
+ goto out_drop_write;
error = chown_common(&path, user, group);
- mnt_drop_write(path.mnt);
+out_drop_write:
+ mnt_drop_write(mnt);
out_release:
path_put(&path);
+ path_put(&nd.path);
+ putname(tmp);
out:
return error;
}
--
1.6.3.3
--
---
fs/xattr.c | 31 +++++++++++++++++++++++++------
1 files changed, 25 insertions(+), 6 deletions(-)
diff --git a/fs/xattr.c b/fs/xattr.c
index 66bb5c7..4e2b5f6 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -320,17 +320,36 @@ SYSCALL_DEFINE5(lsetxattr, const char __user *, pathname,
size_t, size, int, flags)
{
struct path path;
+ struct nameidata nd;
+ struct vfsmount *mnt;
+ char *tmp;
int error;
- error = user_lpath(pathname, &path);
+ error = user_path_nd(AT_FDCWD, pathname, 0, &nd, &path, &tmp);
if (error)
return error;
- error = mnt_want_write(path.mnt);
- if (!error) {
- error = setxattr(path.dentry, name, value, size, flags);
- mnt_drop_write(path.mnt);
- }
+
+ if (IS_DIR_UNIONED(nd.path.dentry))
+ mnt = nd.path.mnt;
+ else
+ mnt = path.mnt;
+
+ error = mnt_want_write(mnt);
+ if (error)
+ goto out;
+
+ error = union_copyup(&nd, &path);
+ if (error)
+ goto out_drop_write;
+
+ error = setxattr(path.dentry, name, value, size, flags);
+
+out_drop_write:
+ mnt_drop_write(mnt);
+out:
path_put(&path);
+ path_put(&nd.path);
+ putname(tmp);
return error;
}
--
1.6.3.3
--
When a file on the read-only layer of a union mount is altered, it must be copied up to the topmost read-write layer. This patch creates union_copyup() and its supporting routines. Thanks to Valdis Kletnieks for a bug fix. Cc: Valdis.Kletnieks@vt.edu --- fs/union.c | 323 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ fs/union.h | 7 +- 2 files changed, 329 insertions(+), 1 deletions(-) diff --git a/fs/union.c b/fs/union.c index 76a6c34..0982446 100644 --- a/fs/union.c +++ b/fs/union.c @@ -24,6 +24,8 @@ #include <linux/namei.h> #include <linux/file.h> #include <linux/security.h> +#include <linux/splice.h> +#include <linux/xattr.h> #include "union.h" @@ -191,6 +193,72 @@ int needs_lookup_union(struct path *parent_path, struct path *path) return 1; } +/** + * union_copyup_xattr + * + * @old: dentry of original file + * @new: dentry of new copy + * + * Copy up extended attributes from the original file to the new one. + * + * XXX - Permissions? For now, copying up every xattr. + */ + +static int union_copyup_xattr(struct dentry *old, struct dentry *new) +{ + ssize_t list_size, size; + char *buf, *name, *value; + int error; + + /* Check for xattr support */ + if (!old->d_inode->i_op->getxattr || + !new->d_inode->i_op->getxattr) + return 0; + + /* Find out how big the list of xattrs is */ + list_size = vfs_listxattr(old, NULL, 0); + if (list_size <= 0) + return list_size; + + /* Allocate memory for the list */ + buf = kzalloc(list_size, GFP_KERNEL); + if (!buf) + return -ENOMEM; + + /* Allocate memory for the xattr's value */ + error = -ENOMEM; + value = kmalloc(XATTR_SIZE_MAX, GFP_KERNEL); + if (!value) + goto out; + + /* Actually get the list of xattrs */ + list_size = vfs_listxattr(old, buf, list_size); + if (list_size <= 0) { + error = list_size; + goto out_free_value; + } + + for (name = buf; name < (buf + list_size); name += strlen(name) + 1) { + /* XXX Locking? old is on read-only fs ...
It checks if len (the size of the file to be copied up) will overflow size_t or ssize_t on this machine. The file could have been created on a 64-bit box, and be too big to be manipulated on a 32-bit box. It could use a comment, fixed. -VAL --
What happens if there's a crash in the middle of the copyup? Possible solution is using rename to atomically "replace" the underlying file. That however introduces namespace issues: where to put the temporary file which then needs to be deleted on "fsck.union"? Miklos --
This kind of problem is what makes union mounts so much fun to work on!! </sarcasm> So far this version of union mounts has kept the namespace clean, so I'd like to keep it that way. One of my ideas is to mark the new file as "copy-in-progress" and if we encounter such a file, we restart the copyup again. But how to mark it? A new inode flag? This applies in some form to directory copyup too. However, we already have a flag we use to indicate that it's copied up - the opaque flag. I moved that to be set after the directory entries are copied up. If it crashes in the middle, it can be safely restarted the next time we call readdir() on that directory. I added a comment to the commit message describing the problem, so it's at least documented. -VAL --
Split inode_permission() into inode and file-system-dependent parts.
Create path_permission() to check permission based on the path to the
inode. This is for union mounts, in which an inode can be located on
a read-only lower layer file system but is still writable, since we
will copy it up to the writable top layer file system. So in that
case, we want to ignore MS_RDONLY on the lower layer. To make this
decision, we must know the path (vfsmount, dentry) of both the target
and its parent.
XXX - so ugly!
---
fs/namei.c | 92 ++++++++++++++++++++++++++++++++++++++++++++--------
include/linux/fs.h | 1 +
2 files changed, 79 insertions(+), 14 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 1e6adf7..4fd431e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -241,29 +241,20 @@ int generic_permission(struct inode *inode, int mask,
}
/**
- * inode_permission - check for access rights to a given inode
+ * __inode_permission - check for access rights to a given inode
* @inode: inode to check permission on
* @mask: right to check for (%MAY_READ, %MAY_WRITE, %MAY_EXEC)
*
* Used to check for read/write/execute permissions on an inode.
- * We use "fsuid" for this, letting us set arbitrary permissions
- * for filesystem access without changing the "normal" uids which
- * are used for other things.
+ *
+ * This does not check for a read-only file system. You probably want
+ * inode_permission().
*/
-int inode_permission(struct inode *inode, int mask)
+static int __inode_permission(struct inode *inode, int mask)
{
int retval;
if (mask & MAY_WRITE) {
- umode_t mode = inode->i_mode;
-
- /*
- * Nobody gets write access to a read-only fs.
- */
- if (IS_RDONLY(inode) &&
- (S_ISREG(mode) || S_ISDIR(mode) || S_ISLNK(mode)))
- return -EROFS;
-
/*
* Nobody gets write access to an immutable file.
*/
@@ -288,6 +279,79 @@ int inode_permission(struct inode *inode, int mask)
}
/**
+ * sb_permission - check ...From: Jan Blunck <jblunck@suse.de>
Add per mountpoint flag for Union Mount support. You need additional patches
to util-linux for that to work - see:
git://git.kernel.org/pub/scm/utils/util-linux-ng/val/util-linux-ng.git
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
---
fs/namespace.c | 5 ++++-
include/linux/fs.h | 1 +
include/linux/mount.h | 4 ++--
3 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index b788cfa..7a399ba 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -808,6 +808,7 @@ static void show_mnt_opts(struct seq_file *m, struct vfsmount *mnt)
{ MNT_NODIRATIME, ",nodiratime" },
{ MNT_RELATIME, ",relatime" },
{ MNT_STRICTATIME, ",strictatime" },
+ { MNT_UNION, ",union" },
{ 0, NULL }
};
const struct proc_fs_info *fs_infop;
@@ -2018,10 +2019,12 @@ long do_mount(char *dev_name, char *dir_name, char *type_page,
mnt_flags &= ~(MNT_RELATIME | MNT_NOATIME);
if (flags & MS_RDONLY)
mnt_flags |= MNT_READONLY;
+ if (flags & MS_UNION)
+ mnt_flags |= MNT_UNION;
flags &= ~(MS_NOSUID | MS_NOEXEC | MS_NODEV | MS_ACTIVE |
MS_NOATIME | MS_NODIRATIME | MS_RELATIME| MS_KERNMOUNT |
- MS_STRICTATIME);
+ MS_STRICTATIME | MS_UNION);
if (flags & MS_REMOUNT)
retval = do_remount(&path, flags & ~MS_REMOUNT, mnt_flags,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b59cd7b..dbd9881 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -192,6 +192,7 @@ struct inodes_stat_t {
#define MS_REMOUNT 32 /* Alter flags of a mounted FS */
#define MS_MANDLOCK 64 /* Allow mandatory locks on an FS */
#define MS_DIRSYNC 128 /* Directory modifications are synchronous */
+#define MS_UNION 256 /* Merge namespace with FS mounted below */
#define MS_NOATIME 1024 /* Do not update access times. */
#define MS_NODIRATIME 2048 /* Do not update directory access times */
#define MS_BIND 4096
diff --git ...If a dentry is removed from dentry cache because its usage count drops
to zero, the union_dirs in its union stack are freed too.
---
fs/dcache.c | 11 +++++++++++
1 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index 54ff5a3..ce54dc5 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -34,6 +34,7 @@
#include <linux/fs_struct.h>
#include <linux/hardirq.h>
#include "internal.h"
+#include "union.h"
int sysctl_vfs_cache_pressure __read_mostly = 100;
EXPORT_SYMBOL_GPL(sysctl_vfs_cache_pressure);
@@ -175,6 +176,7 @@ static struct dentry *d_kill(struct dentry *dentry)
dentry_stat.nr_dentry--; /* For d_free, below */
/*drops the locks, at that point nobody can reach this dentry */
dentry_iput(dentry);
+ d_free_unions(dentry);
if (IS_ROOT(dentry))
parent = NULL;
else
@@ -696,6 +698,7 @@ static void shrink_dcache_for_umount_subtree(struct dentry *dentry)
iput(inode);
}
+ d_free_unions(dentry);
d_free(dentry);
/* finished when we fall off the top of the tree,
@@ -1535,6 +1538,7 @@ void d_delete(struct dentry * dentry)
isdir = S_ISDIR(dentry->d_inode->i_mode);
if (atomic_read(&dentry->d_count) == 1) {
dentry_iput(dentry);
+ d_free_unions(dentry);
fsnotify_nameremove(dentry, isdir);
return;
}
@@ -1545,6 +1549,13 @@ void d_delete(struct dentry * dentry)
spin_unlock(&dentry->d_lock);
spin_unlock(&dcache_lock);
+ /*
+ * Remove any associated unions. While someone still has this
+ * directory open (ref count > 0), we could not have deleted
+ * it unless it was empty, and therefore has no references to
+ * directories below it. So we don't need the unions.
+ */
+ d_free_unions(dentry);
fsnotify_nameremove(dentry, isdir);
}
EXPORT_SYMBOL(d_delete);
--
1.6.3.3
--
From: Jan Blunck <jblunck@suse.de>
Call do_whiteout() when removing files and directories from a union
mounted file system.
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
---
fs/namei.c | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 45be5e5..1e6adf7 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2592,6 +2592,10 @@ static long do_rmdir(int dfd, const char __user *pathname)
error = security_path_rmdir(&nd.path, path.dentry);
if (error)
goto exit4;
+ if (IS_DIR_UNIONED(nd.path.dentry)) {
+ error = do_whiteout(&nd, &path, 1);
+ goto exit4;
+ }
error = vfs_rmdir(nd.path.dentry->d_inode, path.dentry);
exit4:
mnt_drop_write(nd.path.mnt);
@@ -2681,6 +2685,10 @@ static long do_unlinkat(int dfd, const char __user *pathname)
error = security_path_unlink(&nd.path, path.dentry);
if (error)
goto exit3;
+ if (IS_DIR_UNIONED(nd.path.dentry)) {
+ error = do_whiteout(&nd, &path, 0);
+ goto exit3;
+ }
error = vfs_unlink(nd.path.dentry->d_inode, path.dentry);
exit3:
mnt_drop_write(nd.path.mnt);
--
1.6.3.3
--
---
fs/open.c | 24 ++++++++++++++++++++----
1 files changed, 20 insertions(+), 4 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index 68c97dd..3c1ae55 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -230,14 +230,17 @@ int do_truncate(struct dentry *dentry, loff_t length, unsigned int time_attrs,
static long do_sys_truncate(const char __user *pathname, loff_t length)
{
struct path path;
+ struct nameidata nd;
+ struct vfsmount *mnt;
struct inode *inode;
+ char *tmp;
int error;
error = -EINVAL;
if (length < 0) /* sorry, but loff_t says... */
goto out;
- error = user_path(pathname, &path);
+ error = user_path_nd(AT_FDCWD, pathname, 0, &nd, &path, &tmp);
if (error)
goto out;
inode = path.dentry->d_inode;
@@ -251,11 +254,16 @@ static long do_sys_truncate(const char __user *pathname, loff_t length)
if (!S_ISREG(inode->i_mode))
goto dput_and_out;
- error = mnt_want_write(path.mnt);
+ if (IS_DIR_UNIONED(nd.path.dentry))
+ mnt = nd.path.mnt;
+ else
+ mnt = path.mnt;
+
+ error = mnt_want_write(mnt);
if (error)
goto dput_and_out;
- error = inode_permission(inode, MAY_WRITE);
+ error = path_permission(&path, &nd.path, MAY_WRITE);
if (error)
goto mnt_drop_write_and_out;
@@ -263,6 +271,12 @@ static long do_sys_truncate(const char __user *pathname, loff_t length)
if (IS_APPEND(inode))
goto mnt_drop_write_and_out;
+ error = union_copyup_len(&nd, &path, length);
+ if (error)
+ goto mnt_drop_write_and_out;
+
+ /* path may have changed after copyup */
+ inode = path.dentry->d_inode;
error = get_write_access(inode);
if (error)
goto mnt_drop_write_and_out;
@@ -284,9 +298,11 @@ static long do_sys_truncate(const char __user *pathname, loff_t length)
put_write_and_out:
put_write_access(inode);
mnt_drop_write_and_out:
- mnt_drop_write(path.mnt);
+ mnt_drop_write(mnt);
dput_and_out:
path_put(&path);
+ path_put(&nd.path);
+ putname(tmp);
out:
return error;
}
--
1.6.3.3
--
Create and tear down union mount structures on mount. Check
requirements for union mounts. This version clones the read-only
mounts and puts them in an array hanging off the superblock of the
topmost layer.
XXX - need array? maybe use mnt_child or mnt_hash instead
Thanks to Felix Fietkau <nbd@openwrt.org> for a bug fix.
---
fs/namespace.c | 231 ++++++++++++++++++++++++++++++++++++++++++++++++-
fs/super.c | 1 +
include/linux/fs.h | 3 +
include/linux/mount.h | 2 +
4 files changed, 235 insertions(+), 2 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index 7a399ba..9f3884c 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -33,6 +33,7 @@
#include <asm/unistd.h>
#include "pnode.h"
#include "internal.h"
+#include "union.h"
#define HASH_SHIFT ilog2(PAGE_SIZE / sizeof(struct list_head))
#define HASH_SIZE (1UL << HASH_SHIFT)
@@ -1049,6 +1050,7 @@ void umount_tree(struct vfsmount *mnt, int propagate, struct list_head *kill)
propagate_umount(kill);
list_for_each_entry(p, kill, mnt_hash) {
+ d_free_unions(p->mnt_root);
list_del_init(&p->mnt_expire);
list_del_init(&p->mnt_list);
__touch_mnt_namespace(p->mnt_ns);
@@ -1334,6 +1336,193 @@ static int invent_group_ids(struct vfsmount *mnt, bool recurse)
return 0;
}
+/**
+ * check_mnt_union - mount-time checks for union mount
+ *
+ * @mntpnt: path of the mountpoint the new mount will be on
+ * @topmost_mnt: vfsmount of the new file system to be mounted
+ * @mnt_flags: mount flags for the new file system
+ *
+ * Mount-time check of upper and lower layer file systems to see if we
+ * can union mount one on the other.
+ *
+ * The rules:
+ *
+ * Lower layer(s) read-only: We can't deal with namespace changes in
+ * the lower layers of a union, so the lower layer must be read-only.
+ * Note that we could possibly convert a read-write unioned mount into
+ * a read-only mount here, which would give us a way to union more
+ * than one layer with ...If I do mount -r fs1 /mnt mount -r fs2 /mnt mount -ounion fs3 /mnt then only fs2 and fs3 will be unioned. Or how are multiple read-only layers supposed to work? Miklos --
Is there a need to check fallthru, umm ... that probably doesn't Last sentence looks a bit odd, would this be better? We union every underlying file system that is mounted read-only on the --
Try branch "for_miklos" in: git://git.kernel.org/pub/scm/linux/kernel/git/val/linux-2.6.git It's against 2.6.34, I'm rebasing against 2.6.35 tomorrow. -VAL --
Actually, that's on my todo list - right now I'm assuming MS_WHITEOUT implies fallthru support as well. But it doesn't. We're a little short on MS_* flags. I'm thinking of just checking ->whiteout and ->fallthru for non-NULL on the root dir and getting rid of MS_WHITEOUT entirely. Thoughts? -VAL --
Hm, I appear to have re-written that in the latest set of patches. -VAL --
Checking for the methods is a good idea I think, since they are assumed to be present by the code, at least in some places. Although it shouldn't happen, it is possible for a file system to create the root dentry with these methods defined but other dentrys without them defined, so a file system implementation error could cause some unpleasant crashes. Maybe requiring the flags to indicate support would help avoid unpleasant implementation problems like this, not sure really. Also not sure if a method existence check should always be made prior to use, regardless. Ian --
I went for MS_WHITEOUT and MS_FALLTHRU, and added the checks for the ops being non-null. -VAL --
This bit me. Mount failing with EINVAL is a big PITA.
Miklos
Index: linux-2.6/fs/namespace.c
===================================================================
--- linux-2.6.orig/fs/namespace.c 2010-08-05 11:06:56.000000000 +0200
+++ linux-2.6/fs/namespace.c 2010-08-05 11:39:19.000000000 +0200
@@ -1387,6 +1387,7 @@ check_mnt_union(struct path *mntpnt, str
return 0;
#ifndef CONFIG_UNION_MOUNT
+ printk(KERN_INFO "union mount: not supported by the kernel\n");
return -EINVAL;
#endif
for (p = lower_mnt; p; p = next_mnt(p, lower_mnt)) {
@@ -1396,17 +1397,23 @@ check_mnt_union(struct path *mntpnt, str
return -EBUSY;
}
- if (!IS_ROOT(mntpnt->dentry))
+ if (!IS_ROOT(mntpnt->dentry)) {
+ printk(KERN_INFO "union mount: not root\n");
return -EINVAL;
+ }
if (mnt_flags & MNT_READONLY)
return -EROFS;
- if (!(topmost_mnt->mnt_sb->s_flags & MS_WHITEOUT))
+ if (!(topmost_mnt->mnt_sb->s_flags & MS_WHITEOUT)) {
+ printk(KERN_INFO "union mount: whiteout not supported by fs\n");
return -EINVAL;
+ }
- if (!(topmost_mnt->mnt_sb->s_flags & MS_FALLTHRU))
+ if (!(topmost_mnt->mnt_sb->s_flags & MS_FALLTHRU)) {
+ printk(KERN_INFO "union mount: fallthrough not supported by fs\n");
return -EINVAL;
+ }
/* XXX top level mount should only be mounted once */
--
Document design and implementation of union mounts (a.k.a. writable overlays). --- Documentation/filesystems/union-mounts.txt | 759 ++++++++++++++++++++++++++++ 1 files changed, 759 insertions(+), 0 deletions(-) create mode 100644 Documentation/filesystems/union-mounts.txt diff --git a/Documentation/filesystems/union-mounts.txt b/Documentation/filesystems/union-mounts.txt new file mode 100644 index 0000000..2ada88d --- /dev/null +++ b/Documentation/filesystems/union-mounts.txt @@ -0,0 +1,759 @@ +Union mounts (a.k.a. writable overlays) +======================================= + +This document describes the architecture and current status of union +mounts, also known as writable overlays. + +In this document: + - Overview of union mounts + - Terminology + - VFS implementation + - Locking strategy + - VFS/file system interface + - Userland interface + - NFS interaction + - Status + - Contributing to union mounts + +Overview +======== + +A union mount layers one read-write file system over a one read-only +file system, with all writes going to the writable file system. The +namespace of both file systems appears as a combined whole to +userland, with files and directories on the writable file system +covering up any files or directories with matching pathnames on the +read-only file system. The read-write file system is the "topmost" +or "upper" file system and the read-only file system is the "lower" +file system. A few use cases: + +- Root file system on CD with writes saved to hard drive (LiveCD) +- Multiple virtual machines with the same starting root file system +- Cluster with NFS mounted root on clients + +Most if not all of these problems could be solved with a COW block +device or a clustered file system (include NFS mounts). However, for +some use cases, sharing is more efficient and better performing if +done at the file system namespace level. COW block devices only +increase their divergence as time goes on, and a fully coherent +writable ...
This may be a dumb question (I must admit I did only very little research), but how does one cleanup the topmost layer of whiteouts and fallthroughs, so that the entries of lower layer(s) can be made visible again? --
I'm not sure how best to do this. We don't want to add more system calls. One thought of mine has been to do this offline, when the file system is unmounted. For example, e2fsck could add a feature to delete whiteouts and fallthrus. Another option is to add a flag to an existing system call. Any ideas? -VAL --
But that means that if the topmost filesystem is getting full of whiteouts and fallthroughs there will be no way to free up the space without taking the volume offline! That makes operation of union mount on always-on systems difficult. Many personal electronics are always-on today, it will be annoying to have to shutdown them on reconfigurations or just That makes me think that the cleanup operation will be topmost filesystem specific. Maybe this even means that one have to have the filesystem specific tools installed on every system Or calls, if the whiteouts (or even fallthroughs) are to be read through directory file handles. unlinkat(2) ? It already has dirfd and flags arguments. --
Whiteouts and fallthrus go away when a directory is deleted. So, "rm -rf /trash/" will actually free up disk space. You can also move the files you want to keep to a temp directory, rmdir the old one, and move that dir back. Unfortunately, union mounts runs into a lot of bizarre ENOSPC problems. But in the degenerate case in which you delete every single file from the lower layer file system, that information will take up only one whiteout per top-level subdir. You don't keep whiteouts for Any union mount utilities would be distributed as part of the normal Yeah, unlinkat() looks promising. -VAL --
One more advantage of doing whiteouts, etc. with hard links and extended attributes instead of as special filesystem objects. That way they are visible (unless part of a union) and can be treated as normal filesystem objects. Miklos --
This should be reasonably easy to prototype - the whiteout and fallthru patches are pretty well separated from the rest of union mounts. -VAL --
But then you have to break union to cleanup the topmost filesystem. That'll surely take the mount filesystem (in its working configuration, at least) offline. Not much better than using fsck. --
Add support for fallthru directory entries to ext2. XXX What to do for d_ino for fallthrus? If we return the inode from the the underlying file system, it comes from a different inode "namespace" and that will produce spurious matches. This argues for implementation of fallthrus as symlinks because they have to allocate an inode (and inode number) anyway, and we can later reuse it if we copy the file up. Cc: Theodore Tso <tytso@mit.edu> Cc: linux-ext4@vger.kernel.org Signed-off-by: Valerie Aurora <vaurora@redhat.com> Signed-off-by: Jan Blunck <jblunck@suse.de> --- fs/ext2/dir.c | 92 ++++++++++++++++++++++++++++++++++++++++++++-- fs/ext2/ext2.h | 1 + fs/ext2/namei.c | 22 +++++++++++ include/linux/ext2_fs.h | 1 + 4 files changed, 112 insertions(+), 4 deletions(-) diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c index 030bd46..f3b4aff 100644 --- a/fs/ext2/dir.c +++ b/fs/ext2/dir.c @@ -219,7 +219,8 @@ static inline int ext2_match (int len, const char * const name, { if (len != de->name_len) return 0; - if (!de->inode && (de->file_type != EXT2_FT_WHT)) + if (!de->inode && ((de->file_type != EXT2_FT_WHT) && + (de->file_type != EXT2_FT_FALLTHRU))) return 0; return !memcmp(name, de->name, len); } @@ -256,6 +257,7 @@ static unsigned char ext2_filetype_table[EXT2_FT_MAX] = { [EXT2_FT_SOCK] = DT_SOCK, [EXT2_FT_SYMLINK] = DT_LNK, [EXT2_FT_WHT] = DT_WHT, + [EXT2_FT_FALLTHRU] = DT_UNKNOWN, }; #define S_SHIFT 12 @@ -342,6 +344,24 @@ ext2_readdir (struct file * filp, void * dirent, filldir_t filldir) ext2_put_page(page); return 0; } + } else if (de->file_type == EXT2_FT_FALLTHRU) { + int over; + unsigned char d_type = DT_UNKNOWN; + + offset = (char *)de - kaddr; + /* XXX We don't know the inode number + * of the directory entry in the + * underlying file system. Should + * look it up, either on fallthru + * creation at first readdir or now at + * ...
If a previously used ext2 filesystem with is mounted again then fallthroughs don't appear to work as expected. Stat returns ENOENT for these entries. That's an idea, but I guess it won't make everyone happy since it wastes both disk space and memory. One of the key differentiators for union mounts concept was that it doesn't duplicate inodes and dentries from the layers. With the directory copyup on lookup that's already partially lost, but that can be justified by the fact that non-directories usually far outnumber directories. Another idea is to use an internal inode and make all fallthroughs be hard links to that. I think the same would work for whiteouts as well. I don't like the fact that whiteouts are invisible even when not mounted as part of a union. Miklos --
Hm, I wrote one test case for this that worked (attached). Can you give me more details on your test case? Thanks, -VAL
uml:~# mount -oloop -r ext3-2.img /mnt/img/ uml:~# mount -oloop -r ext3.img /mnt/img/ uml:~# losetup -f ovl.img uml:~# mmount -b 8 -t ext2 /dev/loop2 /mnt/img/ uml:~# "ls" /mnt/img bunion lost+found union uml:~# "ls" /mnt/img/union 1 2 3 uml:~# "ls" /mnt/img/union/1 a x uml:~# umount /mnt/img/ uml:~# mmount -b 8 -t ext2 /dev/loop2 /mnt/img/ uml:~# ls -l /mnt/img/ total 14 drwxr-xr-x 2 root root 1024 Aug 5 09:56 bunion drwx------ 2 root root 12288 Aug 5 09:41 lost+found drwxr-xr-x 3 root root 1024 Aug 5 09:56 union uml:~# ls -l /mnt/img/union/ ls: cannot access /mnt/img/union/3: No such file or directory ls: cannot access /mnt/img/union/2: No such file or directory total 1 drwxr-xr-x 2 root root 1024 Aug 5 09:56 1 ?????????? ? ? ? ? ? 2 ?????????? ? ? ? ? ? 3 uml:~# ls -l /mnt/img/union/1 ls: cannot access /mnt/img/union/1/a: No such file or directory ls: cannot access /mnt/img/union/1/x: No such file or directory total 0 ?????????? ? ? ? ? ? a ?????????? ? ? ? ? ? x uml:~# Thanks, Miklos --
Cool, thanks. Yes, I suppose the fallthrus should be ignored if they don't fall through to anything. If I do a proper lookup for d_ino, I can kill two birds with one stone, since that will tell us whether there is anything below the fallthru and thus whether to return this directory entry. --
Oh, "mmount -b 8" == "mount -o union". Is this the mmount from mtools Okay, I'll experiment more and see what I can do. --
It's primitive utility that basically just wraps the mount(2) syscall without any fstab/mtab support: http://www.kernel.org/pub/linux/kernel/people/mszeredi/mmount/ Miklos --
Best would be if it didn't need any modification to filesystems. All this having to upgrade util-linux, e2fsprogs, having incompatible filesystem features is a pain for users (just been through that). What we already have in most filesystems: - extended attributes, e.g. use the system.union.* namespace and denote whiteouts and falltroughs with such an attribute - hard links to make sure a separate inode is not necessary for each whiteout/fallthrough entry - some way for the user to easily identify such files when not mounted as part of a union e.g. make it a symlink pointing to "(deleted)" or whatever Later the extended attributes can also be used for other things like e.g. chmod()/chown() only copying up metadata, not data, and indicating that data is still found on the lower layers. Miklos --
Just a quick note to say that my explicit design was to do as much as possible in the VFS, except when adding a little support to the low-level fs would make it significantly faster, simpler, and more correct. I think for union mounts to perform moderately well, and to avoid namespace problems, we can't build it 100% out of existing file system parts like xattrs. However, I could be wrong and I will definitely give any other implementation serious consideration. -VAL --
Jan Kara helped convince me this might be better than fs-specific The problem with hard links is that you run into hard link limits. I don't think we can do hard links for whiteouts and fallthrus. Each whiteout or fallthru will cost an inode if we implement them as extended attributes. This cost has to be balanced against the cost of implementing them as dentries, which is mainly code complexity in Perhaps we can simply not interpret the whiteout/fallthru extended attributes when the file system is not unioned and let userland It would certainly be more extensible than in-dentry flags. -VAL --
get_unlinked_inode() is a great idea. But I feel that individual inodes for each fallthrough is excessive. It'll make the first readdir() really really expensive and wastes a lot of disk and memory for no good reason. Not sure how to fix the hard link limits problem though... Thanks, Miklos --
Add support for fallthru directory entries to tmpfs
XXX - Makes up inode number for dirent
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
---
fs/dcache.c | 3 +-
fs/libfs.c | 21 +++++++++++++++++--
mm/shmem.c | 60 ++++++++++++++++++++++++++++++++++++++++++++++++++++------
3 files changed, 73 insertions(+), 11 deletions(-)
diff --git a/fs/dcache.c b/fs/dcache.c
index b76f9e4..1575af4 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -2240,7 +2240,8 @@ resume:
* we can evict it.
*/
if (d_unhashed(dentry)||(!dentry->d_inode &&
- !d_is_whiteout(dentry)))
+ !d_is_whiteout(dentry) &&
+ !d_is_fallthru(dentry)))
continue;
if (!list_empty(&dentry->d_subdirs)) {
this_parent = dentry;
diff --git a/fs/libfs.c b/fs/libfs.c
index ea9a6cc..2b28ca9 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -134,6 +134,7 @@ int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
struct dentry *cursor = filp->private_data;
struct list_head *p, *q = &cursor->d_u.d_child;
ino_t ino;
+ int d_type;
int i = filp->f_pos;
switch (i) {
@@ -159,14 +160,28 @@ int dcache_readdir(struct file * filp, void * dirent, filldir_t filldir)
for (p=q->next; p != &dentry->d_subdirs; p=p->next) {
struct dentry *next;
next = list_entry(p, struct dentry, d_u.d_child);
- if (d_unhashed(next) || !next->d_inode)
+ if (d_unhashed(next) || (!next->d_inode && !d_is_fallthru(next)))
continue;
+ if (d_is_fallthru(next)) {
+ /* XXX We don't know the inode
+ * number of the directory
+ * entry in the underlying
+ * file system. Should look
+ * it up, either on fallthru
+ * creation at first readdir
+ * or now at filldir time. */
+ ino = 123; /* Made up ino */
+ d_type = DT_UNKNOWN;
+ } else {
+ ino = next->d_inode->i_ino;
+ d_type = dt_type(next->d_inode);
+ }
+
spin_unlock(&dcache_lock);
if (filldir(dirent, next->d_name.name,
...---
fs/xattr.c | 34 +++++++++++++++++++++++++++-------
1 files changed, 27 insertions(+), 7 deletions(-)
diff --git a/fs/xattr.c b/fs/xattr.c
index 46f87e8..66bb5c7 100644
--- a/fs/xattr.c
+++ b/fs/xattr.c
@@ -19,7 +19,7 @@
#include <linux/fsnotify.h>
#include <linux/audit.h>
#include <asm/uaccess.h>
-
+#include "union.h"
/*
* Check permissions for extended attribute access. This is a bit complicated
@@ -281,17 +281,37 @@ SYSCALL_DEFINE5(setxattr, const char __user *, pathname,
size_t, size, int, flags)
{
struct path path;
+ struct nameidata nd;
+ struct vfsmount *mnt;
+ char *tmp;
int error;
- error = user_path(pathname, &path);
+ error = user_path_nd(AT_FDCWD, pathname, LOOKUP_FOLLOW, &nd, &path,
+ &tmp);
if (error)
return error;
- error = mnt_want_write(path.mnt);
- if (!error) {
- error = setxattr(path.dentry, name, value, size, flags);
- mnt_drop_write(path.mnt);
- }
+
+ if (IS_DIR_UNIONED(nd.path.dentry))
+ mnt = nd.path.mnt;
+ else
+ mnt = path.mnt;
+
+ error = mnt_want_write(mnt);
+ if (error)
+ goto out;
+
+ error = union_copyup(&nd, &path);
+ if (error)
+ goto out_drop_write;
+
+ error = setxattr(path.dentry, name, value, size, flags);
+
+out_drop_write:
+ mnt_drop_write(mnt);
+out:
path_put(&path);
+ path_put(&nd.path);
+ putname(tmp);
return error;
}
--
1.6.3.3
--
---
fs/utimes.c | 14 ++++++++++++--
1 files changed, 12 insertions(+), 2 deletions(-)
diff --git a/fs/utimes.c b/fs/utimes.c
index e4c75db..e83b6bd 100644
--- a/fs/utimes.c
+++ b/fs/utimes.c
@@ -8,8 +8,10 @@
#include <linux/stat.h>
#include <linux/utime.h>
#include <linux/syscalls.h>
+#include <linux/slab.h>
#include <asm/uaccess.h>
#include <asm/unistd.h>
+#include "union.h"
#ifdef __ARCH_WANT_SYS_UTIME
@@ -152,18 +154,26 @@ long do_utimes(int dfd, char __user *filename, struct timespec *times, int flags
error = utimes_common(&file->f_path, times);
fput(file);
} else {
+ struct nameidata nd;
+ char *tmp;
struct path path;
int lookup_flags = 0;
if (!(flags & AT_SYMLINK_NOFOLLOW))
lookup_flags |= LOOKUP_FOLLOW;
- error = user_path_at(dfd, filename, lookup_flags, &path);
+ error = user_path_nd(dfd, filename, lookup_flags, &nd, &path,
+ &tmp);
if (error)
goto out;
- error = utimes_common(&path, times);
+ error = union_copyup(&nd, &path);
+
+ if (!error)
+ error = utimes_common(&path, times);
path_put(&path);
+ path_put(&nd.path);
+ putname(tmp);
}
out:
--
1.6.3.3
--
---
fs/open.c | 23 ++++++++++++++++++++---
1 files changed, 20 insertions(+), 3 deletions(-)
diff --git a/fs/open.c b/fs/open.c
index 336fe01..b021dcb 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -812,18 +812,35 @@ out:
SYSCALL_DEFINE3(lchown, const char __user *, filename, uid_t, user, gid_t, group)
{
struct path path;
+ struct nameidata nd;
+ struct vfsmount *mnt;
+ char *tmp;
int error;
- error = user_lpath(filename, &path);
+ error = user_path_nd(AT_FDCWD, filename, 0, &nd, &path, &tmp);
if (error)
goto out;
- error = mnt_want_write(path.mnt);
+
+ if (IS_DIR_UNIONED(nd.path.dentry))
+ mnt = nd.path.mnt;
+ else
+ mnt = path.mnt;
+
+ error = mnt_want_write(mnt);
if (error)
goto out_release;
+
+ error = union_copyup(&nd, &path);
+ if (error)
+ goto out_drop_write;
+
error = chown_common(&path, user, group);
- mnt_drop_write(path.mnt);
+out_drop_write:
+ mnt_drop_write(mnt);
out_release:
path_put(&path);
+ path_put(&nd.path);
+ putname(tmp);
out:
return error;
}
--
1.6.3.3
--
From: Felix Fietkau <nbd@openwrt.org> Add support for whiteout dentries to jffs2. XXX - David Woodhouse suggests several changes and provides an untested patch. See: http://patchwork.ozlabs.org/patch/50466/ Signed-off-by: Felix Fietkau <nbd@openwrt.org> Signed-off-by: Valerie Aurora <vaurora@redhat.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: linux-mtd@lists.infradead.org --- fs/jffs2/dir.c | 72 +++++++++++++++++++++++++++++++++++++++++++++++- fs/jffs2/fs.c | 4 +++ fs/jffs2/super.c | 2 +- include/linux/jffs2.h | 2 + 4 files changed, 77 insertions(+), 3 deletions(-) diff --git a/fs/jffs2/dir.c b/fs/jffs2/dir.c index 7aa4417..c259193 100644 --- a/fs/jffs2/dir.c +++ b/fs/jffs2/dir.c @@ -34,6 +34,8 @@ static int jffs2_mknod (struct inode *,struct dentry *,int,dev_t); static int jffs2_rename (struct inode *, struct dentry *, struct inode *, struct dentry *); +static int jffs2_whiteout (struct inode *, struct dentry *, struct dentry *); + const struct file_operations jffs2_dir_operations = { .read = generic_read_dir, @@ -56,6 +58,7 @@ const struct inode_operations jffs2_dir_inode_operations = .mknod = jffs2_mknod, .rename = jffs2_rename, .check_acl = jffs2_check_acl, + .whiteout = jffs2_whiteout, .setattr = jffs2_setattr, .setxattr = jffs2_setxattr, .getxattr = jffs2_getxattr, @@ -98,8 +101,14 @@ static struct dentry *jffs2_lookup(struct inode *dir_i, struct dentry *target, fd = fd_list; } } - if (fd) - ino = fd->ino; + if (fd) { + spin_lock(&target->d_lock); + if (fd->type == DT_WHT) + target->d_flags |= DCACHE_WHITEOUT; + else + ino = fd->ino; + spin_unlock(&target->d_lock); + } mutex_unlock(&dir_f->sem); if (ino) { inode = jffs2_iget(dir_i->i_sb, ino); @@ -498,6 +507,11 @@ static int jffs2_mkdir (struct inode *dir_i, struct dentry *dentry, int mode) return PTR_ERR(inode); } + if (dentry->d_flags & DCACHE_WHITEOUT) { + inode->i_flags |= ...
From: Jan Blunck <jblunck@suse.de> This patch adds whiteout support to EXT2. A whiteout is an empty directory entry (inode == 0) with the file type set to EXT2_FT_WHT. Therefore it allocates space in directories. Due to being implemented as a filetype it is necessary to have the EXT2_FEATURE_INCOMPAT_FILETYPE flag set. XXX - Needs serious review. Al wonders: What happens with a delete at the beginning of a block? Will we find the matching dentry or the first empty space? Signed-off-by: Jan Blunck <jblunck@suse.de> Signed-off-by: Valerie Aurora <vaurora@redhat.com> Cc: Theodore Tso <tytso@mit.edu> Cc: linux-ext4@vger.kernel.org --- fs/ext2/dir.c | 96 +++++++++++++++++++++++++++++++++++++++++++++-- fs/ext2/ext2.h | 3 + fs/ext2/inode.c | 11 ++++- fs/ext2/namei.c | 67 +++++++++++++++++++++++++++++++- fs/ext2/super.c | 6 +++ include/linux/ext2_fs.h | 4 ++ 6 files changed, 177 insertions(+), 10 deletions(-) diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c index 57207a9..030bd46 100644 --- a/fs/ext2/dir.c +++ b/fs/ext2/dir.c @@ -219,7 +219,7 @@ static inline int ext2_match (int len, const char * const name, { if (len != de->name_len) return 0; - if (!de->inode) + if (!de->inode && (de->file_type != EXT2_FT_WHT)) return 0; return !memcmp(name, de->name, len); } @@ -255,6 +255,7 @@ static unsigned char ext2_filetype_table[EXT2_FT_MAX] = { [EXT2_FT_FIFO] = DT_FIFO, [EXT2_FT_SOCK] = DT_SOCK, [EXT2_FT_SYMLINK] = DT_LNK, + [EXT2_FT_WHT] = DT_WHT, }; #define S_SHIFT 12 @@ -448,6 +449,26 @@ ino_t ext2_inode_by_name(struct inode *dir, struct qstr *child) return res; } +/* Special version for filetype based whiteout support */ +ino_t ext2_inode_by_dentry(struct inode *dir, struct dentry *dentry) +{ + ino_t res = 0; + struct ext2_dir_entry_2 *de; + struct page *page; + + de = ext2_find_entry (dir, &dentry->d_name, &page); + if (de) { + res = le32_to_cpu(de->inode); + if ...
This looks odd, can someone tell me what's actually going with de and de1 Is page "always" set in ext2_find_entry(), I couldn't quite make that out? If dentry is negative, isn't this a use without initialization of page in --
From: Jan Blunck <jblunck@suse.de>
do_whiteout() allows removal of a directory when it has whiteouts but
is logically empty.
XXX - This patch abuses readdir() to check if the union directory is
logically empty - that is, all the entries are whiteouts (or "." or
".."). Currently, we have no clean VFS interface to ask the lower
file system if a directory is empty.
Fixes:
- Add ->is_directory_empty() op
- Add is_directory_empty flag to dentry (ugly dcache populate)
- Ask underlying fs to remove it and look for an error return
- (your idea here)
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
---
fs/namei.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 84 insertions(+), 0 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index 8c67636..06aad7e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2249,6 +2249,90 @@ static int vfs_whiteout(struct inode *dir, struct dentry *old_dentry, int isdir)
}
/*
+ * XXX - We are abusing readdir to check if a union directory is
+ * logically empty.
+ */
+static int filldir_is_empty(void *__buf, const char *name, int namlen,
+ loff_t offset, u64 ino, unsigned int d_type)
+{
+ int *is_empty = (int *)__buf;
+
+ switch (namlen) {
+ case 2:
+ if (name[1] != '.')
+ break;
+ case 1:
+ if (name[0] != '.')
+ break;
+ return 0;
+ }
+
+ if (d_type == DT_WHT)
+ return 0;
+
+ (*is_empty) = 0;
+ return 0;
+}
+
+static int directory_is_empty(struct path *path)
+{
+ struct file *file;
+ int err;
+ int is_empty = 1;
+
+ BUG_ON(!S_ISDIR(path->dentry->d_inode->i_mode));
+
+ /* references for the file pointer */
+ path_get(path);
+
+ file = dentry_open(path->dentry, path->mnt, O_RDONLY, current_cred());
+ if (IS_ERR(file))
+ return 0;
+
+ err = vfs_readdir(file, filldir_is_empty, &is_empty);
+
+ fput(file);
+ return is_empty;
+}
+
+static int do_whiteout(struct nameidata *nd, struct path *path, int ...From: Jan Blunck <jblunck@suse.de> The ext2_append_link() is later used to find or append a directory entry to whiteout. Signed-off-by: Jan Blunck <jblunck@suse.de> Signed-off-by: Valerie Aurora <vaurora@redhat.com> Cc: Theodore Tso <tytso@mit.edu> Cc: linux-ext4@vger.kernel.org --- fs/ext2/dir.c | 70 ++++++++++++++++++++++++++++++++++++++++---------------- 1 files changed, 50 insertions(+), 20 deletions(-) diff --git a/fs/ext2/dir.c b/fs/ext2/dir.c index 7516957..57207a9 100644 --- a/fs/ext2/dir.c +++ b/fs/ext2/dir.c @@ -472,9 +472,10 @@ void ext2_set_link(struct inode *dir, struct ext2_dir_entry_2 *de, } /* - * Parent is locked. + * Find or append a given dentry to the parent directory */ -int ext2_add_link (struct dentry *dentry, struct inode *inode) +static ext2_dirent * ext2_append_entry(struct dentry * dentry, + struct page ** page) { struct inode *dir = dentry->d_parent->d_inode; const char *name = dentry->d_name.name; @@ -482,13 +483,10 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode) unsigned chunk_size = ext2_chunk_size(dir); unsigned reclen = EXT2_DIR_REC_LEN(namelen); unsigned short rec_len, name_len; - struct page *page = NULL; - ext2_dirent * de; + ext2_dirent * de = NULL; unsigned long npages = dir_pages(dir); unsigned long n; char *kaddr; - loff_t pos; - int err; /* * We take care of directory expansion in the same loop. @@ -498,20 +496,19 @@ int ext2_add_link (struct dentry *dentry, struct inode *inode) for (n = 0; n <= npages; n++) { char *dir_end; - page = ext2_get_page(dir, n, 0); - err = PTR_ERR(page); - if (IS_ERR(page)) + *page = ext2_get_page(dir, n, 0); + de = ERR_PTR(PTR_ERR(*page)); + if (IS_ERR(*page)) goto out; - lock_page(page); - kaddr = page_address(page); + lock_page(*page); + kaddr = page_address(*page); dir_end = kaddr + ext2_last_byte(dir, n); de = (ext2_dirent *)kaddr; kaddr += PAGE_CACHE_SIZE - reclen; while ((char ...
From: Jan Blunck <jblunck@suse.de>
Whiteout a given directory entry. File systems that support whiteouts
must implement the new ->whiteout() directory inode operation.
XXX - Only whiteout when there is a matching entry in a lower layer.
XXX - MS_WHITEOUT only indicates whiteouts, but we also use it for
fallthrus. Can we just check root->i_op->whiteout and ->fallthru? Or
do we need an MS_FALLTHRU?
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: David Woodhouse <dwmw2@infradead.org>
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
---
Documentation/filesystems/vfs.txt | 10 +++++-
fs/dcache.c | 4 ++-
fs/namei.c | 73 ++++++++++++++++++++++++++++++++++++-
include/linux/dcache.h | 6 +++
include/linux/fs.h | 2 +
5 files changed, 92 insertions(+), 3 deletions(-)
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 3de2f32..8846b4f 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -308,7 +308,7 @@ struct inode_operations
-----------------------
This describes how the VFS can manipulate an inode in your
-filesystem. As of kernel 2.6.22, the following members are defined:
+filesystem. As of kernel 2.6.33, the following members are defined:
struct inode_operations {
int (*create) (struct inode *,struct dentry *,int, struct nameidata *);
@@ -319,6 +319,7 @@ struct inode_operations {
int (*mkdir) (struct inode *,struct dentry *,int);
int (*rmdir) (struct inode *,struct dentry *);
int (*mknod) (struct inode *,struct dentry *,int,dev_t);
+ int (*whiteout) (struct inode *, struct dentry *, struct dentry *);
int (*rename) (struct inode *, struct dentry *,
struct inode *, struct dentry *);
int (*readlink) (struct dentry *, char __user *,int);
@@ -382,6 +383,13 @@ otherwise noted.
will probably need to call d_instantiate() just as you would
in the create() method
+ ...Add comments describing what the directions "up" and "down" mean and
ref count handling to the VFS follow_mount() family of functions.
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
---
fs/namei.c | 43 +++++++++++++++++++++++++++++++++++++++----
fs/namespace.c | 16 ++++++++++++++--
2 files changed, 53 insertions(+), 6 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index b86b96f..ec178f1 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -596,6 +596,17 @@ loop:
return err;
}
+/*
+ * follow_up - Find the mountpoint of path's vfsmount
+ *
+ * Given a path, find the mountpoint of its source file system.
+ * Replace @path with the path of the mountpoint in the parent mount.
+ * Up is towards /.
+ *
+ * Return 1 if we went up a level and 0 if we were already at the
+ * root.
+ */
+
int follow_up(struct path *path)
{
struct vfsmount *parent;
@@ -616,8 +627,22 @@ int follow_up(struct path *path)
return 1;
}
-/* no need for dcache_lock, as serialization is taken care in
- * namespace.c
+/*
+ * __follow_mount - Return the most recent mount at this mountpoint
+ *
+ * Given a mountpoint, find the most recently mounted file system at
+ * this mountpoint and return the path to its root dentry. This is
+ * the file system that is visible, and it is in the direction of VFS
+ * "down" - away from the root of the mount tree. See comments to
+ * lookup_mnt() for an example of "down."
+ *
+ * Does not decrement the refcount on the given mount even if it
+ * follows it to another mount and returns that path instead.
+ *
+ * Returns 0 if path was unchanged, 1 if we followed it to another mount.
+ *
+ * No need for dcache_lock, as serialization is taken care in
+ * namespace.c.
*/
static int __follow_mount(struct path *path)
{
@@ -636,6 +661,12 @@ static int __follow_mount(struct path *path)
return res;
}
+/*
+ * Like __follow_mount, but no return value and drops references to
+ * both ...From: Jan Blunck <jblunck@suse.de> Userspace isn't ready for handling another file type, so silently drop whiteout directory entries before they leave the kernel. Signed-off-by: Jan Blunck <jblunck@suse.de> Signed-off-by: David Woodhouse <dwmw2@infradead.org> Signed-off-by: Valerie Aurora <vaurora@redhat.com> Cc: linux-nfs@vger.kernel.org Cc: "J. Bruce Fields" <bfields@fieldses.org> Cc: Neil Brown <neilb@suse.de> --- fs/compat.c | 9 +++++++++ fs/nfsd/nfs3xdr.c | 5 +++++ fs/nfsd/nfs4xdr.c | 5 +++++ fs/nfsd/nfsxdr.c | 4 ++++ fs/readdir.c | 9 +++++++++ 5 files changed, 32 insertions(+), 0 deletions(-) diff --git a/fs/compat.c b/fs/compat.c index 0544873..5d88516 100644 --- a/fs/compat.c +++ b/fs/compat.c @@ -839,6 +839,9 @@ static int compat_fillonedir(void *__buf, const char *name, int namlen, struct compat_old_linux_dirent __user *dirent; compat_ulong_t d_ino; + if (d_type == DT_WHT) + return 0; + if (buf->result) return -EINVAL; d_ino = ino; @@ -910,6 +913,9 @@ static int compat_filldir(void *__buf, const char *name, int namlen, compat_ulong_t d_ino; int reclen = ALIGN(NAME_OFFSET(dirent) + namlen + 2, sizeof(compat_long_t)); + if (d_type == DT_WHT) + return 0; + buf->error = -EINVAL; /* only used if we fail.. */ if (reclen > buf->count) return -EINVAL; @@ -999,6 +1005,9 @@ static int compat_filldir64(void * __buf, const char * name, int namlen, loff_t int reclen = ALIGN(jj + namlen + 1, sizeof(u64)); u64 off; + if (d_type == DT_WHT) + return 0; + buf->error = -EINVAL; /* only used if we fail.. */ if (reclen > buf->count) return -EINVAL; diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c index 2a533a0..9b96f5a 100644 --- a/fs/nfsd/nfs3xdr.c +++ b/fs/nfsd/nfs3xdr.c @@ -885,6 +885,11 @@ encode_entry(struct readdir_cd *ccd, const char *name, int namlen, int elen; /* estimated entry length in words */ int num_entry_words = 0; /* actual number of words */ + if (d_type ...
While we can check if a file system is currently read-only, we can't
guarantee that it will stay read-only. The file system can be
remounted read-write at any time; it's also conceivable that a file
system can be mounted a second time and converted to read-write if the
underlying fs allows it. This is a problem for union mounts, which
require the underlying file system be read-only. Add a read-only
users count and don't allow remounts to change the file system to
read-write or read-write mounts if there are any read-only users.
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
---
fs/namespace.c | 11 +++++++++++
fs/super.c | 23 +++++++++++++++++++++++
include/linux/fs.h | 8 ++++++++
3 files changed, 42 insertions(+), 0 deletions(-)
diff --git a/fs/namespace.c b/fs/namespace.c
index d405444..b788cfa 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -200,6 +200,17 @@ int __mnt_is_readonly(struct vfsmount *mnt)
}
EXPORT_SYMBOL_GPL(__mnt_is_readonly);
+static void inc_hard_readonly_users(struct vfsmount *mnt)
+{
+ mnt->mnt_sb->s_hard_readonly_users++;
+}
+
+static void dec_hard_readonly_users(struct vfsmount *mnt)
+{
+ BUG_ON(mnt->mnt_sb->s_hard_readonly_users == 0);
+ mnt->mnt_sb->s_hard_readonly_users--;
+}
+
static inline void inc_mnt_writers(struct vfsmount *mnt)
{
#ifdef CONFIG_SMP
diff --git a/fs/super.c b/fs/super.c
index 1527e6a..6add39b 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -118,6 +118,7 @@ out:
*/
static inline void destroy_super(struct super_block *s)
{
+ BUG_ON(s->s_hard_readonly_users);
security_sb_free(s);
kfree(s->s_subtype);
kfree(s->s_options);
@@ -557,6 +558,21 @@ out:
return err;
}
+/*
+ * Some uses of file systems require that they never be mounted
+ * read-write anywhere (e.g., the lower layers of union mounts must
+ * always be read-only). If there are any of these "hard" read-only
+ * mounts, don't permit a transition to ...From: Jan Blunck <jblunck@suse.de>
This patch changes lookup_hash() into returning a struct path.
XXX - Check for correctness, otherwise obvious
Signed-off-by: Jan Blunck <jblunck@suse.de>
Signed-off-by: Valerie Aurora <vaurora@redhat.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
---
fs/namei.c | 113 ++++++++++++++++++++++++++++++-----------------------------
1 files changed, 57 insertions(+), 56 deletions(-)
diff --git a/fs/namei.c b/fs/namei.c
index ec178f1..3b43c48 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1155,7 +1155,7 @@ int vfs_path_lookup(struct dentry *dentry, struct vfsmount *mnt,
}
static struct dentry *__lookup_hash(struct qstr *name,
- struct dentry *base, struct nameidata *nd)
+ struct dentry *base, struct nameidata *nd)
{
struct dentry *dentry;
struct inode *inode;
@@ -1212,14 +1212,22 @@ out:
* needs parent already locked. Doesn't follow mounts.
* SMP-safe.
*/
-static struct dentry *lookup_hash(struct nameidata *nd)
+static int lookup_hash(struct nameidata *nd, struct qstr *name,
+ struct path *path)
{
int err;
err = exec_permission(nd->path.dentry->d_inode);
if (err)
- return ERR_PTR(err);
- return __lookup_hash(&nd->last, nd->path.dentry, nd);
+ return err;
+ path->mnt = nd->path.mnt;
+ path->dentry = __lookup_hash(name, nd->path.dentry, nd);
+ if (IS_ERR(path->dentry)) {
+ err = PTR_ERR(path->dentry);
+ path->dentry = NULL;
+ path->mnt = NULL;
+ }
+ return err;
}
static int __lookup_one_len(const char *name, struct qstr *this,
@@ -1701,12 +1709,9 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
/* OK, it's O_CREAT */
mutex_lock(&dir->d_inode->i_mutex);
+ error = lookup_hash(nd, &nd->last, path);
- path->dentry = lookup_hash(nd);
- path->mnt = nd->path.mnt;
-
- error = PTR_ERR(path->dentry);
- if (IS_ERR(path->dentry)) {
+ if (error) {
mutex_unlock(&dir->d_inode->i_mutex);
goto exit;
}
@@ -1958,7 +1963,8 @@ ...There's a bit of indirection going on here so it isn't clear to me if --
| Greg KH | Og dreams of kernels |
| Jens Axboe | [PATCH 31/33] Fusion: sg chaining support |
| Arnd Bergmann | Re: finding your own dead "CONFIG_" variables |
| Mark Brown | [PATCH 2/2] Subject: natsemi: Allow users to disable workaround for DspCfg reset |
| Tony Breeds | [LGUEST] Look in object dir for .config |
git: | |
| Brian Downing | Re: Git in a Nutshell guide |
| John Benes | Re: master has some toys |
| Matthias Lederhofer | [PATCH 4/7] introduce GIT_WORK_TREE to specify the work tree |
| Alexander Sulfrian | [RFC/PATCH] RE: git calls SSH_ASKPASS even if DISPLAY is not set |
| Junio C Hamano | Re: Rss produced by git is not valid xml? |
| Linux Kernel Mailing List | iSeries: fix section mismatch in iseries_veth |
| Linux Kernel Mailing List | ixbge: remove TX lock and redo TX accounting. |
| Linux Kernel Mailing List | ixgbe: fix several counter register errata |
| Linux Kernel Mailing List | b43: fix build with CONFIG_SSB_PCIHOST=n |
| Linux Kernel Mailing List | 9p: block-based virtio client |
