This is against current Linus git. Miklos, if you send me a copy of your current unprivledged mount code merged against mainline, I'll merge with this. git://git.kernel.org/pub/scm/linux/kernel/git/daveh/linux-2.6-robind.git This rolls up all the -mm bugfixes that were accumulated, and addresses some new review comments from Al. Also contains some reworking from hch and a patch from Jeff Dike. Just posting here to let everyone have a sniff before we resend it back to -mm. --- Why do we need r/o bind mounts? This feature allows a read-only view into a read-write filesystem. In the process of doing that, it also provides infrastructure for keeping track of the number of writers to any given mount. This has a number of uses. It allows chroots to have parts of filesystems writable. It will be useful for containers in the future because users may have root inside a container, but should not be allowed to write to somefilesystems. This also replaces patches that vserver has had out of the tree for several years. It allows security enhancement by making sure that parts of your filesystem are read-only (such as when you don't trust your FTP server), when you don't want to have entire new filesystems mounted, or when you want atime selectively updated. I've been using this script: http://sr71.net/~dave/linux/robind-test.sh to test that the feature is working as desired. It takes a directory and makes a regular bind and a r/o bind mount of it. It then performs some normal filesystem operations on the three directories, including ones that are expected to fail, like creating a file on the r/o mount. Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> --
Here's patch for hppfs that uses vfs_kern_mount to make sure it always has a procfs instance and passed the vfsmount on through the inode private data. Also fixes a procfs file_system_type leak for every attempted hppfs mount. [ jdike - gave this file a style workover, plus deleted hppfs_dentry_ops ] Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jeff Dike <jdike@linux.intel.com> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> --- linux-2.6.git-dave/fs/hppfs/hppfs_kern.c | 364 ++++++++++++++++--------------- 1 file changed, 188 insertions(+), 176 deletions(-) diff -puN fs/hppfs/hppfs_kern.c~hppfs-pass-vfsmount-to-dentry_open fs/hppfs/hppfs_kern.c --- linux-2.6.git/fs/hppfs/hppfs_kern.c~hppfs-pass-vfsmount-to-dentry_open 2008-02-15 13:25:42.000000000 -0800 +++ linux-2.6.git-dave/fs/hppfs/hppfs_kern.c 2008-02-15 13:25:42.000000000 -0800 @@ -1,23 +1,25 @@ /* - * Copyright (C) 2002 Jeff Dike (jdike@karaya.com) + * Copyright (C) 2002 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com) * Licensed under the GPL */ -#include <linux/fs.h> +#include <linux/ctype.h> +#include <linux/dcache.h> #include <linux/file.h> -#include <linux/module.h> +#include <linux/fs.h> #include <linux/init.h> -#include <linux/slab.h> -#include <linux/list.h> #include <linux/kernel.h> -#include <linux/ctype.h> -#include <linux/dcache.h> +#include <linux/list.h> +#include <linux/module.h> +#include <linux/mount.h> +#include <linux/slab.h> #include <linux/statfs.h> +#include <linux/types.h> #include <asm/uaccess.h> -#include <asm/fcntl.h> #include "os.h" -static int init_inode(struct inode *inode, struct dentry *dentry); +static int init_inode(struct inode *inode, struct dentry *dentry, + struct vfsmount *mnt); struct hppfs_data { struct list_head list; @@ -33,6 +35,7 @@ struct hppfs_private { struct hppfs_inode_info { struct dentry *proc_dentry; + struct vfsmount *proc_mnt; ...
Some new uses of get_empty_filp() have crept in, and are not properly taking mnt_want_write()s. This fixes them up. We really need to kill get_empty_filp(). Cc: Erez Zadok <ezk@cs.sunysb.edu> Cc: Trond Myklebust <trond.myklebust@fys.uio.no> Cc: "J Bruce Fields" <bfields@fieldses.org> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> --- linux-2.6.git-dave/fs/anon_inodes.c | 16 ++++++---------- linux-2.6.git-dave/fs/file_table.c | 6 ++++++ linux-2.6.git-dave/fs/pipe.c | 19 +++++++++---------- 3 files changed, 21 insertions(+), 20 deletions(-) diff -puN fs/anon_inodes.c~fix-up-new-filp-allocators fs/anon_inodes.c --- linux-2.6.git/fs/anon_inodes.c~fix-up-new-filp-allocators 2008-02-15 13:25:43.000000000 -0800 +++ linux-2.6.git-dave/fs/anon_inodes.c 2008-02-15 13:25:43.000000000 -0800 @@ -81,13 +81,10 @@ int anon_inode_getfd(int *pfd, struct in if (IS_ERR(anon_inode_inode)) return -ENODEV; - file = get_empty_filp(); - if (!file) - return -ENFILE; error = get_unused_fd(); if (error < 0) - goto err_put_filp; + return error; fd = error; /* @@ -114,14 +111,15 @@ int anon_inode_getfd(int *pfd, struct in dentry->d_flags &= ~DCACHE_UNHASHED; d_instantiate(dentry, anon_inode_inode); - file->f_path.mnt = mntget(anon_inode_mnt); - file->f_path.dentry = dentry; + error = -ENFILE; + file = alloc_file(anon_inode_mnt, dentry, + FMODE_READ | FMODE_WRITE, fops); + if (!file) + goto err_put_unused_fd; file->f_mapping = anon_inode_inode->i_mapping; file->f_pos = 0; file->f_flags = O_RDWR; - file->f_op = fops; - file->f_mode = FMODE_READ | FMODE_WRITE; file->f_version = 0; file->private_data = priv; @@ -134,8 +132,6 @@ int anon_inode_getfd(int *pfd, struct in err_put_unused_fd: put_unused_fd(fd); -err_put_filp: - put_filp(file); return error; ...
After several posts and bug reports regarding interaction with the NULL
nameidata, here's a patch to clean up the mess with struct file in the
reiserfs xattr code.
As observed in several of the posts, there's really no need for struct file
to exist in the xattr code. It was really only passed around due to the
f_op->readdir() and a_ops->{prepare,commit}_write prototypes requiring it.
reiserfs_prepare_write() and reiserfs_commit_write() don't actually use the
struct file passed to it, and the xattr code uses a private version of
reiserfs_readdir() to enumerate the xattr directories.
Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
linux-2.6.git-dave/fs/reiserfs/xattr.c | 110 +++++++++------------------------
1 file changed, 30 insertions(+), 80 deletions(-)
diff -puN fs/reiserfs/xattr.c~reiserfs-eliminate-private-use-of-struct-file-in-xattr fs/reiserfs/xattr.c
--- linux-2.6.git/fs/reiserfs/xattr.c~reiserfs-eliminate-private-use-of-struct-file-in-xattr 2008-02-15 13:25:41.000000000 -0800
+++ linux-2.6.git-dave/fs/reiserfs/xattr.c 2008-02-15 13:25:41.000000000 -0800
@@ -191,28 +191,11 @@ static struct dentry *get_xa_file_dentry
dput(xadir);
if (err)
xafile = ERR_PTR(err);
- return xafile;
-}
-
-/* Opens a file pointer to the attribute associated with inode */
-static struct file *open_xa_file(const struct inode *inode, const char *name,
- int flags)
-{
- struct dentry *xafile;
- struct file *fp;
-
- xafile = get_xa_file_dentry(inode, name, flags);
- if (IS_ERR(xafile))
- return ERR_PTR(PTR_ERR(xafile));
else if (!xafile->d_inode) {
dput(xafile);
- return ERR_PTR(-ENODATA);
+ xafile = ERR_PTR(-ENODATA);
}
-
- fp = dentry_open(xafile, NULL, O_RDWR);
- /* dentry_open dputs the dentry if it fails */
-
- return fp;
+ return xafile;
}
...The emergency remount code forcibly removes FMODE_WRITE from
filps. The r/o bind mount code notices that this was done
without a proper mnt_drop_write() and properly gives a
warning.
This patch does a mnt_drop_write() to keep everything
balanced.
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
linux-2.6.git-dave/fs/super.c | 20 ++++++++++++++++++--
1 file changed, 18 insertions(+), 2 deletions(-)
diff -puN fs/super.c~robind-sysrq-fix fs/super.c
--- linux-2.6.git/fs/super.c~robind-sysrq-fix 2008-02-15 13:25:46.000000000 -0800
+++ linux-2.6.git-dave/fs/super.c 2008-02-15 13:25:46.000000000 -0800
@@ -37,6 +37,7 @@
#include <linux/idr.h>
#include <linux/kobject.h>
#include <linux/mutex.h>
+#include <linux/file.h>
#include <asm/uaccess.h>
@@ -567,10 +568,25 @@ static void mark_files_ro(struct super_b
{
struct file *f;
+retry:
file_list_lock();
list_for_each_entry(f, &sb->s_files, f_u.fu_list) {
- if (S_ISREG(f->f_path.dentry->d_inode->i_mode) && file_count(f))
- f->f_mode &= ~FMODE_WRITE;
+ struct vfsmount *mnt;
+ if (!S_ISREG(f->f_path.dentry->d_inode->i_mode))
+ continue;
+ if (!file_count(f))
+ continue;
+ if (!(f->f_mode & FMODE_WRITE))
+ continue;
+ f->f_mode &= ~FMODE_WRITE;
+ mnt = f->f_path.mnt;
+ file_list_unlock();
+ /*
+ * This can sleep, so we can't hold
+ * the file_list_lock() spinlock.
+ */
+ mnt_drop_write(mnt);
+ goto retry;
}
file_list_unlock();
}
_
--
AFAICS an mntget() is needed here to make sure mnt doesn't go away after releasing file_list_lock. --
Good catch, but... that's not all. We also have to move drop_file_write_access() in __fput() past the file_kill(), or we'll get a race here - CPU 1 enters that loop, sees ->f_count 1, checks that file is writable, CPU 2 does final fput() and proceeds to do drop_file_write_access() and mnt_drop_write(). CPU 1, in the meanwhile, removes FMODE_WRITE and goes on to do mnt_drop_write() itself. Oh, well... Fixed. --
This patch adds two function mnt_want_write() and mnt_drop_write(). These are
used like a lock pair around and fs operations that might cause a write to the
filesystem.
Before these can become useful, we must first cover each place in the VFS
where writes are performed with a want/drop pair. When that is complete, we
can actually introduce code that will safely check the counts before allowing
r/w<->r/o transitions to occur.
Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
linux-2.6.git-dave/fs/namespace.c | 54 +++++++++++++++++++++++++++++++
linux-2.6.git-dave/include/linux/mount.h | 3 +
2 files changed, 57 insertions(+)
diff -puN fs/namespace.c~r-o-bind-mounts-stub-functions fs/namespace.c
--- linux-2.6.git/fs/namespace.c~r-o-bind-mounts-stub-functions 2008-02-15 13:25:45.000000000 -0800
+++ linux-2.6.git-dave/fs/namespace.c 2008-02-15 13:25:45.000000000 -0800
@@ -80,6 +80,60 @@ struct vfsmount *alloc_vfsmnt(const char
return mnt;
}
+/*
+ * Most r/o checks on a fs are for operations that take
+ * discrete amounts of time, like a write() or unlink().
+ * We must keep track of when those operations start
+ * (for permission checks) and when they end, so that
+ * we can determine when writes are able to occur to
+ * a filesystem.
+ */
+/**
+ * mnt_want_write - get write access to a mount
+ * @mnt: the mount on which to take a write
+ *
+ * This tells the low-level filesystem that a write is
+ * about to be performed to it, and makes sure that
+ * writes are allowed before returning success. When
+ * the write operation is finished, mnt_drop_write()
+ * must be called. This is effectively a refcount.
+ */
+int mnt_want_write(struct vfsmount *mnt)
+{
+ if (__mnt_is_readonly(mnt))
+ return -EROFS;
+ return ...Argh, is there some reason why this couldn't have gotten merged in -rc1, ahead of the rest of the patch series? This one is going to cause more cross-tree merge pain with any filesystem tree that have changes to fs/*/ioctl.c. - Ted --
I wasn't meaning for this to hit the 2.6.25-rc series. We had some review comments just when the merge window opened, and I was expecting them to get stuck back in -mm for another round. -- Dave --
Yeah, but it means that I need one set of patches for -mm, and another set of patches for Linus's mainline. I notice that your patchset is currently missing changes for fs/ext4/ioctl.c --- I think because you dropped them when Mingming picked them up, and then I dropped them when I was trying to prepare the set of patches to push to Linus. No problem, I'm sure I can ressurect them, but it's still the same basic problem that when there are patchsets such as yours which touch multiple trees in -mm, there are almost inevitably patch conflicts. It would be nice if an initial patch which introduces the new functionality you need for r/o bind mounts could get introduced into mainline *first*, and then people could add patches that call mnt_want_write(), et. al into their trees gradually. As it is, I can't see a way around this other than maintaining two separate patch sets, one that works with r/o bind mounts, and one for mainline, since otherwise akpm gets grumpy and starts dropping either your patchset or the ext4 patchset because *he* has to manually fix up the patch conflicts. (So instead I have to deal with it by hand, and then *I* get grumpy. :-/) - Ted --
On Fri, 15 Feb 2008 20:00:30 -0500 Yes, I expect that merging a handful of do-nothing mnt_foo_write() itym "less than usually cheery" --
Haha, once we put pieces in the first three patches would be useful aswell, to easily catch additions in the next cycle that might be adding NULL-vfsmount calls to dentry_open. --
hrm, well, how about putting up a complete and suitably-changelogged patch series for Linus to look at? That's be a Dave thing I guess. I wasn't overawed by the initial patch - why not make those stubs inlined to truly add zero cost?? --
What I sent was the exact same thing I had from the original patch series. They weren't inlined because shortly after that in the series, we make them much bigger. Merging that exact patch is the easiest for me, but I see your point. I'll rework it with the others. -- Dave --
So, we want "[PATCH 07/30] r/o bind mounts: stub functions" and "[PATCH 03/30] check for null vfsmount in dentry_open()" But what's the third patch? -- Dave --
For that latter patch we'd need the reiserfs and hpps fixes. But I think it's too late now, let's just keep them in -mm for the time beeing. --
Make sure no-one calls dentry_open with a NULL vfsmount argument and crap
out with a stacktrace otherwise. A NULL file->f_vfsmnt has always been
problematic, but with the per-mount r/o tracking we can't accept anymore
at all.
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
linux-2.6.git-dave/fs/open.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff -puN fs/open.c~check-for-NULL-vfsmount-in-dentry-open fs/open.c
--- linux-2.6.git/fs/open.c~check-for-NULL-vfsmount-in-dentry-open 2008-02-15 13:25:42.000000000 -0800
+++ linux-2.6.git-dave/fs/open.c 2008-02-15 13:25:42.000000000 -0800
@@ -903,6 +903,18 @@ struct file *dentry_open(struct dentry *
int error;
struct file *f;
+ /*
+ * We must always pass in a valid mount pointer. Historically
+ * callers got away with not passing it, but we must enforce this at
+ * the earliest possible point now to avoid strange problems deep in the
+ * filesystem stack.
+ */
+ if (!mnt) {
+ printk(KERN_WARNING "%s called with NULL vfsmount\n", __func__);
+ dump_stack();
+ return ERR_PTR(-EINVAL);
+ }
+
error = -ENFILE;
f = get_empty_filp();
if (f == NULL) {
_
--
My end goal here is to make sure all users of may_open()
return filps. This will ensure that we properly release
mount write counts which were taken for the filp in
may_open().
This patch moves the sys_open flags to namei flags
calculation into fs/namei.c. We'll shortly be moving
the nameidata_to_filp() calls into namei.c, and this
gets the sys_open flags to a place where we can get
at them when we need them.
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
linux-2.6.git-dave/fs/namei.c | 43 +++++++++++++++++++++++++++++++++---------
linux-2.6.git-dave/fs/open.c | 22 +--------------------
2 files changed, 36 insertions(+), 29 deletions(-)
diff -puN fs/namei.c~do-namei_flags-calculation-inside-open_namei fs/namei.c
--- linux-2.6.git/fs/namei.c~do-namei_flags-calculation-inside-open_namei 2008-02-15 13:25:44.000000000 -0800
+++ linux-2.6.git-dave/fs/namei.c 2008-02-15 13:25:44.000000000 -0800
@@ -1676,7 +1676,12 @@ int may_open(struct nameidata *nd, int a
return 0;
}
-static int open_namei_create(struct nameidata *nd, struct path *path,
+/*
+ * Be careful about ever adding any more callers of this
+ * function. Its flags must be in the namei format, not
+ * what get passed to sys_open().
+ */
+static int __open_namei_create(struct nameidata *nd, struct path *path,
int flag, int mode)
{
int error;
@@ -1695,26 +1700,46 @@ static int open_namei_create(struct name
}
/*
+ * Note that while the flag value (low two bits) for sys_open means:
+ * 00 - read-only
+ * 01 - write-only
+ * 10 - read-write
+ * 11 - special
+ * it is changed into
+ * 00 - no permissions needed
+ * 01 - read-permission
+ * 10 - write-permission
+ * 11 - read-write
+ * for the internal routines (ie open_namei()/follow_link() etc)
+ * This is more logical, and also allows the 00 "no perm needed"
+ * to be used for symlinks (where the permissions are checked
+ * ...If someone decides to demote a file from r/w to just
r/o, they can use this same code as __fput().
NFS does just that, and will use this in the next
patch.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Erez Zadok <ezk@cs.sunysb.edu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J Bruce Fields" <bfields@fieldses.org>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---
linux-2.6.git-dave/fs/file_table.c | 19 ++++++++++++++++++-
linux-2.6.git-dave/fs/nfsd/nfs4state.c | 3 ++-
linux-2.6.git-dave/include/linux/file.h | 1 +
3 files changed, 21 insertions(+), 2 deletions(-)
diff -puN fs/file_table.c~create-file_drop_write_access-helper fs/file_table.c
--- linux-2.6.git/fs/file_table.c~create-file_drop_write_access-helper 2008-02-15 13:25:46.000000000 -0800
+++ linux-2.6.git-dave/fs/file_table.c 2008-02-15 13:25:46.000000000 -0800
@@ -211,6 +211,23 @@ void fput(struct file *file)
EXPORT_SYMBOL(fput);
+/**
+ * drop_file_write_access - give up ability to write to a file
+ * @file: the file to which we will stop writing
+ *
+ * This is a central place which will give up the ability
+ * to write to @file, along with access to write through
+ * its vfsmount.
+ */
+void drop_file_write_access(struct file *file)
+{
+ struct dentry *dentry = file->f_path.dentry;
+ struct inode *inode = dentry->d_inode;
+
+ put_write_access(inode);
+}
+EXPORT_SYMBOL_GPL(drop_file_write_access);
+
/* __fput is called from task context when aio completion releases the last
* last use of a struct file *. Do not use otherwise.
*/
@@ -237,7 +254,7 @@ void __fput(struct file *file)
cdev_put(inode->i_cdev);
fops_put(file->f_op);
if (file->f_mode & FMODE_WRITE)
- put_write_access(inode);
+ drop_file_write_access(file);
put_pid(file->f_owner.pid);
file_kill(file);
file->f_path.dentry = NULL;
diff -puN ...Elevate the write count during the vfs_rmdir() call. Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- linux-2.6.git-dave/fs/namei.c | 5 +++++ 1 file changed, 5 insertions(+) diff -puN fs/namei.c~r-o-bind-mounts-do_rmdir-elevate-write-count fs/namei.c --- linux-2.6.git/fs/namei.c~r-o-bind-mounts-do_rmdir-elevate-write-count 2008-02-15 13:25:47.000000000 -0800 +++ linux-2.6.git-dave/fs/namei.c 2008-02-15 13:25:47.000000000 -0800 @@ -2189,7 +2189,12 @@ static long do_rmdir(int dfd, const char error = PTR_ERR(dentry); if (IS_ERR(dentry)) goto exit2; + error = mnt_want_write(nd.path.mnt); + if (error) + goto exit3; error = vfs_rmdir(nd.path.dentry->d_inode, dentry); + mnt_drop_write(nd.path.mnt); +exit3: dput(dentry); exit2: mutex_unlock(&nd.path.dentry->d_inode->i_mutex); _ --
Pretty self-explanatory. Fits in with the rest of the series. Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> --- linux-2.6.git-dave/fs/namei.c | 5 +++++ linux-2.6.git-dave/fs/nfsd/nfs4recover.c | 5 +++++ 2 files changed, 10 insertions(+) diff -puN fs/namei.c~r-o-bind-mounts-elevate-mnt-writers-for-callers-of-vfs_mkdir fs/namei.c --- linux-2.6.git/fs/namei.c~r-o-bind-mounts-elevate-mnt-writers-for-callers-of-vfs_mkdir 2008-02-15 13:25:47.000000000 -0800 +++ linux-2.6.git-dave/fs/namei.c 2008-02-15 13:25:47.000000000 -0800 @@ -2082,7 +2082,12 @@ asmlinkage long sys_mkdirat(int dfd, con if (!IS_POSIXACL(nd.path.dentry->d_inode)) mode &= ~current->fs->umask; + error = mnt_want_write(nd.path.mnt); + if (error) + goto out_dput; error = vfs_mkdir(nd.path.dentry->d_inode, dentry, mode); + mnt_drop_write(nd.path.mnt); +out_dput: dput(dentry); out_unlock: mutex_unlock(&nd.path.dentry->d_inode->i_mutex); diff -puN fs/nfsd/nfs4recover.c~r-o-bind-mounts-elevate-mnt-writers-for-callers-of-vfs_mkdir fs/nfsd/nfs4recover.c --- linux-2.6.git/fs/nfsd/nfs4recover.c~r-o-bind-mounts-elevate-mnt-writers-for-callers-of-vfs_mkdir 2008-02-15 13:25:47.000000000 -0800 +++ linux-2.6.git-dave/fs/nfsd/nfs4recover.c 2008-02-15 13:25:47.000000000 -0800 @@ -41,6 +41,7 @@ #include <linux/nfsd/xdr4.h> #include <linux/param.h> #include <linux/file.h> +#include <linux/mount.h> #include <linux/namei.h> #include <asm/uaccess.h> #include <linux/scatterlist.h> @@ -154,7 +155,11 @@ nfsd4_create_clid_dir(struct nfs4_client dprintk("NFSD: nfsd4_create_clid_dir: DIRECTORY EXISTS\n"); goto out_put; } + status = mnt_want_write(rec_dir.mnt); + if (status) + goto out_put; status = vfs_mkdir(rec_dir.path.dentry->d_inode, dentry, S_IRWXU); + mnt_drop_write(rec_dir.mnt); out_put: dput(dentry); ...
Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> --- linux-2.6.git-dave/fs/namei.c | 4 ++++ linux-2.6.git-dave/ipc/mqueue.c | 5 ++++- 2 files changed, 8 insertions(+), 1 deletion(-) diff -puN fs/namei.c~r-o-bind-mounts-elevate-mnt-writers-for-vfs_unlink-callers fs/namei.c --- linux-2.6.git/fs/namei.c~r-o-bind-mounts-elevate-mnt-writers-for-vfs_unlink-callers 2008-02-15 13:25:48.000000000 -0800 +++ linux-2.6.git-dave/fs/namei.c 2008-02-15 13:25:48.000000000 -0800 @@ -2280,7 +2280,11 @@ static long do_unlinkat(int dfd, const c inode = dentry->d_inode; if (inode) atomic_inc(&inode->i_count); + error = mnt_want_write(nd.path.mnt); + if (error) + goto exit2; error = vfs_unlink(nd.path.dentry->d_inode, dentry); + mnt_drop_write(nd.path.mnt); exit2: dput(dentry); } diff -puN ipc/mqueue.c~r-o-bind-mounts-elevate-mnt-writers-for-vfs_unlink-callers ipc/mqueue.c --- linux-2.6.git/ipc/mqueue.c~r-o-bind-mounts-elevate-mnt-writers-for-vfs_unlink-callers 2008-02-15 13:25:48.000000000 -0800 +++ linux-2.6.git-dave/ipc/mqueue.c 2008-02-15 13:25:48.000000000 -0800 @@ -742,8 +742,11 @@ asmlinkage long sys_mq_unlink(const char inode = dentry->d_inode; if (inode) atomic_inc(&inode->i_count); - + err = mnt_want_write(mqueue_mnt); + if (err) + goto out_err; err = vfs_unlink(dentry->d_parent->d_inode, dentry); + mnt_drop_write(mqueue_mnt); out_err: dput(dentry); _ --
This basically audits the callers of xattr_permission(), which calls permission() and can perform writes to the filesystem. Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- linux-2.6.git-dave/fs/nfsd/nfs4proc.c | 7 ++++++- linux-2.6.git-dave/fs/xattr.c | 16 ++++++++++++++-- 2 files changed, 20 insertions(+), 3 deletions(-) diff -puN fs/nfsd/nfs4proc.c~r-o-bind-mounts-elevate-mount-count-for-extended-attributes fs/nfsd/nfs4proc.c --- linux-2.6.git/fs/nfsd/nfs4proc.c~r-o-bind-mounts-elevate-mount-count-for-extended-attributes 2008-02-15 13:25:49.000000000 -0800 +++ linux-2.6.git-dave/fs/nfsd/nfs4proc.c 2008-02-15 13:25:49.000000000 -0800 @@ -658,14 +658,19 @@ nfsd4_setattr(struct svc_rqst *rqstp, st return status; } } + status = mnt_want_write(cstate->current_fh.fh_export->ex_mnt); + if (status) + return status; status = nfs_ok; if (setattr->sa_acl != NULL) status = nfsd4_set_nfs4_acl(rqstp, &cstate->current_fh, setattr->sa_acl); if (status) - return status; + goto out; status = nfsd_setattr(rqstp, &cstate->current_fh, &setattr->sa_iattr, 0, (time_t)0); +out: + mnt_drop_write(cstate->current_fh.fh_export->ex_mnt); return status; } diff -puN fs/xattr.c~r-o-bind-mounts-elevate-mount-count-for-extended-attributes fs/xattr.c --- linux-2.6.git/fs/xattr.c~r-o-bind-mounts-elevate-mount-count-for-extended-attributes 2008-02-15 13:25:49.000000000 -0800 +++ linux-2.6.git-dave/fs/xattr.c 2008-02-15 13:25:49.000000000 -0800 @@ -11,6 +11,7 @@ #include <linux/slab.h> #include <linux/file.h> #include <linux/xattr.h> +#include <linux/mount.h> #include <linux/namei.h> #include <linux/security.h> #include <linux/syscalls.h> @@ -32,8 +33,6 @@ xattr_permission(struct inode *inode, co * filesystem or on an immutable / append-only inode. */ if (mask & ...
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
linux-2.6.git-dave/fs/ncpfs/ioctl.c | 54 +++++++++++++++++++++++++++++++++++-
1 file changed, 53 insertions(+), 1 deletion(-)
diff -puN fs/ncpfs/ioctl.c~r-o-bind-mounts-elevate-write-count-during-entire-ncp_ioctl fs/ncpfs/ioctl.c
--- linux-2.6.git/fs/ncpfs/ioctl.c~r-o-bind-mounts-elevate-write-count-during-entire-ncp_ioctl 2008-02-15 13:25:49.000000000 -0800
+++ linux-2.6.git-dave/fs/ncpfs/ioctl.c 2008-02-15 13:25:49.000000000 -0800
@@ -14,6 +14,7 @@
#include <linux/ioctl.h>
#include <linux/time.h>
#include <linux/mm.h>
+#include <linux/mount.h>
#include <linux/highuid.h>
#include <linux/smp_lock.h>
#include <linux/vmalloc.h>
@@ -261,7 +262,7 @@ ncp_get_charsets(struct ncp_server* serv
}
#endif /* CONFIG_NCPFS_NLS */
-int ncp_ioctl(struct inode *inode, struct file *filp,
+static int __ncp_ioctl(struct inode *inode, struct file *filp,
unsigned int cmd, unsigned long arg)
{
struct ncp_server *server = NCP_SERVER(inode);
@@ -822,6 +823,57 @@ outrel:
return -EINVAL;
}
+static int ncp_ioctl_need_write(unsigned int cmd)
+{
+ switch (cmd) {
+ case NCP_IOC_GET_FS_INFO:
+ case NCP_IOC_GET_FS_INFO_V2:
+ case NCP_IOC_NCPREQUEST:
+ case NCP_IOC_SETDENTRYTTL:
+ case NCP_IOC_SIGN_INIT:
+ case NCP_IOC_LOCKUNLOCK:
+ case NCP_IOC_SET_SIGN_WANTED:
+ return 1;
+ case NCP_IOC_GETOBJECTNAME:
+ case NCP_IOC_SETOBJECTNAME:
+ case NCP_IOC_GETPRIVATEDATA:
+ case NCP_IOC_SETPRIVATEDATA:
+ case NCP_IOC_SETCHARSETS:
+ case NCP_IOC_GETCHARSETS:
+ case NCP_IOC_CONN_LOGGED_IN:
+ case NCP_IOC_GETDENTRYTTL:
+ case NCP_IOC_GETMOUNTUID2:
+ case NCP_IOC_SIGN_WANTED:
+ case NCP_IOC_GETROOT:
+ case NCP_IOC_SETROOT:
+ return 0;
+ default:
+ /* unkown IOCTL command, assume write */
+ return 1;
+ }
+}
+
+int ncp_ioctl(struct inode *inode, struct ...Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
linux-2.6.git-dave/fs/inode.c | 45 ++++++++++++++++++------------------------
1 file changed, 20 insertions(+), 25 deletions(-)
diff -puN fs/inode.c~r-o-bind-mounts-elevate-write-count-for-do_sys_utime-and-touch_atime fs/inode.c
--- linux-2.6.git/fs/inode.c~r-o-bind-mounts-elevate-write-count-for-do_sys_utime-and-touch_atime 2008-02-15 13:25:50.000000000 -0800
+++ linux-2.6.git-dave/fs/inode.c 2008-02-15 13:25:50.000000000 -0800
@@ -1199,42 +1199,37 @@ void touch_atime(struct vfsmount *mnt, s
struct inode *inode = dentry->d_inode;
struct timespec now;
- if (inode->i_flags & S_NOATIME)
+ if (mnt_want_write(mnt))
return;
+ if (inode->i_flags & S_NOATIME)
+ goto out;
if (IS_NOATIME(inode))
- return;
+ goto out;
if ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode))
- return;
+ goto out;
- /*
- * We may have a NULL vfsmount when coming from NFSD
- */
- if (mnt) {
- if (mnt->mnt_flags & MNT_NOATIME)
- return;
- if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))
- return;
-
- if (mnt->mnt_flags & MNT_RELATIME) {
- /*
- * With relative atime, only update atime if the
- * previous atime is earlier than either the ctime or
- * mtime.
- */
- if (timespec_compare(&inode->i_mtime,
- &inode->i_atime) < 0 &&
- timespec_compare(&inode->i_ctime,
- &inode->i_atime) < 0)
- return;
- }
+ if (mnt->mnt_flags & MNT_NOATIME)
+ goto out;
+ if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))
+ goto out;
+ if (mnt->mnt_flags & MNT_RELATIME) {
+ /*
+ * With relative atime, only update atime if the previous
+ * atime is earlier than either the ctime or mtime.
+ */
+ if (timespec_compare(&inode->i_mtime, &inode->i_atime) < 0 &&
+ timespec_compare(&inode->i_ctime, &inode->i_atime) < 0)
+ goto out;
}
...Now includes fix for oops seen by akpm.
"never let a libc developer write your kernel code" - hch
"nor, apparently, a kernel developer" - akpm
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
linux-2.6.git-dave/fs/utimes.c | 18 ++++++++++++------
1 file changed, 12 insertions(+), 6 deletions(-)
diff -puN fs/utimes.c~r-o-bind-mounts-elevate-write-count-for-do_utimes fs/utimes.c
--- linux-2.6.git/fs/utimes.c~r-o-bind-mounts-elevate-write-count-for-do_utimes 2008-02-15 13:25:50.000000000 -0800
+++ linux-2.6.git-dave/fs/utimes.c 2008-02-15 13:25:50.000000000 -0800
@@ -2,6 +2,7 @@
#include <linux/file.h>
#include <linux/fs.h>
#include <linux/linkage.h>
+#include <linux/mount.h>
#include <linux/namei.h>
#include <linux/sched.h>
#include <linux/stat.h>
@@ -59,6 +60,7 @@ long do_utimes(int dfd, char __user *fil
struct inode *inode;
struct iattr newattrs;
struct file *f = NULL;
+ struct vfsmount *mnt;
error = -EINVAL;
if (times && (!nsec_valid(times[0].tv_nsec) ||
@@ -79,18 +81,20 @@ long do_utimes(int dfd, char __user *fil
if (!f)
goto out;
dentry = f->f_path.dentry;
+ mnt = f->f_path.mnt;
} else {
error = __user_walk_fd(dfd, filename, (flags & AT_SYMLINK_NOFOLLOW) ? 0 : LOOKUP_FOLLOW, &nd);
if (error)
goto out;
dentry = nd.path.dentry;
+ mnt = nd.path.mnt;
}
inode = dentry->d_inode;
- error = -EROFS;
- if (IS_RDONLY(inode))
+ error = mnt_want_write(mnt);
+ if (error)
goto dput_and_out;
/* Don't worry, the checks are done in inode_change_ok() */
@@ -98,7 +102,7 @@ long do_utimes(int dfd, char __user *fil
if (times) {
error = -EPERM;
if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;
if (times[0].tv_nsec == ...open_namei() will, in the future, need to take mount write counts
over its creation and truncation (via may_open()) operations. It
needs to keep these write counts until any potential filp that is
created gets __fput()'d.
This gets complicated in the error handling and becomes very murky
as to how far open_namei() actually got, and whether or not that
mount write count was taken. That makes it a bad interface.
All that the current do_filp_open() really does is allocate the
nameidata on the stack, then call open_namei().
So, this merges those two functions and moves filp_open() over
to namei.c so it can be close to its buddy: do_filp_open(). It
also gets a kerneldoc comment in the process.
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
linux-2.6.git-dave/fs/namei.c | 100 +++++++++++++++++++---------------
linux-2.6.git-dave/fs/open.c | 19 ------
linux-2.6.git-dave/include/linux/fs.h | 3 -
3 files changed, 59 insertions(+), 63 deletions(-)
diff -puN fs/namei.c~make-open_namei-return-a-filp fs/namei.c
--- linux-2.6.git/fs/namei.c~make-open_namei-return-a-filp 2008-02-15 13:25:44.000000000 -0800
+++ linux-2.6.git-dave/fs/namei.c 2008-02-15 13:25:44.000000000 -0800
@@ -1724,17 +1724,13 @@ static inline int open_to_namei_flags(in
}
/*
- * open_namei()
- *
- * namei for open - this is in fact almost the whole open-routine.
- *
* Note that the low bits of "flag" aren't the same as in the open
* system call. See open_to_namei_flags().
- * SMP-safe
*/
-int open_namei(int dfd, const char *pathname, int open_flag,
- int mode, struct nameidata *nd)
+struct file *do_filp_open(int dfd, const char *pathname,
+ int open_flag, int mode)
{
+ struct nameidata nd;
int acc_mode, error;
struct path path;
struct dentry *dir;
@@ -1757,18 +1753,19 @@ int open_namei(int dfd, ...Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> --- linux-2.6.git-dave/fs/inode.c | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff -puN fs/inode.c~r-o-bind-mounts-elevate-write-count-for-file_update_time fs/inode.c --- linux-2.6.git/fs/inode.c~r-o-bind-mounts-elevate-write-count-for-file_update_time 2008-02-15 13:25:51.000000000 -0800 +++ linux-2.6.git-dave/fs/inode.c 2008-02-15 13:25:51.000000000 -0800 @@ -1250,10 +1250,13 @@ void file_update_time(struct file *file) struct inode *inode = file->f_path.dentry->d_inode; struct timespec now; int sync_it = 0; + int err; if (IS_NOCMTIME(inode)) return; - if (IS_RDONLY(inode)) + + err = mnt_want_write(file->f_vfsmnt); + if (err) return; now = current_fs_time(inode->i_sb); @@ -1274,6 +1277,7 @@ void file_update_time(struct file *file) if (sync_it) mark_inode_dirty_sync(inode); + mnt_drop_write(file->f_vfsmnt); } EXPORT_SYMBOL(file_update_time); _ --
Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> --- linux-2.6.git-dave/fs/namei.c | 10 ++++++++++ 1 file changed, 10 insertions(+) diff -puN fs/namei.c~r-o-bind-mounts-elevate-write-count-for-link-and-symlink-calls fs/namei.c --- linux-2.6.git/fs/namei.c~r-o-bind-mounts-elevate-write-count-for-link-and-symlink-calls 2008-02-15 13:25:52.000000000 -0800 +++ linux-2.6.git-dave/fs/namei.c 2008-02-15 13:25:52.000000000 -0800 @@ -2365,7 +2365,12 @@ asmlinkage long sys_symlinkat(const char if (IS_ERR(dentry)) goto out_unlock; + error = mnt_want_write(nd.path.mnt); + if (error) + goto out_dput; error = vfs_symlink(nd.path.dentry->d_inode, dentry, from, S_IALLUGO); + mnt_drop_write(nd.path.mnt); +out_dput: dput(dentry); out_unlock: mutex_unlock(&nd.path.dentry->d_inode->i_mutex); @@ -2460,7 +2465,12 @@ asmlinkage long sys_linkat(int olddfd, c error = PTR_ERR(new_dentry); if (IS_ERR(new_dentry)) goto out_unlock; + error = mnt_want_write(nd.path.mnt); + if (error) + goto out_dput; error = vfs_link(old_nd.path.dentry, nd.path.dentry->d_inode, new_dentry); + mnt_drop_write(nd.path.mnt); +out_dput: dput(new_dentry); out_unlock: mutex_unlock(&nd.path.dentry->d_inode->i_mutex); _ --
Some ioctl()s can cause writes to the filesystem. Take these, and make them
use mnt_want/drop_write() instead.
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
linux-2.6.git-dave/fs/ext2/ioctl.c | 49 +++++++----
linux-2.6.git-dave/fs/ext3/ioctl.c | 103 +++++++++++++++---------
linux-2.6.git-dave/fs/fat/file.c | 12 +-
linux-2.6.git-dave/fs/hfsplus/ioctl.c | 40 +++++----
linux-2.6.git-dave/fs/jfs/ioctl.c | 35 +++++---
linux-2.6.git-dave/fs/ocfs2/ioctl.c | 11 +-
linux-2.6.git-dave/fs/reiserfs/ioctl.c | 65 +++++++++------
linux-2.6.git-dave/fs/xfs/linux-2.6/xfs_ioctl.c | 15 ++-
8 files changed, 213 insertions(+), 117 deletions(-)
diff -puN fs/ext2/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls fs/ext2/ioctl.c
--- linux-2.6.git/fs/ext2/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls 2008-02-15 13:25:52.000000000 -0800
+++ linux-2.6.git-dave/fs/ext2/ioctl.c 2008-02-15 13:25:52.000000000 -0800
@@ -12,6 +12,7 @@
#include <linux/time.h>
#include <linux/sched.h>
#include <linux/compat.h>
+#include <linux/mount.h>
#include <linux/smp_lock.h>
#include <asm/current.h>
#include <asm/uaccess.h>
@@ -23,6 +24,7 @@ long ext2_ioctl(struct file *filp, unsig
struct ext2_inode_info *ei = EXT2_I(inode);
unsigned int flags;
unsigned short rsv_window_size;
+ int ret;
ext2_debug ("cmd = %u, arg = %lu\n", cmd, arg);
@@ -34,14 +36,19 @@ long ext2_ioctl(struct file *filp, unsig
case EXT2_IOC_SETFLAGS: {
unsigned int oldflags;
- if (IS_RDONLY(inode))
- return -EROFS;
-
- if (!is_owner_or_cap(inode))
- return -EACCES;
+ ret = mnt_want_write(filp->f_vfsmnt);
+ if (ret)
+ return ret;
+
+ if (!is_owner_or_cap(inode)) {
+ ret = -EACCES;
+ goto setflags_out;
+ }
...This is the first really tricky patch in the series. It elevates the writer
count on a mount each time a non-special file is opened for write.
We used to do this in may_open(), but Miklos pointed out that __dentry_open()
is used as well to create filps. This will cover even those cases, while a
call in may_open() would not have.
There is also an elevated count around the vfs_create() call in open_namei().
See the comments for more details, but we need this to fix a 'create, remount,
fail r/w open()' race.
Some filesystems forego the use of normal vfs calls to create
struct files. Make sure that these users elevate the mnt
writer count because they will get __fput(), and we need
to make sure they're balanced.
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
linux-2.6.git-dave/fs/file_table.c | 14 ++++++
linux-2.6.git-dave/fs/namei.c | 75 ++++++++++++++++++++++++++++++++-----
linux-2.6.git-dave/fs/open.c | 36 ++++++++++++++++-
linux-2.6.git-dave/ipc/mqueue.c | 16 ++++++-
4 files changed, 127 insertions(+), 14 deletions(-)
diff -puN fs/file_table.c~r-o-bind-mounts-elevate-write-count-opened-files fs/file_table.c
--- linux-2.6.git/fs/file_table.c~r-o-bind-mounts-elevate-write-count-opened-files 2008-02-15 13:25:53.000000000 -0800
+++ linux-2.6.git-dave/fs/file_table.c 2008-02-15 13:25:53.000000000 -0800
@@ -199,6 +199,17 @@ int init_file(struct file *file, struct
file->f_mapping = dentry->d_inode->i_mapping;
file->f_mode = mode;
file->f_op = fop;
+
+ /*
+ * These mounts don't really matter in practice
+ * for r/o bind mounts. They aren't userspace-
+ * visible. We do this for consistency, and so
+ * that we can do debugging checks at __fput()
+ */
+ if ((mode & FMODE_WRITE) && !special_file(dentry->d_inode->i_mode)) {
+ error = ...This also uses the little helper in the NFS code to make an if() a little bit
less ugly. We introduced the helper at the beginning of the series.
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
linux-2.6.git-dave/fs/namei.c | 4 ++++
linux-2.6.git-dave/fs/nfsd/vfs.c | 17 +++++++++++++----
2 files changed, 17 insertions(+), 4 deletions(-)
diff -puN fs/namei.c~r-o-bind-mounts-elevate-write-count-over-calls-to-vfs_rename fs/namei.c
--- linux-2.6.git/fs/namei.c~r-o-bind-mounts-elevate-write-count-over-calls-to-vfs_rename 2008-02-15 13:25:54.000000000 -0800
+++ linux-2.6.git-dave/fs/namei.c 2008-02-15 13:25:54.000000000 -0800
@@ -2751,8 +2751,12 @@ static int do_rename(int olddfd, const c
if (new_dentry == trap)
goto exit5;
+ error = mnt_want_write(oldnd.path.mnt);
+ if (error)
+ goto exit5;
error = vfs_rename(old_dir->d_inode, old_dentry,
new_dir->d_inode, new_dentry);
+ mnt_drop_write(oldnd.path.mnt);
exit5:
dput(new_dentry);
exit4:
diff -puN fs/nfsd/vfs.c~r-o-bind-mounts-elevate-write-count-over-calls-to-vfs_rename fs/nfsd/vfs.c
--- linux-2.6.git/fs/nfsd/vfs.c~r-o-bind-mounts-elevate-write-count-over-calls-to-vfs_rename 2008-02-15 13:25:54.000000000 -0800
+++ linux-2.6.git-dave/fs/nfsd/vfs.c 2008-02-15 13:25:54.000000000 -0800
@@ -1678,13 +1678,20 @@ nfsd_rename(struct svc_rqst *rqstp, stru
if (ndentry == trap)
goto out_dput_new;
-#ifdef MSNFS
- if ((ffhp->fh_export->ex_flags & NFSEXP_MSNFS) &&
+ if (svc_msnfs(ffhp) &&
((atomic_read(&odentry->d_count) > 1)
|| (atomic_read(&ndentry->d_count) > 1))) {
host_err = -EPERM;
- } else
-#endif
+ goto out_dput_new;
+ }
+
+ host_err = -EXDEV;
+ if (ffhp->fh_export->ex_mnt != tfhp->fh_export->ex_mnt)
+ goto out_dput_new;
+ host_err = mnt_want_write(ffhp->fh_export->ex_mnt);
+ if (host_err)
+ goto ...Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> --- linux-2.6.git-dave/fs/open.c | 14 ++++++++------ 1 file changed, 8 insertions(+), 6 deletions(-) diff -puN fs/open.c~r-o-bind-mounts-elevate-writer-count-for-do_sys_truncate fs/open.c --- linux-2.6.git/fs/open.c~r-o-bind-mounts-elevate-writer-count-for-do_sys_truncate 2008-02-15 13:25:55.000000000 -0800 +++ linux-2.6.git-dave/fs/open.c 2008-02-15 13:25:55.000000000 -0800 @@ -244,21 +244,21 @@ static long do_sys_truncate(const char _ if (!S_ISREG(inode->i_mode)) goto dput_and_out; - error = vfs_permission(&nd, MAY_WRITE); + error = mnt_want_write(nd.path.mnt); if (error) goto dput_and_out; - error = -EROFS; - if (IS_RDONLY(inode)) - goto dput_and_out; + error = vfs_permission(&nd, MAY_WRITE); + if (error) + goto mnt_drop_write_and_out; error = -EPERM; if (IS_IMMUTABLE(inode) || IS_APPEND(inode)) - goto dput_and_out; + goto mnt_drop_write_and_out; error = get_write_access(inode); if (error) - goto dput_and_out; + goto mnt_drop_write_and_out; /* * Make sure that there are no leases. get_write_access() protects @@ -276,6 +276,8 @@ static long do_sys_truncate(const char _ put_write_and_out: put_write_access(inode); +mnt_drop_write_and_out: + mnt_drop_write(nd.path.mnt); dput_and_out: path_put(&nd.path); out: _ --
Elevate the write count during the xfs m/ctime updates.
XFS has to do it's own timestamp updates due to an unfortunate VFS
design limitation, so it will have to track writers by itself aswell.
[hch: split out from the touch_atime patch as it's not related to it at all]
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---
linux-2.6.git-dave/fs/xfs/linux-2.6/xfs_iops.c | 7 -------
linux-2.6.git-dave/fs/xfs/linux-2.6/xfs_lrw.c | 9 ++++++++-
2 files changed, 8 insertions(+), 8 deletions(-)
diff -puN fs/xfs/linux-2.6/xfs_iops.c~r-o-bind-mounts-elevate-writer-count-for-xfs-timestamp-updates fs/xfs/linux-2.6/xfs_iops.c
--- linux-2.6.git/fs/xfs/linux-2.6/xfs_iops.c~r-o-bind-mounts-elevate-writer-count-for-xfs-timestamp-updates 2008-02-15 13:25:56.000000000 -0800
+++ linux-2.6.git-dave/fs/xfs/linux-2.6/xfs_iops.c 2008-02-15 13:25:56.000000000 -0800
@@ -157,13 +157,6 @@ xfs_ichgtime_fast(
*/
ASSERT((flags & XFS_ICHGTIME_ACC) == 0);
- /*
- * We're not supposed to change timestamps in readonly-mounted
- * filesystems. Throw it away if anyone asks us.
- */
- if (unlikely(IS_RDONLY(inode)))
- return;
-
if (flags & XFS_ICHGTIME_MOD) {
tvp = &inode->i_mtime;
ip->i_d.di_mtime.t_sec = (__int32_t)tvp->tv_sec;
diff -puN fs/xfs/linux-2.6/xfs_lrw.c~r-o-bind-mounts-elevate-writer-count-for-xfs-timestamp-updates fs/xfs/linux-2.6/xfs_lrw.c
--- linux-2.6.git/fs/xfs/linux-2.6/xfs_lrw.c~r-o-bind-mounts-elevate-writer-count-for-xfs-timestamp-updates 2008-02-15 13:25:56.000000000 -0800
+++ linux-2.6.git-dave/fs/xfs/linux-2.6/xfs_lrw.c 2008-02-15 13:25:56.000000000 -0800
@@ -51,6 +51,7 @@
#include "xfs_vnodeops.h"
#include <linux/capability.h>
+#include <linux/mount.h>
#include <linux/writeback.h>
@@ -679,10 +680,16 @@ start:
if (new_size > xip->i_size)
xip->i_new_size = new_size;
- if (likely(!(ioflags & IO_INVIS))) {
+ /*
+ * We're not ...It is OK to let access() go without using a mnt_want/drop_write() pair because it doesn't actually do writes to the filesystem, and it is inherently racy anyway. This is a rare case when it is OK to use __mnt_is_readonly() directly. Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- linux-2.6.git-dave/fs/open.c | 13 +++++++++++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff -puN fs/open.c~r-o-bind-mounts-make-access-use-mnt-check fs/open.c --- linux-2.6.git/fs/open.c~r-o-bind-mounts-make-access-use-mnt-check 2008-02-15 13:25:57.000000000 -0800 +++ linux-2.6.git-dave/fs/open.c 2008-02-15 13:25:57.000000000 -0800 @@ -459,8 +459,17 @@ asmlinkage long sys_faccessat(int dfd, c if(res || !(mode & S_IWOTH) || special_file(nd.path.dentry->d_inode->i_mode)) goto out_path_release; - - if(IS_RDONLY(nd.path.dentry->d_inode)) + /* + * This is a rare case where using __mnt_is_readonly() + * is OK without a mnt_want/drop_write() pair. Since + * no actual write to the fs is performed here, we do + * not need to telegraph to that to anyone. + * + * By doing this, we accept that this access is + * inherently racy and know that the fs may change + * state before we even see this result. + */ + if (__mnt_is_readonly(nd.path.mnt)) res = -EROFS; out_path_release: _ --
chown/chmod,etc... don't call permission in the same way that the normal "open for write" calls do. They still write to the filesystem, so bump the write count during these operations. Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- linux-2.6.git-dave/fs/open.c | 39 ++++++++++++++++++++++++++++++--------- 1 file changed, 30 insertions(+), 9 deletions(-) diff -puN fs/open.c~r-o-bind-mounts-elevate-writer-count-for-chown-and-friends fs/open.c --- linux-2.6.git/fs/open.c~r-o-bind-mounts-elevate-writer-count-for-chown-and-friends 2008-02-15 13:25:55.000000000 -0800 +++ linux-2.6.git-dave/fs/open.c 2008-02-15 13:25:55.000000000 -0800 @@ -567,12 +567,12 @@ asmlinkage long sys_fchmod(unsigned int audit_inode(NULL, dentry); - err = -EROFS; - if (IS_RDONLY(inode)) + err = mnt_want_write(file->f_vfsmnt); + if (err) goto out_putf; err = -EPERM; if (IS_IMMUTABLE(inode) || IS_APPEND(inode)) - goto out_putf; + goto out_drop_write; mutex_lock(&inode->i_mutex); if (mode == (mode_t) -1) mode = inode->i_mode; @@ -581,6 +581,8 @@ asmlinkage long sys_fchmod(unsigned int err = notify_change(dentry, &newattrs); mutex_unlock(&inode->i_mutex); +out_drop_write: + mnt_drop_write(file->f_vfsmnt); out_putf: fput(file); out: @@ -600,13 +602,13 @@ asmlinkage long sys_fchmodat(int dfd, co goto out; inode = nd.path.dentry->d_inode; - error = -EROFS; - if (IS_RDONLY(inode)) + error = mnt_want_write(nd.path.mnt); + if (error) goto dput_and_out; error = -EPERM; if (IS_IMMUTABLE(inode) || IS_APPEND(inode)) - goto dput_and_out; + goto out_drop_write; mutex_lock(&inode->i_mutex); if (mode == (mode_t) -1) @@ -616,6 +618,8 @@ asmlinkage long sys_fchmodat(int dfd, co error = notify_change(nd.path.dentry, &newattrs); mutex_unlock(&inode->i_mutex); ...
If we depend on the inodes for writeability, we will not catch the r/o mounts when implemented. This patches uses __mnt_want_write(). It does not guarantee that the mount will stay writeable after the check. But, this is OK for one of the checks because it is just for a printk(). The other two are probably unnecessary and duplicate existing checks in the VFS. This won't make them better checks than before, but it will make them detect r/o mounts. Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> --- linux-2.6.git-dave/fs/nfs/dir.c | 3 ++- linux-2.6.git-dave/fs/nfsd/vfs.c | 5 +++-- 2 files changed, 5 insertions(+), 3 deletions(-) diff -puN fs/nfs/dir.c~r-o-bind-mounts-nfs-check-mnt-instead-of-superblock-directly fs/nfs/dir.c --- linux-2.6.git/fs/nfs/dir.c~r-o-bind-mounts-nfs-check-mnt-instead-of-superblock-directly 2008-02-15 13:25:57.000000000 -0800 +++ linux-2.6.git-dave/fs/nfs/dir.c 2008-02-15 13:25:57.000000000 -0800 @@ -967,7 +967,8 @@ static int is_atomic_open(struct inode * if (nd->flags & LOOKUP_DIRECTORY) return 0; /* Are we trying to write to a read only partition? */ - if (IS_RDONLY(dir) && (nd->intent.open.flags & (O_CREAT|O_TRUNC|FMODE_WRITE))) + if (__mnt_is_readonly(nd->mnt) && + (nd->intent.open.flags & (O_CREAT|O_TRUNC|FMODE_WRITE))) return 0; return 1; } diff -puN fs/nfsd/vfs.c~r-o-bind-mounts-nfs-check-mnt-instead-of-superblock-directly fs/nfsd/vfs.c --- linux-2.6.git/fs/nfsd/vfs.c~r-o-bind-mounts-nfs-check-mnt-instead-of-superblock-directly 2008-02-15 13:25:57.000000000 -0800 +++ linux-2.6.git-dave/fs/nfsd/vfs.c 2008-02-15 13:25:57.000000000 -0800 @@ -1874,7 +1874,7 @@ nfsd_permission(struct svc_rqst *rqstp, inode->i_mode, IS_IMMUTABLE(inode)? " immut" : "", IS_APPEND(inode)? " append" : "", - IS_RDONLY(inode)? " ro" : ...
This takes care of all of the direct callers of vfs_mknod().
Since a few of these cases also handle normal file creation
as well, this also covers some calls to vfs_create().
So that we don't have to make three mnt_want/drop_write()
calls inside of the switch statement, we move some of its
logic outside of the switch and into a helper function
suggested by Christoph.
This also encapsulates a fix for mknod(S_IFREG) that Miklos
found.
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
linux-2.6.git-dave/fs/namei.c | 43 +++++++++++++++++++++++++---------
linux-2.6.git-dave/fs/nfsd/vfs.c | 4 +++
linux-2.6.git-dave/net/unix/af_unix.c | 4 +++
3 files changed, 40 insertions(+), 11 deletions(-)
diff -puN fs/namei.c~r-o-bind-mounts-sys_mknodat-elevate-write-count-for-vfs_mknod-create fs/namei.c
--- linux-2.6.git/fs/namei.c~r-o-bind-mounts-sys_mknodat-elevate-write-count-for-vfs_mknod-create 2008-02-15 13:25:58.000000000 -0800
+++ linux-2.6.git-dave/fs/namei.c 2008-02-15 13:25:58.000000000 -0800
@@ -2038,6 +2038,23 @@ int vfs_mknod(struct inode *dir, struct
return error;
}
+static int may_mknod(mode_t mode)
+{
+ switch (mode & S_IFMT) {
+ case S_IFREG:
+ case S_IFCHR:
+ case S_IFBLK:
+ case S_IFIFO:
+ case S_IFSOCK:
+ case 0: /* zero mode translates to S_IFREG */
+ return 0;
+ case S_IFDIR:
+ return -EPERM;
+ default:
+ return -EINVAL;
+ }
+}
+
asmlinkage long sys_mknodat(int dfd, const char __user *filename, int mode,
unsigned dev)
{
@@ -2056,12 +2073,19 @@ asmlinkage long sys_mknodat(int dfd, con
if (error)
goto out;
dentry = lookup_create(&nd, 0);
- error = PTR_ERR(dentry);
-
+ if (IS_ERR(dentry)) {
+ error = PTR_ERR(dentry);
+ goto out_unlock;
+ }
if (!IS_POSIXACL(nd.path.dentry->d_inode))
mode &= ~current->fs->umask;
- if (!IS_ERR(dentry)) ...This is the real meat of the entire series. It actually implements the tracking of the number of writers to a mount. However, it causes scalability problems because there can be hundreds of cpus doing open()/close() on files on the same mnt at the same time. Even an atomic_t in the mnt has massive scalaing problems because the cacheline gets so terribly contended. This uses a statically-allocated percpu variable. All want/drop operations are local to a cpu as long that cpu operates on the same mount, and there are no writer count imbalances. Writer count imbalances happen when a write is taken on one cpu, and released on another, like when an open/close pair is performed on two Upon a remount,ro request, all of the data from the percpu variables is collected (expensive, but very rare) and we determine if there are any outstanding writers to the mount. I've written a little benchmark to sit in a loop for a couple of seconds in several cpus in parallel doing open/write/close loops. http://sr71.net/~dave/linux/openbench.c The code in here is a a worst-possible case for this patch. It does opens on a _pair_ of files in two different mounts in parallel. This should cause my code to lose its "operate on the same mount" optimization completely. This worst-case scenario causes a 3% degredation in the benchmark. I could probably get rid of even this 3%, but it would be more complex than what I have here, and I think this is getting into acceptable territory. In practice, I expect writing more than 3 bytes to a file, as well as disk I/O to mask any effects that this has. (To get rid of that 3%, we could have an #defined number of mounts in the percpu variable. So, instead of a CPU getting operate only on percpu data when it accesses only one mount, it could stay on percpu data when it only accesses N or fewer mounts.) Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Dave Hansen ...
I think you should also add a cpu_writer->mnt = NULL; here. It's not a bug, but I had to think a bit about why it's not a bug. Miklos --
Yeah, I kinda copied the code from __clear_mnt_count() where keeping ->mnt is actually a mini optimization. But, there is certainly no chance of that mnt popping up again after a __mntput(), so I clear it in there now. I also added a comment explaining. -- Dave --
Originally from: Herbert Poetzl <herbert@13thfloor.at>
This is the core of the read-only bind mount patch set.
Note that this does _not_ add a "ro" option directly to the bind mount
operation. If you require such a mount, you must first do the bind, then
follow it up with a 'mount -o remount,ro' operation:
If you wish to have a r/o bind mount of /foo on bar:
mount --bind /foo /bar
mount -o remount,ro /bar
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
linux-2.6.git-dave/fs/namespace.c | 50 ++++++++++++++++++++++++++-----
linux-2.6.git-dave/include/linux/mount.h | 1
2 files changed, 44 insertions(+), 7 deletions(-)
diff -puN fs/namespace.c~r-o-bind-mounts-honor-r-w-changes-at-do_remount-time fs/namespace.c
--- linux-2.6.git/fs/namespace.c~r-o-bind-mounts-honor-r-w-changes-at-do_remount-time 2008-02-15 13:25:59.000000000 -0800
+++ linux-2.6.git-dave/fs/namespace.c 2008-02-15 13:25:59.000000000 -0800
@@ -105,7 +105,11 @@ struct vfsmount *alloc_vfsmnt(const char
*/
int __mnt_is_readonly(struct vfsmount *mnt)
{
- return (mnt->mnt_sb->s_flags & MS_RDONLY);
+ if (mnt->mnt_flags & MNT_READONLY)
+ return 1;
+ if (mnt->mnt_sb->s_flags & MS_RDONLY)
+ return 1;
+ return 0;
}
EXPORT_SYMBOL_GPL(__mnt_is_readonly);
@@ -299,7 +303,7 @@ void mnt_drop_write(struct vfsmount *mnt
}
EXPORT_SYMBOL_GPL(mnt_drop_write);
-int mnt_make_readonly(struct vfsmount *mnt)
+static int mnt_make_readonly(struct vfsmount *mnt)
{
int ret = 0;
@@ -312,15 +316,25 @@ int mnt_make_readonly(struct vfsmount *m
goto out;
}
/*
- * actually set mount's r/o flag here to make
- * __mnt_is_readonly() true, which keeps anyone
- * from doing a successful mnt_want_write().
+ * nobody can do a successful mnt_want_write() with all
+ * of the counts in MNT_DENIED_WRITE and the locks held.
...There have been a few oopses caused by 'struct file's with NULL f_vfsmnts.
There was also a set of potentially missed mnt_want_write()s from
dentry_open() calls.
This patch provides a very simple debugging framework to catch these kinds of
bugs. It will WARN_ON() them, but should stop us from having any oopses or
mnt_writer count imbalances.
I'm quite convinced that this is a good thing because it found bugs in the
stuff I was working on as soon as I wrote it.
[hch: made it conditional on a debug option.
But it's still a little bit too ugly]
[hch: merged forced remount r/o fix from Dave and akpm's fix for the fix]
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
linux-2.6.git-dave/fs/file_table.c | 11 ++++++-
linux-2.6.git-dave/fs/open.c | 12 +++++++-
linux-2.6.git-dave/fs/super.c | 3 ++
linux-2.6.git-dave/include/linux/fs.h | 49 ++++++++++++++++++++++++++++++++++
linux-2.6.git-dave/lib/Kconfig.debug | 10 ++++++
5 files changed, 82 insertions(+), 3 deletions(-)
diff -puN fs/file_table.c~keep-track-of-mnt_writer-state-of-struct-file fs/file_table.c
--- linux-2.6.git/fs/file_table.c~keep-track-of-mnt_writer-state-of-struct-file 2008-02-15 13:26:00.000000000 -0800
+++ linux-2.6.git-dave/fs/file_table.c 2008-02-15 13:26:00.000000000 -0800
@@ -42,6 +42,7 @@ static inline void file_free_rcu(struct
static inline void file_free(struct file *f)
{
percpu_counter_dec(&nr_files);
+ file_check_state(f);
call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
}
@@ -207,6 +208,7 @@ int init_file(struct file *file, struct
* that we can do debugging checks at __fput()
*/
if ((mode & FMODE_WRITE) && !special_file(dentry->d_inode->i_mode)) {
+ file_take_write(file);
error = mnt_want_write(mnt);
WARN_ON(error);
}
@@ -237,8 +239,13 @@ void ...Linus, We've been causing Ted a lot of pain having to keep different copies of his patches for mainline and for -mm since -mm has these functions and mainline doesn't. I'm pretty darn sure at that these are the right API and we just need another run in -mm to make sure that the reset of the series in in good shape. Could we bump this one in a bit ahead of the rest to ease Ted's pain? I think the odds of this breaking anything by itself are pretty low. --- This patch adds two functions: mnt_want_write() and mnt_drop_write(). These are used like a lock pair around and fs operations that might cause a write to the filesystem. Note that these will replace many of the uses of IS_RDONLY(inode) currently in the kernel. Before these can become useful, we must first cover each place in the VFS where writes are performed with a want/drop pair. When that is complete, we can actually introduce code that will safely check the counts before allowing r/w<->r/o transitions to occur. Acked-by: Serge Hallyn <serue@us.ibm.com> Acked-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Christoph Hellwig <hch@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Dave Hansen <haveblue@us.ibm.com> --- linux-2.6.git-dave/fs/namespace.c | 54 +++++++++++++++++++++++++++++++ linux-2.6.git-dave/include/linux/mount.h | 3 + 2 files changed, 57 insertions(+) diff -puN fs/namespace.c~r-o-bind-mounts-stub-functions fs/namespace.c --- linux-2.6.git/fs/namespace.c~r-o-bind-mounts-stub-functions 2008-02-15 13:25:45.000000000 -0800 +++ linux-2.6.git-dave/fs/namespace.c 2008-02-15 13:25:45.000000000 -0800 @@ -80,6 +80,60 @@ struct vfsmount *alloc_vfsmnt(const char return mnt; } +/* + * Most r/o checks on a fs are for operations that take + * discrete amounts of time, like a write() or unlink(). + * We must keep track of when those operations start + * (for permission checks) and when they end, so that + * we can determine when writes ...
