Re: [PATCH 07/30] r/o bind mounts: stub functions

Previous thread: none

Next thread: inode leak in 2.6.24? by Ferenc Wagner on Friday, February 15, 2008 - 4:18 pm. (8 messages)
From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

This is against current Linus git.

Miklos, if you send me a copy of your current unprivledged mount
code merged against mainline, I'll merge with this.

git://git.kernel.org/pub/scm/linux/kernel/git/daveh/linux-2.6-robind.git

This rolls up all the -mm bugfixes that were accumulated, and
addresses some new review comments from Al.  Also contains some
reworking from hch and a patch from Jeff Dike.

Just posting here to let everyone have a sniff before we resend
it back to -mm.

---

Why do we need r/o bind mounts?

This feature allows a read-only view into a read-write filesystem.
In the process of doing that, it also provides infrastructure for
keeping track of the number of writers to any given mount.

This has a number of uses.  It allows chroots to have parts of
filesystems writable.  It will be useful for containers in the future
because users may have root inside a container, but should not
be allowed to write to somefilesystems.  This also replaces 
patches that vserver has had out of the tree for several years.

It allows security enhancement by making sure that parts of
your filesystem are read-only (such as when you don't trust your
FTP server), when you don't want to have entire new filesystems
mounted, or when you want atime selectively updated.
I've been using this script:

	http://sr71.net/~dave/linux/robind-test.sh

to test that the feature is working as desired.  It takes a
directory and makes a regular bind and a r/o bind mount of it.
It then performs some normal filesystem operations on the
three directories, including ones that are expected to fail,
like creating a file on the r/o mount.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
--

From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

Here's patch for hppfs that uses vfs_kern_mount to make sure it always
has a procfs instance and passed the vfsmount on through the inode
private data.  Also fixes a procfs file_system_type leak for every attempted   
hppfs mount.

[ jdike - gave this file a style workover, plus deleted hppfs_dentry_ops ]

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jeff Dike <jdike@linux.intel.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 linux-2.6.git-dave/fs/hppfs/hppfs_kern.c |  364 ++++++++++++++++---------------
 1 file changed, 188 insertions(+), 176 deletions(-)

diff -puN fs/hppfs/hppfs_kern.c~hppfs-pass-vfsmount-to-dentry_open fs/hppfs/hppfs_kern.c
--- linux-2.6.git/fs/hppfs/hppfs_kern.c~hppfs-pass-vfsmount-to-dentry_open	2008-02-15 13:25:42.000000000 -0800
+++ linux-2.6.git-dave/fs/hppfs/hppfs_kern.c	2008-02-15 13:25:42.000000000 -0800
@@ -1,23 +1,25 @@
 /*
- * Copyright (C) 2002 Jeff Dike (jdike@karaya.com)
+ * Copyright (C) 2002 - 2007 Jeff Dike (jdike@{addtoit,linux.intel}.com)
  * Licensed under the GPL
  */
 
-#include <linux/fs.h>
+#include <linux/ctype.h>
+#include <linux/dcache.h>
 #include <linux/file.h>
-#include <linux/module.h>
+#include <linux/fs.h>
 #include <linux/init.h>
-#include <linux/slab.h>
-#include <linux/list.h>
 #include <linux/kernel.h>
-#include <linux/ctype.h>
-#include <linux/dcache.h>
+#include <linux/list.h>
+#include <linux/module.h>
+#include <linux/mount.h>
+#include <linux/slab.h>
 #include <linux/statfs.h>
+#include <linux/types.h>
 #include <asm/uaccess.h>
-#include <asm/fcntl.h>
 #include "os.h"
 
-static int init_inode(struct inode *inode, struct dentry *dentry);
+static int init_inode(struct inode *inode, struct dentry *dentry,
+		      struct vfsmount *mnt);
 
 struct hppfs_data {
 	struct list_head list;
@@ -33,6 +35,7 @@ struct hppfs_private {
 
 struct hppfs_inode_info {
         struct dentry *proc_dentry;
+	struct vfsmount *proc_mnt;
 ...
From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

Some new uses of get_empty_filp() have crept in, and are
not properly taking mnt_want_write()s.  This fixes them
up.

We really need to kill get_empty_filp().

Cc: Erez Zadok <ezk@cs.sunysb.edu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J Bruce Fields" <bfields@fieldses.org>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 linux-2.6.git-dave/fs/anon_inodes.c |   16 ++++++----------
 linux-2.6.git-dave/fs/file_table.c  |    6 ++++++
 linux-2.6.git-dave/fs/pipe.c        |   19 +++++++++----------
 3 files changed, 21 insertions(+), 20 deletions(-)

diff -puN fs/anon_inodes.c~fix-up-new-filp-allocators fs/anon_inodes.c
--- linux-2.6.git/fs/anon_inodes.c~fix-up-new-filp-allocators	2008-02-15 13:25:43.000000000 -0800
+++ linux-2.6.git-dave/fs/anon_inodes.c	2008-02-15 13:25:43.000000000 -0800
@@ -81,13 +81,10 @@ int anon_inode_getfd(int *pfd, struct in
 
 	if (IS_ERR(anon_inode_inode))
 		return -ENODEV;
-	file = get_empty_filp();
-	if (!file)
-		return -ENFILE;
 
 	error = get_unused_fd();
 	if (error < 0)
-		goto err_put_filp;
+		return error;
 	fd = error;
 
 	/*
@@ -114,14 +111,15 @@ int anon_inode_getfd(int *pfd, struct in
 	dentry->d_flags &= ~DCACHE_UNHASHED;
 	d_instantiate(dentry, anon_inode_inode);
 
-	file->f_path.mnt = mntget(anon_inode_mnt);
-	file->f_path.dentry = dentry;
+	error = -ENFILE;
+	file = alloc_file(anon_inode_mnt, dentry,
+			  FMODE_READ | FMODE_WRITE, fops);
+	if (!file)
+		goto err_put_unused_fd;
 	file->f_mapping = anon_inode_inode->i_mapping;
 
 	file->f_pos = 0;
 	file->f_flags = O_RDWR;
-	file->f_op = fops;
-	file->f_mode = FMODE_READ | FMODE_WRITE;
 	file->f_version = 0;
 	file->private_data = priv;
 
@@ -134,8 +132,6 @@ int anon_inode_getfd(int *pfd, struct in
 
 err_put_unused_fd:
 	put_unused_fd(fd);
-err_put_filp:
-	put_filp(file);
 	return error;
 ...
From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

After several posts and bug reports regarding interaction with the NULL
nameidata, here's a patch to clean up the mess with struct file in the
reiserfs xattr code.

As observed in several of the posts, there's really no need for struct file
to exist in the xattr code.  It was really only passed around due to the
f_op->readdir() and a_ops->{prepare,commit}_write prototypes requiring it.

reiserfs_prepare_write() and reiserfs_commit_write() don't actually use the
struct file passed to it, and the xattr code uses a private version of
reiserfs_readdir() to enumerate the xattr directories.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 linux-2.6.git-dave/fs/reiserfs/xattr.c |  110 +++++++++------------------------
 1 file changed, 30 insertions(+), 80 deletions(-)

diff -puN fs/reiserfs/xattr.c~reiserfs-eliminate-private-use-of-struct-file-in-xattr fs/reiserfs/xattr.c
--- linux-2.6.git/fs/reiserfs/xattr.c~reiserfs-eliminate-private-use-of-struct-file-in-xattr	2008-02-15 13:25:41.000000000 -0800
+++ linux-2.6.git-dave/fs/reiserfs/xattr.c	2008-02-15 13:25:41.000000000 -0800
@@ -191,28 +191,11 @@ static struct dentry *get_xa_file_dentry
 	dput(xadir);
 	if (err)
 		xafile = ERR_PTR(err);
-	return xafile;
-}
-
-/* Opens a file pointer to the attribute associated with inode */
-static struct file *open_xa_file(const struct inode *inode, const char *name,
-				 int flags)
-{
-	struct dentry *xafile;
-	struct file *fp;
-
-	xafile = get_xa_file_dentry(inode, name, flags);
-	if (IS_ERR(xafile))
-		return ERR_PTR(PTR_ERR(xafile));
 	else if (!xafile->d_inode) {
 		dput(xafile);
-		return ERR_PTR(-ENODATA);
+		xafile = ERR_PTR(-ENODATA);
 	}
-
-	fp = dentry_open(xafile, NULL, O_RDWR);
-	/* dentry_open dputs the dentry if it fails */
-
-	return fp;
+	return xafile;
 }
 
 ...
From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

The emergency remount code forcibly removes FMODE_WRITE from
filps.  The r/o bind mount code notices that this was done
without a proper mnt_drop_write() and properly gives a
warning.

This patch does a mnt_drop_write() to keep everything
balanced.
	
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 linux-2.6.git-dave/fs/super.c |   20 ++++++++++++++++++--
 1 file changed, 18 insertions(+), 2 deletions(-)

diff -puN fs/super.c~robind-sysrq-fix fs/super.c
--- linux-2.6.git/fs/super.c~robind-sysrq-fix	2008-02-15 13:25:46.000000000 -0800
+++ linux-2.6.git-dave/fs/super.c	2008-02-15 13:25:46.000000000 -0800
@@ -37,6 +37,7 @@
 #include <linux/idr.h>
 #include <linux/kobject.h>
 #include <linux/mutex.h>
+#include <linux/file.h>
 #include <asm/uaccess.h>
 
 
@@ -567,10 +568,25 @@ static void mark_files_ro(struct super_b
 {
 	struct file *f;
 
+retry:
 	file_list_lock();
 	list_for_each_entry(f, &sb->s_files, f_u.fu_list) {
-		if (S_ISREG(f->f_path.dentry->d_inode->i_mode) && file_count(f))
-			f->f_mode &= ~FMODE_WRITE;
+		struct vfsmount *mnt;
+		if (!S_ISREG(f->f_path.dentry->d_inode->i_mode))
+		       continue;
+		if (!file_count(f))
+			continue;
+		if (!(f->f_mode & FMODE_WRITE))
+			continue;
+		f->f_mode &= ~FMODE_WRITE;
+		mnt = f->f_path.mnt;
+		file_list_unlock();
+		/*
+		 * This can sleep, so we can't hold
+		 * the file_list_lock() spinlock.
+		 */
+		mnt_drop_write(mnt);
+		goto retry;
 	}
 	file_list_unlock();
 }
_
--

From: Miklos Szeredi
Date: Monday, February 18, 2008 - 9:29 am

AFAICS an mntget() is needed here to make sure mnt doesn't go away
after releasing file_list_lock.

--

From: Al Viro
Date: Saturday, February 23, 2008 - 6:38 am

Good catch, but... that's not all.  We also have to move
drop_file_write_access() in __fput() past the file_kill(), or we'll
get a race here - CPU 1 enters that loop, sees ->f_count 1, checks
that file is writable, CPU 2 does final fput() and proceeds to
do drop_file_write_access() and mnt_drop_write().  CPU 1, in the
meanwhile, removes FMODE_WRITE and goes on to do mnt_drop_write()
itself.

	Oh, well...  Fixed.
--

From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

This patch adds two function mnt_want_write() and mnt_drop_write().  These are
used like a lock pair around and fs operations that might cause a write to the
filesystem.

Before these can become useful, we must first cover each place in the VFS
where writes are performed with a want/drop pair.  When that is complete, we
can actually introduce code that will safely check the counts before allowing
r/w<->r/o transitions to occur.

Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 linux-2.6.git-dave/fs/namespace.c        |   54 +++++++++++++++++++++++++++++++
 linux-2.6.git-dave/include/linux/mount.h |    3 +
 2 files changed, 57 insertions(+)

diff -puN fs/namespace.c~r-o-bind-mounts-stub-functions fs/namespace.c
--- linux-2.6.git/fs/namespace.c~r-o-bind-mounts-stub-functions	2008-02-15 13:25:45.000000000 -0800
+++ linux-2.6.git-dave/fs/namespace.c	2008-02-15 13:25:45.000000000 -0800
@@ -80,6 +80,60 @@ struct vfsmount *alloc_vfsmnt(const char
 	return mnt;
 }
 
+/*
+ * Most r/o checks on a fs are for operations that take
+ * discrete amounts of time, like a write() or unlink().
+ * We must keep track of when those operations start
+ * (for permission checks) and when they end, so that
+ * we can determine when writes are able to occur to
+ * a filesystem.
+ */
+/**
+ * mnt_want_write - get write access to a mount
+ * @mnt: the mount on which to take a write
+ *
+ * This tells the low-level filesystem that a write is
+ * about to be performed to it, and makes sure that
+ * writes are allowed before returning success.  When
+ * the write operation is finished, mnt_drop_write()
+ * must be called.  This is effectively a refcount.
+ */
+int mnt_want_write(struct vfsmount *mnt)
+{
+	if (__mnt_is_readonly(mnt))
+		return -EROFS;
+	return ...
From: Theodore Tso
Date: Friday, February 15, 2008 - 5:32 pm

Argh, is there some reason why this couldn't have gotten merged in
-rc1, ahead of the rest of the patch series?  This one is going to
cause more cross-tree merge pain with any filesystem tree that have
changes to fs/*/ioctl.c.

							- Ted
--

From: Dave Hansen
Date: Friday, February 15, 2008 - 5:49 pm

I wasn't meaning for this to hit the 2.6.25-rc series.  We had some
review comments just when the merge window opened, and I was expecting
them to get stuck back in -mm for another round.

-- Dave

--

From: Theodore Tso
Date: Friday, February 15, 2008 - 6:00 pm

Yeah, but it means that I need one set of patches for -mm, and another
set of patches for Linus's mainline.  I notice that your patchset is
currently missing changes for fs/ext4/ioctl.c --- I think because you
dropped them when Mingming picked them up, and then I dropped them
when I was trying to prepare the set of patches to push to Linus.

No problem, I'm sure I can ressurect them, but it's still the same
basic problem that when there are patchsets such as yours which touch
multiple trees in -mm, there are almost inevitably patch conflicts.

It would be nice if an initial patch which introduces the new
functionality you need for r/o bind mounts could get introduced into
mainline *first*, and then people could add patches that call
mnt_want_write(), et. al into their trees gradually.

As it is, I can't see a way around this other than maintaining two
separate patch sets, one that works with r/o bind mounts, and one for
mainline, since otherwise akpm gets grumpy and starts dropping either
your patchset or the ext4 patchset because *he* has to manually fix up
the patch conflicts.  (So instead I have to deal with it by hand, and
then *I* get grumpy.  :-/)

						- Ted
--

From: Andrew Morton
Date: Friday, February 15, 2008 - 6:11 pm

On Fri, 15 Feb 2008 20:00:30 -0500

Yes, I expect that merging a handful of do-nothing mnt_foo_write()

itym "less than usually cheery"
--

From: Christoph Hellwig
Date: Friday, February 15, 2008 - 11:31 pm

Haha,

once we put pieces in the first three patches would be useful aswell,
to easily catch additions in the next cycle that might be adding
NULL-vfsmount calls to dentry_open.
--

From: Andrew Morton
Date: Friday, February 15, 2008 - 11:46 pm

hrm, well, how about putting up a complete and suitably-changelogged patch
series for Linus to look at?  That's be a Dave thing I guess.

I wasn't overawed by the initial patch - why not make those stubs inlined
to truly add zero cost??


--

From: Dave Hansen
Date: Monday, February 18, 2008 - 12:06 am

What I sent was the exact same thing I had from the original patch
series.  They weren't inlined because shortly after that in the series,
we make them much bigger.  Merging that exact patch is the easiest for
me, but I see your point.  I'll rework it  with the others.

-- Dave

--

From: Dave Hansen
Date: Wednesday, February 20, 2008 - 3:25 pm

So, we want

	"[PATCH 07/30] r/o bind mounts: stub functions"
and 
	"[PATCH 03/30] check for null vfsmount in dentry_open()"

But what's the third patch?

-- Dave

--

From: Christoph Hellwig
Date: Wednesday, February 20, 2008 - 3:58 pm

For that latter patch we'd need the reiserfs and hpps fixes.  But
I think it's too late now, let's just keep them in -mm for the
time beeing.
--

From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

Make sure no-one calls dentry_open with a NULL vfsmount argument and crap
out with a stacktrace otherwise.  A NULL file->f_vfsmnt has always been
problematic, but with the per-mount r/o tracking we can't accept anymore
at all.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 linux-2.6.git-dave/fs/open.c |   12 ++++++++++++
 1 file changed, 12 insertions(+)

diff -puN fs/open.c~check-for-NULL-vfsmount-in-dentry-open fs/open.c
--- linux-2.6.git/fs/open.c~check-for-NULL-vfsmount-in-dentry-open	2008-02-15 13:25:42.000000000 -0800
+++ linux-2.6.git-dave/fs/open.c	2008-02-15 13:25:42.000000000 -0800
@@ -903,6 +903,18 @@ struct file *dentry_open(struct dentry *
 	int error;
 	struct file *f;
 
+	/*
+	 * We must always pass in a valid mount pointer.   Historically
+	 * callers got away with not passing it, but we must enforce this at
+	 * the earliest possible point now to avoid strange problems deep in the
+	 * filesystem stack.
+	 */
+	if (!mnt) {
+		printk(KERN_WARNING "%s called with NULL vfsmount\n", __func__);
+		dump_stack();
+		return ERR_PTR(-EINVAL);
+	}
+
 	error = -ENFILE;
 	f = get_empty_filp();
 	if (f == NULL) {
_
--

From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

My end goal here is to make sure all users of may_open()
return filps.  This will ensure that we properly release
mount write counts which were taken for the filp in
may_open().

This patch moves the sys_open flags to namei flags
calculation into fs/namei.c.  We'll shortly be moving
the nameidata_to_filp() calls into namei.c, and this
gets the sys_open flags to a place where we can get
at them when we need them.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 linux-2.6.git-dave/fs/namei.c |   43 +++++++++++++++++++++++++++++++++---------
 linux-2.6.git-dave/fs/open.c  |   22 +--------------------
 2 files changed, 36 insertions(+), 29 deletions(-)

diff -puN fs/namei.c~do-namei_flags-calculation-inside-open_namei fs/namei.c
--- linux-2.6.git/fs/namei.c~do-namei_flags-calculation-inside-open_namei	2008-02-15 13:25:44.000000000 -0800
+++ linux-2.6.git-dave/fs/namei.c	2008-02-15 13:25:44.000000000 -0800
@@ -1676,7 +1676,12 @@ int may_open(struct nameidata *nd, int a
 	return 0;
 }
 
-static int open_namei_create(struct nameidata *nd, struct path *path,
+/*
+ * Be careful about ever adding any more callers of this
+ * function.  Its flags must be in the namei format, not
+ * what get passed to sys_open().
+ */
+static int __open_namei_create(struct nameidata *nd, struct path *path,
 				int flag, int mode)
 {
 	int error;
@@ -1695,26 +1700,46 @@ static int open_namei_create(struct name
 }
 
 /*
+ * Note that while the flag value (low two bits) for sys_open means:
+ *	00 - read-only
+ *	01 - write-only
+ *	10 - read-write
+ *	11 - special
+ * it is changed into
+ *	00 - no permissions needed
+ *	01 - read-permission
+ *	10 - write-permission
+ *	11 - read-write
+ * for the internal routines (ie open_namei()/follow_link() etc)
+ * This is more logical, and also allows the 00 "no perm needed"
+ * to be used for symlinks (where the permissions are checked
+ * ...
From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

If someone decides to demote a file from r/w to just
r/o, they can use this same code as __fput().

NFS does just that, and will use this in the next
patch.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Cc: Erez Zadok <ezk@cs.sunysb.edu>
Cc: Trond Myklebust <trond.myklebust@fys.uio.no>
Cc: "J Bruce Fields" <bfields@fieldses.org>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Christoph Hellwig <hch@lst.de>
---

 linux-2.6.git-dave/fs/file_table.c      |   19 ++++++++++++++++++-
 linux-2.6.git-dave/fs/nfsd/nfs4state.c  |    3 ++-
 linux-2.6.git-dave/include/linux/file.h |    1 +
 3 files changed, 21 insertions(+), 2 deletions(-)

diff -puN fs/file_table.c~create-file_drop_write_access-helper fs/file_table.c
--- linux-2.6.git/fs/file_table.c~create-file_drop_write_access-helper	2008-02-15 13:25:46.000000000 -0800
+++ linux-2.6.git-dave/fs/file_table.c	2008-02-15 13:25:46.000000000 -0800
@@ -211,6 +211,23 @@ void fput(struct file *file)
 
 EXPORT_SYMBOL(fput);
 
+/**
+ * drop_file_write_access - give up ability to write to a file
+ * @file: the file to which we will stop writing
+ *
+ * This is a central place which will give up the ability
+ * to write to @file, along with access to write through
+ * its vfsmount.
+ */
+void drop_file_write_access(struct file *file)
+{
+	struct dentry *dentry = file->f_path.dentry;
+	struct inode *inode = dentry->d_inode;
+
+	put_write_access(inode);
+}
+EXPORT_SYMBOL_GPL(drop_file_write_access);
+
 /* __fput is called from task context when aio completion releases the last
  * last use of a struct file *.  Do not use otherwise.
  */
@@ -237,7 +254,7 @@ void __fput(struct file *file)
 		cdev_put(inode->i_cdev);
 	fops_put(file->f_op);
 	if (file->f_mode & FMODE_WRITE)
-		put_write_access(inode);
+		drop_file_write_access(file);
 	put_pid(file->f_owner.pid);
 	file_kill(file);
 	file->f_path.dentry = NULL;
diff -puN ...
From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

Elevate the write count during the vfs_rmdir() call.

Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 linux-2.6.git-dave/fs/namei.c |    5 +++++
 1 file changed, 5 insertions(+)

diff -puN fs/namei.c~r-o-bind-mounts-do_rmdir-elevate-write-count fs/namei.c
--- linux-2.6.git/fs/namei.c~r-o-bind-mounts-do_rmdir-elevate-write-count	2008-02-15 13:25:47.000000000 -0800
+++ linux-2.6.git-dave/fs/namei.c	2008-02-15 13:25:47.000000000 -0800
@@ -2189,7 +2189,12 @@ static long do_rmdir(int dfd, const char
 	error = PTR_ERR(dentry);
 	if (IS_ERR(dentry))
 		goto exit2;
+	error = mnt_want_write(nd.path.mnt);
+	if (error)
+		goto exit3;
 	error = vfs_rmdir(nd.path.dentry->d_inode, dentry);
+	mnt_drop_write(nd.path.mnt);
+exit3:
 	dput(dentry);
 exit2:
 	mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
_
--

From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

Pretty self-explanatory.  Fits in with the rest of the series.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 linux-2.6.git-dave/fs/namei.c            |    5 +++++
 linux-2.6.git-dave/fs/nfsd/nfs4recover.c |    5 +++++
 2 files changed, 10 insertions(+)

diff -puN fs/namei.c~r-o-bind-mounts-elevate-mnt-writers-for-callers-of-vfs_mkdir fs/namei.c
--- linux-2.6.git/fs/namei.c~r-o-bind-mounts-elevate-mnt-writers-for-callers-of-vfs_mkdir	2008-02-15 13:25:47.000000000 -0800
+++ linux-2.6.git-dave/fs/namei.c	2008-02-15 13:25:47.000000000 -0800
@@ -2082,7 +2082,12 @@ asmlinkage long sys_mkdirat(int dfd, con
 
 	if (!IS_POSIXACL(nd.path.dentry->d_inode))
 		mode &= ~current->fs->umask;
+	error = mnt_want_write(nd.path.mnt);
+	if (error)
+		goto out_dput;
 	error = vfs_mkdir(nd.path.dentry->d_inode, dentry, mode);
+	mnt_drop_write(nd.path.mnt);
+out_dput:
 	dput(dentry);
 out_unlock:
 	mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
diff -puN fs/nfsd/nfs4recover.c~r-o-bind-mounts-elevate-mnt-writers-for-callers-of-vfs_mkdir fs/nfsd/nfs4recover.c
--- linux-2.6.git/fs/nfsd/nfs4recover.c~r-o-bind-mounts-elevate-mnt-writers-for-callers-of-vfs_mkdir	2008-02-15 13:25:47.000000000 -0800
+++ linux-2.6.git-dave/fs/nfsd/nfs4recover.c	2008-02-15 13:25:47.000000000 -0800
@@ -41,6 +41,7 @@
 #include <linux/nfsd/xdr4.h>
 #include <linux/param.h>
 #include <linux/file.h>
+#include <linux/mount.h>
 #include <linux/namei.h>
 #include <asm/uaccess.h>
 #include <linux/scatterlist.h>
@@ -154,7 +155,11 @@ nfsd4_create_clid_dir(struct nfs4_client
 		dprintk("NFSD: nfsd4_create_clid_dir: DIRECTORY EXISTS\n");
 		goto out_put;
 	}
+	status = mnt_want_write(rec_dir.mnt);
+	if (status)
+		goto out_put;
 	status = vfs_mkdir(rec_dir.path.dentry->d_inode, dentry, S_IRWXU);
+	mnt_drop_write(rec_dir.mnt);
 out_put:
 	dput(dentry);
 ...
From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 linux-2.6.git-dave/fs/namei.c   |    4 ++++
 linux-2.6.git-dave/ipc/mqueue.c |    5 ++++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff -puN fs/namei.c~r-o-bind-mounts-elevate-mnt-writers-for-vfs_unlink-callers fs/namei.c
--- linux-2.6.git/fs/namei.c~r-o-bind-mounts-elevate-mnt-writers-for-vfs_unlink-callers	2008-02-15 13:25:48.000000000 -0800
+++ linux-2.6.git-dave/fs/namei.c	2008-02-15 13:25:48.000000000 -0800
@@ -2280,7 +2280,11 @@ static long do_unlinkat(int dfd, const c
 		inode = dentry->d_inode;
 		if (inode)
 			atomic_inc(&inode->i_count);
+		error = mnt_want_write(nd.path.mnt);
+		if (error)
+			goto exit2;
 		error = vfs_unlink(nd.path.dentry->d_inode, dentry);
+		mnt_drop_write(nd.path.mnt);
 	exit2:
 		dput(dentry);
 	}
diff -puN ipc/mqueue.c~r-o-bind-mounts-elevate-mnt-writers-for-vfs_unlink-callers ipc/mqueue.c
--- linux-2.6.git/ipc/mqueue.c~r-o-bind-mounts-elevate-mnt-writers-for-vfs_unlink-callers	2008-02-15 13:25:48.000000000 -0800
+++ linux-2.6.git-dave/ipc/mqueue.c	2008-02-15 13:25:48.000000000 -0800
@@ -742,8 +742,11 @@ asmlinkage long sys_mq_unlink(const char
 	inode = dentry->d_inode;
 	if (inode)
 		atomic_inc(&inode->i_count);
-
+	err = mnt_want_write(mqueue_mnt);
+	if (err)
+		goto out_err;
 	err = vfs_unlink(dentry->d_parent->d_inode, dentry);
+	mnt_drop_write(mqueue_mnt);
 out_err:
 	dput(dentry);
 
_
--

From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

This basically audits the callers of xattr_permission(), which calls
permission() and can perform writes to the filesystem.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 linux-2.6.git-dave/fs/nfsd/nfs4proc.c |    7 ++++++-
 linux-2.6.git-dave/fs/xattr.c         |   16 ++++++++++++++--
 2 files changed, 20 insertions(+), 3 deletions(-)

diff -puN fs/nfsd/nfs4proc.c~r-o-bind-mounts-elevate-mount-count-for-extended-attributes fs/nfsd/nfs4proc.c
--- linux-2.6.git/fs/nfsd/nfs4proc.c~r-o-bind-mounts-elevate-mount-count-for-extended-attributes	2008-02-15 13:25:49.000000000 -0800
+++ linux-2.6.git-dave/fs/nfsd/nfs4proc.c	2008-02-15 13:25:49.000000000 -0800
@@ -658,14 +658,19 @@ nfsd4_setattr(struct svc_rqst *rqstp, st
 			return status;
 		}
 	}
+	status = mnt_want_write(cstate->current_fh.fh_export->ex_mnt);
+	if (status)
+		return status;
 	status = nfs_ok;
 	if (setattr->sa_acl != NULL)
 		status = nfsd4_set_nfs4_acl(rqstp, &cstate->current_fh,
 					    setattr->sa_acl);
 	if (status)
-		return status;
+		goto out;
 	status = nfsd_setattr(rqstp, &cstate->current_fh, &setattr->sa_iattr,
 				0, (time_t)0);
+out:
+	mnt_drop_write(cstate->current_fh.fh_export->ex_mnt);
 	return status;
 }
 
diff -puN fs/xattr.c~r-o-bind-mounts-elevate-mount-count-for-extended-attributes fs/xattr.c
--- linux-2.6.git/fs/xattr.c~r-o-bind-mounts-elevate-mount-count-for-extended-attributes	2008-02-15 13:25:49.000000000 -0800
+++ linux-2.6.git-dave/fs/xattr.c	2008-02-15 13:25:49.000000000 -0800
@@ -11,6 +11,7 @@
 #include <linux/slab.h>
 #include <linux/file.h>
 #include <linux/xattr.h>
+#include <linux/mount.h>
 #include <linux/namei.h>
 #include <linux/security.h>
 #include <linux/syscalls.h>
@@ -32,8 +33,6 @@ xattr_permission(struct inode *inode, co
 	 * filesystem  or on an immutable / append-only inode.
 	 */
 	if (mask & ...
From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 linux-2.6.git-dave/fs/ncpfs/ioctl.c |   54 +++++++++++++++++++++++++++++++++++-
 1 file changed, 53 insertions(+), 1 deletion(-)

diff -puN fs/ncpfs/ioctl.c~r-o-bind-mounts-elevate-write-count-during-entire-ncp_ioctl fs/ncpfs/ioctl.c
--- linux-2.6.git/fs/ncpfs/ioctl.c~r-o-bind-mounts-elevate-write-count-during-entire-ncp_ioctl	2008-02-15 13:25:49.000000000 -0800
+++ linux-2.6.git-dave/fs/ncpfs/ioctl.c	2008-02-15 13:25:49.000000000 -0800
@@ -14,6 +14,7 @@
 #include <linux/ioctl.h>
 #include <linux/time.h>
 #include <linux/mm.h>
+#include <linux/mount.h>
 #include <linux/highuid.h>
 #include <linux/smp_lock.h>
 #include <linux/vmalloc.h>
@@ -261,7 +262,7 @@ ncp_get_charsets(struct ncp_server* serv
 }
 #endif /* CONFIG_NCPFS_NLS */
 
-int ncp_ioctl(struct inode *inode, struct file *filp,
+static int __ncp_ioctl(struct inode *inode, struct file *filp,
 	      unsigned int cmd, unsigned long arg)
 {
 	struct ncp_server *server = NCP_SERVER(inode);
@@ -822,6 +823,57 @@ outrel:			
 	return -EINVAL;
 }
 
+static int ncp_ioctl_need_write(unsigned int cmd)
+{
+	switch (cmd) {
+	case NCP_IOC_GET_FS_INFO:
+	case NCP_IOC_GET_FS_INFO_V2:
+	case NCP_IOC_NCPREQUEST:
+	case NCP_IOC_SETDENTRYTTL:
+	case NCP_IOC_SIGN_INIT:
+	case NCP_IOC_LOCKUNLOCK:
+	case NCP_IOC_SET_SIGN_WANTED:
+		return 1;
+	case NCP_IOC_GETOBJECTNAME:
+	case NCP_IOC_SETOBJECTNAME:
+	case NCP_IOC_GETPRIVATEDATA:
+	case NCP_IOC_SETPRIVATEDATA:
+	case NCP_IOC_SETCHARSETS:
+	case NCP_IOC_GETCHARSETS:
+	case NCP_IOC_CONN_LOGGED_IN:
+	case NCP_IOC_GETDENTRYTTL:
+	case NCP_IOC_GETMOUNTUID2:
+	case NCP_IOC_SIGN_WANTED:
+	case NCP_IOC_GETROOT:
+	case NCP_IOC_SETROOT:
+		return 0;
+	default:
+		/* unkown IOCTL command, assume write */
+		return 1;
+	}
+}
+
+int ncp_ioctl(struct inode *inode, struct ...
From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 linux-2.6.git-dave/fs/inode.c |   45 ++++++++++++++++++------------------------
 1 file changed, 20 insertions(+), 25 deletions(-)

diff -puN fs/inode.c~r-o-bind-mounts-elevate-write-count-for-do_sys_utime-and-touch_atime fs/inode.c
--- linux-2.6.git/fs/inode.c~r-o-bind-mounts-elevate-write-count-for-do_sys_utime-and-touch_atime	2008-02-15 13:25:50.000000000 -0800
+++ linux-2.6.git-dave/fs/inode.c	2008-02-15 13:25:50.000000000 -0800
@@ -1199,42 +1199,37 @@ void touch_atime(struct vfsmount *mnt, s
 	struct inode *inode = dentry->d_inode;
 	struct timespec now;
 
-	if (inode->i_flags & S_NOATIME)
+	if (mnt_want_write(mnt))
 		return;
+	if (inode->i_flags & S_NOATIME)
+		goto out;
 	if (IS_NOATIME(inode))
-		return;
+		goto out;
 	if ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode))
-		return;
+		goto out;
 
-	/*
-	 * We may have a NULL vfsmount when coming from NFSD
-	 */
-	if (mnt) {
-		if (mnt->mnt_flags & MNT_NOATIME)
-			return;
-		if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))
-			return;
-
-		if (mnt->mnt_flags & MNT_RELATIME) {
-			/*
-			 * With relative atime, only update atime if the
-			 * previous atime is earlier than either the ctime or
-			 * mtime.
-			 */
-			if (timespec_compare(&inode->i_mtime,
-						&inode->i_atime) < 0 &&
-			    timespec_compare(&inode->i_ctime,
-						&inode->i_atime) < 0)
-				return;
-		}
+	if (mnt->mnt_flags & MNT_NOATIME)
+		goto out;
+	if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))
+		goto out;
+	if (mnt->mnt_flags & MNT_RELATIME) {
+		/*
+		 * With relative atime, only update atime if the previous
+		 * atime is earlier than either the ctime or mtime.
+		 */
+		if (timespec_compare(&inode->i_mtime, &inode->i_atime) < 0 &&
+		    timespec_compare(&inode->i_ctime, &inode->i_atime) < 0)
+			goto out;
 	}
 
 ...
From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

Now includes fix for oops seen by akpm.

"never let a libc developer write your kernel code" - hch

"nor, apparently, a kernel developer" - akpm

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Cc: Valdis Kletnieks <Valdis.Kletnieks@vt.edu>
Cc: Balbir Singh <balbir@in.ibm.com>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 linux-2.6.git-dave/fs/utimes.c |   18 ++++++++++++------
 1 file changed, 12 insertions(+), 6 deletions(-)

diff -puN fs/utimes.c~r-o-bind-mounts-elevate-write-count-for-do_utimes fs/utimes.c
--- linux-2.6.git/fs/utimes.c~r-o-bind-mounts-elevate-write-count-for-do_utimes	2008-02-15 13:25:50.000000000 -0800
+++ linux-2.6.git-dave/fs/utimes.c	2008-02-15 13:25:50.000000000 -0800
@@ -2,6 +2,7 @@
 #include <linux/file.h>
 #include <linux/fs.h>
 #include <linux/linkage.h>
+#include <linux/mount.h>
 #include <linux/namei.h>
 #include <linux/sched.h>
 #include <linux/stat.h>
@@ -59,6 +60,7 @@ long do_utimes(int dfd, char __user *fil
 	struct inode *inode;
 	struct iattr newattrs;
 	struct file *f = NULL;
+	struct vfsmount *mnt;
 
 	error = -EINVAL;
 	if (times && (!nsec_valid(times[0].tv_nsec) ||
@@ -79,18 +81,20 @@ long do_utimes(int dfd, char __user *fil
 		if (!f)
 			goto out;
 		dentry = f->f_path.dentry;
+		mnt = f->f_path.mnt;
 	} else {
 		error = __user_walk_fd(dfd, filename, (flags & AT_SYMLINK_NOFOLLOW) ? 0 : LOOKUP_FOLLOW, &nd);
 		if (error)
 			goto out;
 
 		dentry = nd.path.dentry;
+		mnt = nd.path.mnt;
 	}
 
 	inode = dentry->d_inode;
 
-	error = -EROFS;
-	if (IS_RDONLY(inode))
+	error = mnt_want_write(mnt);
+	if (error)
 		goto dput_and_out;
 
 	/* Don't worry, the checks are done in inode_change_ok() */
@@ -98,7 +102,7 @@ long do_utimes(int dfd, char __user *fil
 	if (times) {
 		error = -EPERM;
                 if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
-                        goto dput_and_out;
+			goto mnt_drop_write_and_out;
 
 		if (times[0].tv_nsec == ...
From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

open_namei() will, in the future, need to take mount write counts
over its creation and truncation (via may_open()) operations.  It
needs to keep these write counts until any potential filp that is
created gets __fput()'d.

This gets complicated in the error handling and becomes very murky
as to how far open_namei() actually got, and whether or not that
mount write count was taken.  That makes it a bad interface.

All that the current do_filp_open() really does is allocate the
nameidata on the stack, then call open_namei().

So, this merges those two functions and moves filp_open() over
to namei.c so it can be close to its buddy: do_filp_open().  It
also gets a kerneldoc comment in the process.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 linux-2.6.git-dave/fs/namei.c         |  100 +++++++++++++++++++---------------
 linux-2.6.git-dave/fs/open.c          |   19 ------
 linux-2.6.git-dave/include/linux/fs.h |    3 -
 3 files changed, 59 insertions(+), 63 deletions(-)

diff -puN fs/namei.c~make-open_namei-return-a-filp fs/namei.c
--- linux-2.6.git/fs/namei.c~make-open_namei-return-a-filp	2008-02-15 13:25:44.000000000 -0800
+++ linux-2.6.git-dave/fs/namei.c	2008-02-15 13:25:44.000000000 -0800
@@ -1724,17 +1724,13 @@ static inline int open_to_namei_flags(in
 }
 
 /*
- *	open_namei()
- *
- * namei for open - this is in fact almost the whole open-routine.
- *
  * Note that the low bits of "flag" aren't the same as in the open
  * system call.  See open_to_namei_flags().
- * SMP-safe
  */
-int open_namei(int dfd, const char *pathname, int open_flag,
-		int mode, struct nameidata *nd)
+struct file *do_filp_open(int dfd, const char *pathname,
+		int open_flag, int mode)
 {
+	struct nameidata nd;
 	int acc_mode, error;
 	struct path path;
 	struct dentry *dir;
@@ -1757,18 +1753,19 @@ int open_namei(int dfd, ...
From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 linux-2.6.git-dave/fs/inode.c |    6 +++++-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff -puN fs/inode.c~r-o-bind-mounts-elevate-write-count-for-file_update_time fs/inode.c
--- linux-2.6.git/fs/inode.c~r-o-bind-mounts-elevate-write-count-for-file_update_time	2008-02-15 13:25:51.000000000 -0800
+++ linux-2.6.git-dave/fs/inode.c	2008-02-15 13:25:51.000000000 -0800
@@ -1250,10 +1250,13 @@ void file_update_time(struct file *file)
 	struct inode *inode = file->f_path.dentry->d_inode;
 	struct timespec now;
 	int sync_it = 0;
+	int err;
 
 	if (IS_NOCMTIME(inode))
 		return;
-	if (IS_RDONLY(inode))
+
+	err = mnt_want_write(file->f_vfsmnt);
+	if (err)
 		return;
 
 	now = current_fs_time(inode->i_sb);
@@ -1274,6 +1277,7 @@ void file_update_time(struct file *file)
 
 	if (sync_it)
 		mark_inode_dirty_sync(inode);
+	mnt_drop_write(file->f_vfsmnt);
 }
 
 EXPORT_SYMBOL(file_update_time);
_
--

From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 linux-2.6.git-dave/fs/namei.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

diff -puN fs/namei.c~r-o-bind-mounts-elevate-write-count-for-link-and-symlink-calls fs/namei.c
--- linux-2.6.git/fs/namei.c~r-o-bind-mounts-elevate-write-count-for-link-and-symlink-calls	2008-02-15 13:25:52.000000000 -0800
+++ linux-2.6.git-dave/fs/namei.c	2008-02-15 13:25:52.000000000 -0800
@@ -2365,7 +2365,12 @@ asmlinkage long sys_symlinkat(const char
 	if (IS_ERR(dentry))
 		goto out_unlock;
 
+	error = mnt_want_write(nd.path.mnt);
+	if (error)
+		goto out_dput;
 	error = vfs_symlink(nd.path.dentry->d_inode, dentry, from, S_IALLUGO);
+	mnt_drop_write(nd.path.mnt);
+out_dput:
 	dput(dentry);
 out_unlock:
 	mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
@@ -2460,7 +2465,12 @@ asmlinkage long sys_linkat(int olddfd, c
 	error = PTR_ERR(new_dentry);
 	if (IS_ERR(new_dentry))
 		goto out_unlock;
+	error = mnt_want_write(nd.path.mnt);
+	if (error)
+		goto out_dput;
 	error = vfs_link(old_nd.path.dentry, nd.path.dentry->d_inode, new_dentry);
+	mnt_drop_write(nd.path.mnt);
+out_dput:
 	dput(new_dentry);
 out_unlock:
 	mutex_unlock(&nd.path.dentry->d_inode->i_mutex);
_
--

From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

Some ioctl()s can cause writes to the filesystem.  Take these, and make them
use mnt_want/drop_write() instead.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 linux-2.6.git-dave/fs/ext2/ioctl.c              |   49 +++++++----
 linux-2.6.git-dave/fs/ext3/ioctl.c              |  103 +++++++++++++++---------
 linux-2.6.git-dave/fs/fat/file.c                |   12 +-
 linux-2.6.git-dave/fs/hfsplus/ioctl.c           |   40 +++++----
 linux-2.6.git-dave/fs/jfs/ioctl.c               |   35 +++++---
 linux-2.6.git-dave/fs/ocfs2/ioctl.c             |   11 +-
 linux-2.6.git-dave/fs/reiserfs/ioctl.c          |   65 +++++++++------
 linux-2.6.git-dave/fs/xfs/linux-2.6/xfs_ioctl.c |   15 ++-
 8 files changed, 213 insertions(+), 117 deletions(-)

diff -puN fs/ext2/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls fs/ext2/ioctl.c
--- linux-2.6.git/fs/ext2/ioctl.c~r-o-bind-mounts-elevate-write-count-for-some-ioctls	2008-02-15 13:25:52.000000000 -0800
+++ linux-2.6.git-dave/fs/ext2/ioctl.c	2008-02-15 13:25:52.000000000 -0800
@@ -12,6 +12,7 @@
 #include <linux/time.h>
 #include <linux/sched.h>
 #include <linux/compat.h>
+#include <linux/mount.h>
 #include <linux/smp_lock.h>
 #include <asm/current.h>
 #include <asm/uaccess.h>
@@ -23,6 +24,7 @@ long ext2_ioctl(struct file *filp, unsig
 	struct ext2_inode_info *ei = EXT2_I(inode);
 	unsigned int flags;
 	unsigned short rsv_window_size;
+	int ret;
 
 	ext2_debug ("cmd = %u, arg = %lu\n", cmd, arg);
 
@@ -34,14 +36,19 @@ long ext2_ioctl(struct file *filp, unsig
 	case EXT2_IOC_SETFLAGS: {
 		unsigned int oldflags;
 
-		if (IS_RDONLY(inode))
-			return -EROFS;
-
-		if (!is_owner_or_cap(inode))
-			return -EACCES;
+		ret = mnt_want_write(filp->f_vfsmnt);
+		if (ret)
+			return ret;
+
+		if (!is_owner_or_cap(inode)) {
+			ret = -EACCES;
+			goto setflags_out;
+		}
 ...
From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

This is the first really tricky patch in the series.  It elevates the writer
count on a mount each time a non-special file is opened for write.

We used to do this in may_open(), but Miklos pointed out that __dentry_open()
is used as well to create filps.  This will cover even those cases, while a
call in may_open() would not have.

There is also an elevated count around the vfs_create() call in open_namei(). 
See the comments for more details, but we need this to fix a 'create, remount,
fail r/w open()' race.

Some filesystems forego the use of normal vfs calls to create
struct files.   Make sure that these users elevate the mnt
writer count because they will get __fput(), and we need
to make sure they're balanced.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 linux-2.6.git-dave/fs/file_table.c |   14 ++++++
 linux-2.6.git-dave/fs/namei.c      |   75 ++++++++++++++++++++++++++++++++-----
 linux-2.6.git-dave/fs/open.c       |   36 ++++++++++++++++-
 linux-2.6.git-dave/ipc/mqueue.c    |   16 ++++++-
 4 files changed, 127 insertions(+), 14 deletions(-)

diff -puN fs/file_table.c~r-o-bind-mounts-elevate-write-count-opened-files fs/file_table.c
--- linux-2.6.git/fs/file_table.c~r-o-bind-mounts-elevate-write-count-opened-files	2008-02-15 13:25:53.000000000 -0800
+++ linux-2.6.git-dave/fs/file_table.c	2008-02-15 13:25:53.000000000 -0800
@@ -199,6 +199,17 @@ int init_file(struct file *file, struct 
 	file->f_mapping = dentry->d_inode->i_mapping;
 	file->f_mode = mode;
 	file->f_op = fop;
+
+	/*
+	 * These mounts don't really matter in practice
+	 * for r/o bind mounts.  They aren't userspace-
+	 * visible.  We do this for consistency, and so
+	 * that we can do debugging checks at __fput()
+	 */
+	if ((mode & FMODE_WRITE) && !special_file(dentry->d_inode->i_mode)) {
+		error = ...
From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

This also uses the little helper in the NFS code to make an if() a little bit
less ugly.  We introduced the helper at the beginning of the series.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 linux-2.6.git-dave/fs/namei.c    |    4 ++++
 linux-2.6.git-dave/fs/nfsd/vfs.c |   17 +++++++++++++----
 2 files changed, 17 insertions(+), 4 deletions(-)

diff -puN fs/namei.c~r-o-bind-mounts-elevate-write-count-over-calls-to-vfs_rename fs/namei.c
--- linux-2.6.git/fs/namei.c~r-o-bind-mounts-elevate-write-count-over-calls-to-vfs_rename	2008-02-15 13:25:54.000000000 -0800
+++ linux-2.6.git-dave/fs/namei.c	2008-02-15 13:25:54.000000000 -0800
@@ -2751,8 +2751,12 @@ static int do_rename(int olddfd, const c
 	if (new_dentry == trap)
 		goto exit5;
 
+	error = mnt_want_write(oldnd.path.mnt);
+	if (error)
+		goto exit5;
 	error = vfs_rename(old_dir->d_inode, old_dentry,
 				   new_dir->d_inode, new_dentry);
+	mnt_drop_write(oldnd.path.mnt);
 exit5:
 	dput(new_dentry);
 exit4:
diff -puN fs/nfsd/vfs.c~r-o-bind-mounts-elevate-write-count-over-calls-to-vfs_rename fs/nfsd/vfs.c
--- linux-2.6.git/fs/nfsd/vfs.c~r-o-bind-mounts-elevate-write-count-over-calls-to-vfs_rename	2008-02-15 13:25:54.000000000 -0800
+++ linux-2.6.git-dave/fs/nfsd/vfs.c	2008-02-15 13:25:54.000000000 -0800
@@ -1678,13 +1678,20 @@ nfsd_rename(struct svc_rqst *rqstp, stru
 	if (ndentry == trap)
 		goto out_dput_new;
 
-#ifdef MSNFS
-	if ((ffhp->fh_export->ex_flags & NFSEXP_MSNFS) &&
+	if (svc_msnfs(ffhp) &&
 		((atomic_read(&odentry->d_count) > 1)
 		 || (atomic_read(&ndentry->d_count) > 1))) {
 			host_err = -EPERM;
-	} else
-#endif
+			goto out_dput_new;
+	}
+
+	host_err = -EXDEV;
+	if (ffhp->fh_export->ex_mnt != tfhp->fh_export->ex_mnt)
+		goto out_dput_new;
+	host_err = mnt_want_write(ffhp->fh_export->ex_mnt);
+	if (host_err)
+		goto ...
From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 linux-2.6.git-dave/fs/open.c |   14 ++++++++------
 1 file changed, 8 insertions(+), 6 deletions(-)

diff -puN fs/open.c~r-o-bind-mounts-elevate-writer-count-for-do_sys_truncate fs/open.c
--- linux-2.6.git/fs/open.c~r-o-bind-mounts-elevate-writer-count-for-do_sys_truncate	2008-02-15 13:25:55.000000000 -0800
+++ linux-2.6.git-dave/fs/open.c	2008-02-15 13:25:55.000000000 -0800
@@ -244,21 +244,21 @@ static long do_sys_truncate(const char _
 	if (!S_ISREG(inode->i_mode))
 		goto dput_and_out;
 
-	error = vfs_permission(&nd, MAY_WRITE);
+	error = mnt_want_write(nd.path.mnt);
 	if (error)
 		goto dput_and_out;
 
-	error = -EROFS;
-	if (IS_RDONLY(inode))
-		goto dput_and_out;
+	error = vfs_permission(&nd, MAY_WRITE);
+	if (error)
+		goto mnt_drop_write_and_out;
 
 	error = -EPERM;
 	if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
-		goto dput_and_out;
+		goto mnt_drop_write_and_out;
 
 	error = get_write_access(inode);
 	if (error)
-		goto dput_and_out;
+		goto mnt_drop_write_and_out;
 
 	/*
 	 * Make sure that there are no leases.  get_write_access() protects
@@ -276,6 +276,8 @@ static long do_sys_truncate(const char _
 
 put_write_and_out:
 	put_write_access(inode);
+mnt_drop_write_and_out:
+	mnt_drop_write(nd.path.mnt);
 dput_and_out:
 	path_put(&nd.path);
 out:
_
--

From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

Elevate the write count during the xfs m/ctime updates.

XFS has to do it's own timestamp updates due to an unfortunate VFS
design limitation, so it will have to track writers by itself aswell.

[hch: split out from the touch_atime patch as it's not related to it at all]

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 linux-2.6.git-dave/fs/xfs/linux-2.6/xfs_iops.c |    7 -------
 linux-2.6.git-dave/fs/xfs/linux-2.6/xfs_lrw.c  |    9 ++++++++-
 2 files changed, 8 insertions(+), 8 deletions(-)

diff -puN fs/xfs/linux-2.6/xfs_iops.c~r-o-bind-mounts-elevate-writer-count-for-xfs-timestamp-updates fs/xfs/linux-2.6/xfs_iops.c
--- linux-2.6.git/fs/xfs/linux-2.6/xfs_iops.c~r-o-bind-mounts-elevate-writer-count-for-xfs-timestamp-updates	2008-02-15 13:25:56.000000000 -0800
+++ linux-2.6.git-dave/fs/xfs/linux-2.6/xfs_iops.c	2008-02-15 13:25:56.000000000 -0800
@@ -157,13 +157,6 @@ xfs_ichgtime_fast(
 	 */
 	ASSERT((flags & XFS_ICHGTIME_ACC) == 0);
 
-	/*
-	 * We're not supposed to change timestamps in readonly-mounted
-	 * filesystems.  Throw it away if anyone asks us.
-	 */
-	if (unlikely(IS_RDONLY(inode)))
-		return;
-
 	if (flags & XFS_ICHGTIME_MOD) {
 		tvp = &inode->i_mtime;
 		ip->i_d.di_mtime.t_sec = (__int32_t)tvp->tv_sec;
diff -puN fs/xfs/linux-2.6/xfs_lrw.c~r-o-bind-mounts-elevate-writer-count-for-xfs-timestamp-updates fs/xfs/linux-2.6/xfs_lrw.c
--- linux-2.6.git/fs/xfs/linux-2.6/xfs_lrw.c~r-o-bind-mounts-elevate-writer-count-for-xfs-timestamp-updates	2008-02-15 13:25:56.000000000 -0800
+++ linux-2.6.git-dave/fs/xfs/linux-2.6/xfs_lrw.c	2008-02-15 13:25:56.000000000 -0800
@@ -51,6 +51,7 @@
 #include "xfs_vnodeops.h"
 
 #include <linux/capability.h>
+#include <linux/mount.h>
 #include <linux/writeback.h>
 
 
@@ -679,10 +680,16 @@ start:
 	if (new_size > xip->i_size)
 		xip->i_new_size = new_size;
 
-	if (likely(!(ioflags & IO_INVIS))) {
+	/*
+	 * We're not ...
From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

It is OK to let access() go without using a mnt_want/drop_write() pair because
it doesn't actually do writes to the filesystem, and it is inherently racy
anyway.  This is a rare case when it is OK to use __mnt_is_readonly()
directly.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 linux-2.6.git-dave/fs/open.c |   13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff -puN fs/open.c~r-o-bind-mounts-make-access-use-mnt-check fs/open.c
--- linux-2.6.git/fs/open.c~r-o-bind-mounts-make-access-use-mnt-check	2008-02-15 13:25:57.000000000 -0800
+++ linux-2.6.git-dave/fs/open.c	2008-02-15 13:25:57.000000000 -0800
@@ -459,8 +459,17 @@ asmlinkage long sys_faccessat(int dfd, c
 	if(res || !(mode & S_IWOTH) ||
 	   special_file(nd.path.dentry->d_inode->i_mode))
 		goto out_path_release;
-
-	if(IS_RDONLY(nd.path.dentry->d_inode))
+	/*
+	 * This is a rare case where using __mnt_is_readonly()
+	 * is OK without a mnt_want/drop_write() pair.  Since
+	 * no actual write to the fs is performed here, we do
+	 * not need to telegraph to that to anyone.
+	 *
+	 * By doing this, we accept that this access is
+	 * inherently racy and know that the fs may change
+	 * state before we even see this result.
+	 */
+	if (__mnt_is_readonly(nd.path.mnt))
 		res = -EROFS;
 
 out_path_release:
_
--

From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

chown/chmod,etc...  don't call permission in the same way that the normal
"open for write" calls do.  They still write to the filesystem, so bump the
write count during these operations.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 linux-2.6.git-dave/fs/open.c |   39 ++++++++++++++++++++++++++++++---------
 1 file changed, 30 insertions(+), 9 deletions(-)

diff -puN fs/open.c~r-o-bind-mounts-elevate-writer-count-for-chown-and-friends fs/open.c
--- linux-2.6.git/fs/open.c~r-o-bind-mounts-elevate-writer-count-for-chown-and-friends	2008-02-15 13:25:55.000000000 -0800
+++ linux-2.6.git-dave/fs/open.c	2008-02-15 13:25:55.000000000 -0800
@@ -567,12 +567,12 @@ asmlinkage long sys_fchmod(unsigned int 
 
 	audit_inode(NULL, dentry);
 
-	err = -EROFS;
-	if (IS_RDONLY(inode))
+	err = mnt_want_write(file->f_vfsmnt);
+	if (err)
 		goto out_putf;
 	err = -EPERM;
 	if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
-		goto out_putf;
+		goto out_drop_write;
 	mutex_lock(&inode->i_mutex);
 	if (mode == (mode_t) -1)
 		mode = inode->i_mode;
@@ -581,6 +581,8 @@ asmlinkage long sys_fchmod(unsigned int 
 	err = notify_change(dentry, &newattrs);
 	mutex_unlock(&inode->i_mutex);
 
+out_drop_write:
+	mnt_drop_write(file->f_vfsmnt);
 out_putf:
 	fput(file);
 out:
@@ -600,13 +602,13 @@ asmlinkage long sys_fchmodat(int dfd, co
 		goto out;
 	inode = nd.path.dentry->d_inode;
 
-	error = -EROFS;
-	if (IS_RDONLY(inode))
+	error = mnt_want_write(nd.path.mnt);
+	if (error)
 		goto dput_and_out;
 
 	error = -EPERM;
 	if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
-		goto dput_and_out;
+		goto out_drop_write;
 
 	mutex_lock(&inode->i_mutex);
 	if (mode == (mode_t) -1)
@@ -616,6 +618,8 @@ asmlinkage long sys_fchmodat(int dfd, co
 	error = notify_change(nd.path.dentry, &newattrs);
 	mutex_unlock(&inode->i_mutex);
 ...
From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

If we depend on the inodes for writeability, we will not catch the r/o mounts
when implemented.

This patches uses __mnt_want_write().  It does not guarantee that the mount
will stay writeable after the check.  But, this is OK for one of the checks
because it is just for a printk().

The other two are probably unnecessary and duplicate existing checks in the
VFS.  This won't make them better checks than before, but it will make them
detect r/o mounts.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 linux-2.6.git-dave/fs/nfs/dir.c  |    3 ++-
 linux-2.6.git-dave/fs/nfsd/vfs.c |    5 +++--
 2 files changed, 5 insertions(+), 3 deletions(-)

diff -puN fs/nfs/dir.c~r-o-bind-mounts-nfs-check-mnt-instead-of-superblock-directly fs/nfs/dir.c
--- linux-2.6.git/fs/nfs/dir.c~r-o-bind-mounts-nfs-check-mnt-instead-of-superblock-directly	2008-02-15 13:25:57.000000000 -0800
+++ linux-2.6.git-dave/fs/nfs/dir.c	2008-02-15 13:25:57.000000000 -0800
@@ -967,7 +967,8 @@ static int is_atomic_open(struct inode *
 	if (nd->flags & LOOKUP_DIRECTORY)
 		return 0;
 	/* Are we trying to write to a read only partition? */
-	if (IS_RDONLY(dir) && (nd->intent.open.flags & (O_CREAT|O_TRUNC|FMODE_WRITE)))
+	if (__mnt_is_readonly(nd->mnt) &&
+	    (nd->intent.open.flags & (O_CREAT|O_TRUNC|FMODE_WRITE)))
 		return 0;
 	return 1;
 }
diff -puN fs/nfsd/vfs.c~r-o-bind-mounts-nfs-check-mnt-instead-of-superblock-directly fs/nfsd/vfs.c
--- linux-2.6.git/fs/nfsd/vfs.c~r-o-bind-mounts-nfs-check-mnt-instead-of-superblock-directly	2008-02-15 13:25:57.000000000 -0800
+++ linux-2.6.git-dave/fs/nfsd/vfs.c	2008-02-15 13:25:57.000000000 -0800
@@ -1874,7 +1874,7 @@ nfsd_permission(struct svc_rqst *rqstp, 
 		inode->i_mode,
 		IS_IMMUTABLE(inode)?	" immut" : "",
 		IS_APPEND(inode)?	" append" : "",
-		IS_RDONLY(inode)?	" ro" : ...
From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

This takes care of all of the direct callers of vfs_mknod().
Since a few of these cases also handle normal file creation
as well, this also covers some calls to vfs_create().

So that we don't have to make three mnt_want/drop_write()
calls inside of the switch statement, we move some of its
logic outside of the switch and into a helper function
suggested by Christoph.

This also encapsulates a fix for mknod(S_IFREG) that Miklos
found.

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 linux-2.6.git-dave/fs/namei.c         |   43 +++++++++++++++++++++++++---------
 linux-2.6.git-dave/fs/nfsd/vfs.c      |    4 +++
 linux-2.6.git-dave/net/unix/af_unix.c |    4 +++
 3 files changed, 40 insertions(+), 11 deletions(-)

diff -puN fs/namei.c~r-o-bind-mounts-sys_mknodat-elevate-write-count-for-vfs_mknod-create fs/namei.c
--- linux-2.6.git/fs/namei.c~r-o-bind-mounts-sys_mknodat-elevate-write-count-for-vfs_mknod-create	2008-02-15 13:25:58.000000000 -0800
+++ linux-2.6.git-dave/fs/namei.c	2008-02-15 13:25:58.000000000 -0800
@@ -2038,6 +2038,23 @@ int vfs_mknod(struct inode *dir, struct 
 	return error;
 }
 
+static int may_mknod(mode_t mode)
+{
+	switch (mode & S_IFMT) {
+	case S_IFREG:
+	case S_IFCHR:
+	case S_IFBLK:
+	case S_IFIFO:
+	case S_IFSOCK:
+	case 0: /* zero mode translates to S_IFREG */
+		return 0;
+	case S_IFDIR:
+		return -EPERM;
+	default:
+		return -EINVAL;
+	}
+}
+
 asmlinkage long sys_mknodat(int dfd, const char __user *filename, int mode,
 				unsigned dev)
 {
@@ -2056,12 +2073,19 @@ asmlinkage long sys_mknodat(int dfd, con
 	if (error)
 		goto out;
 	dentry = lookup_create(&nd, 0);
-	error = PTR_ERR(dentry);
-
+	if (IS_ERR(dentry)) {
+		error = PTR_ERR(dentry);
+		goto out_unlock;
+	}
 	if (!IS_POSIXACL(nd.path.dentry->d_inode))
 		mode &= ~current->fs->umask;
-	if (!IS_ERR(dentry)) ...
From: Dave Hansen
Date: Friday, February 15, 2008 - 3:37 pm

This is the real meat of the entire series.  It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can be
hundreds of cpus doing open()/close() on files on the same mnt at
the same time.  Even an atomic_t in the mnt has massive scalaing
problems because the cacheline gets so terribly contended.

This uses a statically-allocated percpu variable.  All want/drop
operations are local to a cpu as long that cpu operates on the same
mount, and there are no writer count imbalances.  Writer count
imbalances happen when a write is taken on one cpu, and released
on another, like when an open/close pair is performed on two

Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we determine
if there are any outstanding writers to the mount.

I've written a little benchmark to sit in a loop for a couple of
seconds in several cpus in parallel doing open/write/close loops.

http://sr71.net/~dave/linux/openbench.c

The code in here is a a worst-possible case for this patch.  It
does opens on a _pair_ of files in two different mounts in parallel.
This should cause my code to lose its "operate on the same mount"
optimization completely.  This worst-case scenario causes a 3%
degredation in the benchmark.

I could probably get rid of even this 3%, but it would be more
complex than what I have here, and I think this is getting into
acceptable territory.  In practice, I expect writing more than 3
bytes to a file, as well as disk I/O to mask any effects that this
has.

(To get rid of that 3%, we could have an #defined number of mounts
in the percpu variable.  So, instead of a CPU getting operate only
on percpu data when it accesses only one mount, it could stay on
percpu data when it only accesses N or fewer mounts.)

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen ...
From: Miklos Szeredi
Date: Monday, February 18, 2008 - 9:10 am

I think you should also add a

		cpu_writer->mnt = NULL;

here.  It's not a bug, but I had to think a bit about why it's not a bug.

Miklos
--

From: Dave Hansen
Date: Wednesday, February 20, 2008 - 2:12 pm

Yeah, I kinda copied the code from __clear_mnt_count() where keeping
->mnt is actually a mini optimization.  But, there is certainly no
chance of that mnt popping up again after a __mntput(), so I clear it in
there now.  I also added a comment explaining.

-- Dave

--

From: Dave Hansen
Date: Friday, February 15, 2008 - 3:38 pm

Originally from: Herbert Poetzl <herbert@13thfloor.at>

This is the core of the read-only bind mount patch set.

Note that this does _not_ add a "ro" option directly to the bind mount
operation.  If you require such a mount, you must first do the bind, then
follow it up with a 'mount -o remount,ro' operation:

If you wish to have a r/o bind mount of /foo on bar:

	mount --bind /foo /bar
	mount -o remount,ro /bar

Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 linux-2.6.git-dave/fs/namespace.c        |   50 ++++++++++++++++++++++++++-----
 linux-2.6.git-dave/include/linux/mount.h |    1 
 2 files changed, 44 insertions(+), 7 deletions(-)

diff -puN fs/namespace.c~r-o-bind-mounts-honor-r-w-changes-at-do_remount-time fs/namespace.c
--- linux-2.6.git/fs/namespace.c~r-o-bind-mounts-honor-r-w-changes-at-do_remount-time	2008-02-15 13:25:59.000000000 -0800
+++ linux-2.6.git-dave/fs/namespace.c	2008-02-15 13:25:59.000000000 -0800
@@ -105,7 +105,11 @@ struct vfsmount *alloc_vfsmnt(const char
  */
 int __mnt_is_readonly(struct vfsmount *mnt)
 {
-	return (mnt->mnt_sb->s_flags & MS_RDONLY);
+	if (mnt->mnt_flags & MNT_READONLY)
+		return 1;
+	if (mnt->mnt_sb->s_flags & MS_RDONLY)
+		return 1;
+	return 0;
 }
 EXPORT_SYMBOL_GPL(__mnt_is_readonly);
 
@@ -299,7 +303,7 @@ void mnt_drop_write(struct vfsmount *mnt
 }
 EXPORT_SYMBOL_GPL(mnt_drop_write);
 
-int mnt_make_readonly(struct vfsmount *mnt)
+static int mnt_make_readonly(struct vfsmount *mnt)
 {
 	int ret = 0;
 
@@ -312,15 +316,25 @@ int mnt_make_readonly(struct vfsmount *m
 		goto out;
 	}
 	/*
-	 * actually set mount's r/o flag here to make
-	 * __mnt_is_readonly() true, which keeps anyone
-	 * from doing a successful mnt_want_write().
+	 * nobody can do a successful mnt_want_write() with all
+	 * of the counts in MNT_DENIED_WRITE and the locks held.
 	 ...
From: Dave Hansen
Date: Friday, February 15, 2008 - 3:38 pm

There have been a few oopses caused by 'struct file's with NULL f_vfsmnts. 
There was also a set of potentially missed mnt_want_write()s from
dentry_open() calls.

This patch provides a very simple debugging framework to catch these kinds of
bugs.  It will WARN_ON() them, but should stop us from having any oopses or
mnt_writer count imbalances.

I'm quite convinced that this is a good thing because it found bugs in the
stuff I was working on as soon as I wrote it.

[hch: made it conditional on a debug option.
      But it's still a little bit too ugly]

[hch: merged forced remount r/o fix from Dave and akpm's fix for the fix]

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---

 linux-2.6.git-dave/fs/file_table.c    |   11 ++++++-
 linux-2.6.git-dave/fs/open.c          |   12 +++++++-
 linux-2.6.git-dave/fs/super.c         |    3 ++
 linux-2.6.git-dave/include/linux/fs.h |   49 ++++++++++++++++++++++++++++++++++
 linux-2.6.git-dave/lib/Kconfig.debug  |   10 ++++++
 5 files changed, 82 insertions(+), 3 deletions(-)

diff -puN fs/file_table.c~keep-track-of-mnt_writer-state-of-struct-file fs/file_table.c
--- linux-2.6.git/fs/file_table.c~keep-track-of-mnt_writer-state-of-struct-file	2008-02-15 13:26:00.000000000 -0800
+++ linux-2.6.git-dave/fs/file_table.c	2008-02-15 13:26:00.000000000 -0800
@@ -42,6 +42,7 @@ static inline void file_free_rcu(struct 
 static inline void file_free(struct file *f)
 {
 	percpu_counter_dec(&nr_files);
+	file_check_state(f);
 	call_rcu(&f->f_u.fu_rcuhead, file_free_rcu);
 }
 
@@ -207,6 +208,7 @@ int init_file(struct file *file, struct 
 	 * that we can do debugging checks at __fput()
 	 */
 	if ((mode & FMODE_WRITE) && !special_file(dentry->d_inode->i_mode)) {
+		file_take_write(file);
 		error = mnt_want_write(mnt);
 		WARN_ON(error);
 	}
@@ -237,8 +239,13 @@ void ...
From: Dave Hansen
Date: Friday, February 15, 2008 - 6:32 pm

Linus,

We've been causing Ted a lot of pain having to keep different copies of
his patches for mainline and for -mm since -mm has these functions and
mainline doesn't.  I'm pretty darn sure at that these are the right API
and we just need another run in -mm to make sure that the reset of the
series in in good shape.

Could we bump this one in a bit ahead of the rest to ease Ted's pain?  I
think the odds of this breaking anything by itself are pretty low.

---

This patch adds two functions: mnt_want_write() and mnt_drop_write().
These are used like a lock pair around and fs operations that might
cause a write to the filesystem.

Note that these will replace many of the uses of IS_RDONLY(inode)
currently in the kernel.

Before these can become useful, we must first cover each place in the VFS
where writes are performed with a want/drop pair.  When that is complete, we
can actually introduce code that will safely check the counts before allowing
r/w<->r/o transitions to occur.

Acked-by: Serge Hallyn <serue@us.ibm.com>
Acked-by: Al Viro <viro@ZenIV.linux.org.uk>
Signed-off-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

 linux-2.6.git-dave/fs/namespace.c        |   54 +++++++++++++++++++++++++++++++
 linux-2.6.git-dave/include/linux/mount.h |    3 +
 2 files changed, 57 insertions(+)

diff -puN fs/namespace.c~r-o-bind-mounts-stub-functions fs/namespace.c
--- linux-2.6.git/fs/namespace.c~r-o-bind-mounts-stub-functions	2008-02-15 13:25:45.000000000 -0800
+++ linux-2.6.git-dave/fs/namespace.c	2008-02-15 13:25:45.000000000 -0800
@@ -80,6 +80,60 @@ struct vfsmount *alloc_vfsmnt(const char
 	return mnt;
 }

+/*
+ * Most r/o checks on a fs are for operations that take
+ * discrete amounts of time, like a write() or unlink().
+ * We must keep track of when those operations start
+ * (for permission checks) and when they end, so that
+ * we can determine when writes ...
Previous thread: none

Next thread: inode leak in 2.6.24? by Ferenc Wagner on Friday, February 15, 2008 - 4:18 pm. (8 messages)