Re: [PATCH 02/24] rearrange may_open() to be r/o friendly

Previous thread: [PATCH 2/2] unify DMA_..BIT_MASK definitions: cleanup drivers/scsi/gdth.c by Borislav Petkov on Monday, September 17, 2007 - 1:51 pm. (1 message)

Next thread: [PATCH 0/1] ppc64: Convert cpu_sibling_map to a per_cpu data array ppc64 v2 by travis on Monday, September 17, 2007 - 2:35 pm. (1 message)
To: <akpm@...>
Cc: <linux-kernel@...>, <hch@...>, Dave Hansen <haveblue@...>
Date: Monday, September 17, 2007 - 2:27 pm

If we can't pull the entire series into -mm, can we just put the
first three patches for now? They can stand on their own.

---

Why do we need r/o bind mounts?

This feature allows a read-only view into a read-write filesystem.
In the process of doing that, it also provides infrastructure for
keeping track of the number of writers to any given mount.

This has a number of uses. It allows chroots to have parts of
filesystems writable. It will be useful for containers in the future
because users may have root inside a container, but should not
be allowed to write to somefilesystems. This also replaces
patches that vserver has had out of the tree for several years.

It allows security enhancement by making sure that parts of
your filesystem read-only (such as when you don't trust your
FTP server), when you don't want to have entire new filesystems
mounted, or when you want atime selectively updated.
I've been using the following script to test that the feature is
working as desired. It takes a directory and makes a regular
bind and a r/o bind mount of it. It then performs some normal
filesystem operations on the three directories, including ones
that are expected to fail, like creating a file on the r/o
mount.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
-

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:44 pm

Yes, they're kinda a series of their own. But I still think we really
want this in -mm. As we've seen on the kernel summit there's a pretty
desparate need for it. And there's not many changes in this area in
-mm, maybe the unprivilegued mounts. I'd personally prioritize the
r/o bindmounts over them as they're more needed and we need more reviewing
of the unprivilegued mounts (I'll try to come back to that soon).

FYI: the series you sent out are missing the last two patches, and while
they compile as-is they're actually missing the interesting part :)
-

To: Christoph Hellwig <hch@...>
Cc: Dave Hansen <haveblue@...>, <linux-kernel@...>, Miklos Szeredi <miklos@...>
Date: Wednesday, September 19, 2007 - 5:24 pm

On Wed, 19 Sep 2007 18:44:18 +0100

What's the situation on unprivileged mounts? iirc, it's all a bit stuck.

If unpriv-mounts code isn't going to go into mainline ahead of r/o bind
mounts then it'd make a big mess to prepare the r/o bind mount patches on
top of unprivileged mounts.

It sounds like a better approach would be for me to merge the r/o bind
mounts code and to drop (or maybe rework) the unprivileged mounts patches

-

To: Andrew Morton <akpm@...>
Cc: Christoph Hellwig <hch@...>, <linux-kernel@...>, Miklos Szeredi <miklos@...>
Date: Wednesday, September 19, 2007 - 5:56 pm

I actually don't think they collided too much. There were a couple of
patches, like maybe 2 or 3 that needed any futzing at all.

I'll cook up a set straight on top of mainline if that helps.

-- Dave

-

To: Dave Hansen <haveblue@...>
Cc: Christoph Hellwig <hch@...>, <linux-kernel@...>, Miklos Szeredi <miklos@...>
Date: Wednesday, September 19, 2007 - 6:06 pm

On Wed, 19 Sep 2007 14:56:27 -0700

That sounds good, thanks.

There may be collisions with unionfs too, but if that happens in a
non-trivial way I may just drop unionfs - it doesn't look like it's
going to get there in its present form.
-

To: <akpm@...>
Cc: <haveblue@...>, <hch@...>, <linux-kernel@...>, <miklos@...>
Date: Thursday, September 20, 2007 - 5:58 am

I agree, that per mount r/o is more important than unprivileged
mounts. Especially because fixing up mount(8) and option showing in
/proc/mounts is really nowhere yet, both of which are needed to make
unprivileged mounting really useful.

So if that's convenient for you, just drop the unpriv mount patches,
and I'll resubmit sometime later.

Miklos
-

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 10:21 am

FWIW this set (through 22) looks good to me.

thanks,
-serge
-

To: <akpm@...>
Cc: <linux-kernel@...>, <hch@...>, Dave Hansen <haveblue@...>
Date: Monday, September 17, 2007 - 2:27 pm

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

lxc-dave/net/unix/af_unix.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)

diff -puN net/unix/af_unix.c~unix-find-other-elevate-write-count-for-touch-atime net/unix/af_unix.c
--- lxc/net/unix/af_unix.c~unix-find-other-elevate-write-count-for-touch-atime 2007-09-17 09:44:04.000000000 -0700
+++ lxc-dave/net/unix/af_unix.c 2007-09-17 09:44:04.000000000 -0700
@@ -729,21 +729,27 @@ static struct sock *unix_find_other(stru
err = path_lookup(sunname->sun_path, LOOKUP_FOLLOW, &nd);
if (err)
goto fail;
+
+ err = mnt_want_write(nd.mnt);
+ if (err)
+ goto put_path_fail;
+
err = vfs_permission(&nd, MAY_WRITE);
if (err)
- goto put_fail;
+ goto mnt_drop_write_fail;

err = -ECONNREFUSED;
if (!S_ISSOCK(nd.dentry->d_inode->i_mode))
- goto put_fail;
+ goto mnt_drop_write_fail;
u=unix_find_socket_byinode(nd.dentry->d_inode);
if (!u)
- goto put_fail;
+ goto mnt_drop_write_fail;

if (u->sk_type == type)
touch_atime(nd.mnt, nd.dentry);

path_release(&nd);
+ mnt_drop_write(nd.mnt);

err=-EPROTOTYPE;
if (u->sk_type != type) {
@@ -763,7 +769,9 @@ static struct sock *unix_find_other(stru
}
return u;

-put_fail:
+mnt_drop_write_fail:
+ mnt_drop_write(nd.mnt);
+put_path_fail:
path_release(&nd);
fail:
*error=err;
_
-

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:35 pm

Ok.
-

To: <akpm@...>
Cc: <linux-kernel@...>, <hch@...>, Dave Hansen <haveblue@...>
Date: Monday, September 17, 2007 - 2:27 pm

Elevate the write count during the vfs_rmdir() call.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

lxc-dave/fs/namei.c | 5 +++++
1 file changed, 5 insertions(+)

diff -puN fs/namei.c~do-rmdir-elevate-write-count fs/namei.c
--- lxc/fs/namei.c~do-rmdir-elevate-write-count 2007-09-17 09:44:09.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-09-17 09:44:09.000000000 -0700
@@ -2173,7 +2173,12 @@ static long do_rmdir(int dfd, const char
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto exit2;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto exit3;
error = vfs_rmdir(nd.dentry->d_inode, dentry);
+ mnt_drop_write(nd.mnt);
+exit3:
dput(dentry);
exit2:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
_
-

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:39 pm

Ok.

-

To: <akpm@...>
Cc: <linux-kernel@...>, <hch@...>, Dave Hansen <haveblue@...>
Date: Monday, September 17, 2007 - 2:27 pm

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

lxc-dave/fs/namei.c | 4 ++++
lxc-dave/ipc/mqueue.c | 5 ++++-
2 files changed, 8 insertions(+), 1 deletion(-)

diff -puN fs/namei.c~elevate-mnt-writers-for-vfs-unlink-callers fs/namei.c
--- lxc/fs/namei.c~elevate-mnt-writers-for-vfs-unlink-callers 2007-09-17 09:44:08.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-09-17 09:44:08.000000000 -0700
@@ -2253,7 +2253,11 @@ static long do_unlinkat(int dfd, const c
inode = dentry->d_inode;
if (inode)
atomic_inc(&inode->i_count);
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto exit2;
error = vfs_unlink(nd.dentry->d_inode, dentry);
+ mnt_drop_write(nd.mnt);
exit2:
dput(dentry);
}
diff -puN ipc/mqueue.c~elevate-mnt-writers-for-vfs-unlink-callers ipc/mqueue.c
--- lxc/ipc/mqueue.c~elevate-mnt-writers-for-vfs-unlink-callers 2007-09-17 09:44:08.000000000 -0700
+++ lxc-dave/ipc/mqueue.c 2007-09-17 09:44:08.000000000 -0700
@@ -747,8 +747,11 @@ asmlinkage long sys_mq_unlink(const char
inode = dentry->d_inode;
if (inode)
atomic_inc(&inode->i_count);
-
+ err = mnt_want_write(mqueue_mnt);
+ if (err)
+ goto out_err;
err = vfs_unlink(dentry->d_parent->d_inode, dentry);
+ mnt_drop_write(mqueue_mnt);
out_err:
dput(dentry);

_
-

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:38 pm

Ok.

-

To: <akpm@...>
Cc: <linux-kernel@...>, <hch@...>, Dave Hansen <haveblue@...>
Date: Monday, September 17, 2007 - 2:27 pm

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

lxc-dave/fs/inode.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)

diff -puN fs/inode.c~elevate-write-count-for-do-sys-utime-and-touch-atime fs/inode.c
--- lxc/fs/inode.c~elevate-write-count-for-do-sys-utime-and-touch-atime 2007-09-17 09:44:07.000000000 -0700
+++ lxc-dave/fs/inode.c 2007-09-17 09:44:07.000000000 -0700
@@ -1185,22 +1185,23 @@ void touch_atime(struct vfsmount *mnt, s
struct inode *inode = dentry->d_inode;
struct timespec now;

- if (inode->i_flags & S_NOATIME)
+ if (mnt && mnt_want_write(mnt))
return;
+ if (inode->i_flags & S_NOATIME)
+ goto out;
if (IS_NOATIME(inode))
- return;
+ goto out;
if ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode))
- return;
+ goto out;

/*
* We may have a NULL vfsmount when coming from NFSD
*/
if (mnt) {
if (mnt->mnt_flags & MNT_NOATIME)
- return;
+ goto out;
if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))
- return;
-
+ goto out;
if (mnt->mnt_flags & MNT_RELATIME) {
/*
* With relative atime, only update atime if the
@@ -1211,16 +1212,19 @@ void touch_atime(struct vfsmount *mnt, s
&inode->i_atime) < 0 &&
timespec_compare(&inode->i_ctime,
&inode->i_atime) < 0)
- return;
+ goto out;
}
}

now = current_fs_time(inode->i_sb);
if (timespec_equal(&inode->i_atime, &now))
- return;
+ goto out;

inode->i_atime = now;
mark_inode_dirty_sync(inode);
+out:
+ if (mnt)
+ mnt_drop_write(mnt);
}
EXPORT_SYMBOL(touch_atime);

_
-

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:36 pm

Ok.
-

To: <akpm@...>
Cc: <linux-kernel@...>, <hch@...>, Dave Hansen <haveblue@...>
Date: Monday, September 17, 2007 - 2:27 pm

This takes care of all of the direct callers of vfs_mknod().
Since a few of these cases also handle normal file creation
as well, this also covers some calls to vfs_create().

So that we don't have to make three mnt_want/drop_write()
calls inside of the switch statement, we move some of its
logic outside of the switch.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

lxc-dave/fs/namei.c | 32 +++++++++++++++++++++-----------
lxc-dave/fs/nfsd/vfs.c | 4 ++++
lxc-dave/net/unix/af_unix.c | 4 ++++
3 files changed, 29 insertions(+), 11 deletions(-)

diff -puN fs/namei.c~sys-mknodat-elevate-write-count-for-vfs-mknod-create fs/namei.c
--- lxc/fs/namei.c~sys-mknodat-elevate-write-count-for-vfs-mknod-create 2007-09-17 09:44:08.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-09-17 09:44:08.000000000 -0700
@@ -1970,12 +1970,25 @@ asmlinkage long sys_mknodat(int dfd, con
if (error)
goto out;
dentry = lookup_create(&nd, 0);
- error = PTR_ERR(dentry);
-
+ if (IS_ERR(dentry)) {
+ error = PTR_ERR(dentry);
+ goto out_unlock;
+ }
if (!IS_POSIXACL(nd.dentry->d_inode))
mode &= ~current->fs->umask;
- if (!IS_ERR(dentry)) {
- switch (mode & S_IFMT) {
+ if (S_ISDIR(mode)) {
+ error = -EPERM;
+ goto out_dput;
+ }
+ if (!S_ISREG(mode) && !S_ISCHR(mode) && !S_ISBLK(mode) &&
+ !S_ISFIFO(mode) && !S_ISSOCK(mode) && mode != 0) {
+ error = -EINVAL;
+ goto out_dput;
+ }
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_dput;
+ switch (mode & S_IFMT) {
case 0: case S_IFREG:
error = vfs_create(nd.dentry->d_inode,dentry,mode,&nd);
break;
@@ -1986,14 +1999,11 @@ asmlinkage long sys_mknodat(int dfd, con
case S_IFIFO: case S_IFSOCK:
error = vfs_mknod(nd.dentry->d_inode,dentry,mode,0);
break;
- case S_IFDIR:
- error = -EPERM;
- break;
- default:
- error = -EINVAL;
- }
- dput(dentry);
}
+ mnt_drop_write(nd.mnt);
+out...

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:38 pm

Ok.
-

To: <akpm@...>
Cc: <linux-kernel@...>, <hch@...>, Dave Hansen <haveblue@...>
Date: Monday, September 17, 2007 - 2:27 pm

Pretty self-explanatory. Fits in with the rest of the series.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

lxc-dave/fs/namei.c | 5 +++++
lxc-dave/fs/nfsd/nfs4recover.c | 4 ++++
2 files changed, 9 insertions(+)

diff -puN fs/namei.c~elevate-mnt-writers-for-callers-of-vfs-mkdir fs/namei.c
--- lxc/fs/namei.c~elevate-mnt-writers-for-callers-of-vfs-mkdir 2007-09-17 09:44:01.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-09-17 09:44:01.000000000 -0700
@@ -2051,7 +2051,12 @@ asmlinkage long sys_mkdirat(int dfd, con

if (!IS_POSIXACL(nd.dentry->d_inode))
mode &= ~current->fs->umask;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_dput;
error = vfs_mkdir(nd.dentry->d_inode, dentry, mode);
+ mnt_drop_write(nd.mnt);
+out_dput:
dput(dentry);
out_unlock:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
diff -puN fs/nfsd/nfs4recover.c~elevate-mnt-writers-for-callers-of-vfs-mkdir fs/nfsd/nfs4recover.c
--- lxc/fs/nfsd/nfs4recover.c~elevate-mnt-writers-for-callers-of-vfs-mkdir 2007-09-17 09:44:01.000000000 -0700
+++ lxc-dave/fs/nfsd/nfs4recover.c 2007-09-17 09:44:01.000000000 -0700
@@ -156,7 +156,11 @@ nfsd4_create_clid_dir(struct nfs4_client
dprintk("NFSD: nfsd4_create_clid_dir: DIRECTORY EXISTS\n");
goto out_put;
}
+ status = mnt_want_write(rec_dir.mnt);
+ if (status)
+ goto out_put;
status = vfs_mkdir(rec_dir.dentry->d_inode, dentry, S_IRWXU);
+ mnt_drop_write(rec_dir.mnt);
out_put:
dput(dentry);
out_unlock:
_
-

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:32 pm

Ok.

-

To: <akpm@...>
Cc: <linux-kernel@...>, <hch@...>, Dave Hansen <haveblue@...>
Date: Monday, September 17, 2007 - 2:27 pm

This patch adds two function mnt_want_write() and
mnt_drop_write(). These are used like a lock pair around
and fs operations that might cause a write to the filesystem.

Before these can become useful, we must first cover each
place in the VFS where writes are performed with a
want/drop pair. When that is complete, we can actually
introduce code that will safely check the counts before
allowing r/w<->r/o transitions to occur.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

lxc-dave/fs/namespace.c | 46 +++++++++++++++++++++++++++++++++++++++++
lxc-dave/include/linux/mount.h | 3 ++
2 files changed, 49 insertions(+)

diff -puN fs/namespace.c~add-vfsmount-writer-count fs/namespace.c
--- lxc/fs/namespace.c~add-vfsmount-writer-count 2007-09-17 09:43:57.000000000 -0700
+++ lxc-dave/fs/namespace.c 2007-09-17 09:43:57.000000000 -0700
@@ -80,6 +80,52 @@ struct vfsmount *alloc_vfsmnt(const char
return mnt;
}

+/*
+ * Most r/o checks on a fs are for operations that take
+ * discrete amounts of time, like a write() or unlink().
+ * We must keep track of when those operations start
+ * (for permission checks) and when they end, so that
+ * we can determine when writes are able to occur to
+ * a filesystem.
+ */
+/*
+ * This tells the low-level filesystem that a write is
+ * about to be performed to it, and makes sure that
+ * writes are allowed before returning success. When
+ * the write operation is finished, mnt_drop_write()
+ * must be called. This is effectively a refcount.
+ */
+int mnt_want_write(struct vfsmount *mnt)
+{
+ if (__mnt_is_readonly(mnt))
+ return -EROFS;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(mnt_want_write);
+
+/*
+ * Tells the low-level filesystem that we are done
+ * performing a write to it. Must be matched with
+ * mnt_want_write() call above.
+ */
+void mnt_drop_write(struct vfsmount *mnt)
+{
+}
+EXPORT_SYMBOL_GPL(mnt_drop_write);
+
+/*
+ * This shouldn't be used directly ouside of the VFS.
+ * It does ...

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:28 pm

kerneldoc comments, please!

Otherwise looks fine.

-

To: <akpm@...>
Cc: <linux-kernel@...>, <hch@...>, Dave Hansen <haveblue@...>
Date: Monday, September 17, 2007 - 2:27 pm

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

lxc-dave/fs/namei.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff -puN fs/namei.c~elevate-write-count-for-link-and-symlink-calls fs/namei.c
--- lxc/fs/namei.c~elevate-write-count-for-link-and-symlink-calls 2007-09-17 09:44:02.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-09-17 09:44:02.000000000 -0700
@@ -2324,7 +2324,12 @@ asmlinkage long sys_symlinkat(const char
if (IS_ERR(dentry))
goto out_unlock;

+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_dput;
error = vfs_symlink(nd.dentry->d_inode, dentry, from, S_IALLUGO);
+ mnt_drop_write(nd.mnt);
+out_dput:
dput(dentry);
out_unlock:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
@@ -2419,7 +2424,12 @@ asmlinkage long sys_linkat(int olddfd, c
error = PTR_ERR(new_dentry);
if (IS_ERR(new_dentry))
goto out_unlock;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_dput;
error = vfs_link(old_nd.dentry, nd.dentry->d_inode, new_dentry);
+ mnt_drop_write(nd.mnt);
+out_dput:
dput(new_dentry);
out_unlock:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
_
-

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:33 pm

Ok.
-

To: <akpm@...>
Cc: <linux-kernel@...>, <hch@...>, Dave Hansen <haveblue@...>
Date: Monday, September 17, 2007 - 2:27 pm

Some ioctl()s can cause writes to the filesystem. Take
these, and make them use mnt_want/drop_write() instead.

We need to pass the filp one layer deeper in XFS, but
somebody _just_ pulled it out in February because nobody
was using it, so I don't feel guilty for adding it back.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

lxc-dave/fs/ext2/ioctl.c | 46 +++++++++-----
lxc-dave/fs/ext3/ioctl.c | 100 +++++++++++++++++++++-----------
lxc-dave/fs/ext4/ioctl.c | 105 +++++++++++++++++++++-------------
lxc-dave/fs/fat/file.c | 10 +--
lxc-dave/fs/hfsplus/ioctl.c | 39 +++++++-----
lxc-dave/fs/jfs/ioctl.c | 33 ++++++----
lxc-dave/fs/ocfs2/ioctl.c | 11 +--
lxc-dave/fs/reiserfs/ioctl.c | 55 +++++++++++------
lxc-dave/fs/xfs/linux-2.6/xfs_ioctl.c | 15 +++-
lxc-dave/fs/xfs/linux-2.6/xfs_iops.c | 7 --
lxc-dave/fs/xfs/linux-2.6/xfs_lrw.c | 9 ++
11 files changed, 273 insertions(+), 157 deletions(-)

diff -puN fs/ext2/ioctl.c~ioctl-mnt-takers fs/ext2/ioctl.c
--- lxc/fs/ext2/ioctl.c~ioctl-mnt-takers 2007-09-17 09:43:58.000000000 -0700
+++ lxc-dave/fs/ext2/ioctl.c 2007-09-17 09:43:58.000000000 -0700
@@ -12,6 +12,7 @@
#include <linux/time.h>
#include <linux/sched.h>
#include <linux/compat.h>
+#include <linux/mount.h>
#include <linux/smp_lock.h>
#include <asm/current.h>
#include <asm/uaccess.h>
@@ -23,6 +24,7 @@ int ext2_ioctl (struct inode * inode, st
struct ext2_inode_info *ei = EXT2_I(inode);
unsigned int flags;
unsigned short rsv_window_size;
+ int ret;

ext2_debug ("cmd = %u, arg = %lu\n", cmd, arg);

@@ -34,14 +36,19 @@ int ext2_ioctl (struct inode * inode, st
case EXT2_IOC_SETFLAGS: {
unsigned int oldflags;

- if (IS_RDONLY(inode))
- return -EROFS;
-
- if (!is_owner_or_cap(inode))
- return -EACCES;
+ ret = mnt_want_write(filp->f_vfsmnt);
+ if (ret)
+ ...

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:31 pm

Ok.
-

To: <akpm@...>
Cc: <linux-kernel@...>, <hch@...>, Dave Hansen <haveblue@...>
Date: Monday, September 17, 2007 - 2:27 pm

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

lxc-dave/fs/utimes.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)

diff -puN fs/utimes.c~elevate-write-count-for-do-utimes fs/utimes.c
--- lxc/fs/utimes.c~elevate-write-count-for-do-utimes 2007-09-17 09:44:06.000000000 -0700
+++ lxc-dave/fs/utimes.c 2007-09-17 09:44:06.000000000 -0700
@@ -2,6 +2,7 @@
#include <linux/file.h>
#include <linux/fs.h>
#include <linux/linkage.h>
+#include <linux/mount.h>
#include <linux/namei.h>
#include <linux/sched.h>
#include <linux/stat.h>
@@ -88,8 +89,8 @@ long do_utimes(int dfd, char __user *fil

inode = dentry->d_inode;

- error = -EROFS;
- if (IS_RDONLY(inode))
+ error = mnt_want_write(nd.mnt);
+ if (error)
goto dput_and_out;

/* Don't worry, the checks are done in inode_change_ok() */
@@ -97,7 +98,7 @@ long do_utimes(int dfd, char __user *fil
if (times) {
error = -EPERM;
if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;

if (times[0].tv_nsec == UTIME_OMIT)
newattrs.ia_valid &= ~ATTR_ATIME;
@@ -117,22 +118,24 @@ long do_utimes(int dfd, char __user *fil
} else {
error = -EACCES;
if (IS_IMMUTABLE(inode))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;

if (!is_owner_or_cap(inode)) {
if (f) {
if (!(f->f_mode & FMODE_WRITE))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;
} else {
error = vfs_permission(&nd, MAY_WRITE);
if (error)
- goto dput_and_out;
+ goto mnt_drop_write_and_out;
}
}
}
mutex_lock(&inode->i_mutex);
error = notify_change(dentry, &newattrs);
mutex_unlock(&inode->i_mutex);
+mnt_drop_write_and_out:
+ mnt_drop_write(nd.mnt);
dput_and_out:
if (f)
fput(f);
_
-

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:36 pm

Ok.
-

To: <akpm@...>
Cc: <linux-kernel@...>, <hch@...>, Dave Hansen <haveblue@...>
Date: Monday, September 17, 2007 - 2:27 pm

If we depend on the inodes for writeability, we will not
catch the r/o mounts when implemented.

This patches uses __mnt_want_write(). It does not guarantee
that the mount will stay writeable after the check. But,
this is OK for one of the checks because it is just for a
printk().

The other two are probably unnecessary and duplicate existing
checks in the VFS. This won't make them better checks than
before, but it will make them detect r/o mounts.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

lxc-dave/fs/nfs/dir.c | 3 ++-
lxc-dave/fs/nfsd/vfs.c | 4 ++--
2 files changed, 4 insertions(+), 3 deletions(-)

diff -puN fs/nfs/dir.c~nfs-check-mnt-instead-of-sb fs/nfs/dir.c
--- lxc/fs/nfs/dir.c~nfs-check-mnt-instead-of-sb 2007-09-17 09:44:05.000000000 -0700
+++ lxc-dave/fs/nfs/dir.c 2007-09-17 09:44:05.000000000 -0700
@@ -997,7 +997,8 @@ static int is_atomic_open(struct inode *
if (nd->flags & LOOKUP_DIRECTORY)
return 0;
/* Are we trying to write to a read only partition? */
- if (IS_RDONLY(dir) && (nd->intent.open.flags & (O_CREAT|O_TRUNC|FMODE_WRITE)))
+ if (__mnt_is_readonly(nd->mnt) &&
+ (nd->intent.open.flags & (O_CREAT|O_TRUNC|FMODE_WRITE)))
return 0;
return 1;
}
diff -puN fs/nfsd/vfs.c~nfs-check-mnt-instead-of-sb fs/nfsd/vfs.c
--- lxc/fs/nfsd/vfs.c~nfs-check-mnt-instead-of-sb 2007-09-17 09:44:05.000000000 -0700
+++ lxc-dave/fs/nfsd/vfs.c 2007-09-17 09:44:05.000000000 -0700
@@ -1853,7 +1853,7 @@ nfsd_permission(struct svc_rqst *rqstp,
inode->i_mode,
IS_IMMUTABLE(inode)? " immut" : "",
IS_APPEND(inode)? " append" : "",
- IS_RDONLY(inode)? " ro" : "");
+ __mnt_is_readonly(exp->ex_mnt)? " ro" : "");
dprintk(" owner %d/%d user %d/%d\n",
inode->i_uid, inode->i_gid, current->fsuid, current->fsgid);
#endif
@@ -1864,7 +1864,7 @@ nfsd_permission(struct svc_rqst *rqstp,
*/
if (!(acc & MAY_LOCAL_ACCESS))
if (acc & (MAY_WRITE | MAY_SATTR ...

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:36 pm

Ok from me, but can please get a nfs person to look over this
aswell?

-

To: <akpm@...>
Cc: <linux-kernel@...>, <hch@...>, Dave Hansen <haveblue@...>
Date: Monday, September 17, 2007 - 2:27 pm

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

lxc-dave/fs/open.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)

diff -puN fs/open.c~elevate-writer-count-for-do-sys-truncate fs/open.c
--- lxc/fs/open.c~elevate-writer-count-for-do-sys-truncate 2007-09-17 09:44:06.000000000 -0700
+++ lxc-dave/fs/open.c 2007-09-17 09:44:06.000000000 -0700
@@ -244,21 +244,21 @@ static long do_sys_truncate(const char _
if (!S_ISREG(inode->i_mode))
goto dput_and_out;

- error = vfs_permission(&nd, MAY_WRITE);
+ error = mnt_want_write(nd.mnt);
if (error)
goto dput_and_out;

- error = -EROFS;
- if (IS_RDONLY(inode))
- goto dput_and_out;
+ error = vfs_permission(&nd, MAY_WRITE);
+ if (error)
+ goto mnt_drop_write_and_out;

error = -EPERM;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;

error = get_write_access(inode);
if (error)
- goto dput_and_out;
+ goto mnt_drop_write_and_out;

/*
* Make sure that there are no leases. get_write_access() protects
@@ -276,6 +276,8 @@ static long do_sys_truncate(const char _

put_write_and_out:
put_write_access(inode);
+mnt_drop_write_and_out:
+ mnt_drop_write(nd.mnt);
dput_and_out:
path_release(&nd);
out:
_
-

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:36 pm

Ok.
-

To: <akpm@...>
Cc: <linux-kernel@...>, <hch@...>, Dave Hansen <haveblue@...>
Date: Monday, September 17, 2007 - 2:27 pm

This also uses the little helper in the NFS code to
make an if() a little bit less ugly. We introduced
the helper at the beginning of the series.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

lxc-dave/fs/namei.c | 4 ++++
lxc-dave/fs/nfsd/vfs.c | 15 +++++++++++----
2 files changed, 15 insertions(+), 4 deletions(-)

diff -puN fs/namei.c~elevate-write-count-over-calls-to-vfs-rename fs/namei.c
--- lxc/fs/namei.c~elevate-write-count-over-calls-to-vfs-rename 2007-09-17 09:44:04.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-09-17 09:44:04.000000000 -0700
@@ -2655,8 +2655,12 @@ static int do_rename(int olddfd, const c
if (new_dentry == trap)
goto exit5;

+ error = mnt_want_write(oldnd.mnt);
+ if (error)
+ goto exit5;
error = vfs_rename(old_dir->d_inode, old_dentry,
new_dir->d_inode, new_dentry);
+ mnt_drop_write(oldnd.mnt);
exit5:
dput(new_dentry);
exit4:
diff -puN fs/nfsd/vfs.c~elevate-write-count-over-calls-to-vfs-rename fs/nfsd/vfs.c
--- lxc/fs/nfsd/vfs.c~elevate-write-count-over-calls-to-vfs-rename 2007-09-17 09:44:04.000000000 -0700
+++ lxc-dave/fs/nfsd/vfs.c 2007-09-17 09:44:04.000000000 -0700
@@ -1659,13 +1659,20 @@ nfsd_rename(struct svc_rqst *rqstp, stru
if (ndentry == trap)
goto out_dput_new;

-#ifdef MSNFS
- if ((ffhp->fh_export->ex_flags & NFSEXP_MSNFS) &&
+ if (svc_msnfs(ffhp) &&
((atomic_read(&odentry->d_count) > 1)
|| (atomic_read(&ndentry->d_count) > 1))) {
host_err = -EPERM;
- } else
-#endif
+ goto out_dput_new;
+ }
+
+ host_err = -EXDEV;
+ if (ffhp->fh_export->ex_mnt != tfhp->fh_export->ex_mnt)
+ goto out_dput_new;
+ host_err = mnt_want_write(ffhp->fh_export->ex_mnt);
+ if (host_err)
+ goto out_dput_new;
+
host_err = vfs_rename(fdir, odentry, tdir, ndentry);
if (!host_err && EX_ISSYNC(tfhp->fh_export)) {
host_err = nfsd_sync_dir(tdentry);
_
-

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:35 pm

Ok.

-

To: <akpm@...>
Cc: <linux-kernel@...>, <hch@...>, Dave Hansen <haveblue@...>
Date: Monday, September 17, 2007 - 2:27 pm

This basically audits the callers of xattr_permission(), which
calls permission() and can perform writes to the filesystem.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

lxc-dave/fs/nfsd/nfs4proc.c | 7 ++++++-
lxc-dave/fs/xattr.c | 16 ++++++++++++++--
2 files changed, 20 insertions(+), 3 deletions(-)

diff -puN fs/nfsd/nfs4proc.c~elevate-mount-count-for-extended-attributes fs/nfsd/nfs4proc.c
--- lxc/fs/nfsd/nfs4proc.c~elevate-mount-count-for-extended-attributes 2007-09-17 09:44:03.000000000 -0700
+++ lxc-dave/fs/nfsd/nfs4proc.c 2007-09-17 09:44:03.000000000 -0700
@@ -658,14 +658,19 @@ nfsd4_setattr(struct svc_rqst *rqstp, st
return status;
}
}
+ status = mnt_want_write(cstate->current_fh.fh_export->ex_mnt);
+ if (status)
+ return status;
status = nfs_ok;
if (setattr->sa_acl != NULL)
status = nfsd4_set_nfs4_acl(rqstp, &cstate->current_fh,
setattr->sa_acl);
if (status)
- return status;
+ goto out;
status = nfsd_setattr(rqstp, &cstate->current_fh, &setattr->sa_iattr,
0, (time_t)0);
+out:
+ mnt_drop_write(cstate->current_fh.fh_export->ex_mnt);
return status;
}

diff -puN fs/xattr.c~elevate-mount-count-for-extended-attributes fs/xattr.c
--- lxc/fs/xattr.c~elevate-mount-count-for-extended-attributes 2007-09-17 09:44:03.000000000 -0700
+++ lxc-dave/fs/xattr.c 2007-09-17 09:44:03.000000000 -0700
@@ -11,6 +11,7 @@
#include <linux/slab.h>
#include <linux/file.h>
#include <linux/xattr.h>
+#include <linux/mount.h>
#include <linux/namei.h>
#include <linux/security.h>
#include <linux/syscalls.h>
@@ -32,8 +33,6 @@ xattr_permission(struct inode *inode, co
* filesystem or on an immutable / append-only inode.
*/
if (mask & MAY_WRITE) {
- if (IS_RDONLY(inode))
- return -EROFS;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
return -EPERM;
}
@@ -235,7 +234,11 @@ sys_setxattr(char __user *path, char ...

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:34 pm

Ok.

-

To: <akpm@...>
Cc: <linux-kernel@...>, <hch@...>, Dave Hansen <haveblue@...>
Date: Monday, September 17, 2007 - 2:27 pm

Some ioctls need write access, but others don't. Make a helper
function to decide when write access is needed, and take it.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

lxc-dave/fs/ncpfs/ioctl.c | 55 +++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 54 insertions(+), 1 deletion(-)

diff -puN fs/ncpfs/ioctl.c~elevate-write-count-during-entire-ncp-ioctl fs/ncpfs/ioctl.c
--- lxc/fs/ncpfs/ioctl.c~elevate-write-count-during-entire-ncp-ioctl 2007-09-17 09:44:01.000000000 -0700
+++ lxc-dave/fs/ncpfs/ioctl.c 2007-09-17 09:44:01.000000000 -0700
@@ -14,6 +14,7 @@
#include <linux/ioctl.h>
#include <linux/time.h>
#include <linux/mm.h>
+#include <linux/mount.h>
#include <linux/highuid.h>
#include <linux/smp_lock.h>
#include <linux/vmalloc.h>
@@ -261,7 +262,7 @@ ncp_get_charsets(struct ncp_server* serv
}
#endif /* CONFIG_NCPFS_NLS */

-int ncp_ioctl(struct inode *inode, struct file *filp,
+static int __ncp_ioctl(struct inode *inode, struct file *filp,
unsigned int cmd, unsigned long arg)
{
struct ncp_server *server = NCP_SERVER(inode);
@@ -822,6 +823,58 @@ outrel:
return -EINVAL;
}

+static int ncp_ioctl_need_write(unsigned int cmd)
+{
+ switch (cmd) {
+ case NCP_IOC_GET_FS_INFO:
+ case NCP_IOC_GET_FS_INFO_V2:
+ case NCP_IOC_NCPREQUEST:
+ case NCP_IOC_SETDENTRYTTL:
+ case NCP_IOC_SIGN_INIT:
+ case NCP_IOC_LOCKUNLOCK:
+ case NCP_IOC_SET_SIGN_WANTED:
+ return 1;
+ case NCP_IOC_GETOBJECTNAME:
+ case NCP_IOC_SETOBJECTNAME:
+ case NCP_IOC_GETPRIVATEDATA:
+ case NCP_IOC_SETPRIVATEDATA:
+ case NCP_IOC_SETCHARSETS:
+ case NCP_IOC_GETCHARSETS:
+ case NCP_IOC_CONN_LOGGED_IN:
+ case NCP_IOC_GETDENTRYTTL:
+ case NCP_IOC_GETMOUNTUID2:
+ case NCP_IOC_SIGN_WANTED:
+ case NCP_IOC_GETROOT:
+ case NCP_IOC_SETROOT:
+ return 0;
+ default:
+ /* unkown IOCTL command, assume write */
+ WARN_ON(1);
+ }
+ return 1;
+}
+
+int ncp_ioctl(struct inode *inode, struct file *filp,
+ ...

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:33 pm

The WARN_ON is too agressive because it'll happen whenever someone tries
a random, non-existant ioctl on ncpfs. Otherwise the patch looks good.

-

To: <akpm@...>
Cc: <linux-kernel@...>, <hch@...>, Dave Hansen <haveblue@...>
Date: Monday, September 17, 2007 - 2:27 pm

Christoph H. says this stands on its own and can go in before the
rest of the r/o bind mount set.

---

Some filesystems forego the vfs and may_open() and create their
own 'struct file's.

This patch creates a couple of helper functions which can be
used by these filesystems, and will provide a unified place
which the r/o bind mount code may patch.

Also, rename an existing, static-scope init_file() to a less
generic name.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

lxc-dave/fs/configfs/dir.c | 5 +++--
lxc-dave/fs/file_table.c | 34 ++++++++++++++++++++++++++++++++++
lxc-dave/fs/hugetlbfs/inode.c | 22 +++++++++-------------
lxc-dave/include/linux/file.h | 9 +++++++++
lxc-dave/ipc/shm.c | 13 +++++--------
lxc-dave/mm/shmem.c | 7 ++-----
lxc-dave/mm/tiny-shmem.c | 19 +++++++------------
lxc-dave/net/socket.c | 18 +++++++++---------
8 files changed, 78 insertions(+), 49 deletions(-)

diff -puN fs/configfs/dir.c~filesystem-helpers-for-custom-struct-file-s fs/configfs/dir.c
--- lxc/fs/configfs/dir.c~filesystem-helpers-for-custom-struct-file-s 2007-09-17 09:43:55.000000000 -0700
+++ lxc-dave/fs/configfs/dir.c 2007-09-17 09:43:55.000000000 -0700
@@ -142,7 +142,7 @@ static int init_dir(struct inode * inode
return 0;
}

-static int init_file(struct inode * inode)
+static int configfs_init_file(struct inode * inode)
{
inode->i_size = PAGE_SIZE;
inode->i_fop = &configfs_file_operations;
@@ -283,7 +283,8 @@ static int configfs_attach_attr(struct c

dentry->d_fsdata = configfs_get(sd);
sd->s_dentry = dentry;
- error = configfs_create(dentry, (attr->ca_mode & S_IALLUGO) | S_IFREG, init_file);
+ error = configfs_create(dentry, (attr->ca_mode & S_IALLUGO) | S_IFREG,
+ configfs_init_file);
if (error) {
configfs_put(sd);
return error;
diff -puN fs/file_table.c~filesystem-helpers-for-custom-struct-file-s fs/file_table.c
--- lxc/fs/file_tab...

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:26 pm

Looks good. But please provide a patch ontop of your patchkit to
add kernel-doc comments for the two new exported functions.

Are there any direct caller of get_empty_filp left after this patch?
We really should get rid of that export to stop people from shooting
themselves in the foot.

-

To: Christoph Hellwig <hch@...>
Cc: <akpm@...>, <linux-kernel@...>
Date: Wednesday, September 19, 2007 - 1:47 pm

Appended. I started to write the comments to describe the function

Yeah, there are a few left. But, they're in the middle of what is
sometimes tricky error handling, so I think we should trickle those in
later.

-- Dave

lxc-dave/fs/file_table.c | 22 +++++++++++++++++++++-
1 file changed, 21 insertions(+), 1 deletion(-)

diff -puN fs/file_table.c~document fs/file_table.c
--- lxc/fs/file_table.c~document 2007-09-19 10:31:13.000000000 -0700
+++ lxc-dave/fs/file_table.c 2007-09-19 10:39:34.000000000 -0700
@@ -137,6 +137,18 @@ fail:

EXPORT_SYMBOL(get_empty_filp);

+/**
+ * alloc_file - allocate and initialize a 'struct file' with
+ * the given arguments
+ *
+ * Use this instead of get_empty_filp() to get a new
+ * 'struct file'. Do so because of the same initialization
+ * pitfalls reasons listed for init_file(). This is a
+ * preferred interface to using init_file().
+ *
+ * If all the callers of init_file() are eliminated, its
+ * code should be moved into this function.
+ */
struct file *alloc_file(struct vfsmount *mnt, struct dentry *dentry,
mode_t mode, const struct file_operations *fop)
{
@@ -152,7 +164,15 @@ struct file *alloc_file(struct vfsmount
}
EXPORT_SYMBOL(alloc_file);

-/*
+/**
+ * init_file - initialize a 'struct file'
+ * @file: the already allocated 'struct file' to initialized
+ * @mnt: the vfsmount on which the file resides
+ *
+ * Use this instead of setting the members directly. Doing so
+ * avoids making mistakes like forgetting the mntget() or
+ * forgetting to take a write on the mnt.
+ *
* Note: This is a crappy interface. It is here to make
* merging with the existing users of get_empty_filp()
* who have complex failure logic easier. All users
_

-

To: Dave Hansen <haveblue@...>
Cc: Christoph Hellwig <hch@...>, <akpm@...>, <linux-kernel@...>
Date: Wednesday, September 19, 2007 - 6:07 pm

Hi Dave,

They aren't quite in kernel-doc format. Holler if you need help
with that, or see examples, or

* function_name - short description (on one line)
* @param1: descr1
* @param2: descr2

OK, that one looks better. :)

---
~Randy
*** Remember to use Documentation/SubmitChecklist when testing your code ***
-

To: Randy Dunlap <randy.dunlap@...>
Cc: Christoph Hellwig <hch@...>, <akpm@...>, <linux-kernel@...>
Date: Thursday, September 20, 2007 - 1:25 pm

Should we do comments for every single function argument, or is it OK to
leave them out for the obvious ones?

Seems like a waste of space to me, but I'd be happy to do them if that's
the convention.

-- Dave

-

To: Dave Hansen <haveblue@...>
Cc: Randy Dunlap <randy.dunlap@...>, Christoph Hellwig <hch@...>, <akpm@...>, <linux-kernel@...>
Date: Thursday, September 20, 2007 - 1:30 pm

Yes, we should keep the kerneldoc comments coherent, even if it seems
useless in a few cases.

-

To: <akpm@...>
Cc: <linux-kernel@...>, <hch@...>, Dave Hansen <haveblue@...>
Date: Monday, September 17, 2007 - 2:27 pm

This is the first really tricky patch in the series. It
elevates the writer count on a mount each time a
non-special file is opened for write.

This is not completely apparent in the patch because the
two if() conditions in may_open() above the
mnt_want_write() call are, combined, equivalent to
special_file().

There is also an elevated count around the vfs_create()
call in open_namei(). The count needs to be kept elevated
all the way into the may_open() call. Otherwise, when the
write is dropped, a ro->rw transisition could occur. This
would lead to having rw access on the newly created file,
while the vfsmount is ro. That is bad.

Some filesystems forego the use of normal vfs calls to create
struct files. Make sure that these users elevate the mnt writer
count because they will get __fput(), and we need to make
sure they're balanced.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

lxc-dave/fs/file_table.c | 9 ++++++++-
lxc-dave/fs/namei.c | 20 ++++++++++++++++----
lxc-dave/ipc/mqueue.c | 3 +++
3 files changed, 27 insertions(+), 5 deletions(-)

diff -puN fs/file_table.c~tricky-elevate-write-count-files-are-open-ed fs/file_table.c
--- lxc/fs/file_table.c~tricky-elevate-write-count-files-are-open-ed 2007-09-17 09:43:57.000000000 -0700
+++ lxc-dave/fs/file_table.c 2007-09-17 09:43:57.000000000 -0700
@@ -167,6 +167,10 @@ int init_file(struct file *file, struct
file->f_mapping = dentry->d_inode->i_mapping;
file->f_mode = mode;
file->f_op = fop;
+ if (mode & FMODE_WRITE) {
+ error = mnt_want_write(mnt);
+ WARN_ON(error);
+ }
return error;
}
EXPORT_SYMBOL(init_file);
@@ -204,8 +208,11 @@ void fastcall __fput(struct file *file)
if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL))
cdev_put(inode->i_cdev);
fops_put(file->f_op);
- if (file->f_mode & FMODE_WRITE)
+ if (file->f_mode & FMODE_WRITE) {
put_write_access(inode);
+ if (!special_file(ino...

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:30 pm

Ok.

-

To: <akpm@...>
Cc: <linux-kernel@...>, <hch@...>, Dave Hansen <haveblue@...>
Date: Monday, September 17, 2007 - 2:27 pm

chown/chmod,etc... don't call permission in the same way
that the normal "open for write" calls do. They still
write to the filesystem, so bump the write count during
these operations.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

lxc-dave/fs/open.c | 39 ++++++++++++++++++++++++++++++---------
1 file changed, 30 insertions(+), 9 deletions(-)

diff -puN fs/open.c~elevate-writer-count-for-chown-and-friends fs/open.c
--- lxc/fs/open.c~elevate-writer-count-for-chown-and-friends 2007-09-17 09:43:59.000000000 -0700
+++ lxc-dave/fs/open.c 2007-09-17 09:43:59.000000000 -0700
@@ -571,12 +571,12 @@ asmlinkage long sys_fchmod(unsigned int

audit_inode(NULL, dentry);

- err = -EROFS;
- if (IS_RDONLY(inode))
+ err = mnt_want_write(file->f_vfsmnt);
+ if (err)
goto out_putf;
err = -EPERM;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
- goto out_putf;
+ goto out_drop_write;
mutex_lock(&inode->i_mutex);
if (mode == (mode_t) -1)
mode = inode->i_mode;
@@ -585,6 +585,8 @@ asmlinkage long sys_fchmod(unsigned int
err = notify_change(dentry, &newattrs);
mutex_unlock(&inode->i_mutex);

+out_drop_write:
+ mnt_drop_write(file->f_vfsmnt);
out_putf:
fput(file);
out:
@@ -604,13 +606,13 @@ asmlinkage long sys_fchmodat(int dfd, co
goto out;
inode = nd.dentry->d_inode;

- error = -EROFS;
- if (IS_RDONLY(inode))
+ error = mnt_want_write(nd.mnt);
+ if (error)
goto dput_and_out;

error = -EPERM;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
- goto dput_and_out;
+ goto out_drop_write;

mutex_lock(&inode->i_mutex);
if (mode == (mode_t) -1)
@@ -620,6 +622,8 @@ asmlinkage long sys_fchmodat(int dfd, co
error = notify_change(nd.dentry, &newattrs);
mutex_unlock(&inode->i_mutex);

+out_drop_write:
+ mnt_drop_write(nd.mnt);
dput_and_out:
path_release(&nd);
out:
@@ -642,9 +646,6 @@ static int chown_common(struct dentry *
printk(KERN_ERR "chown_common: NULL i...

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:31 pm

Ok.

-

To: <akpm@...>
Cc: <linux-kernel@...>, <hch@...>, Dave Hansen <haveblue@...>
Date: Monday, September 17, 2007 - 2:27 pm

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

lxc-dave/fs/inode.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)

diff -puN fs/inode.c~elevate-write-count-for-file_update_time fs/inode.c
--- lxc/fs/inode.c~elevate-write-count-for-file_update_time 2007-09-17 09:44:03.000000000 -0700
+++ lxc-dave/fs/inode.c 2007-09-17 09:44:03.000000000 -0700
@@ -1241,10 +1241,19 @@ void file_update_time(struct file *file)
struct inode *inode = file->f_path.dentry->d_inode;
struct timespec now;
int sync_it = 0;
+ int err = 0;

if (IS_NOCMTIME(inode))
return;
- if (IS_RDONLY(inode))
+ /*
+ * Ideally, we want to guarantee that 'f_vfsmnt'
+ * is non-NULL here. But, NFS exports need to
+ * be fixed up before we can do that. So, check
+ * it for now. - Dave Hansen
+ */
+ if (file->f_vfsmnt)
+ err = mnt_want_write(file->f_vfsmnt);
+ if (err)
return;

now = current_fs_time(inode->i_sb);
@@ -1260,6 +1269,8 @@ void file_update_time(struct file *file)

if (sync_it)
mark_inode_dirty_sync(inode);
+ if (file->f_vfsmnt)
+ mnt_drop_write(file->f_vfsmnt);
}

EXPORT_SYMBOL(file_update_time);
_
-

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:35 pm

FYI: I have a patch to fix up nfsd, but there are more suckers around,
e.g. reiserfs xattr crapola.

-

To: <akpm@...>
Cc: <linux-kernel@...>, <hch@...>, Dave Hansen <haveblue@...>
Date: Monday, September 17, 2007 - 2:27 pm

It is OK to let access() go without using a mnt_want/drop_write()
pair because it doesn't actually do writes to the filesystem,
and it is inherently racy anyway. This is a rare case when it is
OK to use __mnt_is_readonly() directly.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

lxc-dave/fs/open.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)

diff -puN fs/open.c~make-access-use-helper fs/open.c
--- lxc/fs/open.c~make-access-use-helper 2007-09-17 09:44:00.000000000 -0700
+++ lxc-dave/fs/open.c 2007-09-17 09:44:00.000000000 -0700
@@ -457,8 +457,17 @@ asmlinkage long sys_faccessat(int dfd, c
if(res || !(mode & S_IWOTH) ||
special_file(nd.dentry->d_inode->i_mode))
goto out_path_release;
-
- if(IS_RDONLY(nd.dentry->d_inode))
+ /*
+ * This is a rare case where using __mnt_is_readonly()
+ * is OK without a mnt_want/drop_write() pair. Since
+ * no actual write to the fs is performed here, we do
+ * not need to telegraph to that to anyone.
+ *
+ * By doing this, we accept that this access is
+ * inherently racy and know that the fs may change
+ * state before we even see this result.
+ */
+ if (__mnt_is_readonly(nd.mnt))
res = -EROFS;

out_path_release:
_
-

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:32 pm

Ok.
-

To: <akpm@...>
Cc: <linux-kernel@...>, <hch@...>, Dave Hansen <haveblue@...>
Date: Monday, September 17, 2007 - 2:27 pm

I'm going to be modifying nfsd_rename() shortly to support
read-only bind mounts. This #ifdef is around the area I'm
patching, and it starts to get really ugly if I just try
to add my new code by itself. Using this little helper
makes things a lot cleaner to use.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

lxc-dave/fs/nfsd/vfs.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)

diff -puN fs/nfsd/vfs.c~create-svc_msnfs-helper fs/nfsd/vfs.c
--- lxc/fs/nfsd/vfs.c~create-svc_msnfs-helper 2007-09-17 09:43:56.000000000 -0700
+++ lxc-dave/fs/nfsd/vfs.c 2007-09-17 09:43:56.000000000 -0700
@@ -865,6 +865,15 @@ static int nfsd_direct_splice_actor(stru
return __splice_from_pipe(pipe, sd, nfsd_splice_actor);
}

+static inline int svc_msnfs(struct svc_fh *ffhp)
+{
+#ifdef MSNFS
+ return (ffhp->fh_export->ex_flags & NFSEXP_MSNFS);
+#else
+ return 0;
+#endif
+}
+
static __be32
nfsd_vfs_read(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
loff_t offset, struct kvec *vec, int vlen, unsigned long *count)
@@ -877,11 +886,9 @@ nfsd_vfs_read(struct svc_rqst *rqstp, st

err = nfserr_perm;
inode = file->f_path.dentry->d_inode;
-#ifdef MSNFS
- if ((fhp->fh_export->ex_flags & NFSEXP_MSNFS) &&
- (!lock_may_read(inode, offset, *count)))
+
+ if (svc_msnfs(fhp) && !lock_may_read(inode, offset, *count))
goto out;
-#endif

/* Get readahead parameters */
ra = nfsd_get_raparms(inode->i_sb->s_dev, inode->i_ino);
_
-

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:39 pm

The name looks kinda confusing. msnfs? almost makes you think
what's got microsoft to do with it..

-

To: Jan Engelhardt <jengelh@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:45 pm

I'd give it a better name if I had any idea what it means. :)

The macros and names are all preexisting. I just took the liberty of
pulling the #ifdef out into a function because the code got very
unreadable with the other modifications that I need.

If it were some kind of larger API, we could come up with something
better. But, for a single isolated user, I think it is OK.

-- Dave

-

To: Dave Hansen <haveblue@...>
Cc: Jan Engelhardt <jengelh@...>, <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:54 pm

The msnfs option means enable the use of MS Windows share locks, so yes,
the name is appropriate.

Trond

-

To: Trond Myklebust <trond.myklebust@...>
Cc: Dave Hansen <haveblue@...>, Jan Engelhardt <jengelh@...>, <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:59 pm

Yepp. While we're at it, can we please kill the silly ifdefs for it
(once Dave's patches are merged to avoid conflicts). The code has always
been enabled since it was introduced, it's an export option anyway and
to make things worse the ifdef needs to be disabled in multiple files
---end quoted text---
-

To: Christoph Hellwig <hch@...>, Dr. J. Bruce Fields <bfields@...>, Neil Brown <neilb@...>
Cc: Dave Hansen <haveblue@...>, Jan Engelhardt <jengelh@...>, <akpm@...>, <linux-kernel@...>
Date: Wednesday, September 19, 2007 - 2:10 pm

I'm sure that either Neil or Bruce would be happy to take such a patch.

Cheers
Trond

-

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:28 pm

Ok.

-

To: <akpm@...>
Cc: <linux-kernel@...>, <hch@...>, Dave Hansen <haveblue@...>
Date: Monday, September 17, 2007 - 2:27 pm

may_open() calls vfs_permission() before it does checks for
IS_RDONLY(inode). It checks _again_ inside of vfs_permission().

The check inside of vfs_permission() is going away eventually.
With the mnt_want/drop_write() functions, all of the r/o
checks (except for this one) are consistently done before
calling permission(). Because of this, I'd like to use
permission() to hold a debugging check to make sure that
the mnt_want/drop_write() calls are actually being made.

So, to do this:
1. remove the IS_RDONLY() check from permission()
2. enforce that you must mnt_want_write() before
even calling permission()
3. enable a debugging in permission()

We need to rearrange may_open(). Here's the patch.

Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---

lxc-dave/fs/namei.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)

diff -puN fs/namei.c~rearrange-permission-and-ro-checks-in-may_open fs/namei.c
--- lxc/fs/namei.c~rearrange-permission-and-ro-checks-in-may_open 2007-09-17 09:43:56.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-09-17 09:43:56.000000000 -0700
@@ -228,6 +228,10 @@ int generic_permission(struct inode *ino
int permission(struct inode *inode, int mask, struct nameidata *nd)
{
int retval, submask;
+ struct vfsmount *mnt = NULL;
+
+ if (nd)
+ mnt = nd->mnt;

if (mask & MAY_WRITE) {
umode_t mode = inode->i_mode;
@@ -251,7 +255,7 @@ int permission(struct inode *inode, int
* MAY_EXEC on regular files is denied if the fs is mounted
* with the "noexec" flag.
*/
- if (nd && nd->mnt && (nd->mnt->mnt_flags & MNT_NOEXEC))
+ if (mnt && (mnt->mnt_flags & MNT_NOEXEC))
return -EACCES;
}

@@ -1604,10 +1608,6 @@ int may_open(struct nameidata *nd, int a
if (S_ISDIR(inode->i_mode) && (flag & FMODE_WRITE))
return -EISDIR;

- error = vfs_permission(nd, acc_mode);
- if (error)
- return error;
-
/*
* FIFO's, sockets and devic...

To: Dave Hansen <haveblue@...>
Cc: <akpm@...>, <linux-kernel@...>, <hch@...>
Date: Wednesday, September 19, 2007 - 1:27 pm

Why is this entirely unrelated cleanup in this patch?

Except for that it looks fine.
-

To: Christoph Hellwig <hch@...>
Cc: <akpm@...>, <linux-kernel@...>
Date: Thursday, September 20, 2007 - 2:47 pm

You're right. In this context, it looks completely unrelated. I'll
separate it into another patch.

-- Dave

-

Previous thread: [PATCH 2/2] unify DMA_..BIT_MASK definitions: cleanup drivers/scsi/gdth.c by Borislav Petkov on Monday, September 17, 2007 - 1:51 pm. (1 message)

Next thread: [PATCH 0/1] ppc64: Convert cpu_sibling_map to a per_cpu data array ppc64 v2 by travis on Monday, September 17, 2007 - 2:35 pm. (1 message)