This is against 2.6.23-rc6. Compared to 2.6.23-rc6-mm1, there
are a few (trivial) merge conflicts with the audit git tree,
and one slightly more complicated one with ext2-reservations.patch.Changes from last post:
- added several kerneldoc comments for exported functions
- broke out the 'mnt' variable patch
- removed WARN_ON() in ioctl patchThe first four patches here are very simple cleanups, and can go
in ahead of the rest of the patches.If you are reviewing this, the things that need the most review
are the last two patches that actually add the logic for
tracking mount write counts.---
Why do we need r/o bind mounts?
This feature allows a read-only view into a read-write filesystem.
In the process of doing that, it also provides infrastructure for
keeping track of the number of writers to any given mount.This has a number of uses. It allows chroots to have parts of
filesystems writable. It will be useful for containers in the future
because users may have root inside a container, but should not
be allowed to write to somefilesystems. This also replaces
patches that vserver has had out of the tree for several years.It allows security enhancement by making sure that parts of
your filesystem read-only (such as when you don't trust your
FTP server), when you don't want to have entire new filesystems
mounted, or when you want atime selectively updated.
I've been using the following script to test that the feature is
working as desired. It takes a directory and makes a regular
bind and a r/o bind mount of it. It then performs some normal
filesystem operations on the three directories, including ones
that are expected to fail, like creating a file on the r/o
mount.Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
-
If we can't pull just this patch in for now, it would be
great to get everything leading up to here pulled in. I've
re-implemented this several ways, and it has never caused
the preceeding patches to change at all.--
This is the real meat of the entire series. It actually
implements the tracking of the number of writers to a mount.
However, it causes scalability problems because there can
be hundreds of cpus doing open()/close() on files on the
same mnt at the same time. Even an atomic_t in the mnt
has massive scalaing problems because the cacheline gets
so terribly contended.This uses a statically-allocated percpu variable. All
operations are local to a cpu as long that cpu operates on
the same mount, and there are no writer count imbalances.
Writer count imbalances happen when a write is taken on one
cpu, and released on another, like when an open/close pair
is performed on two different cpus because the task moved.Upon a remount,ro request, all of the data from the percpu
variables is collected (expensive, but very rare) and we
determine if there are any outstanding writers to the mount.I've written a little benchmark to sit in a loop for a couple
of seconds in several cpus in parallel doing open/write/close
loops.http://sr71.net/~dave/linux/openbench.c
The code in here is a a worst-possible case for this patch.
It does opens on a _pair_ of files in two different mounts
in parallel. This should cause my code to lose its "operate
on the same mount" optimization completely. This worst-case
scenario causes a 3% degredation in the benchmark.I could probably get rid of even this 3%, but it would be
more complex than what I have here, and I think this is
getting into acceptable territory. In practice, I expect
writing more than 3 bytes to a file, as well as disk I/O
to mask any effects that this has.(To get rid of that 3%, we could have an #defined number
of mounts in the percpu variable. So, instead of a CPU
getting operate only on percpu dat...
As we already say in various messages the percpu counters in here
look rather fishy. I'd recomment to take a look at the per-cpu
superblock counters in XFS as they've been debugged quite well
now and could probably be lifted into a generic library for this
kind of think. The code is mostly in fs/xfs/xfs_mount.c can
can be spotted by beeing under #ifdef HAVE_PERCPU_SB.It also handles cases like hotplug cpu nicely that this code
seems to work around by always iterating over all possible cpus
which might not be nice on a dual core laptop with a distro kernel
that also has to support big iron.-
I'll take a look at xfs to see what I can get out of it.
There are basically two times when you have to do this
for_each_possible_cpu() stuff:
1. when doing a r/w->r/o transition, which is rare, and
certainly not a fast path
2. Where the per-cpu writer count underflows. This requires
a _minimum_ of 1<<16 file opens (configurable) each of which
is closed on a different cpu than it was opened on. Even
if you were trying, I'm not sure you'd notice the overhead.-- Dave
-
On Mon, 24 Sep 2007 12:28:11 -0700
Sounds like what you're doing is more akin to the local_t-based module
refcounting. `grep local_ kernel/module.c'.That code should be converted from NR_CPUS to for_each_possible_cpu()..
-
The basic incompatibility with what that provides and what I need is
that percpu_counters allow some fuzziness in the numbers. One cpu can
be summing the numbers up while another is still adding to the local
percpu counters. That's fine for statistics, but horrible for questions
like, "can anybody write to and corrupt my FS right now?"It could probably be modified to do what I want, but it would still have
a the "invented your own lock" problem, and would likely impact the
scalability of the existing "fuzzy" users.We could introduce fuzzy and coherent variants of the function calls,
but that would probably introduce more code than what I have now for theI think that can get converted to use the percpu_counters pretty easily.
I've coded that up, and sent it [RFC] to lkml. Rusty will forward on
into mainline.-- Dave
-
On Mon, 24 Sep 2007 18:54:11 +0100
hm. How come xfs invented a new version of percpu_counters?
-
This code actually predates the generic percpu_counters even if it
was merged to mainline later. Neither Dave who wrote it nor me
who reviewed it before it was merged thought of percpu_counters
probably. Then again this code is considerably more complex due
to features actually needed in a very hot fastpath.
-
Did you test with lockdep enabled?
=============================================
[ INFO: possible recursive locking detected ]
2.6.23-rc7-mm1 #1
---------------------------------------------
swapper/1 is trying to acquire lock:
(&writer->lock){--..}, at: [<c0197a32>] lock_and_coalesce_cpu_mnt_writer_counts+0x32/0x70but task is already holding lock:
(&writer->lock){--..}, at: [<c0197a32>] lock_and_coalesce_cpu_mnt_writer_counts+0x32/0x70other info that might help us debug this:
1 lock held by swapper/1:
#0: (&writer->lock){--..}, at: [<c0197a32>] lock_and_coalesce_cpu_mnt_writer_counts+0x32/0x70stack backtrace:
[<c0103ffa>] show_trace_log_lvl+0x1a/0x30
[<c0104b82>] show_trace+0x12/0x20
[<c0104c96>] dump_stack+0x16/0x20
[<c0144dc5>] __lock_acquire+0xde5/0x10a0
[<c01450fa>] lock_acquire+0x7a/0xa0
[<c03e734c>] _spin_lock+0x2c/0x40
[<c0197a32>] lock_and_coalesce_cpu_mnt_writer_counts+0x32/0x70
[<c01982c6>] mntput_no_expire+0x36/0xc0
[<c0188f15>] path_release_on_umount+0x15/0x20
[<c0198930>] sys_umount+0x40/0x230
[<c010070b>] name_to_dev_t+0x9b/0x270
[<c05230c2>] prepare_namespace+0x62/0x1b0
[<c05226ca>] kernel_init+0x21a/0x320
[<c0103b47>] kernel_thread_helper+0x7/0x10
=======================It look like a false positive to me, but really, for a patchset of this
complexity and maturity I cannot fathom how it could have escaped any
lockdep testing.-
I test with lockdep all the time. The problem was that lockdep doesn't
complain until you have 8 nested locks, and I only tested on a 4-cpu
system.I lowered the lockdep nesting limit to 3, and got the warning on my
machine.-- Dave
-
On Mon, 24 Sep 2007 15:06:42 -0700
hm. I saw that warning on my 2-way. It has CONFIG_NR_CPUS=8 so perhaps
the kernel has decided that this machine can possibly have eight CPUs.It's an old super-micro board, doesn't have ACPI.
OEM ID: INTEL Product ID: 440BX APIC at: 0xFEE00000
Processor #0 6:8 APIC version 17
Processor #1 6:8 APIC version 17
I/O APIC #2 Version 17 at 0xFEC00000.
Enabling APIC mode: Flat. Using 1 I/O APICs
Processors: 2...
Checking if this processor honours the WP bit even in supervisor mode... Ok.
Calibrating delay using timer specific routine.. 1707.03 BogoMIPS (lpj=3414067)
Mount-cache hash table entries: 512
CPU: After generic identify, caps: 0383fbff 00000000 00000000 00000000 00000000 00000000 00000000 00000000
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 256K
CPU: After all inits, caps: 0383fbff 00000000 00000000 00000040 00000000 00000000 00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
Compat vDSO mapped to ffffe000.
Checking 'hlt' instruction... OK.
lockdep: not fixing up alternatives.
CPU0: Intel Pentium III (Coppermine) stepping 03
lockdep: not fixing up alternatives.
Booting processor 1/1 eip 2000
Initializing CPU#1
Calibrating delay using timer specific routine.. 1704.02 BogoMIPS (lpj=3408054)
CPU: After generic identify, caps: 0383fbff 00000000 00000000 00000000 00000000 00000000 00000000 00000000
CPU: L1 I cache: 16K, L1 D cache: 16K
CPU: L2 cache: 256K
CPU: After all inits, caps: 0383fbff 00000000 00000000 00000040 00000000 00000000 00000000 00000000
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#1.
CPU1: Intel Pentium III (Coppermine) stepping 03
Total of 2 processors activated (3411.06 BogoMIPS).
ExtINT not setup in hardware but reported by MP table
ENABLING IO-APIC IRQs
..TIMER: vector=0x31 apic1=0 pin1=2 apic2=0 pin2=0
APIC timer registered as dummy, due to nmi_watchdog=1!
checking TSC synchronization [...
Well, it's looking like we only set cpu_possible_map from data we find
in the MP table, which makes sense. The only question is how your
system gets more than ~8 possible cpus. Do you have the .config handy?-- Dave
-
On Mon, 24 Sep 2007 16:05:37 -0700
I've reproduced this. I'll do more runtime testing with all of the
debugging, CONFIG_CPU_HOTPLUG=y, and NR_CPUS to a high value. It should
help catch things like this in the future.-- Dave
-
the code tries to implement per cpu spinlocks, or rather it tries
to bring back the brlocks from way past.... cute.we can educate lockdep about this quite easily; but isn't there some
primitive already in existence that we can use?-
Originally from: Herbert Poetzl <herbert@13thfloor.at>
This is the core of the read-only bind mount patch set.
Note that this does _not_ add a "ro" option directly to
the bind mount operation. If you require such a mount,
you must first do the bind, then follow it up with a
'mount -o remount,ro' operation.Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---lxc-dave/fs/namespace.c | 46 ++++++++++++++++++++++++++++++++++-------
lxc-dave/include/linux/mount.h | 1
2 files changed, 40 insertions(+), 7 deletions(-)diff -puN fs/namespace.c~honor-r-w-changes-at-do-remount-time fs/namespace.c
--- lxc/fs/namespace.c~honor-r-w-changes-at-do-remount-time 2007-09-20 12:16:23.000000000 -0700
+++ lxc-dave/fs/namespace.c 2007-09-20 12:16:23.000000000 -0700
@@ -102,7 +102,11 @@ struct vfsmount *alloc_vfsmnt(const char
*/
int __mnt_is_readonly(struct vfsmount *mnt)
{
- return (mnt->mnt_sb->s_flags & MS_RDONLY);
+ if (mnt->mnt_flags & MNT_READONLY)
+ return 1;
+ if (mnt->mnt_sb->s_flags & MS_RDONLY)
+ return 1;
+ return 0;
}
EXPORT_SYMBOL_GPL(__mnt_is_readonly);@@ -275,7 +279,7 @@ void mnt_drop_write(struct vfsmount *mnt
}
EXPORT_SYMBOL_GPL(mnt_drop_write);-int mnt_make_readonly(struct vfsmount *mnt)
+static int mnt_make_readonly(struct vfsmount *mnt)
{
int ret = 0;@@ -288,15 +292,21 @@ int mnt_make_readonly(struct vfsmount *m
goto out;
}
/*
- * actually set mount's r/o flag here to make
- * __mnt_is_readonly() true, which keeps anyone
- * from doing a successful mnt_want_write().
+ * nobody can do a successful mnt_want_write() with all
+ * of the counts in MNT_DENIED_WRITE and the locks held.
*/
+ if (!ret)
+ mnt->mnt_flags |= MNT_READONLY;
out:
mnt_unlock_cpus();
return ret;
}+static void __mnt_unmake_readonly(struct vfsmount *mnt)
+{
+ mnt->mnt_flags &= ~MNT_READONLY;
+}
+
int simple_set_mnt(struct vfsmount *mnt, struct super_block *sb)
{
...
Elevate the write count during the vfs_rmdir() call.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
---lxc-dave/fs/namei.c | 5 +++++
1 file changed, 5 insertions(+)diff -puN fs/namei.c~do-rmdir-elevate-write-count fs/namei.c
--- lxc/fs/namei.c~do-rmdir-elevate-write-count 2007-09-20 12:16:22.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-09-20 12:16:22.000000000 -0700
@@ -2148,7 +2148,12 @@ static long do_rmdir(int dfd, const char
error = PTR_ERR(dentry);
if (IS_ERR(dentry))
goto exit2;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto exit3;
error = vfs_rmdir(nd.dentry->d_inode, dentry);
+ mnt_drop_write(nd.mnt);
+exit3:
dput(dentry);
exit2:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
_
-
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---lxc-dave/fs/namei.c | 4 ++++
lxc-dave/ipc/mqueue.c | 5 ++++-
2 files changed, 8 insertions(+), 1 deletion(-)diff -puN fs/namei.c~elevate-mnt-writers-for-vfs-unlink-callers fs/namei.c
--- lxc/fs/namei.c~elevate-mnt-writers-for-vfs-unlink-callers 2007-09-20 12:16:21.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-09-20 12:16:21.000000000 -0700
@@ -2228,7 +2228,11 @@ static long do_unlinkat(int dfd, const c
inode = dentry->d_inode;
if (inode)
atomic_inc(&inode->i_count);
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto exit2;
error = vfs_unlink(nd.dentry->d_inode, dentry);
+ mnt_drop_write(nd.mnt);
exit2:
dput(dentry);
}
diff -puN ipc/mqueue.c~elevate-mnt-writers-for-vfs-unlink-callers ipc/mqueue.c
--- lxc/ipc/mqueue.c~elevate-mnt-writers-for-vfs-unlink-callers 2007-09-20 12:16:21.000000000 -0700
+++ lxc-dave/ipc/mqueue.c 2007-09-20 12:16:21.000000000 -0700
@@ -750,8 +750,11 @@ asmlinkage long sys_mq_unlink(const char
inode = dentry->d_inode;
if (inode)
atomic_inc(&inode->i_count);
-
+ err = mnt_want_write(mqueue_mnt);
+ if (err)
+ goto out_err;
err = vfs_unlink(dentry->d_parent->d_inode, dentry);
+ mnt_drop_write(mqueue_mnt);
out_err:
dput(dentry);_
-
This takes care of all of the direct callers of vfs_mknod().
Since a few of these cases also handle normal file creation
as well, this also covers some calls to vfs_create().So that we don't have to make three mnt_want/drop_write()
calls inside of the switch statement, we move some of its
logic outside of the switch.Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---lxc-dave/fs/namei.c | 32 +++++++++++++++++++++-----------
lxc-dave/fs/nfsd/vfs.c | 4 ++++
lxc-dave/net/unix/af_unix.c | 4 ++++
3 files changed, 29 insertions(+), 11 deletions(-)diff -puN fs/namei.c~sys-mknodat-elevate-write-count-for-vfs-mknod-create fs/namei.c
--- lxc/fs/namei.c~sys-mknodat-elevate-write-count-for-vfs-mknod-create 2007-09-20 12:16:21.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-09-20 12:16:21.000000000 -0700
@@ -1945,12 +1945,25 @@ asmlinkage long sys_mknodat(int dfd, con
if (error)
goto out;
dentry = lookup_create(&nd, 0);
- error = PTR_ERR(dentry);
-
+ if (IS_ERR(dentry)) {
+ error = PTR_ERR(dentry);
+ goto out_unlock;
+ }
if (!IS_POSIXACL(nd.dentry->d_inode))
mode &= ~current->fs->umask;
- if (!IS_ERR(dentry)) {
- switch (mode & S_IFMT) {
+ if (S_ISDIR(mode)) {
+ error = -EPERM;
+ goto out_dput;
+ }
+ if (!S_ISREG(mode) && !S_ISCHR(mode) && !S_ISBLK(mode) &&
+ !S_ISFIFO(mode) && !S_ISSOCK(mode) && mode != 0) {
+ error = -EINVAL;
+ goto out_dput;
+ }
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_dput;
+ switch (mode & S_IFMT) {
case 0: case S_IFREG:
error = vfs_create(nd.dentry->d_inode,dentry,mode,&nd);
break;
@@ -1961,14 +1974,11 @@ asmlinkage long sys_mknodat(int dfd, con
case S_IFIFO: case S_IFSOCK:
error = vfs_mknod(nd.dentry->d_inode,dentry,mode,0);
break;
- case S_IFDIR:
- error = -EPERM;
- break;
- default:
- error = -EINVAL;
- }
...
This basically audits the callers of xattr_permission(), which
calls permission() and can perform writes to the filesystem.This conflicts with the current (~2.6.23-rc7) audit
git tree in -mm. wiggle'ing the patch merges it.Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---lxc-dave/fs/nfsd/nfs4proc.c | 7 ++++++-
lxc-dave/fs/xattr.c | 16 ++++++++++++++--
2 files changed, 20 insertions(+), 3 deletions(-)diff -puN fs/nfsd/nfs4proc.c~elevate-mount-count-for-extended-attributes fs/nfsd/nfs4proc.c
--- lxc/fs/nfsd/nfs4proc.c~elevate-mount-count-for-extended-attributes 2007-09-20 12:16:16.000000000 -0700
+++ lxc-dave/fs/nfsd/nfs4proc.c 2007-09-20 12:16:16.000000000 -0700
@@ -658,14 +658,19 @@ nfsd4_setattr(struct svc_rqst *rqstp, st
return status;
}
}
+ status = mnt_want_write(cstate->current_fh.fh_export->ex_mnt);
+ if (status)
+ return status;
status = nfs_ok;
if (setattr->sa_acl != NULL)
status = nfsd4_set_nfs4_acl(rqstp, &cstate->current_fh,
setattr->sa_acl);
if (status)
- return status;
+ goto out;
status = nfsd_setattr(rqstp, &cstate->current_fh, &setattr->sa_iattr,
0, (time_t)0);
+out:
+ mnt_drop_write(cstate->current_fh.fh_export->ex_mnt);
return status;
}diff -puN fs/xattr.c~elevate-mount-count-for-extended-attributes fs/xattr.c
--- lxc/fs/xattr.c~elevate-mount-count-for-extended-attributes 2007-09-20 12:16:16.000000000 -0700
+++ lxc-dave/fs/xattr.c 2007-09-20 12:16:16.000000000 -0700
@@ -11,6 +11,7 @@
#include <linux/slab.h>
#include <linux/file.h>
#include <linux/xattr.h>
+#include <linux/mount.h>
#include <linux/namei.h>
#include <linux/security.h>
#include <linux/syscalls.h>
@@ -32,8 +33,6 @@ xattr_permission(struct inode *inode, co
* filesystem or on an immutable / append-only inode.
*/
if (mask & MAY_WRITE) {
- if (IS_RDONLY...
This is the first really tricky patch in the series. It
elevates the writer count on a mount each time a
non-special file is opened for write.This is not completely apparent in the patch because the
two if() conditions in may_open() above the
mnt_want_write() call are, combined, equivalent to
special_file().There is also an elevated count around the vfs_create()
call in open_namei(). The count needs to be kept elevated
all the way into the may_open() call. Otherwise, when the
write is dropped, a ro->rw transisition could occur. This
would lead to having rw access on the newly created file,
while the vfsmount is ro. That is bad.Some filesystems forego the use of normal vfs calls to create
struct files. Make sure that these users elevate the mnt writer
count because they will get __fput(), and we need to make
sure they're balanced.Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
---lxc-dave/fs/file_table.c | 9 ++++++++-
lxc-dave/fs/namei.c | 20 ++++++++++++++++----
lxc-dave/ipc/mqueue.c | 3 +++
3 files changed, 27 insertions(+), 5 deletions(-)diff -puN fs/file_table.c~tricky-elevate-write-count-files-are-open-ed fs/file_table.c
--- lxc/fs/file_table.c~tricky-elevate-write-count-files-are-open-ed 2007-09-20 12:16:11.000000000 -0700
+++ lxc-dave/fs/file_table.c 2007-09-20 12:16:11.000000000 -0700
@@ -194,6 +194,10 @@ int init_file(struct file *file, struct
file->f_mapping = dentry->d_inode->i_mapping;
file->f_mode = mode;
file->f_op = fop;
+ if (mode & FMODE_WRITE) {
+ error = mnt_want_write(mnt);
+ WARN_ON(error);
+ }
return error;
}
EXPORT_SYMBOL(init_file);
@@ -231,8 +235,11 @@ void fastcall __fput(struct file *file)
if (unlikely(S_ISCHR(inode->i_mode) && inode->i_cdev != NULL))
cdev_put(inode->i_cdev);
fops_put(file->f_op);
- if (file->f_mode & FMODE_WRITE)
+ if (file->f_mode & FMODE_WRITE) {...
For some reason this has started oopsing on me:
[ 39.941273] audit(1196237507.111:5): audit_pid=1882 old=0 by auid=4294967295 subj=system_u:system_r:auditd_t:s0
[ 35.037235] BUG: unable to handle kernel NULL pointer dereference at virtual address 0000006a
[ 35.040536] printing eip: c017815c *pde = 00000000
[ 35.043913] Oops: 0000 [#1] PREEMPT
[ 35.047270] last sysfs file: /devices/platform/i8042/serio0/description
[ 35.050679] Modules linked in: ipv6 autofs4 hidp l2cap bluetooth sunrpc nf_conntrack_netbios_ns ipt_REJECT nf_conntrack_ipv4 xt_state nf_conntrack xt_tcpudp iptable_filter ip_tables x_tables acpi_cpufreq nvram ohci1394 ieee1394 ehci_hcd uhci_hcd sg joydev snd_hda_intel snd_seq_dummy snd_seq_oss snd_seq_midi_event snd_seq snd_seq_device snd_pcm_oss snd_mixer_oss snd_pcm sr_mod snd_timer snd i2c_i801 ipw2200 cdrom ieee80211 pcspkr piix soundcore ieee80211_crypt i2c_core snd_page_alloc button generic ext3 jbd ide_disk ide_core
[ 35.071690]
[ 35.075612] Pid: 2277, comm: hald-addon-acpi Not tainted (2.6.24-rc3-mm2 #8)
[ 35.080517] EIP: 0060:[<c017815c>] EFLAGS: 00010296 CPU: 0
[ 35.084585] EIP is at open_pathname+0x37b/0x6a5
[ 35.088631] EAX: 00000000 EBX: fffffff0 ECX: f61fc000 EDX: c01a02c1
[ 35.092747] ESI: f61fdf20 EDI: 00000008 EBP: f61fdf84 ESP: f61fdefc
[ 35.098762] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068
[ 35.102847] Process hald-addon-acpi (pid: 2277, ti=F61FC000 task=F61FEEB0 task.ti=F61FC000)
[ 35.103028] Stack: 00008000 f6589000 00000000 00000004 00000000 00008001 00000000 00000246
[ 35.107359] 0000023a f5a9f908 f780cbc0 00000000 00000000 00000000 00000101 00000001
[ 35.111671] 00000000 f61fdf5c 00000246 c016cdd9 00000246 f782b6a4 00000004 f782b6a4
[ 35.115978] Call Trace:
[ 35.124180] [<c0104a74>] show_trace_log_lvl+0x12/0x25
[ 35.128422] [<c0104b13>] show_stack_log_lvl+0x8c/0x97
[ 35.132646] [<c0104ba8>] show_registers+0x8a/0x1c0
[ 35.136900] [<c0104dcc>] die+0xee/0...
It does. I can see how I introduced the bug, and your fix does look
like a good one. Thanks.-- Dave
-
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---lxc-dave/fs/inode.c | 20 ++++++++++++--------
1 file changed, 12 insertions(+), 8 deletions(-)diff -puN fs/inode.c~elevate-write-count-for-do-sys-utime-and-touch-atime fs/inode.c
--- lxc/fs/inode.c~elevate-write-count-for-do-sys-utime-and-touch-atime 2007-09-20 12:16:20.000000000 -0700
+++ lxc-dave/fs/inode.c 2007-09-20 12:16:20.000000000 -0700
@@ -1176,22 +1176,23 @@ void touch_atime(struct vfsmount *mnt, s
struct inode *inode = dentry->d_inode;
struct timespec now;- if (inode->i_flags & S_NOATIME)
+ if (mnt && mnt_want_write(mnt))
return;
+ if (inode->i_flags & S_NOATIME)
+ goto out;
if (IS_NOATIME(inode))
- return;
+ goto out;
if ((inode->i_sb->s_flags & MS_NODIRATIME) && S_ISDIR(inode->i_mode))
- return;
+ goto out;/*
* We may have a NULL vfsmount when coming from NFSD
*/
if (mnt) {
if (mnt->mnt_flags & MNT_NOATIME)
- return;
+ goto out;
if ((mnt->mnt_flags & MNT_NODIRATIME) && S_ISDIR(inode->i_mode))
- return;
-
+ goto out;
if (mnt->mnt_flags & MNT_RELATIME) {
/*
* With relative atime, only update atime if the
@@ -1202,16 +1203,19 @@ void touch_atime(struct vfsmount *mnt, s
&inode->i_atime) < 0 &&
timespec_compare(&inode->i_ctime,
&inode->i_atime) < 0)
- return;
+ goto out;
}
}now = current_fs_time(inode->i_sb);
if (timespec_equal(&inode->i_atime, &now))
- return;
+ goto out;inode->i_atime = now;
mark_inode_dirty_sync(inode);
+out:
+ if (mnt)
+ mnt_drop_write(mnt);
}
EXPORT_SYMBOL(touch_atime);_
-
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---lxc-dave/fs/utimes.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)diff -puN fs/utimes.c~elevate-write-count-for-do-utimes fs/utimes.c
--- lxc/fs/utimes.c~elevate-write-count-for-do-utimes 2007-09-20 12:16:20.000000000 -0700
+++ lxc-dave/fs/utimes.c 2007-09-20 12:16:20.000000000 -0700
@@ -2,6 +2,7 @@
#include <linux/file.h>
#include <linux/fs.h>
#include <linux/linkage.h>
+#include <linux/mount.h>
#include <linux/namei.h>
#include <linux/sched.h>
#include <linux/stat.h>
@@ -75,8 +76,8 @@ long do_utimes(int dfd, char __user *filinode = dentry->d_inode;
- error = -EROFS;
- if (IS_RDONLY(inode))
+ error = mnt_want_write(nd.mnt);
+ if (error)
goto dput_and_out;/* Don't worry, the checks are done in inode_change_ok() */
@@ -84,7 +85,7 @@ long do_utimes(int dfd, char __user *fil
if (times) {
error = -EPERM;
if (IS_APPEND(inode) || IS_IMMUTABLE(inode))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;if (times[0].tv_nsec == UTIME_OMIT)
newattrs.ia_valid &= ~ATTR_ATIME;
@@ -104,22 +105,24 @@ long do_utimes(int dfd, char __user *fil
} else {
error = -EACCES;
if (IS_IMMUTABLE(inode))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;if (!is_owner_or_cap(inode)) {
if (f) {
if (!(f->f_mode & FMODE_WRITE))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;
} else {
error = vfs_permission(&nd, MAY_WRITE);
if (error)
- goto dput_and_out;
+ goto mnt_drop_write_and_out;
}
}
}
mutex_lock(&inode->i_mutex);
error = notify_change(dentry, &newattrs);
mutex_unlock(&inode->i_mutex);
+mnt_drop_write_and_out:
+ mnt_drop_write(nd.mnt);
dput_and_out:
if (f)
fput(f);...
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---lxc-dave/fs/open.c | 14 ++++++++------
1 file changed, 8 insertions(+), 6 deletions(-)diff -puN fs/open.c~elevate-writer-count-for-do-sys-truncate fs/open.c
--- lxc/fs/open.c~elevate-writer-count-for-do-sys-truncate 2007-09-20 12:16:19.000000000 -0700
+++ lxc-dave/fs/open.c 2007-09-20 12:16:19.000000000 -0700
@@ -244,21 +244,21 @@ static long do_sys_truncate(const char _
if (!S_ISREG(inode->i_mode))
goto dput_and_out;- error = vfs_permission(&nd, MAY_WRITE);
+ error = mnt_want_write(nd.mnt);
if (error)
goto dput_and_out;- error = -EROFS;
- if (IS_RDONLY(inode))
- goto dput_and_out;
+ error = vfs_permission(&nd, MAY_WRITE);
+ if (error)
+ goto mnt_drop_write_and_out;error = -EPERM;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
- goto dput_and_out;
+ goto mnt_drop_write_and_out;error = get_write_access(inode);
if (error)
- goto dput_and_out;
+ goto mnt_drop_write_and_out;/*
* Make sure that there are no leases. get_write_access() protects
@@ -276,6 +276,8 @@ static long do_sys_truncate(const char _put_write_and_out:
put_write_access(inode);
+mnt_drop_write_and_out:
+ mnt_drop_write(nd.mnt);
dput_and_out:
path_release(&nd);
out:
_
-
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
---lxc-dave/fs/inode.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)diff -puN fs/inode.c~elevate-write-count-for-file_update_time fs/inode.c
--- lxc/fs/inode.c~elevate-write-count-for-file_update_time 2007-09-20 12:16:16.000000000 -0700
+++ lxc-dave/fs/inode.c 2007-09-20 12:16:16.000000000 -0700
@@ -1232,10 +1232,19 @@ void file_update_time(struct file *file)
struct inode *inode = file->f_path.dentry->d_inode;
struct timespec now;
int sync_it = 0;
+ int err = 0;if (IS_NOCMTIME(inode))
return;
- if (IS_RDONLY(inode))
+ /*
+ * Ideally, we want to guarantee that 'f_vfsmnt'
+ * is non-NULL here. But, NFS exports need to
+ * be fixed up before we can do that. So, check
+ * it for now. - Dave Hansen
+ */
+ if (file->f_vfsmnt)
+ err = mnt_want_write(file->f_vfsmnt);
+ if (err)
return;now = current_fs_time(inode->i_sb);
@@ -1251,6 +1260,8 @@ void file_update_time(struct file *file)if (sync_it)
mark_inode_dirty_sync(inode);
+ if (file->f_vfsmnt)
+ mnt_drop_write(file->f_vfsmnt);
}EXPORT_SYMBOL(file_update_time);
_
-
If we depend on the inodes for writeability, we will not
catch the r/o mounts when implemented.This patches uses __mnt_want_write(). It does not guarantee
that the mount will stay writeable after the check. But,
this is OK for one of the checks because it is just for a
printk().The other two are probably unnecessary and duplicate existing
checks in the VFS. This won't make them better checks than
before, but it will make them detect r/o mounts.Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---lxc-dave/fs/nfs/dir.c | 3 ++-
lxc-dave/fs/nfsd/vfs.c | 4 ++--
2 files changed, 4 insertions(+), 3 deletions(-)diff -puN fs/nfs/dir.c~nfs-check-mnt-instead-of-sb fs/nfs/dir.c
--- lxc/fs/nfs/dir.c~nfs-check-mnt-instead-of-sb 2007-09-20 12:16:18.000000000 -0700
+++ lxc-dave/fs/nfs/dir.c 2007-09-20 12:16:18.000000000 -0700
@@ -997,7 +997,8 @@ static int is_atomic_open(struct inode *
if (nd->flags & LOOKUP_DIRECTORY)
return 0;
/* Are we trying to write to a read only partition? */
- if (IS_RDONLY(dir) && (nd->intent.open.flags & (O_CREAT|O_TRUNC|FMODE_WRITE)))
+ if (__mnt_is_readonly(nd->mnt) &&
+ (nd->intent.open.flags & (O_CREAT|O_TRUNC|FMODE_WRITE)))
return 0;
return 1;
}
diff -puN fs/nfsd/vfs.c~nfs-check-mnt-instead-of-sb fs/nfsd/vfs.c
--- lxc/fs/nfsd/vfs.c~nfs-check-mnt-instead-of-sb 2007-09-20 12:16:18.000000000 -0700
+++ lxc-dave/fs/nfsd/vfs.c 2007-09-20 12:16:18.000000000 -0700
@@ -1845,7 +1845,7 @@ nfsd_permission(struct svc_rqst *rqstp,
inode->i_mode,
IS_IMMUTABLE(inode)? " immut" : "",
IS_APPEND(inode)? " append" : "",
- IS_RDONLY(inode)? " ro" : "");
+ __mnt_is_readonly(exp->ex_mnt)? " ro" : "");
dprintk(" owner %d/%d user %d/%d\n",
inode->i_uid, inode->i_gid, current->fsuid, current->fsgid);
#endif
@@ -1856,7 +1856,7 @@ nfsd_permission(struct svc_rqst *rqstp,
*/
if (!(acc & MAY_L...
may_open() calls vfs_permission() before it does checks for
IS_RDONLY(inode). It checks _again_ inside of vfs_permission().The check inside of vfs_permission() is going away eventually.
With the mnt_want/drop_write() functions, all of the r/o
checks (except for this one) are consistently done before
calling permission(). Because of this, I'd like to use
permission() to hold a debugging check to make sure that
the mnt_want/drop_write() calls are actually being made.So, to do this:
1. remove the IS_RDONLY() check from permission()
2. enforce that you must mnt_want_write() before
even calling permission()
3. actually add the debugging check to permission()We need to rearrange may_open() to do r/o checks
before calling permission(). Here's the patch.Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---lxc-dave/fs/namei.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)diff -puN fs/namei.c~rearrange-permission-and-ro-checks-in-may_open fs/namei.c
--- lxc/fs/namei.c~rearrange-permission-and-ro-checks-in-may_open 2007-09-20 12:16:09.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-09-20 12:16:09.000000000 -0700
@@ -1579,10 +1579,6 @@ int may_open(struct nameidata *nd, int a
if (S_ISDIR(inode->i_mode) && (flag & FMODE_WRITE))
return -EISDIR;- error = vfs_permission(nd, acc_mode);
- if (error)
- return error;
-
/*
* FIFO's, sockets and device files are special: they don't
* actually live on the filesystem itself, and as such you
@@ -1597,6 +1593,10 @@ int may_open(struct nameidata *nd, int a
flag &= ~O_TRUNC;
} else if (IS_RDONLY(inode) && (flag & FMODE_WRITE))
return -EROFS;
+
+ error = vfs_permission(nd, acc_mode);
+ if (error)
+ return error;
/*
* An append-only file must be opened in append mode for writing.
*/
_
-
This also uses the little helper in the NFS code to
make an if() a little bit less ugly. We introduced
the helper at the beginning of the series.Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
---lxc-dave/fs/namei.c | 4 ++++
lxc-dave/fs/nfsd/vfs.c | 15 +++++++++++----
2 files changed, 15 insertions(+), 4 deletions(-)diff -puN fs/namei.c~elevate-write-count-over-calls-to-vfs-rename fs/namei.c
--- lxc/fs/namei.c~elevate-write-count-over-calls-to-vfs-rename 2007-09-20 12:16:18.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-09-20 12:16:18.000000000 -0700
@@ -2630,8 +2630,12 @@ static int do_rename(int olddfd, const c
if (new_dentry == trap)
goto exit5;+ error = mnt_want_write(oldnd.mnt);
+ if (error)
+ goto exit5;
error = vfs_rename(old_dir->d_inode, old_dentry,
new_dir->d_inode, new_dentry);
+ mnt_drop_write(oldnd.mnt);
exit5:
dput(new_dentry);
exit4:
diff -puN fs/nfsd/vfs.c~elevate-write-count-over-calls-to-vfs-rename fs/nfsd/vfs.c
--- lxc/fs/nfsd/vfs.c~elevate-write-count-over-calls-to-vfs-rename 2007-09-20 12:16:18.000000000 -0700
+++ lxc-dave/fs/nfsd/vfs.c 2007-09-20 12:16:18.000000000 -0700
@@ -1651,13 +1651,20 @@ nfsd_rename(struct svc_rqst *rqstp, stru
if (ndentry == trap)
goto out_dput_new;-#ifdef MSNFS
- if ((ffhp->fh_export->ex_flags & NFSEXP_MSNFS) &&
+ if (svc_msnfs(ffhp) &&
((atomic_read(&odentry->d_count) > 1)
|| (atomic_read(&ndentry->d_count) > 1))) {
host_err = -EPERM;
- } else
-#endif
+ goto out_dput_new;
+ }
+
+ host_err = -EXDEV;
+ if (ffhp->fh_export->ex_mnt != tfhp->fh_export->ex_mnt)
+ goto out_dput_new;
+ host_err = mnt_want_write(ffhp->fh_export->ex_mnt);
+ if (host_err)
+ goto out_dput_new;
+
host_err = vfs_rename(fdir, odentry, tdir, ndentry);
if (!host_err && EX_ISSYNC(tfhp->fh_export)) {
host_err = nfsd_sync_dir(tdentry);
_...
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
---lxc-dave/net/unix/af_unix.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)diff -puN net/unix/af_unix.c~unix-find-other-elevate-write-count-for-touch-atime net/unix/af_unix.c
--- lxc/net/unix/af_unix.c~unix-find-other-elevate-write-count-for-touch-atime 2007-09-20 12:16:17.000000000 -0700
+++ lxc-dave/net/unix/af_unix.c 2007-09-20 12:16:17.000000000 -0700
@@ -729,21 +729,27 @@ static struct sock *unix_find_other(stru
err = path_lookup(sunname->sun_path, LOOKUP_FOLLOW, &nd);
if (err)
goto fail;
+
+ err = mnt_want_write(nd.mnt);
+ if (err)
+ goto put_path_fail;
+
err = vfs_permission(&nd, MAY_WRITE);
if (err)
- goto put_fail;
+ goto mnt_drop_write_fail;err = -ECONNREFUSED;
if (!S_ISSOCK(nd.dentry->d_inode->i_mode))
- goto put_fail;
+ goto mnt_drop_write_fail;
u=unix_find_socket_byinode(nd.dentry->d_inode);
if (!u)
- goto put_fail;
+ goto mnt_drop_write_fail;if (u->sk_type == type)
touch_atime(nd.mnt, nd.dentry);path_release(&nd);
+ mnt_drop_write(nd.mnt);err=-EPROTOTYPE;
if (u->sk_type != type) {
@@ -763,7 +769,9 @@ static struct sock *unix_find_other(stru
}
return u;-put_fail:
+mnt_drop_write_fail:
+ mnt_drop_write(nd.mnt);
+put_path_fail:
path_release(&nd);
fail:
*error=err;
_
-
Pretty self-explanatory. Fits in with the rest of the series.
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
---lxc-dave/fs/namei.c | 5 +++++
lxc-dave/fs/nfsd/nfs4recover.c | 4 ++++
2 files changed, 9 insertions(+)diff -puN fs/namei.c~elevate-mnt-writers-for-callers-of-vfs-mkdir fs/namei.c
--- lxc/fs/namei.c~elevate-mnt-writers-for-callers-of-vfs-mkdir 2007-09-20 12:16:14.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-09-20 12:16:14.000000000 -0700
@@ -2026,7 +2026,12 @@ asmlinkage long sys_mkdirat(int dfd, conif (!IS_POSIXACL(nd.dentry->d_inode))
mode &= ~current->fs->umask;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_dput;
error = vfs_mkdir(nd.dentry->d_inode, dentry, mode);
+ mnt_drop_write(nd.mnt);
+out_dput:
dput(dentry);
out_unlock:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
diff -puN fs/nfsd/nfs4recover.c~elevate-mnt-writers-for-callers-of-vfs-mkdir fs/nfsd/nfs4recover.c
--- lxc/fs/nfsd/nfs4recover.c~elevate-mnt-writers-for-callers-of-vfs-mkdir 2007-09-20 12:16:14.000000000 -0700
+++ lxc-dave/fs/nfsd/nfs4recover.c 2007-09-20 12:16:14.000000000 -0700
@@ -156,7 +156,11 @@ nfsd4_create_clid_dir(struct nfs4_client
dprintk("NFSD: nfsd4_create_clid_dir: DIRECTORY EXISTS\n");
goto out_put;
}
+ status = mnt_want_write(rec_dir.mnt);
+ if (status)
+ goto out_put;
status = vfs_mkdir(rec_dir.dentry->d_inode, dentry, S_IRWXU);
+ mnt_drop_write(rec_dir.mnt);
out_put:
dput(dentry);
out_unlock:
_
-
Some ioctl()s can cause writes to the filesystem. Take
these, and make them use mnt_want/drop_write() instead.We need to pass the filp one layer deeper in XFS, but
somebody _just_ pulled it out in February because nobody
was using it, so I don't feel guilty for adding it back.Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
---lxc-dave/fs/ext2/ioctl.c | 46 +++++++++-----
lxc-dave/fs/ext3/ioctl.c | 100 +++++++++++++++++++++-----------
lxc-dave/fs/ext4/ioctl.c | 105 +++++++++++++++++++++-------------
lxc-dave/fs/fat/file.c | 10 +--
lxc-dave/fs/hfsplus/ioctl.c | 40 ++++++++----
lxc-dave/fs/jfs/ioctl.c | 33 ++++++----
lxc-dave/fs/ocfs2/ioctl.c | 11 +--
lxc-dave/fs/reiserfs/ioctl.c | 55 +++++++++++------
lxc-dave/fs/xfs/linux-2.6/xfs_ioctl.c | 15 +++-
lxc-dave/fs/xfs/linux-2.6/xfs_iops.c | 7 --
lxc-dave/fs/xfs/linux-2.6/xfs_lrw.c | 9 ++
11 files changed, 274 insertions(+), 157 deletions(-)diff -puN fs/ext2/ioctl.c~ioctl-mnt-takers fs/ext2/ioctl.c
--- lxc/fs/ext2/ioctl.c~ioctl-mnt-takers 2007-09-20 12:16:12.000000000 -0700
+++ lxc-dave/fs/ext2/ioctl.c 2007-09-20 12:16:12.000000000 -0700
@@ -12,6 +12,7 @@
#include <linux/time.h>
#include <linux/sched.h>
#include <linux/compat.h>
+#include <linux/mount.h>
#include <linux/smp_lock.h>
#include <asm/current.h>
#include <asm/uaccess.h>
@@ -20,6 +21,7 @@
int ext2_ioctl (struct inode * inode, struct file * filp, unsigned int cmd,
unsigned long arg)
{
+ int ret;
struct ext2_inode_info *ei = EXT2_I(inode);
unsigned int flags;@@ -33,14 +35,19 @@ int ext2_ioctl (struct inode * inode, st
case EXT2_IOC_SETFLAGS: {
unsigned int oldflags;- if (IS_RDONLY(inode))
- return -EROFS;
-
- if (!is_owner_or_cap(inode))
- return -EACCES;
+ ret = mnt_want_write(filp-&...
On Thu, 20 Sep 2007 12:52:57 -0700
Note that -mm's ext2-reservations.patch adds EXT2_IOC_SETRSVSZ,
and it doesn't do mnt_want_write().
-
That doesn't quite apply to mainline (at least after the patches I just
sent). I'll wait and send you one on top of the next -mm so that I can
get a coherent view of what's going on if that's all right.-- Dave
-
On Fri, 21 Sep 2007 16:39:40 -0700
Sure, that's OK.
But I only noticed it because I happened to have my nose in there fixing a
reject.-
It still creeps me out that we have this sprinkled *all over* the tree and
people will forget to do it and there's no runtime or compile-time checking
that they remembered to do it and when they forget to do it nobody will
notice that it broke until ages and ages later.IOW: it is a sheer horror for maintainability.
Please have a think about what we can do about this. For example, if you'd
thought about this up-front, (and I think it's a big problem), we could
have done something grotty like, in mnt_want_write():current->vfsmnt_im_allowed_to_write_to = inode;
and then check that current->vfsmnt_im_allowed_to_write_to is the correct
inode in __mark_inode_dirty() and various other strategic places. That
sort of thing.We need to do *something*, I think. This code just doesn't look feasibly
maintainable to me.-
We definitely can't use 'current'-based counts for everything. For
example, we might take a write on a mnt at a time when a loopback device
is created. We'll release that write when the device is destroyed, and
these certainly won't always be the same task, or even process.However, there are preciously few places in the code that actually have
to deal with these long-lived, persistent mount writer counts. Those
are basically where we're creating a 'struct file' and keeping it around
for a while.The rest of the calls (95% or so) are much more atomic. They are much
more like a simple lock/unlock pair, and their references will never
persist beyond a single system call. These are also very rarely nested
and certainly never deeply nested.For the "atomic" write counts, I keep track of the mnts on which a
particular task may write. This is a simple array in the task_struct.
I currently check this array on all task_struct free()s to make sure it
is empty. We could also do this at all system call exits, but this is
probably unnecessary because that would almost certainly be a
mnt_want_write() leak.For the persistent users, we actually change the API to be
mnt_want_persistent_write(mnt, inode) to track the mnt and the inode
pair, and we hang the list of these off the superblock. We could
probably do this universally, but the number of persistent users is
quite small, which reduces the code churn required. This part of the
patch is not very scalable, so we'd have to rethink this if we want this
functionality for more than debugging.Did you have this in mind as being a debugging option that we turn on
when we see problems like LOCKDEP, or were you thinking of something
that stays on all the time? This turned out to be more code than I had
hoped, so I'm very open to suggestions for other approaches.We also need a thorough review to look for things like
ext3_mark_inode_dirty() since they never actually call down to the
generic vfs mark_inode_dirty().This is stri...
Just brainstorming a bit... I tried just keeping a writer count in the
superblock which we bump at the same time as the mnt->__mnt_writers,
which makes it quite easy to check whenever we have the inode around.
This causes quite a few false positives with journal code, and I think a
few more with direct block device access. I think we can work around
those, though.The true issue comes when we have a lot of users (more than one) of a
superblock. If anybody has a write request outstanding, then the check
I put in will get skipped. It's easy to mask real problems this way.Putting the "inodes allowed" list in the task (or probably signal struct
to support threads) should be workable, but I worry a bit about the
cases where fds are sent across sockets.We could also break up the mnt_want_write() requests into two classes,
the common mnt_want_write() case and mnt_want_indefinite_write() or
something. The plain (and common) case is the one for things like
rmdir, mknod, or chmod where the write is confined to a single syscall.
mnt_want_indefinite_write() would be for things that do an open() and
later work on a file descriptor. Each of these cases could have a
different variable or flag in (or hanging off) the task struct.In __mark_inode_dirty() we would first check to see if we are currently
in one of the plain, more atomic, mnt_want_write() regions by checking a
flag. If not, we go looking through the task's file descriptors and
seeing if one of them is open for write on the inode. If both of these
conditions fail, then we have a potential bug.-- Dave
-
chown/chmod,etc... don't call permission in the same way
that the normal "open for write" calls do. They still
write to the filesystem, so bump the write count during
these operations.This conflicts with the current (~2.6.23-rc7) audit
git tree in -mm. wiggle'ing the patch merges it.Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
---lxc-dave/fs/open.c | 39 ++++++++++++++++++++++++++++++---------
1 file changed, 30 insertions(+), 9 deletions(-)diff -puN fs/open.c~elevate-writer-count-for-chown-and-friends fs/open.c
--- lxc/fs/open.c~elevate-writer-count-for-chown-and-friends 2007-09-20 12:16:13.000000000 -0700
+++ lxc-dave/fs/open.c 2007-09-20 12:16:13.000000000 -0700
@@ -571,12 +571,12 @@ asmlinkage long sys_fchmod(unsigned intaudit_inode(NULL, inode);
- err = -EROFS;
- if (IS_RDONLY(inode))
+ err = mnt_want_write(file->f_vfsmnt);
+ if (err)
goto out_putf;
err = -EPERM;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
- goto out_putf;
+ goto out_drop_write;
mutex_lock(&inode->i_mutex);
if (mode == (mode_t) -1)
mode = inode->i_mode;
@@ -585,6 +585,8 @@ asmlinkage long sys_fchmod(unsigned int
err = notify_change(dentry, &newattrs);
mutex_unlock(&inode->i_mutex);+out_drop_write:
+ mnt_drop_write(file->f_vfsmnt);
out_putf:
fput(file);
out:
@@ -604,13 +606,13 @@ asmlinkage long sys_fchmodat(int dfd, co
goto out;
inode = nd.dentry->d_inode;- error = -EROFS;
- if (IS_RDONLY(inode))
+ error = mnt_want_write(nd.mnt);
+ if (error)
goto dput_and_out;error = -EPERM;
if (IS_IMMUTABLE(inode) || IS_APPEND(inode))
- goto dput_and_out;
+ goto out_drop_write;mutex_lock(&inode->i_mutex);
if (mode == (mode_t) -1)
@@ -620,6 +622,8 @@ asmlinkage long sys_fchmodat(int dfd, co
error = notify_change(nd.dentry, &newattrs);
mutex_unlock(&inode->i_mutex);+out_drop_write:
+ mnt_drop_write(...
Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---lxc-dave/fs/namei.c | 10 ++++++++++
1 file changed, 10 insertions(+)diff -puN fs/namei.c~elevate-write-count-for-link-and-symlink-calls fs/namei.c
--- lxc/fs/namei.c~elevate-write-count-for-link-and-symlink-calls 2007-09-20 12:16:15.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-09-20 12:16:15.000000000 -0700
@@ -2299,7 +2299,12 @@ asmlinkage long sys_symlinkat(const char
if (IS_ERR(dentry))
goto out_unlock;+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_dput;
error = vfs_symlink(nd.dentry->d_inode, dentry, from, S_IALLUGO);
+ mnt_drop_write(nd.mnt);
+out_dput:
dput(dentry);
out_unlock:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
@@ -2394,7 +2399,12 @@ asmlinkage long sys_linkat(int olddfd, c
error = PTR_ERR(new_dentry);
if (IS_ERR(new_dentry))
goto out_unlock;
+ error = mnt_want_write(nd.mnt);
+ if (error)
+ goto out_dput;
error = vfs_link(old_nd.dentry, nd.dentry->d_inode, new_dentry);
+ mnt_drop_write(nd.mnt);
+out_dput:
dput(new_dentry);
out_unlock:
mutex_unlock(&nd.dentry->d_inode->i_mutex);
_
-
Some ioctls need write access, but others don't. Make a helper
function to decide when write access is needed, and take it.Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---lxc-dave/fs/ncpfs/ioctl.c | 54 +++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 53 insertions(+), 1 deletion(-)diff -puN fs/ncpfs/ioctl.c~elevate-write-count-during-entire-ncp-ioctl fs/ncpfs/ioctl.c
--- lxc/fs/ncpfs/ioctl.c~elevate-write-count-during-entire-ncp-ioctl 2007-09-20 12:16:15.000000000 -0700
+++ lxc-dave/fs/ncpfs/ioctl.c 2007-09-20 12:16:15.000000000 -0700
@@ -14,6 +14,7 @@
#include <linux/ioctl.h>
#include <linux/time.h>
#include <linux/mm.h>
+#include <linux/mount.h>
#include <linux/highuid.h>
#include <linux/smp_lock.h>
#include <linux/vmalloc.h>
@@ -261,7 +262,7 @@ ncp_get_charsets(struct ncp_server* serv
}
#endif /* CONFIG_NCPFS_NLS */-int ncp_ioctl(struct inode *inode, struct file *filp,
+static int __ncp_ioctl(struct inode *inode, struct file *filp,
unsigned int cmd, unsigned long arg)
{
struct ncp_server *server = NCP_SERVER(inode);
@@ -822,6 +823,57 @@ outrel:
return -EINVAL;
}+static int ncp_ioctl_need_write(unsigned int cmd)
+{
+ switch (cmd) {
+ case NCP_IOC_GET_FS_INFO:
+ case NCP_IOC_GET_FS_INFO_V2:
+ case NCP_IOC_NCPREQUEST:
+ case NCP_IOC_SETDENTRYTTL:
+ case NCP_IOC_SIGN_INIT:
+ case NCP_IOC_LOCKUNLOCK:
+ case NCP_IOC_SET_SIGN_WANTED:
+ return 1;
+ case NCP_IOC_GETOBJECTNAME:
+ case NCP_IOC_SETOBJECTNAME:
+ case NCP_IOC_GETPRIVATEDATA:
+ case NCP_IOC_SETPRIVATEDATA:
+ case NCP_IOC_SETCHARSETS:
+ case NCP_IOC_GETCHARSETS:
+ case NCP_IOC_CONN_LOGGED_IN:
+ case NCP_IOC_GETDENTRYTTL:
+ case NCP_IOC_GETMOUNTUID2:
+ case NCP_IOC_SIGN_WANTED:
+ case NCP_IOC_GETROOT:
+ case NCP_IOC_SETROOT:
+ return 0;
+ default:
+ /* unkown IOCTL command, assume write */
+ }
+ return 1;
+}
+
+int ncp_ioctl(struct inode *inode, struct file *filp,
+ unsigned int c...
It is OK to let access() go without using a mnt_want/drop_write()
pair because it doesn't actually do writes to the filesystem,
and it is inherently racy anyway. This is a rare case when it is
OK to use __mnt_is_readonly() directly.Acked-by: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---lxc-dave/fs/open.c | 13 +++++++++++--
1 file changed, 11 insertions(+), 2 deletions(-)diff -puN fs/open.c~make-access-use-helper fs/open.c
--- lxc/fs/open.c~make-access-use-helper 2007-09-20 12:16:13.000000000 -0700
+++ lxc-dave/fs/open.c 2007-09-20 12:16:13.000000000 -0700
@@ -457,8 +457,17 @@ asmlinkage long sys_faccessat(int dfd, c
if(res || !(mode & S_IWOTH) ||
special_file(nd.dentry->d_inode->i_mode))
goto out_path_release;
-
- if(IS_RDONLY(nd.dentry->d_inode))
+ /*
+ * This is a rare case where using __mnt_is_readonly()
+ * is OK without a mnt_want/drop_write() pair. Since
+ * no actual write to the fs is performed here, we do
+ * not need to telegraph to that to anyone.
+ *
+ * By doing this, we accept that this access is
+ * inherently racy and know that the fs may change
+ * state before we even see this result.
+ */
+ if (__mnt_is_readonly(nd.mnt))
res = -EROFS;out_path_release:
_
-
This patch adds two function mnt_want_write() and
mnt_drop_write(). These are used like a lock pair around
and fs operations that might cause a write to the filesystem.Before these can become useful, we must first cover each
place in the VFS where writes are performed with a
want/drop pair. When that is complete, we can actually
introduce code that will safely check the counts before
allowing r/w<->r/o transitions to occur.Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---lxc-dave/fs/namespace.c | 54 +++++++++++++++++++++++++++++++++++++++++
lxc-dave/include/linux/mount.h | 3 ++
2 files changed, 57 insertions(+)diff -puN fs/namespace.c~add-vfsmount-writer-count fs/namespace.c
--- lxc/fs/namespace.c~add-vfsmount-writer-count 2007-09-20 12:16:10.000000000 -0700
+++ lxc-dave/fs/namespace.c 2007-09-20 12:16:10.000000000 -0700
@@ -77,6 +77,60 @@ struct vfsmount *alloc_vfsmnt(const char
return mnt;
}+/*
+ * Most r/o checks on a fs are for operations that take
+ * discrete amounts of time, like a write() or unlink().
+ * We must keep track of when those operations start
+ * (for permission checks) and when they end, so that
+ * we can determine when writes are able to occur to
+ * a filesystem.
+ */
+/**
+ * mnt_want_write - get write access to a mount
+ * @mnt: the mount on which to take a write
+ *
+ * This tells the low-level filesystem that a write is
+ * about to be performed to it, and makes sure that
+ * writes are allowed before returning success. When
+ * the write operation is finished, mnt_drop_write()
+ * must be called. This is effectively a refcount.
+ */
+int mnt_want_write(struct vfsmount *mnt)
+{
+ if (__mnt_is_readonly(mnt))
+ return -EROFS;
+ return 0;
+}
+EXPORT_SYMBOL_GPL(mnt_want_write);
+
+/**
+ * mnt_drop_write - give up write access to a mount
+ * @mnt: the mount on which to give up write access
+ *
+ * Tells the low-level filesystem that we are done
+ * performing writes to it. Must be matc...
I'm going to be modifying nfsd_rename() shortly to support
read-only bind mounts. This #ifdef is around the area I'm
patching, and it starts to get really ugly if I just try
to add my new code by itself. Using this little helper
makes things a lot cleaner to use.Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
Acked-by: Christoph Hellwig <hch@infradead.org>
---lxc-dave/fs/nfsd/vfs.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)diff -puN fs/nfsd/vfs.c~create-svc_msnfs-helper fs/nfsd/vfs.c
--- lxc/fs/nfsd/vfs.c~create-svc_msnfs-helper 2007-09-20 12:16:10.000000000 -0700
+++ lxc-dave/fs/nfsd/vfs.c 2007-09-20 12:16:10.000000000 -0700
@@ -857,6 +857,15 @@ static int nfsd_direct_splice_actor(stru
return __splice_from_pipe(pipe, sd, nfsd_splice_actor);
}+static inline int svc_msnfs(struct svc_fh *ffhp)
+{
+#ifdef MSNFS
+ return (ffhp->fh_export->ex_flags & NFSEXP_MSNFS);
+#else
+ return 0;
+#endif
+}
+
static __be32
nfsd_vfs_read(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
loff_t offset, struct kvec *vec, int vlen, unsigned long *count)
@@ -869,11 +878,9 @@ nfsd_vfs_read(struct svc_rqst *rqstp, sterr = nfserr_perm;
inode = file->f_path.dentry->d_inode;
-#ifdef MSNFS
- if ((fhp->fh_export->ex_flags & NFSEXP_MSNFS) &&
- (!lock_may_read(inode, offset, *count)))
+
+ if (svc_msnfs(fhp) && !lock_may_read(inode, offset, *count))
goto out;
-#endif/* Get readahead parameters */
ra = nfsd_get_raparms(inode->i_sb->s_dev, inode->i_ino);
_
-
First of all, this makes the structure jumping look a little
bit cleaner. So, this stands alone as a tiny cleanup. But,
we also need 'mnt' by itself a few more times later in this
series, so this isn't _just_ a cleanup.Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---lxc-dave/fs/namei.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)diff -puN fs/namei.c~give-may_open-a-local-mnt-variable fs/namei.c
--- lxc/fs/namei.c~give-may_open-a-local-mnt-variable 2007-09-20 12:16:09.000000000 -0700
+++ lxc-dave/fs/namei.c 2007-09-20 12:16:09.000000000 -0700
@@ -230,6 +230,10 @@ int permission(struct inode *inode, int
{
umode_t mode = inode->i_mode;
int retval, submask;
+ struct vfsmount *mnt = NULL;
+
+ if (nd)
+ mnt = nd->mnt;if (mask & MAY_WRITE) {
@@ -254,7 +258,7 @@ int permission(struct inode *inode, int
* the fs is mounted with the "noexec" flag.
*/
if ((mask & MAY_EXEC) && S_ISREG(mode) && (!(mode & S_IXUGO) ||
- (nd && nd->mnt && (nd->mnt->mnt_flags & MNT_NOEXEC))))
+ (mnt && (mnt->mnt_flags & MNT_NOEXEC))))
return -EACCES;/* Ordinary permission routines do not understand MAY_APPEND. */
_
-
You're touching permission, not may_open, though.
-
Christoph H. says this stands on its own and can go in before the
rest of the r/o bind mount set.---
Some filesystems forego the vfs and may_open() and create their
own 'struct file's.This patch creates a couple of helper functions which can be
used by these filesystems, and will provide a unified place
which the r/o bind mount code may patch.Also, rename an existing, static-scope init_file() to a less
generic name.Signed-off-by: Dave Hansen <haveblue@us.ibm.com>
---lxc-dave/fs/configfs/dir.c | 5 ++-
lxc-dave/fs/file_table.c | 60 ++++++++++++++++++++++++++++++++++++++++++
lxc-dave/fs/hugetlbfs/inode.c | 22 ++++++---------
lxc-dave/include/linux/file.h | 9 ++++++
lxc-dave/ipc/shm.c | 13 +++------
lxc-dave/mm/shmem.c | 7 +---
lxc-dave/mm/tiny-shmem.c | 19 ++++---------
lxc-dave/net/socket.c | 18 ++++++------
8 files changed, 104 insertions(+), 49 deletions(-)diff -puN fs/configfs/dir.c~filesystem-helpers-for-custom-struct-file-s fs/configfs/dir.c
--- lxc/fs/configfs/dir.c~filesystem-helpers-for-custom-struct-file-s 2007-09-20 12:16:08.000000000 -0700
+++ lxc-dave/fs/configfs/dir.c 2007-09-20 12:16:08.000000000 -0700
@@ -142,7 +142,7 @@ static int init_dir(struct inode * inode
return 0;
}-static int init_file(struct inode * inode)
+static int configfs_init_file(struct inode * inode)
{
inode->i_size = PAGE_SIZE;
inode->i_fop = &configfs_file_operations;
@@ -283,7 +283,8 @@ static int configfs_attach_attr(struct cdentry->d_fsdata = configfs_get(sd);
sd->s_dentry = dentry;
- error = configfs_create(dentry, (attr->ca_mode & S_IALLUGO) | S_IFREG, init_file);
+ error = configfs_create(dentry, (attr->ca_mode & S_IALLUGO) | S_IFREG,
+ configfs_init_file);
if (error) {
configfs_put(sd);
return error;
diff -puN fs/file_table.c~filesystem-helpers-for-custom-struct-file-s fs/file_table.c
--- lxc/fs/file_table.c~filesystem-helper...
| Ingo Molnar | Re: containers (was Re: -mm merge plans for 2.6.23) |
| Greg Kroah-Hartman | [PATCH 009/196] Chinese: add translation of sparse.txt |
| holzheu | Re: [RFC/PATCH] Documentation of kernel messages |
| Vladislav Bolkhovitin | Re: Integration of SCST in the mainstream Linux kernel |
git: | |
| Jarek Poplawski | [PATCH] pkt_sched: Destroy gen estimators under rtnl_lock(). |
| Gerrit Renker | [PATCH 27/37] dccp: Integration of dynamic feature activation - part 2 (server side) |
| David Miller | [GIT]: Networking |
| Antonio Almeida | HTB accuracy for high speed |
