[patch 1/2] VFS: new fgetattr() file operation

Previous thread: realtime preemption performance difference by Jaswinder Singh on Monday, September 24, 2007 - 7:48 am. (5 messages)

Next thread: 2.6.23-rc7-mm1: fix generic_setlease() mismerge by Alexey Dobriyan on Monday, September 24, 2007 - 9:09 am. (1 message)
To: <hch@...>
Cc: <trond.myklebust@...>, <adilger@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Monday, September 24, 2007 - 8:24 am

Thanks to everyone for the feedback. Here's two of the VFS patches
reworked according to comments. I also plan to rework the setattr()
patch accordingly and perhaps the xattr patch, altough that is the
lowest priority.

Christoph, are these OK with you in this form?

----
From: Miklos Szeredi <mszeredi@suse.cz>

Add a new file operation: f_op->fgetattr(), that is invoked by
fstat(). Fall back to i_op->getattr() if it is not defined.

This is useful for filesystems such as sshfs, which don't have a state
associated with inodes, but do have a state associated with open file
handles, and can perform a getattr operation using these handles.

In these cases there are basically two ways to correctly implement
open-unlink-fstat semantics:

1) use "sillyrenaming"

2) keep track of open files for each inode within the filesystem, and
randomly choose one for each getattr() request on an inode with
i_nlink == 0

3) VFS passes the open file to the filesystem, which can be used to
perform a getattr operation on the file handle

No 3. is by far the simplest solution, but it does require this
interface change in the VFS. It is also the only one that takes care
of the case, when a regular file is unlinked or renamed on the remote
host, while it is still open locally.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/stat.c
===================================================================
--- linux.orig/fs/stat.c 2007-09-24 13:04:37.000000000 +0200
+++ linux/fs/stat.c 2007-09-24 13:06:10.000000000 +0200
@@ -55,6 +55,27 @@ int vfs_getattr(struct vfsmount *mnt, st

EXPORT_SYMBOL(vfs_getattr);

+static int vfs_fgetattr(struct file *file, struct kstat *stat)
+{
+ struct vfsmount *mnt = file->f_path.mnt;
+ struct dentry *dentry = file->f_path.dentry;
+ struct inode *inode = dentry->d_inode;
+ int retval;
+
+ retval = security_inode_getattr(mnt, dentry);
+ if (retval)
+ return retval;
+
+ if (file->f_op &a...

To: Miklos Szeredi <miklos@...>
Cc: <hch@...>, <trond.myklebust@...>, <adilger@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Monday, September 24, 2007 - 8:36 am

Not at all. Attribute operations like this have no business at all
looking at the struct file. Please fix your dreaded filesystem to
implement proper unix semantics intead, and if that means adding
silly rename support so be it.

-

To: <hch@...>
Cc: <miklos@...>, <hch@...>, <trond.myklebust@...>, <adilger@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Monday, September 24, 2007 - 8:48 am

It's a fixed protocol, with servers installed on millions of servers.
The protocol is called SFTP and the server is part of the OpenSSH
package. There's nothing I can change there.

And even if I could change the protocol, it's impossible to implement
full UNIX semantics with a userspace server. Please think a bit about

That's what is done currently.

But it's has various dawbacks, like rmdir doesn't work if there are
open files within an otherwise empty directory.

I'd happily accept suggestions on how to deal with this differenty.

Thanks,
Miklos
-

To: Miklos Szeredi <miklos@...>
Cc: <hch@...>, <miklos@...>, <trond.myklebust@...>, <adilger@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Monday, September 24, 2007 - 9:42 am

NFS has that problem because it really has to sillyrename into the same
directory. I don't see that ssh/sftp needs to do that. Instead it can
sillyrename anywhere in the filesystem.

Alan
-

To: <alan@...>
Cc: <miklos@...>, <hch@...>, <trond.myklebust@...>, <adilger@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Monday, September 24, 2007 - 9:40 am

I don't think it can. How can we find in a reliable way another
directory, which is writable by the user?

Miklos
-

To: Miklos Szeredi <miklos@...>
Cc: <hch@...>, <trond.myklebust@...>, <adilger@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Monday, September 24, 2007 - 8:59 am

Only sillyrename files with nlink > 1? I don't see how attributes can
change anything for a deleted file.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
-

To: <matthew@...>
Cc: <miklos@...>, <hch@...>, <trond.myklebust@...>, <adilger@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Monday, September 24, 2007 - 9:06 am

I don't quite understand your suggestion.

A file isn't deleted while there are still links or open files
refering to it. So getting the attributes for a file with nlink==0 is
perfectly valid while the file is still open.

If a network filesystem protocol can't handle operations (be it data
or metadata) on an unlinked file, we must do sillirenaming, so that
the file is not actually unlinked.

Miklos
-

To: Miklos Szeredi <miklos@...>
Cc: <hch@...>, <trond.myklebust@...>, <adilger@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Monday, September 24, 2007 - 9:16 am

Is it? Why not just pretend that the attributes are wiped when the file

Or you could call getattr right before you unlink and cache the result
in the client.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
-

To: <matthew@...>
Cc: <miklos@...>, <hch@...>, <trond.myklebust@...>, <adilger@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Monday, September 24, 2007 - 9:21 am

You mean "when finally unlinked"? Delete happens when the file is

The file can still be modified after being unlinked.

And even if we did this caching thing and modify the attributes when
the file is modified, it would not deal with access on the remote end,
and would be much more complex than the other alternatives.

Miklos
-

To: Miklos Szeredi <miklos@...>
Cc: <matthew@...>, <hch@...>, <trond.myklebust@...>, <adilger@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Monday, September 24, 2007 - 9:10 am

Or not support such a broken protocol at all.

-

To: <hch@...>
Cc: <miklos@...>, <matthew@...>, <hch@...>, <trond.myklebust@...>, <adilger@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Monday, September 24, 2007 - 9:18 am

Wonder what people would say if we removed support for NFSv[23].

Just because a protocol does not support "perfect" UNIX semantics, it
doesn't mean it's broken. By that standard almost all network
filesystem protocols are severely broken.

Miklos
-

To: Miklos Szeredi <miklos@...>
Cc: <hch@...>, <matthew@...>, <trond.myklebust@...>, <adilger@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Monday, September 24, 2007 - 9:27 am

Well, they are broken by these and other standards. At least nfs and
cifs maintainers do the workarounds for this brokeness where they belong.
-

To: <hch@...>
Cc: <miklos@...>, <hch@...>, <matthew@...>, <trond.myklebust@...>, <adilger@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Monday, September 24, 2007 - 9:38 am

And my patch is not working around a problem, rather solving a problem
in a correct way.

Let me summarise it:

There are valid reasons we have fstat() in addition to stat/lstat.
For example we want to protect agains races involving unlink/rename on
an open file.

Say I want to implement a network filesystem with a pure unprivileged
userspace sever (this is basically what sshfs does).

I want my filesystem client implementation to keep all these
advantages of fstat(). So how can I do that? There's a simple way:
implement this operation with fstat() on the server. I get all the
advantages on the remote system automatically.

But for that the filesystem needs to have the open file that the
fstat() on the client was performed on.

It's that simple. There's really no ugly hacks going on behind the
scenes. It's just that we do want to delegate some properties of this
operation onto the server, and the simplest and best implementation is
to just let the filesystem have the information it needs.

Why is that such a big problem?

Miklos
-

To: <hch@...>
Cc: <trond.myklebust@...>, <adilger@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Monday, September 24, 2007 - 8:25 am

From: Miklos Szeredi <mszeredi@suse.cz>

Add a new super block flag, that results in the VFS not checking if
the current process has enough privileges to do an mknod().

If this flag is set, all mounts for this super block will have the
"nodev" flag implied.

This is needed on filesystems, where an unprivileged user may be able
to create a device node, without causing security problems.

One such example is "mountlo" a loopback mount utility implemented
with fuse and UML, which runs as an unprivileged userspace process.
In this case the user does in fact have the right to create device
nodes within the filesystem image, as long as the user has write
access to the image. Since the filesystem is mounted with "nodev",
adding device nodes is not a security concern.

Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
---

Index: linux/fs/namei.c
===================================================================
--- linux.orig/fs/namei.c 2007-09-24 13:52:17.000000000 +0200
+++ linux/fs/namei.c 2007-09-24 13:54:57.000000000 +0200
@@ -1617,7 +1617,7 @@ int may_open(struct nameidata *nd, int a
if (S_ISFIFO(inode->i_mode) || S_ISSOCK(inode->i_mode)) {
flag &= ~O_TRUNC;
} else if (S_ISBLK(inode->i_mode) || S_ISCHR(inode->i_mode)) {
- if (nd->mnt->mnt_flags & MNT_NODEV)
+ if (IS_MNT_NODEV(nd->mnt))
return -EACCES;

flag &= ~O_TRUNC;
@@ -1920,7 +1920,8 @@ int vfs_mknod(struct inode *dir, struct
if (error)
return error;

- if ((S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
+ if (!(dir->i_sb->s_flags & MS_MKNOD_NOCAP) &&
+ (S_ISCHR(mode) || S_ISBLK(mode)) && !capable(CAP_MKNOD))
return -EPERM;

if (!dir->i_op || !dir->i_op->mknod)
Index: linux/include/linux/fs.h
===================================================================
--- linux.orig/include/linux/fs.h 2007-09-24 13:52:17.000000000 +0200
+++ linux/include/linux/fs.h 2007-09-24 13:54:57.00...

To: Miklos Szeredi <miklos@...>
Cc: <hch@...>, <trond.myklebust@...>, <adilger@...>, <linux-kernel@...>, <linux-fsdevel@...>
Date: Monday, September 24, 2007 - 8:38 am

This one looks okay, but I'd prefer to not put it in until we actually
have proper non-privilegued mounts.

-

Previous thread: realtime preemption performance difference by Jaswinder Singh on Monday, September 24, 2007 - 7:48 am. (5 messages)

Next thread: 2.6.23-rc7-mm1: fix generic_setlease() mismerge by Alexey Dobriyan on Monday, September 24, 2007 - 9:09 am. (1 message)