Re: [PATCH 02/18] xstat: Add a pair of system calls to make extended file stats available [ver #6]

Previous thread: linux-next: build failure after merge of the devicetree tree by Stephen Rothwell on Wednesday, July 14, 2010 - 7:15 pm. (2 messages)

Next thread: [PATCH 0/12] sound/alsa/soc/codec: fix memory leak and resource relaim in error path by Axel Lin on Wednesday, July 14, 2010 - 7:49 pm. (26 messages)
From: David Howells
Date: Wednesday, July 14, 2010 - 7:17 pm

Implement a pair of new system calls to provide extended and further extensible
stat functions.

The second of the associated patches is the main patch that provides these new
system calls:

	ssize_t ret = xstat(int dfd,
			    const char *filename,
			    unsigned atflag,
			    struct xstat_parameters *params,
			    struct xstat *buffer,
			    size_t bufsize);

	ssize_t ret = fxstat(int fd,
			     struct xstat_parameters *params,
			     struct xstat *buffer,
			     size_t bufsize);

which are more fully documented in that patch's description.

These new stat functions provide a number of useful features, in summary:

 (1) More information: creation time, inode generation number, data version
     number, flags/attributes.  A subset of these is available through each of:
     CIFS, NFS, AFS, Ext4, BTRFS and others.

 (2) Lightweight stat: Ask for just those details of interest, and allow a
     netfs (such as NFS) to approximate anything not of interest, possibly
     without going to the server.

 (3) Heavyweight stat: Force a netfs to go to the server, even if it thinks its
     cached attributes are up to date.

 (4) Allow the filesystem to indicate what it can/cannot provide: A filesystem
     can now say it doesn't support a standard stat feature if that isn't
     available.

 (5) Make the fields a consistent size on all arches, and make them large.

 (6) Can be extended by using more request flags and appending further data
     after the end of the standard return data.

Note that no lstat() equivalent is required as that can be implemented through
xstat() with atflag == 0.


==================
ADDITIONAL PATCHES
==================

The first patch makes const a bunch of system call userspace string/buffer
arguments.  I can then make sys_xstat()'s filename pointer const too (though
the entire first patch is not required for that).

The third patch makes the AFS filesystem use i_generation for the vnode ID
uniquifier rather than i_version, ...
From: David Howells
Date: Wednesday, July 14, 2010 - 7:17 pm

Store the AFS vnode uniquifier in the i_generation field, not the i_version
field of the inode struct.  i_version can then be given the AFS data version
number.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/afs/dir.c      |    8 ++++----
 fs/afs/fsclient.c |    3 ++-
 fs/afs/inode.c    |   10 +++++-----
 3 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index b42d5cc..afb9ff8 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -542,11 +542,11 @@ static struct dentry *afs_lookup(struct inode *dir, struct dentry *dentry,
 	dentry->d_op = &afs_fs_dentry_operations;
 
 	d_add(dentry, inode);
-	_leave(" = 0 { vn=%u u=%u } -> { ino=%lu v=%llu }",
+	_leave(" = 0 { vn=%u u=%u } -> { ino=%lu v=%u }",
 	       fid.vnode,
 	       fid.unique,
 	       dentry->d_inode->i_ino,
-	       (unsigned long long)dentry->d_inode->i_version);
+	       dentry->d_inode->i_generation);
 
 	return NULL;
 }
@@ -626,10 +626,10 @@ static int afs_d_revalidate(struct dentry *dentry, struct nameidata *nd)
 		 * been deleted and replaced, and the original vnode ID has
 		 * been reused */
 		if (fid.unique != vnode->fid.unique) {
-			_debug("%s: file deleted (uq %u -> %u I:%llu)",
+			_debug("%s: file deleted (uq %u -> %u I:%u)",
 			       dentry->d_name.name, fid.unique,
 			       vnode->fid.unique,
-			       (unsigned long long)dentry->d_inode->i_version);
+			       dentry->d_inode->i_generation);
 			spin_lock(&vnode->lock);
 			set_bit(AFS_VNODE_DELETED, &vnode->flags);
 			spin_unlock(&vnode->lock);
diff --git a/fs/afs/fsclient.c b/fs/afs/fsclient.c
index 4bd0218..346e328 100644
--- a/fs/afs/fsclient.c
+++ b/fs/afs/fsclient.c
@@ -89,7 +89,7 @@ static void xdr_decode_AFSFetchStatus(const __be32 **_bp,
 			i_size_write(&vnode->vfs_inode, size);
 			vnode->vfs_inode.i_uid = status->owner;
 			vnode->vfs_inode.i_gid = status->group;
-			vnode->vfs_inode.i_version = vnode->fid.unique;
+			vnode->vfs_inode.i_generation = ...
From: David Howells
Date: Wednesday, July 14, 2010 - 7:17 pm

Return extended attributes from the CIFS filesystem.  This includes the
following:

 (1) Return the file creation time as btime.  We assume that the creation time
     won't change over the life of the inode.

 (2) FS_AUTOMOUNT_FL on referral/submount directories.

 (3) Deasserting XSTAT_REQUEST_INO in st_result_mask if we made up the inode
     number and didn't get it from the server.

 (4) Map various Windows file attributes to FS_xxx_FL flags in st_inode_flags,
     fetching them from the server if we don't have them yet or don't have a
     current copy.

Furthermore, what cifs_getattr() does can be controlled as follows:

 (1) If AT_FORCE_ATTR_SYNC is indicated, or if the inode flags or creation time
     are requested but not yet collected, then the attributes will be reread
     unconditionally.

 (2) If the basic stats are requested or if the inode flags are requested and
     have been collected previously, then the attributes will be reread if out
     of date.

 (3) Otherwise the cached attributes will be used - even if expired - without
     reference to the server.

Note that cifs_revalidate_dentry() will issue an extra operation to get the
FILE_ALL_INFO in addition to the FILE_UNIX_BASIC_INFO if it needs to collect
creation time and attributes on behalf of cifs_getattr().

[NOTE: THIS PATCH IS UNTESTED!]

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/cifs/cifsfs.h   |    2 +
 fs/cifs/cifsglob.h |    5 +++
 fs/cifs/dir.c      |    2 +
 fs/cifs/inode.c    |   76 ++++++++++++++++++++++++++++++++++++++++++++--------
 4 files changed, 71 insertions(+), 14 deletions(-)

diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index a7eb65c..50bf70b 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -62,7 +62,7 @@ extern int cifs_rmdir(struct inode *, struct dentry *);
 extern int cifs_rename(struct inode *, struct dentry *, struct inode *,
 		       struct dentry *);
 extern int cifs_revalidate_file(struct file *filp);
-extern int ...
From: David Howells
Date: Wednesday, July 14, 2010 - 7:17 pm

Make AFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/afs/dir.c      |    1 +
 fs/afs/internal.h |    1 +
 fs/afs/mntpt.c    |   46 +++++++++++++++-------------------------------
 3 files changed, 17 insertions(+), 31 deletions(-)

diff --git a/fs/afs/dir.c b/fs/afs/dir.c
index afb9ff8..d2dd137 100644
--- a/fs/afs/dir.c
+++ b/fs/afs/dir.c
@@ -65,6 +65,7 @@ static const struct dentry_operations afs_fs_dentry_operations = {
 	.d_revalidate	= afs_d_revalidate,
 	.d_delete	= afs_d_delete,
 	.d_release	= afs_d_release,
+	.d_automount	= afs_d_automount,
 };
 
 #define AFS_DIR_HASHTBL_SIZE	128
diff --git a/fs/afs/internal.h b/fs/afs/internal.h
index 5f679b7..2c700dc 100644
--- a/fs/afs/internal.h
+++ b/fs/afs/internal.h
@@ -583,6 +583,7 @@ extern int afs_abort_to_error(u32);
 extern const struct inode_operations afs_mntpt_inode_operations;
 extern const struct file_operations afs_mntpt_file_operations;
 
+extern struct vfsmount *afs_d_automount(struct path *);
 extern int afs_mntpt_check_symlink(struct afs_vnode *, struct key *);
 extern void afs_mntpt_kill_timer(void);
 
diff --git a/fs/afs/mntpt.c b/fs/afs/mntpt.c
index a9e2303..ea9cfee 100644
--- a/fs/afs/mntpt.c
+++ b/fs/afs/mntpt.c
@@ -24,7 +24,6 @@ static struct dentry *afs_mntpt_lookup(struct inode *dir,
 				       struct dentry *dentry,
 				       struct nameidata *nd);
 static int afs_mntpt_open(struct inode *inode, struct file *file);
-static void *afs_mntpt_follow_link(struct dentry *dentry, struct nameidata *nd);
 static void afs_mntpt_expiry_timed_out(struct work_struct *work);
 
 const struct file_operations afs_mntpt_file_operations = {
@@ -33,7 +32,6 @@ const struct file_operations afs_mntpt_file_operations = {
 
 const struct inode_operations afs_mntpt_inode_operations = {
 	.lookup		= afs_mntpt_lookup,
-	.follow_link	= afs_mntpt_follow_link,
 	.readlink	= page_readlink,
 	.getattr	= ...
From: David Howells
Date: Wednesday, July 14, 2010 - 7:17 pm

Add a dentry op (d_automount) to handle automounting directories rather than
abusing the follow_link() inode operation.  The operation is keyed off a new
inode flag (S_AUTOMOUNT).

This makes it easier to add an AT_ flag to suppress terminal segment automount
during pathwalk.  It should also remove the need for the kludge code in the
pathwalk algorithm to handle directories with follow_link() semantics.

I've only changed __follow_mount() to handle automount points, but it might be
necessary to change follow_mount() too.  The latter is only used from
follow_dotdot(), but any automounts on ".." should be pinned whilst we're using
a child of it.

Note that autofs4's use of follow_mount() will need examining if this patch is
committed.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 Documentation/filesystems/Locking |    2 +
 Documentation/filesystems/vfs.txt |   13 ++++++
 fs/namei.c                        |   85 +++++++++++++++++++++++++++++--------
 fs/stat.c                         |    2 +
 include/linux/dcache.h            |    5 ++
 include/linux/fs.h                |    2 +
 6 files changed, 91 insertions(+), 18 deletions(-)

diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
index 96d4293..ccbfa98 100644
--- a/Documentation/filesystems/Locking
+++ b/Documentation/filesystems/Locking
@@ -16,6 +16,7 @@ prototypes:
 	void (*d_release)(struct dentry *);
 	void (*d_iput)(struct dentry *, struct inode *);
 	char *(*d_dname)((struct dentry *dentry, char *buffer, int buflen);
+	struct vfsmount *(*d_automount)(struct path *path);
 
 locking rules:
 	none have BKL
@@ -27,6 +28,7 @@ d_delete:	yes		no		yes		no
 d_release:	no		no		no		yes
 d_iput:		no		no		no		yes
 d_dname:	no		no		no		no
+d_automount:	no		no		no		yes
 
 --------------------------- inode_operations --------------------------- 
 prototypes:
diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index 94677e7..31a9e8f 100644
--- ...
From: Christoph Hellwig
Date: Sunday, July 18, 2010 - 1:50 am

Moving this out of ->follow_link is a good idea, but please submit this
as a separate patch series, as it has very little to do with stat().

--


Except that I want to use it to create a new AT flag for xstat() (and also
fstatat()), but fair enough.

David
--

From: David Howells
Date: Wednesday, July 14, 2010 - 7:17 pm

Provide a mechanism in the kernel by which extra results beyond those allocated
space in the xstat struct can be returned to userspace.

[I'm not sure this is the best way to do this; it's a bit unwieldy.  However,
 I'd rather not overburden struct kstat with fields for every extra result we
 might want to return as it's allocated on the stack in various places.
 Possibly the pass_result of struct xstat_extra_result could be placed in
 struct kstat to be used if pass_result is non-NULL, and struct kstat could be
 passed to container_of().]

This is modelled on the filldir approach used to read directory entries.  This
allows kernel routines (such as NFSD) to access this information too.

A new inode operation (getattr_extra) is provided that interested filesystems
need to implement.  If this is not provided, then it is assumed that no extra
results will be returned.

The getattr_extra() routine is passed a token to represent the request:

	struct xstat_extra_result {
		u64			request_mask;
		struct kstat		*stat;
		xstat_extra_result_t	pass_result;
	};

The three fields in this struct are: the request_mask (with bits not
representing extra results edited out); the pointer to the kstat structure as
passed to getattr() (stat->query_flags may be useful); and a pointer to a
function to which each individual result should be passed.

The requests can be handled in order with something like the following:

	u64 request_mask = token->request_mask;
	do {
		int request = __ffs64(request_mask);
		request_mask &= ~(1ULL << request);
		switch (request) {
		case ilog2(XSTAT_REQUEST_FOO): {
			struct xstat_foo foo;
			ret = myfs_get_foo(inode, token, &foo);
			if (!ret)
				token->pass_result(token, request,
						   &foo, sizeof(foo));
			break;
		}
		default:
			ret = 0;
			break;
		}
	} while (ret == 0 && request_mask);

The caller should probably embed token in something so that they can retrieve
it in the pass_result() function with container_of().

Signed-off-by: ...
From: Christoph Hellwig
Date: Sunday, July 18, 2010 - 1:51 am

As mentioned before this is total overkill.  The request/respond flags
together with the buffer size already provide enough ways to extent
the structure in a backwards compatible way if needed.

--

From: David Howells
Date: Wednesday, July 14, 2010 - 7:17 pm

Remove the automount through follow_link() kludge code from pathwalk in favour
of using d_automount().

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/namei.c |   17 +++--------------
 1 files changed, 3 insertions(+), 14 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index fcec3c6..86068a2 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -845,17 +845,6 @@ fail:
 }
 
 /*
- * This is a temporary kludge to deal with "automount" symlinks; proper
- * solution is to trigger them on follow_mount(), so that do_lookup()
- * would DTRT.  To be killed before 2.6.34-final.
- */
-static inline int follow_on_final(struct inode *inode, unsigned lookup_flags)
-{
-	return inode && unlikely(inode->i_op->follow_link) &&
-		((lookup_flags & LOOKUP_FOLLOW) || S_ISDIR(inode->i_mode));
-}
-
-/*
  * Name resolution.
  * This is the basic name resolution function, turning a pathname into
  * the final dentry. We expect 'base' to be positive and a directory.
@@ -975,7 +964,8 @@ last_component:
 		if (err)
 			break;
 		inode = next.dentry->d_inode;
-		if (follow_on_final(inode, lookup_flags)) {
+		if (inode && unlikely(inode->i_op->follow_link) &&
+		    (lookup_flags & LOOKUP_FOLLOW)) {
 			err = do_follow_link(&next, nd);
 			if (err)
 				goto return_err;
@@ -1888,8 +1878,7 @@ reval:
 		struct inode *inode = path.dentry->d_inode;
 		void *cookie;
 		error = -ELOOP;
-		/* S_ISDIR part is a temporary automount kludge */
-		if (!(nd.flags & LOOKUP_FOLLOW) && !S_ISDIR(inode->i_mode))
+		if (!(nd.flags & LOOKUP_FOLLOW))
 			goto exit_dput;
 		if (count++ == 32)
 			goto exit_dput;

--

From: David Howells
Date: Wednesday, July 14, 2010 - 7:17 pm

Add an AT_NO_AUTOMOUNT flag to suppress terminal automounting of directories
with follow_link semantics.  This can be used by fstatat()/xstat() users to
permit the gathering of attributes on an automount point and also prevent
mass-automounting of a directory of automount points by ls.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/namei.c            |   15 ++++++++++-----
 fs/stat.c             |    4 +++-
 include/linux/fcntl.h |    1 +
 include/linux/namei.h |    2 ++
 4 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 86068a2..056427e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -654,7 +654,8 @@ static int follow_automount(struct path *path, int res)
 /* no need for dcache_lock, as serialization is taken care in
  * namespace.c
  */
-static int __follow_mount(struct path *path, unsigned nofollow)
+static int __follow_mount(struct path *path, unsigned nofollow,
+			  struct nameidata *nd)
 {
 	struct vfsmount *mounted;
 	int ret, res = 0;
@@ -674,8 +675,12 @@ static int __follow_mount(struct path *path, unsigned nofollow)
 		}
 		if (!d_automount_point(path->dentry))
 			break;
-		if (nofollow)
-			return -ELOOP;
+		if (!(nd->flags & LOOKUP_CONTINUE)) {
+			if (nofollow)
+				return -ELOOP;
+			if (nd->flags & LOOKUP_NO_AUTOMOUNT)
+				break;
+		}
 		ret = follow_automount(path, res);
 		if (ret < 0)
 			return ret;
@@ -769,7 +774,7 @@ static int do_lookup(struct nameidata *nd, struct qstr *name,
 done:
 	path->mnt = mnt;
 	path->dentry = dentry;
-	ret = __follow_mount(path, 0);
+	ret = __follow_mount(path, 0, nd);
 	if (unlikely(ret < 0))
 		path_put(path);
 	return ret;
@@ -1762,7 +1767,7 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
 	if (open_flag & O_EXCL)
 		goto exit_dput;
 
-	error = __follow_mount(path, open_flag & O_NOFOLLOW);
+	error = __follow_mount(path, open_flag & O_NOFOLLOW, nd);
 	if (error < 0)
 		goto exit_dput;
 
diff --git a/fs/stat.c ...
From: David Howells
Date: Wednesday, July 14, 2010 - 7:17 pm

Make CIFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.

[NOTE: THIS IS UNTESTED!]

[Question:  Why does cifs_dfs_do_refmount() when the caller has already done
	    that and could pass the result through?]

Signed-off-by: David Howells <dhowells@redhat.com>
Cc: Steve French <sfrench@samba.org>
---

 fs/cifs/cifs_dfs_ref.c |  145 +++++++++++++++++++++++-------------------------
 fs/cifs/cifsfs.h       |    6 ++
 fs/cifs/dir.c          |    2 +
 fs/cifs/inode.c        |    8 ++-
 4 files changed, 83 insertions(+), 78 deletions(-)

diff --git a/fs/cifs/cifs_dfs_ref.c b/fs/cifs/cifs_dfs_ref.c
index 4516867..500b952 100644
--- a/fs/cifs/cifs_dfs_ref.c
+++ b/fs/cifs/cifs_dfs_ref.c
@@ -230,8 +230,8 @@ compose_mount_options_err:
 }
 
 
-static struct vfsmount *cifs_dfs_do_refmount(const struct vfsmount *mnt_parent,
-		struct dentry *dentry, const struct dfs_info3_param *ref)
+static struct vfsmount *cifs_dfs_do_refmount(struct dentry *mntpt,
+					     const struct dfs_info3_param *ref)
 {
 	struct cifs_sb_info *cifs_sb;
 	struct vfsmount *mnt;
@@ -239,12 +239,12 @@ static struct vfsmount *cifs_dfs_do_refmount(const struct vfsmount *mnt_parent,
 	char *devname = NULL;
 	char *fullpath;
 
-	cifs_sb = CIFS_SB(dentry->d_inode->i_sb);
+	cifs_sb = CIFS_SB(mntpt->d_inode->i_sb);
 	/*
 	 * this function gives us a path with a double backslash prefix. We
 	 * require a single backslash for DFS.
 	 */
-	fullpath = build_path_from_dentry(dentry);
+	fullpath = build_path_from_dentry(mntpt);
 	if (!fullpath)
 		return ERR_PTR(-ENOMEM);
 
@@ -262,35 +262,6 @@ static struct vfsmount *cifs_dfs_do_refmount(const struct vfsmount *mnt_parent,
 
 }
 
-static int add_mount_helper(struct vfsmount *newmnt, struct nameidata *nd,
-				struct list_head *mntlist)
-{
-	/* stolen from afs code */
-	int err;
-
-	mntget(newmnt);
-	err = do_add_mount(newmnt, &nd->path, nd->path.mnt->mnt_flags | MNT_SHRINKABLE, mntlist);
-	switch (err) ...
From: David Howells
Date: Wednesday, July 14, 2010 - 7:17 pm

Make NFS use the new d_automount() dentry operation rather than abusing
follow_link() on directories.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/nfs/dir.c       |    2 +
 fs/nfs/inode.c     |    1 +
 fs/nfs/internal.h  |    1 +
 fs/nfs/namespace.c |   87 ++++++++++++++++++++++++----------------------------
 4 files changed, 44 insertions(+), 47 deletions(-)

diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 782b431..d7e5810 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -927,6 +927,7 @@ const struct dentry_operations nfs_dentry_operations = {
 	.d_revalidate	= nfs_lookup_revalidate,
 	.d_delete	= nfs_dentry_delete,
 	.d_iput		= nfs_dentry_iput,
+	.d_automount	= nfs_d_automount,
 };
 
 static struct dentry *nfs_lookup(struct inode *dir, struct dentry * dentry, struct nameidata *nd)
@@ -1002,6 +1003,7 @@ const struct dentry_operations nfs4_dentry_operations = {
 	.d_revalidate	= nfs_open_revalidate,
 	.d_delete	= nfs_dentry_delete,
 	.d_iput		= nfs_dentry_iput,
+	.d_automount	= nfs_d_automount,
 };
 
 /*
diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 8c6de96..f9737bd 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -296,6 +296,7 @@ nfs_fhget(struct super_block *sb, struct nfs_fh *fh, struct nfs_fattr *fattr)
 					inode->i_op = &nfs_mountpoint_inode_operations;
 				inode->i_fop = NULL;
 				set_bit(NFS_INO_MOUNTPOINT, &nfsi->flags);
+				inode->i_flags |= S_AUTOMOUNT;
 			}
 		} else if (S_ISLNK(inode->i_mode))
 			inode->i_op = &nfs_symlink_inode_operations;
diff --git a/fs/nfs/internal.h b/fs/nfs/internal.h
index d8bd619..48de6f8 100644
--- a/fs/nfs/internal.h
+++ b/fs/nfs/internal.h
@@ -238,6 +238,7 @@ extern char *nfs_path(const char *base,
 		      const struct dentry *droot,
 		      const struct dentry *dentry,
 		      char *buffer, ssize_t buflen);
+extern struct vfsmount *nfs_d_automount(struct path *path);
 
 /* getroot.c */
 extern struct dentry *nfs_get_root(struct super_block *, struct nfs_fh *);
diff --git ...
From: David Howells
Date: Wednesday, July 14, 2010 - 7:17 pm

Make automounter filesystems return FS_AUTOMOUNT_FL in st_inode_flags to
xstat().

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/autofs/init.c  |    1 +
 fs/autofs4/init.c |    1 +
 2 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/autofs/init.c b/fs/autofs/init.c
index cea5219..2c06d4b 100644
--- a/fs/autofs/init.c
+++ b/fs/autofs/init.c
@@ -23,6 +23,7 @@ static int autofs_get_sb(struct file_system_type *fs_type,
 static struct file_system_type autofs_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "autofs",
+	.inode_flags	= FS_AUTOMOUNT_FL,
 	.get_sb		= autofs_get_sb,
 	.kill_sb	= autofs_kill_sb,
 };
diff --git a/fs/autofs4/init.c b/fs/autofs4/init.c
index 9722e4b..43df431 100644
--- a/fs/autofs4/init.c
+++ b/fs/autofs4/init.c
@@ -23,6 +23,7 @@ static int autofs_get_sb(struct file_system_type *fs_type,
 static struct file_system_type autofs_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "autofs",
+	.inode_flags	= FS_AUTOMOUNT_FL,
 	.get_sb		= autofs_get_sb,
 	.kill_sb	= autofs4_kill_sb,
 };

--

From: David Howells
Date: Wednesday, July 14, 2010 - 7:17 pm

Make network filesystems return FS_REMOTE_FL in st_inode_flags to xstat().

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/afs/super.c   |    1 +
 fs/ceph/super.c  |    1 +
 fs/cifs/cifsfs.c |    1 +
 fs/coda/inode.c  |    1 +
 fs/ncpfs/inode.c |    1 +
 fs/nfs/super.c   |    7 +++++++
 fs/smbfs/inode.c |    1 +
 7 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/fs/afs/super.c b/fs/afs/super.c
index e932e5a..daaa3d4 100644
--- a/fs/afs/super.c
+++ b/fs/afs/super.c
@@ -40,6 +40,7 @@ static int afs_statfs(struct dentry *dentry, struct kstatfs *buf);
 struct file_system_type afs_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "afs",
+	.inode_flags	= FS_REMOTE_FL,
 	.get_sb		= afs_get_sb,
 	.kill_sb	= kill_anon_super,
 	.fs_flags	= 0,
diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index fa87f51..f486ac8 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -1019,6 +1019,7 @@ static void ceph_kill_sb(struct super_block *s)
 static struct file_system_type ceph_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "ceph",
+	.inode_flags	= FS_REMOTE_FL,
 	.get_sb		= ceph_get_sb,
 	.kill_sb	= ceph_kill_sb,
 	.fs_flags	= FS_RENAME_DOES_D_MOVE,
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index ef9a773..eb2c517 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -586,6 +586,7 @@ static int cifs_setlease(struct file *file, long arg, struct file_lock **lease)
 struct file_system_type cifs_fs_type = {
 	.owner = THIS_MODULE,
 	.name = "cifs",
+	.inode_flags = FS_REMOTE_FL,
 	.get_sb = cifs_get_sb,
 	.kill_sb = kill_anon_super,
 	/*  .fs_flags */
diff --git a/fs/coda/inode.c b/fs/coda/inode.c
index d97f993..cb05427 100644
--- a/fs/coda/inode.c
+++ b/fs/coda/inode.c
@@ -308,6 +308,7 @@ static int coda_get_sb(struct file_system_type *fs_type,
 struct file_system_type coda_fs_type = {
 	.owner		= THIS_MODULE,
 	.name		= "coda",
+	.inode_flags	= FS_REMOTE_FL,
 	.get_sb		= coda_get_sb,
 	.kill_sb	= kill_anon_super,
 	.fs_flags	= ...
From: David Howells
Date: Wednesday, July 14, 2010 - 7:17 pm

Make special system filesystems return FS_SPECIAL_FL in st_inode_flags to
xstat().

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/ia64/kernel/perfmon.c                |    7 ++++---
 arch/powerpc/platforms/cell/spufs/inode.c |    1 +
 arch/s390/hypfs/inode.c                   |    1 +
 drivers/infiniband/hw/ipath/ipath_fs.c    |    1 +
 drivers/infiniband/hw/qib/qib_fs.c        |    1 +
 drivers/isdn/capi/capifs.c                |    1 +
 drivers/misc/ibmasm/ibmasmfs.c            |    1 +
 drivers/mtd/mtdchar.c                     |    1 +
 drivers/oprofile/oprofilefs.c             |    1 +
 drivers/usb/core/inode.c                  |    1 +
 drivers/usb/gadget/f_fs.c                 |    1 +
 drivers/usb/gadget/inode.c                |    1 +
 drivers/xen/xenfs/super.c                 |    1 +
 fs/anon_inodes.c                          |    1 +
 fs/binfmt_misc.c                          |    1 +
 fs/configfs/mount.c                       |    1 +
 fs/debugfs/inode.c                        |    1 +
 fs/fuse/control.c                         |    1 +
 fs/hostfs/hostfs_kern.c                   |    1 +
 fs/nfsd/nfsctl.c                          |    1 +
 fs/ocfs2/dlmfs/dlmfs.c                    |    1 +
 fs/openpromfs/inode.c                     |    1 +
 fs/pipe.c                                 |    1 +
 fs/proc/root.c                            |    1 +
 fs/sysfs/mount.c                          |    1 +
 ipc/mqueue.c                              |    1 +
 kernel/cgroup.c                           |    1 +
 kernel/cpuset.c                           |    1 +
 net/socket.c                              |    1 +
 net/sunrpc/rpc_pipe.c                     |    1 +
 security/inode.c                          |    1 +
 security/selinux/selinuxfs.c              |    1 +
 security/smack/smackfs.c                  |    1 +
 33 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/arch/ia64/kernel/perfmon.c b/arch/ia64/kernel/perfmon.c
index ...
From: David Howells
Date: Tuesday, July 27, 2010 - 6:41 am

Actually, that last is not true; FS_REMOTE_FL is per-file, not per-fs.  You
can have a filesystem that has fabricated files and remote files.  For
instance, with kAFS at some point you will be go into /afs, do a lookup for a
directory that doesn't exist, but whose name represents a cell+volume, the
filesystem will fabricate a local directory and then attempt to mount a remote
directory on to it.

David
--

From: David Howells
Date: Wednesday, July 14, 2010 - 7:17 pm

Return extended attributes from the NFS filesystem.  This includes the
following:

 (1) The change attribute as st_data_version if NFSv4.

 (2) FS_AUTOMOUNT_FL on referral/submount directories.

Furthermore, what nfs_getattr() does can be controlled as follows:

 (1) If AT_FORCE_ATTR_SYNC is indicated, or mtime, ctime or data_version (NFSv4
     only) are requested then the outstanding writes will be written to the
     server first.

 (2) The inode's attributes may be synchronised with the server:

     (a) If AT_FORCE_ATTR_SYNC is indicated or if atime is requested (and atime
     	 updating is not suppressed by a mount flag) then the attributes will
     	 be reread unconditionally.

     (b) If the data version or any of basic stats are requested then the
     	 attributes will be reread if the cached attributes have expired.

     (c) Otherwise the cached attributes will be used - even if expired -
     	 without reference to the server.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/nfs/inode.c |   46 ++++++++++++++++++++++++++++++++++------------
 1 files changed, 34 insertions(+), 12 deletions(-)

diff --git a/fs/nfs/inode.c b/fs/nfs/inode.c
index 099b351..8c6de96 100644
--- a/fs/nfs/inode.c
+++ b/fs/nfs/inode.c
@@ -495,11 +495,21 @@ void nfs_setattr_update_inode(struct inode *inode, struct iattr *attr)
 int nfs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
 {
 	struct inode *inode = dentry->d_inode;
+	unsigned force = stat->query_flags & AT_FORCE_ATTR_SYNC;
 	int need_atime = NFS_I(inode)->cache_validity & NFS_INO_INVALID_ATIME;
 	int err;
 
-	/* Flush out writes to the server in order to update c/mtime.  */
-	if (S_ISREG(inode->i_mode)) {
+	if (NFS_SERVER(inode)->nfs_client->rpc_ops->version < 4)
+		stat->request_mask &= ~XSTAT_REQUEST_DATA_VERSION;
+
+	/* Flush out writes to the server in order to update c/mtime
+	 * or data version if the user wants them */
+	if ((force || stat->request_mask & ...
From: David Howells
Date: Wednesday, July 14, 2010 - 7:17 pm

Return extended attributes from the Ext4 filesystem.  This includes the
following:

 (1) The inode creation time (i_crtime) as i_btime.

 (2) The inode i_generation as i_gen if not the root directory.

 (3) The inode i_version as st_data_version if a file with I_VERSION set or a
     directory.

 (4) FS_xxx_FL flags as for FS_IOC_GETFLAGS.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/ext4/ext4.h    |    2 ++
 fs/ext4/file.c    |    2 +-
 fs/ext4/inode.c   |   32 +++++++++++++++++++++++++++++---
 fs/ext4/namei.c   |    2 ++
 fs/ext4/symlink.c |    2 ++
 5 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 19a4de5..96823f3 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1571,6 +1571,8 @@ extern int  ext4_write_inode(struct inode *, struct writeback_control *);
 extern int  ext4_setattr(struct dentry *, struct iattr *);
 extern int  ext4_getattr(struct vfsmount *mnt, struct dentry *dentry,
 				struct kstat *stat);
+extern int  ext4_file_getattr(struct vfsmount *mnt, struct dentry *dentry,
+				struct kstat *stat);
 extern void ext4_delete_inode(struct inode *);
 extern int  ext4_sync_inode(handle_t *, struct inode *);
 extern void ext4_dirty_inode(struct inode *);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 5313ae4..18c29ab 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -150,7 +150,7 @@ const struct file_operations ext4_file_operations = {
 const struct inode_operations ext4_file_inode_operations = {
 	.truncate	= ext4_truncate,
 	.setattr	= ext4_setattr,
-	.getattr	= ext4_getattr,
+	.getattr	= ext4_file_getattr,
 #ifdef CONFIG_EXT4_FS_XATTR
 	.setxattr	= generic_setxattr,
 	.getxattr	= generic_getxattr,
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 42272d6..822a4ad 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5550,12 +5550,38 @@ err_out:
 int ext4_getattr(struct vfsmount *mnt, struct dentry *dentry,
 		 struct kstat *stat)
 {
-	struct inode ...
From: David Howells
Date: Wednesday, July 14, 2010 - 7:17 pm

Return extended attributes from the AFS filesystem.  This includes the
following:

 (1) The vnode uniquifier as st_gen.

 (2) The data version number as st_data_version.

 (3) FS_AUTOMOUNT_FL on mountpoint directories.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/afs/inode.c |   13 ++++++++-----
 1 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/afs/inode.c b/fs/afs/inode.c
index ee3190a..02f115f 100644
--- a/fs/afs/inode.c
+++ b/fs/afs/inode.c
@@ -300,16 +300,19 @@ error_unlock:
 /*
  * read the attributes of an inode
  */
-int afs_getattr(struct vfsmount *mnt, struct dentry *dentry,
-		      struct kstat *stat)
+int afs_getattr(struct vfsmount *mnt, struct dentry *dentry, struct kstat *stat)
 {
-	struct inode *inode;
-
-	inode = dentry->d_inode;
+	struct inode *inode = dentry->d_inode;
 
 	_enter("{ ino=%lu v=%u }", inode->i_ino, inode->i_generation);
 
 	generic_fillattr(inode, stat);
+
+	stat->result_mask |= XSTAT_REQUEST_GEN | XSTAT_REQUEST_DATA_VERSION;
+	stat->gen = inode->i_generation;
+	stat->data_version = inode->i_version;
+	if (test_bit(AFS_VNODE_MOUNTPOINT, &AFS_FS_I(inode)->flags))
+		stat->inode_flags |= FS_AUTOMOUNT_FL;
 	return 0;
 }
 

--

From: David Howells
Date: Wednesday, July 14, 2010 - 7:17 pm

Add a pair of system calls to make extended file stats available, including
file creation time, inode version and data version where available through the
underlying filesystem.

[This depends on the previously posted pair of patches to (a) constify a number
 of syscall string and buffer arguments and (b) rearrange AFS's use of
 i_version and i_generation].

This has a number of uses:

 (1) Creation time: The SMB protocol carries the creation time, which could be
     exported by Samba, which will in turn help CIFS make use of FS-Cache as
     that can be used for coherency data.

     This is also specified in NFSv4 as a recommended attribute and could be
     exported by NFSD [Steve French].

 (2) Lightweight stat: Ask for just those details of interest, and allow a
     netfs (such as NFS) to approximate anything not of interest, possibly
     without going to the server [Trond Myklebust, Ulrich Drepper].

 (3) Heavyweight stat: Force a netfs to go to the server, even if it thinks its
     cached attributes are up to date [Trond Myklebust].

 (4) Inode generation number: Useful for FUSE and userspace NFS servers [Bernd
     Schubert].

 (5) Data version number: Could be used by userspace NFS servers [Aneesh Kumar].

     Can also be used to modify fill_post_wcc() in NFSD which retrieves
     i_version directly, but has just called vfs_getattr().  It could get it
     from the kstat struct if it used vfs_xgetattr() instead.

 (6) BSD stat compatibility: Including more fields from the BSD stat such as
     creation time (st_btime) and inode generation number (st_gen) [Jeremy
     Allison, Bernd Schubert].

 (7) Extra coherency data may be useful in making backups [Andreas Dilger].

 (8) Allow the filesystem to indicate what it can/cannot provide: A filesystem
     can now say it doesn't support a standard stat feature if that isn't
     available.

 (9) Make the fields a consistent size on all arches, and make them large.

(10) Can be extended by using more ...
From: Arnd Bergmann
Date: Thursday, July 15, 2010 - 1:35 pm

I don't think I'd call this general preference. Three of the four
are fixed length and could easily be done inside the structure if you
leave a bit of space instead of a variable-length field at the end.

For the volume id, I could not find any file system that requires more
than 32 bytes here, which is also reasonable to put into the structure.
Make it 36 if you want to cover ascii encoded UUIDs.

That's at most 60 bytes for the extensions you're considering already,
plus the 152 you have already is still less than a cache line on
some machines. Padding it to 256 bytes would make it nice and round,

I'd also still argue that 32 bits would be better since you can
put them into the argument list instead of having to use a pointer
to xstat_parameters. You only use 15 bits so far, so the remaining
17 bits should go a long way. It's not as important to me as the

The resulting syscall I'd hope for would be

int xstat(dfd, const char *filename, unsigned flags,
	  unsigned mask, struct xstat *buf);

Everything else in your patch looks very good and has my full support.

	Arnd
--

From: David Howells
Date: Thursday, July 15, 2010 - 2:53 pm

?

Maybe I wasn't clear: I meant having an extended stat() syscall rather than

You should also include a length.  Volume IDs may be binary rather than


Which we currently allocate on the kernel stack, plus up to a couple of kstat
structs if something like eCryptFS is used.  Admittedly, the base xstat struct
could be kmalloc()'d instead, but why use up all that space if you don't need
it?

David
--

From: Mark Harris
Date: Thursday, July 15, 2010 - 11:22 pm

unsigned?  Existing filesystems support on-disk timestamps
representing times prior to the epoch.
--

From: Arnd Bergmann
Date: Friday, July 16, 2010 - 3:46 am

Ok, I misparsed your statement there. I don't think anyone was
objecting the use of xstat for this.

The controversial part is only how the extension happens. I would
already feel better about it if you just dropped the
'unsigned long long      st_extra_results[0];' at the end and
added a comment saying that the structure may grow in the future, though

Yes, maybe. There are several possible encodings for this. I was actually
thinking of fixed-length string rather than zero-terminated, but that
is possible as well. If this gets added, we need to audit every possible
use to make sure each of them is covered. My point was mostly that if we


If you're worried about stack utilization, xstat could also be embedded into
kstat, like

struct kstat {
	u64 request_mask;
	struct xstat x;
};

Then you only need one of them on the stack for sys_xstat, or have both
struct kstat and struct stat/stat64 for the other syscalls.

	Arnd
--

From: Arnd Bergmann
Date: Friday, July 16, 2010 - 4:02 am

You could also define the tv_gran_units to be power-of-ten nanoseconds,
making it a decimal floating point number like 

enum {
	XSTAT_NANOSECONDS_GRANULARITY = 0,
	XSTAT_MICROSECONDS_GRANULARITY = 3,
	XSTAT_MILLISECONDS_GRANULARITY = 6,
	XSTAT_SECONDS_GRANULARITY = 9,
};

That would make it easier to define an xstat_time_before() function, though
it means that you could no longer do XSTAT_MINUTES_GRANULARITY and

I wouldn't even go that far if we needed sub-ns (I don't think we do), because
that breaks old compilers that cannot do bit fields.

	Arnd
--

From: David Howells
Date: Friday, July 16, 2010 - 5:38 am

So you're thinking of indicating time (in)equality based on overlapping time
granules?

Your suggestion would suffice, I think.  With a 2:2 split between exponent
(tv_gran_units) and mantissa (tv_granularity), you can do:

	UNIT		SECONDS/UNIT	EXPONENT	MANTISSA
	nanoseconds	0.000000001	-9		1
	microseconds	0.000001	-6		1
	millseconds	0.001		-3		1
	seconds		1		0		1
	minutes		60		1		6
	hours		3600		2		36
	days		86400		2		864
	weeks		604800		2		6048

Any units beyond that are variable length and not worth considering, IMO.

And if you don't want negative numbers in your exponent, you can make the base
unit nS instead of S.

Is it worth allowing a filesystem to indicate that it has granularity smaller
than nS, even if the resolution can't be handled here?  We could even have:

	struct xstat_time {
		signed long long	tv_sec;		/* seconds */
		unsigned int		tv_nsec;	/* nanoseconds */
		unsigned char		tv_psec4;	/* picoseconds/4 */
		signed char		tv_gran_exp;	/* exponent */
		unsigned short		tv_gran_mant;	/* mantissa */
	};

Though it's probably still an unnecessary extravagance to have the pS field.
It's probably best left as padding for now; we can always change our minds
later...

David
--

From: Arnd Bergmann
Date: Friday, July 16, 2010 - 6:32 am

No, just tv_granularity. Most users won't need to care that this

Yes, for example rsync could use this to determine wether a local (e.g. FAT)
and a remote (e.g. NFS) file are identical or not. Right now, you can pass
the granularity in seconds as a command line argument, but it would be nice



There are also two extra bits in tv_nsec ;-). No, I don't think we
need picoseconds any time soon.

One byte padding might not be the worst thing to have in here, like

        struct xstat_time {
                signed long long        tv_sec;         /* seconds */
                unsigned int            tv_nsec;        /* nanoseconds */
                unsigned short          tv_gran_mant;   /* mantissa */
                signed char             tv_gran_exp;    /* exponent */
                unsigned char           unused;
        };

	Arnd
--

From: Mark Harris
Date: Friday, July 16, 2010 - 10:51 pm

At least for the in-tree filesystems, I do not see any that keep
timestamps with a granularity larger than 2s.  For that, a simple
32-bit tv_granularity in nanoseconds (not limited to 1e9) would
suffice, and there is no need for the complexity of dealing with
a separate exponent.

If there is a need to handle larger granularity, its msb could
potentially be used to indicate that the number is in seconds
instead of nanoseconds.  This is convenient because the timestamp
is already broken down into sec and nsec fields.  So this bit would
then indicate that the granularity applies to the tv_sec field, and
that tv_nsec is not in use.  But even this is overkill if no one
uses a granularity larger than 2s.

 - Mark
--

From: Arnd Bergmann
Date: Saturday, July 17, 2010 - 2:00 am

Yes, good point. That would indeed be a significant simplification.

	Arnd
--

From: Christoph Hellwig
Date: Sunday, July 18, 2010 - 1:48 am

Adding Uli to the Cc list to make sure this system call is useful
for glibc / can be exported by it.  Otherwise it's rather pointless




Why making them large for the sake of it?  We'll need massive changes
all through libc and applications to ever make use of this.  So please

Just pass this as a single flag by value.  And just make it an unsigned

No point in adding special types here that aren't genericly useful.
Also this is the first and only system call using split major/minor



What's the point of the REQUEST in the name?  Also no double
underscores inside the identifier.  Instead adding a _MASK postfix

Please don't overload the FL_ namespace even more.  It's already a
complete mess given that it overloads the extN on-disk namespace.

If you already have a buflen parameter there is absolute no need for
the extra results field.  Just define new fields at the end and include
them if the bufsize is big enough and it's in the mask of requested

Why add a special case like that?  Especially if we make the request

Please don't introduce tons of special cases.  Instead use a simple rule
like:

 - a filesystem must return all attributes requests, or return an
   error if it can't.
 - a filesystem may return additional attributes, the caller can detect
   this by looking at st_mask.

plus possibly a list of attributes the filesystem must be able to
provide if requests.  I don't see a reason to make that mask different
from the attributes required by Posix.

--

From: Jan Engelhardt
Date: Thursday, July 22, 2010 - 3:52 am

Given xstat.otime=0, how would you determine whether the file is really 
tagged with a date of 1970, or whether it's just the fs which didnot 
store this kind of information.
--

From: David Howells
Date: Thursday, July 22, 2010 - 5:25 am

I was thinking more of stuff that's already in the Linux stat struct, some of
which is fabricated because the underlying fs doesn't support it.

Take RomFS for example: it fabricates all of st_mtime, st_atime, st_ctime,
st_nlinks, st_blocks, st_uid and st_gid because none of them are stored in the
medium

Similarly, UbiFS fabricates st_blocks and complains in a comment that it makes
no sense for that type of filesystem.

There are other examples.

David
--

From: David Howells
Date: Monday, July 19, 2010 - 7:05 am

There are extra dates and version numbers potentially available.  This may be

So that you can decide not to use it.  Some of our filesystems fabricate things

Otherwise we end up with #ifdefs and duplicated fields of different sizes
within stat structs, and fields of "long" types which vary in size, depending
on the environment.

I just want to make sure that:

       - st_ino is stored as 64-bit
       - st_size and st_blocks are stored 64-bit
       - st.{a,b,c,m}time.tv_sec are stored 64-bit

We could probably stand to make st_blksize 32-bit.  I'd quite like to leave


I can perhaps agree on the device numbers, though some filesystems we have can
store numbers that can't be represented by dev_t.  I think, however, everything
we have can be handled by a 32:32 split.  The numbers could then be encoded as
desired in userspace.

The problem with using extant time structs is they use "long" or "unsigned
long".  And I specifically want to get away from that, since it might be

Perhaps, but it contrasts nicely with request_mask, and makes it easier to






Firstly: Lightweight stat: I want to say that the filesystem may return data
that is out of date if it isn't asked for specifically, but the filesystem has
a copy available.  But I'm not sure that this should apply to non-standard
fields.

Secondly: It doesn't matter what POSIX wants; not all filesystems we support
have everything available.  Where something that's standard is not available,
we have the opportunity to indicate this, whilst still providing a fabricated
result, so that the user can take note of this fact if they choose to, whilst
totally ignoring the indication if they prefer, and just using the fabrication.

Davod
--

From: Linus Torvalds
Date: Monday, July 19, 2010 - 8:17 am

Ugh. So I think this is pretty disgusting. For a few reasons:

 - that whole xstat buffer handling is just a mess. I think you
already fixed the "xstat_parameters" crud and just made it a simple
unsigned long and a direct argument, but the "buffer+buflen" thing is
still disgusting.

   Why not just leave a few empty fields at the end, and make the rule
be: "We don't just add random crap, so don't expect it to grow widely
in the future".

 - you use "long long" all over the place. Don't do that. If you want
a fixed size, say so, and use "u64/s64". That's the _real_ fixed size,
and "long long" just _happens_ to be the same size on all current
architectures.

   Put another way: "long" just _happened_ to be 32 bits way back when
on pretty much all targets. That's where all the 64-bit compatibility
mess came from. Don't make the same mistake. Besides, if the point is
to make things be the same, _document_ that point by using a type that
is explicitly sized.

 - why create that new kind of xstat() that realistically absolutely
nobody will use outside of some very special cases, and that has no
real advantages for 99.9% of all people?

   You could make it a "atomic stat+open" by replacing the useless
"size" return value with a "fd" return value, add a flag saying "we're
also interested in opening it" (in the same result set flags), and
instead of that stupid "buflen" input, give the "mode" input that open
needs.

   Tadaa! You now have something that more people might be interested
in, if only because it avoids a system call and might be a performance
win. Who knows. Ask the Wine people what strange

Quite frankly, my gut feel is that once you do "xstat(dfd, filename,
...)" then it's damn stupid to do a separate "fxstat()", when you
might as well say that "xtstat(dfd, NULL, ...)" is the same as
"fxstat(fd, ...)"

Now, the difference between adding one or two system calls may not be
huge, but just from a cleanliness angle, I really don't see the point
of having another ...
From: David Howells
Date: Monday, July 19, 2010 - 9:15 am

I was thinking more of an unsigned int argument, since it can't have more than

Because it gets allocated on the kernel stack.  It's already 160 bytes, and
expanding it will eat more kernel stack space.  Now, I can offset that by: (a)
embedding it in struct kstat so that we allocate less stack space in xstat()
overall, and (b) allocating kstat/xstat structs with kmalloc() rather than on

I was following struct stat/stat64 in arch/x86/include/asm/stat.h which do the
same.  Also, if this is going to be seen by userspace, isn't it better to use

The new information is useful for some cases.  Samba for example.  At least
two of the fields I'm adding are also made available through BSD's stat()
call, and will automatically be used for some things by autoconf magic if they
become available.

I'm still trying to get a handle on what people think will be truly useful.  I
can see things *could* be useful, particularly to GUI file managers and ls,
but not everyone is of the same opinion.

Perhaps you or others can offer answers to the following questions as these
might help:

 (1) Should I offer information that's effectively free to come by, but could
     be got through:

     (a) An extra statfs() call - such as whether a file is remote, whether
     	 it's some kernel special file?  Or what the volume label is for this
     	 file?

     (b) An extra getxattr() call - such as a file's security label.

     (c) An extra ioctl() call - such as FS_IOC_GETFLAGS.

 (2) Should I offer information that's appropriate to non-UNIX filesystems
     such as FAT, NTFS or CIFS.  Some of this may map onto other fields, such
     as FS_IOC_GETFLAGS.

 (3) Should I offer information about which results that I've returned are
     actually useful, as opposed to being fabricated on the spot?  Such as
     UID/GID in FAT or blocks in UBIFS.  This may be of use to df or a GUI.
     For instance, a GUI, seeing that UID/GID aren't useful, could ask the
     filesystem to provide information ...
From: Linus Torvalds
Date: Monday, July 19, 2010 - 9:51 am

Using implementation issues like that as a reason for some odd
interface that we'll have to live with for the next decades sounds
bad. It's basically a broken form of versioning, since if you end up
using buffer sizes, everybody will just use "sizeof()" except for some
random crazy developer that decides to re-use a buffer they use for
something else, and then use the size of that instead.

End result: the kernel gets passed in some random constant that
depends on just which version of glibc they were compiled against _or_
on just how crazy they were. And it all just encourages people to do
odd things. For example, the glibc developers, who love adding their
own random fields for crazy "forwards compatibility", will start
extending the xstat structure on their own and then just pass in the
larger size and emulate a few new fields à la that whole vfstat thing.
And then if/when we want to extend on it, we're screwed.

So making it fixed is not only simpler, it avoids all the "I'm passing
in random integers" crud.

You don't need to allocate the whole thing inside the kernel anyway.
Quite the reverse. You probably want to continue using the kernel
"kstat" interface with some extensions. That's the point of kstat,
after all - allowing the filesystem interfaces to share _one_
interface rather than having new interfaces at the VFS level for every
damn new stat implementation we have to do for user space.

In short, your stack space usage is all totally bogus. You should copy
the kstat to the user xstat one field at a time, and NOT allocate an
xstat on the kernel stack at all. There is no advantage to using
"memcpy_to_user()" (after having filled in the kernel struct one field
at a time) over just filling in the user struct directly.

Just do "access_ok() + several __put_user() calls", in other words.

I think you wanted to use "memcpy_to_user()" just because you had that
broken "bufsize" argument to begin with. If you get rid of the
bufsize, you also get rid of the potential ...
From: David Howells
Date: Monday, July 19, 2010 - 10:26 am

That's not what I meant at all.  I meant there may be things out there that
will just use st_btime and st_gen as soon as they appear without anything
having to be done to them because these fields already exist in the BSD stat
struct.

Samba is such an example as this.  It will use st_btime immediately if it

Not having ls cause a mass automount just because you did an ls of a directory

Perhaps.  As previously mentioned, BSD (and other unices) already make some of
these fields available (notably st_btime and st_gen).  We could also make a


I suspect they would, though maybe they can say otherwise.  What about SMB
directory enumeration?  I believe that is effectively getdents-with-stat.
Having to do open+stat for each file for that would be painful.

David
--

From: Linus Torvalds
Date: Monday, July 19, 2010 - 10:46 am

Yeah, but do you need xstat information at all for something like
that? Most people try very hard to make do with the information
returned by readdir itself (d_type and inode number), because if you
end up looking up each name you've already pretty much lost in a
performance model.

(And I do agree that a "readdirplus()" is probably something that a
lot of server people would find useful, but obviously that's another
cross-filesystem nightmare. Only a few filesystems can cheaply give
you anything but d_type/d_ino, and not all do even that),

                      Linus
--

From: Andreas Dilger
Date: Tuesday, July 20, 2010 - 1:28 am

This lightweight stat() interface is exactly needed for things like "color ls",

Having a readdirplus() syscall would be even better, but again only with the ability to request specific attributes.  Otherwise the filesystem may be doing a lot of extra work to collect all of the file attributes, and then userspace will probably be throwing most of them away.

Cheers, Andreas





--

From: David Howells
Date: Thursday, July 22, 2010 - 5:14 am

It is?  It's called crtime in Ext4.  st_btime, however, would be compatible
with BSD's stat, and Samba would just use it by way of autoconf magic if it
appeared.

David
--

From: Volker Lendecke
Date: Thursday, July 22, 2010 - 5:17 am

Samba has the following check:

# recent FreeBSD, NetBSD have creation timestamps called birthtime:             
AC_CHECK_MEMBERS([struct stat.st_birthtimespec.tv_nsec])
AC_CHECK_MEMBERS([struct stat.st_birthtime], AC_CHECK_MEMBERS([struct stat.st_birthtimensec]))

and the supporting code around that. "birth" might also be
where the "b" comes from :-)

Volker
--

From: Jan Engelhardt
Date: Thursday, July 22, 2010 - 6:05 am

Of course you can find remnants of btime in Linux's BSD-style task 
accounting, but Linux always looked more like SysV than BSD, speaking 
for otime. And if you are using autoconf, the cost of using otime over 

Well, in all reference to the Matrix movie, files aren't born. Except 
for Directory Default ACLs and possibly security labels, they usually 
don't inherit either :)  And on a CS level, it's more like copy than 
inherit, because if the parent changes, the file does not (with the 
potential exception of security relabeling, bla).
--

From: Linus Torvalds
Date: Thursday, July 22, 2010 - 8:14 am

On Thu, Jul 22, 2010 at 5:17 AM, Volker Lendecke

Oh wow. And all of this just convinces me that we should _not_ do any
of this, since clearly it's all totally useless and people can't even
agree on a name.

Let's wait five years and see if there is actually any consensus on it
being needed and used at all, rather than rush into something just
because "we can".

                       Linus
--

From: Volker Lendecke
Date: Thursday, July 22, 2010 - 8:36 am

The nice thing about this is also that if this is supposed
to be fully usable for Windows clients, the birthtime needs
to be changeable. That's what NTFS semantics gives you, thus
Windows clients tend to require it.

Just as a hint, nothing that Linux should necessarily have
to be bothered with, this is Samba's duty :-)

Volker
--

From: Linus Torvalds
Date: Thursday, July 22, 2010 - 8:47 am

On Thu, Jul 22, 2010 at 8:36 AM, Volker Lendecke

Ok. So it's not really a creation date, exactly the same way ctime
isn't at all a creation date.

And maybe that actually hints at a better solution: maybe a better
model is to create a new per-thread flag that says "do ctime updates
the way windows does them".

So instead of adding another "btime" - which isn't actually what even
windows does - just admit that the _real_ issue is that Unix and
Windows semantics are different for the pre-existing "ctime".

The fact is, windows has "access time", "modification time" and
"creation time" _exactly_ like UNIX. It's just that the ctime has
slightly different semantics in windows vs unix. So quite frankly,
it's totally insane to introduce a "birthtime", when that isn't even
what windows wants, just because people cannot face the actual real
difference.

Tell me why we shouldn't just do this right?

                Linus
--

From: Greg Freemyer
Date: Thursday, July 22, 2010 - 9:06 am

On Thu, Jul 22, 2010 at 11:47 AM, Linus Torvalds

I haven't been keeping up with this thread, but I believe NTFS has a
number of timestamps, not just 3.

This blog post references 8 in the left hand column.

The 4 standard (most common) ones are:

File last access
File last modified
File created
MFT last modified

My understanding is that "MFT last modified" has semantics very
similar to Linux ctime.

But there is not a generic equivalent to NTFS created.

Thus if trying to have the Linux kernel match NTFS semantics for the
benefit of Samba is the goal, it seems a new field should be preferred
instead of having linux ctime try to do different jobs.

Greg
--

From: Jeremy Allison
Date: Thursday, July 22, 2010 - 9:27 am

No, ctime isn't the same as Windows "create time". Windows
"create time" semantics are that the timestamp is set to
current time on file creation, but afterwards anyone with
sufficient access can then modify it (!). Which is different
from the "birthtime" spec on *BSD, as they can't be modified.

Currently on *BSD we look for our special EA containing any
modified create times on a file, and return that as "create
time" if found, if not we return the st_birthtime from the
stat struct. That works well enough for systems where you
don't want to allow birthtime to be changed. Having said
that I'm not sure how they cope with doing restores to
a filesystem where you would need to set st_birthtime :-).

Jeremy.
--

From: Linus Torvalds
Date: Thursday, July 22, 2010 - 9:40 am

Umm. What kind of reading problems do you guys have?

I know effin well that ctime isn't the same as Windows create time.
THAT WAS MY POINT.

But the fact is, th Unix ctime semantics are insane and largely
useless. There's a damn good reason almost nobody uses ctime under
unix.

So what I'm suggesting is that we have a flag - either per-process or
per-mount - that just says "use windows semantics for ctime".

And yes, I'm very aware that the "c" in ctime doesn't stand for
"create". But anybody who points that out is - once more - totally
missing the point. My point is that we have three timestamps, and
windows wants three timestamps (somebody claims that NTFS has four
timestamps, but the Windows file time access functions certainly only
shows three times, so any potential extra on-disk times have no
relevance because they are invisible to pretty much everybody). We can
have unix semantics for mtime/atime/ctime, or we can have windows
semantics for those three values.

So let's say that we introduce a mount flag that says
"ctime=winctime", which basically just sets a flag that instead of
changing ctime on chmod/chown/etc, it just changes mtime instead (or,
as mentioned, we could make it a process flag instead).

Let's face it, Unix semantics are not sacred.  Especially not
something like ctime, which is pretty damn useless. If you're a samba
server, why not just say "let's do ctime the way windows does creation
times", and let it be at that?

I personally think that Unix ctime is insane. There is no real reason
why "write()" should change mtime, but "chmod" changes ctime. It was
just a random decision way back when, and it's clearly not what samba
wants, and it's equally clearly not what even most _unix_ people want
(just google for "ctime" and "creation time", and watch the confusion
- exactly because unix semantics are simply _random_ and odd semantics
in this area)

I would not be at all surprised if it turns out that people might want
to really turn ctime into ...
From: Jan Engelhardt
Date: Thursday, July 22, 2010 - 10:03 am

I beg to differ. ctime is not completely useless. It reflects changes on 
the inode for when you don't you change the content. It's like an mtime 
for the metadata. It comes useful when you go around in your filesystem 
trying to figure out who of your co-admins screwed up the permissions on 
/etc/passwd... and if the mtime is the same as that of the last backup, 
I can at least have a reasonable assurance that it was /only/ the 
metadata that was tampered with. (SHA1 check, yeah yeah, costly on large 
--

From: Trond Myklebust
Date: Thursday, July 22, 2010 - 10:16 am

Errr... Only if you eliminate utimes() from your syscall table.
Otherwise it is trivial to reset the mtime after changing the file
contents.

Cheers
  Trond

--

From: Jan Engelhardt
Date: Thursday, July 22, 2010 - 10:36 am

Well yes; I had implicitly implied that evil people with malicious intent
are absent.

--

From: Linus Torvalds
Date: Thursday, July 22, 2010 - 10:24 am

Uh. Yes. Except that why is file metadata really different from file
data? Most people really don't care. And a lot of people have asked
for creation dates - and I seriously doubt that Windows people
complain a lot about the fact that there you have mtime for metadata
changes too.

The point being that Unix ctime semantics certainly have well-defined
semantics, but they are in no way "better" than having a real creation
time, and are often worse.

Just imagine what you could do as an MIS person if you actually had a
creation time you could somewhat trust? You talk about seeing somebody
change the permissions of /etc/passwd, but realistically, absent
preexisting semantics, who would really ask for that? The only reason
you mention that as an example of what you can do with ctime is that
that is indeed pretty much the _only_ thing you can do with ctime, and
it really isn't that useful.

In contrast, with a creation date, you see the difference between
people overwriting files by writing to them, or overwriting files by
creating a new one and moving it over the old one. At a guess, that
would be quite as useful to a sysadmin as ctime is now (my gut feel is
that it would be more so, but whatever).

IOW, there really isn't anything magically good about UNIX ctime
semantics, and in fact they are totally broken in the presence of
extended attributes (that's file data, but it only changes ctime? WTF
is up with that? Yes, I know why it happens, and it makes sense within
the insane unix ctime rules, but no way does it make sense in a bigger
picture unless you are in total denial and try to claim that xattrs
are just metadata despite having contents).

And yes, I am also sure that there are applications that do depend on
ctime semantics. Trond mentioned NFS serving, and that's unfortunate.
I bet there are others. That's inevitable when you have 40 years of
history. So I'm not claiming that re-using ctime is painfree, but for
somebody that cares about samba a lot, I bet it's a _lot_ ...
From: Jeremy Allison
Date: Thursday, July 22, 2010 - 11:15 am

Samba mostly ignores ctime, for just the reasons you mention.
But re-using ctime as create time will lead to more horrible
confusion (IMHO).

Easier to add a btime field to stat (or whatever you want to
call it), especially as some of the filesystems already support it,
the code for it exists inside Samba and is working on other UNIX-style
OS'es, and for filesystems that don't support it, just return

Yep. We even have to do that on systems with an immutable
btime to get Windows semantics.

Jeremy.
--

From: Benny Halevy
Date: Thursday, July 22, 2010 - 11:21 am

Yeah, having create time would be important.
That said, having a non user-settable modify timestamp is crucial
for quickly determining whether a file has changed.

--

From: Greg Freemyer
Date: Thursday, July 22, 2010 - 11:45 am

How would "cp --archive" and a host of backup/restore tools work
without user-settable modify timestamps?

Or are you proposing another timestamp?  I do computer forensics, I
like timestamps, but enough is enough.

Greg
--

From: Benny Halevy
Date: Thursday, July 22, 2010 - 12:53 pm

mtime and atime are already user settable and archive programs use
this on the destination, but ctime would be different after
copy/restore.

When updating the archive, just comparing mtime to determine if the source
changed is problematic as it can be set to any value after the change,
but src.ctime would be greater than dest.ctime in this case.

With posix semantics (http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap04.html#tag_04_07)
this is not perfect either as there can be false-positives when the file stat changed but
the file has not, e.g. when st_nlink changed.

--

From: Greg Freemyer
Date: Thursday, July 22, 2010 - 11:41 am

On Thu, Jul 22, 2010 at 1:24 PM, Linus Torvalds
<torvalds@linux-foundation.org> wrote:


But Windows doesn't work that way for I'm fairly sure.

Window's mtime is only affected by file content updates.  (I don't
know about xattr updates).

If you look at the first and fourth rows of the table at:

http://blogs.sans.org/computer-forensics/2010/04/12/windows-7-mft-entry-timestamp-prop...

You see that there are a number of activities that update the "$STD
Info MFT Entry Modified Field" that don't update the "$STD Info
Modification Time"

Again, "$STD Info MFT Entry Modified Field" has semantics close to linux ctime.

And "$STD Info  Modification Time" similar to mtime.

I don't know if there are APIs to present MFT Entry Modified to user
space or if Samba uses that info.  I just know it's part of the
on-disk NTFS filesystem data.

Greg
--

From: Neil Brown
Date: Tuesday, July 27, 2010 - 6:15 pm

On Thu, 22 Jul 2010 10:24:17 -0700

Much as I despise xattrs, this would definitely be my preference.

ctime and mtime have real cache-coherence semantics which require them being
updated by the kernel (whether the cache is on an NFS client, in a backup
archive, or in a .o translation of a .c file).

create-time, on the other hand, would never be updated by the kernel, and
might sometimes be updated by an application.  So it is a very different sort
of attribute, much like a hypothetical 'last archived' time.

The only role the kernel might have would be setting the 'creation time' when
the file was created, but it seems even that isn't always what is wanted,
because people don't so much what the time of create of the
container-on-disk, but the time of creation of the data-content.

I would want to see a pretty convincing use-case that cannot be solved with
xattrs before 'creation time' was added to a generic kernel interface.

So just use xattrs and don't involve the kernel in any detailed knowledge of
this value.

Maybe xstat should take a list of xattrs to be retrieved as well??  or maybe
not.


But I hope the xstat debate doesn't get bogged down about whether 'create
time' is sensible or not.  Quite apart from the ability to return more
attributes,  I think it has real value is being able to return fewer
attributes, and being allowed to ask for 'best guess' values.  Being able to
do an 'fstat' and being certain that you won't be blocked by a non-responsive
NFS server would be a GOOD THING (TM).

NeilBrown
--

From: David Howells
Date: Wednesday, July 28, 2010 - 10:28 am

So does creation time, at least for CIFS caching.  Creation time has potential
for spotting when the object at a pathname has changed for something else,
given the lack of inode number and inode generation from windows servers.

That should be a timestamp in the content itself, not a filesystem metadata

Then there's no point even considering this.  You could emulate the entirety
of stat() with getxattr().  I've previously posted a patch to implement the
retrieval of creation time, inode gen and data version as xattrs and been told

Why not?  BSD has it in its stat struct.  Windows has it in its Win32
equivalents.  Samba for one will look for it there, and use it if it is.

Using an xattr means an extra pathwalk and extra locking per access for any
program that wants it.  It's a reasonable bet such a program will also be
stat'ing the file it wants the creation time for.

If we are going to extend stat anyway, then why not make out a short list of
extra things we could usefully return and consider adding them?  Something
like creation time is reasonably easy to come by for little extra overhead.

The idea of xstat() having a variable-length buffer and variable arguments has
been well derided.  It ain't going to happen, much though I'd like it to.  I'd
quite like to offer the opportunity to return the security label, for example.

David
--

From: Neil Brown
Date: Wednesday, July 28, 2010 - 4:04 pm

On Wed, 28 Jul 2010 18:28:02 +0100

This justifies for me why a CIFS client would want to extract the
creation-time from the CIFS protocol, but not why you want to expose it via a
generic interface.
The kernel/filesystem doesn't need to maintain creation-time to meet this
need, only the CIFS server needs to maintain it - the kernel/filesystem just
needs to provide somewhere to store it - xattrs.

Given that we have an extensible attribute framework, it seems wrong to be
adding new attributes to *stat.  If a given filesystem wants to store certain
attributes more efficiently, then it is welcome to intercept xattr calls and
store (say) "cifs.birthtime" directly at a known offset in the inode.

The flip-side of extracting these various attributes is setting them.  One
presumably doesn't want to set st_data_version and possibly not st_gen, but
there seems to be a need to set st_btime and FS_SYSTEM_FL and FS_TEMPORARY_FL
might want to be set.  Your xstat doesn't give any way to do that, xattrs
already does - you just need to define names for the attributes.

So I'm against adding new attributes that simply involve the fs storing some
information for the application to use.

I'm still pondering those extra flags:
  FS_SPECIAL_FL
  FS_AUTOMOUNT_FL
  FS_AUTOMOUNT_ANY_FL
  FS_REMOTE_FL
  FS_ENCRYPTED_FL
  FS_OFFLINE_FL

They sound like they might be useful, they are not file-metadata (like
btime) but rather implementation details (like st_blocks).  So it is probably
sensible to include them as you have done.

However I would really like to see clear and complete documentation for them.
When exactly should a filesystem set these flag, and what exactly can an
application assume if they are (or are not) set.

If a filesystem is mounted on an network-block-device, or a loop-back of a
file on NFS, is FS_REMOTE_FL set?
Is ROT13 enough for FS_ENCRYPTED_FL to be set?
If the NFS server is "not responding, still trying", should FS_OFFLINE_FL get
set on all files?
And I cannot even ...
From: J. Bruce Fields
Date: Friday, July 30, 2010 - 11:38 am

For what it's worth, the NFSv4 server would also export creation time if
we had it.

--b.
--

From: Jeff Layton
Date: Sunday, August 1, 2010 - 6:40 am

On Thu, 29 Jul 2010 09:04:01 +1000

The problem with the above approach is that you're assuming that the
data in question is always accessed via the CIFS server. If someone
comes along and messes with the data outside of CIFS, then samba won't
have knowledge of that and the birthtime will be wrong.

There's some history behind this as well -- samba tracks windows ACLs
via xattr and it can be very problematic keeping those up to date when
the data is accessed outside of samba.

I think presenting this data via xattr makes the most sense. It's
simple and as Neil points out, it also provides us with a clealy
settable interface. If we ever get an xstat-like syscall, we can always
present the same data via that as well.

I also think it's quite reasonable to consider tracking birthtime in a
generic inode field. In the absence of that, filesystems could track
this themselves in their filesystem-specific inode structs.

Furthermore, I'll go ahead and propose the following (simple) semantics:

1) birthtime is initialized to the current time when a new inode is
created

2) it's settable via the xattr to an arbitrary value

Either way, the xattr for this ought to be named the same on all
filesystems. Samba shouldn't need to know or care what the underlying
filesystem is, as long as it presents the correct xattr.

That should make samba happy, and be reasonably simple to implement.

-- 
Jeff Layton <jlayton@redhat.com>
--

From: David Howells
Date: Thursday, July 29, 2010 - 9:15 am

It would also be easier for NFSD if the creation time was in struct kstat.
It's included as an optional element in NFSv4.  The same goes for the data
version number.  I'm not sure about the inode generation, I suspect that's used
as part of the FH construction.

However, someone was talking about a userspace NFS daemon, and there they may
want all three bits.  Even Samba may want multiple bits.  Calling getxattr
multiple times per file starts to add up, even for internal values.

Consider further: NFS, for example, could be made to retrieve the creation time
from the server.  This can be merged with the attribute fetch done by the
getattr() call, or it could be done separately by getxattr.  Unless it's stored
in RAM, that's one NFS RPC op versus two.  Okay, that's a bit of an artificial

It's not attribute storage I'm thinking about, but making attribute retrieval

I acknowledge that if we went down the getxattr() route, then that
automatically makes setxattr() the obvious candidate for setting things.

But think about it another way: what if you want to set several attributes?
You have to make a bunch of setxattr() calls.  But what if it were possible to
do all of chmod, chgrp, chown, truncate, utimes, set_btime, etc. all in one go,
atomically?  We more or less have this internally in the kernel, and it might
stand to be exposed to userspace.


I've split these away from ioc flags as ioc flags is very ext2/3/4 centric, and
those filesystems happily create their own ioc flags sets without updating the

Yeah.  I have plans to write documentation for it, but I'd like to have a
clearer idea of what the interface might be before doing that.

But to give you an idea of the flags:

 (*) FS_SPECIAL_FL - Kernel API file from a quasi-filesystem such as /proc or
     /sys - the sort of thing you might not want to expose through NFSD.

 (*) FS_AUTOMOUNT_FL - A named automount/referral point.  You attempt to
     transit this directory and the backing fs will mount something over ...
From: Neil Brown
Date: Monday, August 2, 2010 - 6:13 pm

On Thu, 29 Jul 2010 17:15:15 +0100

Thanks for these.  It particularly helps when you identify how the flag might
be used - guiding GUI icon choice is certainly valid and tells me that if I
don't set the flag 'correctly' (maybe because it is too difficult) then it
isn't the end of the world.

I get the AUTOMOUNT distinction too - FS_AUTHMOUNT_ANY_FL would be good for a
GUI as it could allow you to type in a filename for it to try to follow.

I'm not sure exactly how FS_ENCRYPTED_FL would be used - if the gui might be
prompted to ask for a key there would either need to be a completely general
interface for presenting keys, or the flag should be specific to CIFS and
should mean that a key must be given to CIFS to unlock the file.

Similarly, what can you do with an OFFLINE file?  Do CIFS and AFS offline
files behave the same way?  If not there should be two different flags.  If
so then that behaviour should be specified with the flag ... unless this flag
is just for GUI cosmetics too.


Anyway, I've been thinking more about this and have refined my position
somewhat.  I'll present it here for what it is worth - feel free to ignore
bits you don't like.

Your proposed 'xstat' seems to combine a number of different goals - doing
that is always a bit dangerous as you have defend it on multiple fronts...

I see the separate goals are:
 A/ allowing attributes to be accessed independently - an explicit list of
    required attributes is given and the FS doesn't need to collect the other
    attributes.
 B/ allowing  synthetic attributes to be identified - if the FS doesn't
    natively support some attribute but must synthesise it, you can now
    discover that fact 
 C/ add an ad-hoc collection of new attributes that filesystems can return if
    they happen to support them
 D/ do all the above with a single system call for efficiency.

I think pushing all these together is asking for trouble - arguments about one
aspect will interfere with completion of the others.

Given ...
From: Jim Rees
Date: Thursday, July 22, 2010 - 10:12 am

Linus Torvalds wrote:

  I personally think that Unix ctime is insane. There is no real reason
  why "write()" should change mtime, but "chmod" changes ctime. It was
  just a random decision way back when...

I believe it was done that way so "dump" could backup just the inode and not
the data if only the inode had changed.  Full history here:

http://blog.plover.com/Unix/ctime.html
--

From: Linus Torvalds
Date: Thursday, July 22, 2010 - 10:32 am

Yes, the dump reasoning makes sense, and that history also shows that
originally chmod just changed mtime (since that's the _sane_ thing to
do). So if it wasn't for dump - that nobody uses any more and that was
considered a hack even back when and never supported things like
xattrs etc - unix probably wouldn't have a ctime at all (or would have
implemented a "creation time" because people would have asked for it).

So I'm sure there are reasons for ctime. That just doesn't mean that
it's really "good", the same way there were reasons to name "creat()"
without the "e".

                                  Linus
--

From: Jeremy Allison
Date: Thursday, July 22, 2010 - 11:02 am

Ask NetApp about that :-). They have built a rather large
business on just that fact :-).

Jeremy.
--

From: Jeremy Allison
Date: Thursday, July 22, 2010 - 11:07 am

Get sued out of existence by software patent trolls who have lost
the ability to write code, apparently :-).
--

From: Jeremy Allison
Date: Thursday, July 22, 2010 - 11:07 am

The time is counted in years, not hours :-).
--

From: Trond Myklebust
Date: Thursday, July 22, 2010 - 11:59 am

I said "limited", not "non-existent".

The fact remains that most of us would be hard pressed to name an
application that requires you to share the same dataset to both
Windows/CIFS and posix NFS clients. Everything from ACL models through
caseless vs case-aware filesystems and Windows vs posix locking
semantics tends to discourage mixing the two environments.

   Trond

--

From: Trond Myklebust
Date: Friday, July 30, 2010 - 11:11 am

Your Mac has a perfectly functional CIFS client, as do your Linux boxes.
They both interoperate just fine with Samba, and would presumably
continue to do so if someone were to decide to reuse the ctime field on
your Samba box as storage for a create time.

  Trond

--

From: Phil Pishioneri
Date: Friday, July 30, 2010 - 11:19 am

It didn't, at one point. Some version of Mac OS X would cause a client 
kernel crash when unmounting the CIFS share. I think it's been fixed, 
but we had to have some OS X clients switch to NFS because of it.

-Phil
--

From: Andreas Dilger
Date: Saturday, July 31, 2010 - 11:41 am

CIFS doesn't support symlinks (they just appear as the referenced file), so I've had applications that scan the filesystem recurse indefinitely due to symlinked directories on a CIFS share appearing as hard-linked directories on the client.  This doesn't happen when the filesystem is accessed via NFS.

Cheers, Andreas





--

From: Jan Engelhardt
Date: Saturday, July 31, 2010 - 11:48 am

This shouldn't go on indefinitely - PATH_MAX is reached at some point.
--

From: Trond Myklebust
Date: Saturday, July 31, 2010 - 12:03 pm

Sigh... So please explain how it would be useful to export that
particular filesystem through _both_ CIFS and NFS?

My point was that in most circumstances you want to export either
through CIFS or through NFS, but very rarely both.

I also made the point that converting ctime into a creation time would
break NFS, but it would be a limited breakage, mainly affecting the
client's ability to detect ACL changes, and possibly causing the inode
to get temporarily updated with stale attribute information on occasion
due to out-of-order RPC replies.

  Trond

--

From: Jan Engelhardt
Date: Saturday, July 31, 2010 - 2:20 pm

Seems like a reasonable case for, say, a public "ftp server". For
example, I keep ftp5.gwdg.de:/ftp/pub mounted, that's a little more
convenient than always having to start an ftp cilent.

Conversely, since NFS is, well, non-existent on Windows, one would
use CIFS there (had it ftp5 opened) to get the same convenience.
--

From: John Stoffel
Date: Thursday, July 22, 2010 - 12:18 pm

>>>>> "Jeremy" == Jeremy Allison <jra@samba.org> writes:


Jeremy> Ask NetApp about that :-). They have built a rather large
Jeremy> business on just that fact :-).

And it does work, as long as you also go with either unix or windows
semantics for the security and permissions bits.  If you try to use
the mixed-mode, you're in for a world of hurt.

Oh yeah, Netapp still uses dump/restore for it's backups.  :]  Though
whether it's still dependent on the optimization of ctime being used
to know whether to just dump the inode only or not, I can't say.

John
--


Hi Linus,

 > My point is that we have three timestamps, and
 > windows wants three timestamps (somebody claims that NTFS has four
 > timestamps, but the Windows file time access functions certainly only
 > shows three times, so any potential extra on-disk times have no
 > relevance because they are invisible to pretty much everybody).

Not quite. The underlying structure available to Windows programmers
is this one:

typedef struct _FILE_BASIC_INFORMATION {
  LARGE_INTEGER CreationTime;
  LARGE_INTEGER LastAccessTime;
  LARGE_INTEGER LastWriteTime;
  LARGE_INTEGER ChangeTime;
  ULONG         FileAttributes;
} FILE_BASIC_INFORMATION, *PFILE_BASIC_INFORMATION;

See http://msdn.microsoft.com/en-us/library/ff545762%28v=VS.85%29.aspx

These are the definitions:

CreationTime
    Specifies the time that the file was created. 
LastAccessTime
    Specifies the time that the file was last accessed. 
LastWriteTime
    Specifies the time that the file was last written to. 
ChangeTime
    Specifies the last time the file was changed. 

You are right that the more commonly used APIs (such as
GetFileInformationByHandle()) omit the ChangeTime field in the return
value. The ChangeTime is also not visible via the normal Windows GUI
or command line tools.

But there are APIs that are used by quite a few programs that do get
all 4 timestamps. For example, GetFileInformationByHandleEx() returns
all 4 fields. I include an example program that uses that API to show
all the timestamps below.

and yes, we think that real applications (such as Excel), look at
these values separately.

The other big difference from POSIX timestamps is that the
CreationTime is settable on Windows, and some of the windows UI
behaviour relies on this.

Cheers, Tridge

PS: Sorry for coming into this discussion so late


/* 
   show all 4 file times
   tridge@samba.org, July 2010
*/

#define _WIN32_WINNT 0x0600

#include <stdio.h>
#include <stdlib.h>
#include "windows.h"
#include ...
From: Ted Ts'o
Date: Thursday, July 22, 2010 - 6:21 pm

Well, not POSIX, because POSIX doesn't have CreationTime at all.
BSD's birthtime doesn't allow it to be set, and the question here is
largely philosophical.  Does it literally mean "file creation time" in
terms of when the OS created the file, or does it mean "file" in the
sense of application contents.  For example, if an application edits
the file and saves it out using "write file to foo.new; sync; rename
foo to foo.bak; rename foo.new to foo", should the creation time for
the newly written file "foo" be the time when the editor saved out the
file (i.e., when "foo.new" was created), or copied from the original
file "foo"'s creation time.

This is something (whether or not the application is allowed to set
the creation time) that I think makes sense to be either a filesystem
level mount option, or superblock tunable, or even a per-process
personality flag.

However, I think Linus's idea of using a per-process flag to control
whether or not "ctime" has the original POSIX semantics or some new
"creation time" semantics would lead to a huge amount of confusion.
Given that a number of new filesystems, including both ext4 and btrfs,
have creation time, it makes sense for us to have a fourth timestamp.
Whether or not our creation time is settable or not is a separate
question, and I don't think we need to follow BSD's lead on this.  If
GNOME and/or KDE applications start using it, I could see this
becoming that gets wide adoption fairly quickly.

						- Ted

--


Hi Ted,

 > Does it literally mean "file creation time" in terms of when the OS
 > created the file, or does it mean "file" in the sense of
 > application contents.  For example, if an application edits the
 > file and saves it out using "write file to foo.new; sync; rename
 > foo to foo.bak; rename foo.new to foo", should the creation time
 > for the newly written file "foo" be the time when the editor saved
 > out the file (i.e., when "foo.new" was created), or copied from the
 > original file "foo"'s creation time.

In Windows this is can be controlled by applications, but it also is
done at the filesystem level in NTFS using a technique that Microsoft
call "File System Tunneling". If you create a file with the same name
within a short time (default 15s and settable in the registry) of when
the file previously existed then it will get the same CreationTime as
the previous file.

For details see http://support.microsoft.com/kb/172190

Some applications also do this regardless of the registry setting for
MaximumTunnelEntryAgeInSeconds. They use the ability to set the
CreationTime to get the same behaviour.

Cheers, Tridge
--

From: Björn Jacke
Date: Friday, July 23, 2010 - 2:14 am

actually, it can (partly :). But the way it can be done is an insane hack:

<quote "http://ace.delos.com/kirk/">
To provide a sensible birth time for applications that are unaware of the birth
time attribute, we changed the semantics of the "utimes" system call so that if
the birth time was newer than the value of the modification time that it was
setting, it sets the birth time to the same time as the modification time. An
application that is aware of the birth time attribute can set both the birth
time and the modification time by doing two calls to "utimes". First it calls
"utimes" with a modification time equal to the saved birth time, then it calls
"utimes" a second time with a modification time equal to the (presumably newer)
saved modification time.
</quote>

Thus it can also be only be set more in the past.

Cheers
Björn
-- 
SerNet GmbH, Bahnhofsallee 1b, 37081 Göttingen
phone: +49-551-370000-0, fax: +49-551-370000-9
AG Göttingen, HRB 2816, GF: Dr. Johannes Loxen
--

From: utz lehmann
Date: Friday, July 30, 2010 - 2:22 pm

When abusing an existing time stamp use atime not ctime please.
ctime has it's uses. atime was just a mistake and is nearly useless.

And with noatime we already have creation time semantics for atime.


utz


--

From: Jan Engelhardt
Date: Saturday, July 31, 2010 - 1:08 am

noatime was a late afterthought, and because it can interfere with
some programs, relatime came along too.
--

From: utz lehmann
Date: Saturday, July 31, 2010 - 7:43 am

I know mutt uses atime to detect new messages. But there are better and

There are people who prefer noatime over relatime.

Using an existing time stamp for creation time is a bad idea IMHO. But
when doing this use the least important one. Which is atime. For example
ctime is used by backup programs.

Anyway when we want to support creation time it should be an additional
time stamp.


utz


--

From: Jeff Layton
Date: Sunday, August 1, 2010 - 6:25 am

On Fri, 30 Jul 2010 23:22:58 +0200

Ugh. Honestly all of this talk of abusing different time fields seems
like craziness to me. It's going to be very hard to do that without
breaking *something*. There's also very little reason to do this when
xattrs are a much cleaner approach.

Neil Brown has put forth a very reasoned justification for putting the
birthtime in an xattr. After reading it, I think that makes more sense
than anything. It's also something that can be done without any extra
infrastructure. If at some point in the future we get an xstat-like
syscall then we can always add birthtime to that as well.

Ditto for the other fields under discussion (i_generation and the like).

-- 
Jeff Layton <jlayton@redhat.com>
--

From: Jeremy Allison
Date: Thursday, August 5, 2010 - 4:52 pm

Just my 2 cents (as a Samba server implementor). I *hate* the idea
of adding a "virtual" EA for birthtime. If you're going to add it,
just add it to the stat struct like *BSD does. Don't abuse the other
time fields, it's a new one.

Jeff, please don't advocate for an EA for the Samba server to use.
Don't add it as an EA. It's *not* an EA, it's a timestamp.

Jeremy.
--

From: Neil Brown
Date: Thursday, August 5, 2010 - 8:38 pm

On Thu, 5 Aug 2010 16:52:18 -0700

I'm curious.  Why do you particularly care what interface the kernel uses to
provide you with access to this attribute?

And given that it is an attribute that is not part of 'POSIX' or "UNIX", it
would seem to be an extension - an extended attribute.
As the Linux kernel does virtually nothing with this attribute except provide
access, it seems to be a very different class of thing to other timestamps.
Surely it is simply some storage associated with a file which is capable of
storing a timestamp, which can be set or retrieved by an application, and
which happens to be initialised to the current time when a file is created.

Yes, to you it is a timestamp.  But to Linux it is a few bytes of
user-settable metadata.  Sounds like an EA to me.

Or do you really want something like BSD's 'btime' which as I understand it
cannot be set.  Would that be really useful to you?

Is there something important that I am missing?

NeilBrown
--

From: Steve French
Date: Thursday, August 5, 2010 - 8:55 pm

Obviously the cifs and SMB2 protocols which  Samba server support can
ask the server to set the create time of a file (this is handled
through xattrs today along with the "dos attribute" flags such as
archive/hidden/system), but certainly it is much more common (and

It is another syscall that Samba server would have to make - and xattr
performance is extremely slow on some file systems (although
presumably this one would be more likely to be stored in inode and
perhaps not as bad on ext4, cifs and a few others such as ntfs).


-- 
Thanks,

Steve
--

From: Jeff Layton
Date: Friday, August 6, 2010 - 4:18 am

On Thu, 5 Aug 2010 22:55:06 -0500

Right. One has to consider that samba has to satisfy READDIRPLUS-like
calls, and on a large directory all of those extra syscalls are likely
to impact performance.

In my view, the ideal thing would be to add this field as an EA and
continue work on implementing xstat(). Adding it as an EA gives
userland a way to set this value, without needing to add a new utimes()
variant.

If/when xstat becomes available, samba could use that instead of the EA
for reading this value.

-- 
Jeff Layton <jlayton@redhat.com>
--

From: Neil Brown
Date: Friday, August 6, 2010 - 4:30 pm

On Thu, 5 Aug 2010 22:55:06 -0500

Just a point of clarification - when you say it is common and important to be
able to read the creation time on an existing file, and you still talking in
the context of cifs/smb windows compatibility, or are you talking in the
broader context?
If you are referring to a broader context could be please give more details
because I have not heard any mention of any real value of creation-time out
side of window interoperability - have such a use clearly documented would
assist the conversation I think.

If on the other hand you are just referring the the windows interoperability
context ... given that you have to read an EA if the create-time has been
changed, you will always have to read and EA so having something else is

Obviously if we were to make xattrs the preferred way to get create time out
of the filesystem we would want to make sure it is efficient.
It would seem to make perfect sense to add a 'getxattrat' syscall and allow an
AT_NONBLOCK flag (which would probably be useful for statat too).  The
AT_NONBLOCK flag would only get attributes if they were available immediately
without going to storage/network/whatever.

And if it is simply a case of too many syscalls per file, then
getxattrat_multi would seem to be the most general way to go.

NeilBrown
--

From: Steve French
Date: Friday, August 6, 2010 - 4:58 pm

There are other cases, less common than cifs and smb2.   One
that comes to mind is NFS version 4, but there are a few other
cases that I have heard of (backup/archive applications).
The RFC recommends that servers return attribute 50 (creation
time).  See below text:

   time_create         50   nfstime4       R/W      The time of creation
                                                    of the object.  This
                                                    attribute does not
                                                    have any relation to
                                                    the traditional UNIX
                                                    file attribute
                                                    "ctime" or "change
                                                    time".



-- 
Thanks,

Steve
--

From: Neil Brown
Date: Friday, August 6, 2010 - 5:29 pm

On Fri, 6 Aug 2010 18:58:42 -0500

I really don't think NFSv4 is a separate justification.  I'm fairly sure
that attribute was only including in NFSv4 for enhanced Windows
compatibility (windows interoperation was a big issue during the protocol
development).

That leaves hypothetical "backup/archive applications".  Do you have a
concrete example?  Or we are left with just various flavours of Windows
compatibility (not that I have a problem with Windows compatibility, but if
that is the only reason that we have creation-time then I think it is
important to be clear and open about that).

NeilBrown
--

From: Steve French
Date: Friday, August 6, 2010 - 7:42 pm

Perhaps also useful for MacOS (and other BSD), not just Windows,
although MacOS may use cifs more often than nfs.




-- 
Thanks,

Steve
--

From: Steve French
Date: Friday, August 6, 2010 - 7:54 pm

A quick search for backup applications in Wikipedia came up with a
reference fairly easily (to backup app which uses creation
time) for Linux:

http://www.aqualab.cs.northwestern.edu/publications/Cornell04VFS.html

Presumably Windows compat. is a stronger motivation, than BSD/MacOS
NFSv4 (returning birth time) compat, and backup applications
are a lesser motivations.   There may also be some value in using creation
time as a generation number where no generation number is
available.

Intuitively seems like creation time would be as "useful" as ctime (and probably
more so) to app developers ... but that is hard to prove.

-- 
Thanks,

Steve
--

From: Neil Brown
Date: Friday, August 6, 2010 - 8:32 pm

On Fri, 6 Aug 2010 21:54:49 -0500

That publication seems to mention 'creation time' only as an abstract concept.
The backup architecture keeps a history of the file all that way back to its
"creation time".
It doesn't appear to need or use a 'creation time' attribute stored with any

I agree, it does seem like an intuitively valuable number - after all we each
have a birthday which we are very aware of and often make use of.  It is
often treated as part of our identity - just like you were mentioning that
the CIFS client uses creation-time to help identify files which lack the
'inode number' identifier that is the common tool in Unix and derivatives.

But I'm not convinced that it is *practically* useful.  The only practical
use beyond windows-compatibility that has been mentioned is a stronger
'identity' tag.  However inode+generation number, or "file-handle-fragment"
are better things to use for identifying a file than "creation time",
especially when the latter is settable.

So if we were to add something for native applications to use, I doubt that
it would be 'creation time' (but I'm still open to hearing a convincing
use-case).

So we are left with an attribute that is needed for windows compatibility,
and so just needs to be understood by samba and wine.  Some filesystems might
support it efficiently, others might require the use of generic
extended-attributes, still others might not support it at all (I guess you
store it in some 'tdb' and hope it works well enough).

Core-linux doesn't really need to know about this - there just needs to be a
channel to pass it between samba/wine and the filesystem. xattr still seems
the best mechanism to pass this stuff around.  Team-samba can negotiate with
fs developers to optimise/accelerate certain attributes, and linux-VFS
doesn't need to know or care (except maybe to provide generic non-blocking or
multiple-access interfaces).

What is 'creation time' used for in the windows world??? Maybe there really
is something ...
From: Jeff Layton
Date: Saturday, August 7, 2010 - 3:34 am

On Sat, 7 Aug 2010 13:32:40 +1000

IIUC, you're saying that we should basically just have samba stuff the
current time into an xattr when it creates the file and leave the
filesystems alone. If so, I disagree here.

The problem with treating this as *just* an xattr is that it doesn't
account for files that are created outside of samba but are then shared
out by it.

To handle this correctly, I believe it needs to be initialized by the
kernel to the current time whenever an inode is created, even if samba
doesn't create it. After that, it can be treated as just another xattr.

-- 
Jeff Layton <jlayton@samba.org>
--

From: Neil Brown
Date: Saturday, August 7, 2010 - 4:04 am

On Sat, 7 Aug 2010 06:34:00 -0400

I'm not quite saying that (though there is a temptation).  Some attributes
are initialised by the filesystem rather than by common code.  i_uid is a
simple example.  I have no problem with the filesystem initialising the
storage that is used for this well-known-EA to the current time at creation.

If something is created in a different universe, then brought into this one -
when is its date of birth?  The moment of creation, or the moment of entry
into this universe?   If both universes have a common time line (altough
with a 10 year offset) then I guess the former, though I think it is a bit of
Yes, I suspect that would be ideal, and trivial for the fs to implement (it
has to initialise it to something after all).

i.e. I agree.

NeilBrown
--

From: Jeremy Allison
Date: Sunday, August 8, 2010 - 5:12 am

It's a matter of taste. The *BSD's have this right IMHO. It
should be part of the stat information. A file timestamp is not
an EA. Making it available that way just feels like an appalingly

It is *already* useful to us, and is widely used in
existing code. The occasions when btime is set are
relatively rare, and at that point we store it in a
separate EA for Windows reporting purposes.

Jeremy.
--

From: Jeff Layton
Date: Sunday, August 8, 2010 - 5:53 am

On Sun, 8 Aug 2010 05:12:09 -0700

It would be more convenient if this were part of stat() but adding a
new stat call is non-trivial. Even if we did that, it still doesn't
solve the problem of being able to set the create time. The fact that
that's rarely done doesn't really matter much -- we ought to shoot for
the semantics that are needed to handle this properly.

If we do settle on a xstat() interface, it might also end up being able
to report things like selinux labels which are also available and
settable via xattr. I don't see a problem with presenting the same data
via multiple interfaces. If presenting this data via xattr solves the
immediate problem of being able to properly store and report the create

If that's the case, don't you have to query for this EA every time you
need to return the create time anyway? If so, then doing this really
isn't any more costly -- you'd just be querying a different EA, right?

-- 
Jeff Layton <jlayton@redhat.com>
--

From: Jeremy Allison
Date: Sunday, August 8, 2010 - 6:05 am

*BSD didn't. They just added something that was useful to UNIX.
I'd be happy with that. We don't need to ape Windows in everything.
The coming ACL disaster will show that (we will go from an ACL
model that is slightly too complex to use, to one that is impossibly

No, we'd be querying an additional EA. The EA we query contains
the DOS attribues as well as the create time.

Jeremy.
--

From: J. Bruce Fields
Date: Friday, August 13, 2010 - 5:54 am

Care to elaborate?

And what would native ACL support mean for Samba?

--b.
--

From: Jeremy Allison
Date: Friday, August 13, 2010 - 10:54 am

POSIX ACLs -> RichACLs (NT-style). Not criticising Andreas here,
people are asking for this. But Windows ACLs are a nightmare
beyond human comprehension :-). In the "too complex to be

RichACLs'll do it, but I feel sorry for the admins :-).

Jeremy.
--

From: Steve French
Date: Friday, August 13, 2010 - 11:09 am

Not much choice - even community colleges now have

Yes - RichACLs and Windows ACLs allow you to set
some strange combinations of permssion bits.
RichACLs will make a more natural mapping for
Samba and NFSv4 - and it is far too late to
remove the requirement for Windows and
MacOS (among other clients) support.



-- 
Thanks,

Steve
--

From: Jan Engelhardt
Date: Friday, August 13, 2010 - 12:06 pm

Well, for one, ACLs in NT can be recursive IIRC. You can't say that of Linux
ACLs - instead you have to setfacl -R and setfacl -Rd to give one user access
to a directory and all its subdirs including future new inodes.
--

From: Jeremy Allison
Date: Friday, August 13, 2010 - 12:19 pm

You do realize that Windows does exactly the same thing under
the covers, right ? Watch SMB or SMB2 traffic between a client
and Windows server when someone changes an ACL sometime :-).

Jeremy
--

From: J. Bruce Fields
Date: Monday, August 16, 2010 - 11:04 am

Yeah.  There's some explanation here:

	http://tools.ietf.org/search/rfc5661#section-6.4.3.2

What NT-style ACLs provide is a few bits that help a setfacl-like
application decide how to propagate the change.  But it's still up to
the application to do the recursive traversal.

--b.
--

From: J. Bruce Fields
Date: Monday, August 16, 2010 - 11:08 am

I was curious whether you can support that with any data (or even just
anecdotes) about real-world sysadmins.

The NT-style ACLs give me a headache, honestly.  But that may just be
because I've been involved with the implementation.  Admins may have the
luxury of using only the subset that they're comfortable with.

--b.
--

From: Jeremy Allison
Date: Monday, August 16, 2010 - 12:07 pm

Just an anecdote, but I remember giving a talk to a room full
of admins, all of whom told me it was essential for Samba to
implement "full Windows ACL compatibility" (we were in the process
of coding it up at the time). I asked them to tell me the difference
between object inherit, container inherit, and inherit only. Only
one hand remained up (out of a room containing a couple of hundred
Windows admins). I asked him where he worked, and the reply was

Yeah. I think most sites set a group as the owner of a share
and the directory so exported, set the directory to inherit
everything down below, and just leave it up to the members
of that group without getting further involved :-).

Jeremy.
--

From: Neil Brown
Date: Sunday, August 8, 2010 - 4:07 pm

On Sun, 8 Aug 2010 05:12:09 -0700

Unfortunately whenever you work on a collaborative project someone has to make
concessions to taste, as we all taste different.. (or have different taste..
or something).

So I think it is very important to clearly differentiate the practical issues
from the aesthetic issues as I think we can hope for unity on the former, but

I'm probably sounding like a scratched record, but when you say "is widely
used" do you mean "is used in samba which is widely used" or do you mean "is
used in a wide variety of applications"?

Because if you are only saying the former, then I don't think we should copy
BSD, but rather I think we should provide exactly the semantics that are most
useful to samba - and that would seem to be creation-time and DOS flags which
the filesystem can store directly in the inode and which samba can access
cheaply.
(and I would prefer to use xattrs, but that is a taste thing and as I'm not
writing the code, I don't get to choose the taste).

But if you are saying the later, then sharing those details might help us see
that copying bsd is actually the best thing to do, or maybe that something
else is better.

I'm just afraid that if some new interface is added without clear,
comprehensive and up-front justification then we will end up getting a
sub-optimal interface.

NeilBrown
--

From: David Howells
Date: Saturday, July 31, 2010 - 9:53 am

CacheFiles currently uses atime to determine least-recently-usedness.

David
--

From: utz lehmann
Date: Saturday, July 31, 2010 - 11:05 am

How does this works right with noatime or relatime (which is default)?

We had used FS-Cache with a few 10000s files cached. Doesn't it mean
that the cleanup has to stat them all?

Why didn't cachefilesd managed the cache index in a separate database
like other caches?


--

From: David Howells
Date: Saturday, July 31, 2010 - 12:26 pm

Because using atime is much simpler since the filesystem updates it
automatically.  If you have a separate database then you have redundant
information and you need to maintain metadata integrity which has a cost, both
in terms of disk usage and performance.  I'm working on it, but you don't get
it for free.

David
--

From: Jan Engelhardt
Date: Thursday, July 22, 2010 - 8:46 am

There just is no way currently to store creation times. Abusing ctimes
for write-once archives also stops working once you rsync it from one
place to another. (Which brings me to the side question of why
the ctime isn't settable through futimesnat.)
--

From: David Howells
Date: Thursday, July 22, 2010 - 9:06 am

What do you mean?  Ext4 and BtrFS can both do so; it's just that there's no
user interface to it.

David

--

From: David Howells
Date: Wednesday, July 14, 2010 - 7:17 pm

Return extended attributes from the eCryptFS filesystem, dredged up from the
lower filesystem.

Possibly eCryptFS should also set FS_COMPR_FL on its compressed files.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 fs/ecryptfs/inode.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/fs/ecryptfs/inode.c b/fs/ecryptfs/inode.c
index 31ef525..41bc407 100644
--- a/fs/ecryptfs/inode.c
+++ b/fs/ecryptfs/inode.c
@@ -994,8 +994,10 @@ int ecryptfs_getattr(struct vfsmount *mnt, struct dentry *dentry,
 	struct kstat lower_stat;
 	int rc;
 
-	rc = vfs_getattr(ecryptfs_dentry_to_lower_mnt(dentry),
-			 ecryptfs_dentry_to_lower(dentry), &lower_stat);
+	lower_stat.query_flags = stat->query_flags;
+	lower_stat.request_mask = stat->request_mask | XSTAT_REQUEST_BLOCKS;
+	rc = vfs_xgetattr(ecryptfs_dentry_to_lower_mnt(dentry),
+			  ecryptfs_dentry_to_lower(dentry), &lower_stat);
 	if (!rc) {
 		generic_fillattr(dentry->d_inode, stat);
 		stat->blocks = lower_stat.blocks;

--

From: David Howells
Date: Wednesday, July 14, 2010 - 7:17 pm

Mark arguments to certain system calls as being const where they should be but
aren't.  The list includes:

 (*) The filename arguments of various stat syscalls, execve(), various utimes
     syscalls and some mount syscalls.

 (*) The filename arguments of some syscall helpers relating to the above.

 (*) The buffer argument of various write syscalls.

Signed-off-by: David Howells <dhowells@redhat.com>
---

 arch/alpha/kernel/osf_sys.c             |    6 +++---
 arch/alpha/kernel/process.c             |    2 +-
 arch/arm/kernel/sys_arm.c               |    4 ++--
 arch/arm/kernel/sys_oabi-compat.c       |    6 +++---
 arch/avr32/include/asm/syscalls.h       |    2 +-
 arch/avr32/kernel/process.c             |    3 ++-
 arch/blackfin/kernel/process.c          |    2 +-
 arch/frv/kernel/process.c               |    3 ++-
 arch/h8300/kernel/process.c             |    2 +-
 arch/ia64/include/asm/unistd.h          |    2 +-
 arch/ia64/kernel/process.c              |    2 +-
 arch/m32r/kernel/process.c              |    3 ++-
 arch/m68k/kernel/process.c              |    2 +-
 arch/m68knommu/kernel/process.c         |    2 +-
 arch/microblaze/kernel/sys_microblaze.c |    2 +-
 arch/mips/kernel/syscall.c              |    2 +-
 arch/mn10300/kernel/process.c           |    2 +-
 arch/parisc/hpux/fs.c                   |    7 ++++---
 arch/powerpc/kernel/process.c           |    2 +-
 arch/powerpc/kernel/sys_ppc32.c         |    2 +-
 arch/s390/kernel/compat_linux.c         |   10 +++++-----
 arch/s390/kernel/compat_linux.h         |   10 +++++-----
 arch/s390/kernel/entry.h                |    2 +-
 arch/s390/kernel/process.c              |    2 +-
 arch/sh/include/asm/syscalls_32.h       |    2 +-
 arch/sh/include/asm/syscalls_64.h       |    2 +-
 arch/sh/kernel/process_64.c             |    2 +-
 arch/sparc/kernel/sys_sparc32.c         |    7 ++++---
 arch/um/kernel/exec.c                   |    6 +++---
 arch/um/kernel/internal.h               |    2 +-
 ...
Previous thread: linux-next: build failure after merge of the devicetree tree by Stephen Rothwell on Wednesday, July 14, 2010 - 7:15 pm. (2 messages)

Next thread: [PATCH 0/12] sound/alsa/soc/codec: fix memory leak and resource relaim in error path by Axel Lin on Wednesday, July 14, 2010 - 7:49 pm. (26 messages)